### 1.1 最初のwebスクレイパー
アクセスしてみよう→ [http://pythonscraping.com/pages/page1.html](http://pythonscraping.com/pages/page1.html)
こちらの内容を読み取ってみる

print(html.read())ここではurllibを使っているが[公式ドキュメント](https://docs.python.org/ja/3/library/urllib.html)を読んでおくことをおすすめされている

In [1]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


### 1.2 初めてのBeautifulSoup
bsオブジェクトがhtmlを構造化し、アクセスが容易にできることを確認しよう。

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


In [3]:
bs.html

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [4]:
bs.head

<head>
<title>A Useful Page</title>
</head>

In [5]:
bs.title

<title>A Useful Page</title>

#### 1.2.3 確実につなげる
エラー等に対する処理

In [6]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    #このurlを開くのを試みる
    html = urlopen("https://pythonscrapingthisurldoesnotexist.com") #開けない
except HTTPError as e:
    # HTTP errorはこっちにくる
    print("The server returned an HTTP error")
except URLError as e:
    # URL errorはこっちにくる
    print("The server could not be found!")
else:
    print(html.read())

The server could not be found!


In [7]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    #このurlを開くのを試みる
    html = urlopen("http://www.pythonscraping.com/pages/page1.html") #開ける
except HTTPError as e:
    print("The server returned an HTTP error")
except URLError as e:
    print("The server could not be found!")
else:
    #error がなかったらelse節がじっこうされる
    print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


もしURLにアクセスできたとしても、タグがちゃんと存在するのかは疑問がのこる。存在しないタグにアクセスした場合、エラーによって処理は中断されてしまう。
そこで、ほしいオブジェクトにアクセスするときも、try文の中に書いてみよう。

In [8]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

#errorが出た時点でreturn Noneで関数を終了するようにしている。
def getTitle(url):
    #まずはURLが開けるか確かめる
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    
    #次にオブジェクトにアクセスできるか試みる
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>


In [9]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    #まずはURLが開けるか確かめる
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    
    #次にオブジェクトにアクセスできるか試みる
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.hogehoge #ここを存在しない属性に変更している
    except AttributeError as e:
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

Title could not be found
