# Ch1 Building Scrapers

- To give you some idea of infrastructure required tp get information to your browser, let's use the following example. Alice owns a web server. Bob uses a desktop computer, which is trying to connect to Alice's server. When one machine wants to talk to another machine, something like the following exchange takes place:

1. Bob's computer sends along a stream of 1 and 0 bits, indicated by high and low voltages on a wire. These bits form some infomation, contaoining a header and body. The header contains an immediate destination of his local router's MAC address, with a final destination of Alice's IP address. The body contains his request fpr Alice's server application.
2. Bob's local router receives all these 1's and 0's and interprets them as a packet, from Bob's own MAC address, and destined for Alice's IP address. His router stamps its own IP address on the packet as the "from" IP address, and sends it off across the Internet.
3. Bob's packet traverses several intermediary server, which direct his packet toward the correct physical wired path, on to Alice's server. 
4. Alice's server receives the packet, at her IP address.
5. Alice's server reads the packet port destination (almost always port 80 for web packet data, where the IP address is the "street address"), in the hearder, and passes it off to the appropriate application - the web server application.
6. The web server application receives a stream of data from the server processor. This data says something like:
  - This is a GET request
  - The following file is requested :index.html
7. The web server locates the correct HTML file, bundles it up into a new packet to send to Bob, and sends it through to its local router, for transport back to Bob's machine, through the same process.

In [1]:
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")

In [2]:
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


- This can only read single HTML file that we've requested(yet), not like using brower which can recognize when we click the link.

## Beautiful Soup

In [4]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read(), "lxml")
print(bsObj.h1)

<h1>An Interesting Title</h1>


- As in the example before, we are importing the urlopen library and calling html.read() in order to get the HTML content of the page. This HTML content is then transformed into a BeautifulSoup object, with the following structure:

In [5]:
bsObj.html.body.h1

<h1>An Interesting Title</h1>

In [6]:
bsObj.html.h1

<h1>An Interesting Title</h1>

In [7]:
html = urlopen("http://pythonscraping.com/pages/page1.html")

- There are two main things that can go wrong in this line:

- The page is not found on the server (or there was some error in retrieving it)
- The server is not found

Could be
- "404 Page Not Found"
- "55 Internal Server Error"
- or "HTTPError"

In [None]:
# handle like
try:
    html = urlopen()
except HTTPError as e:
    print(e)
else:
    

- If server is not found at all, urloepn returns a None object. This object is analogous to null in other programming languages.

In [None]:
if html is None:
    print("URL is not found")
else:
    # program continues

- How about if there is not the tag that we want?

In [None]:
print(bsObj.nonExistingTag.someTag)


we get
- AttributeError: 'NoneType' object has no attribute 'someTag'

In [None]:
# guarding
try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent == None:
        print("Tag was not found")
    else:
        print(badContent)

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    try:
        bsObj = BeautifulSoup(html.read(), "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

# Ch2  Advanced HTML Parsing

- Class Selector

In [10]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html, "lxml")
nameList = bsObj.find_all("span", {"class":"green"})
for name in nameList:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


- find_all(tag, attributes, recursive, text, limit, keywords)
- find(tag, attributes, recursive, text, keywords)

- .find_all({"h1", "h2", "h3", "h4, "h5", "h6"})

- .find_all("span", {"class":"green", "class":"red"})

- The recursive argument is a boolean. How deeply into the document do you want to go? If recursion is set to True, the find_all function looks into children, and children's children, for tags that match your parameters. If it is false, it will look only at the top-level tags in your document. By default, find_all works recursively (recursive is set to True); it's generally a good idea to leave this as is, unless your really know what your need to do and performance is an issue.

In [11]:
nameList  = bsObj.find_all(text="the prince")
print(len(nameList))

7


In [12]:
nameList

['the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince',
 'the prince']

- Keyword argument to get the tages that contain a particular atrribute.

In [13]:
allText = bsObj.find_all(id="text")
print(allText[0].get_text())


"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news."

It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite.

All her invitations without exception, written in French, and
delivered by a scarlet-liveri

> all this keyword argument can be done with regular expression and lambda expression
- and use class\_ for class attribute

- 4 Objects in bs4

1. BeautifulSoup
2. Tag
3. BavigableString
4. Comment

- children and descendents

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)

for child in bsObj.find("table", {"id":"giftList"}).children:
    print(child)

** Dealing with siblings**

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("")
bsObj = BeautifulSoup(html, "lxml")

for sibling in bsObj.find("table", {"id":"giftList"}).tr.next_siblings:
    print(sibling)

- So, by selecting the title orw and calling next_siblings, we can select all the rows in the table, without selecting the title row itself.

- previous_siblings

**Dealing with your parents**

In [15]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "lxml")

print(bsObj.find("img", {"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())


$15.00



## Regular expression and BeautifulSoup

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "lxml")

images = bsObj.find_all("img", {"src":re.compile("\.\.\/img/gifts/img.*\.jpg")})
for image in images:
    print(image["src"])

### Accessing Attributes

- myImgTag.attrs['src']

## Lambda Expressions