# Web scraping with python 

Edited on Sep 23

This book attempted to cover the use of databases, web servers, HTTP, HTML, internet security, image processing, data science, and other tools.

## Part I building scrapers

+ Retrieving HTML data from a domain name
+ Parsing that data for target information
+ Storing the target information
+ Optionally, moving to another page to repeat the process

## Chapter 1 Your first web scraper

开头是一个computer之间交流的例子

In [10]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')   ## 注意是逗号，我疯了 检查了五分钟才发现  或者也可以直接 BeautifulSoup(html.'html.parser')
print(bs.h1)   # This returns only the first instace of the h1 tag found on the page.
               # <h1> defines the most important heading

<h1>An Interesting Title</h1>


### An introduction to BeautifulSoup

html.read() get the content of the page

HTML content could be transfomred into a BS object, with the structure (html--body--h1)

Another popular parser is lxml. It is generally better at parsing "messy" or malformed HTML code than html.parser. 

Another popular HTML parser is html5lib.

### Connecting Reliably and Handling Exceptions


In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
    print(e)
    # return null, break, or other plan B
except URLError as e:
    print('The server could not be found!')
else:
    ptint('It Worked!')
    # program continues (写了return或break的就不用else了)
    

## Advanced HTML Parsing


In [12]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(),'html.parser')

nameList = bs.findAll('span', {'class':'green'})   # bs.find_all(tagName, tagAttributes)
for name in nameList:
    print(name.get_text())                         # .get_text() strips all the tags and returns a unicode string

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


### find() and find_all() in BS

find_all(tag, attributes, recursive, text, limit, keywords)  

find(tag, attributes, recursive, text, keywords)

常用的是tag 和 attributes  

The keyword argument allows you to select tags that contain a particular attribute of set of attributes.  

title = bs.find_all(id='title', class_='text')

In [15]:
nameList = bs.find_all(text='the prince')
print(len(nameList))                              # find the number of times "the prince" surrounded by tags on the example page

7


### Other BS Objects

BeautifulSoup objects  

Tag objects  

NavigableString Objects -- used to represent text with tags, rather than the tags themselves  

Comment Object -- used to find HTML comments in comment tags, <!--like this one-->


### Navigating trees

To find a tag based on its location in a document

**Dealing with children and other descendants **  两者区别就相当于亲生孩子和直属后代


In [16]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table',{'id':'giftList'}).children:     ## find only descendants that are children
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


**Dealing with siblings**  

next_siblings()    它只会返回object itself之后的sibling  

**Dealing with parents**  



## Regular Expressions   

star -- match 0 or more times  
加号  -- match 1 or more times  
[]      match any character with the brackets(pick one of these things)  
()      a grouped subexpression  
{m,n}  match the preceding character, subexpression, or bracketed character between m and n times(inclusive)  
[^]    match any single character that is not in the brackets  
|      match any (or)  
.      match any gingle character(including symbols, numbers, a space, etc.)  
^      indicate a character or subexpression occurs at the beginning of a string ^a  
\      An escape character 转义  
$      Often used at the end of a regular expression  
?!     Doe not contain  


In [18]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')
images = bs.find_all('img',
                    {'src':re.compile('\.\.\/img\/gifts/img.*\.jpg')})
for image in images:
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


### Accessing Attributes

With tag objects, a Python list of attributes can be automatically accessed by calling this:  

myTag.attrs  -- this literally returns a Pytho dictionary object  

## Lambda Expressions 闭包

就是原来要f(x,y) lambda可以弄为f(g(x),h(x))  





# Chapter 3 Writing Web Crawlers

## Traversing a Single Domain



In [22]:
# Retrieve an arbitrary Wikipedia page and produces a list of links on that page
# Retrieve only the desired article links by using the regular expression

from urllib.request import urlopen
from bs4 import BeautifulSoup

# article links 和 其他 links 的区别
# they reside within the div within the id set to bodyContent
# The urls do not contain colons
# The urls begin with /wiki/

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html,'html.parser')
for link in bs.find('div', {'id':'bodyContent'}).find_all(
    'a', href=re.compile('^(/wiki/)((?!:).)*S')):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Sleepers
/wiki/Screen_Actors_Guild_Award
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
/wiki/Julia_R._Masterman_High_School
/wiki/Pennsylvania_Governor%27s_School_for_the_Arts
/wiki/Glory_Van_Scott
/wiki/Circle_in_the_Square
/wiki/Search_for_Tomorrow
/wiki/Second_Stage_Theatre
/wiki/Slab_Boys
/wiki/Sean_Penn
/wiki/Steve_Guttenberg
/wiki/Daniel_Stern_(actor)
/wiki/She%27s_Having_a_Baby
/wiki/Joel_Schumacher
/wiki/He_Said,_She_Said_(film)
/wiki/Oliver_Stone
/wiki/Meryl_Streep
/wiki/Sleepers_(film)
/wiki/Stir_of_Echoes
/wiki/Sean_Penn
/wiki/Michael_Strobl
/wiki/Desert_Storm
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Sebastian_Shaw_(comics)
/wiki/Saturn_Award_for_Best_Actor_on_Television
/wiki/Kyra_Sedgwick
/wiki/PBS
/wiki/Lemon_Sky
/wiki/Sosie_Bacon
/wiki/Upper_West_Side
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/Six_degrees_of_separation
/wiki/S