# Web Scrapping

Web scrapping is the process of extracting usefull information from web pages, before we can get into web scrapping we should have a basic understanding of the technologies used to create webpages. The process involves making a HTTP request to a website and parsing the returned HTML so we can extract the information.



# HTTP



In [1]:
from requests_html import HTMLSession

In [2]:
sess = HTMLSession()

In [3]:
url = "https://en.wikipedia.org/wiki/42_(number)"

In [4]:
response = sess.get(url) #make get request

The response object has many methods. Bellow we look at the status code of the response to check if it was successfull.

In [5]:
response.status_code #200 means succesfull

200

Headers allow us to contain additional information we sending and responding to HTTP requests, such as if the request is coming from a laptop or a mobile or which browser was used to make the request. In the response headers bellow we can see some additional information like character encoding ('utf-8') and that the content has been gzipped. The browser will make use of this information to render the page correctly.

In [6]:
response.headers

{'Date': 'Wed, 20 Jun 2018 04:39:02 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '47641', 'Connection': 'keep-alive', 'Server': 'mw1247.eqiad.wmnet', 'X-Content-Type-Options': 'nosniff', 'P3P': 'CP="This is not a P3P policy! See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'X-Powered-By': 'HHVM/3.18.6-dev', 'Content-language': 'en', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Link': '</static/images/project-logos/enwiki.png>;rel=preload;as=image;media=not all and (min-resolution: 1.5dppx),</static/images/project-logos/enwiki-1.5x.png>;rel=preload;as=image;media=(min-resolution: 1.5dppx) and (max-resolution: 1.999999dppx),</static/images/project-logos/enwiki-2x.png>;rel=preload;as=image;media=(min-resolution: 2dppx)', 'Last-Modified': 'Tue, 19 Jun 2018 22:38:51 GMT', 'Backend-Timing': 'D=105941 t=1529448196438069', 'Content-Encoding': 'gzip', 'X-Varnish': '465681593 449058662, 116932997 97369490, 124377582 76287078, 687304923 6

What we're really intrested in is the responses text in other words the HTML document itself.

In [7]:
html = response.text #the html
print("Number of chars: ",len(html))
html[:1000]

Number of chars:  250460


'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>42 (number) - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"42_(number)","wgTitle":"42 (number)","wgCurRevisionId":844818912,"wgRevisionId":844818912,"wgArticleId":191178,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: Julian–Gregorian uncertainty","All articles with dead external links","Articles with dead external links from December 2017","Articles with permanently dead external links","Articles needing additional references from February 2011","All articles needing additional references","Articles with short description","All art

As you can see it would be very tricky to extract information out of this  large html document without some help, this is where beautiful soup comes in.

## CSS Selectors

Web developers use CSS selectors to select elements on the page they want to apply styling on. Hence CSS Selectors provide us with a succint way to specify which information on the html page we'd like to extract.  The simplest way to get to grips with them is to play one of these games:

* [CSS Diner](https://flukeout.github.io/)
* [CSS Leveler](http://toolness.github.io/css-selector-game/)

Alternatively you could write a simple html page and try to style it.  For a good cheatsheet on CSS Selectors and Xpath see [here](http://www.cheetyr.com/css-selectors).


# Parsing HTML

Request html can also help us parse the html and return an easy to use object, using this object we can get anything we want from the page. First however we need to spot what information we want and what kind of HTML element is it contained in. Does that element always have a certain attribute, or some other unique way to identify it? The easiest way to do this is using the chrome developer tools.

## Links
Using `soup.find_all` we can extract all of the `<a>` tags from a page.

In [8]:
a_tags = response.html.find('a') #get all a tags
a_tags[:5]

[<Element 'a' id='top'>,
 <Element 'a' class=('mw-jump-link',) href='#mw-head'>,
 <Element 'a' class=('mw-jump-link',) href='#p-search'>,
 <Element 'a' href='/wiki/File:Question_book-new.svg' class=('image',)>,
 <Element 'a' href='/wiki/Wikipedia:Verifiability' title='Wikipedia:Verifiability'>]

We can access the attribute from a tag using `.attrs`, which returns us a dictionary of the attributes.

In [9]:
a_tag = a_tags[1]
a_tag.attrs

{'class': ('mw-jump-link',), 'href': '#mw-head'}

If we only want `<a>` tags with a href attibute, we could filter it using a list comprhension

In [10]:
[ a for a in a_tags if 'href' in a.attrs][:5]

[<Element 'a' class=('mw-jump-link',) href='#mw-head'>,
 <Element 'a' class=('mw-jump-link',) href='#p-search'>,
 <Element 'a' href='/wiki/File:Question_book-new.svg' class=('image',)>,
 <Element 'a' href='/wiki/Wikipedia:Verifiability' title='Wikipedia:Verifiability'>,
 <Element 'a' class=('external', 'text') href='//en.wikipedia.org/w/index.php?title=42_(number)&action=edit'>]

But if all we want is the links. Then there are some convient funcitons for us, `response.html.links` returns a set of all the links.

In [11]:
list(response.html.links)[:5]

['/wiki/370_(number)',
 '/wiki/Coldplay',
 '/w/index.php?title=42_(number)&action=edit&section=11',
 '/wiki/136_(number)',
 '/wiki/623_(number)']

Another option is to use [css selectors](https://www.w3schools.com/cssref/css_selectors.asp), if we want a tags with a particular href.

In [12]:
response.html.find('a[href*=numerals]') # select all a tags where the href contains the substring numerals.

[<Element 'a' href='/wiki/Greek_numerals' title='Greek numerals'>,
 <Element 'a' href='/wiki/Roman_numerals' title='Roman numerals'>,
 <Element 'a' href='/wiki/Arabic_numerals' title='Arabic numerals'>,
 <Element 'a' href='/wiki/Chinese_numerals' title='Chinese numerals'>,
 <Element 'a' href='/wiki/Chuvash_numerals' title='Chuvash numerals'>,
 <Element 'a' href='/wiki/Hebrew_numerals' title='Hebrew numerals'>]

## Text

Often we'll want to extract text from a webpage we can use `get_text` for that.

In [13]:
text = response.html.text
text[:1000]

'42 (number) - Wikipedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" ); (window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"42_(number)","wgTitle":"42 (number)","wgCurRevisionId":844818912,"wgRevisionId":844818912,"wgArticleId":191178,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1: Julian–Gregorian uncertainty","All articles with dead external links","Articles with dead external links from December 2017","Articles with permanently dead external links","Articles needing additional references from February 2011","All articles needing additional references","Articles with short description","All articles with unsourced statements","Articles with unsourced statements from February 2011","Articles with unsourced statements from April 201

This also gets lots of texy that we don't care about. Well have to be more specific, likes only get the text from the p tags.

In [14]:
p_tags =  response.html.find('p')
p_tags[0]

<Element 'p' >

In [15]:
p_tags[0].text

'42 (forty-two) is the natural number that succeeds 41 and precedes 43.'

## Dynamic Pages

Sometimes when we request a page the content that we want isn't in the html, this is because it's dynamically fetched or create by excuting javascript in the browser. Since where not using a browser the javascript never gets run and the content never gets loaded. 

In [16]:
url = "http://quotes.toscrape.com/js/"
res = sess.get(url)

In [17]:
res.html.find(".quote")

[]

Nothing is returned because the javascript hasn't yet been excuted. Thankfully requests_html allows us to render the html easily, however unfortunatley due to limitations in the library this won't work in a jupyter notebook, but will in a .`py` script or the python console. Nevertheless bellow is how we'd render the html and fetch all of the elements with a class of .quote

In [21]:
res.html.render() #won't work in a jupyter notebook
res.html.find(".quote")

## Tables

For extracting tables the pandas package has a really usefull function `read_html` , this won't work on all html tables, but can be good for some. The table might not always be formated in the correct way, but this is often easy to fix in pandas.

In [72]:
import pandas as pd

In [73]:
tables = pd.read_html("http://www.nanotech-now.com/metric-prefix-table.htm") #download all tables on page

In [74]:
df = tables[0]
df = df.rename(columns=df.iloc[0])
df = df[df.index > 0]
df.head()

Unnamed: 0,Prefix,Symbol,Multiplier,Exponential
1,yotta,Y,1000000000000000000000000,1024
2,zetta,Z,1000000000000000000000,1021
3,exa,E,1000000000000000000,1018
4,peta,P,1000000000000000,1015
5,tera,T,1000000000000,1012


In [None]:
[ link['href'], link.get_text() for link in soup.select('.storytime')

# Resources 

* [MDN Docs](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)