## Scraping websites

In [None]:
%matplotlib inline

import pandas as pd

Will pandas solve my problems?

Some table on web pages can also be read in with `read_html`. This works for tables that are in the document's HTML, rather than displayed using JavaScript or some other technique. You can confirm by inspecting the HTML code, but trial and error is better.

The Titanic data might have been displayed on a website like this: 

![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/html_table.png)


Unlike the other `read` methods which return a dataframe, `read_html` returns a list dataframes. This is useful when a page contains more than one table. In this case there is only one table, so it will return a list of one item.

In [None]:
titanic_url = 'https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/data/titanic.html'

pd.read_html(titanic_url)

In [None]:
html_table_list = pd.read_html(titanic_url)

In [None]:
len(html_table_list)

In [None]:
html_table_list[0]

In [None]:
df_html = html_table_list[0]
df_html.sample(3)


<div class="alert alert-info">
<h3> Your turn</h3>
<p> Using the Wikipedia page on homicide, make a dataframe of the estimated homicide rates
in Europe by century. 

</div>


<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
html_table_list = pd.read_html('https://en.wikipedia.org/wiki/Homicide')

eh_df= html_table_list[1]
eh_df
</code>
</details>

## `read_html` limitations:
    
* Only handles basic HTML tables.
* Only grabs displayed text -- no URLS.

### See if someone else has written a package for you

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/espn1.png)


![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/espn2.png)


![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/espn3.png)

## Sometimes you have to do it yourself

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/wiki_sociology.png)

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/wiki_sociology_view_source.png)

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/wiki_sociology_source.png)

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/wiki_sociology_title.png)

In [None]:
%pip install requests_html

In [None]:
from requests_html import HTMLSession

In [None]:
session = HTMLSession()

In [None]:
url = 'https://en.wikipedia.org/wiki/Sociology'
r = session.get(url)

In [None]:
r.status_code

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/statuscode.png)

In [None]:
r.html

In [None]:
r.html.html[:500]

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/wiki_sociology_source.png)

In [None]:
parsed_html = r.html

### In HTML  parsed by `requests_html`,  `find` looks for tags

In [None]:
parsed_html.find('title')

In [None]:
parsed_html.find('title')[0]

In [None]:
parsed_html.find('title')[0].html

In [None]:
parsed_html.find('title')[0].text

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/image/wiki_sociology_h1.png)

In [None]:
parsed_html.find('h1')

In [None]:
parsed_html.find('h1')[0].text

### In HTML  parsed by `requests_html`,  `find` looks for tags. It returns a list.

In the resulting list objects:
* `.html` returns the  HTML (as a string)
* `.text` returns the visible text (as a string)
* `.absolute_links` returns the full path of any links (as a set)

In [None]:
parsed_html.find('h1')[0].html

In [None]:
parsed_html.find('h1')[0].absolute_links

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Explore the content tagged <code>h2</code> on the sociology wikipedia page.
</div>



<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
h2s = parsed_html.find('h2')
print(len(h2s))
for item in h2s:
    print(item.text)
</code>
</details>



![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/wiki_sociology_h2.png)

### `find` looks for classes if you a `.` before the class

In [None]:
parsed_html.find('h1')[0]

### `find` looks for classes if you a `.` before the class

In [None]:
parsed_html.find('.firstHeading')

In [None]:
parsed_html.find('.firstHeading')[0].text

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Print out 'History' by locating its class?
</div>




<details>
<summary>Sample answer </summary> 
<code style="background-color: white">
parsed_html.find('.mw-headline')[0].text</code>
</details>



![](https://raw.githubusercontent.com/nealcaren/UiOBigData/master/notebooks/images/wiki_sociology_h2.png)

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> What is the tag associated with text of the article?
</div>



<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
parsed_html.find('p')
</code>
</details>

In [None]:
for item in parsed_html.find('p'):
    print(item.text)

### `find` looks for more things when you put them in brackets.

![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/socnote.png)

In [None]:
parsed_html.find('[role=note]')

In [None]:
for item in parsed_html.find('[role=note]'):
    print(item.text)

In [None]:
for item in parsed_html.find('.reference-text'):
    print(item.text)

<div class="alert alert-info">
<h3>Homework</h3>
    <p> Find the links for all the refences.</p>
    <p> Hint: sets can be converted to a list:
    <p> <code>link_list = list(links_in_a_set)</code>
    
</div>




<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
links = []

for item in parsed_html.find('.reference-text'):
    item_links = list(item.absolute_links)
    links = item_links + links
    
print(len(links))
print(links[:5])
</code>
</details>



<div class="alert alert-info">
<h3>A new page</h3>

<p> Store the text of the Crime in the United States wikipedia article as a string.
</div>





<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
url = 'https://en.wikipedia.org/wiki/Crime_in_the_United_States'
r = session.get(url)
parsed_html = r.html
text = ''
for paragraph in parsed_html.find('p'):
    text = text + paragraph.text


print(text[:100])
print(text[-200:])
</code>
</details>




<div class="alert alert-info">
<h3>A new new page</h3>

<p> Extract the headline and text from this <a href='https://www.justice.gov/opa/pr/two-former-twitter-employees-and-saudi-national-charged-acting-illegal-agents-saudi-arabia'>DOJ Press release.</a>.
</div>





<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
url = 'https://www.justice.gov/opa/pr/two-former-twitter-employees-and-saudi-national-charged-acting-illegal-agents-saudi-arabia'
    
r = session.get(url)
parsed_html = r.html

headline = parsed_html.find('H1')[1].text
subhead = parsed_html.find('H2')[3].text


text = ''
for paragraph in parsed_html.find('p'):
    text = text + paragraph.text

    
</code>
</details>


![](https://raw.githubusercontent.com/nealcaren/KULeuvenBigData/master/notebooks/images/function.png)

<div class="alert alert-info">
<h3> Your turn</h3>
    <p> Make a function that takes a url and returns the text (ignoring headers and notes and other stuff) of a wikipedia article. 
</div>




<details>
<summary>Sample answer code</summary> 
<code style="background-color: white">
def get_wiki_text(url):
    r = session.get(url)
    parsed_html = r.html
    text = ''
    for paragraph in parsed_html.find('p'):
        text = text + paragraph.text
    return text

<code style="background-color: white">
# sample usage
get_wiki_text('https://en.wikipedia.org/wiki/Sociology')
 
</code>
</details>
