## What is Screen-Scraping?


Screenscraping refers to the process of automatically extracting data from web pages. 

A typical screenscraping program:
- download a webpage in HTML format
- finds some piece of desired information 
- places that information in a convenient format

Screenscraping can also be used to download other types of content as well, however, such as audio-visual content.

## Reading content from a web-page in Python 

Let's say we wanted to parse a table from this web page:
[https://en.wikipedia.org/wiki/List_of_countries_by_social_welfare_spending](https://en.wikipedia.org/wiki/List_of_countries_by_social_welfare_spending)


<img src="../slides_and_videos/images/table_example.png" alt="drawing" width="300"/>


First, we would import a couple of useful packages. 

In [1]:
from bs4 import BeautifulSoup ##A package to work with HTML data
import requests #A package to make HTTP requests

Then, we would request the page from the internet using the `requests` package, and parse the HTML content using `Beautifulsoup`. We store the content of the page in a variable named _soup_.

In [2]:

LINK = "https://en.wikipedia.org/wiki/List_of_countries_by_social_welfare_spending"
r = requests.get(LINK) 


In [5]:
soup = BeautifulSoup(r.content) 
#soup

To find our table, we open the elements panel from our browser (Command + Shift + C on Chrome).
Scrolling through the elements, we find that the our table is stored in a _&lt;table&gt;_ element of class _wikitable_. 
Each row in the table is stored in a _&lt;tr&gt;_ table element.
The row's cells can then be established using a mix of _&lt;td&gt;_ (data cell) and _&lt;th&gt;_ (header cell) elements.
    
<img src="../slides_and_videos/images/table_example2.png" alt="drawing" width="700"/>


We use the command ``find`` to find the first element of this kind within _soup_. 

Then, we use the command ``find_all`` to find all the rows (table row elements _&lt;tr&gt;_) in our table. We loop through the rows and use ``find_all`` to get the headers (table header elements _&lt;th&gt;_) and the data (table data elements _&lt;td&gt;_). 

In [11]:

table = soup.find("table",{"class":"wikitable"})

#HERE I GET THE HEADER
ths = table_rows[0].find_all('th')
header = [th.text.replace("\n","") for th in ths]
    
#HERE I GET THE ROWS
rows = []
for tr in table_rows[1:]:
    tds = tr.find_all('td')
    row = [td.text.replace("\n","") for td in tds]
    rows.append(row)                    

In [64]:
import pandas as pd
pd.DataFrame(rows, columns=header)

Unnamed: 0,Unnamed: 1,Country,2019,2016,2010,2005,2000
0,1,France,31.2,31.5,30.7,28.7,27.5
1,2,Belgium,28.9,29.0,28.3,25.3,23.5
2,3,Finland,28.7,30.8,27.4,23.9,22.6
3,4,Italy,28.2,28.9,27.6,24.1,22.6
4,5,Denmark,28.0,28.7,28.9,25.2,23.8
5,6,Austria,26.6,27.8,27.6,25.9,25.5
6,7,Sweden,26.1,27.1,26.3,27.4,26.8
7,8,Germany,25.1,25.3,25.9,26.3,25.4
8,9,Norway,25.0,25.1,21.9,20.7,20.4
9,10,Spain,23.7,24.6,25.8,20.4,19.5


There is a way to get a table from an html page automatically using Pandas (using ``pd.read_html``). 
However, this will not be useful if you need to parse content other than tables.

In [65]:
tables = pd.read_html(LINK)
tables[1]

Unnamed: 0.1,Unnamed: 0,Country,2019,2016,2010,2005,2000
0,1,France,31.2,31.5,30.7,28.7,27.5
1,2,Belgium,28.9,29.0,28.3,25.3,23.5
2,3,Finland,28.7,30.8,27.4,23.9,22.6
3,4,Italy,28.2,28.9,27.6,24.1,22.6
4,5,Denmark,28.0,28.7,28.9,25.2,23.8
5,6,Austria,26.6,27.8,27.6,25.9,25.5
6,7,Sweden,26.1,27.1,26.3,27.4,26.8
7,8,Germany,25.1,25.3,25.9,26.3,25.4
8,9,Norway,25.0,25.1,21.9,20.7,20.4
9,10,Spain,23.7,24.6,25.8,20.4,19.5


## Is Screen-Scraping Legal?


Yes, with some limitations, that depend upon how you perform the scraping and how you use the data you scrape. 
There are still a lot of grey areas when it comes to the law on web scraping. Here is a list of practical tips you can follow to web-scrape ethically.

1. Respect and follow the Terms of Service (ToS). Check the website’s terms of use and [``robot.txt``](https://en.wikipedia.org/wiki/Robots.txt) files before web scraping data ([here](https://en.wikipedia.org/robots.txt) is the robot.txt file for Wikipedia).

2. Scrape at moderate rate. 

3. Consider any actions a website takes to restrict web scraping. If a website restricts your web scraping activities with anti-scraping measures, such as the use of CAPTCHAs, rate limits, and blocking of IP addresses, you may need to be cautious. 

4. Consider whether any data to be scraped belongs to the personally identifiable information (PII) of EU citizens. There are limitations to collect PII under the GDPR. 

5. Consider whether any data to be scraped is protected by copyright. Don't scrape the copyrighted data.

6. Make good use of the scraped data (don't reshare it).

When in doubt, talk to an expert. 
