# Course: Intro to Python & R for Data Analysis
## Lecture: Webscraping
### The wildwest of data collection
Professor: Mary Kaltenberg

Fall 2020

contact: mkaltenberg@pace.edu

About me: www.mkaltenberg.com

## Objectives:

* Inspect an HTML page and identify which parts you want to scrape.
* Scrape web pages with `requests` and `BeautifulSoup`.
* Ethical consideration (be a good citizen of the Internet).

If we have time in the next class:
* Web crawling across the internet
* Navigate Javascript elements with `Selenium`

*Additional information about Selenium is included in this notebook, but not covered in class

<img src="https://media.giphy.com/media/Cjv37jPMVJw0o/giphy.gif" width = 300>

Fun fact: When Google started in 1994, it was just two Stanford graduate students with an old server and a Python web crawler. 



# Webscraping

Generally, webscraping consists of these things:
1. GET request to a web server for a specific page
2. Reading the HTML output from that page
3. Simple data extraction isolating desired content
4. Storing that content somewhere
5. (optionally) move to another page to rinse and repeat.

Webscraping typically involves detective work. You will often have to adjust your steps according to the type of data you want, and the steps that worked on one website may not work on another or even work on the same website a few months later. It requires a fair bit of art. But, if you can see it in your browser, you can scrape it.

## HTML page structure

**Hypertext Markup Language (HTML)** is the standard markup language for documents designed to be displayed in a web browser. HTML describes the structure of a web page and it can be used with **Cascading Style Sheets (CSS)** and a scripting language such as **JavaScript** to create interactive websites. HTML consists of a series of elements that "tell" to the browser how to display the content. Lastly, elements are represented by **tags**.

Here are some tags:
* `<!DOCTYPE html>` declaration defines this document to be HTML5.  
* `<html>` element is the root element of an HTML page.  
* `<div>` tag defines a division or a section in an HTML document. It's usually a container for other elements.
* `<head>` element contains meta information about the document.  
* `<title>` element specifies a title for the document.  
* `<body>` element contains the visible page content.  
* `<h1>` element defines a large heading.  
* `<p>` element defines a paragraph.  
* `<a>` element defines a hyperlink.

HTML tags normally come in pairs like `<p>` and `</p>`. The first tag in a pair is the opening tag, the second tag is the closing tag. The end tag is written like the start tag, but with a slash inserted before the tag name.

<img src="tags.png" width="512">

HTML has a tree-like 🌳 🌲 structure thanks to the **Document Object Model (DOM)**, a cross-platform and language-independent interface. Here's how a very simple HTML tree looks like.

<img src="dom_tree.gif">

### Creating a simple HTML page

Here is some simple HTML code of what a webpage looks like (when you run with markdown, it'll show what it might look like on a webpage)

<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
  <title>Intro to HTML</title>
</head>

<body>
  <h1>Heading h1</h1>
  <h2>Heading h2</h2>
  <h3>Heading h3</h3>
  <h4>Heading h4</h4>

  <p>
    That's a text paragraph. You can also <b>bold</b>, <mark>mark</mark>, <ins>underline</ins>, <del>strikethrough</del> and <i>emphasize</i> words.
    You can also add links - here's one to <a href="https://en.wikipedia.org/wiki/Main_Page">Wikipedia</a>.
  </p>

  <p>
    This <br> is a paragraph <br> with <br> line breaks
  </p>

  <p style="color:red">
    Add color to your paragraphs.
  </p>

  <p>Unordered list:</p>
  <ul>
    <li>Python</li>
    <li>R</li>
    <li>Julia</li>
  </ul>

  <p>Ordered list:</p>
  <ol>
    <li>Data collection</li>
    <li>Exploratory data analysis</li>
    <li>Data analysis</li>
    <li>Policy recommendations</li>
  </ol>
  <hr>

  <!-- This is a comment -->

</body>
</html>

## Web Scraping with `requests` and `BeautifulSoup`

We will use `requests` and `BeautifulSoup` to access and scrape the content of [IMDB's homepage](https://www.imdb.com).

### What is `BeautifulSoup`?

It is a Python library for pulling data out of HTML and XML files. It provides methods to navigate the document's tree structure that we discussed before and scrape its content.

*Fun fact: The BeautifulSoup library was named after a Lewis Carroll poem of the same name in Alice’s Adventures in Wonderland. *

### A Pipeline Example
<img src='scrape-pipeline.png' width="1024">

## How do you figure out what information to scrape?

"Selectors" can be found using the “inspect web element” feature that is available in all modern browsers. 

In the google Chrome case, you can use Chrome DevTools.

#### Chrome DevTools

[Chrome DevTools](https://developers.google.com/web/tools/chrome-devtools/) is a set of web developer tools built directly into the Google Chrome browser. DevTools can help you view and edit web pages. We will use Chrome's tool to inspect an HTML page and find which elements correspond to the data we might want to scrape.

#### Short exercise
To get some experience with the HTML page structure, we will search and locate elements in [IMDB](https://www.imdb.com/). 

**Tip**: Hit *Command+Option+C* (Mac) or *Control+Shift+C* (Windows, Linux) to access the elements panel. Or right-click and choose “Inspect”

Typically, it doesn't matter which browser you use.

#### Tasks (we will do them together)
* Find the _Sign in_ button
* Find the box containing the _Up next_
* Locate one of the photos/videos in the main section of the page.
* What is the size of the main photo?

# Let's get to Scraping!

In [1]:
!pip install bs4

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1273 sha256=8065d874fc2e4d93994c8884316728c8fc8c932fe92ceabb20bf7794169ad1ee
  Stored in directory: /Users/mkaltenberg/Library/Caches/pip/wheels/75/78/21/68b124549c9bdc94f822c02fb9aa3578a669843f9767776bca
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


In [1]:
# Import necessary libraries
import requests 
import numpy as np
import pandas as pd
import pprint
from bs4 import BeautifulSoup  #this is the spell that helps you read tags

In [10]:
# IMDB's homepage
imdb_url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'

# Use requests to retrieve data from a given URL
imdb_response = requests.get(imdb_url)

# Parse the whole HTML page using BeautifulSoup
imdb_soup = BeautifulSoup(imdb_response.text, 'html.parser')

# Title of the parsed page
imdb_soup.title

<title>Top 250 Movies - IMDb</title>

See how the tags `<title>` identify what you are looking for

In [12]:
# We can also get it without the HTML tags
imdb_soup.title.string

'Top 250 Movies - IMDb'

In [13]:
# we can inspect what the html code looks like
#and can manually see title, but really, we just want to extract the information relevant to our data collection process
print(imdb_soup.prettify())

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   Top 250 Movies - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
  </script>
  <link href="https://www.imdb.com/chart/top/" rel="canonical"/>
  <meta content="http://www.imdb.com/chart/top/" property="og:url">
   <script>
  

### Find links

In many cases, it is useful to collect the links contained in a webpage (for example, you might want to scrape them too or webcrawl). Here is how you can do this.

In [15]:
imdb_soup.find_all('a')

[<a href="/?ref_=nv_home"><svg class="ipc-logo drawer-logo" height="56" version="1.1" viewbox="0 0 64 32" width="98" xmlns="http://www.w3.org/2000/svg"><g fill="#F5C518"><rect height="100%" rx="4" width="100%" x="0" y="0"></rect></g><g fill="#000000" fill-rule="nonzero" transform="translate(8.000000, 7.000000)"><polygon points="0 18 5 18 5 0 0 0"></polygon><path d="M15.6725178,0 L14.5534833,8.40846934 L13.8582008,3.83502426 C13.65661,2.37009263 13.4632474,1.09175121 13.278113,0 L7,0 L7,18 L11.2416347,18 L11.2580911,6.11380679 L13.0436094,18 L16.0633571,18 L17.7583653,5.8517865 L17.7707076,18 L22,18 L22,0 L15.6725178,0 Z"></path><path d="M24,18 L24,0 L31.8045586,0 C33.5693522,0 35,1.41994415 35,3.17660424 L35,14.8233958 C35,16.5777858 33.5716617,18 31.8045586,18 L24,18 Z M29.8322479,3.2395236 C29.6339219,3.13233348 29.2545158,3.08072342 28.7026524,3.08072342 L28.7026524,14.8914865 C29.4312846,14.8914865 29.8796736,14.7604764 30.0478195,14.4865461 C30.2159654,14.2165858 30.3021941,13.486

In [96]:
# Find all links
#The goal is to code like this:
links = [link.get('href') for link in imdb_soup.find_all('a')]

#translated code:
somelist=imdb_soup.find_all('a')
links = []

for link in somelist:
    ab = link.get('href')
    links.append(ab)
    

# Add homepage and keep the unique links
fixed_links = set([''.join([imdb_url, link]) for link in links if link])

#translated code:
fixed_links = []

for link in links:  #for each item in the list called links (that we created above)
    if link: #if it's this item (ie the number)
        ab = set([''.join([imdb_url, link])]) #then join together two strings and set them together as one item
        fixed_links.append(ab) #append that string to a list

# Let's scrape IMDB's top 250 movies

In [21]:
url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
response = requests.get(url)

imdb_soup = BeautifulSoup(response.text, 'lxml') #we're going to read the information we got with beautiful soup
# and it will read lxml and html code for you. learn more here: https://lxml.de/


### Let's explore the information related to:
* title of movie
* year of the movie
* ranking
* links (this will be useful if we want to scrape additional information about that movie later on)
* the crew (director and star cast)
* IMDb ratings

-> Go to your inspector page and find the relevant tags to these things

What's easy to find:
1. title =  titleColumn (with tag td)
2. Imdb rating = ratingColumn.idbmRating  (with tag td)

What about the other stuff?

We can find what's within tags using `find`

In [22]:
imdb_soup.find('td', {'class': 'titleColumn'})

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [23]:
imdb_soup.find('td', {'class': 'ratingColumn'})

<td class="ratingColumn imdbRating">
<strong title="9.2 based on 2,707,952 user ratings">9.2</strong>
</td>

We will use the `.find_all()` method to search the HTML tree for particular tags and get a `list` with all the relevant objects.

In [24]:
imdb_title = imdb_soup.find_all('td', {'class': 'titleColumn'})

In [26]:
imdb_title

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
 <span class="secondaryInfo">(1972)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>
 <span class="secondaryInfo">(2008)</span>
 </td>,
 <td class="titleColumn">
       4.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">The Godfather Part II</a>
 <span class="secondaryInfo">(1974)</span>
 </td>,
 <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 Angry Men</a>
 <span class="secondaryInfo">(

In [34]:
# you can turn this into a dataframe, but it's quite messy 
# and we'd need to clean it up and it's the first bit of what we want
imdb_table= pd.DataFrame(imdb_title).drop([2,4],1)
imdb_table[0]= imdb_table[0].str.strip('\n .')
imdb_table[1]= imdb_table[1].str.join(', ') #turn lists into values
imdb_table[3]= imdb_table[3].str.join(', ').str.strip('( )')
imdb_table.columns='ranking','movie','year'
imdb_table

##tbh, I often scrape like this and then join pieces of the dataframes together. Doesn't really matter how you do it.

Unnamed: 0,ranking,movie,year
0,1,The Shawshank Redemption,1994
1,2,The Godfather,1972
2,3,The Dark Knight,2008
3,4,The Godfather Part II,1974
4,5,12 Angry Men,1957
...,...,...,...
245,246,Dersu Uzala,1975
246,247,The Help,2011
247,248,Aladdin,1992
248,249,Gandhi,1982


## Another way to add even more information

We could just analyze what's the selectors that we need and then create a list out of it
We'll use the `select` function to do this

In [35]:
# you can do this by analyzing the website, or you can widdle down the info with our
# find option
imdb_soup.find('td', {'class': 'titleColumn'})

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [36]:
#given the above
# href is always a link in html so, that's a hint to store that information
#notice how it is under the class tag "a"
#title has my crew info, also under the tag "a"
# year information is in the class "secondaryInfo"

movies = imdb_soup.select('td.titleColumn') #I'm going to select the title information
year = imdb_soup.select('span.secondaryInfo')#I'm going to select the year information

# a bit more tricky is getting the attribute href and title from the node "a"
#we'll use list comprehension to create a list of values that looks into the node td.titleColumn a and get 
# attributes related to our keywords

links = [a.attrs.get('href') for a in imdb_soup.select('td.titleColumn a')] #get the links

crew = [a.attrs.get('title') for a in imdb_soup.select('td.titleColumn a')] #get the crew

In [27]:
imdb_soup.find('td', {'class': 'posterColumn'})

<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="9.222790917333004" name="ir"></span>
<span data-value="7.791552E11" name="us"></span>
<span data-value="2291236" name="nv"></span>
<span data-value="-1.777209082666996" name="ur"></span>
<a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
</a> </td>

In [44]:
# And based on the above, we can collect the rating
ratings = [b.attrs.get('data-value') for b in imdb_soup.select('td.posterColumn span[name=ir]')]

### So, we got a list of stuff, how do we put it together?

In [45]:
movies[0]

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [46]:
# we're going to for loop and append everything into one giant nested list.
# as we move along, we'll clean the data

#first, we're going to get the information contained in movies. Remember that this is a dictionary. 
#so I'm going to call the index from the list we created of movies and get the text
#here is an example of the last item in the dicitonary

#let's name this
movie_string = movies[0].get_text()
movies[0].get_text()

'\n      1.\n      The Shawshank Redemption\n(1994)\n'

I'm going to clean it up

First, I'll join the items in the string together with ('',join())

Then, I'm going to split this information into a list by using a space as the splitter with `.split()`

Then, I'll delete that pesky '.' using `.replace`

In [47]:
# This looks a lot nicer
movie = (' '.join(movie_string.split()).replace('.', ''))
(' '.join(movie_string.split()).replace('.', ''))


'1 The Shawshank Redemption (1994)'

In [48]:
# I'm going to remove the last 6 digits of the movie name to get rid of the year
movie_title = movie[len(str(0))+1:-7]
movie[len(str(0))+1:-7]


'The Shawshank Redemption'

In [49]:
#I'll do the same thing to the years variable
year_string = year[0].get_text()
years = (' '.join(year_string.split()).replace('(', '').replace(')', ''))
years

'1994'

In [50]:
# I'll use the existing place indexes in the table to create the ranking 
ranking = movie[:len(str(0))-(len(movie))]
movie[:len(str(0))-(len(movie))]

'1'

In [51]:
# Then I'll build out a dictionary
data = {"movie_title": movie_title,
            "year": years,
            "ranking": ranking,
            "star_cast": crew[0],
            "rating": ratings[0],
            "link": links[0]}
data

{'movie_title': 'The Shawshank Redemption',
 'year': '1994',
 'ranking': '1',
 'star_cast': 'Frank Darabont (dir.), Tim Robbins, Morgan Freeman',
 'rating': '9.23567279859621',
 'link': '/title/tt0111161/'}

In [43]:
len(movies)

250

In [52]:
# I'll turn this into a for loop, append each item 
imdb =[]
for index in range(len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year_string = year[index].get_text()
    years = (' '.join(year_string.split()).replace('(', '').replace(')', ''))
    rankings = movie[:len(str(index))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": years,
            "ranking": rankings,
            "star_cast": crew[index],
            "rating": ratings[index],
            "link": links[index]}
    imdb.append(data)
imdb    

[{'movie_title': 'The Shawshank Redemption',
  'year': '1994',
  'ranking': '1',
  'star_cast': 'Frank Darabont (dir.), Tim Robbins, Morgan Freeman',
  'rating': '9.23567279859621',
  'link': '/title/tt0111161/'},
 {'movie_title': 'The Godfather',
  'year': '1972',
  'ranking': '2',
  'star_cast': 'Francis Ford Coppola (dir.), Marlon Brando, Al Pacino',
  'rating': '9.155920890085513',
  'link': '/title/tt0068646/'},
 {'movie_title': 'The Dark Knight',
  'year': '2008',
  'ranking': '3',
  'star_cast': 'Christopher Nolan (dir.), Christian Bale, Heath Ledger',
  'rating': '8.991222925312774',
  'link': '/title/tt0468569/'},
 {'movie_title': 'The Godfather Part II',
  'year': '1974',
  'ranking': '4',
  'star_cast': 'Francis Ford Coppola (dir.), Al Pacino, Robert De Niro',
  'rating': '8.98386561827145',
  'link': '/title/tt0071562/'},
 {'movie_title': '12 Angry Men',
  'year': '1957',
  'ranking': '5',
  'star_cast': 'Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb',
  'rating': '8.952792

In [53]:
#Last, I'll create a dataframe from that dictionary
pd.DataFrame(imdb)

Unnamed: 0,movie_title,year,ranking,star_cast,rating,link
0,The Shawshank Redemption,1994,1,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",9.23567279859621,/title/tt0111161/
1,The Godfather,1972,2,"Francis Ford Coppola (dir.), Marlon Brando, Al...",9.155920890085513,/title/tt0068646/
2,The Dark Knight,2008,3,"Christopher Nolan (dir.), Christian Bale, Heat...",8.991222925312774,/title/tt0468569/
3,The Godfather Part II,1974,4,"Francis Ford Coppola (dir.), Al Pacino, Robert...",8.98386561827145,/title/tt0071562/
4,12 Angry Men,1957,5,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",8.95279265099551,/title/tt0050083/
...,...,...,...,...,...,...
245,Dersu Uzala,1975,246,"Akira Kurosawa (dir.), Maksim Munzuk, Yuriy So...",8.005718887277654,/title/tt0071411/
246,The Help,2011,247,"Tate Taylor (dir.), Viola Davis, Emma Stone",8.005134781498315,/title/tt1454029/
247,Aladdin,1992,248,"Ron Clements (dir.), Scott Weinger, Robin Will...",8.004032924028945,/title/tt0103639/
248,Gandhi,1982,249,"Richard Attenborough (dir.), Ben Kingsley, Joh...",8.00217705972711,/title/tt0083987/


In [46]:
# The Full code

# Download IMDB's Top 250 data
#(Let's quickly check out what it looks like)
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)

imdb_soup = BeautifulSoup(response.text, 'lxml') #we're going to read the information we got with beautiful soup
# and it will read lxml and html code for you. learn more here: https://lxml.de/

movies = imdb_soup.select('td.titleColumn') #I'm going to select the title column
links = [a.attrs.get('href') for a in imdb_soup.select('td.titleColumn a')] #get the links
crew = [a.attrs.get('title') for a in imdb_soup.select('td.titleColumn a')] #get the crew
ratings = [b.attrs.get('data-value') for b in imdb_soup.select('td.posterColumn span[name=ir]')]
year = imdb_soup.select('span.secondaryInfo')#I'm going to select the year information

imdb = [] #creating an empty list

# Store each item into dictionary (data), then put those into a list (imdb)
for i in range(len(movies)):
    # Seperate movie into: 'place', 'title', 'year'
    movie_string = movies[i].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(i))+1:-7]
    year_string = year[i].get_text()
    years = (' '.join(year_string.split()).replace('(', '').replace(')', ''))
    place = movie[:len(str(i))-(len(movie))]
    data = {"movie_title": movie_title,
            "year": years,
            "place": place,
            "star_cast": crew[i],
            "rating": ratings[i],
            "link": links[i]}
    imdb.append(data)

imdb = pd.DataFrame(imdb)
imdb['rating'] = imdb['rating'].astype(float).round(3) #rounding the values
# rating.round(2) #make it into a data frame

# voi-la easy-peasy data in your hands
imdb

Unnamed: 0,movie_title,year,place,star_cast,rating,link
0,The Shawshank Redemption,1994,1,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",9.223,/title/tt0111161/
1,The Godfather,1972,2,"Francis Ford Coppola (dir.), Marlon Brando, Al...",9.149,/title/tt0068646/
2,The Godfather: Part II,1974,3,"Francis Ford Coppola (dir.), Al Pacino, Robert...",8.981,/title/tt0071562/
3,The Dark Knight,2008,4,"Christopher Nolan (dir.), Christian Bale, Heat...",8.973,/title/tt0468569/
4,12 Angry Men,1957,5,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",8.931,/title/tt0050083/
...,...,...,...,...,...,...
245,The Terminator,1984,246,"James Cameron (dir.), Arnold Schwarzenegger, L...",8.009,/title/tt0088247/
246,Tangerines,2013,247,"Zaza Urushadze (dir.), Lembit Ulfsak, Elmo Nüg...",8.008,/title/tt2991224/
247,Aladdin,1992,248,"Ron Clements (dir.), Scott Weinger, Robin Will...",8.007,/title/tt0103639/
248,Fanny and Alexander,1982,249,"Ingmar Bergman (dir.), Bertil Guve, Pernilla A...",8.007,/title/tt0083922/


# Ethical considerations

**You can scrape it, should you though?**

A very good summary of practices for [ethical web scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):

* If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
* I will only save the data I absolutely need from your page.
* I will respect any content I do keep. I’ll never pass it off as my own.
* I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
* I will respond in a timely fashion to your outreach and work with you towards a resolution.
* I will scrape for the purpose of creating new value from the data, not to duplicate it.

Some other [important components](http://robertorocha.info/on-the-ethics-of-web-scraping/) of ethical web scraping practices include:

* Read the Terms of Service and Privacy Policies of a website before scraping it (this might not be possible in many situations though).
* If it’s not clear from looking at the website, contact the webmaster and ask if and what you’re allowed to harvest.
* Be gentle on smaller websites
    * Run your scraper in off-peak hours
    * Space out your requests.
* Identify yourself by name and email in your User-Agent strings.
* Inspecting the **robots.txt** file for rules about what pages can be scraped, indexed, etc.

### What is a robots.txt?

A simple text file placed on the web server which tells crawlers which file they can and cannot access. It's also called _The Robots Exclusion Protocol_.

<img src='robots.png' width="600">

#### Some examples

In [54]:
print(requests.get('https://www.imdb.com/robots.txt').text)

# robots.txt for https://www.imdb.com properties
User-agent: *
Disallow: /OnThisDay
Disallow: /ads/
Disallow: /ap/
Disallow: /mymovies/
Disallow: /r/
Disallow: /register
Disallow: /registration/
Disallow: /search/name-text
Disallow: /search/title-text
Disallow: /find
Disallow: /find$
Disallow: /find/
Disallow: /tvschedule
Disallow: /updates
Disallow: /watch/_ajax/option
Disallow: /_json/video/mon
Disallow: /_json/getAdsForMediaViewer/
Disallow: /list/ls*/_ajax
Disallow: /list/ls*/export
Disallow: /*/*/rg*/mediaviewer/rm*/tr
Disallow: /*/rg*/mediaviewer/rm*/tr
Disallow: /*/mediaviewer/*/tr
Disallow: /title/tt*/mediaviewer/rm*/tr
Disallow: /name/nm*/mediaviewer/rm*/tr
Disallow: /gallery/rg*/mediaviewer/rm*/tr
Disallow: /tr/
Disallow: /title/tt*/watchoptions
Disallow: /search/title/?title_type=feature,tv_movie,tv_miniseries,documentary,short,video,tv_short&release_date=,2020-12-31&lists=%21ls538187658,%21ls539867036,%21ls538186228&view=simple&sort=num_votes,asc&aft
Disallow: /name/nm*/fil

In [53]:
print(requests.get('https://www.nesta.org.uk/robots.txt').text)

User-Agent: *

Disallow: /search/

Allow: /



In [None]:
print(requests.get('https://www.howtogeek.com/robots.txt').text)

#### What's a User-Agent?

A User-Agent is a string identifying the browser and operating system to the web server. It's your machine's way of saying _Hi, I am Chrome on macOS_ to a web server.

Web servers use user agents for a variety of purposes:
* Serving different web pages to different web browsers. This can be used for good – for example, to serve simpler web pages to older browsers – or evil – for example, to display a “This web page must be viewed in Internet Explorer” message.
* Displaying different content to different operating systems – for example, by displaying a slimmed-down page on mobile devices.
* Gathering statistics showing the browsers and operating systems in use by their users. If you ever see browser market-share statistics, this is how they’re acquired.

Let's break down the structure of a human-operated User-Agent:

```Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Mobile/7B405```

The components of this string are as follows:

* Mozilla/5.0: Previously used to indicate compatibility with the Mozilla rendering engine.
* (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us): Details of the system in which the browser is running.
* AppleWebKit/531.21.10: The platform the browser uses.
* (KHTML, like Gecko): Browser platform details.
* Mobile/7B405: This is used by the browser to indicate specific enhancements that are available directly in the browser or through third parties. An example of this is Microsoft Live Meeting which registers an extension so that the Live Meeting service knows if the software is already installed, which means it can provide a streamlined experience to joining meetings.

When scraping websites, it is a good idea to include your contact information as a custom **User-Agent** string so that the webmaster can get in contact. For example:

In [1]:
headers = {
    'User-Agent': 'Mary Kaltenberg bot',
    'From': 'mkaltenberg@pace.edu'
}
request = requests.get('https://www.pace.edu/', headers=headers)
print(request.request.headers)

NameError: name 'requests' is not defined

Using a header like this - particularly one that looks just like your browser - is effective to overcome obstacles when web scraping (some website will stop you from repeated searches if you don't identify yourself).

## Advanced web scraping tools 
This section and example with exercise is from a tutorial done by Nesta *Thanks to Kostas Stathoulopoulos and Alex Bishop for this open source information and tutorial*


**[Scrapy](https://scrapy.org)** is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.

**[ARGUS](https://github.com/datawizard1337/ARGUS)** is an easy-to-use web mining tool that's built on Scrapy. It is able to crawl a broad range of different websites.

**[Selenium](https://selenium-python.readthedocs.io/index.html)** is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provides infrastructure for the W3C WebDriver specification — a platform and language-neutral coding interface compatible with all major web browsers. We can use it to imitate a user's behaviour and interact with Javascript elements (buttons, sliders etc.).

For now, let's see how Selenium works.

### How to install Selenium
1. If you are using Anaconda: `conda install -c conda-forge selenium `
2. Download the driver for your web browser for [here](https://selenium-python.readthedocs.io/installation.html#drivers). **Note:** Choose a driver that corresponds to your web browser's version. Unzip the file and move the executable to your working directory.

#### Important note on Selinium and web drivers
If you are running this notebook locally, follow the above steps and run the code directly below (change the path to where your web driver is located). If you are running this notebook on colab, skip the next cell and run the one below it.

### Scraping data with Selenium
We will use [UK's Yearly Box Office](https://www.boxofficemojo.com/intl/uk/yearly/) to scrape not only the top 100 but all the top movies of 2019. This will be our pipeline:

<img src='selenium-pipeline.png' width='1024'>

### Exercise

Use Selenium to scrape Box Office Mojo's top \#100 for every year between 2002 and 2019.

[Link to the solutions: NOTE THIS IS DATED AND HAS NOT BEEN CHECKED](https://colab.research.google.com/github/nestauk/im-tutorials/blob/3-ysi-tutorial/notebooks/Web-Scraping/solutions.ipynb)

## Web Crawling
#### The very grey area of collecting data

Before webcrawling, think about (from WSP):

* As you may know, just a few number of clicks can lead you down a rabit hole of unsavery information/content. Don't web crawl if for any reason you are uncomfortable with that possibility. Doesn't matter where you start, you can end up in a black hole of a dumpster fire.

* What data am I trying to gather? Can this be accomplished by scraping just a few predefined websites (almost always the easier option), or does my crawler need to be able to discover new websites I might not know about?

* When my crawler reaches a particular website, will it immediately follow the next outbound link to a new website, or will it stick around for a while and drill down into the current website?

* Are there any conditions under which I would not want to scrape a particular site? Am I interested in non-English content?

* How am I protecting myself against legal action if my web crawler catches the attention of a webmaster on one of the sites it runs across? (Check out Appendix C for more information on this subject.)

* Be conscientiousa about how much bandwidth you are using and make every effort to determine if there’s a way to make the target server’s load easier

* Check out Appendix C of WSP on pitfalls and considerations to legal considerations in the USA

* Be careful about web-scraping from European websites especially concerning personal information. A useful guide can be found [here](https://blog.scrapinghub.com/web-scraping-gdpr-compliance-guide).

<img src ="https://media.giphy.com/media/4BgQaxfQfeqys/giphy.gif">


### Dark/Deep Web
Many websites are note indexed by search engines - they can't go to websites that specifically exile them through the robots.txt file. These places are what's sometimes called the "deep web"  You can web scrape information in these places, I won't go into it, but there are a lot of resources out there that can show you how. Indexed websites(aka "googleable") are only a small percentage of what's out there.

Dark web/Darknet is an entirely other thing. Its runs over an existing network infrastructure but uses a Tor client with an application protocol that runs on top of HTTP, providing a secure channel to exchange information. I've never attempted to scrape from here, so you'd be on your own if you do it.

## Scrapy

Scrapy is another package that is useful 

## Additional resources/references:

* [Document Object Model (DOM)](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction)
* [HTML elements reference guide](https://www.w3schools.com/tags/default.asp)
* [About /robots.txt](https://www.robotstxt.org/robotstxt.html)
* [The robots.txt file](https://varvy.com/robottxt.html)
* [Ethics in Web Scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)
* [On the Ethics of Web Scraping](http://robertorocha.info/on-the-ethics-of-web-scraping/)
* [User-Agent](https://en.wikipedia.org/wiki/User_agent)
* [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Selinium Python - Unofficial documentation](https://selenium-python.readthedocs.io/)
* [ARGUS paper](http://ftp.zew.de/pub/zew-docs/dp/dp18033.pdf)
* [Brian's C. Keegan](http://www.brianckeegan.com/) excellent [5-week web scraping course](https://github.com/CU-ITSS/Web-Data-Scraping-S2019) intended for researchers in the social sciences and humanities.

Note: Much of this tutorial is inspired by:
- WSP book (cited in the syllabus)
- [Grant McDermott's Lecture 6 Notes](https://raw.githack.com/uo-ec607/lectures/master/06-web-css/06-web-css.html)
- Tutorial by Nesta (I used their code/examples)