# Web-Scraping

Hopefully, you can get all the data you need easily and accessibly, and don't need to scour the web to find a source that will let you do your analysis. 

We'd all prefer one of these:

<img src="images/other_options.png" alt="image showcasing a downloadable csv, database connection, or API, but we're not always so lucky. not sure of image source, took from materials provided by another instructor" width=650>

But we're not always so lucky! Sometimes we need data that's less accessible.

Enter...

<img alt="beautiful soup logo" src="images/bs.png" width=500>

> "You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects."

- From the Beautiful Soup [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#)

## Grabbing Movie Data

We might think about grabbing more movie data, as we gear up towards our Phase 1 project which uses movie data. 

If we go to [IMDB](https://www.imdb.com/), their only API content seems expensive, and their advanced search results in tabular data that seems _extremely_ scrapable.

**BUT** 

Enter - [conditions of use pages](https://www.imdb.com/conditions) ... and ethics!

> "**Robots and Screen Scraping:** You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below."

**Let's Discuss**

- Do people scrape sites they shouldn't? Sure, all the time. But am I going to tell you to ignore conditions/terms of use? Absolutely not. Make good choices.


Instead, let's scrape Wikipedia for movie data - Wikipedia has a very accessible Creative Commons license for use!

Let's explore a few [years in film](https://en.wikipedia.org/wiki/Table_of_years_in_film).

## Task: Grab the top 10 highest-grossing films for each year, 2000-2019

### Imports

Our goal is to collect data into a Pandas dataframe. Plus we're still working with websites, so we'll still need the requests library.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup # note this odd import statement structure
import lxml

In [2]:
# you may also need lxml - https://lxml.de/index.html
# helps process html or xml in python
# !pip install lxml

Test case - [the year 2000](https://en.wikipedia.org/wiki/2000_in_film).

In [3]:
# Get the response from the website, using requests
resp = requests.get("https://en.wikipedia.org/wiki/2000_in_film")

In [None]:
# Let's check out the text attribute of that response...
resp.text
# (ew)

In [5]:
# And now... beautiful soup! Let's soup-ify that text attribute
soup = BeautifulSoup(resp.text)

In [None]:
# Can use a prettify function to pretty print
print(soup.prettify())

In [7]:
# Now we need to find the table we want in the soup - use .find()
# Can pass a dictionary in the attributes argument
table = soup.find('table', {'class':"wikitable sortable"})

In [8]:
# Explore that result
len(table.find_all('tr'))

['\n1', 'Mission: Impossible 2', 'Paramount', '$546,388,105\n']

In [9]:
# Check out the first real row in the table
table.find_all('tr')[1].get_text().split('\n\n')

11

In [10]:
# Check out the last row... what's missing?
table.find_all('tr')[10].get_text()

'\n10\n\nWhat Lies Beneath\n\n$291,420,351\n'

**But wait...** there's a shortcut (thanks pandas)

In [16]:
# Note - pandas likes the prettify objects better
df = pd.read_html(table.prettify())

In [23]:
films_2000 = df[0]

In [25]:
films_2000['Year'] = 2000

In [26]:
films_2000

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
0,1,Mission: Impossible 2,Paramount,"$546,388,105",2000
1,2,Gladiator,Universal,"$460,583,960",2000
2,3,Cast Away,20th Century Fox,"$429,632,142",2000
3,4,What Women Want,Paramount,"$374,111,707",2000
4,5,Dinosaur,Disney,"$349,822,765",2000
5,6,How the Grinch Stole Christmas,Universal,"$345,141,403",2000
6,7,Meet the Parents,Universal,"$330,444,045",2000
7,8,The Perfect Storm,Warner Bros.,"$328,718,434",2000
8,9,X-Men,20th Century Fox,"$296,339,527",2000
9,10,What Lies Beneath,20th Century Fox,"$291,420,351",2000


### Now Loop It!

In [30]:
# My preference - create a list of dataframes, then concat afterwards
# Are there other ways to create one big df from this? OF COURSE!

list_of_dfs = []

for year in range(2000, 2021):
    url = f"https://en.wikipedia.org/wiki/{year}_in_film"
    resp = requests.get(url).text
    soup = BeautifulSoup(resp)
    table = soup.find('table', {'class':"wikitable sortable"})
    df = pd.read_html(table.prettify())[0]
    df['Year'] = year
    list_of_dfs.append(df)
    # Only 20 things... not going to worry about using time to pause requests

In [34]:
list_of_dfs[20]

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
0,1,The Eight Hundred,CMC Pictures Holdings,"$464,760,324",2020
1,2,Bad Boys for Life,Sony,"$424,505,244",2020
2,3,"My People, My Homeland",China Lion Film Distribution,"$394,780,000",2020
3,4,Tenet,Warner Bros.,"$341,600,000",2020
4,5,Sonic the Hedgehog,Paramount,"$308,434,533",2020
5,6,Dolittle,Universal,"$250,482,863",2020
6,7,Jiang Ziya,Beijing Enlight Pictures,"$234,023,520",2020
7,8,Birds of Prey,Warner Bros.,"$201,858,461",2020
8,9,Demon Slayer: Infinity Train,Toho / Aniplex,"$150,000,000",2020
9,10,Onward,Disney,"$144,983,422",2020


In [39]:
# Now to concat...
full_df = pd.concat([df for df in list_of_dfs], ignore_index=True)

Let's practice some data cleaning on the Worldwide Gross column:

In [75]:
full_df['Worldwide gross'].head()

0    546388105
1    460583960
2    429632142
3    374111707
4    349822765
Name: Worldwide gross, dtype: int64

In [66]:
full_df.loc[full_df['Worldwide gross'] == '$871,014,978  [2]']

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
31,2,Finding Nemo,Disney,"$871,014,978 [2]",2003


In [68]:
full_df['Worldwide gross'][31].split()[0]

'$871,014,978'

In [71]:
full_df['Worldwide gross'] = full_df['Worldwide gross'].map(lambda x: x.split()[0]).unique()

In [73]:
full_df['Worldwide gross'] = full_df['Worldwide gross'].str.replace(",","").str.replace("$","").astype(int)

In [77]:
full_df.head()

Unnamed: 0,Rank,Title,Distributor,Worldwide gross,Year
0,1,Mission: Impossible 2,Paramount,546388105,2000
1,2,Gladiator,Universal,460583960,2000
2,3,Cast Away,20th Century Fox,429632142,2000
3,4,What Women Want,Paramount,374111707,2000
4,5,Dinosaur,Disney,349822765,2000


## Discussion Time!

What else could we do with webscraping? Any project ideas pop into mind? Any useful things on that page we could also use to grab more data? Let's discuss!

- Had URLs in these results - could grab even more data on each movie using those
- Can loop through any kind of repeatable URL, provided you figure out the pattern!
- The possibilities are endless... (but don't forget to check the terms of use, don't get in trouble!)
