# Web Scraping

## What is Web Scraping?

- "The web": a collection of files hosted on a large network of 
communicating servers.
- *Webscraping* : the act of accessing those files and programmatically saving them, or parts of them, to a chosen location (usually your computer). This is often a critical task  when writing projects that require
data from the internet. 



HTML (HyperText Markup Language): said to be the fabric of the internet. 

Nearly all of the things that you 
would normally think of as "webpages" are really files 
written in HTML. A browser like Firefox, Chrome, or Safari is
just a program for *rendering* HTML in an attractive visual 
format. 

- Unfortunately, for scraping, we often need to interact
with raw HTML, which can get messy. 
- Fortunately, the BeautifulSoup package gives us some tools with which to do this. 


Resources:

- pd.read_html: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

- requests: https://requests.readthedocs.io/en/latest/

- Introduction to HTML: https://www.w3schools.com/html/html_intro.asp

- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
#! conda install -c conda-forge beautifulsoup4
# pip install beautifulsoup4

Let's take a quick look at the tutorial website we'll scrape from. 

http://quotes.toscrape.com/

We observe that there are a number of quotes, which possess 
text, authors, and tags. There are multiple pages of 
these quotes, which are accessed via the "Next" button. 

For now, let's try just obtain the text on the webpage. 

In [None]:
import requests
link = "http://quotes.toscrape.com/"

In [None]:
from bs4 import BeautifulSoup

The `BeautifulSoup` type is a basis type for parsing a webpage.

## CSS Selectors

CSS (Cascaded Styling Sheet) is a file type for styling web pages. It is designed to apply some formatting to certain parts of the webpage. How do we select "certain parts"? That is what CSS selectors are for. 


- CSS selector references: https://www.w3schools.com/cssref/css_selectors.php
- a fun activity: https://flukeout.github.io/


A quick code to parse text, author name, and the list of tags:

- 

### Following the links

At the bottom of each page, there is a "next" button. Can we follow the link?

__Exercise__: Can we continue on and parse all the quotes on that website?


In [None]:
base_url = "http://quotes.toscrape.com/"


## Example: Get the wikipedia links to country capitals

Our question: "*Get the Wikipedia links to each country capital from [this page](https://en.m.wikipedia.org/wiki/List_of_national_capitals)*" (note the mobile page link)

If you are on a desktop machine, the Wikipedia page has a table at the top before it goes into the table with the capitals/countries. 

In [None]:
soup = link2soup("https://en.m.wikipedia.org/wiki/List_of_national_capitals")

In [None]:
# Recall that we have the hierarchy <tr> -> <td> -> <a>, and that the href= attribute is part of the <a> tag. We need to find all the <tr> tags, then get the (first) <td> tag for each, the <a> tag from the <td> tag, and finally get the href= from that.


In [None]:
# What if we want both the links for the capital AND the country? Then we need to get ALL the <td> tags for each <tr> row. Using a nested list comprehension(!):


## Example: The 100 most popular feature films released in 2023

Can be accessed at: https://www.imdb.com/search/title/?title_type=feature&release_date=2023-01-01,2023-12-31&count=100

In [None]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2023-01-01,2023-12-31&count=100"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'} 
# you act like user, not a robot. 


Suppose we want to scrape following 8 features from this page:
- Rank (popularity)
- Title
- Description
- Runtime
- User rating
- Metascore

### Rank and title

### Descriptions

### Runtimes

### User rating

### Metascore

Oops, we only have 23 metascore data, and 2 are missing. How do we figure out the films with missing metascore?

### Visualizing the data