# Web Scraping

## What is Web Scraping?

- "The web": a collection of files hosted on a large network of 
communicating servers.
- *Webscraping* : the act of accessing those files and programmatically saving them, or parts of them, to a chosen location (usually your computer). This is often a critical task  when writing projects that require
data from the internet. 



HTML (HyperText Markup Language): said to be the fabric of the internet. 

Nearly all of the things that you 
would normally think of as "webpages" are really files 
written in HTML. A browser like Firefox, Chrome, or Safari is
just a program for *rendering* HTML in an attractive visual 
format. 

- Unfortunately, for scraping, we often need to interact
with raw HTML, which can get messy. 
- Fortunately, the BeautifulSoup package gives us some tools with which to do this. 


Resources:

- pd.read_html: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

- requests: https://requests.readthedocs.io/en/latest/

- Introduction to HTML: https://www.w3schools.com/html/html_intro.asp

- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
! conda install -c conda-forge beautifulsoup4

Let's take a quick look at the tutorial website we'll scrape from. 

http://quotes.toscrape.com/

We observe that there are a number of quotes, which possess 
text, authors, and tags. There are multiple pages of 
these quotes, which are accessed via the "Next" button. 

For now, let's try just obtain the text on the webpage. 

In [None]:
import requests
link = "http://quotes.toscrape.com/"
data = requests.get(link).text

In [None]:
print(data)

In [None]:
from bs4 import BeautifulSoup

The `BeautifulSoup` type is a basis type for parsing a webpage.

In [None]:
def link2soup(link):
    """Convert a link to a BeautifulSoup object."""
    data = requests.get(link).text
    return BeautifulSoup(data)

In [None]:
soup = link2soup(link)

In [None]:
type(soup)

## CSS Selectors

CSS (Cascaded Styling Sheet) is a file type for styling web pages. It is designed to apply some formatting to certain parts of the webpage. How do we select "certain parts"? That is what CSS selectors are for. 


- CSS selector references: https://www.w3schools.com/cssref/css_selectors.php
- a fun activity: https://flukeout.github.io/


A quick code to parse text, author name, and the list of tags:

- 

In [None]:
soup.select("small.author")

In [None]:
soup.select("small.author")[0].get_text()

In [None]:
l = []

for t in soup.select("div.quote"):
    text = t.select("span.text")[0].get_text()
    author = t.select("small.author")[0].get_text()
    tags = t.select("div.tags a.tag")
    tags = [x.get_text() for x in tags]
    l.append((text, author, tags))        

In [None]:
l

### Following the links

At the bottom of each page, there is a "next" button. Can we follow the link?

In [None]:
next_button = soup.select(".next a")[0]
next_button

In [None]:
next_url = link + next_button.attrs["href"]
next_url

In [None]:
next_soup = link2soup(next_url)
next_soup

In [None]:
for t in next_soup.select("div.quote"):
    text = t.select("span.text")[0].get_text()
    author = t.select("small.author")[0].get_text()
    tags = t.select("div.tags a.tag")
    tags = [x.get_text() for x in tags]
    l.append((text, author, tags))        

__Exercise__: Can we continue on and parse all the quotes on that website?


In [None]:
def parse_page(l, soup, base_url):
    
    for t in soup.select("div.quote"):
        text = t.select("span.text")[0].get_text()
        author = t.select("small.author")[0].get_text()
        tags = t.select("div.tags a.tag")
        tags = [x.get_text() for x in tags]
        l.append((text, author, tags))
    next_button_match = soup.select(".next a")
    
    if next_button_match:
        next_button = next_button_match[0]
        next_url = base_url + next_button.attrs["href"]
        return next_url
    else:
        return None


In [None]:
base_url = "http://quotes.toscrape.com/"
l = []
soup = link2soup(base_url)
while True:
    next_url = parse_page(l, soup, base_url)
    if not next_url:
        break
    else:
        soup = link2soup(next_url)

In [None]:
len(l)

## Example: Get the wikipedia links to country capitals

Our question: "*Get the Wikipedia links to each country capital from [this page](https://en.m.wikipedia.org/wiki/List_of_national_capitals)*" (note the mobile page link)

If you are on a desktop machine, the Wikipedia page has a table at the top before it goes into the table with the capitals/countries. 

In [None]:
soup = link2soup("https://en.m.wikipedia.org/wiki/List_of_national_capitals")

In [None]:
soup.findAll('tr')[1:]

In [None]:
# Recall that we have the hierarchy <tr> -> <td> -> <a>, and that the href= attribute is part of the <a> tag. We need to find all the <tr> tags, then get the (first) <td> tag for each, the <a> tag from the <td> tag, and finally get the href= from that.

['https://en.m.wikipedia.org' + tr.td.a['href'] for tr in soup.findAll('tr')[1:]][:30]
# (first 10 only)

In [None]:
# What if we want both the links for the capital AND the country? Then we need to get ALL the <td> tags for each <tr> row. Using a nested list comprehension(!):
['https://en.m.wikipedia.org' + td.a['href'] for tr in soup.findAll('tr')[1:] 
 for td in tr.findAll('td')[:2] if td.a][:30]
# (first 10 only)

## Example: The 100 most popular feature films released in 2023

Can be accessed at: https://www.imdb.com/search/title/?title_type=feature&release_date=2023-01-01,2023-12-31&count=100

In [None]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2023-01-01,2023-12-31&count=100"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'} 
# you act like user, not a robot. 
data = requests.get(url, headers=headers).text
soup = BeautifulSoup(data)

In [None]:
soup;

Suppose we want to scrape following 8 features from this page:
- Rank (popularity)
- Title
- Description
- Runtime
- User rating
- Metascore

In [None]:
soup.select('.ipc-title-link-wrapper')

In [None]:
soup.select('.ipc-title-link-wrapper .ipc-title__text')

### Rank and title

In [None]:
title_texts = [x.get_text() for x in soup.select('.ipc-title-link-wrapper .ipc-title__text')]

In [None]:
title_texts

In [None]:
import re # the Python regex module

In [None]:
rank_data = [int(re.search('^[0-9]+', x).group(0)) for x in title_texts]
rank_data

In [None]:
title_data = [x[re.search('^[0-9]+. ', x).end():] for x in title_texts]
title_data

### Descriptions

In [None]:
description_data = [x.get_text() for x in soup.select('.ipc-html-content-inner-div')]
description_data

### Runtimes

In [None]:
runtime_data = [x.get_text() for x in soup.select('.dli-title-metadata-item:nth-child(2)')]
runtime_data[0]

In [None]:
runtime_hr = [int(re.search("\\d+(?=h)", x).group(0)) if re.search("\\d+(?=h)", x) else 0 for x in runtime_data]

In [None]:
runtime_min = [int(re.search("\\d+(?=m)", x).group(0)) if re.search("\\d+(?=m)", x) else 0 for x in runtime_data ]

In [None]:
runtime_data = [runtime_hr[i] * 60 + runtime_min[i] for i in range(len(runtime_hr))]

In [None]:
runtime_data

### User rating

In [None]:
userrating_data = [x.get_text() for x in soup.select('.ratingGroup--imdb-rating')]

In [None]:
userrating_data[0]

In [None]:
userrating_data = [float(x.split('\xa0')[0]) for x in userrating_data]

In [None]:
userrating_data

### Metascore

In [None]:
metascore_data = [float(x.get_text()) for x in soup.select('.metacritic-score-box')]

In [None]:
len(metascore_data)

Oops, we only have 23 metascore data, and 2 are missing. How do we figure out the films with missing metascore?

In [None]:
mixed = [x.get_text() for x in soup.select('.ipc-title-link-wrapper .ipc-title__text , .metacritic-score-box')]
mixed

In [None]:
mixed = [x[:re.search(' .*', x).start()] if re.search(' .*', x) else x for x in mixed]

In [None]:
mixed

In [None]:
import numpy as np
isrank = np.array(['.' in x for x in mixed]) # indicates if the element is a rank
ismissing=np.zeros(len(rank_data), dtype='bool')
ismissing = isrank[:-1] & isrank[1:] # rank followed by another rank is missing metascore

In [None]:
ismissing = np.hstack([ismissing, [isrank[len(isrank)-1]]]) # check if the last entry is missing or not.

In [None]:
missingpos = np.array([int(float(x)) for x in np.array(mixed)[np.array(ismissing)]]) - 1 # these are ranks missing metascore
missingpos

In [None]:
mask = np.ones(25, dtype=bool)
mask[missingpos] = False

In [None]:
metascore_data_ = np.zeros(25)
metascore_data_[:] = np.nan
metascore_data_[mask] = metascore_data

In [None]:
metascore_data = metascore_data_ 

In [None]:
metascore_data

### Visualizing the data

In [None]:
import pandas as pd
df = pd.DataFrame(data = {
    "poprank" : rank_data,
    "title" : title_data,
    "description": description_data,
    "runtime": runtime_data,
    "userrating": userrating_data,
    "metascore": metascore_data
}
                 )
    

In [None]:
df

In [None]:
from plotly import express as px

In [None]:
fig = px.scatter(df, 
                 x = "userrating",
                 y = "metascore",
                 hover_name = "title",
                 height = 500,
                 trendline="lowess"
)
fig.show()