### Assignment 2 - Introduction to web scraping

In this assignment, we establish a web connection to metacritic, and retrieve the first 100 records (first page) of movies within a specific year. 

We used valided/tested regex to ensure we are targeting and capturing the correct data. Once retrieved, it is immediatedly stored in a pandas dataframe where we can store/manipulate however we want.

In [1]:
# imports
import pandas as pd
import urllib3
import re

In [2]:
# establish target URL to get page contents
url = "https://www.metacritic.com/browse/movies/score/metascore/year/filtered?year_selected=2015&sort=desc&view=detailed"

# spin up ambiguous connection
http = urllib3.PoolManager()

# added user-agent per advice from June --> need to learn more as to why we need this / how we found this answer
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'

# "GET" request to get page contents, passing user agent string
resp = http.request('GET', url, headers={'User-Agent': user_agent})

# store page contents as string for later parsing
datastring = str(resp.data, "utf-8")

print(f"Page response: {resp.status}, with a total page length of {len(datastring)} characters")

Page response: 200, with a total page length of 519412 characters


In [3]:
# store tested regex in list to iterate through, used from assignment 1 after validation and some extra fixes
reg_vals = {
    'Title': 'class="title"><h3>(.*)<\/h3><\/a>',  
    'Description': "<div class=\"summary\">\s(.*)\s",
    'ReleaseDate': '<div class=\"clamp-details\">\s+<span>(.*)<\/span>',
    'Metascore': '<div class="clamp-score-wrap">\s*.*\s<div class="metascore_w large movie positive">(.*)<',
    'PhotoURL': '<a href="\/movie\/.*"><img src="(.*)" a'
}

In [79]:
# establish dataframe object for storage
df = pd.DataFrame()

# iterate through reg_vals, get designated regex for each key, and store name
for val in reg_vals:
    # get regex to pass to regex finder
    regx_str = reg_vals.get(val)
    # establish pandas series and get all values searched from regex, name=val=reg_vals.key(i)
    sers = pd.Series(re.findall(regx_str, datastring), name=val)
    # concat column to growing dataframe
    df = pd.concat([df, sers], axis=1)

# show dataframe
df


Unnamed: 0,Title,Description,ReleaseDate,Metascore,PhotoURL
0,Carol,"Set in 1950s New York,...","November 20, 2015",94,https://static.metacritic.com/images/products/...
1,45 Years,There is just one week...,"December 23, 2015",94,https://static.metacritic.com/images/products/...
2,Inside Out,Growing up can be a bu...,"June 19, 2015",94,https://static.metacritic.com/images/products/...
3,Sherpa,A fight on Everest? It...,"October 2, 2015",93,https://static.metacritic.com/images/products/...
4,Spotlight,Spotlight tells the ri...,"November 6, 2015",93,https://static.metacritic.com/images/products/...
...,...,...,...,...,...
95,Wild Tales,Vulnerable in the face...,"February 20, 2015",77,https://static.metacritic.com/images/products/...
96,Best of Enemies,"In the summer of 1968,...","July 31, 2015",77,https://static.metacritic.com/images/products/...
97,Buzzard,Paranoia forces small-...,"March 6, 2015",77,https://static.metacritic.com/images/products/...
98,The Hunting Ground,From the makers of The...,"February 27, 2015",77,https://static.metacritic.com/images/products/...
