# Improving and Adding Review Score Data
### Movie List Project - Notebook #1
#### by Max Ruther

## Motivation and Overview

One of my main interests for this Movie List project is to train models predicting my enjoyment of a given movie. Seeing the enjoyment of critics as promisingly predictive of my own, I created a _critic_ratings_ table in my MySQL Movie database. Upon initial construction, it features review scores from IMDb, Rotten Tomatoes, and Metacritic. However, some of its scores are erroneously missing. It also lacks scores from a favorite review site of mine, RogerEbert.com .

### The Two Main Concerns

These two issues with review scores make up the primary concerns of this notebook. In greater detail, these involve:

1. Identifying, for each review site, (IMDb, Rotten Tomatoes, or Metacritic) which film records are missing review scores. If a review score is erroneously missing (as I determine in unshown online searches) add its true score to a mapping, ultimately used to correct those values in the table.

2. Adding scores from a reviewer of interest not recorded in the OMDb.

### Realized in the Project Code

The code featured in this notebook comprises the various methods of my Python project's *RatingsTableMender* class, contained in my **critic_ratings.RatingsTableMender** package. The code is spread across the following modules, specifically:
- *_map_missing_ratings.py*
- *_reporting.py*
- *_add_reviewers.py*
- *_reviewer_mappings.py*


## Setup

##### Imports

In [1]:
import pandas as pd
from sqlalchemy import create_engine

I use pandas to manipulate the data. I use sqlalchemy to connect to my MySQL database and query the _critic_ratings_ table.

##### Connect the SQLAlchemy engine to my local MySQL movie database

In [2]:
# Read in my database's creds/URL from a file.
movie_db_url = None
with open('../.secret/movie_db_url.txt', 'r') as f:
    movie_db_url = f.read().strip()

# Connect to my MySQL movie database.
engine = create_engine(movie_db_url)
conn = engine.connect()

## Missing Review Scores

### Metacritic

OMDb seems particularly slow to update Metacritic scores.

#### Identify Missing Scores

##### Querying the missing Metacritic reviews from the critic_ratings table

In [3]:
query = """SELECT Movie_ID, Title FROM 
(SELECT c.Movie_ID, c.Title, c.Year, c.MetaC_Score, a.Release_Date 
FROM critic_ratings c INNER JOIN allmovies a 
ON c.Title=a.Title 
WHERE c.MetaC_Score IS NULL
ORDER BY a.Release_Date ASC) AS tt;"""

review_df = pd.read_sql_query(query, engine, index_col='Movie_ID')


##### Printing these in the format of a python dictionary literal
This way, they are ready to be pasted into a Python statement, following my manual entry of the missing scores.

In [4]:
print("metacritic_mapping = {")
for i in review_df.values:
    print(f'\t"{i[0]}": ,')
print("}")

metacritic_mapping = {
	"Salo": ,
	"The Ascent": ,
	"Troll 2": ,
	"Memories": ,
	"Air Bud": ,
	"Pokemon 2000": ,
	"Rampant": ,
	"Inspector Ike": ,
	"Nate - A One Man Show": ,
	"El Conde": ,
	"Outlaw Johnny Black": ,
	"Janet Planet": ,
	"Wingwomen": ,
	"Good One": ,
	"The People's Joker": ,
	"The Nature of Love": ,
	"Only the River Flows": ,
	"La Cocina": ,
	"Bird": ,
	"Hard Truths": ,
	"Striking Rescue": ,
	"Nickel Boys": ,
}


#### Correct Erroneously Missing Scores

##### Creating the mapping for the missing reviews
Commented out at the top of the mapping are films that lack Metacritic review scores and will probably never be scored.

Commented out at the bottom of the mapping are films that erroneously lacks a review score **but have only just come out.** In such cases, I hold off on including them in this mapping, until more Metacritic reviews come in to form a larger sample base for their scores. (The *MetaC_Score* in my database is the Metacritic aggregate score.)

In [5]:
metacritic_mapping = {
    # "Salo": ,
	# "The Ascent": ,
	# "Troll 2": ,
	# "Memories": ,
	# "Air Bud": ,
	# "Rampant": ,
	# "Nate - A One Man Show": ,
	# "Inspector Ike": ,
	# "Sirocco and the Kingdom of the Winds": ,
	"Pokemon 2000": 0.28,
	"Hundreds of Beavers": 0.82,
	"The Holdovers": 0.82,
    "The Wonderful Story of Henry Sugar": 0.85,
    "El Conde": 0.72,
    "American Fiction": 0.81,
	"Quiz Lady": 0.59,
	"Sing Sing": 0.84,
	"Outlaw Johnny Black": 0.54,
    "Janet Planet": 0.83,
    "Wingwomen": 0.54,
    "Saltburn": 0.61,
	"Silent Night": 0.53,
	"The Boy and the Heron": 0.91,
	"All of Us Strangers": 0.9,
	"Society of the Snow": 0.72,
	"Migration": 0.56,
	"The Teachers' Lounge": 0.82,
	"Good One": 0.87,
	"Godzilla Minus One": 0.81,
	"Upgraded": 0.59,
	"Molli and Max in the Future": 0.7,
	"Drive-Away Dolls": 0.56,
	"Love Lies Bleeding": 0.77,
	"Do Not Expect Too Much from the End of the World": 0.95,
    "The Beast": 0.8,
    "Civil War": 0.75,
    "Challengers": 0.82,
	"Slow": 0.72,
	"Evil Does Not Exist": 0.83,
	"Gasoline Rainbow": 0.8,
    "Babes": 0.71,
    "Furiosa: A Mad Max Saga": 0.79,
    "I Used to Be Funny": 0.74,
    "Ghostlight": 0.83,
    "Thelma": 0.77,
	"The Nature of Love": 0.8,
	"Oddity": 0.78,
	"Only the River Flows": 0.7,
	"His Three Daughters": 0.84,
    "La Cocina": 0.75,
	"Bird": 0.74,
	"Hard Truths": 0.88,
	"Nickel Boys": 0.91,
    "The People's Joker": 0.78,
}

##### Import the **entire** _critic_ratings_ table into a df, from the MySQL db.

In [6]:
query = "SELECT * FROM critic_ratings"

cr_df = pd.read_sql_query(query, engine, index_col='Movie_ID')
cr_df.head(5)

Unnamed: 0_level_0,Title,Year,IMDB_Score,RT_Score,MetaC_Score
Movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Nickel Boys,2024,0.76,0.9,
2,The Brutalist,2024,0.81,0.93,0.91
3,Hard Truths,2024,0.74,0.94,
4,Vermiglio,2024,0.72,0.94,0.8
5,Wallace & Gromit: Vengeance Most Fowl,2024,0.79,1.0,0.82


##### Applying the mapping to the missing reviews.

In [7]:
cr_df['MetaC_Score'] = cr_df['MetaC_Score'].fillna(cr_df['Title'].map(metacritic_mapping))
cr_df.head(5)

Unnamed: 0_level_0,Title,Year,IMDB_Score,RT_Score,MetaC_Score
Movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Nickel Boys,2024,0.76,0.9,0.91
2,The Brutalist,2024,0.81,0.93,0.91
3,Hard Truths,2024,0.74,0.94,0.88
4,Vermiglio,2024,0.72,0.94,0.8
5,Wallace & Gromit: Vengeance Most Fowl,2024,0.79,1.0,0.82


If a film's title is featured in the mapping but it **isn't** missing a review score, then it won't be affected by this mapping. This is an important detail because scores from the _critic_ratings_ table are often updated to no longer be missing. Such values shouldn't be overwritten by this mapping, as this is meant solely to correct **missing** values.

##### Load this amended table to the MySQL db, replacing the preexisting one.

In [8]:
cr_df.to_sql('critic_ratings', engine, if_exists='replace', index=True)

289

### Rotten Tomatoes

#### Identify Missing Scores

##### Querying the missing RT reviews from the critic_ratings table

In [9]:
query = """SELECT Movie_ID, Title FROM
(SELECT c.Movie_ID, c.Title, c.Year, c.RT_Score, a.Release_Date 
FROM critic_ratings c INNER JOIN allmovies a ON c.Title=a.Title
WHERE c.RT_Score IS NULL
ORDER BY a.Release_Date ASC) AS tt;"""

missing_RT_df = pd.read_sql_query(query, engine, index_col='Movie_ID')

##### Printing these film titles in the format of a python dictionary, ready for my manual data entry.

In [10]:
print("rt_mapping" +
              " = {")
for i in missing_RT_df.values:
    print(f'\t"{i[0]}": ,')
print("}")

rt_mapping = {
	"Memories": ,
	"Pokemon 2000": ,
	"Possessor": ,
	"The Card Counter": ,
	"TÃ¡r": ,
	"Suzume": ,
	"Talk to Me": ,
	"Sirocco and the Kingdom of the Winds": ,
	"Striking Rescue": ,
}


#### Correct Erroneously Missing Scores

##### Create the mapping for the missing reviews

In [11]:
rt_mapping = {
    # The following commented-out film indeed lacks a Rotten Tomatoes review.
    # "Memories": ,
    # "Sirocco and the Kingdom of the Winds": ,
    "Pokemon 2000": 0.19,
    "Possessor": 0.94,
    "The Card Counter": 0.87,
    "TÃ¡r": 0.91, # This is Tár, starring Cate Blanchett
    "Suzume": 0.96,
    "Talk to Me": 0.94,
}

##### (If preceding sections weren't run, importing the _critic_ratings_ table from the MySQL db)

In [12]:
if 'cr_df' not in locals() or 'cr_df' not in globals():
    query = "SELECT * FROM critic_ratings"
    cr_df = pd.read_sql_query(query, engine, index_col='Movie_ID')
    cr_df.head(5)

##### Print the records that lack Rotten Tomatoes scores

In [13]:
# Printing the records with missing 'RT_Score'
missing_rt_mask = cr_df['RT_Score'].isnull()
cr_df[missing_rt_mask]

Unnamed: 0_level_0,Title,Year,IMDB_Score,RT_Score,MetaC_Score
Movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
24,Suzume,2022,0.76,,0.77
99,TÃ¡r,2022,0.74,,0.93
108,The Card Counter,2021,0.62,,0.78
178,Talk to Me,2022,0.71,,0.76
184,Possessor,2020,0.65,,0.72
186,Striking Rescue,2024,,,
202,Sirocco and the Kingdom of the Winds,2023,0.71,,0.83
209,Pokemon 2000,2000,0.75,,0.28
225,Memories,1995,0.75,,


##### Applying mapping to the missing reviews

In [14]:
cr_df['RT_Score'] = cr_df['RT_Score'].fillna(cr_df['Title'].map(rt_mapping))
cr_df[missing_rt_mask]

Unnamed: 0_level_0,Title,Year,IMDB_Score,RT_Score,MetaC_Score
Movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
24,Suzume,2022,0.76,0.96,0.77
99,TÃ¡r,2022,0.74,0.91,0.93
108,The Card Counter,2021,0.62,0.87,0.78
178,Talk to Me,2022,0.71,0.94,0.76
184,Possessor,2020,0.65,0.94,0.72
186,Striking Rescue,2024,,,
202,Sirocco and the Kingdom of the Winds,2023,0.71,,0.83
209,Pokemon 2000,2000,0.75,0.19,0.28
225,Memories,1995,0.75,,


##### Loading this amended table to the MySQL db, replacing the preexisting one.

In [15]:
cr_df.to_sql('critic_ratings', engine, if_exists='replace', index=True)

289

### IMDb

#### Identify Missing Scores

##### Querying the missing RT reviews from the critic_ratings table

In [16]:
query = """SELECT Movie_ID, Title FROM
(SELECT c.Movie_ID, c.Title, c.Year, c.IMDB_Score, a.Release_Date 
FROM critic_ratings c INNER JOIN allmovies a ON c.Title=a.Title
WHERE c.IMDB_Score IS NULL
ORDER BY a.Release_Date ASC) AS tt;"""

missing_imdb_df = pd.read_sql_query(query, engine, index_col='Movie_ID')


##### Printing these film titles in the format of a python dictionary, ready for my manual data entry.

In [17]:
print("imdb_mapping" +
              " = {")
for i in missing_imdb_df.values:
    print(f'\t"{i[0]}": ,')
print("}")

imdb_mapping = {
	"Striking Rescue": ,
}


#### Correct Erroneously Missing Scores

##### Create the mapping for the missing reviews

In [18]:
imdb_mapping = {
	"Sirocco and the Kingdom of the Winds": 0.71,
	"His Three Daughters": 0.76,
}

##### (If preceding sections weren't run, importing the _critic_ratings_ table from the MySQL db)

In [19]:
if 'cr_df' not in locals() or 'cr_df' not in globals():
    query = "SELECT * FROM critic_ratings"
    cr_df = pd.read_sql_query(query, engine, index_col='Movie_ID')
    cr_df.head(5)

##### Print the records that lack Rotten Tomatoes scores

In [20]:
# Printing the records with missing 'IMDB_Score'
missing_imdb_mask = cr_df['IMDB_Score'].isnull()
cr_df[missing_imdb_mask]

Unnamed: 0_level_0,Title,Year,IMDB_Score,RT_Score,MetaC_Score
Movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
186,Striking Rescue,2024,,,


##### Applying mapping to the missing reviews

In [21]:
cr_df['IMDB_Score'] = cr_df['IMDB_Score'].fillna(cr_df['Title'].map(imdb_mapping))
cr_df[missing_imdb_mask]

Unnamed: 0_level_0,Title,Year,IMDB_Score,RT_Score,MetaC_Score
Movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
186,Striking Rescue,2024,,,


##### Loading this amended table to the MySQL db, replacing the preexisting one.

In [22]:
cr_df.to_sql('critic_ratings', engine, if_exists='replace', index=True)

289

## Add New Reviewer

### RogerEbert.com

Contained in the file 'ebert_ratings.csv' are ratings from RogerEbert.com , one of my favorite sites for movie reviews. To a point, I retrieved these ratings using my web scrapers from this project's package **critic_ratings.ebertscrape**. However, after scraping 40 reviews, my main scraper gets foiled by Google's  anti-scrape mechanisms (a captCha check, in this case.) So the remaining ~150 Ebert ratings in this file were entered by me, manually. 

I here read that csv file of Ebert ratings into a dataframe, then join that onto the _critic_ratings_ table. With this result, I again overwrite the _critic_ratings_ table in the MySQL db, to finish.


#### Read in the two datasets

##### The _RogerEbert.com_ ratings, from file

In [23]:
ebert_df = pd.read_csv('../data/csv/ebert/ratings_manually_gathered/ebert_ratings.csv', index_col='Movie_ID')
ebert_df['Year'] = ebert_df['Year'].astype(str)
ebert_df.head(5)

Unnamed: 0_level_0,Title,Year,Ebert_Score
Movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Sing Sing,2023,4.0
2,Y Tu Mama Tambien,2001,4.0
3,Gasoline Rainbow,2023,4.0
4,Evil Does Not Exist,2023,3.5
5,Slow,2023,3.5


##### The _critic_ratings_ table, from MySQL 

(This segment is only run if that table hasn't already been imported, like in cases where the _Missing Review Scores_ sections were not run prior.)

In [24]:
if 'cr_df' not in locals() or 'cr_df' not in globals():
    query = "SELECT * FROM critic_ratings"
    cr_df = pd.read_sql_query(query, engine, index_col='Movie_ID')
    cr_df.head(5)

#### Merge the two and load the result

In [25]:
merged_df = cr_df.merge(ebert_df, how='left', on=['Title','Year'])
merged_df.index = range(1, len(merged_df)+1)
merged_df.index.names = ['Movie_ID']

cr_plus_ebert_df = merged_df
cr_plus_ebert_df.head(5)

Unnamed: 0_level_0,Title,Year,IMDB_Score,RT_Score,MetaC_Score,Ebert_Score
Movie_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Nickel Boys,2024,0.76,0.9,0.91,4.0
2,The Brutalist,2024,0.81,0.93,0.91,4.0
3,Hard Truths,2024,0.74,0.94,0.88,4.0
4,Vermiglio,2024,0.72,0.94,0.8,3.5
5,Wallace & Gromit: Vengeance Most Fowl,2024,0.79,1.0,0.82,3.5


##### Loading this amended table to the MySQL db, replacing the preexisting one.

In [26]:
cr_plus_ebert_df.to_sql('critic_ratings', engine, if_exists='replace', index=True)

289

## Shutting down the SQL engine and db connection.

In [27]:
engine.dispose()
conn.close()