## IPRatings | Popularity of Star Wars Titles compared to Brickset Ratings

#### This notebook generates data to investigate a relationship between the [Rotten Tomatoes](https://rottentomatoes.com) scores for Star Wars titles and the ratings of Star Wars LEGO sets on [Brickset](https://brickset.com/). The Rotten Tomatoes scores were compiled in July 2022 into [SWMovies Google Sheet](https://docs.google.com/spreadsheets/d/1xw7y9yawF6i35BTfP9M1uUawJvwpacz01Xq4MEZszBs). The final output of this will be a .csv, ultimately used in a [Tableau Dashboard](https://public.tableau.com/app/profile/jared.sage/viz/BricksandPieces/BricksandPieces). This notebook will merge two DataFrames and generate "tomato.csv" in the "CSVs" folder in this repo.

#### Prediction: LEGO set popularity will not directly correlate to the popularity of a Star Wars title. There are likely more, undetermined, factors specific to rating LEGO sets beyond popularity of a licensed title.

#### Start by importing the necessary packages to run the notebook. Pandas will be used for importing, cleaning, and merging data. Requests will be used for API calls. PythonScripts is a folder of scripts made for this project. KEY_TWO will be the Brickset API key. data_clean is for functions used across the 2 or more notebooks in this repo. 

In [None]:
import pandas as pd
import requests
from PythonScripts.keys import KEY_TWO
import PythonScripts.data_clean as dc

#### Configure the parameters for reading the Star Wars title scores Google Sheet into a DataFrame. The parameters are then fed into pandas.read_csv() to create the DataFrame. Output is the resulting DataFrame.

In [None]:
# Configure URL for pd.read_csv and read as DataFrame
# Full sheet URL == https://docs.google.com/spreadsheets/d/1xw7y9yawF6i35BTfP9M1uUawJvwpacz01Xq4MEZszBs/
workbook_id = "1xw7y9yawF6i35BTfP9M1uUawJvwpacz01Xq4MEZszBs"
sheet_name = "Tomato"
url = f"https://docs.google.com/spreadsheets/d/{workbook_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
ip_df = pd.read_csv(url, parse_dates=['Release_Date'])
ip_df

#### The Release Date column looks messy. There are values with full dates and some that just start on January 1st. To clean this up, we format these values to just the year of the date. Output shows the new values of the release dates.

In [None]:
# Format Date column to display as the year
ip_df['Release_Date'] = ip_df['Release_Date'].dt.strftime('%Y')
ip_df

#### We are going to do a little more cleaning to make the merge easier down the road. We first remove the subtitles from titles with "Episode". We then define and call a function to remove "Star Wars: " from other properties leaving only the subtitles. Output shows the newly renamed values in the Title column.

In [None]:
# Clean extra text out the Title column

for item, str in ip_df['Title'].items():
        head, sep, tail = str.partition(' – ')
        ip_df['Title'].replace(to_replace=str, value = head, inplace=True)


# Function for cleaning a series by partition
def part_colon(column_label: pd.Series) -> pd.Series:
        for item, value in column_label.items():
           if ': ' in value:
                head, sep, tail = value.partition(': ')
                column_label.replace(to_replace=value, value = tail, inplace=True)
        
# Run cleaning function on Title column
part_colon(ip_df['Title'])
ip_df['Title']

#### There are two values for Clone Wars with very different ratings. Contextually, the movie is a pilot episode for the TV show. So as a smaller part of a bigger 7-season show, we will drop the movie rating to eliminate the duplicate. This finishes the cleaning of the first DataFrame. Output shows the title Series with only one "The Clone Wars" in the title row. 

In [None]:
# Drop duplicate Clone Wars, first 3 episodes of TV show theatrical release. Duplicate value and outlier
ip_df.drop(index=9, inplace=True)
ip_df.reset_index(drop=True, inplace=True)
ip_df['Title']

#### The second DataFrame starts with an API call to Brickset. This retrieves LEGO set information for all sets under the Star Wars theme. Output shows the shape and head of the new sw_df DataFrame.

In [None]:
# API call for information for sets in Star Wars theme and convert to dataframe. 
parameters = {'theme' : 'Star Wars', 'pageSize' : 900}
sw_set_list = requests.get(f"https://brickset.com/api/v3.asmx/getSets?apiKey={KEY_TWO}&userHash=&params={parameters}")
sw_data = sw_set_list.json()
sw_df = pd.json_normalize(sw_data,'sets')
print(f'sw_df shape: {sw_df.shape}')
sw_df.head()

#### The new DataFrame has 44 columns...a lot of columns. We will not need the majority of those so we will drop them out of the DataFrame. Output shows the shape of the DataFrame with reduced columns and DataFrame head.

In [None]:
dc.drop_columns(sw_df)
print(f'sw_df shape: {sw_df.shape}')

#### Subthemes in sw_df are the closest match to Star Wars titles. We rename some of the subthemes to make them equivalent to the titles in ip_df. This allows for a cleaner merge. Output shows the starting values in the subtheme column and the new values after replacement.

In [None]:
# Replace certain values with values matching first data frame
subthemes = sw_df['subtheme'].sort_values().unique()
print(f'Subthemes: f{subthemes}')

sw_df['subtheme'].replace(to_replace={'The Clone Wars' : 'Star Wars: The Clone Wars', 
                                       'The Force Awakens' : 'Episode VII', 
                                       'The Last Jedi' : 'Episode VIII', 
                                       'The Rise of Skywalker' : 'Episode IX' }, inplace=True)
subthemes = sw_df['subtheme'].sort_values().unique()
print(f'\nRenamed Subthemes: f{subthemes}')

#### To be safe, we want to get rid of any sets that have not been rated or rated at 0 so that it doesn't skew any of the results. Output will show the shape of the DataFrame. If the shape hasn't changed, then there are no sets that are equal to 0.

In [None]:
# # Drop rows where there is no rating for the set.
print('sw_df shape: ',sw_df.shape)

mask_two = sw_df[sw_df['rating'] == 0].index
sw_df.drop(mask_two, inplace=True)   

print('new sw_df shape: ',sw_df.shape)


#### We also want to git rid of any LEGO sets that have a NaN value for the number of pieces in the set so that they don't skew any analysis. Output will show a shape with fewer rows if there are any sets with a NaN value for pieces. If unchanged then there were no NaN values.

In [None]:
# Drop any rows if they have a NaN value in the pieces column
pieces_null = sw_df.isnull().values.any()
if pieces_null == True:
    sw_df.dropna(subset=['pieces'], inplace=True)
                 
print(f'sw_df shape: {sw_df.shape}')
sw_df.head()

#### In the previously shown heads of the DataFrame, you can see that the number of pieces is currently a float value. As you can't have a partial piece, we want to convert that to an integer value. Output shows new int type for a a value in the pieces column and the DataFrame head.

In [None]:
# Convert pieces to Int64 
sw_df['pieces'] = sw_df['pieces'].astype(pd.Int64Dtype())
dtype = type(sw_df['pieces'].iloc[1])
print("Dtype of sw_df['pieces']: ", dtype)
sw_df.head()

#### We run the part_colon function to remove "Star Wars: " from subthemes leaving only subtitles for the value. This will allow a clean merge between titles and subthemes. Output shows the new values in the subtheme column.

In [None]:
# Run clean via partition function on the subtheme column of the second dataframe
part_colon(sw_df['subtheme'])
sw_df['subtheme'].unique()

#### While we could merge DataFrames here, what we really want to know about LEGO sets is specific to how they relate to the subtheme column. We group set number by each subtheme and count the set numbers to determine how many sets are in each theme. That is saved as a new Series. Output is the new lego_set_count Series.

In [None]:
# Group subthemes by the number of sets in the the subtheme
lego_set_count = sw_df.groupby(['subtheme'])['number'].count()
lego_set_count

#### The next set of information we want is the average Brickset rating by subtheme. We group rating by subtheme and then calculate the average. That is saved as a new Series. Output is the rating_avg Series.

In [None]:
# Group subthemes by the average rating
rating_avg = sw_df.groupby(['subtheme'])['rating'].mean().round(2)
rating_avg

#### Next step is to concatenate the 2 calculated Series into a new DataFrame. This is all the set information needed to then merge with our information by title. Output is the new agg_df DataFrame.

In [None]:
# Create a new DataFrame combining the set count and rating by subtheme
agg_df = pd.concat([lego_set_count, rating_avg], axis=1)
agg_df

#### Now we perform the merge. We are going to do a left join based on the title so that we only get subtheme information for subthemes that match a title, eliminating extraneous data. Output is the new merged_df DataFrame showing all the ip_df data with new "number" and "rating" columns from agg_df.

In [None]:
# Merge DataFrame of set #s and average rating into DataFrame of Star Wars properties
merged_df = ip_df.merge(agg_df, how='left', left_on='Title', right_on='subtheme')
merged_df

#### Eventually we want to compare values in the Tomatometer column with Brickset ratings in Tableau. We want to make sure that we have numerical values for the Tomatometer instead of the string values that currently exist. We remove the "%" character, convert the resulting characters to float values, and convert them to decimal percentages. Output is the DataFrame head showing the new Tomatometer format.

In [None]:
# Replace percentage string with a float value for percentage rating
for index, value in merged_df['Tomatometer'].items():
    x = value.strip('%')
    merged_df['Tomatometer'] = merged_df['Tomatometer'].replace(value, x)
merged_df['Tomatometer'] = merged_df['Tomatometer'].astype(float)
merged_df['Tomatometer'] = merged_df['Tomatometer']/100
merged_df.head()

#### We need to compare Brickset Ratings to Rotten Tomato scores. However, a problem can be seen by looking at the DataFrame. The two websites use different scales. Brickset uses a 5-point scale, while Rotten Tomatoes uses a 100-point percentage scale. To make things easier we add a new Series with Brickset ratings converted to a 100-point percentage scale. Output shows the the DataFrame head with the new Series "Brickset % Rating"

In [None]:
# Make new column that converts Bricket rating from 5 point scale to a percentage scale
merged_df['Tomatometer'] = merged_df['Tomatometer'].astype(float)
merged_df['Brickset % Rating'] =  merged_df['rating'] / 5
merged_df.head()

#### Data cleaning and merging is complete. We save that data as a .csv to use in Tableau. A file called "tomato.csv" will be created in the "CSVs" folder of this repo if you would like to review the final output. Output is the .csv file in the repo. 

In [None]:
# Write the merged DataFrame to .csv for visualization in Tableau
file_path = dc.csv_path('tomato.csv')
merged_df.to_csv(file_path)