## IPRatings | Popularity of Star Wars Titles compared to Brickset Ratings

#### This notebook generates data to investigate a relationship between the [Rotten Tomatoes](https://rottentomatoes.com) scores for Star Wars titles and the ratings of Star Wars LEGO sets on [Brickset](https://brickset.com/). The Rotten Tomatoes scores were compiled in July 2022 into [SWMovies Google Sheet](https://docs.google.com/spreadsheets/d/1xw7y9yawF6i35BTfP9M1uUawJvwpacz01Xq4MEZszBs). The final output of this will be a .csv, ultimately used in a [Tableau Dashboard](https://public.tableau.com/app/profile/jared.sage/viz/BricksandPieces/BricksandPieces). This notebook will merge two DataFrames and generate "tomato.csv" in the "CSVs" folder in this repo.

#### Prediction: LEGO set popularity will not directly correlate to the popularity of a Star Wars title. There are likely more, undetermined, factors specific to rating LEGO sets beyond popularity of a licensed title.

#### Import the necessary packages to run the program. PythonScripts is a folder of scripts made for this project. KEY_TWO will be the Brickset API key. data_clean is for functions used across the 2 or more notebooks in this repo. 

In [None]:
import pandas as pd
import requests
from PythonScripts.keys import KEY_TWO
import PythonScripts.data_clean as dc

#### To get started we will configure the parameters for reading the Star Wars title scores Google Sheet into a DataFrame. The parameters are then fed into pandas.read_csv to create the DataFrame. Output is the resulting DataFrame

In [2]:
# Configure URL for pd.read_csv and read as DataFrame
# Full sheet URL == https://docs.google.com/spreadsheets/d/1xw7y9yawF6i35BTfP9M1uUawJvwpacz01Xq4MEZszBs/
workbook_id = "1xw7y9yawF6i35BTfP9M1uUawJvwpacz01Xq4MEZszBs"
sheet_name = "Tomato"
url = f"https://docs.google.com/spreadsheets/d/{workbook_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
ip_df = pd.read_csv(url, parse_dates=['Release_Date'])
ip_df

#### The Release Date column looks pretty messy. There are values with full dates, and lots that just start on January first. To clean this up, we will format these values to just the year of the date. Output shows the new values of the release dates.

In [4]:
# Format Date column to display as the year
ip_df['Release_Date'] = ip_df['Release_Date'].dt.strftime('%Y')
ip_df

Unnamed: 0,Title,Is_Movie,Is_TV,Release_Date,Tomatometer
0,Episode IV – A New Hope,Y,N,1977,93%
1,Episode V – The Empire Strikes Back,Y,N,1980,94%
2,Episode VI – Return of the Jedi,Y,N,1983,83%
3,Episode I – The Phantom Menace,Y,N,1999,51%
4,Episode II – Attack of the Clones,Y,N,2002,66%
5,Episode III – Revenge of the Sith,Y,N,2005,79%
6,Episode VII – The Force Awakens,Y,N,2015,93%
7,Episode VIII – The Last Jedi,Y,N,2017,91%
8,Episode IX – The Rise of Skywalker,Y,N,2019,52%
9,Star Wars: The Clone Wars,Y,N,2008,18%


#### We are going to do a little more cleaning to make the merge easier down the road. We first remove the subtitles from titles with "Episode". We then define and call a function to remove "Star Wars: " from other properties leaving only the subtitles. Output shows the newly renamed values in the Title column.

In [5]:
# Clean extra text out the Title column

for item, str in ip_df['Title'].items():
        head, sep, tail = str.partition(' – ')
        ip_df['Title'].replace(to_replace=str, value = head, inplace=True)


# Function for cleaning a series by partition
def part_colon(column_label: pd.Series) -> pd.Series:
        for item, value in column_label.items():
           if ': ' in value:
                head, sep, tail = value.partition(': ')
                column_label.replace(to_replace=value, value = tail, inplace=True)
        
# Run cleaning function on Title column
part_colon(ip_df['Title'])
ip_df['Title']

0                Episode IV
1                 Episode V
2                Episode VI
3                 Episode I
4                Episode II
5               Episode III
6               Episode VII
7              Episode VIII
8                Episode IX
9            The Clone Wars
10                Rogue One
11                     Solo
12            The Bad Batch
13               Resistance
14                   Rebels
15           The Clone Wars
16          The Mandalorian
17    The Book of Boba Fett
Name: Title, dtype: object

In [None]:
# Drop duplicate Clone Wars, first 3 episodes of TV show theaterical release. Duplicate value and outlier
ip_df.drop(index=9, inplace=True)
ip_df.reset_index(drop=True, inplace=True)
ip_df

In [None]:
# API call for information for sets in Star Wars theme and convert to dataframe. 
parameters = {'theme' : 'Star Wars', 'pageSize' : 900}
sw_set_list = requests.get(f"https://brickset.com/api/v3.asmx/getSets?apiKey={KEY_TWO}&userHash=&params={parameters}")
sw_data = sw_set_list.json()
sw_df = pd.json_normalize(sw_data,'sets')
print(f'sw_df shape: {sw_df.shape}')
sw_df.head()

In [None]:
dc.drop_columns(sw_df)
print(f'sw_df shape: {sw_df.shape}')

In [None]:
# Replace certain values with values matching first data frame
subthemes = sw_df['subtheme'].sort_values().unique()
print(f'Subthemes: f{subthemes}')

sw_df['subtheme'].replace(to_replace={'The Clone Wars' : 'Star Wars: The Clone Wars', 
                                       'The Force Awakens' : 'Episode VII', 
                                       'The Last Jedi' : 'Episode VIII', 
                                       'The Rise of Skywalker' : 'Episode IX' }, inplace=True)
subthemes = sw_df['subtheme'].sort_values().unique()
print(f'\nRenamed Subthemes: f{subthemes}')

In [None]:
# Drop any rows where the set has not been rated and where there is NaN for number of pieces.
mask_two = sw_df[sw_df['rating'] == 0].index
sw_df.drop(mask_two, inplace=True)

pieces_null = sw_df.isnull().values.any()
if pieces_null == True:
    sw_df.dropna(subset=['pieces'], inplace=True)
                 
print(f'sw_df shape: {sw_df.shape}')
sw_df.head()

In [None]:
# Convert pieces to Int64 
sw_df['pieces'] = sw_df['pieces'].astype(pd.Int64Dtype())
sw_df.head()

In [None]:
# Run clean via partition function on the subtheme column of the second dataframe
part_colon(sw_df['subtheme'])

In [None]:
# Group subthemes by the number of sets in the the subtheme
lego_set_count = sw_df.groupby(['subtheme'])['number'].count()
lego_set_count

In [None]:
# Group subthemes by the average rating
rating_avg=sw_df.groupby(['subtheme'])['rating'].mean().round(2)
rating_avg

In [None]:
# Create a new DataFrame combining the set count and rating by subtheme
agg_df = pd.concat([lego_set_count, rating_avg], axis=1)
agg_df

In [None]:
# Merge DataFrame of set #s and average rating into DataFrame of Star Wars properties
merged_df = ip_df.merge(agg_df, how='left', left_on='Title', right_on='subtheme')
merged_df

In [None]:
# Replace percentage string with a float value for percentage rating
for index, value in merged_df['Tomatometer'].items():
    x = value.strip('%')
    merged_df['Tomatometer'] = merged_df['Tomatometer'].replace(value, x)
merged_df['Tomatometer'] = merged_df['Tomatometer'].astype(float)
merged_df['Tomatometer'] = merged_df['Tomatometer']/100
merged_df.head()

In [None]:
# Make new column that converts Bricket rating from 5 point scale to a percentage scale
merged_df['Tomatometer'] = merged_df['Tomatometer'].astype(float)
merged_df['Brickset % Rating'] =  merged_df['rating'] / 5
merged_df.head()

In [None]:
# Write the merged DataFrame to .csv for visualization in Tableau
file_path = dc.csv_path('tomato.csv')
merged_df.to_csv(file_path)