## NumPiecesRating | Popularity of LEGO sets based on number of pieces  

#### This notebook generates data to investigate the relationship between the number of pieces in a LEGO set and the customer rating for that set on [Brickset](https://brickset.com/). The final output of this will be a .csv, ultimately used in a [Tableau Dashboard](https://public.tableau.com/app/profile/jared.sage/viz/BricksandPieces/BricksandPieces). The output of this notebook will be random each time. The .csv filesfrom this notebook used in the Tableau dashboard are found in the "CSVs" folder in this repo under "theme_sample_set_list1","theme_sample_set_list2", and "theme_sample_set_list3". This notebook will generate "theme_sample_set_list" as an example of a sample generated by this notebook. 

#### Prediction: There will be an overall trend that LEGO sets with more pieces will have higher Brickset user ratings than sets with fewer pieces.

#### Start by importing the necessary packages to run the notebook. Pandas will be used for importing, cleaning, and merging data. Requests will be used for API calls.  PythonScripts is a folder of scripts made for this project. KEY_TWO will be the Brickset API key. data_clean is for functions used across 2 or more notebooks in this repo. 

In [None]:
import pandas as pd
import requests
from PythonScripts.keys import KEY_TWO
import PythonScripts.data_clean as dc

#### Make an API request to Brickset to get a list of LEGO Set themes, convert them to json, and convert that json to a DataFrame. A random sample of themes will generate from this DataFrame so that we can get a mix of set types. Output shows the shape and first 5 rows (head) of the DataFrame.

In [None]:
# API call to get list of LEGO themes and convert to DataFrame
themes = requests.get(f'https://brickset.com/api/v3.asmx/getThemes?apiKey={KEY_TWO}') 
data = themes.json()
theme_df = pd.json_normalize(data, 'themes')

print('Shape: ',theme_df.shape)
theme_df.head()

#### We are going to narrow down the list of themes a little bit. We are going to remove set themes whose first sets were released before 1999 or after 2022, themes that have less than 50 sets, the minifigure theme, and miscellaneous theme. We're dropping minifigures because they typically have around 5 pieces and dropping miscellaneous because it is too broad. Output shows the shape of the DataFrame with reduced rows.

In [None]:
# Drop themes older than 1999, themes with less than 50 sets, sets that aren't currently in production, 
#minifig theme, or miscellaneous theme
print('Starting theme_df shape: ',theme_df.shape)

mask = theme_df[(theme_df['yearFrom'] < 1999) | (theme_df['setCount'] < 50) | (theme_df['yearTo'] < 2022) |
         (theme_df['theme'] == 'Collectable Minifigures') | (theme_df['theme'] == 'Miscellaneous')].index
theme_df.drop(mask, inplace=True)

print('Resulting theme_df shape: ',theme_df.shape)

#### Generate a list of 3 themes generated by the sample methods within pandas. Output is a list of the 3 theme names that we will pull set samples from.

In [None]:
# Generate sample theme list to use in 2nd API call. Convert list to string for API parameters.
theme_list = []
for item in theme_df['theme'].sample(3):
    theme_list.append(item)
param_string = ", ".join(theme_list)
print('Theme List: ',param_string)

#### In order to look at LEGO sets based on the theme list, we need the set information for those themes. Make an API call to Brickset for the set information for all themes in the theme list, making the page size is large enough to get lots of results. This set information is then read as a DataFrame. Output shows the shape and head of the set DataFrame.

In [None]:
# 2nd API call to get a full set list for themes in the theme list generated by first API call. 
# Convert to a Dataframe.
parameters = {'theme' : f'{param_string}', 'pageSize' : 2500}
set_list = requests.get(f"https://brickset.com/api/v3.asmx/getSets?apiKey={KEY_TWO}&userHash=&params={parameters}")
set_data = set_list.json()
set_df = pd.json_normalize(set_data,'sets')
print('set_df shape: ',set_df.shape)
set_df.head()

#### The new DataFrame has 44 columns...a lot of columns. We will not need the majority of those so we will drop them out of the DataFrame. Output shows the shape of the DataFrame with reduced columns and DataFrame head.

In [None]:
# Drop columns using helper function
dc.drop_columns(set_df)

print('set_df shape: ',set_df.shape)
set_df.head()

#### To be safe, we want to get rid of any sets that have not been rated or rated at 0 so that it doesn't skew any of the results. Output will show the shape of the DataFrame. If the shape hasn't changed, then there are no sets that are equal to 0.

In [None]:
# Drop rows where there is no rating for the set.
mask_two = set_df[set_df['rating'] == 0].index
set_df.drop(mask_two, inplace=True)
print('set_df shape: ',set_df.shape)

#### We also want to git rid of any LEGO sets that have a NaN value for the number of pieces in the set so that they don't skew any analysis. Output will show a shape with fewer rows if there are any sets with a NaN value for pieces. If unchanged then there were no NaN values.

In [None]:
# Drop any rows if they have a NaN value in the pieces column
pieces_null = set_df['pieces'].isnull().values.any()
if pieces_null == True:
    set_df.dropna(subset=['pieces'], inplace=True)
    
print('set_df shape: ',set_df.shape)
set_df.head()

#### In the previously shown heads of the DataFrame, you can see that the number of pieces is currently a float value. As you can't have a partial piece, we want to convert that to an integer value. Output shows new int type for a a value in the pieces column and the DataFrame head.

In [None]:
#Convert pieces, minimum age range, and maximum age range from floats to ints.
set_df['pieces'] = set_df['pieces'].astype(pd.Int64Dtype())
dtype = type(set_df['pieces'].iloc[1])
print('Dtype of number of pieces: ', dtype)
set_df.head()

#### Because the set DataFrame will eventually be used in Tableau, we want to clean up some of the column titles to make them more presentable. Output will show the original column names and then the new column names.

In [None]:
# Rename column labels
print(f'Starting column labels: {set_df.columns}')

rename_dict = {
               'setID' : 'Set ID',
               'number' : 'Set Number',
               'name' : 'Set Name',
               'year' : 'Release Year',
               'theme' : 'Theme',
               'subtheme' : 'Subtheme',
               'pieces' : 'Number of Pieces',
               'rating' : 'Brickset Rating',
               }
                

set_df.rename(columns=rename_dict, inplace=True)
print(f'\nResulting column labels: {set_df.columns}')


#### Now we want to generate a small sample of sets from the DataFrame. We don't want it to be too big a sample so that the Tableau visualization is not super cluttered. The set DataFrame is overwritten with a sample of 50 LEGO sets and the accompanying information. If the there aren't enough sets in the dataframe, you'll need to start over to generate a new theme list. Output shows that the new set DataFrame only has 50 rows.

In [None]:
# Make a new DataFrame with a sample of 100 sets from set_df
if len(set_df.index) > 50:
    set_df = set_df.sample(50)
else:
    print('Restart notebook kernel and try again for a better sample')
set_df.shape

#### Now that we have a fairly random sample, we save that data as a .csv to use in Tableau. A file called "theme_sample_set_list.csv" will be created in the "CSVs" folder of this repo if you would like to review your sample. Output is the .csv file in the repo. 

In [None]:
# Save sample DataFrame to .csv for visualization in Tableau
file_path = dc.csv_path('theme_sample_set_list.csv')
set_df.to_csv(file_path)