# Lego Analysis

Author: M. Tosic

Date: 01.2022

This notebook is part of my capstone project for a data science course. The project is independent and has no connection to the company LEGO.

# 1. Business Understanding

### Questions of interest

**A) Exploratory Analysis**
* **What themes are most dominant over the years?**
* **What sets where record breakers in terms of piece count?**
* **What sets where record breakers in terms of number of minifigs?**
* **What words do most often come up in set names?**
* Are lego sets becomming more and more expensive?
* Retail price to piece count?
* Does the value of sets go up after eol on average?
* What sets do best after eol? (eol = lego-term for end-of-life meaning the date when the set is not being produced 

**B) Predictive Analysis**
* What are features of the data set are good predictors that will rise in value after eol?
* What do the words contained in the set names tell us about the rise of value after eol.
* What sets that are currently being sold can I predicte to be a good investment after eol?* 

*e.g. price increase of at least 10 usd (for package and shipment when selling) + at least 25% profit

# 2. Data Understanding

Data being used in this notebook has been downloaded from the following sources:

* https://brickset.com/
* https://rebrickable.com/downloads/

Simplifications:
* No time series data on the price averages available. Assumption: price changes average out over time after eol. The price curves are already in a steady state.
* No data available on unique minifigs in sets (minifig are popular for collectors that focus on them and are generally believed to drive up the prices of some sets after eol).

### 2.1 Import Libraries

In [431]:
import numpy as np
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_rows', 100) #pd.set_option('display.max_rows', None)


#visualization
#import matplotlib.pyplot as plt
#%matplotlib inline
#import seaborn as sns
import plotly.express as px

# import necessary libraries for batch import csv:
import os
import glob

#for counting elements in a list:
from collections import Counter

#needed for text processing:
import nltk
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re

#datetime:
from datetime import datetime

#for pipeline:
from sklearn.pipeline import Pipeline

#for estimators:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier

#for training:
from sklearn.model_selection import train_test_split

#for testing:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

#for Grid search:
from sklearn.model_selection import GridSearchCV

#for saving model
import pickle

### 2.2 Import Data

#### Functions:

In [296]:
#df_sets = pd.read_csv('data/rebrickable-sets.csv')
#df_themes = pd.read_csv('data/rebrickable-themes.csv')

In [297]:
def import_csv_with_date_column(filename, date_col_name, skiprows_val = 0):
    dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d')
    df = pd.read_csv(filename, parse_dates=[date_col_name], date_parser=dateparse,skiprows = skiprows_val)
    df.rename(columns=lambda x: x.strip(), inplace = True)
    return df

def slice_2_date_range(df, date_col,start_date, end_date):
    #greater than start date and smaller than the end date
    mask = (df[date_col] > start_date) & (df[date_col] <= end_date)
    df = df[mask].reset_index(drop = True)
    return df

In [298]:
def import_multiple_csv_files_2_df (relative_path):
    """ Function uses os and glob packages to import multiple csv files into one dataframe. 
    The current working directory should be the one where this notebook is located.
    INPUT: 
    Relative path to the files e.g. "./data/Kurac*.csv"
    OUTPUT: 
    One dataframe containting all csv files concatenated together over axis = 0.
    """
    path = os.getcwd()
    files = glob.glob(os.path.join(path, relative_path))
    
    print('Glob search with parameters:', relative_path)
   # print('Ingested files:')
    li = []
    for file in files:
        df_temp = pd.read_csv(file, index_col = None, header = 0)
        li.append(df_temp)
        #print(file)
    try:    
        df = pd.concat(li, axis=0, ignore_index=True)
        print('Done.')

    except:
        print('Something went wrong the concatenation of the files, returning None. Is the relative_path correctly set?')
        return(None)
    
    return (df)

#### Importing exchange rate data that will be used to fill in price-column:

In [299]:
df_gdp_usd = import_csv_with_date_column("data/exchange-rate-historical-chart_pound-dollar.csv",  "date", 15)
df_eur_usd = import_csv_with_date_column("data/exchange-rate-historical-chart_euro-dollar.csv",  "date", 15)
df_gdp_usd = slice_2_date_range(df_gdp_usd, "date", "1991-01-01", "2021-12-31")
df_eur_usd = slice_2_date_range(df_eur_usd, "date", "1991-01-01", "2021-12-31")
df_gdp_usd.head(3), df_eur_usd.head(3)
rate_gdp_usd = df_gdp_usd["value"].mean()
rate_eur_usd = df_eur_usd["value"].mean()
print("rate_gdp_usd:", rate_gdp_usd)
print("rate_eur_usd:", rate_eur_usd)

rate_gdp_usd: 1.5725745846764236
rate_eur_usd: 1.1970355369182681


#### Importing main data set

In [300]:
df = import_multiple_csv_files_2_df("./data/Brickset*.csv")

Glob search with parameters: ./data/Brickset*.csv
Done.


**Droping unnessecary columns**

In [301]:
df.drop(['Qty owned','UPC','Qty owned new', 
         'Qty owned used', 'EAN','Priority','Wanted', 'Height', 'Depth', 'Weight', 'Width', 
         'Notes','Qty wanted','RRP (CAD)','Flag 1 not used', 'Flag 2 not used', 'Flag 3 not used',
         'Flag 4 not used', 'Flag 5 not used', 'Flag 6 not used','Flag 7 not used', 'Flag 8 not used'], axis=1, inplace=True)

**Renaming columns to be able to use dot-notation and make them more intuitive (e.g. price instead of rrp)**

In [302]:
df.rename(columns = lambda x : x.replace(' ', '_').replace('(','').replace(')','').lower().strip(), inplace = True)
df.rename(columns={'rrp_usd': 'price', 
                   'value_new_usd': 'value_new', 
                   'value_used_usd':'value_used',
                   'exit_date': 'eol_date'}, inplace = True) 
df.columns

Index(['number', 'theme', 'subtheme', 'year', 'set_name', 'minifigs', 'pieces',
       'rrp_gbp', 'price', 'rrp_eur', 'value_new', 'value_used', 'launch_date',
       'eol_date'],
      dtype='object')

The "eol" in "eol-date" means in lego-lingo "end-of-life",eqivalent to date from which the set is retired from production and official sales through lego. The sets are normally availble for some time through official and unofficial retailers.

**Sorting values by year and launch date**

In [303]:
df.sort_values(["year","launch_date"], inplace = True)

**Parse date columns to datetime**

In [304]:
df['launch_date'] = pd.to_datetime(df['launch_date'])
df['eol_date'] = pd.to_datetime(df['eol_date'])

In [305]:
display(df.head(3)), display(df.tail(3));

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,,5.5,,6.26,,NaT,NaT
1,1040-1,Dacta,,1991,Farm,4.0,89.0,,,,,,NaT,NaT
2,1474-1,Basic,Universal Building Set,1991,Basic Building Set with Gift Item,1.0,69.0,,,,24.64,,NaT,NaT


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date
4104,ISBN9781452182261-1,Books,Chronicle,2022,The Art of the Minifigure,,,,,,,,NaT,NaT
4105,ISBN9781728257907-1,Books,Dorling Kindersley,2022,Build and Stick: NINJAGO Dragons,,,,,,,,NaT,NaT
4106,ISBN9781797214139-1,Books,Dorling Kindersley,2022,Build Every Day,,,,,,,,NaT,NaT


### 2.3 Exploring Content

**Checking basic info on dataframe and descriptive statistics:**

In [306]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15634 entries, 0 to 4106
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   number       15634 non-null  object        
 1   theme        15634 non-null  object        
 2   subtheme     12655 non-null  object        
 3   year         15634 non-null  int64         
 4   set_name     15634 non-null  object        
 5   minifigs     7171 non-null   float64       
 6   pieces       12118 non-null  float64       
 7   rrp_gbp      8172 non-null   float64       
 8   price        10234 non-null  float64       
 9   rrp_eur      3900 non-null   float64       
 10  value_new    10541 non-null  float64       
 11  value_used   8751 non-null   float64       
 12  launch_date  6624 non-null   datetime64[ns]
 13  eol_date     6624 non-null   datetime64[ns]
dtypes: datetime64[ns](2), float64(7), int64(1), object(4)
memory usage: 1.8+ MB


In [307]:
 df.describe()

Unnamed: 0,year,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used
count,15634.0,7171.0,12118.0,8172.0,10234.0,3900.0,10541.0,8751.0
mean,2010.44,2.67,233.25,26.55,29.97,38.66,79.04,41.11
std,8.05,2.79,470.34,39.71,44.52,56.64,213.08,75.64
min,1991.0,1.0,0.0,0.0,0.0,0.01,0.0,0.25
25%,2004.0,1.0,24.0,5.99,6.99,9.99,11.05,6.57
50%,2012.0,2.0,75.0,14.99,15.0,19.99,28.98,16.23
75%,2017.0,3.0,251.0,29.99,34.99,44.95,74.89,43.35
max,2022.0,33.0,11695.0,699.99,799.99,799.99,9773.99,1391.39


**Check if there are duplicated values:**

In [308]:
df[df.duplicated()]

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date


**Unique values per column:**

In [309]:
df.nunique().sort_values(ascending = True)

minifigs          31
year              32
theme            141
rrp_eur          143
eol_date         165
rrp_gbp          270
price            322
launch_date      404
subtheme         801
pieces          1332
value_used      4522
value_new       6178
set_name       13328
number         15634
dtype: int64

**Investigate missing values in data set:**

In [310]:
print("Percentages of missing values:\n{}".format(df.isnull().sum()/df.shape[0]*100))

Percentages of missing values:
number         0.00
theme          0.00
subtheme      19.05
year           0.00
set_name       0.00
minifigs      54.13
pieces        22.49
rrp_gbp       47.73
price         34.54
rrp_eur       75.05
value_new     32.58
value_used    44.03
launch_date   57.63
eol_date      57.63
dtype: float64


In [311]:
df_missing_val_per = pd.DataFrame(df.isnull().sum()/df.shape[0]*100, columns=['value'])
df_missing_val_per_sorted = df_missing_val_per.sort_values(by = "value", ascending = False)

px.bar(df_missing_val_per_sorted, 
       x = df_missing_val_per_sorted.index, 
       y = "value", 
       labels = {"value":"percentage of missing values"})

**Comments:**
* There are NaN values in most columns.
* Most values are missing in rrp_eur, but this is ok since the analysis will be done in usd (due to value_new and value_used also being in usd). The available rrp_eur values can be used to fill-in missing data in the usd column.
* More than half of the items don't have a launch and exit date.
* The missing values for minifigs could just be due to the items being lego sets without any minifigures or those are other lego product merchandice.

**Tasks:**
* A quarter of the items are missing piece counts. This must be investigated since it could indicate the item is not a lego set but some other kind of merchandise from the database. I will aim to categorize the items into sets and other merchandice. A possible way to does this is to use the pieces count >0 or minifigure >0.

* Most prices are available in usd, also the value new and used is available in usd. If possible I will calculate missing values in usd by the columns of other currencies then drop the other columns to reduce complexity for further processing (one currency is enough for the intended analysis).

* Also some launch and eol dates are missing, I'll take a look at that. Sets from 2022 have probably not yet been released, I will label them as not released. The items that have a launch date but no exit date will be labeled as active, items that have an exit date will be label eol (popular lego term "end-of-life" for items that are no longer produced).

**Make box-plots of all columns with numerival values:**

In [312]:
def make_plots_of_num_cols(df):
    for col in df.columns:
        if df[col].dtype == np.int64 or df[col].dtype == np.float64:
            print(col)
            fig = px.box(df, x = col, points="all")
            fig.update_yaxes(visible = False, showticklabels = False)
            fig.show()
        else:
            continue

In [313]:
#make_plots_of_num_cols(df)

## 3. Prepare Data

### Removing rows where there is no numeric data

In [314]:
df.shape

(15634, 14)

In [315]:
cond = df[['minifigs','pieces',
       "rrp_gbp", "rrp_eur", "price", 
       "value_new", "value_used", 
       "launch_date", "eol_date"]].isnull().values.all(axis=1)
df['numeric_data_nan'] = np.where(cond, True, False)
df[df["numeric_data_nan"] == True].head(3)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan
142,BK15SPR1991-1,Books,Brick Kicks,1991,BRICK KICKS Spring 1991,,,,,,,,NaT,NaT,True
143,BK16SUM1991-1,Books,Brick Kicks,1991,BRICK KICKS Summer 1991,,,,,,,,NaT,NaT,True
144,BK17FAL1991-1,Books,Brick Kicks,1991,BRICK KICKS Fall 1991,,,,,,,,NaT,NaT,True


In [316]:
df = df[df["numeric_data_nan"] == False]
df.head(3)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,,5.5,,6.26,,NaT,NaT,False
1,1040-1,Dacta,,1991,Farm,4.0,89.0,,,,,,NaT,NaT,False
2,1474-1,Basic,Universal Building Set,1991,Basic Building Set with Gift Item,1.0,69.0,,,,24.64,,NaT,NaT,False


In [317]:
df.shape

(14473, 15)

### Categorization of items to sets and minifigures while removing items such as gear, books, etc.

From Lego.com: "We use numbers as a quick and convenient way to instantly identify any LEGO set. Numbers on the first sets we made were three digits long, but as we made more and more sets, we started using longer numbers. Currently, set numbers are five to seven digits long and are featured prominently on the box and instructions for the set."

Sourcepage: 
https://www.lego.com/en-my/service/help/bricks-and-sets/replacement-parts/identifying-lego-set-and-part-numbers-blte3ec07db3789cacb

This will help me further narrow down true lego sets in the main data set and filter our the other merch.



**Drop columns where item number is not a valid set number according to lego.com**

In [318]:
df[['number_main','number_sub']] = df['number'].str.split('-',expand=True)

In [319]:
df.drop(df.index[df["number_main"].apply(lambda x: not (x.isnumeric()))], axis=0, inplace=True)
df.drop(df.index[df["number_sub"].apply(lambda x: not (x.isnumeric()))], axis=0, inplace=True)

In [320]:
df.drop(['number_main','number_sub'], axis = "columns", inplace = True)

**Adding category column with value set for all items with > 0 number of pieces:**

In [321]:
df['category'] = np.where(df['pieces'] > 0, "set", "uncategorized")
df[df.category == "uncategorized"].head(3)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
662,8299-1,Technic,,1997,Search Sub,1.0,0.0,,50.0,,108.62,61.46,NaT,NaT,False,uncategorized
678,9708-1,Gear,Education,1997,Intelligent House Activity Pack,,,,12.0,,,,NaT,NaT,False,uncategorized
15096,3978-1,Gear,Key Chains/Castle,1998,Magic Wizard Key Chain,,,,3.0,,,,NaT,NaT,False,uncategorized


**Dealing with uncategorized items:**

In [322]:
uncat_themes = set(df[(df.category == "uncategorized")].theme)
print(uncat_themes)

{'Marvel Super Heroes', 'DC Comics Super Heroes', 'Friends', 'Vidiyo', 'BrickHeadz', 'Legends of Chima', 'Unikitty', 'Books', 'Technic', 'Ninjago', 'Power Miners', 'Star Wars', 'Promotional', 'Duplo', 'Education', 'Sports', 'The LEGO Movie 2', 'Creator Expert', 'Super Mario', 'Clikits', 'Make and Create', 'Disney', 'City', 'Gear', 'Collectable Minifigures', 'Seasonal', 'Miscellaneous'}


In [323]:
for theme in uncat_themes:
    cond_2 = (df.theme == theme) & (df.category == "uncategorized")
    print("Theme:", theme)
    print("Number of rows:", df[cond_2].shape[0])
    display(df[cond_2].head(3))

Theme: Marvel Super Heroes
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
4076,242210-1,Marvel Super Heroes,Magazine Gift,2022,Iron Man,,,,,,5.04,,NaT,NaT,False,uncategorized


Theme: DC Comics Super Heroes
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
7594,5004816-1,DC Comics Super Heroes,Product Collection,2015,Super Heroes DC Collection,,,,149.98,,,,NaT,NaT,False,uncategorized
3631,212010-1,DC Comics Super Heroes,Magazine Gift,2020,Batman,1.0,,,,,3.38,2.73,NaT,NaT,False,uncategorized


Theme: Friends
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
4805,5005553-1,Friends,Product Collection,2018,LEGO Friends Easter Bundle,,,35.97,,43.97,,,NaT,NaT,False,uncategorized
1585,66673-1,Friends,Product Collection,2021,Animal Gift Set,,,,,,,,2021-01-11,2021-12-31,False,uncategorized


Theme: Vidiyo
Number of rows: 4


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
1475,43101-0,Vidiyo,Bandmates Series 1,2021,Bandmates Series 1 {Random box},,,3.99,4.99,4.99,,,2021-01-03,2022-12-31,False,uncategorized
1489,43101-14,Vidiyo,Bandmates Series 1,2021,Bandmates Series 1 - Sealed Box,,,3.99,4.99,4.99,,,2021-01-03,2022-12-31,False,uncategorized
1496,43108-0,Vidiyo,Bandmates Series 2,2021,Bandmates Series 2 {Random box},,,3.99,4.99,4.99,,,2021-01-10,2022-12-31,False,uncategorized


Theme: BrickHeadz
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
5669,6315025-1,BrickHeadz,Promotional,2019,Amsterdam BrickHeadz,,,,,,433.6,,NaT,NaT,False,uncategorized


Theme: Legends of Chima
Number of rows: 19


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
6617,391214-1,Legends of Chima,Magazine gift,2014,Speedorz Ramp,,,,,,2.37,,NaT,NaT,False,uncategorized
6619,391404-1,Legends of Chima,Magazine gift,2014,Worriz,,,,,,6.22,,NaT,NaT,False,uncategorized
6620,391405-1,Legends of Chima,Magazine gift,2014,Crocodile Hideout,,,,,,1.63,,NaT,NaT,False,uncategorized


Theme: Unikitty
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
4385,41775-14,Unikitty,Blind Bags Series 1,2018,Unikitty! - Blind Bags Series 1 - Sealed Box,,,,,,,,2018-01-06,2018-12-31,False,uncategorized


Theme: Books
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
5988,4006-1,Books,LEGO,2000,Brick Tricks: Cool Cars,,,,8.0,,,,NaT,NaT,False,uncategorized
5989,4007-1,Books,LEGO,2000,Brick Tricks: Fantastic Fliers,,,,8.0,,,,NaT,NaT,False,uncategorized


Theme: Technic
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
662,8299-1,Technic,,1997,Search Sub,1.0,0.0,,50.0,,108.62,61.46,NaT,NaT,False,uncategorized


Theme: Ninjago
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
4804,5005552-1,Ninjago,Product Collection,2018,LEGO NINJAGO Easter Bundle,,,42.95,,44.95,,,NaT,NaT,False,uncategorized


Theme: Power Miners
Number of rows: 3


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
2556,4559288-1,Power Miners,Promotional,2009,{Power Miners Promotional Polybag},1.0,,,,,29.5,,NaT,NaT,False,uncategorized
2557,4559385-1,Power Miners,Promotional,2009,{Power Miners Promotional Polybag},1.0,,,,,14.63,,NaT,NaT,False,uncategorized
2558,4559387-1,Power Miners,Promotional,2009,{Power Miners Promotional Polybag},1.0,,,,,9.26,,NaT,NaT,False,uncategorized


Theme: Star Wars
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
1586,66674-1,Star Wars,Product Collection,2021,Skywalker Adventures Pack,,,,,,66.58,,2021-01-11,2021-12-31,False,uncategorized


Theme: Promotional
Number of rows: 18


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
14860,4212850-1,Promotional,LEGO brand stores,2004,Easter Egg Orange,,,,,,49.0,,NaT,NaT,False,uncategorized
4841,6258620-1,Promotional,Miscellaneous,2018,Classic Wooden Duck,,,,,,47.7,,NaT,NaT,False,uncategorized
4842,6258622-1,Promotional,Miscellaneous,2018,Classic Wooden Bus,,,,,,52.51,,NaT,NaT,False,uncategorized


Theme: Duplo
Number of rows: 21


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
5838,2751-1,Duplo,,2000,Egg Fun,,,,4.0,,,,NaT,NaT,False,uncategorized
11429,5484-1,Duplo,,2006,{Zoo animal},,,,,,,,2006-01-01,2009-12-31,False,uncategorized
11431,5485-2,Duplo,,2006,Zoo - Zoo Keeper,,,,,,,,2006-01-01,2007-12-31,False,uncategorized


Theme: Education
Number of rows: 9


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
9282,9412-1,Education,Duplo,2003,Duplo Bricks,,,28.99,,,,,NaT,NaT,False,uncategorized
12048,9310-1,Education,,2007,Dinosaurs Set,,,,,,,89.99,NaT,NaT,False,uncategorized
12492,45080-1,Education,,2013,Creative Cards,,,,,,6.57,,NaT,NaT,False,uncategorized


Theme: Sports
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
5896,3406-2,Sports,Football,2000,French Team Bus,,,,,,,,2000-01-04,2002-06-30,False,uncategorized


Theme: The LEGO Movie 2
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
5435,471906-1,The LEGO Movie 2,Magazine Gift,2019,Rex with Jetpack,1.0,,,,,3.19,,NaT,NaT,False,uncategorized
5582,5005738-1,The LEGO Movie 2,,2019,Sticker roll,,,3.99,3.99,3.99,,,NaT,NaT,False,uncategorized


Theme: Creator Expert
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
1229,10282-2,Creator Expert,Adidas,2021,Adidas Originals Superstar X Footshop 'Bluepri...,,,79.99,79.99,89.99,,,2021-01-07,2023-12-31,False,uncategorized


Theme: Super Mario
Number of rows: 54


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
3468,71361-0,Super Mario,Character Pack - Series 1,2020,Character Pack - Series 1 {Random bag},,,,4.99,,,,2020-01-08,2020-12-31,False,uncategorized
3469,71361-1,Super Mario,Character Pack - Series 1,2020,Paragoomba,1.0,,,4.99,,12.92,5.39,2020-01-08,2020-12-31,False,uncategorized
3470,71361-2,Super Mario,Character Pack - Series 1,2020,Fuzzy,1.0,,,4.99,,6.0,4.67,2020-01-08,2020-12-31,False,uncategorized


Theme: Clikits
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
14672,7575-1,Clikits,Seasonal,2004,Clikits Advent Calendar,,,11.99,15.0,,18.75,,2004-01-10,2006-12-31,False,uncategorized


Theme: Make and Create
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
11514,7794-1,Make and Create,,2006,{Set with two minifigs},,,19.99,,,,,NaT,NaT,False,uncategorized


Theme: Disney
Number of rows: 4


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
1834,302102-1,Disney,Magazine Gift,2021,Rapunzel & Hairbrush,,,,,,3.66,,NaT,NaT,False,uncategorized
1835,302103-1,Disney,Magazine Gift,2021,Cinderella's Kitchen,,,,,,3.39,,NaT,NaT,False,uncategorized
1837,302105-1,Disney,Magazine Gift,2021,"Lumiere, Cogsworth and Sultan",,,,,,3.9,,NaT,NaT,False,uncategorized


Theme: City
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
10130,66540-1,City,Volcano Explorers,2016,City Volcano Value Pack,,,,,,88.23,55.18,2016-01-09,2016-12-31,False,uncategorized
4806,5005554-1,City,Product Collection,2018,LEGO City Easter Bundle,,,37.97,,44.97,,,NaT,NaT,False,uncategorized


Theme: Gear
Number of rows: 1873


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
678,9708-1,Gear,Education,1997,Intelligent House Activity Pack,,,,12.0,,,,NaT,NaT,False,uncategorized
15096,3978-1,Gear,Key Chains/Castle,1998,Magic Wizard Key Chain,,,,3.0,,,,NaT,NaT,False,uncategorized
15154,5701-1,Gear,Video Games/PC,1998,LEGO Loco,,,,10.0,,,,NaT,NaT,False,uncategorized


Theme: Collectable Minifigures
Number of rows: 68


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
14162,8683-0,Collectable Minifigures,Series 1,2010,LEGO Minifigures - Series 1 {Random bag},,,1.99,1.99,,,,2010-01-05,2010-12-31,False,uncategorized
14180,8683-18,Collectable Minifigures,Series 1,2010,LEGO Minifigures - Series 1 - Sealed Box,,,119.4,,,,,2010-01-05,2010-12-31,False,uncategorized
14181,8684-0,Collectable Minifigures,Series 2,2010,LEGO Minifigures - Series 2 {Random bag},,,1.99,1.99,,,,2010-01-09,2010-12-31,False,uncategorized


Theme: Seasonal
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
6807,5004259-1,Seasonal,Christmas,2014,Holiday Ornament Collection,,,,47.94,,,,NaT,NaT,False,uncategorized


Theme: Miscellaneous
Number of rows: 5


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
8769,4000024-1,Miscellaneous,LEGO Inside Tour Exclusive,2017,LEGO House Tree of Creativity,,,,,,1960.86,,NaT,NaT,False,uncategorized
4743,4000025-1,Miscellaneous,LEGO Inside Tour Exclusive,2018,LEGO Ferguson Tractor,,,,,,2300.0,,NaT,NaT,False,uncategorized
5559,4000034-1,Miscellaneous,LEGO Inside Tour Exclusive,2019,LEGO System House,,,,,,3432.84,,NaT,NaT,False,uncategorized


Dropping product collections, bundle, promotionals, sealed boxes, magazine gifts, shoes such as Adidas Original Superstar, etc. Other items that could be sorted as sets or minifigures are categorized. Eventhough sets are the main focus of the analysis, I've decided to create the minifigure category in case I choose to do a deepdive in that topic later.


In [324]:
drop_col_list = ['Star Wars', 'DC Comics Super Heroes', 'City', 'The LEGO Movie 2', 
                     'Legends of Chima', 'Marvel Super Heroes',  'Books', 'Creator Expert', 
                       'Ninjago', 'Vidiyo', 'Disney', 'Miscellaneous', 'Gear', 'Duplo', 
                     'BrickHeadz', 'Promotional', 'Friends', 'Seasonal', 'Unikitty']
set_list = ["Clikits", 'Education','Make and Create', 'Sports']
minifig_list = ['Collectable Minifigures', 'Power Miners', 'Super Mario','Technic']
print("Rows that will be dropped:", drop_col_list)
print("To be categorized as sets:", set_list)
print("To be categorized as minifigs:", minifig_list)

Rows that will be dropped: ['Star Wars', 'DC Comics Super Heroes', 'City', 'The LEGO Movie 2', 'Legends of Chima', 'Marvel Super Heroes', 'Books', 'Creator Expert', 'Ninjago', 'Vidiyo', 'Disney', 'Miscellaneous', 'Gear', 'Duplo', 'BrickHeadz', 'Promotional', 'Friends', 'Seasonal', 'Unikitty']
To be categorized as sets: ['Clikits', 'Education', 'Make and Create', 'Sports']
To be categorized as minifigs: ['Collectable Minifigures', 'Power Miners', 'Super Mario', 'Technic']


In [325]:
for theme in set_list:
    df.loc[(df.theme == theme) & (df.category == "uncategorized"),'category'] ='set'
for theme in minifig_list:
    df.loc[(df.theme == theme) & (df.category == "uncategorized"),'category'] = "minifig"

In [326]:
df = df.drop(df[(df.category == "uncategorized") & (df.theme.isin(drop_col_list))].index)
df.category.unique()

array(['set', 'minifig'], dtype=object)

**Recategorize "Collectable Minifigures" as minifigs**


In [327]:
df.loc[(df.theme == "Collectable Minifigures"), "category"] = "minifig"

In [328]:
set(df[df.theme == "Collectable Minifigures"].category)

{'minifig'}

**Since 2022 is incomplete we sill save the rows from 2022 in a new dataframe and remove it from the main one.**


In [329]:
df_2022 = df[df.year == 2022]

In [330]:
df = df[df.year != 2022]
print(set(df.year))

{1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021}


**Creating dataframe containing only sets, since they will be the main focus of the analysis:**


In [331]:
df_sets = df[df.category == "set"].sort_values(by = "year")

### Further preparation steps for exploration analysis

**Filling in price data in usd from other currencies where possible**


In [332]:
rate_eur_usd, rate_gdp_usd

(1.1970355369182681, 1.5725745846764236)

In [333]:
print("No. of rows where the price could be filled exclusively with gbp-data:")
mask_gbp = (df_sets.price.isna()) & (df_sets.rrp_eur.isna()) & (df_sets.rrp_gbp.notnull())
df_sets[mask_gbp].shape

No. of rows where the price could be filled exclusively with gbp-data:


(243, 16)

In [334]:
print("No. of rows where the price could be filled exclusively with eur-data:")
mask_eur = (df_sets.price.isna()) & (df_sets.rrp_gbp.isna()) & (df_sets.rrp_eur.notnull())
df_sets[mask_eur].shape

No. of rows where the price could be filled exclusively with eur-data:


(5, 16)

In [335]:
df_sets[df_sets['price'].isnull()].shape

(3285, 16)

In [336]:
df_sets["price_eur_calc"] = df_sets["rrp_eur"].apply(lambda x: x*rate_eur_usd)
df_sets["price_gbp_calc"] = df_sets["rrp_gbp"].apply(lambda x: x*rate_gdp_usd)
df_sets.head(3)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category,price_eur_calc,price_gbp_calc
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,,5.5,,6.26,,NaT,NaT,False,set,,
91,5165-1,Service Packs,,1991,"Hinges, Couplings and Tilting Bearings",,31.0,,3.0,,20.0,,NaT,NaT,False,set,,
92,5166-1,Service Packs,,1991,"Lamp Holders, Tool Holder Plates",,18.0,,,,12.79,,NaT,NaT,False,set,,


In [337]:
df_sets['price'].fillna(df_sets['price_gbp_calc'], inplace=True)
df_sets[df_sets['price'].isnull()].shape

(2999, 18)

In [338]:
df_sets['price'].fillna(df_sets['price_eur_calc'], inplace=True)
df_sets[df_sets['price'].isnull()].shape

(2994, 18)

In [339]:
df_sets[df_sets['price'].isnull()].head(3)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,eol_date,numeric_data_nan,category,price_eur_calc,price_gbp_calc
92,5166-1,Service Packs,,1991,"Lamp Holders, Tool Holder Plates",,18.0,,,,12.79,,NaT,NaT,False,set,,
93,5271-1,Service Packs,,1991,Tyres and Hubs 49.6 mm White,,4.0,,,,5.9,,NaT,NaT,False,set,,
102,6352-1,Town,Vehicles,1991,Cargomaster Crane,1.0,140.0,,,,115.92,24.24,NaT,NaT,False,set,,


In [340]:
df_sets_price = df_sets[df_sets['price'].notnull()]
df_sets_price.columns

Index(['number', 'theme', 'subtheme', 'year', 'set_name', 'minifigs', 'pieces',
       'rrp_gbp', 'price', 'rrp_eur', 'value_new', 'value_used', 'launch_date',
       'eol_date', 'numeric_data_nan', 'category', 'price_eur_calc',
       'price_gbp_calc'],
      dtype='object')

In [341]:
df_sets_price = df_sets_price.drop(['rrp_eur','price_gbp_calc', 'rrp_gbp', 'price_eur_calc'], axis = "columns")
df_sets = df_sets.drop(['rrp_eur','price_gbp_calc', 'rrp_gbp', 'price_eur_calc'], axis = "columns")

In [342]:
df_sets_price = df_sets_price.sort_values("year").reset_index(drop = True)
df_sets = df_sets.sort_values("year").reset_index(drop = True)

In [343]:
df_sets.head(3)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,5.5,6.26,,NaT,NaT,False,set
1,1040-1,Dacta,,1991,Farm,4.0,89.0,,,,NaT,NaT,False,set
2,1474-1,Basic,Universal Building Set,1991,Basic Building Set with Gift Item,1.0,69.0,,24.64,,NaT,NaT,False,set


In [344]:
df_sets_price.head(3)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,5.5,6.26,,NaT,NaT,False,set
1,6540-1,Town,Police,1991,Pier Police,4.0,352.0,44.0,424.79,66.18,NaT,NaT,False,set
2,6541-1,Town,Boats,1991,Intercoastal Seaport,5.0,545.0,63.75,369.0,165.57,NaT,NaT,False,set


**Prepare dataframe for value after eol analysis**


In [345]:
df_sets_price.eol_date.min(), df_sets_price.eol_date.max()

(Timestamp('1996-12-31 00:00:00'), Timestamp('2026-12-31 00:00:00'))

In [346]:
df_sets_price[df_sets_price.eol_date.notnull()].head(5)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
366,6712-1,Western,Cowboys,1996,Sheriff's Showdown,2.0,28.0,4.0,81.1,15.67,1996-01-09,1998-12-31,False,set
367,6716-1,Western,Cowboys,1996,Covered Wagon,1.0,64.0,8.0,152.52,39.78,1996-01-09,1998-12-31,False,set
368,6769-1,Western,Cowboys,1996,Fort Legoredo,10.0,687.0,85.0,400.0,223.58,1996-01-10,1998-12-31,False,set
374,6518-1,Town,Coastguard,1996,Baja Buggy,1.0,37.0,3.5,15.14,5.53,1996-01-07,1997-12-31,False,set
375,6493-1,Time Cruisers,,1996,Flying Time Vessel,2.0,237.0,44.0,112.47,40.77,1996-01-07,1998-12-31,False,set


**Comments**
* EOL-dates before 1996-12-31 are not available.
* Unfortunately, I could find time series information about the value of sets after eol. The values provided are a snapshot from Jan 2022. The value of sets should go in to saturation at some point after eol. I will assume for the sake of simplicity that this saturation point has been reached. To be fair to sets who have just recently retired I will shorten the timeframe to sets with an eol on 31.12.2020.

In [347]:
df_eol = slice_2_date_range(df_sets_price,"eol_date","1996-12-31","2020-12-31")
display(df_eol.head(5)), df_eol.shape

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
0,6712-1,Western,Cowboys,1996,Sheriff's Showdown,2.0,28.0,4.0,81.1,15.67,1996-01-09,1998-12-31,False,set
1,6716-1,Western,Cowboys,1996,Covered Wagon,1.0,64.0,8.0,152.52,39.78,1996-01-09,1998-12-31,False,set
2,6769-1,Western,Cowboys,1996,Fort Legoredo,10.0,687.0,85.0,400.0,223.58,1996-01-10,1998-12-31,False,set
3,6518-1,Town,Coastguard,1996,Baja Buggy,1.0,37.0,3.5,15.14,5.53,1996-01-07,1997-12-31,False,set
4,6493-1,Time Cruisers,,1996,Flying Time Vessel,2.0,237.0,44.0,112.47,40.77,1996-01-07,1998-12-31,False,set


(None, (4474, 14))

In [348]:
df_eol = df_eol[df_eol["value_new"].notnull()] #remove rows where value_new is missing
df_eol.shape

(4395, 14)

In [349]:
def change_per(a,b):
    """Function to get a change of value in percent from numbers a and b, b being the later instance.
    INPUT:
    a: first instance
    b: second instance
    OUTPUT:
    result of calculation as float in percent
    """
    try:
        result = 100*(b - a)/a 
    except(ZeroDivisionError): 
        print("Division with zero.")
        print(a,b)
        result = None
    return result

In [350]:
#First try showed the error division with zero. Checking how many items have the price zero:

In [351]:
df_eol[df_eol.price.values == 0].shape

(13, 14)

In [352]:
#Removing rows where price is zero (those are predominantly promotional sets gifted at ofr instance an openning of a Lego store).

In [353]:
df_eol = df_eol[df_eol.price.values != 0]

In [354]:
df_eol["value_change_per"]  = df_eol.apply(lambda f: change_per(f['price'],f['value_new']), axis=1)
df_eol.head(5)

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,eol_date,numeric_data_nan,category,value_change_per
0,6712-1,Western,Cowboys,1996,Sheriff's Showdown,2.0,28.0,4.0,81.1,15.67,1996-01-09,1998-12-31,False,set,1927.5
1,6716-1,Western,Cowboys,1996,Covered Wagon,1.0,64.0,8.0,152.52,39.78,1996-01-09,1998-12-31,False,set,1806.5
2,6769-1,Western,Cowboys,1996,Fort Legoredo,10.0,687.0,85.0,400.0,223.58,1996-01-10,1998-12-31,False,set,370.59
3,6518-1,Town,Coastguard,1996,Baja Buggy,1.0,37.0,3.5,15.14,5.53,1996-01-07,1997-12-31,False,set,332.57
4,6493-1,Time Cruisers,,1996,Flying Time Vessel,2.0,237.0,44.0,112.47,40.77,1996-01-07,1998-12-31,False,set,155.61


### Further preparation steps for predictive analysis

In [398]:
df_eol.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4382 entries, 0 to 4473
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   number            4382 non-null   object        
 1   theme             4382 non-null   object        
 2   subtheme          3472 non-null   object        
 3   year              4382 non-null   int64         
 4   set_name          4382 non-null   object        
 5   minifigs          3069 non-null   float64       
 6   pieces            4381 non-null   float64       
 7   price             4382 non-null   float64       
 8   value_new         4382 non-null   float64       
 9   value_used        4219 non-null   float64       
 10  launch_date       4382 non-null   datetime64[ns]
 11  eol_date          4382 non-null   datetime64[ns]
 12  numeric_data_nan  4382 non-null   bool          
 13  category          4382 non-null   object        
 14  value_change_per  4382 n

In [None]:
#sometimes there are subthemes missing, adding themes as subthemes in those cases.

In [399]:
df_eol['subtheme'].fillna(df_sets['theme'], inplace=True)

In [402]:
df_eol['subtheme'].isnull().any()

False

In [None]:
#Can we predict which sets are 1 - not worth investing, 2 - at least double in value (100%)?
#1. prediction based

In [418]:
df_ml = df_eol.drop(["category", "numeric_data_nan", "launch_date","eol_date", "value_used", "value_new"], axis = "columns")

In [419]:
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4382 entries, 0 to 4473
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   number            4382 non-null   object 
 1   theme             4382 non-null   object 
 2   subtheme          4382 non-null   object 
 3   year              4382 non-null   int64  
 4   set_name          4382 non-null   object 
 5   minifigs          3069 non-null   float64
 6   pieces            4381 non-null   float64
 7   price             4382 non-null   float64
 8   value_change_per  4382 non-null   float64
dtypes: float64(4), int64(1), object(4)
memory usage: 471.4+ KB


In [420]:
#one row 7575-1	Clikits	 had a null value for pieces --> drop

In [421]:
df_ml = df_ml[df_ml.pieces.notnull()]

In [422]:
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4381 entries, 0 to 4473
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   number            4381 non-null   object 
 1   theme             4381 non-null   object 
 2   subtheme          4381 non-null   object 
 3   year              4381 non-null   int64  
 4   set_name          4381 non-null   object 
 5   minifigs          3069 non-null   float64
 6   pieces            4381 non-null   float64
 7   price             4381 non-null   float64
 8   value_change_per  4381 non-null   float64
dtypes: float64(4), int64(1), object(4)
memory usage: 342.3+ KB


In [423]:
#minifigs still have Nan values

In [428]:
df_ml["minifigs"].fillna(0, inplace=True)

In [429]:
df_ml.info() # no Nan values anymore

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4381 entries, 0 to 4473
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   number            4381 non-null   object 
 1   theme             4381 non-null   object 
 2   subtheme          4381 non-null   object 
 3   year              4381 non-null   int64  
 4   set_name          4381 non-null   object 
 5   minifigs          4381 non-null   float64
 6   pieces            4381 non-null   float64
 7   price             4381 non-null   float64
 8   value_change_per  4381 non-null   float64
dtypes: float64(4), int64(1), object(4)
memory usage: 342.3+ KB


In [430]:
df_ml

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_change_per
0,6712-1,Western,Cowboys,1996,Sheriff's Showdown,2.00,28.00,4.00,1927.50
1,6716-1,Western,Cowboys,1996,Covered Wagon,1.00,64.00,8.00,1806.50
2,6769-1,Western,Cowboys,1996,Fort Legoredo,10.00,687.00,85.00,370.59
3,6518-1,Town,Coastguard,1996,Baja Buggy,1.00,37.00,3.50,332.57
4,6493-1,Time Cruisers,Assorted,1996,Flying Time Vessel,2.00,237.00,44.00,155.61
...,...,...,...,...,...,...,...,...,...
4469,43182-1,Disney,Mulan,2020,Mulan's Training Grounds,1.00,157.00,29.99,22.41
4470,43178-1,Disney,Cinderella,2020,Cinderella's Castle Celebration,1.00,168.00,29.99,-24.67
4471,43174-1,Disney,Storybook Adventures,2020,Mulan's Storybook Adventures,3.00,124.00,19.99,0.80
4472,43170-1,Disney,Moana,2020,Moana's Ocean Adventure,1.00,46.00,9.99,40.04


In [445]:
df_ml[['number_main','number_sub']] = df_ml['number'].str.split('-',expand=True).astype(int)

In [774]:
df_ml.value_change_per.describe()

count    4,381.00
mean       181.52
std        300.28
min        -81.01
25%         30.23
50%         99.10
75%        231.14
max     10,413.86
Name: value_change_per, dtype: float64

Mapping the value change column to following cluster groups:
* 0: not worth investing (value increase less than 100%)
* 1: goof value increase (100%-300%)
* 2: awesome return on investment (>300% value increase)

In [775]:
cut = pd.cut(df_ml.value_change_per.values, bins= [-100, 100, 300, 10500], labels = ["0", "1", "2"])


In [472]:
df_invest_category = pd.DataFrame(data=cut.astype(int), index=df_ml.index, columns=["invest_category"])


In [473]:
df_ml_2 = df_ml.join(df_invest_category.invest_category)


In [475]:
df_ml_2.describe()

Unnamed: 0,year,minifigs,pieces,price,value_change_per,number_main,number_sub,invest_category
count,4381.0,4381.0,4381.0,4381.0,4381.0,4381.0,4381.0,4381.0
mean,2011.08,2.2,321.74,36.55,181.52,33358.76,1.02,0.68
std,6.35,2.59,454.93,42.67,300.28,108000.11,0.34,0.77
min,1996.0,0.0,3.0,1.0,-81.01,1070.0,1.0,0.0
25%,2006.0,0.0,66.0,9.99,30.23,7499.0,1.0,0.0
50%,2013.0,2.0,173.0,20.0,99.1,10707.0,1.0,0.0
75%,2016.0,3.0,401.0,49.99,231.14,60076.0,1.0,1.0
max,2020.0,32.0,5923.0,499.99,10413.86,4000026.0,12.0,2.0


In [482]:
df_ml_2.columns

Index(['number', 'theme', 'subtheme', 'year', 'set_name', 'minifigs', 'pieces',
       'price', 'value_change_per', 'number_main', 'number_sub',
       'invest_category'],
      dtype='object')

In [776]:
df_ml_final = df_ml_2.drop(["value_change_per","number"], axis = "columns")

In [711]:

#for testing the apporach remove later and test with multiple text columns
df_ml_final.drop(["set_name", "subtheme"], axis = "columns", inplace = True)

In [713]:
df_ml_final

Unnamed: 0,theme,year,minifigs,pieces,price,number_main,number_sub,invest_category
0,Western,1996,2.00,28.00,4.00,6712,1,2
1,Western,1996,1.00,64.00,8.00,6716,1,2
2,Western,1996,10.00,687.00,85.00,6769,1,2
3,Town,1996,1.00,37.00,3.50,6518,1,2
4,Time Cruisers,1996,2.00,237.00,44.00,6493,1,1
...,...,...,...,...,...,...,...,...
4469,Disney,2020,1.00,157.00,29.99,43182,1,0
4470,Disney,2020,1.00,168.00,29.99,43178,1,0
4471,Disney,2020,3.00,124.00,19.99,43174,1,0
4472,Disney,2020,1.00,46.00,9.99,43170,1,0


In [714]:
df_ml_final.dtypes

theme               object
year                 int64
minifigs           float64
pieces             float64
price              float64
number_main          int64
number_sub           int64
invest_category      int64
dtype: object

## 4. Modelling

In [723]:
def tokenize(text):
    """Function for text processing, in particular it replaces urls, tokenizes and lemmatizes the words in a given text.
    INPUT
    text: text to process as str
    OUTPUT:
    tokens: list of tokenized words"""
    
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    detected_urls = re.findall(url_regex, text)
    
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
 
    # normalize case and remove punctuation using regex
    text = re.sub(r"[^a-zA-Z]", " ", text.lower()) #[^a-zA-Z0-9]
    
    # tokenize text with the tokenizer
    tokens = word_tokenize(text)
    
    # lemmatize and remove stop words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words("english")]

    return tokens

In [724]:
# Create Function Transformer to use Feature Union
def get_numeric_data(df):
    newdf = df.select_dtypes(include=["float64", "int64"])
    return newdf
    #[record[2:].astype(float) for record in x]

def get_text_data(df):
    newdf = df.select_dtypes(include=['object'])
    return newdf

In [753]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer


def build_model():
    """Functions creates a machine learning pipelne and performs a grid search to find adequate parameters.
    INPUT
    None
    OUTPUT
    ML model with parameters aquired with GridSearch
    """
   # pipeline = Pipeline([
   #     ('vect', CountVectorizer(tokenizer=tokenize)),
   #     ('tfidf', TfidfTransformer()),
   #     ('clf', RandomForestClassifier())])
    
    #parameters = {'vect__ngram_range': ((1, 1),(1,2)),
    #              'clf__estimator__n_estimators': [10, 50],
    #             'clf__estimator__min_samples_split': [2, 5]}
    
    #cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs=-1, cv=2, verbose = 3)
    
    transfomer_numeric = FunctionTransformer(get_numeric_data)
    transformer_text = FunctionTransformer(get_text_data)

    # Create a pipeline to concatenate Tfidf Vector and Numeric data
    # Use RandomForestClassifier as an example
    pipeline = Pipeline([
        ('features', FeatureUnion([
                ('numeric_features', Pipeline([
                    ('selector', transfomer_numeric)
                ])),
                 ('text_features', Pipeline([
                    ('selector', transformer_text),
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                   ('tfidf', TfidfTransformer())
                     
                   # ('vec', TfidfVectorizer(analyzer='word'))
                ]))
             ])),
        ('clf', RandomForestClassifier())
    ])
    
    
    print("Done.")
    return pipeline #cv 

In [None]:
pipeline = Pipeline([
    ('features', FeatureUnion([

        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),

        ('starting_verb', StartingVerbExtractor())
    ])),

    ('clf', RandomForestClassifier())
])

### Numeric Only Model

In [792]:
import sklearn
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_squared_error

def build_numeric_model():
    """Functions creates a machine learning pipelne and performs a grid search to find adequate parameters.
    INPUT
    None
    OUTPUT
    ML model with parameters aquired with GridSearch
    """
    #lm_model = linear_model.LinearRegression(normalize=True) # Instantiate
    pipeline = Pipeline([
    ('clf', RandomForestClassifier())
])

    print("Done.")
    return pipeline #lm_model #cv 

In [793]:
df_numeric_test = get_numeric_data(df_ml_final)
X = df_numeric_test.drop('invest_category', axis=1) #features
y = df_numeric_test['invest_category'] #labels

In [794]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state = 42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3504, 6), (877, 6), (3504,), (877,))

In [795]:
print('Building model...')
model = build_numeric_model()      
print('Training model...')
model.fit(X_train, y_train)

Building model...
Done.
Training model...


Pipeline(steps=[('clf', RandomForestClassifier())])

In [796]:
#Predict and score the model
y_pred = model.predict(X_test) 
"The r-squared score for the model using only quantitative variables was {} on {} values.".format(r2_score(y_test, y_test_preds), len(y_test))

'The r-squared score for the model using only quantitative variables was 0.3559512283599816 on 877 values.'

In [797]:
def display_results(cv, y_test, y_pred):
    labels = np.unique(y_pred)
    #confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
   # print("Confusion Matrix:\n", confusion_mat)
    print("Accuracy:", accuracy)
  #  print("\nBest Parameters:", cv.best_params_)

In [798]:
display_results(model, y_test, y_pred)

Labels: [0 1 2]
Accuracy: 0.6681870011402509


### Text only model

In [799]:
import sklearn
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_squared_error

def build_text_model():
    """Functions creates a machine learning pipelne and performs a grid search to finde adequate parameters.
    INPUT
    None
    OUTPUT
    ML model with parameters aquired with GridSearch
    """
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier())])
    
   # parameters = {'vect__ngram_range': ((1, 1),(1,2)),
   #               'clf__estimator__n_estimators': [10, 50],
   #              'clf__estimator__min_samples_split': [2, 5]}
    
  #  cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs=-1, cv=2, verbose = 3)
    return pipeline # cv #pipeline

In [800]:
df_text_test = get_text_data(df_ml_final)
X = df_text_test #features
y = df_ml_final['invest_category'] #labels

In [801]:
X

Unnamed: 0,theme,subtheme,set_name
0,Western,Cowboys,Sheriff's Showdown
1,Western,Cowboys,Covered Wagon
2,Western,Cowboys,Fort Legoredo
3,Town,Coastguard,Baja Buggy
4,Time Cruisers,Assorted,Flying Time Vessel
...,...,...,...
4469,Disney,Mulan,Mulan's Training Grounds
4470,Disney,Cinderella,Cinderella's Castle Celebration
4471,Disney,Storybook Adventures,Mulan's Storybook Adventures
4472,Disney,Moana,Moana's Ocean Adventure


In [802]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state = 42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3504, 3), (877, 3), (3504,), (877,))

In [803]:
print('Building model...')
model = build_numeric_model()      
print('Training model...')
model.fit(X_train, y_train)

Building model...
Done.
Training model...


ValueError: could not convert string to float: 'Technic'

## 5. Analysis and Evaluation

In [362]:
def create_record_breaker_dataset(df, col):
    #Record-winner in 1991
    df_winner_1991 = df_sets[df_sets.year == 1991].sort_values(col, ascending = False).head(1)
    
    mask = df_sets[col].values >= df_winner_1991[col].values[0]
    df_winners = df_sets[mask].sort_values(by = ["year", "launch_date"])
    
    df_winners["cummulative"] = df_winners[col].cummax()
    df_winners.drop_duplicates(subset = "cummulative", inplace = True)
    df_winners.reset_index(drop = True, inplace = True)
    
    return df_winners

In [365]:
def make_h_bar_chart(df, x, y, color, text, title, labels = {}):
    """ Function uses plotly express to plot a horizontal bar chart. 
    The given text column is displayed insid of the bars.
    INPUT:
    df: dataframe
    x: column name as string
    y: must be index column as df.index
    text: text column to use for the bars
    title: plot title as str
    labels: empty dict by default, labels can be changed by entering e.g. {"index":"year"}
    
    OUTPUT: 
    plotly figure as fig-object
    """
    
    fig = px.bar(df, 
             x=x, y=y, 
             color = color, orientation='h', text = text,  
             title = title, labels = labels)
    fig.update_layout(yaxis = dict(tickmode = 'array',tickvals =y, ticktext = list(df.year)))
    fig.update_layout(xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 1000))
    fig.update_traces(textfont_size=12, textangle=0, textposition="inside")
    return fig

### A) Exporatory Analysis


#### What number of sets was launched every year since 1991?


In [355]:
df_set_count = df_sets.groupby(by = ["year", "theme"]).count().sort_values(by = ["year","number"], ascending = (True, False)).reset_index()

In [356]:
df_set_count.head(10)

Unnamed: 0,year,theme,number,subtheme,set_name,minifigs,pieces,price,value_new,value_used,launch_date,eol_date,numeric_data_nan,category
0,1991,Duplo,36,0,36,25,36,7,9,15,0,0,36,36
1,1991,Service Packs,22,8,22,0,22,15,21,6,0,0,22,22
2,1991,Town,22,22,22,19,22,11,22,22,0,0,22,22
3,1991,Space,13,13,13,11,13,4,11,13,0,0,13,13
4,1991,Trains,12,12,12,7,12,11,12,12,0,0,12,12
5,1991,Basic,11,5,11,5,11,2,5,6,0,0,11,11
6,1991,Dacta,7,2,7,3,7,0,1,2,0,0,7,7
7,1991,Technic,6,1,6,0,6,2,6,6,0,0,6,6
8,1991,Pirates,5,2,5,5,5,4,5,5,0,0,5,5
9,1991,Boats,3,0,3,3,3,3,3,3,0,0,3,3


In [357]:
fig = px.area(df_set_count, "year", "number", 
       color = "theme", title = "Number of sets launched every year",
      labels = {"number":"set count"})
fig.update_layout(xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 1))
fig.show()      

#### What themes are most dominant over the years by number of sets?


In [358]:
df_temp = df_sets.groupby(by = ["year","theme"]).count().sort_values(by =["year", "number"], ascending = (True, False))
df_1 = pd.DataFrame(df_temp["number"])
df_1.reset_index(inplace = True);
df_1.head(3)

Unnamed: 0,year,theme,number
0,1991,Duplo,36
1,1991,Service Packs,22
2,1991,Town,22


In [359]:
fig = px.scatter(df_1, x = "year", y = "theme", color = "theme", size = "number",
              title = "Available themes in the period 1991-2021")
fig.update_layout(height=1750,
                  font=dict(size=9),
                  yaxis = dict(tickmode = 'linear',tick0 = 1,dtick = 1),
                  xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 1),
                  showlegend = False)
fig.update_traces(mode='markers+lines', textfont_size=3)

#### What where the yearly top ten most dominant themes by number of sets?

In [360]:
df_2 = df_1.groupby(by = ["year"]).head(10).reset_index()
df_2.head(12)

Unnamed: 0,index,year,theme,number
0,0,1991,Duplo,36
1,1,1991,Service Packs,22
2,2,1991,Town,22
3,3,1991,Space,13
4,4,1991,Trains,12
5,5,1991,Basic,11
6,6,1991,Dacta,7
7,7,1991,Technic,6
8,8,1991,Pirates,5
9,9,1991,Boats,3


In [361]:
fig = px.bar(df_2, x = "number",y= "year", color = "theme", 
             orientation='h', text = "theme", 
             title = "Yearly top 10 themes by number of sets")
fig.update_layout(height=1000,showlegend = False,
                  yaxis = dict(tickmode = 'linear',tick0 = 0,dtick =1, autorange = "reversed"),
                  xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 50))
fig.update_traces(textfont_size=12, textangle=0, textposition="inside")

#### What were the record-breaking sets by piece count?

In [363]:
df_piece_count_winners = create_record_breaker_dataset(df_sets,"pieces")
df_piece_count_winners.head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,eol_date,numeric_data_nan,category,cummulative
0,9452-1,Dacta,,1991,Giant LEGO topic set,,2165.0,,,,NaT,NaT,False,set,2165.0
1,9287-1,Education,Town,1996,Bonus Lego Basic Town,11.0,2456.0,,511.29,,NaT,NaT,False,set,2456.0
2,3450-1,Creator Expert,Sculptures,2000,Statue of Liberty,,2882.0,199.0,200.0,793.69,2000-11-15,2002-12-31,False,set,2882.0
3,10030-1,Star Wars,Ultimate Collector Series,2002,Imperial Star Destroyer,,3096.0,269.99,1471.58,535.0,2002-06-12,2007-12-31,False,set,3096.0
4,10143-1,Star Wars,Ultimate Collector Series,2005,Death Star II,,3441.0,269.99,2449.0,1206.66,2005-01-09,2007-12-31,False,set,3441.0


In [364]:
df_piece_count_winners["text"] = df_piece_count_winners["set_name"] +" "+ "(" + df_piece_count_winners["number"] +")"

In [366]:
fig = make_h_bar_chart(df_piece_count_winners, "pieces",df_piece_count_winners.index, "theme",
                 "text", "Record breaking sets by piece count",{"index":"year"})
fig.show()

#### Do sets have more and more pieces in general?

In [367]:
df_sets_size = df_sets.groupby(by = "year").pieces.describe().reset_index()
df_sets_size.head(5)

Unnamed: 0,year,count,mean,std,min,25%,50%,75%,max
0,1991,142.0,130.44,251.15,1.0,13.25,38.5,137.25,2165.0
1,1992,107.0,122.14,165.31,1.0,23.5,48.0,164.0,954.0
2,1993,151.0,124.65,177.41,1.0,21.5,43.0,156.5,912.0
3,1994,143.0,126.29,218.77,2.0,19.0,45.0,139.0,1343.0
4,1995,163.0,148.19,226.76,1.0,20.0,47.0,194.0,1733.0


In [368]:
fig = px.line(title = "Median and mean of the number of pieces per year")

# Only thing I figured is - I could do this 
fig.add_scatter(name="mean", x=df_sets_size['year'], y=df_sets_size['mean'], mode='lines+markers')
fig.add_scatter(name="median",x=df_sets_size['year'], y=df_sets_size['50%'], mode='lines+markers')

fig.update_xaxes(title="year")

fig.update_layout(xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 1))
fig.show()

In [369]:
fig = px.box(df_sets, "year", "pieces", title = "Boxplot of pieces per year")
fig.update_layout(xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 1))
fig.show()

#### What words do most often come up in set names?

In [371]:
results_list = []
df_sets.set_name.apply(lambda x: results_list.extend(tokenize(x)));
print(results_list[:10])

['blue', 'baseplate', 'farm', 'basic', 'building', 'set', 'gift', 'item', 'airport', 'security']


In [372]:
results_count = Counter(results_list)
df_results_count = pd.DataFrame.from_dict(results_count, orient='index', columns = ["col"]).reset_index()
df_results_count.rename(columns = {'index':"word", "col":"occurances"}, inplace = True)
df_results_count.sort_values(by = "occurances", ascending = False, inplace = True)
df_results_count.reset_index(drop = True, inplace = True)
df_results_count.head(10)

Unnamed: 0,word,occurances
0,pack,572
1,set,496
2,brick,237
3,value,224
4,truck,208
5,bucket,186
6,lego,180
7,fire,176
8,bonus,175
9,x,164


In [373]:
fig = px.bar(df_results_count.head(25), "word", "occurances",
       color = "word", text = "occurances",
       title = "Top 25 most frequent words in set names")
fig.show()                     
#fig.update_layout(xaxis = dict(visible = False, matches = None))


#### Are lego sets becomming more and more expensive?

In [374]:
df_expense = pd.DataFrame(df_sets_price.groupby(by = "year").mean()).reset_index()
df_expense.head()

Unnamed: 0,year,minifigs,pieces,price,value_new,value_used,numeric_data_nan
0,1991,2.83,127.9,22.6,195.91,62.36,0.0
1,1992,3.03,155.44,23.37,191.67,56.31,0.0
2,1993,2.79,150.68,21.33,173.64,52.11,0.0
3,1994,2.69,167.21,28.49,179.69,59.67,0.0
4,1995,2.84,187.92,25.27,156.63,47.19,0.0


In [375]:
df_expense["price_per_piece"] = df_expense.price.values / df_expense.pieces.values
df_expense.head()

Unnamed: 0,year,minifigs,pieces,price,value_new,value_used,numeric_data_nan,price_per_piece
0,1991,2.83,127.9,22.6,195.91,62.36,0.0,0.18
1,1992,3.03,155.44,23.37,191.67,56.31,0.0,0.15
2,1993,2.79,150.68,21.33,173.64,52.11,0.0,0.14
3,1994,2.69,167.21,28.49,179.69,59.67,0.0,0.17
4,1995,2.84,187.92,25.27,156.63,47.19,0.0,0.13


In [376]:
fig = px.scatter(df_expense,"year", "price", size = "price_per_piece",
       color = "price_per_piece", #markers=True,
       title = "Yearly mean price per set 1991-2021",labels = {"price":"average price in USD", "price_per_piece": "price per piece"})
#fig.update_layout(yaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 5))
fig.update_layout(xaxis = dict(tickmode = 'linear',tick0 = 1,dtick = 2))
fig.show()
#fig.update_traces(mode='markers', textfont_size=3)

Sets are getting more expesive on average, but the price per piece has decreased.

#### How does the value of lego sets change after they are not produced anymore (eol)?

In [377]:
#Descriptive statistics of value change in percent:

In [378]:
px.box(df_eol, df_eol.value_change_per)

In [379]:
df_eol.value_change_per.describe()

count    4,382.00
mean       181.48
std        300.26
min        -81.01
25%         30.19
50%         99.10
75%        231.06
max     10,413.86
Name: value_change_per, dtype: float64

* The median is at about just bellow 100%.
* A mean value change is at 181%, but this is influenced by the outliers more than the median (one outlier topping at 10412%).
* The worst value depriciation is at -81%.



In [380]:
df_eol_top = df_eol.sort_values(by = "value_change_per")
df_eol_top["text"] = df_eol_top["set_name"] +" "+ "(" + df_eol_top["number"] +")"
df_eol_top_25 = df_eol_top.tail(25).reset_index(drop = True)
df_eol_top_25.head(10)

In [384]:
fig = make_h_bar_chart(df_eol_top_25, "value_change_per",df_eol_top_25.index, "theme",
                 "text", "Top 25 value increase sets after eol",
                       {"index":"sets with set number and year of launch","value_change_per":"value change in %"})
fig.update_layout(height=999)
fig.show()

#### What themes do best after eol?

In [385]:
df_eol.groupby(by = "theme").value_change_per.describe().sort_values(by = "mean", ascending = False).head(20)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
theme,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Avatar The Last Airbender,2.0,1611.07,828.15,1025.48,1318.28,1611.07,1903.87,2196.67
Studios,16.0,1102.75,2527.36,-13.02,212.5,344.63,727.62,10413.86
Western,12.0,1069.21,509.85,370.59,711.63,1032.99,1392.5,1927.5
Batman,13.0,796.85,330.14,347.2,610.73,706.5,900.0,1543.54
Vikings,7.0,677.14,282.31,232.01,521.54,670.34,863.17,1068.2
Adventurers,40.0,587.49,446.22,116.75,250.52,485.67,728.14,2147.5
Rock Raiders,6.0,574.86,203.51,276.54,518.46,546.62,641.65,897.0
Aqua Raiders,1.0,560.06,,560.06,560.06,560.06,560.06,560.06
SpongeBob SquarePants,14.0,556.23,469.75,206.2,283.81,440.21,531.35,1905.3
Bionicle,234.0,527.28,450.67,-16.52,246.34,410.01,677.49,3040.71


The top 20 show a number of classic lego themes such as Western, Vikings, Adventurers, etc. but also some licenced themes su as Batman, Spider-Man, Indiana Jones, Super mario or Harry Potter.

In [386]:
df_eol.groupby(by = "theme").value_change_per.describe().sort_values(by = "mean", ascending = False).tail(10)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
theme,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Dimensions,65.0,4.9,64.07,-81.01,-36.49,-12.14,31.41,219.93
Hidden Side,23.0,2.18,22.2,-22.64,-11.99,-2.99,9.5,75.76
The Powerpuff Girls,2.0,-7.66,16.83,-19.56,-13.61,-7.66,-1.71,4.23
Master Builder Academy,1.0,-8.3,,-8.3,-8.3,-8.3,-8.3,-8.3
Mindstorms,5.0,-10.24,26.9,-44.5,-28.0,-10.71,12.48,19.55
The LEGO Movie 2,27.0,-13.6,34.67,-50.21,-35.24,-21.94,-8.4,86.09
Dots,2.0,-13.84,3.63,-16.41,-15.13,-13.84,-12.56,-11.27
Unikitty,20.0,-27.29,18.29,-46.87,-37.6,-33.18,-21.87,21.56
Trolls World Tour,4.0,-27.65,20.46,-43.72,-42.79,-33.46,-18.32,0.05
Fusion,4.0,-31.42,28.0,-65.48,-42.13,-31.54,-20.82,2.89


Some of the worst seem to be Dimensions, Hidden Side and The Powerpuff Girls.

In [387]:
fig = px.box(df_eol, "value_change_per", "theme", orientation = "h",
       color = "theme", labels = {"value_change_per":"change of value in percent"}, 
      title = "Value change box plot per lego theme")

fig.update_layout(height=1500)

fig.show()

### Questions of interest

**A) Exploratory Analysis**
* **Number of launched sets per year?**
* **What themes are most dominant over the years?**
* **What sets where record breakers in terms of piece count?**
* **What sets where record breakers in terms of number of minifigs?**
* **What words do most often come up in set names?**
* **Are lego sets becomming more and more expensive?**
* **How does the value of lego sets change after they are not produced anymore (eol)?**
* **What themes do best after eol?** 

*eol = lego-term for end-of-life meaning the date when the set is not being produced)

**B) Predictive Analysis**
* What are features of the data set are good predictors that will rise in value after eol?
* What do the words contained in the set names tell us about the rise of value after eol.
* What sets that are currently being sold can I predict to be a good investment after eol?* 

*e.g. price increase of at least 10 usd (for package and shipment when selling) + at least 25% profit

### B) Predictive Analysis

## A. Appendix

In [388]:
corr = df.drop(["year","rrp_gbp","rrp_eur"], axis = "columns").corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

f, ax = plt.subplots(figsize = (20,20))

cmap = sns.diverging_palette(200, 20, as_cmap = True)
sns.heatmap(corr, mask = mask, cmap = cmap, vmax = 1, center = 0, square = True, linewidth = 4, cbar_kws = {"shrink":.5})

NameError: name 'plt' is not defined

#### What sets where record breakers in terms of number of minifigures?


In [None]:
df_minifig_winners = create_record_breaker_dataset(df_sets,"minifigs")
df_minifig_winners.head()

In [None]:
fig = make_h_bar_chart(df_minifig_winners, "minifigs", df_minifig_winners.index,
                 "theme", "set_name", 
                 "Record breaking sets by no of minifigures")

fig.show()