# Lego Analysis

Author: M. Tosic

Date: 01.2022

This notebook is part of my capstone project for a data science course. The project is independent and has no connection to the company LEGO.

# 1. Business Understanding

### Questions of interest

**A) Exploratory Analysis**
* **What themes are most dominant over the years?**
* **What sets where record breakers in terms of piece count?**
* **What sets where record breakers in terms of number of minifigs?**
* **What words do most often come up in set names?**
* Are lego sets becomming more and more expensive?
* Retail price to piece count?
* Does the value of sets go up after eol on average?
* What sets do best after eol? (eol = lego-term for end-of-life meaning the date when the set is not being produced 

**B) Predictive Analysis**
* What are features of the data set are good predictors that will rise in value after eol?
* What do the words contained in the set names tell us about the rise of value after eol.
* What sets that are currently being sold can I predicte to be a good investment after eol?* 

*e.g. price increase of at least 10 usd (for package and shipment when selling) + at least 25% profit

# 2. Data Understanding

Data being used in this notebook has been downloaded from the following sources:

* https://brickset.com/
* https://rebrickable.com/downloads/

Simplifications:
* No time series data on the price averages available. Assumption: price changes average out over time after eol. The price curves are already in a steady state.
* No data available on unique minifigs in sets (minifig are popular for collectors that focus on them and are generally believed to drive up the prices of some sets after eol).

### Import Libraries

In [205]:
import numpy as np
import pandas as pd

#cisualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_rows', 100) #pd.set_option('display.max_rows', None)

# import necessary libraries for batch import csv:
import os
import glob

#for counting elements in a list:
from collections import Counter

#needed for text processing:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

import re

from datetime import datetime


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/michaeltosic/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/michaeltosic/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/michaeltosic/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Import Data

In [2]:
#df_sets = pd.read_csv('data/rebrickable-sets.csv')
#df_themes = pd.read_csv('data/rebrickable-themes.csv')

In [407]:
def import_csv_with_date_column(filename, date_col_name, skiprows_val = 0):
    dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d')
    df = pd.read_csv(filename, parse_dates=[date_col_name], date_parser=dateparse,skiprows = skiprows_val)
    df.rename(columns=lambda x: x.strip(), inplace = True)
    return df

def slice_2_date_range(df, date_col,start_date, end_date):
    #greater than start date and smaller than the end date
    mask = (df[date_col] > start_date) & (df[date_col] <= end_date)
    df = df[mask].reset_index(drop = True)
    return df

In [414]:
df_gdp_usd = import_csv_with_date_column("data/exchange-rate-historical-chart_pound-dollar.csv",  "date", 15)
df_gdp_usd.dtypes, df_gdp_usd.head(3)

(date     datetime64[ns]
 value           float64
 dtype: object,
         date  value
 0 1971-01-04   2.39
 1 1971-01-05   2.39
 2 1971-01-06   2.40)

In [415]:
df_eur_usd = import_csv_with_date_column("data/exchange-rate-historical-chart_euro-dollar.csv",  "date", 15)
df_eur_usd.dtypes, df_eur_usd.head(3)

(date     datetime64[ns]
 value           float64
 dtype: object,
         date  value
 0 1999-01-04   1.18
 1 1999-01-05   1.18
 2 1999-01-06   1.16)

In [416]:
df_gdp_usd = slice_2_date_range(df_gdp_usd, "date", "1991-01-01", "2021-12-31")
df_eur_usd = slice_2_date_range(df_eur_usd, "date", "1991-01-01", "2021-12-31")

df_gdp_usd.head(), df_eur_usd.head()

(        date  value
 0 1991-01-02   1.94
 1 1991-01-03   1.95
 2 1991-01-04   1.93
 3 1991-01-07   1.91
 4 1991-01-08   1.91,
         date  value
 0 1999-01-04   1.18
 1 1999-01-05   1.18
 2 1999-01-06   1.16
 3 1999-01-07   1.17
 4 1999-01-08   1.16)

In [417]:
df_gdp_usd.columns

Index(['date', 'value'], dtype='object')

In [418]:
rate_gdp_usd = df_gdp_usd["value"].mean()
print(rate_gdp_usd)
rate_eur_usd = df_eur_usd["value"].mean()
print(rate_eur_usd)

1.5725745846764236
1.1970355369182681


In [3]:
def import_multiple_csv_files_2_df (relative_path):
    """ Function uses os and glob packages to import multiple csv files into one dataframe. 
    The current working directory should be the one where this notebook is located.
    INPUT: 
    Relative path to the files e.g. "./data/Kurac*.csv"
    OUTPUT: 
    One dataframe containting all csv files concatenated together over axis = 0.
    """
    path = os.getcwd()
    files = glob.glob(os.path.join(path, relative_path))
    
    print('Glob search with parameters:', relative_path)
   # print('Ingested files:')
    li = []
    for file in files:
        df_temp = pd.read_csv(file, index_col = None, header = 0)
        li.append(df_temp)
        #print(file)
    try:    
        df = pd.concat(li, axis=0, ignore_index=True)
        print('Done.')

    except:
        print('Something went wrong the concatenation of the files, returning None. Is the relative_path correctly set?')
        return(None)
    
    return (df)

In [436]:
df = import_multiple_csv_files_2_df("./data/Brickset*.csv")

Glob search with parameters: ./data/Brickset*.csv
Done.


**Droping unnessecary columns:**

In [437]:
df.drop(['Qty owned','UPC','Qty owned new', 
         'Qty owned used', 'EAN','Priority','Wanted', 'Height', 'Depth', 'Weight', 'Width', 
         'Notes','Qty wanted','RRP (CAD)','Flag 1 not used', 'Flag 2 not used', 'Flag 3 not used',
         'Flag 4 not used', 'Flag 5 not used', 'Flag 6 not used','Flag 7 not used', 'Flag 8 not used'], axis=1, inplace=True)

**Adapting columns names to be able to user dot notation and more intuitive code(e.g. price instead of rrp):**

In [438]:
df.rename(columns = lambda x : x.replace(' ', '_').replace('(','').replace(')','').lower().strip(), inplace = True)
df.columns

Index(['number', 'theme', 'subtheme', 'year', 'set_name', 'minifigs', 'pieces',
       'rrp_gbp', 'rrp_usd', 'rrp_eur', 'value_new_usd', 'value_used_usd',
       'launch_date', 'exit_date'],
      dtype='object')

In [439]:
df.rename(columns={'rrp_usd': 'price', 'value_new_usd': 'value_new', 'value_used_usd':'value_used'}, inplace = True)
df.columns

Index(['number', 'theme', 'subtheme', 'year', 'set_name', 'minifigs', 'pieces',
       'rrp_gbp', 'price', 'rrp_eur', 'value_new', 'value_used', 'launch_date',
       'exit_date'],
      dtype='object')

In [440]:
df.sort_values(["year","launch_date"], inplace = True)

In [441]:
#Parse dates
#df['launch_date'] = pd.to_datetime(df['launch_date'])
#df['exit_date'] = pd.to_datetime(df['exit_date'])

In [442]:
df.head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,,5.5,,6.26,,,
1,1040-1,Dacta,,1991,Farm,4.0,89.0,,,,,,,
2,1474-1,Basic,Universal Building Set,1991,Basic Building Set with Gift Item,1.0,69.0,,,,24.64,,,
3,1475-1,Town,Flight,1991,Airport Security Squad,2.0,123.0,,10.0,,165.9,49.8,,
4,1476-1,Assorted,Bonus/Value Pack,1991,Five Set Bonus Pack,,158.0,,,,450.0,100.0,,


### Exploring Content

**Checking types per column:**

In [443]:
 df.describe()

Unnamed: 0,year,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used
count,15634.0,7171.0,12118.0,8172.0,10234.0,3900.0,10541.0,8751.0
mean,2010.44,2.67,233.25,26.55,29.97,38.66,79.04,41.11
std,8.05,2.79,470.34,39.71,44.52,56.64,213.08,75.64
min,1991.0,1.0,0.0,0.0,0.0,0.01,0.0,0.25
25%,2004.0,1.0,24.0,5.99,6.99,9.99,11.05,6.57
50%,2012.0,2.0,75.0,14.99,15.0,19.99,28.98,16.23
75%,2017.0,3.0,251.0,29.99,34.99,44.95,74.89,43.35
max,2022.0,33.0,11695.0,699.99,799.99,799.99,9773.99,1391.39


In [444]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15634 entries, 0 to 4106
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   number       15634 non-null  object 
 1   theme        15634 non-null  object 
 2   subtheme     12655 non-null  object 
 3   year         15634 non-null  int64  
 4   set_name     15634 non-null  object 
 5   minifigs     7171 non-null   float64
 6   pieces       12118 non-null  float64
 7   rrp_gbp      8172 non-null   float64
 8   price        10234 non-null  float64
 9   rrp_eur      3900 non-null   float64
 10  value_new    10541 non-null  float64
 11  value_used   8751 non-null   float64
 12  launch_date  6624 non-null   object 
 13  exit_date    6624 non-null   object 
dtypes: float64(7), int64(1), object(6)
memory usage: 1.8+ MB


**Check if there are duplicated values:**

In [445]:
df[df.duplicated()]

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date


**Unique values per column:**

In [446]:
df.nunique().sort_values(ascending = True)

minifigs          31
year              32
theme            141
rrp_eur          143
exit_date        165
rrp_gbp          270
price            322
launch_date      404
subtheme         801
pieces          1332
value_used      4522
value_new       6178
set_name       13328
number         15634
dtype: int64

**Investigate missing values in data set:**

In [447]:
print("Percentages of missing values:\n{}".format(df.isnull().sum()/df.shape[0]*100))

Percentages of missing values:
number         0.00
theme          0.00
subtheme      19.05
year           0.00
set_name       0.00
minifigs      54.13
pieces        22.49
rrp_gbp       47.73
price         34.54
rrp_eur       75.05
value_new     32.58
value_used    44.03
launch_date   57.63
exit_date     57.63
dtype: float64


In [448]:
df_missing_val_per = pd.DataFrame(df.isnull().sum()/df.shape[0]*100, columns=['value'])
df_missing_val_per_sorted = df_missing_val_per.sort_values(by = "value", ascending = False)

In [449]:
px.bar(df_missing_val_per_sorted, 
       x = df_missing_val_per_sorted.index, 
       y = "value", 
       labels = {"value":"percentage of missing values"})

**Comments:**
* There are NaN values in most columns.
* Most values are missing in rrp_eur, but this is ok since the analysis will be done in usd (due to value_new and value_used also being in usd). The available rrp_eur values can be used to fill-in missing data in the usd column.
* More than half of the items don't have a launch and exit date.
* The missing values for minifigs could just be due to the items being lego sets without any minifigures or those are other lego product merchendise.

**Tasks:**
* A quarter of the items are missing piece counts. This must be investigated since it could indicate the item is not a lego set but some other kind of merchandise from the database. I will aim to categorize the items into sets and other merchendise. A possible way to does this is to use the pieces count >0 or minifigure >0.

* Most prices are available in usd, also the value new and used is available in usd. If possible I will try to calculate missing values in usd by the columns of other currencies then drop the other columns to reduce complexity for further processing (one currency is enough for the inteded analysis).

* Also some dates are missing, I'll take a look at that. Sets from 2022 have probably not yet been released, I will label them as not released. The items that have a launch date but no exit date will be labeled as active, items that have an exit date will be label eol (popular lego term "end-of-life" for items that are no longer produced).

**Make box-plots of all columns with numerival values:**

In [450]:
def make_plots_of_num_cols(df):
    for col in df.columns:
        if df[col].dtype == np.int64 or df[col].dtype == np.float64:
            print(col)
            fig = px.box(df, x = col, points="all")
            fig.update_yaxes(visible = False, showticklabels = False)
            fig.show()
        else:
            continue

In [451]:
#make_plots_of_num_cols(df)

**Comments**
* Year: The median is 2012 meaning that half of the items in the data-set containing the past 30th years were released in the past 9 years.
* Minifing: Median is only 2. This should be investigated in more detail since data set includes lego items that are not necessarily sets but other merchandice.
* Pieces: Similare goes for the rather low median of pieces. 

**Task**
* Task categorize entries as sets very other merchandice.

## 3. Prepare Data

### Removing rows where there is no numeric data

In [452]:
cond = df[['minifigs','pieces',
       "rrp_gbp", "rrp_eur", "price", 
       "value_new", "value_used", 
       "launch_date", "exit_date"]].isnull().values.all(axis=1)
df['numeric_data_nan'] = np.where(cond, True, False)
df[df["numeric_data_nan"] == True].head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan
142,BK15SPR1991-1,Books,Brick Kicks,1991,BRICK KICKS Spring 1991,,,,,,,,,,True
143,BK16SUM1991-1,Books,Brick Kicks,1991,BRICK KICKS Summer 1991,,,,,,,,,,True
144,BK17FAL1991-1,Books,Brick Kicks,1991,BRICK KICKS Fall 1991,,,,,,,,,,True
145,BK18WIN1991-1,Books,Brick Kicks,1991,BRICK KICKS Winter 1991 - 1992,,,,,,,,,,True
253,BK19SPR1992-1,Books,Brick Kicks,1992,BRICK KICKS Spring 1992,,,,,,,,,,True


In [453]:
df = df[df["numeric_data_nan"] == False]

In [454]:
df.head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,,5.5,,6.26,,,,False
1,1040-1,Dacta,,1991,Farm,4.0,89.0,,,,,,,,False
2,1474-1,Basic,Universal Building Set,1991,Basic Building Set with Gift Item,1.0,69.0,,,,24.64,,,,False
3,1475-1,Town,Flight,1991,Airport Security Squad,2.0,123.0,,10.0,,165.9,49.8,,,False
4,1476-1,Assorted,Bonus/Value Pack,1991,Five Set Bonus Pack,,158.0,,,,450.0,100.0,,,False


### Categorization of items to sets and minifigures while removing items such as gear, books, etc.

In [455]:
#Over our long history, we’ve made loads of unique sets, many with similar names. We use numbers as a quick and convenient way to instantly identify any LEGO set. Numbers on the first sets we made were three digits long, but as we made more and more sets, we started using longer numbers. Currently, set numbers are five to seven digits long and are featured prominently on the box and instructions for the set.

In [456]:
df[['number_main','number_sub']] = df['number'].str.split('-',expand=True)

In [457]:
df.number_main

0                     819
1                    1040
2                    1474
3                    1475
4                    1476
              ...        
4094              5007182
4095              5007183
4096              5007184
4097              5007185
4102    ISBN9780744054576
Name: number_main, Length: 14473, dtype: object

In [458]:
df.drop(df.index[df["number_main"].apply(lambda x: not (x.isnumeric()))], axis=0, inplace=True)

In [459]:
df.drop(df.index[df["number_sub"].apply(lambda x: not (x.isnumeric()))], axis=0, inplace=True)

In [460]:
df.drop(['number_main','number_sub'], axis = "columns", inplace = True)

**Adding category column with value set for all items with > 0 number of pieces:**

In [461]:
df['category'] = np.where(df['pieces'] > 0, "set", "uncategorized")
df[df.category == "uncategorized"].describe()

Unnamed: 0,year,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used
count,2099.0,102.0,6.0,1385.0,1599.0,703.0,104.0,47.0
mean,2013.4,2.28,0.0,13.14,18.44,14.74,105.31,33.08
std,5.44,3.28,0.0,21.54,23.99,26.17,444.3,92.03
min,1997.0,1.0,0.0,1.95,0.0,1.99,1.36,2.73
25%,2009.0,1.0,0.0,3.99,4.99,4.99,4.88,4.0
50%,2014.0,1.0,0.0,6.99,12.99,6.99,7.35,5.04
75%,2018.0,3.0,0.0,14.65,24.99,14.99,40.93,14.24
max,2022.0,24.0,0.0,274.99,299.99,304.99,3432.84,499.0


**Dealing with uncategorized items:**

In [462]:
uncat_themes = set(df[(df.category == "uncategorized")].theme)
print(uncat_themes)

{'Clikits', 'Collectable Minifigures', 'Friends', 'Legends of Chima', 'Disney', 'Star Wars', 'Gear', 'The LEGO Movie 2', 'Ninjago', 'Sports', 'City', 'Education', 'BrickHeadz', 'Seasonal', 'Duplo', 'Marvel Super Heroes', 'Creator Expert', 'DC Comics Super Heroes', 'Make and Create', 'Power Miners', 'Technic', 'Promotional', 'Books', 'Miscellaneous', 'Vidiyo', 'Unikitty', 'Super Mario'}


In [463]:
for theme in uncat_themes:
    cond_2 = (df.theme == theme) & (df.category == "uncategorized")
    print("Theme:", theme)
    print("Number of rows:", df[cond_2].shape[0])
    display(df[cond_2])

Theme: Clikits
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
14672,7575-1,Clikits,Seasonal,2004,Clikits Advent Calendar,,,11.99,15.0,,18.75,,01/10/2004,31/12/2006,False,uncategorized


Theme: Collectable Minifigures
Number of rows: 68


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
14162,8683-0,Collectable Minifigures,Series 1,2010,LEGO Minifigures - Series 1 {Random bag},,,1.99,1.99,,,,01/05/2010,31/12/2010,False,uncategorized
14180,8683-18,Collectable Minifigures,Series 1,2010,LEGO Minifigures - Series 1 - Sealed Box,,,119.4,,,,,01/05/2010,31/12/2010,False,uncategorized
14181,8684-0,Collectable Minifigures,Series 2,2010,LEGO Minifigures - Series 2 {Random bag},,,1.99,1.99,,,,01/09/2010,31/12/2010,False,uncategorized
14199,8684-18,Collectable Minifigures,Series 2,2010,LEGO Minifigures - Series 2 - Sealed Box,,,,,,,,01/09/2010,31/12/2010,False,uncategorized
13594,8803-0,Collectable Minifigures,Series 3,2011,LEGO Minifigures - Series 3 {Random bag},,,1.99,2.99,,,,01/01/2011,31/05/2011,False,uncategorized
13612,8803-18,Collectable Minifigures,Series 3,2011,LEGO Minifigures - Series 3 - Sealed Box,,,119.4,,,,,01/01/2011,31/05/2011,False,uncategorized
13613,8804-0,Collectable Minifigures,Series 4,2011,LEGO Minifigures - Series 4 {Random bag},,,1.99,2.99,,,,01/04/2011,31/08/2011,False,uncategorized
13631,8804-18,Collectable Minifigures,Series 4,2011,LEGO Minifigures - Series 4 - Sealed Box,,,119.4,,,,,01/04/2011,31/08/2011,False,uncategorized
13632,8805-0,Collectable Minifigures,Series 5,2011,LEGO Minifigures - Series 5 {Random bag},,,1.99,2.99,,,,01/09/2011,31/12/2011,False,uncategorized
13650,8805-18,Collectable Minifigures,Series 5,2011,LEGO Minifigures - Series 5 - Sealed Box,,,119.4,,,,,01/09/2011,31/12/2011,False,uncategorized


Theme: Friends
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
4805,5005553-1,Friends,Product Collection,2018,LEGO Friends Easter Bundle,,,35.97,,43.97,,,,,False,uncategorized
1585,66673-1,Friends,Product Collection,2021,Animal Gift Set,,,,,,,,01/11/2021,31/12/2021,False,uncategorized


Theme: Legends of Chima
Number of rows: 19


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
6617,391214-1,Legends of Chima,Magazine gift,2014,Speedorz Ramp,,,,,,2.37,,,,False,uncategorized
6619,391404-1,Legends of Chima,Magazine gift,2014,Worriz,,,,,,6.22,,,,False,uncategorized
6620,391405-1,Legends of Chima,Magazine gift,2014,Crocodile Hideout,,,,,,1.63,,,,False,uncategorized
6621,391406-1,Legends of Chima,Magazine gift,2014,Crug minifigure with armour and sword,,,,,,2.5,,,,False,uncategorized
6622,391407-1,Legends of Chima,Magazine gift,2014,Fire spinner and ramp,,,,,,3.96,,,,False,uncategorized
6623,391408-1,Legends of Chima,Magazine gift,2014,Vornon,,,,,,2.54,3.75,,,False,uncategorized
6624,391409-1,Legends of Chima,Magazine gift,2014,Ice Prison,,,,,,1.36,,,,False,uncategorized
6625,391410-1,Legends of Chima,Magazine gift,2014,Sykor,,,,,,3.81,,,,False,uncategorized
6626,391411-1,Legends of Chima,Magazine gift,2014,Shooter,,,,,,1.84,,,,False,uncategorized
6627,391412-1,Legends of Chima,Magazine gift,2014,Worriz,,,,,,2.4,,,,False,uncategorized


Theme: Disney
Number of rows: 4


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
1834,302102-1,Disney,Magazine Gift,2021,Rapunzel & Hairbrush,,,,,,3.66,,,,False,uncategorized
1835,302103-1,Disney,Magazine Gift,2021,Cinderella's Kitchen,,,,,,3.39,,,,False,uncategorized
1837,302105-1,Disney,Magazine Gift,2021,"Lumiere, Cogsworth and Sultan",,,,,,3.9,,,,False,uncategorized
1838,302106-1,Disney,Magazine Gift,2021,Princess Ariel,,,,,,4.62,,,,False,uncategorized


Theme: Star Wars
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
1586,66674-1,Star Wars,Product Collection,2021,Skywalker Adventures Pack,,,,,,66.58,,01/11/2021,31/12/2021,False,uncategorized


Theme: Gear
Number of rows: 1873


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
678,9708-1,Gear,Education,1997,Intelligent House Activity Pack,,,,12.00,,,,,,False,uncategorized
15096,3978-1,Gear,Key Chains/Castle,1998,Magic Wizard Key Chain,,,,3.00,,,,,,False,uncategorized
15154,5701-1,Gear,Video Games/PC,1998,LEGO Loco,,,,10.00,,,,,,False,uncategorized
15155,5702-1,Gear,Video Games/PC,1998,LEGO Chess,,,,10.00,,,,,,False,uncategorized
15515,5703-1,Gear,Video Games/Nintendo 64,1999,LEGO Racers,,,,20.00,,,,,,False,uncategorized
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4093,5007181-1,Gear,Housewares,2022,Fire Chief 46 in x 60 in Throw,,,,19.99,,,,,,False,uncategorized
4094,5007182-1,Gear,Housewares,2022,City Town Map 46 in x 60 in Throw,,,,19.99,,,,,,False,uncategorized
4095,5007183-1,Gear,Housewares,2022,City Police 46 in x 60 in Throw,,,,19.99,,,,,,False,uncategorized
4096,5007184-1,Gear,Housewares,2022,Butterfly 46 in x 60 in Throw,,,,19.99,,,,,,False,uncategorized


Theme: The LEGO Movie 2
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
5435,471906-1,The LEGO Movie 2,Magazine Gift,2019,Rex with Jetpack,1.0,,,,,3.19,,,,False,uncategorized
5582,5005738-1,The LEGO Movie 2,,2019,Sticker roll,,,3.99,3.99,3.99,,,,,False,uncategorized


Theme: Ninjago
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
4804,5005552-1,Ninjago,Product Collection,2018,LEGO NINJAGO Easter Bundle,,,42.95,,44.95,,,,,False,uncategorized


Theme: Sports
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
5896,3406-2,Sports,Football,2000,French Team Bus,,,,,,,,01/04/2000,30/06/2002,False,uncategorized


Theme: City
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
10130,66540-1,City,Volcano Explorers,2016,City Volcano Value Pack,,,,,,88.23,55.18,01/09/2016,31/12/2016,False,uncategorized
4806,5005554-1,City,Product Collection,2018,LEGO City Easter Bundle,,,37.97,,44.97,,,,,False,uncategorized


Theme: Education
Number of rows: 9


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
9282,9412-1,Education,Duplo,2003,Duplo Bricks,,,28.99,,,,,,,False,uncategorized
12048,9310-1,Education,,2007,Dinosaurs Set,,,,,,,89.99,,,False,uncategorized
12492,45080-1,Education,,2013,Creative Cards,,,,,,6.57,,,,False,uncategorized
8378,45497-1,Education,Storage,2017,"Storage boxes, pack of 7",,,,,,74.92,,,,False,uncategorized
8379,45498-1,Education,,2017,"Medium storage, 8 pack",,,,,,111.49,,,,False,uncategorized
1541,45816-1,Education,FIRST LEGO League,2021,FIRST LEGO League Challenge,,,,,,,,01/08/2021,31/12/2023,False,uncategorized
1542,45817-1,Education,FIRST LEGO League,2021,Cargo Connect Explore Set,,,,,,,,01/08/2021,31/12/2023,False,uncategorized
1533,45345-1,Education,SPIKE Essential,2021,SPIKE Essential Set,,,274.99,274.95,304.99,380.61,,,,False,uncategorized
1538,45609-1,Education,SPIKE Prime,2021,Small Hub,,,189.99,189.95,209.99,,,,,False,uncategorized


Theme: BrickHeadz
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
5669,6315025-1,BrickHeadz,Promotional,2019,Amsterdam BrickHeadz,,,,,,433.6,,,,False,uncategorized


Theme: Seasonal
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
6807,5004259-1,Seasonal,Christmas,2014,Holiday Ornament Collection,,,,47.94,,,,,,False,uncategorized


Theme: Duplo
Number of rows: 21


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
5838,2751-1,Duplo,,2000,Egg Fun,,,,4.0,,,,,,False,uncategorized
11429,5484-1,Duplo,,2006,{Zoo animal},,,,,,,,01/01/2006,31/12/2009,False,uncategorized
11431,5485-2,Duplo,,2006,Zoo - Zoo Keeper,,,,,,,,01/01/2006,31/12/2007,False,uncategorized
11432,5485-3,Duplo,,2006,Zoo - Penguin,,,,,,,,01/01/2006,31/12/2007,False,uncategorized
11433,5485-4,Duplo,,2006,Zoo - Polar Bear,,,,,,,,01/01/2006,31/12/2007,False,uncategorized
11434,5485-5,Duplo,,2006,Zoo - Hippopotamus,,,,,,,,01/01/2006,31/12/2007,False,uncategorized
11435,5485-6,Duplo,,2006,Zoo - Giraffe,,,,,,,,01/01/2006,31/12/2007,False,uncategorized
14260,30060-2,Duplo,,2010,Farm - Farmer,,,,,,,,01/01/2010,31/12/2011,False,uncategorized
14261,30060-3,Duplo,,2010,Farm - Dog,,,,,,,,01/01/2010,31/12/2011,False,uncategorized
14262,30060-4,Duplo,,2010,Farm - Sheep,,,,,,,,01/01/2010,31/12/2011,False,uncategorized


Theme: Marvel Super Heroes
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
4076,242210-1,Marvel Super Heroes,Magazine Gift,2022,Iron Man,,,,,,5.04,,,,False,uncategorized


Theme: Creator Expert
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
1229,10282-2,Creator Expert,Adidas,2021,Adidas Originals Superstar X Footshop 'Bluepri...,,,79.99,79.99,89.99,,,01/07/2021,31/12/2023,False,uncategorized


Theme: DC Comics Super Heroes
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
7594,5004816-1,DC Comics Super Heroes,Product Collection,2015,Super Heroes DC Collection,,,,149.98,,,,,,False,uncategorized
3631,212010-1,DC Comics Super Heroes,Magazine Gift,2020,Batman,1.0,,,,,3.38,2.73,,,False,uncategorized


Theme: Make and Create
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
11514,7794-1,Make and Create,,2006,{Set with two minifigs},,,19.99,,,,,,,False,uncategorized


Theme: Power Miners
Number of rows: 3


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
2556,4559288-1,Power Miners,Promotional,2009,{Power Miners Promotional Polybag},1.0,,,,,29.5,,,,False,uncategorized
2557,4559385-1,Power Miners,Promotional,2009,{Power Miners Promotional Polybag},1.0,,,,,14.63,,,,False,uncategorized
2558,4559387-1,Power Miners,Promotional,2009,{Power Miners Promotional Polybag},1.0,,,,,9.26,,,,False,uncategorized


Theme: Technic
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
662,8299-1,Technic,,1997,Search Sub,1.0,0.0,,50.0,,108.62,61.46,,,False,uncategorized


Theme: Promotional
Number of rows: 18


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
14860,4212850-1,Promotional,LEGO brand stores,2004,Easter Egg Orange,,,,,,49.0,,,,False,uncategorized
4841,6258620-1,Promotional,Miscellaneous,2018,Classic Wooden Duck,,,,,,47.7,,,,False,uncategorized
4842,6258622-1,Promotional,Miscellaneous,2018,Classic Wooden Bus,,,,,,52.51,,,,False,uncategorized
4843,6258623-1,Promotional,Miscellaneous,2018,Classic Wooden Train,,,,,,40.77,,,,False,uncategorized
5653,5006065-1,Promotional,Minifigure,2019,Brick Friday 2019 minifigure,1.0,,,,,48.5,32.37,,,False,uncategorized
5654,5006066-1,Promotional,LEGO brand stores,2019,Brick Friday 2019 brick,,,,,,16.78,,,,False,uncategorized
5656,6244853-1,Promotional,LEGO brand stores,2019,Lion Dance,,,,,,59.04,,,,False,uncategorized
5658,6307986-1,Promotional,Toys R Us,2019,Summer,,,,,,41.42,14.24,,,False,uncategorized
5659,6307987-1,Promotional,Toys R Us,2019,Autumn,,,,,,35.07,14.24,,,False,uncategorized
5660,6307988-1,Promotional,Toys R Us,2019,Winter,,,,,,48.25,14.24,,,False,uncategorized


Theme: Books
Number of rows: 2


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
5988,4006-1,Books,LEGO,2000,Brick Tricks: Cool Cars,,,,8.0,,,,,,False,uncategorized
5989,4007-1,Books,LEGO,2000,Brick Tricks: Fantastic Fliers,,,,8.0,,,,,,False,uncategorized


Theme: Miscellaneous
Number of rows: 5


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
8769,4000024-1,Miscellaneous,LEGO Inside Tour Exclusive,2017,LEGO House Tree of Creativity,,,,,,1960.86,,,,False,uncategorized
4743,4000025-1,Miscellaneous,LEGO Inside Tour Exclusive,2018,LEGO Ferguson Tractor,,,,,,2300.0,,,,False,uncategorized
5559,4000034-1,Miscellaneous,LEGO Inside Tour Exclusive,2019,LEGO System House,,,,,,3432.84,,,,False,uncategorized
3105,11929-1,Miscellaneous,,2020,Parts for The LEGO Games Book,,0.0,,,,,,,,False,uncategorized
3106,11930-1,Miscellaneous,,2020,Parts for Halloween Ideas,,0.0,,,,,,,,False,uncategorized


Theme: Vidiyo
Number of rows: 4


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
1475,43101-0,Vidiyo,Bandmates Series 1,2021,Bandmates Series 1 {Random box},,,3.99,4.99,4.99,,,01/03/2021,31/12/2022,False,uncategorized
1489,43101-14,Vidiyo,Bandmates Series 1,2021,Bandmates Series 1 - Sealed Box,,,3.99,4.99,4.99,,,01/03/2021,31/12/2022,False,uncategorized
1496,43108-0,Vidiyo,Bandmates Series 2,2021,Bandmates Series 2 {Random box},,,3.99,4.99,4.99,,,01/10/2021,31/12/2022,False,uncategorized
1510,43108-14,Vidiyo,Bandmates Series 2,2021,Bandmates Series 2 - Sealed Box,,,4.49,4.99,4.99,,,01/10/2021,31/12/2022,False,uncategorized


Theme: Unikitty
Number of rows: 1


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
4385,41775-14,Unikitty,Blind Bags Series 1,2018,Unikitty! - Blind Bags Series 1 - Sealed Box,,,,,,,,01/06/2018,31/12/2018,False,uncategorized


Theme: Super Mario
Number of rows: 54


Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
3468,71361-0,Super Mario,Character Pack - Series 1,2020,Character Pack - Series 1 {Random bag},,,,4.99,,,,01/08/2020,31/12/2020,False,uncategorized
3469,71361-1,Super Mario,Character Pack - Series 1,2020,Paragoomba,1.0,,,4.99,,12.92,5.39,01/08/2020,31/12/2020,False,uncategorized
3470,71361-2,Super Mario,Character Pack - Series 1,2020,Fuzzy,1.0,,,4.99,,6.0,4.67,01/08/2020,31/12/2020,False,uncategorized
3471,71361-3,Super Mario,Character Pack - Series 1,2020,Spiny,1.0,,,4.99,,7.95,4.71,01/08/2020,31/12/2020,False,uncategorized
3472,71361-4,Super Mario,Character Pack - Series 1,2020,Buzzy Beetle,1.0,,,4.99,,6.8,4.13,01/08/2020,31/12/2020,False,uncategorized
3473,71361-5,Super Mario,Character Pack - Series 1,2020,Bullet Bill,1.0,,,4.99,,25.65,16.37,01/08/2020,31/12/2020,False,uncategorized
3474,71361-6,Super Mario,Character Pack - Series 1,2020,Bob-omb,1.0,,,4.99,,9.08,5.45,01/08/2020,31/12/2020,False,uncategorized
3475,71361-7,Super Mario,Character Pack - Series 1,2020,Eep Cheep,1.0,,,4.99,,7.99,4.35,01/08/2020,31/12/2020,False,uncategorized
3476,71361-8,Super Mario,Character Pack - Series 1,2020,Blooper,1.0,,,4.99,,7.27,6.14,01/08/2020,31/12/2020,False,uncategorized
3477,71361-9,Super Mario,Character Pack - Series 1,2020,Urchin,1.0,,,4.99,,8.45,4.28,01/08/2020,31/12/2020,False,uncategorized


**Dropping product collections, bundle, promotionals, sealed boxes, magazine gifts, shoes such as Adidas Original Superstar, etc. Other items that could be sorted as sets or minifigures are categorized.**


In [464]:
drop_col_list = ['Star Wars', 'DC Comics Super Heroes', 'City', 'The LEGO Movie 2', 
                     'Legends of Chima', 'Marvel Super Heroes',  'Books', 'Creator Expert', 
                       'Ninjago', 'Vidiyo', 'Disney', 'Miscellaneous', 'Gear', 'Duplo', 
                     'BrickHeadz', 'Promotional', 'Friends', 'Seasonal', 'Unikitty']
set_list = ["Clikits", 'Education','Make and Create', 'Sports']
minifig_list = ['Collectable Minifigures', 'Power Miners', 'Super Mario','Technic']
print("Rows that will be dropped:", drop_col_list)
print("To be categorized as sets:", set_list)
print("To be categorized as minifigs:", minifig_list)


Rows that will be dropped: ['Star Wars', 'DC Comics Super Heroes', 'City', 'The LEGO Movie 2', 'Legends of Chima', 'Marvel Super Heroes', 'Books', 'Creator Expert', 'Ninjago', 'Vidiyo', 'Disney', 'Miscellaneous', 'Gear', 'Duplo', 'BrickHeadz', 'Promotional', 'Friends', 'Seasonal', 'Unikitty']
To be categorized as sets: ['Clikits', 'Education', 'Make and Create', 'Sports']
To be categorized as minifigs: ['Collectable Minifigures', 'Power Miners', 'Super Mario', 'Technic']


In [465]:
for theme in set_list:
    df.loc[(df.theme == theme) & (df.category == "uncategorized"),'category'] ='set'

In [466]:
for theme in minifig_list:
    df.loc[(df.theme == theme) & (df.category == "uncategorized"),'category'] = "minifig"

In [467]:
df = df.drop(df[(df.category == "uncategorized") & (df.theme.isin(drop_col_list))].index)

In [468]:
df.category.unique()

array(['set', 'minifig'], dtype=object)

**Categorize Collectable Minifigures as minifigs.**


In [469]:
df.loc[(df.theme == "Collectable Minifigures"), "category"] = "minifig"

In [470]:
set(df[df.theme == "Collectable Minifigures"].category)

{'minifig'}

**Since 2022 is incomplete we sill save the rows from 2022 in a new dataframe and remove it from the main one.**


In [471]:
df_2022 = df[df.year == 2022]

In [472]:
df = df[df.year != 2022]
print(set(df.year))

{1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021}


**Creating dataframe containing only sets, since they will be the main focus of the analysis:**


In [513]:
df_sets = df[df.category == "set"].sort_values(by = "year")

### Further preparation steps for exploration analysis

**Filling in price data in usd from other currencies where possible:**


In [514]:
rate_eur_usd, rate_gdp_usd

(1.1970355369182681, 1.5725745846764236)

In [515]:
print("No. of rows where the price could be filled exclusively with gbp-data:")
mask_gbp = (df_sets.price.isna()) & (df_sets.rrp_eur.isna()) & (df_sets.rrp_gbp.notnull())
df_sets[mask_gbp].shape

No. of rows where the price could be filled exclusively with gbp-data:


(243, 16)

In [516]:
print("No. of rows where the price could be filled exclusively with eur-data:")
mask_eur = (df_sets.price.isna()) & (df_sets.rrp_gbp.isna()) & (df_sets.rrp_eur.notnull())
df_sets[mask_eur].shape

No. of rows where the price could be filled exclusively with eur-data:


(5, 16)

In [517]:
df_sets[df_sets['price'].isnull()].shape

(3285, 16)

In [518]:
df_sets["price_eur_calc"] = df_sets["rrp_eur"].apply(lambda x: x*rate_eur_usd)

In [519]:
df_sets["price_gbp_calc"] = df_sets["rrp_gbp"].apply(lambda x: x*rate_gdp_usd)

In [520]:
df_sets.head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category,price_eur_calc,price_gbp_calc
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,,5.5,,6.26,,,,False,set,,
91,5165-1,Service Packs,,1991,"Hinges, Couplings and Tilting Bearings",,31.0,,3.0,,20.0,,,,False,set,,
92,5166-1,Service Packs,,1991,"Lamp Holders, Tool Holder Plates",,18.0,,,,12.79,,,,False,set,,
93,5271-1,Service Packs,,1991,Tyres and Hubs 49.6 mm White,,4.0,,,,5.9,,,,False,set,,
94,5272-1,Service Packs,Technic,1991,Cylinder Motor,,9.0,,3.0,,3.67,,,,False,set,,


In [521]:
df_sets['price'].fillna(df_sets['price_gbp_calc'], inplace=True)

In [522]:
df_sets[df_sets['price'].isnull()].shape

(2999, 18)

In [523]:
df_sets['price'].fillna(df_sets['price_eur_calc'], inplace=True)

In [524]:
df_sets[df_sets['price'].isnull()].shape

(2994, 18)

In [525]:
df_sets[df_sets['price'].isnull()]

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category,price_eur_calc,price_gbp_calc
92,5166-1,Service Packs,,1991,"Lamp Holders, Tool Holder Plates",,18.00,,,,12.79,,,,False,set,,
93,5271-1,Service Packs,,1991,Tyres and Hubs 49.6 mm White,,4.00,,,,5.90,,,,False,set,,
102,6352-1,Town,Vehicles,1991,Cargomaster Crane,1.00,140.00,,,,115.92,24.24,,,False,set,,
106,6509-1,Town,Racing,1991,Red Devil Racer,1.00,39.00,,,,17.81,5.40,,,False,set,,
77,5045-1,Service Packs,,1991,"Magnets, Magnet Holders",,12.00,,,,31.82,,,,False,set,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1324,30570-1,City,Wildlife Rescue,2021,Wildlife Rescue Hovercraft,1.00,35.00,,,,4.89,,01/06/2021,31/12/2022,False,set,,
1488,43101-13,Vidiyo,Bandmates Series 1,2021,Bandmates Series 1 - Complete,12.00,121.00,,,,,,01/03/2021,31/12/2022,False,set,,
1396,40502-1,Promotional,LEGO House,2021,The Brick Moulding Machine,,1205.00,,,,149.38,112.52,01/03/2021,31/12/2022,False,set,,
1372,40473-1,Promotional,LEGOLAND,2021,Water Park,5.00,359.00,,,,48.52,,01/04/2021,31/12/2021,False,set,,


In [526]:
df_sets_price = df_sets[df_sets['price'].notnull()]
df_sets_price.columns


Index(['number', 'theme', 'subtheme', 'year', 'set_name', 'minifigs', 'pieces',
       'rrp_gbp', 'price', 'rrp_eur', 'value_new', 'value_used', 'launch_date',
       'exit_date', 'numeric_data_nan', 'category', 'price_eur_calc',
       'price_gbp_calc'],
      dtype='object')

In [528]:
df_sets_price = df_sets_price.drop(['rrp_eur','price_gbp_calc', 'rrp_gbp', 'price_eur_calc'], axis = "columns")
df_sets = df_sets.drop(['rrp_eur','price_gbp_calc', 'rrp_gbp', 'price_eur_calc'], axis = "columns")

In [537]:
df_sets_price = df_sets_price.sort_values("year").reset_index(drop = True)
df_sets = df_sets.sort_values("year").reset_index(drop = True)

In [538]:
df_sets.head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,5.5,6.26,,,,False,set
1,1040-1,Dacta,,1991,Farm,4.0,89.0,,,,,,False,set
2,1474-1,Basic,Universal Building Set,1991,Basic Building Set with Gift Item,1.0,69.0,,24.64,,,,False,set
3,1475-1,Town,Flight,1991,Airport Security Squad,2.0,123.0,10.0,165.9,49.8,,,False,set
4,1476-1,Assorted,Bonus/Value Pack,1991,Five Set Bonus Pack,,158.0,,450.0,100.0,,,False,set


In [539]:
df_sets_price.head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.0,5.5,6.26,,,,False,set
1,2306-1,Duplo,,1991,Large Red Building Plate,,1.0,12.0,,5.61,,,False,set
2,6646-1,Town,Racing,1991,Screaming Patriot,1.0,65.0,6.75,68.99,9.93,,,False,set
3,6887-1,Space,Blacktron 2,1991,Allied Avenger,1.0,100.0,7.93,250.0,39.96,,,False,set
4,6669-1,Town,Vehicles,1991,Diesel Daredevil,1.0,90.0,8.75,39.4,12.37,,,False,set


### Further preparation steps for predictive analysis

## 4. Analysis

### A) Exporatory Analysis
#### What themes are most dominant over the years by number of sets?


In [540]:
df_temp = df_sets.groupby(by = ["year","theme"]).count().sort_values(by =["year", "number"], ascending = (True, False))
df_1 = pd.DataFrame(df_temp["number"])
df_1.reset_index(inplace = True);
df_1.head()

Unnamed: 0,year,theme,number
0,1991,Duplo,36
1,1991,Service Packs,22
2,1991,Town,22
3,1991,Space,13
4,1991,Trains,12


In [541]:
fig = px.scatter(df_1, x = "year", y = "theme", color = "theme", size = "number",
              title = "Available themes in the period 1991-2021")
fig.update_layout(height=1750,
                  font=dict(size=9),
                  yaxis = dict(tickmode = 'linear',tick0 = 1,dtick = 1),
                  xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 1),
                  showlegend = False)
fig.update_traces(mode='markers+lines', textfont_size=3)

In [542]:
df_2 = df_1.groupby(by = ["year"]).head(10).reset_index()
df_2.head(15)

Unnamed: 0,index,year,theme,number
0,0,1991,Duplo,36
1,1,1991,Service Packs,22
2,2,1991,Town,22
3,3,1991,Space,13
4,4,1991,Trains,12
5,5,1991,Basic,11
6,6,1991,Dacta,7
7,7,1991,Technic,6
8,8,1991,Pirates,5
9,9,1991,Boats,3


In [543]:
fig = px.bar(df_2, x = "number",y= "year", color = "theme", 
             orientation='h', text = "theme", 
             title = "Yearly top 10 themes by number of sets")
fig.update_layout(height=1000,showlegend = False,
                  yaxis = dict(tickmode = 'linear',tick0 = 0,dtick =1, autorange = "reversed"),
                  xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 50))
fig.update_traces(textfont_size=12, textangle=0, textposition="inside")

#### What sets where record breakers in terms of piece count?

In [544]:
def create_record_breaker_dataset(df, col):
    #Record-winner in 1991
    df_winner_1991 = df_sets[df_sets.year == 1991].sort_values(col, ascending = False).head(1)
    
    mask = df_sets[col].values >= df_winner_1991[col].values[0]
    df_winners = df_sets[mask].sort_values(by = ["year", "launch_date"])
    
    df_winners["cummulative"] = df_winners[col].cummax()
    df_winners.drop_duplicates(subset = "cummulative", inplace = True)
    df_winners.reset_index(drop = True, inplace = True)
    
    return df_winners

In [545]:
df_piece_count_winners = create_record_breaker_dataset(df_sets,"pieces")
df_piece_count_winners.head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,exit_date,numeric_data_nan,category,cummulative
0,9452-1,Dacta,,1991,Giant LEGO topic set,,2165.0,,,,,,False,set,2165.0
1,9287-1,Education,Town,1996,Bonus Lego Basic Town,11.0,2456.0,,511.29,,,,False,set,2456.0
2,3450-1,Creator Expert,Sculptures,2000,Statue of Liberty,,2882.0,199.0,200.0,793.69,15/11/2000,31/12/2002,False,set,2882.0
3,10030-1,Star Wars,Ultimate Collector Series,2002,Imperial Star Destroyer,,3096.0,269.99,1471.58,535.0,06/12/2002,31/12/2007,False,set,3096.0
4,10143-1,Star Wars,Ultimate Collector Series,2005,Death Star II,,3441.0,269.99,2449.0,1206.66,01/09/2005,31/12/2007,False,set,3441.0


In [546]:
df_piece_count_winners["text"] = df_piece_count_winners["set_name"] +" "+ "(" + df_piece_count_winners["number"] +")"

In [547]:
def make_h_bar_chart(df, x, y, color, text, title):
    
    fig = px.bar(df, 
             x=x, y=y, 
             color = color, orientation='h', text = text,  
             title = title,labels = {"index":"year"})
    fig.update_layout(yaxis = dict(tickmode = 'array',tickvals = df.index, ticktext = list(df.year)))
    fig.update_layout(xaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 1000))
    fig.update_traces(textfont_size=12, textangle=0, textposition="inside")
    return fig

In [548]:
fig = make_h_bar_chart(df_piece_count_winners, "pieces",df_piece_count_winners.index, "theme",
                 "text", "Record breaking sets by piece count")

fig.show()

#### What sets where record breakers in terms of piece count?


In [549]:
df_minifig_winners = create_record_breaker_dataset(df_sets,"minifigs")
df_minifig_winners.head()

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,price,value_new,value_used,launch_date,exit_date,numeric_data_nan,category,cummulative
0,9361-1,Dacta,,1991,People,24.0,36.0,,150.0,,,,False,set,24.0
1,9293-1,Dacta,,1996,Community Workers,30.0,182.0,31.5,83.0,69.0,,,False,set,30.0
2,9247-1,Education,,2005,Community Workers,31.0,202.0,49.99,138.2,46.34,,,False,set,31.0
3,852293-1,Gear,Board Games,2008,Castle Giant Chess Set,33.0,2292.0,199.99,,,,,False,set,33.0


In [550]:
fig = make_h_bar_chart(df_minifig_winners, "minifigs", df_minifig_winners.index,
                 "theme", "set_name", 
                 "Record breaking sets by no of minifigures")

fig.show()

#### What words do most often come up in set names?

In [551]:
def tokenize(text):
    """Function for text processing, in particular it replaces urls, tokenizes and lemmatizes the words in a given text.
    INPUT
    text: text to process as str
    OUTPUT:
    tokens: list of tokenized words"""
    
   # url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    #detected_urls = re.findall(url_regex, text)
    #for url in detected_urls:
    #    text = text.replace(url, "urlplaceholder")
 
    # normalize case and remove punctuation using regex
    text = re.sub(r"[^a-zA-Z]", " ", text.lower()) #[^a-zA-Z0-9]
    
    # tokenize text with the tokenizer
    tokens = word_tokenize(text)
    
    # lemmatize and remove stop words
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words("english")]

    return tokens

In [552]:
results_list = []
df_sets.set_name.apply(lambda x: results_list.extend(tokenize(x)));
print(results_list[:10])

['blue', 'baseplate', 'farm', 'basic', 'building', 'set', 'gift', 'item', 'airport', 'security']


In [553]:
results_count = Counter(results_list)
df_results_count = pd.DataFrame.from_dict(results_count, orient='index', columns = ["col"]).reset_index()
df_results_count.rename(columns = {'index':"word", "col":"occurances"}, inplace = True)
df_results_count.sort_values(by = "occurances", ascending = False, inplace = True)
df_results_count.reset_index(drop = True, inplace = True)
df_results_count.head(10)

Unnamed: 0,word,occurances
0,pack,572
1,set,496
2,brick,237
3,value,224
4,truck,208
5,bucket,186
6,lego,180
7,fire,176
8,bonus,175
9,x,164


In [554]:
fig = px.bar(df_results_count.head(25), "word", "occurances",
       color = "word", text = "occurances",
       title = "Top 25 most frequent words in set names")
fig.show()                     
#fig.update_layout(xaxis = dict(visible = False, matches = None))


#### Are lego sets becomming more and more expensive?

In [598]:
df_expense = pd.DataFrame(df_sets_price.groupby(by = "year").mean()).reset_index()
df_expense.head()

Unnamed: 0,year,minifigs,pieces,price,value_new,value_used,numeric_data_nan
0,1991,2.83,127.9,22.6,195.91,62.36,0.0
1,1992,3.03,155.44,23.37,191.67,56.31,0.0
2,1993,2.79,150.68,21.33,173.64,52.11,0.0
3,1994,2.69,167.21,28.49,179.69,59.67,0.0
4,1995,2.84,187.92,25.27,156.63,47.19,0.0


In [605]:
df_expense["price_per_piece"] = df_expense.price.values / df_expense.pieces.values
df_expense.head()

Unnamed: 0,year,minifigs,pieces,price,value_new,value_used,numeric_data_nan,price_per_piece
0,1991,2.83,127.9,22.6,195.91,62.36,0.0,0.18
1,1992,3.03,155.44,23.37,191.67,56.31,0.0,0.15
2,1993,2.79,150.68,21.33,173.64,52.11,0.0,0.14
3,1994,2.69,167.21,28.49,179.69,59.67,0.0,0.17
4,1995,2.84,187.92,25.27,156.63,47.19,0.0,0.13


In [624]:
fig = px.scatter(df_expense,"year", "price", size = "price_per_piece",
       color = "price_per_piece", #markers=True,
       title = "Yearly mean price per set 1991-2021",labels = {"price":"average price in USD"})
#fig.update_layout(yaxis = dict(tickmode = 'linear',tick0 = 0,dtick = 5))
fig.update_layout(xaxis = dict(tickmode = 'linear',tick0 = 1,dtick = 2))
fig.show()
#fig.update_traces(mode='markers', textfont_size=3)

### Questions of interest

**A) Exploratory Analysis**
* **What themes are most dominant over the years?**
* **What sets where record breakers in terms of piece count?**
* **What sets where record breakers in terms of number of minifigs?**
* **What words do most often come up in set names?**
* **Are lego sets becomming more and more expensive?**
* Does the value of sets go up after eol on average?
* What sets do best after eol? (eol = lego-term for end-of-life meaning the date when the set is not being produced 

**B) Predictive Analysis**
* What are features of the data set are good predictors that will rise in value after eol?
* What do the words contained in the set names tell us about the rise of value after eol.
* What sets that are currently being sold can I predicte to be a good investment after eol?* 

*e.g. price increase of at least 10 usd (for package and shipment when selling) + at least 25% profit

In [331]:
df_sets_price

Unnamed: 0,number,theme,subtheme,year,set_name,minifigs,pieces,rrp_gbp,price,rrp_eur,value_new,value_used,launch_date,exit_date,numeric_data_nan,category
0,819-1,Basic,Supplementaries,1991,Blue baseplate,,1.00,,5.50,,6.26,,,,False,set
91,5165-1,Service Packs,,1991,"Hinges, Couplings and Tilting Bearings",,31.00,,3.00,,20.00,,,,False,set
92,5166-1,Service Packs,,1991,"Lamp Holders, Tool Holder Plates",,18.00,,,,12.79,,,,False,set
93,5271-1,Service Packs,,1991,Tyres and Hubs 49.6 mm White,,4.00,,,,5.90,,,,False,set
94,5272-1,Service Packs,Technic,1991,Cylinder Motor,,9.00,,3.00,,3.67,,,,False,set
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1797,80022-1,Monkie Kid,Season 2,2021,Spider Queen's Arachnoid Base,6.00,1170.00,89.99,119.99,99.99,140.26,,01/03/2021,31/12/2022,False,set
1798,80023-1,Monkie Kid,Season 2,2021,Monkie Kid's Team Dronecopter,9.00,1462.00,114.99,149.99,129.99,154.21,,01/03/2021,31/12/2022,False,set
1230,10283-1,Creator Expert,Space,2021,NASA Space Shuttle Discovery,,2354.00,159.99,199.99,179.99,197.60,151.00,01/04/2021,31/12/2024,False,set
1358,40454-1,Marvel Super Heroes,Spider-Man,2021,Spider-Man versus Venom and Iron Venom,4.00,63.00,13.49,14.99,14.99,24.20,,01/04/2021,31/12/2022,False,set


### B) Predictive Analysis

## 5. Evaluation

In [None]:
corr = df.drop(["year","rrp_gbp","rrp_eur"], axis = "columns").corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

f, ax = plt.subplots(figsize = (20,20))

cmap = sns.diverging_palette(200, 20, as_cmap = True)
sns.heatmap(corr, mask = mask, cmap = cmap, vmax = 1, center = 0, square = True, linewidth = 4, cbar_kws = {"shrink":.5})