# Evolution of LEGO

Sample Notebook by Junghoo Kim

## Introduction

### Motivation

Ever since I played with my first LEGO set, I've noticed there have been huge increases in both the variety of colors and themes in available LEGO sets. The LEGO blocks that I remember playing with were generic square blocks with the iconic red, white, yellow, and green colors. I remember growing up and thinking Star Wars-themed sets were so cool, but now I see even more amazing sets available like Frozen-themed sets, Super Mario themed sets and Venom set!

![](https://i5.walmartimages.ca/images/Large/319/657/6000203319657.jpg)

I feel as if there has been a big increase in the total number of LEGO sets produced over the years. The variety of themes have increased for sure, but are there fan-favourite themes that are still being produced most often? Has the distribution of colors used for the Lego parts changed as the more popular themes have shifted away from the colors like red and yellow used in the classic LEGO sets? Last but not least, how has the number of parts in each set changed over the years? Does each set still contain just as many pieces as they did 20 years ago, or are there more/fewer pieces now? We will be able to address these questions using an interactive dashboard.

### Questions of interest

    1. How has the total number of LEGO sets produced changed over the years?
    2. How has the distribution of the number of LEGO sets with different themes changed over the years?
    3. Which colors were used most often for LEGO parts over the years?
    4. How has the number of parts in each set changed over the years?


## Analysis 

### Data imports

In [2]:
# Import libraries needed for this assignment
import altair as alt
import pandas as pd
import os

alt.data_transformers.enable("data_server")

# Data URLs 
themes_url = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/lego-themes.csv"
sets_url = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/lego-sets.csv"
inventories_url = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/lego_inventories.csv"
inventory_parts_url = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/lego_inventory_parts.csv"
colors_url = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/lego-colors.csv"
combined_url = "https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/lego-combined.csv"

# DataFrames from local csv files
themes_df = pd.read_csv(themes_url)
sets_df = pd.read_csv(sets_url)
inventories_df = pd.read_csv(inventories_url)
inventory_parts_df = pd.read_csv(inventory_parts_url)
colors_df = pd.read_csv(colors_url)
combined_df = pd.read_csv(combined_url)

### Dataset description

The below descriptions were taken directly from the website where the datasets were obtained.

"LEGO is a popular brand of toy building bricks. They are often sold in sets with in order to build a specific object. Each set contains a number of parts in different shapes, sizes and colors. This database contains information on which parts are included in different LEGO sets. It was originally compiled to help people who owned some LEGO sets already figure out what other sets they could build with the pieces they had."

The LEGO dataset is composed of 8
tables, colors.csv, inventories.csv, inventory_parts.csv, inventory_sets.csv, part_categories.csv, parts.csv,sets.csv, and themes.csv . Each table is stored in a .csv file and contains different information about lego pieces including shapes, sizes, sets, colors, and themes. Tables below summarize the themes, sets, inventories, inventory_parts, and colors data, as well as which columns will be used to answer which questions. Same colors are used to indicate the columns that are shared between different .csv files.

### Data summary table

In [3]:


themes_df.info()
print("\n")
themes_df.describe()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   id         614 non-null    int64  
 1   name       614 non-null    object 
 2   parent_id  503 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 14.5+ KB




Unnamed: 0,id,parent_id
count,614.0,503.0
mean,307.5,274.294235
std,177.390811,176.070151
min,1.0,1.0
25%,154.25,126.0
50%,307.5,264.0
75%,460.75,430.0
max,614.0,591.0


### How has the number of released sets changed over the years?


In [4]:
sets_per_year = (
    alt.Chart(sets_url)
    .mark_bar(color="navy")
    .encode(
        alt.X("year:O", title="Year"),
        alt.Y("count()", title="Number of Sets"),
        tooltip=[alt.Tooltip("count()", title="Number of Sets")])
    .properties(width=350)
)

sets_per_year.properties(title="Fig 1. Number of sets produced each year from 1950-2017")


We can see from the above visualization that there is a fairly clear increasing trend in the number of sets produced each year from 1950 to 2017. But this makes me ask the following question: do the sets released in recent years have the same themes as the sets released in the earlier years? If the themes have changed over the years, what are some of the themes that have "defined" each era? Let's find out!

In [5]:
# On mouse click
select_year_click = alt.selection_multi(encodings=["x"], on='click', nearest=True)

sets_per_year_click = (
    sets_per_year.encode(
        color=alt.condition(select_year_click, alt.value("navy"), alt.value("lightgray")))
    .properties(height=100, width=350)
    .add_selection(select_year_click)
    .properties(title={
        "text" : "Number of sets produced each year from 1950-2017.",
        "subtitle" : ["Click on a bar to select the year. Hold shift to select multiple years.", "Double-click to clear selection(s)."]
    })
)

top_themes = (
    alt.Chart(sets_url)
    .transform_filter(select_year_click)     # filter for selected year
    .mark_bar()
    .encode(
        alt.X("name:N", title="Theme", sort='-y'),
        alt.Y("sets_count:Q", title="Number of Sets"),
        alt.Color(value="navy"),
        tooltip=[alt.Tooltip("sets_count:Q", title="Number of Sets with Theme")])
    .transform_lookup(
        lookup='theme_id',
        from_=alt.LookupData(data=themes_url, key='id',
                         fields=['name']))
    .transform_aggregate(
        sets_count="count()",
        groupby=["name"])
    .transform_window(
        rank='rank(sets_count)',
        sort=[alt.SortField("sets_count", order="descending")])
    .transform_filter(alt.datum.rank <= 10)
    .properties(title="Most used themes in selected year(s)",
                height=350, width=350)
    .add_selection(select_year_click)
)

sets_per_year_click & top_themes



That's interesting! Sets produced in 1950s mostly had "Basic Set", "Town Plan", "Traffic", and "Supplemental" themes. These are the generic yet iconic LEGO themes that I remember.

In contrast, you can start seeing some trendy themes in 2000s and 2010s. For example, in 2004, when the "Harry Potter and the Prisoner of Azkaban" film was released, there were 11 sets produced with the "Prisoner of Azkaban" theme! Similarly, in 2011 when "Pirates of the Caribbean: On Stranger Tides" movie was released, there were 16 LEGO sets released with "Pirates of the Caribbean" theme.

It's also interesting to see themes related to other major events such as 33 LEGO sets with "Soccer" theme in 2002, when the 17th FIFA World Cup took place.

Despite these variations in the trendy themes across years, we see that Star Wars themes such as "Star Wars Clone Wars", "Star Wars Episode 4/5/6", and "Star Wars Episode 7" appear consistently among the most used themes between 2008-2017.

It appears that there have been some major changes in the themes used for LEGO sets. Have the colors used for the LEGO blocks also changed over the years? We can answer that, too!


### Which colors were used most often over the years?

In [6]:
color_slider = alt.binding_range(
    step=1,
    min=5,
    max=25)

select_colors = alt.selection_single(
    fields=['num_colors'],
    bind=color_slider,
    init={'num_colors' : 10},
    name='Select')

colors_filtered = colors_df.query("id >= 0 & id < 9999")

colors = colors_filtered.name.unique()
# I'm adding "#" prefix to the rgb values (e.g. "0033B2") so that Altair knows these are RGB values
rgbs = [f"#{rgb}" for rgb in colors_filtered.loc[colors_filtered['name'] == colors, 'rgb'].values]

colors_parts = (
    alt.Chart(combined_url)
    .transform_filter(select_year_click)
    .transform_filter(alt.datum.id_color >= 0 & alt.datum.id_color < 9999)
    .mark_bar(stroke="black")
    .encode(
        alt.X("total_parts:Q", title="Number of Parts"),
        alt.Y("name_color:N", title="Color", sort='-x'),
        color=alt.Color('name_color:N', scale=alt.Scale(domain=colors, range=rgbs), legend=None),
        tooltip=[alt.Tooltip("total_parts:Q", title="Number of Parts with Color")])
    .transform_aggregate(
        total_parts="sum(quantity)",
        groupby=["name_color"])
    .transform_window(
        rank='rank(total_parts)',
        sort=[alt.SortField("total_parts", order="descending")])
    .add_selection(select_colors)
    .transform_filter(alt.datum.rank <= select_colors.num_colors)
    .properties(title={
        "text" : "Top colors with most number of parts in selected year(s)",
        "subtitle" : "Use slider below to select number of colors to show."},
                height=550)
)

((sets_per_year_click & top_themes) | colors_parts)




We see in the above visualization that the colors used for the LEGO blocks have become more diverse in recent years. Now the color palette used includes colors such as "Lime", "Trans-Light Blue", and "Pearl Gold".

In addition, we can see that while grey colors were not used very much in the early years, light grey and dark grey colors are used much more in recent years. This change appears to have started around 1978 onwards.

So far, we've seen that there have been some major changes in the themes used for the LEGO sets, as well as the color palette used for the LEGO blocks. I'm super interested in these newer LEGO sets with trendy themes and wide array of colors! But do these newer LEGO sets come with just as many pieces as they did in the earlier years?


In [7]:
# Boolean selection for point/line marks
scatter_check = alt.binding_checkbox()
line_check = alt.binding_checkbox()

scatter_selection = alt.selection_single(bind=scatter_check, name="Hide Scatter")
line_selection = alt.selection_single(bind=line_check, name="Hide Mean Trend Line")

max_parts = sets_df["num_parts"].max()

parts_per_set_scatter = (
    alt.Chart(sets_url)
    .mark_point(size=5, fill="navy")
    .encode(
        alt.X("year:O", title="Year"),
        alt.Y("num_parts:Q", title="Number of Parts in a Set",
              scale = alt.Scale(type="sqrt", domain=[0, max_parts])),
        opacity=alt.condition(scatter_selection, alt.value(0.5), alt.value(0.0)),
        tooltip=[alt.Tooltip("name:N", title="Name of Set"),
                 alt.Tooltip("num_parts:Q", title="Number of Parts")])
    .add_selection(scatter_selection)
)
    
parts_per_set_line = (
    alt.Chart(sets_url)
    .mark_line(color="red", strokeWidth=5)
    .encode(
        alt.X("year:O", title="Year"),
        alt.Y("num_parts:Q", aggregate="mean", title="Number of Parts in a Set",
              scale = alt.Scale(type="sqrt", domain=[0, max_parts])),
        opacity=alt.condition(line_selection, alt.value(1), alt.value(0.0)),
        tooltip=[alt.Tooltip("year:O", title="Year"),
                 alt.Tooltip("num_parts:Q", aggregate="mean", title="Mean Number of Parts", format='.2f')])
    .add_selection(line_selection)
)

parts_per_set = (
    (parts_per_set_scatter + parts_per_set_line)
    .properties(height=100, width=350, title="Number of parts in a set")
)

sets_per_year_click & parts_per_set 



Side Note: Unforunately, the checkbox widgets do not work after the first click. This issue is documented in Altair repo as well as Vega-Lite repo, and will hopefully be fixed in future versions.

As seen in the plot above, the number of parts in a set has had a generally increasing trend over the years. Using the tooltips, we can also see some sets with many parts above the mean, such as Taj Mahal set released in 2008 with 5922 parts, and Millennium Falcon - UCS set released in 2007 with 5195 parts.
