![star_wars_unsplash](star_wars_unsplash.jpg)

Lego is a household name across the world, supported by a diverse toy line, hit movies, and a series of successful video games. In this project, we are going to explore a key development in the history of Lego: the introduction of licensed sets such as Star Wars, Super Heroes, and Harry Potter.

The introduction of its first licensed series, Star Wars, was a hit that sparked a series of collaborations with more themed sets. The partnerships team has asked you to perform an analysis of this success, and before diving into the analysis, they have suggested reading the descriptions of the two datasets to use, reported below.

## The Data

You have been provided with two datasets to use. A summary and preview are provided below.

## lego_sets.csv

| Column     | Description              |
|------------|--------------------------|
| `"set_num"` | A code that is unique to each set in the dataset. This column is critical, and a missing value indicates the set is a duplicate or invalid! |
| `"name"` | The name of the set. |
| `"year"` | The date the set was released. |
| `"num_parts"` | The number of parts contained in the set. This column is not central to our analyses, so missing values are acceptable. |
| `"theme_name"` | The name of the sub-theme of the set. |
| `"parent_theme"` | The name of the parent theme the set belongs to. Matches the name column of the parent_themes csv file.
|

## parent_themes.csv

| Column     | Description              |
|------------|--------------------------|
| `"id"` | A code that is unique to every theme. |
| `"name"` | The name of the parent theme. |
| `"is_licensed"` | A Boolean column specifying whether the theme is a licensed theme. |

The team responsible for the Star Wars partnership has asked for specific information in preparation for their meeting:

- What percentage of all licensed sets ever released were Star Wars themed? Save your answer as a variable the_force, as an integer (e.g. 25).
  
- In which year was the highest number of Star Wars sets released? Save your answer as a variable new_era, as an integer (e.g. 2012).

In [36]:
# Import pandas, read and inspect the datasets
import pandas as pd

lego_sets = pd.read_csv('data/lego_sets.csv')
lego_sets.head()

Unnamed: 0,set_num,name,year,num_parts,theme_name,parent_theme
0,00-1,Weetabix Castle,1970,471.0,Castle,Legoland
1,0011-2,Town Mini-Figures,1978,,Supplemental,Town
2,0011-3,Castle 2 for 1 Bonus Offer,1987,,Lion Knights,Castle
3,0012-1,Space Mini-Figures,1979,12.0,Supplemental,Space
4,0013-1,Space Mini-Figures,1979,12.0,Supplemental,Space


In [37]:
lego_sets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11986 entries, 0 to 11985
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   set_num       11833 non-null  object 
 1   name          11833 non-null  object 
 2   year          11986 non-null  int64  
 3   num_parts     6926 non-null   float64
 4   theme_name    11833 non-null  object 
 5   parent_theme  11986 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 562.0+ KB


In [38]:
parent_themes = pd.read_csv('data/parent_themes.csv')
parent_themes.head()

Unnamed: 0,id,name,is_licensed
0,1,Technic,False
1,22,Creator,False
2,50,Town,False
3,112,Racers,False
4,126,Space,False


In [9]:
parent_themes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           111 non-null    int64 
 1   name         111 non-null    object
 2   is_licensed  111 non-null    bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 2.0+ KB


In [42]:
# Drop relevant missing rows
lego_sets_clean = lego_sets.dropna(subset=['set_num', 'name', 'theme_name'])
lego_sets_clean.head()

Unnamed: 0,set_num,name,year,num_parts,theme_name,parent_theme
0,00-1,Weetabix Castle,1970,471.0,Castle,Legoland
1,0011-2,Town Mini-Figures,1978,,Supplemental,Town
2,0011-3,Castle 2 for 1 Bonus Offer,1987,,Lion Knights,Castle
3,0012-1,Space Mini-Figures,1979,12.0,Supplemental,Space
4,0013-1,Space Mini-Figures,1979,12.0,Supplemental,Space


In [43]:
licensed_themes = parent_themes[parent_themes['is_licensed']]['name']
licensed_themes.head()

7                    Star Wars
12                Harry Potter
16    Pirates of the Caribbean
17               Indiana Jones
18                        Cars
Name: name, dtype: object

In [46]:
licensed = lego_sets_clean['parent_theme'].isin(licensed_themes)
licensed_sets = lego_sets_clean[licensed]
licensed_sets

Unnamed: 0,set_num,name,year,num_parts,theme_name,parent_theme
44,10018-1,Darth Maul,2001,1868.0,Star Wars,Star Wars
45,10019-1,Rebel Blockade Runner - UCS,2001,,Star Wars Episode 4/5/6,Star Wars
54,10026-1,Naboo Starfighter - UCS,2002,,Star Wars Episode 1,Star Wars
57,10030-1,Imperial Star Destroyer - UCS,2002,3115.0,Star Wars Episode 4/5/6,Star Wars
95,10075-1,Spider-Man Action Pack,2002,25.0,Spider-Man,Super Heroes
...,...,...,...,...,...,...
11811,VP-12,Star Wars Co-Pack of 7121 and 7151,2000,2.0,Star Wars Episode 1,Star Wars
11816,VP-2,Star Wars Co-Pack of 7110 and 7144,2001,2.0,Star Wars Episode 4/5/6,Star Wars
11817,VP-3,Star Wars Co-Pack of 7131 and 7151,2000,2.0,Star Wars Episode 1,Star Wars
11818,VP-4,Star Wars Co-Pack of 7101 7111 and 7171,2000,3.0,Star Wars Episode 1,Star Wars


In [47]:
all_sets = len(licensed_sets)
star_wars_sets = len(licensed_sets[licensed_sets['parent_theme'] == 'Star Wars'])
ratio = star_wars_sets / all_sets
the_force = int(ratio * 100)
print(f'The percentage of licensed sets that are Star Wars themed is {the_force}%.')

The percentage of licensed sets that are Star Wars themed is 51%.


In [49]:
# Create a pivot table of sets released by theme per year
licensed_pivot = licensed_sets.pivot_table(index='year', columns='parent_theme', values='set_num', aggfunc='count')
licensed_pivot

parent_theme,Angry Birds,Avatar,Ben 10,Cars,Disney,Disney Princess,Disney's Mickey Mouse,Ghostbusters,Harry Potter,Indiana Jones,...,Pirates of the Caribbean,Prince of Persia,Scooby-Doo,SpongeBob SquarePants,Star Wars,Super Heroes,Teenage Mutant Ninja Turtles,The Hobbit and Lord of the Rings,The Lone Ranger,Toy Story
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1999,,,,,,,,,,,...,,,,,13.0,,,,,
2000,,,,,,,5.0,,,,...,,,,,26.0,,,,,
2001,,,,,,,,,11.0,,...,,,,,14.0,,,,,
2002,,,,,,,,,19.0,,...,,,,,28.0,3.0,,,,
2003,,,,,,,,,3.0,,...,,,,,32.0,5.0,,,,
2004,,,,,,,,,14.0,,...,,,,,20.0,6.0,,,,
2005,,,,,,,1.0,,5.0,,...,,,,,28.0,1.0,,,,
2006,,2.0,,,,,,,,,...,,,,3.0,11.0,8.0,,,,
2007,,,,,,,,,1.0,,...,,,,2.0,16.0,2.0,,,,
2008,,,,,,,,,,12.0,...,,,,3.0,23.0,5.0,,,,


In [51]:
licensed_pivot.sort_values(by="Star Wars", ascending=False)["Star Wars"]

year
2016    61.0
2015    58.0
2017    55.0
2014    45.0
2012    43.0
2009    39.0
2013    35.0
2003    32.0
2011    32.0
2010    30.0
2002    28.0
2005    28.0
2000    26.0
2008    23.0
2004    20.0
2007    16.0
2001    14.0
1999    13.0
2006    11.0
Name: Star Wars, dtype: float64

In [52]:
new_era = 2016
print(f'The year when the most Star Wars sets were released was {new_era}.')

The year when the most Star Wars sets were released was 2016.


# Solution

In [34]:
# Import pandas and read in the DataFrame, and inspect it
import pandas as pd
lego_sets = pd.read_csv('data/lego_sets.csv')
lego_sets.head()

# Drop relevant missing rows
lego_sets_clean = lego_sets.dropna(subset=['set_num', 'name', 'theme_name'])
lego_sets_clean.head()

# Get list of licensed sets
parent_themes = pd.read_csv('data/parent_themes.csv')
licensed_themes = parent_themes[parent_themes['is_licensed']]['name']
licensed_themes.head()

# Subset for licensed sets
licensed = lego_sets_clean['parent_theme'].isin(licensed_themes)
licensed_sets = lego_sets_clean[licensed]
licensed_sets.head()

# Calculate the percentage of licensed sets that are Star Wars themed
all_sets = len(licensed_sets)
star_wars_sets = len(licensed_sets[licensed_sets['parent_theme'] == 'Star Wars'])
ratio = star_wars_sets / all_sets
the_force = int(ratio * 100)
print(f'The percentage of licensed sets that are Star Wars themed is {the_force}%.')

# Create a pivot table of sets released by theme per year
licensed_pivot = licensed_sets.pivot_table(index='year', columns='parent_theme', values='set_num', aggfunc='count')

# Find the year when the most Star Wars sets were released
licensed_pivot.sort_values(by="Star Wars", ascending=False)["Star Wars"]
new_era = 2016
print(f'The year when the most Star Wars sets were released was {new_era}.')

The percentage of licensed sets that are Star Wars themed is 51%.
The year when the most Star Wars sets were released was 2016.
