# <h1 align = "center">Lego Set </h1>
***


<h3><strong>Scope of the Project:</strong></h3>
<ul>
    <li>Demonstrate ability in loading files and merging them together</li>
    <li>Demonstrating ability in Cleaning the dataset and validating the data as well</li>
    <li>Querying/Manipulating the dataset to answer questions</li>
    <li>Visualization done using Power Bi (Check Power Bi tab)</li>
</ul>

### __Info regarding the Data Set__
__datasets/lego_sets.csv__
- set_num: A code that is unique to each set in the dataset. This column is critical, and a missing value indicates the set is a duplicate or invalid!
- set_name: A name for every set in the dataset (note that this can be the same for different sets).
- year: The date the set was released.
- num_parts: The number of parts contained in the set. This column is not central to our analyses, so missing values are acceptable.
- theme_name: The name of the sub-theme of the set.
- parent_theme: The name of the parent theme the set belongs to. Matches the `name` column of the `parent_themes` csv file.<br><br>
__datasets/parent_themes.csv__


- id: A code that is unique to every theme.
- name: The name of the parent theme.
- is_licensed: A Boolean column specifying whether the theme is a licensed theme.
***


### Q1:  __What Percentage of all licensed sets ever released were Star Wars themed__ <br>


In [1]:
import pandas as pd

Loading Dataset: 

In [2]:
lego_set_raw = pd.read_csv(r"C:\Users\alexc\OneDrive\Documents\Lego_Data_Analysis\Lego_Data\lego_sets.csv")
lego_set_raw.head(10)

Unnamed: 0,set_num,name,year,num_parts,theme_name,parent_theme
0,00-1,Weetabix Castle,1970,471.0,Castle,Legoland
1,0011-2,Town Mini-Figures,1978,,Supplemental,Town
2,0011-3,Castle 2 for 1 Bonus Offer,1987,,Lion Knights,Castle
3,0012-1,Space Mini-Figures,1979,12.0,Supplemental,Space
4,0013-1,Space Mini-Figures,1979,12.0,Supplemental,Space
5,0014-1,Space Mini-Figures,1979,12.0,Supplemental,Space
6,0015-1,Space Mini-Figures,1979,,Supplemental,Space
7,0016-1,Castle Mini Figures,1978,,Castle,Castle
8,00-2,Weetabix Promotional House 1,1976,,Building,Legoland
9,00-3,Weetabix Promotional House 2,1976,,Building,Legoland


Loading Dataset 2:

In [3]:
parent_theme_raw = pd.read_csv(r"C:\Users\alexc\OneDrive\Documents\Lego_Data_Analysis\Lego_Data\parent_themes.csv")
parent_theme_raw.head()

Unnamed: 0,id,name,is_licensed
0,1,Technic,False
1,22,Creator,False
2,50,Town,False
3,112,Racers,False
4,126,Space,False


Checking to see if the ID verification set contains any duplicate values, if it does then those duplicates will have to be removed.<br>They will have to be removed because this Dataset will be used as a Primary key essentially so there cannot be any duplicates.

In [4]:
parent_theme_raw.duplicated().sum()

0

In [5]:
parent_theme_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111 entries, 0 to 110
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           111 non-null    int64 
 1   name         111 non-null    object
 2   is_licensed  111 non-null    bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 2.0+ KB


Checking to see if there is any null values within the Dataset:

In [7]:
parent_theme_raw["name"].isnull().sum()

0

Changing the column name to parent_theme, so we can merge both this Dataset and the lego Dataset on the same column:

In [8]:
parent_theme_raw.rename(columns= {"name":"parent_theme"},inplace=True)
parent_theme_raw.head()


Unnamed: 0,id,parent_theme,is_licensed
0,1,Technic,False
1,22,Creator,False
2,50,Town,False
3,112,Racers,False
4,126,Space,False


In [9]:
parent_theme_raw[parent_theme_raw["is_licensed"]== True].count()

id              22
parent_theme    22
is_licensed     22
dtype: int64

Since, there was no issues within the Parent theme Dataset, then its time to Merge both the parent theme and Lego Dataset together using a left join. <br>
A left join is basically a VLookup or XLookup in Excel terms.<br><br>
We are joining on the "parent_theme" column:

In [10]:
merged_data = pd.merge(left=lego_set_raw,right=parent_theme_raw,how="left",left_on=["parent_theme"],right_on=["parent_theme"])
merged_data.head(50)

Unnamed: 0,set_num,name,year,num_parts,theme_name,parent_theme,id,is_licensed
0,00-1,Weetabix Castle,1970,471.0,Castle,Legoland,411,False
1,0011-2,Town Mini-Figures,1978,,Supplemental,Town,50,False
2,0011-3,Castle 2 for 1 Bonus Offer,1987,,Lion Knights,Castle,186,False
3,0012-1,Space Mini-Figures,1979,12.0,Supplemental,Space,126,False
4,0013-1,Space Mini-Figures,1979,12.0,Supplemental,Space,126,False
5,0014-1,Space Mini-Figures,1979,12.0,Supplemental,Space,126,False
6,0015-1,Space Mini-Figures,1979,,Supplemental,Space,126,False
7,0016-1,Castle Mini Figures,1978,,Castle,Castle,186,False
8,00-2,Weetabix Promotional House 1,1976,,Building,Legoland,411,False
9,00-3,Weetabix Promotional House 2,1976,,Building,Legoland,411,False


    lets clean up the new DF merge_set
- remove column "num_parts" --> useless 
- remove any NaNs in the "set_num" column --> very important any NaN means a set is duplicated or invalid.
- check to see if there are any other NaNs in the other columns 

In [11]:
# removing column num_parts
merged_data.drop(columns="num_parts",inplace=True)
merged_data.head()

Unnamed: 0,set_num,name,year,theme_name,parent_theme,id,is_licensed
0,00-1,Weetabix Castle,1970,Castle,Legoland,411,False
1,0011-2,Town Mini-Figures,1978,Supplemental,Town,50,False
2,0011-3,Castle 2 for 1 Bonus Offer,1987,Lion Knights,Castle,186,False
3,0012-1,Space Mini-Figures,1979,Supplemental,Space,126,False
4,0013-1,Space Mini-Figures,1979,Supplemental,Space,126,False


In [12]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11986 entries, 0 to 11985
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   set_num       11833 non-null  object
 1   name          11833 non-null  object
 2   year          11986 non-null  int64 
 3   theme_name    11833 non-null  object
 4   parent_theme  11986 non-null  object
 5   id            11986 non-null  int64 
 6   is_licensed   11986 non-null  bool  
dtypes: bool(1), int64(2), object(4)
memory usage: 667.2+ KB


We see that there are a total of 11,986 rows in the entire Merged Dataset.<br> However, if you look at the columns section and the Non-Null count column we see that there are Null Values present. 

In [13]:
merged_data[merged_data["set_num"].isna()].head(20)

Unnamed: 0,set_num,name,year,theme_name,parent_theme,id,is_licensed
11833,,,2017,,Disney Princess,579,True
11834,,,2016,,Disney Princess,579,True
11835,,,2016,,Disney Princess,579,True
11836,,,2017,,Super Heroes,482,True
11837,,,2017,,Super Heroes,482,True
11838,,,2016,,Minecraft,577,True
11839,,,2014,,Super Heroes,482,True
11840,,,2013,,Super Heroes,482,True
11841,,,2013,,Super Heroes,482,True
11842,,,2013,,Super Heroes,482,True


Total Number of null values in each column:

In [15]:
merged_data.isna().sum()

set_num         153
name            153
year              0
theme_name      153
parent_theme      0
id                0
is_licensed       0
dtype: int64

Removing Null Values:

In [16]:
#drop rows where set_num column has NaN values
merged_data.dropna(how="any",inplace=True)

In [17]:
merged_data.isna().sum()

set_num         0
name            0
year            0
theme_name      0
parent_theme    0
id              0
is_licensed     0
dtype: int64

All the Null Values were removed.

In [18]:
merged_data.head()

Unnamed: 0,set_num,name,year,theme_name,parent_theme,id,is_licensed
0,00-1,Weetabix Castle,1970,Castle,Legoland,411,False
1,0011-2,Town Mini-Figures,1978,Supplemental,Town,50,False
2,0011-3,Castle 2 for 1 Bonus Offer,1987,Lion Knights,Castle,186,False
3,0012-1,Space Mini-Figures,1979,Supplemental,Space,126,False
4,0013-1,Space Mini-Figures,1979,Supplemental,Space,126,False


In [19]:
licensed_true = merged_data[merged_data["is_licensed"]==True]
licensed_true.head()

Unnamed: 0,set_num,name,year,theme_name,parent_theme,id,is_licensed
44,10018-1,Darth Maul,2001,Star Wars,Star Wars,158,True
45,10019-1,Rebel Blockade Runner - UCS,2001,Star Wars Episode 4/5/6,Star Wars,158,True
54,10026-1,Naboo Starfighter - UCS,2002,Star Wars Episode 1,Star Wars,158,True
57,10030-1,Imperial Star Destroyer - UCS,2002,Star Wars Episode 4/5/6,Star Wars,158,True
95,10075-1,Spider-Man Action Pack,2002,Spider-Man,Super Heroes,482,True


In [20]:
licensed_count = licensed_true.groupby("is_licensed")
licensed_count.head()

Unnamed: 0,set_num,name,year,theme_name,parent_theme,id,is_licensed
44,10018-1,Darth Maul,2001,Star Wars,Star Wars,158,True
45,10019-1,Rebel Blockade Runner - UCS,2001,Star Wars Episode 4/5/6,Star Wars,158,True
54,10026-1,Naboo Starfighter - UCS,2002,Star Wars Episode 1,Star Wars,158,True
57,10030-1,Imperial Star Destroyer - UCS,2002,Star Wars Episode 4/5/6,Star Wars,158,True
95,10075-1,Spider-Man Action Pack,2002,Spider-Man,Super Heroes,482,True


In [21]:
licensed_count["parent_theme"].value_counts().head() #gives me the total number of sets that are star wars themed which is 609

is_licensed  parent_theme                    
True         Star Wars                           609
             Super Heroes                        242
             Harry Potter                         67
             The Hobbit and Lord of the Rings     40
             Minecraft                            30
Name: parent_theme, dtype: int64

In [22]:
licensed_true.info() # using the info method to see how many rows are there in the licensed_true DF, so i can use this to divide the star wars.
# rows in DF --> 1179

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1179 entries, 44 to 11822
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   set_num       1179 non-null   object
 1   name          1179 non-null   object
 2   year          1179 non-null   int64 
 3   theme_name    1179 non-null   object
 4   parent_theme  1179 non-null   object
 5   id            1179 non-null   int64 
 6   is_licensed   1179 non-null   bool  
dtypes: bool(1), int64(2), object(4)
memory usage: 65.6+ KB


In [23]:
the_force =  round((609/1179)*100) # using this equation to figure out the percentage of star wars sets out of all time released sets that were licensed
print(the_force)

52


### __Q1 Answer:__<font color="yellow"> 52 percent</font>
***

### Q2: __In which year was star wars not the most popular licensed theme(in terms of number of sets released that year)__ <br>


In [24]:
licensed_true.head()

Unnamed: 0,set_num,name,year,theme_name,parent_theme,id,is_licensed
44,10018-1,Darth Maul,2001,Star Wars,Star Wars,158,True
45,10019-1,Rebel Blockade Runner - UCS,2001,Star Wars Episode 4/5/6,Star Wars,158,True
54,10026-1,Naboo Starfighter - UCS,2002,Star Wars Episode 1,Star Wars,158,True
57,10030-1,Imperial Star Destroyer - UCS,2002,Star Wars Episode 4/5/6,Star Wars,158,True
95,10075-1,Spider-Man Action Pack,2002,Spider-Man,Super Heroes,482,True


In [25]:
licensed_true[licensed_true["year"]==1999]

Unnamed: 0,set_num,name,year,theme_name,parent_theme,id,is_licensed
7607,7101-1,Lightsaber Duel,1999,Star Wars Episode 1,Star Wars,158,True
7728,7110-1,Landspeeder,1999,Star Wars Episode 4/5/6,Star Wars,158,True
7730,7111-1,Droid Fighter,1999,Star Wars Episode 1,Star Wars,158,True
7752,7121-1,Naboo Swamp,1999,Star Wars Episode 1,Star Wars,158,True
7799,7128-1,Speeder Bikes,1999,Star Wars Episode 4/5/6,Star Wars,158,True
7804,7130-1,Snowspeeder,1999,Star Wars Episode 4/5/6,Star Wars,158,True
7815,7131-1,Anakin's Podracer,1999,Star Wars Episode 1,Star Wars,158,True
7836,7140-1,X-wing Fighter,1999,Star Wars Episode 4/5/6,Star Wars,158,True
7837,7141-1,Naboo Fighter,1999,Star Wars Episode 1,Star Wars,158,True
7845,7150-1,TIE Fighter & Y-wing,1999,Star Wars Episode 4/5/6,Star Wars,158,True


In [26]:
licensed_true.groupby("year").count()["parent_theme"]

year
1999     13
2000     31
2001     25
2002     50
2003     40
2004     40
2005     35
2006     24
2007     21
2008     43
2009     48
2010     63
2011     81
2012    109
2013     86
2014     99
2015    107
2016    121
2017    143
Name: parent_theme, dtype: int64

In [27]:
not_popular = licensed_true.groupby("year")

In [28]:
not_popular2 = not_popular["parent_theme"].value_counts()

In [29]:
not_popular2.head(50)

year  parent_theme                    
1999  Star Wars                           13
2000  Star Wars                           26
      Disney's Mickey Mouse                5
2001  Star Wars                           14
      Harry Potter                        11
2002  Star Wars                           28
      Harry Potter                        19
      Super Heroes                         3
2003  Star Wars                           32
      Super Heroes                         5
      Harry Potter                         3
2004  Star Wars                           20
      Harry Potter                        14
      Super Heroes                         6
2005  Star Wars                           28
      Harry Potter                         5
      Disney's Mickey Mouse                1
      Super Heroes                         1
2006  Star Wars                           11
      Super Heroes                         8
      SpongeBob SquarePants                3
      Avatar    

In [30]:
not_popular2.tail(30)

year  parent_theme                    
2013  The Hobbit and Lord of the Rings    13
      Teenage Mutant Ninja Turtles         9
      The Lone Ranger                      8
      Minecraft                            2
2014  Star Wars                           45
      Super Heroes                        23
      Teenage Mutant Ninja Turtles        10
      Disney Princess                      8
      Minecraft                            7
      The Hobbit and Lord of the Rings     6
2015  Star Wars                           58
      Super Heroes                        28
      Jurassic World                       7
      Scooby-Doo                           5
      Disney Princess                      4
      Minecraft                            4
      The Hobbit and Lord of the Rings     1
2016  Star Wars                           61
      Super Heroes                        33
      Disney Princess                     11
      Minecraft                            7
      Angry Bird

### __Q2 Answer:__<font color="yellow"> year 2017</font>

<font size="7">__Another way to solve the problem above__</font>

In [31]:
popular_set = licensed_true.sort_values("year")

popular_set2 = popular_set.groupby(["year","parent_theme"]).sum().reset_index()
popular_set2.head()

Unnamed: 0,year,parent_theme,id,is_licensed
0,1999,Star Wars,2054,13
1,2000,Disney's Mickey Mouse,1940,5
2,2000,Star Wars,4108,26
3,2001,Harry Potter,2706,11
4,2001,Star Wars,2212,14


In [32]:
popular_max = popular_set2.sort_values(by="is_licensed",ascending=False).drop_duplicates(["year"])
popular_max.head(50)

Unnamed: 0,year,parent_theme,id,is_licensed
82,2017,Super Heroes,34704,72
76,2016,Star Wars,9638,61
67,2015,Star Wars,9164,58
59,2014,Star Wars,7110,45
47,2012,Star Wars,6794,43
32,2009,Star Wars,6162,39
52,2013,Star Wars,5530,35
9,2003,Star Wars,5056,32
42,2011,Star Wars,5056,32
36,2010,Star Wars,4740,30


In [33]:
popular_max.sort_values("year",inplace=True)
popular_max.head(20)

Unnamed: 0,year,parent_theme,id,is_licensed
0,1999,Star Wars,2054,13
2,2000,Star Wars,4108,26
4,2001,Star Wars,2212,14
6,2002,Star Wars,4424,28
9,2003,Star Wars,5056,32
12,2004,Star Wars,3160,20
16,2005,Star Wars,4424,28
20,2006,Star Wars,1738,11
24,2007,Star Wars,2528,16
28,2008,Star Wars,3634,23


***

### __Additional Question: Break down number of sets by Year__

In [34]:
merged_data["year"].unique()

array([1970, 1978, 1987, 1979, 1976, 1965, 1985, 1968, 1999, 1967, 1969,
       2001, 1966, 2003, 2002, 2004, 2006, 2005, 2010, 2007, 2008, 2009,
       2011, 2012, 2013, 2014, 2015, 2016, 2017, 1977, 1983, 1986, 1984,
       1973, 1981, 2000, 1980, 1982, 1988, 1997, 1998, 1971, 1955, 1956,
       1957, 1958, 1974, 1972, 1975, 1992, 1991, 1989, 1990, 1993, 1994,
       1996, 1995, 1959, 1962, 1961, 1960, 1963, 1964, 1950, 1953, 1954],
      dtype=int64)

In [35]:
merged_data.head()

Unnamed: 0,set_num,name,year,theme_name,parent_theme,id,is_licensed
0,00-1,Weetabix Castle,1970,Castle,Legoland,411,False
1,0011-2,Town Mini-Figures,1978,Supplemental,Town,50,False
2,0011-3,Castle 2 for 1 Bonus Offer,1987,Lion Knights,Castle,186,False
3,0012-1,Space Mini-Figures,1979,Supplemental,Space,126,False
4,0013-1,Space Mini-Figures,1979,Supplemental,Space,126,False


In [36]:
merged_data.groupby("year").count().sort_values("parent_theme",ascending=False).head(66)

Unnamed: 0_level_0,set_num,name,theme_name,parent_theme,id,is_licensed
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014,715,715,715,715,715,715
2015,670,670,670,670,670,670
2012,615,615,615,615,615,615
2016,608,608,608,608,608,608
2013,593,593,593,593,593,593
...,...,...,...,...,...,...
1965,10,10,10,10,10,10
1950,7,7,7,7,7,7
1959,4,4,4,4,4,4
1953,4,4,4,4,4,4


In [37]:
sets_per_year = merged_data.groupby("year").count().reset_index()
sets_per_year

Unnamed: 0,year,set_num,name,theme_name,parent_theme,id,is_licensed
0,1950,7,7,7,7,7,7
1,1953,4,4,4,4,4,4
2,1954,14,14,14,14,14,14
3,1955,28,28,28,28,28,28
4,1956,12,12,12,12,12,12
...,...,...,...,...,...,...,...
61,2013,593,593,593,593,593,593
62,2014,715,715,715,715,715,715
63,2015,670,670,670,670,670,670
64,2016,608,608,608,608,608,608


In [38]:
sets_year = sets_per_year[["year","parent_theme"]].sort_values("parent_theme",ascending=False).copy()


In [39]:
sets_year.rename(columns={"parent_theme":"sets"},inplace=True)


In [40]:
sets_year.head(100)

Unnamed: 0,year,sets
62,2014,715
63,2015,670
60,2012,615
64,2016,608
61,2013,593
...,...,...
13,1965,10
0,1950,7
7,1959,4
1,1953,4


Answer: Year 2014 had the most sets released which was a total of 715.