# Unsupervised Learning Predict Student Solution

© Explore Data Science Academy



<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [2]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd

# Libraries for data preparation and model building
#import *

# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [39]:
train = pd.read_csv('train.csv')
genome_score = pd.read_csv('genome_scores.csv')
genome_tags = pd.read_csv('genome_tags.csv')
imdb = pd.read_csv('imdb_data.csv')
links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
tags = pd.read_csv('tags.csv')
test = pd.read_csv('test.csv')
submission = pd.read_csv('sample_submission.csv')

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [5]:
train.sort_values(by = 'userId').head()

Unnamed: 0,userId,movieId,rating,timestamp
6308822,1,296,5.0,1147880044
3137042,1,27721,3.0,1147869115
2533005,1,665,5.0,1147878820
2524478,1,4308,3.0,1147868534
1946297,1,1250,4.0,1147868414


In [6]:
test.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [7]:
submission.head()

Unnamed: 0,Id,rating
0,1_2011,1.0
1,1_4144,1.0
2,1_5767,1.0
3,1_6711,1.0
4,1_7318,1.0


In [8]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [40]:
train_joined = pd.merge(train,movies, how = 'inner', on = 'movieId').sort_values(by = ['userId','movieId'])
print(train_joined.shape)
train_joined.head()

(10000038, 6)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
502801,1,296,5.0,1147880044,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller
6852635,1,665,5.0,1147878820,Underground (1995),Comedy|Drama|War
6086993,1,899,3.5,1147868510,Singin' in the Rain (1952),Comedy|Musical|Romance
6325668,1,1175,3.5,1147868826,Delicatessen (1991),Comedy|Drama|Romance
5047680,1,1217,3.5,1147878326,Ran (1985),Drama|War


In [6]:
imdb.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [89]:
train_mo_imdb = pd.merge(train_joined,imdb, how = 'inner', on = 'movieId').sort_values(by = ['userId','movieId'])
train_mo_imdb.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,title_cast,director,runtime,budget,plot_keywords
0,1,296,5.0,1147880044,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,Tim Roth|Amanda Plummer|Laura Lovelace|John Tr...,Quentin Tarantino,154.0,"$8,000,000",nonlinear timeline|overdose|drug overdose|bondage
31697,1,665,5.0,1147878820,Underground (1995),Comedy|Drama|War,Predrag 'Miki' Manojlovic|Lazar Ristovski|Mirj...,Dusan Kovacevic,170.0,"$14,000,000",magical realism|communism|zoo|black comedy
32198,1,899,3.5,1147868510,Singin' in the Rain (1952),Comedy|Musical|Romance,,,,,
36605,1,1175,3.5,1147868826,Delicatessen (1991),Comedy|Drama|Romance,Pascal Benezech|Dominique Pinon|Marie-Laure Do...,Jean-Pierre Jeunet,99.0,"FRF24,000,000",black comedy|absurd comedy|surrealist|bed
39554,1,1217,3.5,1147878326,Ran (1985),Drama|War,,,,,


In [90]:
train_mo_imdb.isna().sum()/ train_mo_imdb.shape[0]

userId           0.000000
movieId          0.000000
rating           0.000000
timestamp        0.000000
title            0.000000
genres           0.000000
title_cast       0.270362
director         0.270184
runtime          0.275413
budget           0.327236
plot_keywords    0.270947
dtype: float64

In [88]:
train_mo_imdb.shape

(10000038, 11)

In [55]:
train.userId.nunique()

162541

In [30]:
#test.head()

In [31]:
#submission.head()

In [16]:
# print(genome_score.shape)
# genome_score.head()

In [17]:
# genome_score.groupby(['movieId','tagId'])[['relevance']].sum()

In [18]:
# genome_score.tagId.nunique()

In [19]:
# print(genome_tags.shape)
# genome_tags.head()

In [20]:
# print(imdb.shape)
# imdb.head()

In [21]:
# links.head()

In [28]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [22]:
# movies.shape

In [29]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [41]:
train_joined1 = pd.merge(train_joined.drop(columns = 'timestamp'),tags, on = ['userId','movieId'], how = 'inner').sort_values(by='userId')
train_joined1

Unnamed: 0,userId,movieId,rating,title,genres,tag,timestamp
0,20,1210,5.0,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Sci-Fi,bah,1155082282
1,68,3481,4.5,High Fidelity (2000),Comedy|Drama|Romance,music,1472113217
11,87,102445,1.0,Star Trek Into Darkness (2013),Action|Adventure|Sci-Fi|IMAX,unoriginal,1522677043
9,87,102445,1.0,Star Trek Into Darkness (2013),Action|Adventure|Sci-Fi|IMAX,inferior sequel,1522677007
8,87,1127,5.0,"Abyss, The (1989)",Action|Adventure|Sci-Fi|Thriller,sci-fi,1542308464
...,...,...,...,...,...,...,...
333035,162501,112556,3.5,Gone Girl (2014),Drama|Thriller,crime,1421990253
333036,162501,112556,3.5,Gone Girl (2014),Drama|Thriller,the wife did it,1421990253
333037,162534,189169,2.5,Ugly Nasty People (2017),Comedy,comedy,1527518175
333038,162534,189169,2.5,Ugly Nasty People (2017),Comedy,disabled,1527518181


In [42]:
train_joined1['Years'] = train_joined1.title.apply(lambda x: x[-5:-1])
train_joined1['genres'] = train_joined1.genres.apply(lambda x: ",".join(x.split('|')))
train_joined1

Unnamed: 0,userId,movieId,rating,title,genres,tag,timestamp,Years
0,20,1210,5.0,Star Wars: Episode VI - Return of the Jedi (1983),"Action,Adventure,Sci-Fi",bah,1155082282,1983
1,68,3481,4.5,High Fidelity (2000),"Comedy,Drama,Romance",music,1472113217,2000
11,87,102445,1.0,Star Trek Into Darkness (2013),"Action,Adventure,Sci-Fi,IMAX",unoriginal,1522677043,2013
9,87,102445,1.0,Star Trek Into Darkness (2013),"Action,Adventure,Sci-Fi,IMAX",inferior sequel,1522677007,2013
8,87,1127,5.0,"Abyss, The (1989)","Action,Adventure,Sci-Fi,Thriller",sci-fi,1542308464,1989
...,...,...,...,...,...,...,...,...
333035,162501,112556,3.5,Gone Girl (2014),"Drama,Thriller",crime,1421990253,2014
333036,162501,112556,3.5,Gone Girl (2014),"Drama,Thriller",the wife did it,1421990253,2014
333037,162534,189169,2.5,Ugly Nasty People (2017),Comedy,comedy,1527518175,2017
333038,162534,189169,2.5,Ugly Nasty People (2017),Comedy,disabled,1527518181,2017


In [28]:
train_joined1.isna().sum()

userId               0
movieId              0
rating          760320
timestamp_x     760320
title           760320
genres          760320
tag            9915548
timestamp_y    9915532
dtype: int64

In [None]:
# plot relevant feature interactions

In [None]:
# evaluate correlation

In [None]:
# have a look at feature distributions

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
# remove missing values/ features

In [None]:
# create new features

In [None]:
# engineer existing features

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [None]:
# split data

In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic