# Interview Challenge

The objective of this challenge is to assess your ability to:
- Perform basic data manipulation and data pre-processing
- Demonstrate awareness of the computations involved
- Perform feature engineering
- Train and tune ML models
- Asses performance of the ML models
- Obtaining clear, useful, and business driven insights from data and models

**The objective of the model you create will be to predict whether a client will rate as “high” a movie or not.**

**Note:**
-  If a user has several ratings, then each of her ratings must appear on a different row
- Each column will correspond to a predictive variable 
- Response variable:
    - 1 in case the rating is >= 4 (flag for "high" rating)
    - 0 in case the rating is < 4
- Assume that this model will be used to generate online predictions on a production setting, and be aware of the implications of that, and put special attention for data leakage

## 1. EDA

### **tag.csv**

In [18]:
import pandas as pd
tag_df = pd.read_csv('data/tag.csv')

In [19]:
from sklearn.model_selection import train_test_split

# Main split 70% train - 30% test
tag_train, tag_test = train_test_split(tag_df, random_state=123, test_size=0.3)
tag_train.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
339770,103076,4743,remake,2006-06-04 02:20:02
379512,119367,7158,American dream,2009-05-12 04:41:26
248167,71833,48304,atmospheric,2012-10-17 21:49:53
10906,1741,111722,"quote:\\""Climb the stairway of mystery then cr...",2014-07-05 00:36:49
190117,57434,215,bittersweet,2011-07-08 12:50:15


In [None]:
print("Shape:", tag_train.shape) # (325894, 4)
print("Info:", tag_train.info()) 

Shape: (325894, 4)
<class 'pandas.core.frame.DataFrame'>
Index: 325894 entries, 339770 to 249342
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   userId     325894 non-null  int64 
 1   movieId    325894 non-null  int64 
 2   tag        325882 non-null  object
 3   timestamp  325894 non-null  object
dtypes: int64(2), object(2)
memory usage: 12.4+ MB
Info: None


In [None]:
# Check for NaN
tag_train.isna().sum().sort_values(ascending=False)

tag          12
userId        0
movieId       0
timestamp     0
dtype: int64

### **rating.csv**

In [23]:
rating_df = pd.read_csv('data/rating.csv')
rating_train, rating_test = train_test_split(rating_df, random_state=123, test_size=0.3)
rating_train.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
19165361,132599,45208,2.5,2012-06-19 22:49:51
12156569,83970,1617,5.0,2011-06-06 00:10:35
13004234,89791,2616,4.0,2002-06-28 04:00:05
355257,2395,1663,4.0,2006-04-27 22:11:56
12137396,83840,282,3.0,1996-05-06 14:19:04


In [24]:
rating_train.describe()

Unnamed: 0,userId,movieId,rating
count,14000180.0,14000180.0,14000180.0
mean,69046.28,9040.399,3.525504
std,40038.98,19788.3,1.052068
min,1.0,1.0,0.5
25%,34395.0,902.0,3.0
50%,69148.0,2167.0,3.5
75%,103637.0,4770.0,4.0
max,138493.0,131262.0,5.0


### **movie.csv**

In [None]:
movie_df = pd.read_csv('data/movie.csv')
movie_train, movie_test = train_test_split(movie_df, random_state=123, test_size=0.3)

### **link.csv**

In [None]:
link_df = pd.read_csv('data/link.csv')
link_train, link_test = train_test_split(link_df, random_state=123, test_size=0.3)

### **genome_scores.csv**

In [None]:
genome_scores_df = pd.read_csv('data/genome_scores.csv')
genome_scores_train, genome_scores_test = train_test_split(genome_scores_df, random_state=123, test_size=0.3)

### **genome_tag.csv**

In [None]:
genome_tags_df = pd.read_csv('data/genome_tags.csv')
genome_tags_train, genome_tags_test = train_test_split(genome_tags_df, random_state=123, test_size=0.3)

## 2. Binary Classification

## 3. Feature Engineering

## 4. Model Implementation

Explain the process you followed to generate/choose the model. Do not invest too much time training/tuning your model. It will be enough for us if you choose an algorithm
and a configuration of hyperparameters you have seen in the past to work well for this type of problems. Please, explain and justify your selection of the algorithm and hyperparameters.

## 5. Feature Importance

Give an explanation of the importance of each feature, and show us which of the features you created had the highest impact on your model. Explain and justify your choice of the importance metric.

## 6. Conclusions

Add some comments summarizing your work. Also, add comments on how you would improve it if further time was given to you.