# **How good is the movie** ?

This is a PoC project made to predict how good a movie is. It is possible to find the information in Kaggle the dataset [Kaggle](https://www.kaggle.com/grouplens/movielens-20m-dataset?select=tag.csv). I want to try to predict whether the movie is good or bad, considering a rating > 4 good and a rating < 4 bad.

In [None]:
import pandas as pd
import io
from google.colab import files
from google.colab import drive
import numpy as np
import re
import itertools
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
! pip install shap
import shap
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

# **Data Exploration**

There are 6 sources of information available. 
* genome scores:
     - csv type
     - movieId integer
     - tagId integer 
     - relevance (range 0-1) float 
     - 11709768 registers
   
* genome tags:
     - csv type
     - tag id integer 
     - tag string (movie characteristics)
     - 1128 registers
* link:
     - csv type
     - movieId integer
     - imdbId integer
     - tmdbId integer
     - 27278 registers
     
* movie:
     - csv type
     - movieId integer
     - titles string (movie name)
     - genres string (genres within the movie)
     - 27278 registers

* rating:
     - csv type
     - userId integer
     - movieId integer
     - rating float (genres within the movie 0.5-5.0)
     - timestamp (timestamp)
     - 20000263 registers

* tag:
     - csv type
     - userId integer
     - movieId integer
     - tag string (specific tag written by a user)
     - timestamp (timestamp)
     - 465564 registers
        

# **Data import**

I load the data with the code below.

In [None]:
 drive.mount('/content/drive')

In [None]:
!ls "/content/drive/My Drive/dataset"

In [None]:
# genome scores 
genome_scores_df = pd.read_csv('/content/drive/My Drive/dataset/genome_scores.csv')
genome_scores_df

In [None]:
print("max relevance genome score is: ", genome_scores_df['relevance'].min())
print("min relevance genome score is: ", genome_scores_df['relevance'].max())

In [None]:
rel_hist = genome_scores_df['relevance'].hist(bins=10)

In [None]:
rel_boxplot = genome_scores_df.boxplot(column=['relevance'])
print("relevance average: ", genome_scores_df['relevance'].mean())
print("relevance q25: ", genome_scores_df['relevance'].quantile(.25))
print("relevance q75: ", genome_scores_df['relevance'].quantile(.75))

In [None]:
genome_tags_df = pd.read_csv("/content/drive/My Drive/dataset/genome_tags.csv")

In [None]:
genome_tags_df

In [None]:
link_df = pd.read_csv("/content/drive/My Drive/dataset/link.csv")

In [None]:
link_df

In [None]:
movie_df = pd.read_csv("/content/drive/My Drive/dataset/movie.csv")

In [None]:
movie_df

In [None]:
def get_year_from_title(title):
    try:
        year = int(title.split('(')[1].split(')')[0])
    except:
        year = 0
    return year

In [None]:
movie_df['year'] = list(map(get_year_from_title, movie_df['title']))

In [None]:
movie_df

In [None]:
#list of genres
def get_genre_list(genre_array):
    genres = []
    for genre in genre_array:
        try:
            genres.append(genre.split('|'))
        except:
            genres.append(genre)
    return genres

In [None]:
movies = list(itertools.chain(*get_genre_list(movie_df['genres'])))

In [None]:
genres = set(movies)

In [None]:
len(genres)

In [None]:
movie_df['year'].value_counts()

In [None]:
rating_df = pd.read_csv("/content/drive/My Drive/dataset/rating.csv")

In [None]:
rating_df

In [None]:
print("the min rating", rating_df["rating"].min())
print("the max rating", rating_df["rating"].max())

In [None]:
rating_df["timestamp"].min()

In [None]:
rating_df["timestamp"].max()

In [None]:
hist_rating = rating_df["rating"].hist( bins = 5)

In [None]:
tag_df = pd.read_csv("/content/drive/My Drive/dataset/tag.csv")

In [None]:
tag_df_transform = tag_df[['movieId','tag']]
def cast_string(tag):
    return str(tag)
tag_df_transform['tag'] = list(map(cast_string,tag_df_transform['tag']))
tag_df_transform['movieId'] = list(map(cast_string,tag_df_transform['movieId']))

In [None]:
tags_per_movie = tag_df_transform.groupby('movieId')['tag'].apply(list).reset_index(name='tags')

In [None]:
def get_tag_len(tags):
    return len(tags)
tags_per_movie['quantity_tags'] = list(map(get_tag_len,tags_per_movie['tags']))

In [None]:
tag_df

In [None]:
tags_per_movie_df = tags_per_movie[['movieId','quantity_tags']]

In [None]:
tags_per_movie['quantity_tags'].hist(bins=100)

# **Summary Exploratory Analysis**

From the Descriptive Analytics from above we can infer the next information:

    * The most of the movies do not have a big relevance, very few movies are quite relevant.
    * People create so far 1128 different tags to assign to movies.
    * There are   27278 movies, most of them contain the "year" in the title. However, around the 19.2% 
      do not have a year asign (5218). 
    * There are movies from the year 1913 to the year 2013.The year with more movies is 2013.
    * There are 20 different genres.
    * From the rating asignation the time took place between 1995-2015.
    * Most of the movies have a 4 + rating.
    * It was found that not all the movies have a tag assign, only 71.6% (19545) have a tag. 
    * Customers present a tendency to assign around 10-15 tags per movie.

#**Data Science Application: Feature Engineering**

I am interested to know whether the movie is good or bad base on a +4.0 rating. Thus, I am going to use the rating information as my main dataframe to predict whether the movie is bad. Given that it cointains the user activity at the moment of rating the movie.

- First, I am going to work with the relevance. The relevance represent, how much impact does the object have to the attention of users. The genome_scores_df has the movie ID and Tag ID with a specific relevance value. I am going to create a feature which name is going to be average relevance, the feature is going to represent the general relevance of the movie base on the tags that it has. In general the dataset contains the next information, for each movie; how relevant the movie is for each of the tags created by the users (1128 tags). 

In [None]:
# This returns the aggregate average of relevance for each movie given the presented tags
df_rel_avg_per_movie = genome_scores_df[['movieId','relevance']].groupby('movieId')['relevance'].mean().reset_index(name='rel_avg')

In [None]:
# This returns the aggregate max of relevance for each movie given the presented tags
df_rel_max_per_movie = genome_scores_df[['movieId','relevance']].groupby('movieId')['relevance'].max().reset_index(name='rel_max')

In [None]:
# This returns the aggregate min of relevance for each movie given the presented tags
df_rel_min_per_movie = genome_scores_df[['movieId','relevance']].groupby('movieId')['relevance'].min().reset_index(name='rel_min')

In [None]:
main_df = rating_df.merge(df_rel_avg_per_movie, how='left', on='movieId')

In [None]:
main_df = main_df.merge(df_rel_max_per_movie, how='left', on='movieId')

In [None]:
main_df = main_df.merge(df_rel_min_per_movie, how='left', on='movieId')

In [None]:
main_df.head()

In [None]:
main_df['rel_diff'] = main_df['rel_max'] - main_df['rel_min']

- Second, I am going to convert my dependent variable to a binary classification alike form. 

In [None]:
# Define my desire clasification to 1-0. 1 = rating >= 4, 0 = rating < 4.
def get_score_response(val):
    response = 0
    if val >= 4:
        response = 1
    else:
        response = 0
    return response
main_df['rating_score'] =  list(map(get_score_response,main_df['rating']))

 - Third, I am going to work with the movie_df. The dataframe contains information related to the movies with the title and genre. From here, I am interested in bringing the year of the movie with some text processing and turn the genres into indicator labels, thus creating 1 column for each genre. 

In [None]:
def confirm_genre_content(genres_desc,define_genre):
    value = 0
    if define_genre in genres_desc:
        value = 1
    else:
        value = 0
    return value

In [None]:
list_genres = []
for genre in genres:   
    for genre_desc in movie_df['genres']:
        list_genres.append(confirm_genre_content(genre_desc,genre))
    movie_df[genre] = list_genres
    list_genres = []

In [None]:
main_df = main_df.merge(movie_df, how='left', on='movieId')

In [None]:
main_df['movieId'] = list(map(cast_string,main_df['movieId']))

 - Fourth, from the tag_df dataframe, which contains the userid, movieid, tag associated with the movie and the timestamp of the association; I am going to bring the amount of tags that a movie has, the amount of tags that a user has tag movies and the amount of times a user tag a specific movie.

In [None]:
main_df = main_df.merge(tags_per_movie_df, how='left', on='movieId')

In [None]:
tag_df_transform_user = tag_df[['userId','tag']]
tags_per_user = tag_df_transform_user.groupby('userId')['tag'].apply(list).reset_index(name='tags_user')
tags_per_user['user_tags'] = list(map(get_tag_len,tags_per_user['tags_user']))

In [None]:
main_df['userId'] = main_df['userId'].astype(int)

In [None]:
tags_per_user['userId'] = tags_per_user['userId'].astype(int)

In [None]:
main_df = main_df.merge(tags_per_user, how='left', on='userId')

In [None]:
tag_df_user_movie = tag_df[['userId','movieId']]
tag_df_user_movie['amount_us2mov'] = 1

In [None]:
tag_df_user_movie = tag_df_user_movie.groupby(['userId','movieId'])['amount_us2mov'].sum().reset_index()

In [None]:
main_df['movieId'] = main_df['movieId'].astype(int)

In [None]:
main_df = main_df.merge(tag_df_user_movie, how='left', on=['userId','movieId'])

In [None]:
main_df = main_df.fillna(0)

In [None]:
main_df['timestamp'] = list(map(pd.to_datetime,main_df['timestamp']))

In [None]:
def get_month(date):
    return date.month
def get_day(date):
    return date.day
def get_hour(date):
    return date.hour
def get_year(date):
    return date.year

In [None]:
main_df['month'] = list(map(get_month,main_df['timestamp']))

In [None]:
main_df['day'] = list(map(get_day,main_df['timestamp']))

In [None]:
main_df['hour'] = list(map(get_hour,main_df['timestamp']))

In [None]:
main_df['date_year'] = list(map(get_year,main_df['timestamp']))

In [None]:
main_df['year_rating'] = main_df['date_year'] - main_df['year']

In [None]:
main_df = main_df[main_df['date_year'] >= 2008]

In [None]:
len(main_df)

In [None]:
main_df.head()

In [None]:
# remove columns that I do not need.
# main_df = main_df.drop(columns=['timestamp','userId','movieId','rating','title','genres','tags_user','date_year'])

In [None]:
main_df.head()

In [None]:
main_df = main_df.sort_values(by=['userId','timestamp'])

In [None]:
main_df.to_csv('/content/drive/My Drive/dataset/main_df.csv')

# **Data Science Application: Data normalization**

In [None]:
main_df = pd.read_csv('/content/drive/My Drive/dataset/main_df.csv')

In [None]:
main_df = main_df.drop(columns = 'Unnamed: 0')

In [None]:
# remove columns that I do not need.
main_df = main_df.drop(columns=['timestamp','userId','movieId','rating','title','genres','tags_user','date_year'])

In [None]:
main_df.head()

In [None]:
def clean_year_rating(val):
  if val > 200:
    val = 0
  else:
    val
  return val
main_df['year_rating'] = list(map(clean_year_rating,main_df['year_rating']))

In [None]:
main_df['year_rating'].hist()

In [None]:
y = main_df['rating_score']
main_df = main_df.drop(columns=['rating_score'])
x = main_df

In [None]:
x['y_lag1'] = y.shift(1)

# **Data Science Application: Second Feature Engineering**

After getting the baseline result, I saw that the time features are not meaningful for the result. Neither most of movie genres. The movie genres which seem to be useful for the model are scifi, drama and action. Thus, I will erase the rest of the data, the amount of tags that a user puts in a movie feature does not have any impact on the model. Thus, I am going to delete such feature.

In [None]:
# Erasing not necessary movie genres
x = x.drop(columns=['War','Thriller','Fantasy','Horror','(no genres listed)','Crime','IMAX','Children','Animation','Comedy','Musical','Documentary','Romance','Film-Noir','Western','Adventure','Mystery','amount_us2mov','month','day','hour'])

In [None]:
scaler = MinMaxScaler()
x_scaled = pd.DataFrame(scaler.fit_transform(x))

In [None]:
x_scaled.columns = x.columns

In [None]:
x_scaled.head()

In [None]:
x_scaled.columns

In [None]:
x_scaled = x.fillna(0)

In [None]:
# The data is balance
y.value_counts()

# **Data Science Application: Machine Learning Model**

In [None]:
# I try to be effective but the computer crash with this distributed way of getting results
#clf = RandomForestClassifier(n_estimators = 50, criterion='entropy', max_depth=5, random_state=0)
#accuracies = cross_val_score(estimator=clf, X = x_scaled, y = y, cv = 10, n_jobs = -1, verbose=1)

In [None]:
len(x_scaled.columns)

In [None]:
xscale = x_scaled.iloc[:1000,:]

In [None]:
yscale = y[:1000]

In [None]:
xscale.to_csv('/content/drive/My Drive/dataset/xscale.csv')

In [None]:
yscale.to_csv('/content/drive/My Drive/dataset/yscale.csv')

In [None]:
parameters = {
    'n_estimators': [5, 10, 20, 50, 100, 150, 200],
    'criterion': ['gini', 'entropy'],
    'max_depth': [1,2,3,5],
    'min_weight_fraction_leaf':[0.0,0.1,0.2],
    'max_features' : ['auto','sqrt','log2'],
    'min_impurity_decrease' : [0.0,0.1,0.01],
    'ccp_alpha': [0.0, 0.1, 0.01, 0.05, 0.03, 0.02]
}

In [None]:
estimator = RandomForestClassifier(
    random_state=42
)

In [None]:
grid_search = GridSearchCV(
    estimator=estimator,
    param_grid=parameters,
    scoring = 'roc_auc',
    n_jobs = 10,
    cv = 10,
    verbose=True
)

In [None]:
grid_search.fit(xscale, yscale)

In [None]:
grid_search.best_estimator_

In [None]:
k = 5
kf = KFold(n_splits=k, random_state=None)

clf = RandomForestClassifier(n_estimators=50, criterion='entropy',max_depth=7)
 
acc_score = []
 
for train_index , test_index in kf.split(x_scaled):
    X_train , X_test = x_scaled.iloc[train_index,:],x_scaled.iloc[test_index,:]
    y_train , y_test = y[train_index] , y[test_index]

    clf.fit(X_train,y_train)
    pred_values = clf.predict(X_test)
     
    acc = accuracy_score(pred_values , y_test)
    acc_score.append(acc)
    print(acc)
avg_acc_score = sum(acc_score)/k
 
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))

In [None]:
# this takes a minute or two since we are explaining over 30 thousand samples in a model with over a thousand trees.
explainer = shap.TreeExplainer(clf)
shap_values = explainer.shap_values(x_scaled)

In [None]:
# First Iteration of the model. The features Musical, Crime, Horror, Comedy, Adventure, War, Western, Film-Noir, Romance. 
# Are not useful to the model. Thus, I am going to erase them and try to predict again the model. Check if with less noisy data the model respond better.
shap.summary_plot(shap_values, x_scaled, plot_type="bar")

In [None]:
# ROC curve
fpr, tpr, _ = metrics.roc_curve(y_test,  pred_values)
#create ROC curve
plt.plot(fpr,tpr)
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()