# Team 12 - Unsupervised Predict

© Explore Data Science Academy

---

## Team Members

 - [Karabo Mampuru](https://www.linkedin.com/in/karabo-mampuru-318118a6/)
 - [Muhammad](https://www.linkedin.com/in/mpilenhle-hlatshwayo-70544b169/)
 - [Bohlale Kekana](https://www.linkedin.com/in/bohlale-kekana-8b753320b/)
 - [Tsepo](https://www.linkedin.com/in/sello-sydney-mafikeng-46a664110/)
 - [Mpilenhle Hlatshwayo](https://www.linkedin.com/in/mpilenhle-hlatshwayo-70544b169/)


## Introduction: Movie Recommender

Many businesses are based on making sure the user satisfaction is reached an ultimately exceeded. The results of not meeting user satisfaction include less traffic in the website, and this also results in a high churn rate. The ability to recommend a movie that a user might like, means that the user satisfaction rises, thus the business profit increases, the churn rate decreases, this can help in our company, Netflix, as it can keep a user interested in being in the web because of the remmendations of the movies they like.
 

<img src="https://raw.githubusercontent.com/kmsekgothe/unsupervised-predict-streamlit-template/master/resources/imgs/Image_header.png" width=75%/>


## Predict Overview

We have been tasked with creating a Machine Learning model that is able to predict a users next favourite movie based on the users past recommendations on movies, thus creating a `Content Based Recommender`. The second prediction will be based on the movie rating of all the users that have watched an rated the movie thus the `Collaborative Based Remmender`. These processes will be made created using different methods, since we will be using text data vectorization will be used.


The structure of this notebook is as follows:

 - First, we'll load our data to get a view of the predictor variables we will be modeling with, since this is an unsupervised learning there is no response variable. 
 
 - We will then preprocess our data, cleaning or removing unwanted charecters such as `Strokes |` `RT` `@ mentions` `urls` `emojis`.
 
 - We will preprocess our data, normalising the tokens in the tweets using processes such as  `Stemming` and `Lemmatizing`.
  
 - Then vectorise the tweets and splitting up the data into train and test sets.

 - We then model our data using a various models such as `SVC` `Naive Bayes` and `Linear Logistics`, paying attention to multiclass versus binary classification model parameters.
 
 - Following this modeling, we use the f1 score to determine the best performing model.
 
 - Using this metric, we then take several steps to improve our base model's performance by optimising the hyperparameters of the model through a `grid search strategy` , then use the `K best` for best feature selection.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Preprocessing </a>

<a href=#five>5. Feature Engineering</a>

<a href=#six>6. Modeling</a>

<a href=#seven>7. Model Performance</a>

<a href=#eight>8. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages

In [2]:
#Import nescessarry libraries

import numpy as np
import pandas as pd
import cufflinks as cf
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

import nltk
from nltk.corpus import stopwords
import re
import string
# importing tokenizing library
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
#importing stemmer library
from nltk import SnowballStemmer

from sklearn.ensemble import RandomForestRegressor

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (15,10)})

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [3]:
import datetime as dt
import requests
import bs4 as bs
import urllib.request

In [4]:
import re
import string
# importing tokenizing library
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
#importing stemmer library
from nltk import SnowballStemmer

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Load the Train and Test datasets from the github repository

In [5]:
# get IMDB data
imdb_data = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/unspervised/predict/unsupervised-predict-streamlit-template/Data/imdb_data.csv')

# get movies data
movies_data = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/unspervised/predict/unsupervised-predict-streamlit-template/Data/movies.csv')

# get tags
tags_data = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/unspervised/predict/unsupervised-predict-streamlit-template/Data/tags.csv')

# get test data
test_data = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/unspervised/predict/unsupervised-predict-streamlit-template/Data/test.csv')

# get genome data
genome_scores_data = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/unspervised/predict/unsupervised-predict-streamlit-template/Data/genome_scores.csv')

# get genome data
genome_tags_data = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/unspervised/predict/unsupervised-predict-streamlit-template/Data/genome_tags.csv')

# get genome data
train_data = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/unspervised/predict/unsupervised-predict-streamlit-template/Data/train.csv')



# Fetching more data
since some of the data is missing we need to collect the data, especially the new movies data

In [6]:
link_2020 = "https://en.wikipedia.org/wiki/List_of_American_films_of_2020"
link_2021 = "https://en.wikipedia.org/wiki/List_of_American_films_of_2021"

In [22]:
imdb_link = 'https://www.imdb.com/'

In [23]:
imdb_source = urllib.request.urlopen(imdb_link).read()
soup_imdb = bs.BeautifulSoup(imdb_source,'lxml')

In [24]:
tables_imdb = soup_imdb.find_all('table',class_='wikitable sortable')

In [25]:
len(tables_imdb)

0

In [7]:
source_20 = urllib.request.urlopen(link_2020).read()
soup_20 = bs.BeautifulSoup(source_20,'lxml')

In [8]:
source_21 = urllib.request.urlopen(link_2021).read()
soup_21 = bs.BeautifulSoup(source_21,'lxml')

In [9]:
tables_20 = soup_20.find_all('table',class_='wikitable sortable')

In [10]:
tables_21 = soup_21.find_all('table',class_='wikitable sortable')

In [215]:
len(tables_21)

4

In [11]:
df1_20 = pd.read_html(str(tables_20[0]))[0]
df2_20 = pd.read_html(str(tables_20[1]))[0]
df3_20 = pd.read_html(str(tables_20[2]))[0]
df4_20 = pd.read_html(str(tables_20[3]).replace("'1\"\'",'"1"'))[0]

In [12]:
df_2020 = df1_20.append(df2_20.append(df3_20.append(df4_20,ignore_index=True),ignore_index=True),ignore_index=True)

In [13]:
df_2020.head()

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.
0,JANUARY,3,The Grudge,Screen Gems / Stage 6 Films / Ghost House Pict...,Nicolas Pesce (director/screenplay); Andrea Ri...,[2],
1,JANUARY,10,Underwater,20th Century Fox / TSG Entertainment / Chernin...,"William Eubank (director); Brian Duffield, Ada...",[3],
2,JANUARY,10,Like a Boss,Paramount Pictures,"Miguel Arteta (director); Sam Pitman, Adam Col...",[4],
3,JANUARY,10,Three Christs,IFC Films,Jon Avnet (director/screenplay); Eric Nazarian...,,
4,JANUARY,10,Inherit the Viper,Barry Films / Tycor International Film Company,Anthony Jerjen (director); Andrew Crabtree (sc...,[5],


In [201]:
df_2020.shape

(273, 7)

In [14]:
df1_21 = pd.read_html(str(tables_21[0]))[0]
df2_21 = pd.read_html(str(tables_21[1]))[0]
df3_21 = pd.read_html(str(tables_21[2]))[0]
df4_21 = pd.read_html(str(tables_21[3]).replace("'1\"\'",'"1"'))[0]

In [15]:
df_2021 = df1_21.append(df2_21.append(df3_21.append(df4_21,ignore_index=True),ignore_index=True),ignore_index=True)

In [29]:
df_2021.head()

Unnamed: 0,Opening,Opening.1,Title,Production company,Cast and crew,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref.,Ref.
0,JANUARY,1,Shadow in the Cloud,Vertical Entertainment,Roseanne Liang (director/screenplay); Max Land...,[2],
1,JANUARY,13,The White Tiger,Netflix,Ramin Bahrani (director/screenplay); Adarsh Go...,,
2,JANUARY,14,Locked Down,HBO Max / Warner Bros. Pictures,Doug Liman (director); Steven Knight (screenpl...,[3],
3,JANUARY,15,The Dig,Netflix / Clerkenwell Films,Simon Stone (director); Moira Buffini (screenp...,[4],
4,JANUARY,15,Outside the Wire,Netflix,"Mikael Håfström (director); Rob Yescombe, Rowa...",[5],


In [222]:
df_2021.shape

(356, 7)

In [21]:
#pip install tmdbv3api

In [26]:
from tmdbv3api import TMDb
import json
import requests
tmdb = TMDb()
tmdb.api_key = '5e7481152ec747e2bb753efcb58a2073'

In [58]:
from tmdbv3api import Movie
tmdb_movie = Movie()
def get_genre(x):
    genres = []
    result = tmdb_movie.search(x)
    
    
    return result

In [59]:
w = get_genre('Shadow in the Cloud')

In [60]:
w

[{'adult': False, 'backdrop_path': '/aHYUj0hICtWZ5tPiCIm6pWUcjYK.jpg', 'genre_ids': [27, 28, 10752], 'id': 675327, 'original_language': 'en', 'original_title': 'Shadow in the Cloud', 'overview': 'A WWII pilot traveling with top secret documents on a B-17 Flying Fortress encounters an evil presence on board the flight.', 'popularity': 99.993, 'poster_path': '/t7EUMSlfUN3jUSZUJOLURAzJzZs.jpg', 'release_date': '2021-01-01', 'title': 'Shadow in the Cloud', 'video': False, 'vote_average': 5.8, 'vote_count': 520}]

In [32]:
df_2021['genres'] = df_2021['Title'].map(lambda x: get_genre(str(x)))

IndexError: list index out of range

In [None]:
df_2018 = df[['Title','Cast and crew','genres']]

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Exploratory Data Analysis refers to the critical process of performing initial investigations on a dataset so as to discover patterns in the data, spot anomalies, to check assumptions with the help of some statistics and graphical representations. The following section analyses and provides an overview of the given data. 

Looking at the token lengths and the number of times a token appears in a text can help us extract those tokens that do not add value to the model prediction, we will use the bag of words method which constructs a word presence feature set from all the words in the text, indicating the number of times each word has appeared. But first, let us take a look at the most frequently used words in each class.


In [176]:
# looking at the imdb data
imdb_data.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [5]:
imdb_data.shape

(27278, 6)

In [282]:
movies_data.tail(20)

Unnamed: 0,movieId,title,genres,year
62403,209121,Adrenalin: The BMW Touring Car Story (2014),Documentary,[2014]
62404,209123,Square Roots: The Story of SpongeBob SquarePan...,Documentary,[2009]
62405,209129,Destination Titan (2011),Documentary,[2011]
62406,209131,Last Days of the Arctic (2011),Documentary,[2011]
62407,209133,The Riot and the Dance (2018),(no genres listed),[2018]
62408,209135,Jane B. by Agnès V. (1988),Documentary|Fantasy,[1988]
62409,209137,The Reward's Yours... The Man's Mine (1969),Western,[1969]
62410,209139,Rimsky-Korsakov (1953),Drama,[1953]
62411,209141,And They Lived Happily Ever After (1976),Comedy,[1976]
62412,209143,The Painting (2019),Animation|Documentary,[2019]


In [223]:
movies_data.shape

(62423, 3)

In [284]:
movies_data.isnull().sum()

movieId    0
title      0
genres     0
year       0
dtype: int64

In [229]:
x = [re.findall('[0-9]+', i) for i in movies_data.title]

In [233]:
x[:5]

[['1995'], ['1995'], ['1995'], ['1995'], ['1995']]

In [279]:
years = [i[-1] for i in x]

IndexError: list index out of range

In [278]:
len(yrs)

15036

In [281]:
movies_data['year'] = years

In [283]:
len(movies_data.movieId.unique())

62423

Merge the tables that have movie info to get the best info about every movie

In [285]:
new_movies = pd.merge( movies_data,imdb_data, how = 'left', on = 'movieId')

In [286]:
new_movies.head()

Unnamed: 0,movieId,title,genres,year,title_cast,director,runtime,budget,plot_keywords
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,[1995],Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Jumanji (1995),Adventure|Children|Fantasy,[1995],Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Grumpier Old Men (1995),Comedy|Romance,[1995],Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,[1995],Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Father of the Bride Part II (1995),Comedy,[1995],Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [287]:
new_movies.shape

(62423, 9)

In [288]:
new_movies.isnull().sum()

movieId              0
title                0
genres               0
year                 0
title_cast       47222
director         47076
runtime          48902
budget           55140
plot_keywords    48039
dtype: int64

Columns of interest are those with the information we want but we must becareful not to repeat the same data

In [None]:
text_features = [title_cast, director, plot_keywords, title, genres, tag]

In [290]:
new_movies.shape

(62423, 9)

In [18]:
# get train data
train_data.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [177]:
train_data.shape

(10000038, 4)

In [291]:
len(train_data.movieId.unique())

48213

In [22]:
# get tags
tags_data.head(10)

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455
5,4,44665,unreliable narrators,1573943619
6,4,115569,tense,1573943077
7,4,115713,artificial intelligence,1573942979
8,4,115713,philosophical,1573943033
9,4,115713,tense,1573943042


In [9]:
tags_data.shape

(1093360, 4)

In [38]:
# genome tags
genome_tags_data.head(10)

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s
5,6,1950s
6,7,1960s
7,8,1970s
8,9,1980s
9,10,19th century


In [10]:
genome_tags_data.shape

(1128, 2)

In [28]:
# get genome scores
genome_scores_data.head()

Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [135]:
genome_scores_data.shape

(15584448, 3)

In [139]:
len(genome_scores_data['movieId'].unique())

13816

In [138]:
len(genome_scores_data['tagId'].unique())

1128

In [130]:
tag_score = pd.merge(genome_tags_data, genome_scores_data , left_on = 'tagId' , right_on  = 'movieId')

In [134]:
tag_score.tail(20)

Unnamed: 0,tagId_x,tag,movieId,tagId_y,relevance
1149412,1128,zombies,1128,1109,0.22425
1149413,1128,zombies,1128,1110,0.04675
1149414,1128,zombies,1128,1111,0.0365
1149415,1128,zombies,1128,1112,0.2225
1149416,1128,zombies,1128,1113,0.17125
1149417,1128,zombies,1128,1114,0.12625
1149418,1128,zombies,1128,1115,0.03675
1149419,1128,zombies,1128,1116,0.13975
1149420,1128,zombies,1128,1117,0.02325
1149421,1128,zombies,1128,1118,0.06625


In [132]:
tag_score.shape

(1149432, 5)

In [12]:
genome_scores_data.shape

(15584448, 3)

In [106]:
g_score = genome_scores_data.movieId.unique()

In [109]:
len(g_score)

13816

In [125]:
tag_score = pd.merge(genome_scores_data, genome_tags_data, left_on = 'movieId' , right_on  = 'tagId')

In [127]:
tag_score.tail(20)

Unnamed: 0,movieId,tagId_x,relevance,tagId_y,tag
1149412,1128,1109,0.22425,1128,zombies
1149413,1128,1110,0.04675,1128,zombies
1149414,1128,1111,0.0365,1128,zombies
1149415,1128,1112,0.2225,1128,zombies
1149416,1128,1113,0.17125,1128,zombies
1149417,1128,1114,0.12625,1128,zombies
1149418,1128,1115,0.03675,1128,zombies
1149419,1128,1116,0.13975,1128,zombies
1149420,1128,1117,0.02325,1128,zombies
1149421,1128,1118,0.06625,1128,zombies


In [121]:
tag_score.shape

(15584557, 5)

In [129]:
len(tag_score['movieId'].unique())

1019

In [32]:
test_data.head()

Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [11]:
test_data.shape

(5000019, 2)

<a id="four"></a>
## 4. Data Preprocessing
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Before we can do feature Engineering, Exploratory Data Analysis (EDA) in section 3, showed we need to ensure that our data is in a clean format that can actually be used. The tweets have a lot of charecters that are not very usefull and may cause noise in our data, therefore, with the use of regualar expressions we can remove these charecters which include:

 - strokes .
 - $ .
 - numbers / Years.
 - Web urls.
 - Eemojis.
 - Digits.
 - Spelling charecters eg. ã¢â‚¬â¦.

By removing these charecters, we are trying to remove words that will not be very usefull during the modeling phase.

In [89]:
imdb_data.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [90]:
imdb_data.shape

(27278, 6)

In [96]:
im_ids = imdb_data.movieId.unique()

In [91]:
movies_data.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [92]:
movies_data.shape

(62423, 3)

In [97]:
m_ids = movies_data.movieId.unique()

In [100]:
simmillar = [i for i in m_ids if i in im_ids]


In [101]:
len(simmillar)

24866

In [103]:
missing_data = movies_data.shape[0] -  len(simmillar)

In [104]:
missing_data

37557

In [298]:
new_movies.head()

Unnamed: 0,movieId,title,genres,year,title_cast,director,runtime,budget,plot_keywords
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,[1995],Tom Hanks Tim Allen Don Rickles Jim Varney Wal...,John Lasseter,81.0,"$30,000,000",toy rivalry cowboy cgi animation
1,2,Jumanji (1995),Adventure Children Fantasy,[1995],Robin Williams Jonathan Hyde Kirsten Dunst Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game adventurer fight game
2,3,Grumpier Old Men (1995),Comedy Romance,[1995],Walter Matthau Jack Lemmon Sophia Loren Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat lake neighbor rivalry
3,4,Waiting to Exhale (1995),Comedy Drama Romance,[1995],Whitney Houston Angela Bassett Loretta Devine ...,Terry McMillan,124.0,"$16,000,000",black american husband wife relationship betra...
4,5,Father of the Bride Part II (1995),Comedy,[1995],Steve Martin Diane Keaton Martin Short Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood doberman dog mansion


In [71]:
new_movies.shape

(24866, 8)

In [299]:
new_movies.isnull().sum()

movieId              0
title                0
genres               0
year                 0
title_cast       47222
director         47076
runtime          48902
budget           55140
plot_keywords    48039
dtype: int64

Removing the `|` from our data 

In [293]:
new_movies['genres'] = new_movies['genres'].str.replace('|', ' ')


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



In [294]:
new_movies['title_cast'] = new_movies['title_cast'].str.replace('|', ' ')


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



In [295]:
new_movies['plot_keywords'] = new_movies['plot_keywords'].str.replace('|', ' ')


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



In [296]:
len(new_movies['movieId'].unique())

62423

In [300]:
tags_data.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [301]:
t = tags_data[['movieId', 'tag']]
t.head()

Unnamed: 0,movieId,tag
0,260,classic
1,260,sci-fi
2,1732,dark comedy
3,1732,great dialogue
4,7569,so bad it's good


In [79]:
t.shape

(1093360, 2)

In [78]:
len(t['movieId'].unique())

45251

In [302]:
unique_tagIds = t['movieId'].unique()
unique_newIds = new_movies['movieId'].unique()

In [305]:
simmillar_ids = [i for i in unique_tagIds if i in unique_newIds]
len(simmillar_ids)

45251

In [310]:
new_movie_tags_merge = pd.merge(new_movies, t , how = 'left' ,on = 'movieId')

In [311]:
new_movie_tags_merge.head()

Unnamed: 0,movieId,title,genres,year,title_cast,director,runtime,budget,plot_keywords,tag
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,[1995],Tom Hanks Tim Allen Don Rickles Jim Varney Wal...,John Lasseter,81.0,"$30,000,000",toy rivalry cowboy cgi animation,Owned
1,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,[1995],Tom Hanks Tim Allen Don Rickles Jim Varney Wal...,John Lasseter,81.0,"$30,000,000",toy rivalry cowboy cgi animation,imdb top 250
2,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,[1995],Tom Hanks Tim Allen Don Rickles Jim Varney Wal...,John Lasseter,81.0,"$30,000,000",toy rivalry cowboy cgi animation,Pixar
3,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,[1995],Tom Hanks Tim Allen Don Rickles Jim Varney Wal...,John Lasseter,81.0,"$30,000,000",toy rivalry cowboy cgi animation,Pixar
4,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,[1995],Tom Hanks Tim Allen Don Rickles Jim Varney Wal...,John Lasseter,81.0,"$30,000,000",toy rivalry cowboy cgi animation,time travel


In [312]:
new_movie_tags_merge.shape

(1110532, 10)

<a id="five"></a>
## 5. Feature Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

Feature engineering, the process of creating features that can be fed into a machine learning model from raw text data, is one of the more important part of text classification using machine learning. It is the process of using domain knowledge of the data to create features that make machine learning algorithms work. The better the quality of information you provide a machine learning algorithm, the more it will be able to interpret the information well. Feature engineering helps us to create better data which helps the model understand it well and provide reasonable results. 


<a id="six"></a>
## 6. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>


Training the models with the processed_df_train dataframe, this is done to get the best perfoming models in the original data, this is to find the best models in predicting our data using the `message` column

<a id="seven"></a>
## 7. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

In [163]:
A = 10500
a = 8000*(1 + 0.115/ 12)**13
A -a

1443.9621597386267

In [159]:
i = 0.115
p = 8000
interest = p * i
interest

920.0

In [160]:
11.5 / 100

0.115