# **Overview**

In this notebook I will be implementing movies recommendation system using only movies metadata such as the name of the movie, actors, directors, description, production company, budget...etc. The data was scrapped from IMDB website and it contains 86K rows of movies information.

1. **Loading Data**<br>
SQLAlchemy is used in this script to filter and extract the data from SQLite. 

2. **Data Processing**<br>
In this step, the dataset will go into data processing pipeline that cleans, transform the text and reduce the number of generated features (dimensionality). Transforming the text will be done using TF-IDF which gives low scores to the highly repeated words and higher scores to words that are common in few movie descriptions and thus they will be the ones that the model identifies as important and tries to learn. Due to memory issues, I had to limit the TF-IDF returned features and apply PCA to reduce the number of features so t fits the memory and tospeed up the computing process. Unfortunately, applying dimensionality reduction technique means information loss specially if the number of components were not selected carefully. In my case, I have tried few numbers but again, due to memory issues, I used only 30 components.

3. **Movie Recommendation**<br>
In this step, the similarity between the selected movie and rest of movies will be done using *Cosine Similarity* function.<br><br>

Notes:<br>
A. From the results I have seen, it's not that bad! Tunning the TF-IDF parameters and PCA will make a big difference.
B. Few movies have weird titles. i.e. *L'uomo d'acciaio* is actually *Man of Steel*, *La promessa* is *The Pledge* and so on.<br><br>

## **Recommendation System**

#### **Extract movies data**

In [1]:
import pandas as pd

from sqlalchemy import create_engine, MetaData, select, Table
from sqlalchemy import and_, or_
from databases import Database

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
DB_PATH = 'sqlite:///backend/data/movies.sqlite3'
database = Database(DB_PATH)
engine = create_engine(DB_PATH)
meta = MetaData()

In [None]:
movies_tb = Table('movies', meta, autoload=True, autoload_with=engine)

In [5]:
meta.create_all(engine)
database.connect()

In [8]:
# Convert query result (type list of tuples) along with columns (type sqlalchemy select object) into dictionary / json format
def _func_convert_results_to_json(results, columns):
    return {col: list(list(zip(*results))[indx]) for indx, col in enumerate(columns)}

In [9]:
# Import movies related data
# All movies after 1990, and English movies

query  = select([
    movies_tb.columns.title,
    movies_tb.columns.year,
    movies_tb.columns.genre,
    movies_tb.columns.duration,
    movies_tb.columns.director,
    movies_tb.columns.actors,
    movies_tb.columns.description 
])

query  = query.where(or_(movies_tb.columns.country.like('%USA%'), movies_tb.columns.country.like('%UK%'))) 
query  = query.where(and_(movies_tb.columns.year >= 1990, movies_tb.columns.language.like('%English%')))

result = await database.fetch_all(query)
result = _func_convert_results_to_json(results = result, columns = [column.name for column in query.columns])

result_df = pd.DataFrame(result)
result_df[:2]

Unnamed: 0,title,year,genre,duration,director,actors,description
0,Kate & Leopold,2001,"Comedy, Fantasy, Romance",118,James Mangold,"Meg Ryan, Hugh Jackman, Liev Schreiber, Brecki...",An English Duke from 1876 is inadvertedly drag...
1,L'altra faccia del vento,2018,Drama,122,Orson Welles,"John Huston, Oja Kodar, Peter Bogdanovich, Sus...",A Hollywood director emerges from semi-exile w...


#### **Preprocessing Dataset**

In [10]:
import re

import spacy
nlp = spacy.load("en_core_web_sm")

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, PCA

In [11]:
def create_content(data):
    return data.apply(lambda x: ' '.join(x), axis=1).rename('content')

In [12]:
def create_genres(data):
    return data.str.split(', ').apply(lambda x: pd.Series(1, x)).fillna(0)

In [13]:
def group_years(year, bins, labels):
    return pd.get_dummies(pd.cut(year, bins=bins, labels=labels))

In [14]:
def transform_duration(data):
    return data/100

In [15]:
def initial_text_cleaning(data):
    my_stopwords_list = ['year', 'years', 'movie', 'movies', 'mr', 'mrs', 'miss', 'ms', 'mx', 'sir', 'dr', 'mr.', 'mrs.', 'miss.', 
                         'ms.', 'mx.', 'sir.', 'dr.', 'count', 'counts', 'woo', 'use', 'using', 'part', 'see', 'sees', 'when', 'how', 
                         'what', 'movie', 'begins', 'begin', 'until', 'one', 'two', 'three', 'much', 'more', 'until', 'used', 'each', 
                         'everyone', 'six', 'might', 'guides', 'guide']
    
    return ' '.join([word for word in re.findall(r"(?i)\b[a-z]+\b", data.lower()) if len(word) > 2 and word not in my_stopwords_list])

In [16]:
def clean_text(text):
    return ' '.join([w.lemma_ for w in nlp(text) if not w.is_stop and w.pos_ in ['VERB', 'PROPN', 'NOUN', 'ADJ', 'ADV']])

In [17]:
def transform_text(data):
    tfidf_vect = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', min_df=4, max_df=0.50)
    vectorized_content = tfidf_vect.fit_transform(data)
    return pd.DataFrame(data=vectorized_content.toarray(), columns=tfidf_vect.get_feature_names())

In [18]:
def pca_reduce_tfidf_dimensions(data):
    pca = PCA(n_components=30)
    return pd.DataFrame(data=pca.fit_transform(data), columns=['PCA_' + str(n) for n in range(1,31)])

In [19]:
def prepare_dataframe(data):
    
    years_column, genres_column, duration_column = 'year', 'genre', 'duration'
    content_columns = ['title', 'actors', 'director', 'description']
    drop_columns = ['actors', 'director', 'description']

    bins=[1000, 1990, 2000, 2010, 2020]
    labels=['before_1990', 'between_1990_2000', 'between_2000_2010', 'between_2010_2020']

    print('Create content task started...')
    data = pd.concat([data, create_content(data=data[content_columns])], axis=1)

    print('Create genres task started...')
    data = pd.concat([data, create_genres(data=data[genres_column])], axis=1)
    
    print('Transforming years task started...')
    data = pd.concat([data, group_years(year=data[years_column], bins=bins, labels=labels)], axis=1)
    
    print('Transforming duration task started...')
    data = pd.concat([data, transform_duration(data=data[duration_column]).rename('transformed_duration')], axis=1)
    
    print('Initial cleaning content task started...')
    data = pd.concat([data, data['content'].apply(initial_text_cleaning).rename('init_clean_content')], axis=1)
    
    print('Content lemmatization task started...')
    data = pd.concat([data, data['init_clean_content'].apply(clean_text).rename('clean_content')], axis=1) # content column will be created by the create_content function
    
    # Save clean dataset for furthure investigation
    # data.to_csv('movies_cleaned_data.csv')
    
    print('TF-IDF text transformation and LDA dimension reduction task started...')    
    data = pd.concat([data, pca_reduce_tfidf_dimensions(transform_text(data['clean_content']))], axis=1)

    # Prepare outputs
    print('Preparing outputs')
    titles = pd.DataFrame(data=data.title, index=range(0,len(data)))
    data   = data.drop(['title', 'year', 'genre', 'duration', 'actors', 'director', 'description', 'content', 'init_clean_content', 'clean_content'], axis=1)
    
    print('Done !')
    return titles, data

In [20]:
movies_titles, processed_data = prepare_dataframe(data=result_df)

Create content task started...
Create genres task started...
Transforming years task started...
Transforming duration task started...
Initial cleaning content task started...
Content lemmatization task started...
TF-IDF text transformation and LDA dimension reduction task started...
Preparing outputs
Done !


In [28]:
# Save datasets
movies_titles.to_csv('movies_titles.csv')
processed_data.to_csv('processed_data.csv')

In [2]:
# Load datasets
movies_titles  = pd.read_csv('movies_titles.csv')
processed_data = pd.read_csv('processed_data.csv')

### **Movie Recommendation Using Cosine Similarity**

In [3]:
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
def get_recommendation(movie_title):
    try: movie_id = movies_titles[movies_titles.title == movie_title].index[0]
    except: return 'Movies you have entered is not in this database'
    similarity_scores = cosine_similarity(X=processed_data.loc[movie_id].values.reshape(1,-1),Y=processed_data.values)
    similarity_scores = {indx: v[0] for indx, v in enumerate(similarity_scores.reshape(-1,1))}
    similarity_scores = {i: similarity_scores[i] for i in sorted(similarity_scores, key=similarity_scores.get, reverse=True)[1:11]}
    similarity_scores = pd.DataFrame(data={'similarity_scores':[str(round(i * 100, 1)) + ' %' for i in similarity_scores.values()]}, index=similarity_scores.keys())
    return pd.merge(movies_titles, similarity_scores, left_index=True, right_index=True, how='right')

In [18]:
get_recommendation('The Avengers')

Unnamed: 0,title,similarity_scores
16586,Avengers: Age of Ultron,99.9 %
19342,Avengers: Infinity War,99.9 %
18414,Captain America: Civil War,99.9 %
14742,Captain America: The Winter Soldier,99.8 %
14031,Ready Player One,99.8 %
18844,Rogue One: A Star Wars Story,99.7 %
12866,Into Darkness - Star Trek,99.7 %
18884,Solo: A Star Wars Story,99.7 %
19712,Maze Runner - La rivelazione,99.7 %
10601,The Amazing Spider-Man,99.7 %


In [20]:
get_recommendation('Zodiac')

Unnamed: 0,title,similarity_scores
6783,Mystic River,99.4 %
5363,La promessa,99.0 %
9222,State of Play,99.0 %
8832,Inside Man,98.9 %
7737,Black Dahlia,98.9 %
10491,In the Electric Mist - L'occhio del ciclone,98.8 %
9274,Nella valle di Elah,98.7 %
11575,The Limits of Control,98.7 %
8783,Gone Baby Gone,98.6 %
13944,Thorne: Scaredycat,98.5 %


In [23]:
get_recommendation('Horrible Bosses')

'Movies you have entered is not in this database'