# Content Based Movie Recommendation System
This application uses a dataset called TheMoviesDatasets, which contains tens of thousands of different movies. From this large dataset, we used about 5,000 of the movies. Each of these movies hold additional information such as the genre, revenue, budget, release date, etc. The goal for this project is to give the user a recommendation of movies based of genres and keywords. 

Recommendation systems have become very popular in recent times, and not only have they been used for movies, but also for music, shopping items, etc. Many companies such as Spotify, Netflix, and Amazon have utilized this tool to give a better experience for the user. 

![example](./images/momomovies_banner_long.jpg)

# Content-Based Recommendation Model
This model is a Content Based Recommendation model that is seperated into 2 categories: genre and keywords. The model uses the concepts of Term Frequency (TF) and Inverse Document Frequency (IDF) to determine the relative importance of the genre or keyword. After calculating the TF-IDF values, in order to see how similar items are to each other, we used vector space modeling. Each item is stored as a vector, and the similarity can be correlated with the proximity between any two vectors. The proximity is found by taking the angle between the vectors. Then, using the TF-IDF scores and the vector space model, we can create a dot product that gives the value of the cosine similarity score. This score is a number than can be directly used to find the similarity between two movies. Through all that math and logic we successfully created a content based recommendation model that returns 10 similar movies based off a given movie the user chooses. Depending on what content based recommendation system the user picks, the 10 movies can be similar based off genre, or the keywords of the chosen movie.

![tfidf](./images/TF-IDF_Sanjay.png)
![tfidf](./images/Vector-Space-Model_Sanjay.png)

In [21]:
import numpy as np
import pandas as pd

# Function that get movie recommendations based on the cosine similarity score of movie genres
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [22]:

# Define file directories
MOVIES_DATA_DIR = './data/'
MOVIES_CBR_CSV_FILE = MOVIES_DATA_DIR + 'movies_cbr_small.csv'
MOVIE_TITLE_CSV_FILE = MOVIES_DATA_DIR + 'movie_titles.csv'

def load_preprocess_movies_cbr_small(movies_cbr_small_filepath):
    movies_cbr_small = pd.read_csv(movies_cbr_small_filepath, dtype='unicode')
    movies_cbr_small["tmdbid"] = movies_cbr_small["tmdbid"].astype(str).astype('int64')
    movies_cbr_small["imdbid"] = movies_cbr_small["tmdbid"].astype(str).astype('int64')
    movies_cbr_small["budget"] = movies_cbr_small["budget"].astype(str).astype('int64')
    movies_cbr_small["revenue"] = movies_cbr_small["revenue"].astype(str).astype('int64')
    movies_cbr_small["runtime"] = movies_cbr_small["runtime"].astype(str).astype(float)
    movies_cbr_small["vote_average"] = movies_cbr_small["vote_average"].astype(str).astype(float)
    movies_cbr_small["vote_count"] = movies_cbr_small["vote_count"].astype(str).astype('int64')
    movies_cbr_small['release_date'] = pd.to_datetime(movies_cbr_small['release_date'])
    movies_cbr_small = movies_cbr_small.loc[(movies_cbr_small.budget > 0) & (movies_cbr_small.revenue > 0),:]
    return movies_cbr_small

def get_movie_titles_list(movie_titles_filepath):
    #Load movieswithratings & create movies dataframe
    df = pd.read_csv(movie_titles_filepath, dtype='unicode')
    return ['Select a movie'] + df['title'].tolist()

def get_cosine_sim(tfidf_matrix):
    cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
    return cosine_sim

def genre_recommendations(title, cosine_sim,df):  
    # Build a 1-dimensional array with movie titles
    titles = df['title']
    indices = pd.Series(df.index, index=df['title'])
    try:
        idx = indices[title]
        sim_scores = list(enumerate(cosine_sim[idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        sim_scores = sim_scores[1:12]
        movie_indices = [i[0] for i in sim_scores]
        remove_title = False
        for i in range(len(movie_indices)):
            if movie_indices[i] == idx:
                remove_title = True
        if remove_title: movie_indices.remove(idx)
        return titles.iloc[movie_indices[0:10]]
    except KeyError:
        return "No recommendations found for this movie"


In [23]:

#get movies dataframe
movies_cbr_small = load_preprocess_movies_cbr_small(MOVIES_CBR_CSV_FILE)

#get the movie_titles dataframe
movie_titles = get_movie_titles_list(MOVIE_TITLE_CSV_FILE)

#Create tfidf matrix and cosine similarity by genre
tf_genre = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix_genre = tf_genre.fit_transform(movies_cbr_small['genres_list'])
cosine_sim_genre = get_cosine_sim(tfidf_matrix_genre)

#Create tfidf matrix and cosine similarity by keywords
tf_keywords = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix_keywords = tf_keywords.fit_transform(movies_cbr_small.loc[pd.notnull(movies_cbr_small['keywords_list']),'keywords_list'])
cosine_sim_keywords = get_cosine_sim(tfidf_matrix_keywords)

# Movie Recommendation System Sample Output Based on Genre
The results below show the ten most similar movies to the Titanic (1997), based on the genres associated with that movie. Most of these movies would all contain Drama, Romance, and Thriller, just like the Titanic movie. 

In [24]:
genre_recommendations('Titanic ( 1997 )',cosine_sim_genre,movies_cbr_small)

935             Cruel Intentions ( 1999 )
1512                  Angel Eyes ( 2001 )
1883           Absence of Malice ( 1981 )
4661        Fifty Shades of Grey ( 2015 )
2863             Man of the Year ( 2006 )
3654                   Tere Naam ( 2003 )
281                        Bound ( 1996 )
570                      Witness ( 1985 )
1078    Someone to Watch Over Me ( 1987 )
1650           Play Misty for Me ( 1971 )
Name: title, dtype: object

# Movie Recommendation System Sample Output Based on Keywords
The results below show the ten most similar movies to the Titanic (1997), based on the keywords associated with that movie. Most of these movies would all contain words like tragic love and shipwreck, just like the Titanic movie.

In [25]:
genre_recommendations('Titanic ( 1997 )',cosine_sim_keywords,movies_cbr_small)

1510                    Love Story ( 1970 )
1727            A Walk to Remember ( 2002 )
1214                       Titanic ( 1953 )
3903                       In Time ( 2011 )
652     You Can't Take It With You ( 1938 )
4578                    Aashiqui 2 ( 2013 )
1149                        Onegin ( 1999 )
604                    Deep Rising ( 1998 )
4469        The Fault in Our Stars ( 2014 )
2667            Brokeback Mountain ( 2005 )
Name: title, dtype: object