# Content Based Filtering
https://www.analyticsvidhya.com/blog/2021/12/comprehensive-project-on-building-a-movie-recommender-website/

Recommender System is basically a system that takes the user’s choice as input and predicts all the related movies, or news, books, etc.

you would have seen Recommender System in Action while Scrolling on Youtube, Netflix, etc.

Content-based filtering
This type of Filtering system recommends you on the basis of what you actually like. Imagine you love to watch comedy movies so a content-based recommender system will recommend you other related comedy movies which belong to your category

Steps Involved

- Step 1. Getting the Dataset

- Step 2. Data Cleaning and Processing

- Step 3. Training our Recommender System

- Step 4. Testing and Validation

- Step 5. Saving the Trained Model for Deployment

In [1]:
# Import Modules
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Import data

movies = pd.read_csv('data/movies.csv')
#imbd_df = pd.read_csv('data/imdb_data.csv')


In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


 ## Training Steps
 
 our final data frame is textual data, we need to parse it into numerical or floating values in order to feed as inputs in machine learning algorithms. This process is called feature extraction |  vectorization).

## Data Processing Function



In [2]:
# Import data

movies = pd.read_csv('data/movies.csv')

In [3]:
def content_data_processing(df):
    
    genres = df['genres']
    genres = [genre.split("|") for genre in genres]
    df['genre_corpus']= genres
    df['genre_corpus'] = df.genre_corpus.apply(lambda x:" ".join(x))
    cvect = CountVectorizer() 
    vectors = cvect.fit_transform(df['genre_corpus']).toarray()
    #print(vectors.shape)
    return vectors

### Model Building :

our model should be capable of finding the similarity between movies based on their tags.

Our Recommender model takes a movie title as input and predicts top-n most similar movies based on the tags

here we will use the concept of Cosine distance to calculate the similarity of tags

sklearn provides a class for calculating pairwise cosine_similarity.

## Function for Contentent Based filtering

In [4]:
def content_model(movie_list, top_n):

    new_df = movies.copy()
    
       
    movie_index_1 = new_df[new_df['title'] == movie_list[0]].index[0]
    movie_index_2 = new_df[new_df.title == movie_list[1]].index[0]
    movie_index_3 =  new_df[new_df.title == movie_list[2]].index[0]
    
    df_1 = new_df.sample(frac = 0.5)
    df_2 = new_df.iloc[[movie_index_1,movie_index_2,movie_index_3]]
    df_2 = df_2.append(df_1)
    
    vectors = content_data_processing(df_2)
    similarity = cosine_similarity(vectors)
    
    distances_1 = similarity[0]
    distances_2 = similarity[1]
    distances_3 = similarity[2]
    
    sim_score_1 = pd.Series(distances_1).sort_values(ascending = False)
    sim_score_2 = pd.Series(distances_2).sort_values(ascending = False)
    sim_score_3 = pd.Series(distances_3).sort_values(ascending = False)
    
    # Getting the indexes of the 10 most similar movies
    sim_score_list = sim_score_1.append(sim_score_2).append(sim_score_3).sort_values(ascending = False)


    # Appending the names of movies
    indexes = list(sim_score_list.index)
    
    recommended_movies = []
    
    top_n = 10
    for i in indexes:
        
        if df_2.iloc[i].title not in movie_list and len(recommended_movies) < top_n:
            recommended_movies.append(df_2.iloc[i].title)
    
        
    return recommended_movies

In [5]:
movie_list = ['Guardian Angel (1994)','Jack Frost (1979)','Wasteland No. 1: Ardent Verdant (2017)']


In [9]:

recommended_movies = content_model(movie_list, 10)


In [10]:
recommended_movies

['Caged No More (2016)',
 "President's Man: A Line in the Sand, The (2002)",
 'Firepower (1979)',
 'Deadly Encounter (1982)',
 'Eraser (1996)',
 'Assassination (2015)',
 'Moscow Heat (2004)',
 'Killer Elite, The (1975)',
 'Klansman, The (1974)',
 'Operation Thunderbolt (1977)']

# Collaborative Filter

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import pickle
import copy
from surprise import Reader, Dataset, SVD
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

from surprise.model_selection import train_test_split

In [2]:
movies = pd.read_csv('data/movies.csv')
ratings = pd.read_csv('data/train.csv')

### Building the model

In [3]:
print('Length of Movies dataset:', movies.shape[0],'rows')
print('Length of Ratings dataset:', ratings.shape[0],'rows')

Length of Movies dataset: 62423 rows
Length of Ratings dataset: 10000038 rows


In [6]:
ratings = ratings.drop('timestamp',axis=1)


In [19]:
ratings = ratings.sample(frac = 0.1)

In [7]:
# Set the reader variable

reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(ratings,reader)

# Instantiate the model
SVD_model = SVD()


# Train Test Split method
X_train, X_test = train_test_split(data,test_size=0.2)

In [9]:
# Fit the model to our data

SVD_model.fit(X_train)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x22ea455e040>

In [10]:
# Predict 

predictions = SVD_model.test(X_test)

In [16]:
# Check the accuracy of our model

accuracy.rmse(predictions)

RMSE: 0.8341


0.8341035926026377

In [4]:
import pickle


In [14]:
pickle.dump(SVD_model, open('SVD_model.pkl', 'wb'))

In [5]:
# Split train data into subset to save as smaller csv file to be able
# to upload to github, concat files to use again

ratings_1 = ratings[0:2500000]
ratings_2 = ratings[2500000:5000000]
ratings_3 = ratings[5000000:7500000]
ratings_4 = ratings[7500000:]

In [9]:
ratings_1.to_csv('data/ratings_1.csv', index=False)
ratings_2.to_csv('data/ratings_2.csv', index=False)
ratings_3.to_csv('data/ratings_3.csv', index=False)
ratings_4.to_csv('data/ratings_4.csv', index=False)

## Kaggle submission

In [17]:
# Prepare Kaggle submission

test = pd.read_csv('data/test.csv')

# Make predictions on test data
pred_list = []

for _,row in test.iterrows():
    x = (SVD_model.predict(row.userId, row.movieId))
    pred = x[3]
    pred_list.append(pred)

In [18]:
# Convert values to strings

test['userId'] = test['userId'].astype(str)
test['movieId'] = test['movieId'].astype(str)

In [19]:
# Create submission column

test['Id'] = test['userId'] +'_'+test['movieId']

In [20]:
submission_df = pd.DataFrame({'Id':test['Id'],
                              'rating':pred_list})

In [21]:
submission_df.to_csv('SVD_model_sub.csv', index=False)

## Function for Streamlit


In [2]:
# Importing data
movies_df = pd.read_csv('data/movies.csv')

ratings_1 = pd.read_csv('data/ratings_1.csv')
ratings_2 = pd.read_csv('data/ratings_2.csv')
ratings_3 = pd.read_csv('data/ratings_3.csv')
ratings_4 = pd.read_csv('data/ratings_4.csv')

# Add ratings together to form complete dataset

ratings_df = ratings_1.append(ratings_2).append(ratings_3).append(ratings_4)

ratings_df.drop(['timestamp'], axis=1,inplace=True)

In [3]:
# We make use of an SVD model trained on the MovieLens 10 million dataset.
model=pickle.load(open('SVD_model.pkl', 'rb'))

In [4]:
def prediction_item(item_id):
    """Map a given favourite movie to users within the
       MovieLens dataset with the same preference.

    Parameters
    ----------
    item_id : int
        A MovieLens Movie ID.

    Returns
    -------
    list
        User IDs of users with similar high ratings for the given movie.

    """
    # Data preprosessing
    reader = Reader(rating_scale=(0, 5))
    load_df = Dataset.load_from_df(ratings_df,reader)
    a_train = load_df.build_full_trainset()

    predictions = []
    for ui in a_train.all_users():
        predictions.append(model.predict(iid=item_id,uid=ui, verbose = False))
    return predictions


In [8]:
def pred_movies(movie_list):
    """Maps the given favourite movies selected within the app to corresponding
    users within the MovieLens dataset.

    Parameters
    ----------
    movie_list : list
        Three favourite movies selected by the app user.

    Returns
    -------
    list
        User-ID's of users with similar high ratings for each movie.

    """
    # Store the id of users
    id_store=[]
    # For each movie selected by a user of the app,
    # predict a corresponding user within the dataset with the highest rating
    for i in movie_list:
        predictions = prediction_item(item_id = i)
        predictions.sort(key=lambda x: x.est, reverse=True)
        # Take the top 10 user id's from each movie with highest rankings
        for pred in predictions[:10]:
            id_store.append(pred.uid)
    # Return a list of user id's
    return id_store

In [11]:
def collab_model(movie_list,top_n):
    """Performs Collaborative filtering based upon a list of movies supplied
       by the app user.

    Parameters
    ----------
    movie_list : list (str)
        Favorite movies chosen by the app user.
    top_n : type
        Number of top recommendations to return to the user.

    Returns
    -------
    list (str)
        Titles of the top-n movie recommendations to the user.

    """
    top_n=10
    new_df = movies_df.copy()
    new_df.set_index('movieId',inplace=True)
    
    indices = pd.Series(new_df['title'])
    users_ids = pred_movies(movie_list)
    
    # Get movie IDs and ratings for top users
    df_init_users = ratings_df[ratings_df['userId']==users_ids[0]]
    for i in users_ids[1:]:
        df_init_users = df_init_users.append(ratings_df[ratings_df['userId']==i])
    
    # Include predictions for chosen movies
    for j in movie_list:
        a = pd.DataFrame(prediction_item(j))
        for i in set(df_init_users['userId']):
            mid = indices[indices == j].index[0]
            est = a['est'][a['uid']==i].values[0]
            df_init_users = df_init_users.append(pd.Series([int(i),int(mid),est], index=['userId','movieId','rating']), ignore_index=True)
    
    # Remove duplicate entries
    df_init_users.drop_duplicates(inplace=True)
    
    #Create pivot table
    util_matrix = df_init_users.pivot_table(index=['userId'], columns=['movieId'], values='rating')
    
    # Fill Nan values with 0's and save the utility matrix in scipy's sparse matrix format
    util_matrix.fillna(0, inplace=True)
    util_matrix_sparse = sp.sparse.csr_matrix(util_matrix.values)
    
    # Compute the similarity matrix using the cosine similarity metric
    user_similarity = cosine_similarity(util_matrix_sparse.T)
    
    # Save the matrix as a dataframe to allow for easier indexing
    user_sim_df = pd.DataFrame(user_similarity, index = util_matrix.columns, columns = util_matrix.columns)
    user_similarity = cosine_similarity(np.array(df_init_users), np.array(df_init_users))
    user_sim_df = pd.DataFrame(user_similarity, index = df_init_users['movieId'].values.astype(int), columns = df_init_users['movieId'].values.astype(int))
    
    # Remove duplicate rows from matrix
    user_sim_df = user_sim_df.loc[~user_sim_df.index.duplicated(keep='first')]
    
    # Transpose matrix
    user_sim_df = user_sim_df.T
    
    # Find IDs of chosen load_movie_titles
    idx_1 = indices[indices == movie_list[0]].index[0]
    idx_2 = indices[indices == movie_list[1]].index[0]
    idx_3 = indices[indices == movie_list[2]].index[0]
    
    # Creating a Series with the similarity scores in descending order
    distances_1 = user_sim_df[idx_1]
    distances_2 = user_sim_df[idx_2]
    distances_3 = user_sim_df[idx_3]
    
    # Calculating the scores
    sim_score_1 = pd.Series(distances_1).sort_values(ascending = False)
    sim_score_2 = pd.Series(distances_2).sort_values(ascending = False)
    sim_score_3 = pd.Series(distances_3).sort_values(ascending = False)
    
    # Appending the names of movies
    sim_score_list = sim_score_1.append(sim_score_2).append(sim_score_3).sort_values(ascending = False)
    
    # Choose top 50
    top_50_indexes = list(sim_score_list.iloc[1:50].index)
    
    # Removing chosen movies
    indexes = np.setdiff1d(top_50_indexes,[idx_1,idx_2,idx_3])
    
    # Get titles of recommended movies
    recommended_movies = []
    for i in indexes[:top_n]:
        recommended_movies.append(list(movies_df[movies_df['movieId']==i]['title']))
    
    # Return list of movies
    recommended_movies = [val for sublist in recommended_movies for val in sublist]
    return recommended_movies


In [12]:
movie_list = ['Guardian Angel (1994)','Jack Frost (1979)','Wasteland No. 1: Ardent Verdant (2017)']
collab_model(movie_list,top_n=10)

['Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)',
 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)',
 'Dead Man Walking (1995)',
 'Cry, the Beloved Country (1995)',
 'Georgia (1995)',
 'Indian in the Cupboard, The (1995)',
 "Don't Be a Menace to South Central While Drinking Your Juice in the Hood (1996)",
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Two Bits (1995)',
 'Big Bully (1996)']