# user based collaborative filtering

https://medium.com/@srinidhi14vaddy/collaborative-filtering-movie-recommendation-60461c7ef897

https://www.kaggle.com/srinidhi14vaddy/collaborative-filtering-movie-recommendation/notebook

It uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the Pearson Correlation Function.

In [1]:
# Import modules
import numpy as np
import pandas as pd
from math import sqrt
import matplotlib.pyplot as plt
%matplotlib inline

## Import datasets

In [3]:
rating = pd.read_csv('data/train.csv')
movie = pd.read_csv('data/movies.csv')

Quick view of the two dataframes, the rows and the columns

In [4]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [5]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


# Content Based Filtering
https://www.analyticsvidhya.com/blog/2021/12/comprehensive-project-on-building-a-movie-recommender-website/

Recommender System is basically a system that takes the user’s choice as input and predicts all the related movies, or news, books, etc.

you would have seen Recommender System in Action while Scrolling on Youtube, Netflix, etc.

Content-based filtering
This type of Filtering system recommends you on the basis of what you actually like. Imagine you love to watch comedy movies so a content-based recommender system will recommend you other related comedy movies which belong to your category

Steps Involved

- Step 1. Getting the Dataset

- Step 2. Data Cleaning and Processing

- Step 3. Training our Recommender System

- Step 4. Testing and Validation

- Step 5. Saving the Trained Model for Deployment

In [1]:
# Import Modules
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Import data

movies = pd.read_csv('data/movies.csv')
#imbd_df = pd.read_csv('data/imdb_data.csv')


In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


 ## Training Steps
 
 our final data frame is textual data, we need to parse it into numerical or floating values in order to feed as inputs in machine learning algorithms. This process is called feature extraction |  vectorization).

## Data Processing Function



In [2]:
# Import data

movies = pd.read_csv('data/movies.csv')

In [3]:
def content_data_processing(df):
    
    genres = df['genres']
    genres = [genre.split("|") for genre in genres]
    df['genre_corpus']= genres
    df['genre_corpus'] = df.genre_corpus.apply(lambda x:" ".join(x))
    cvect = CountVectorizer() 
    vectors = cvect.fit_transform(df['genre_corpus']).toarray()
    #print(vectors.shape)
    return vectors

### Model Building :

our model should be capable of finding the similarity between movies based on their tags.

Our Recommender model takes a movie title as input and predicts top-n most similar movies based on the tags

here we will use the concept of Cosine distance to calculate the similarity of tags

sklearn provides a class for calculating pairwise cosine_similarity.

## Function for Contentent Based filtering

In [4]:
def content_model(movie_list, top_n):

    new_df = movies.copy()
    
       
    movie_index_1 = new_df[new_df['title'] == movie_list[0]].index[0]
    movie_index_2 = new_df[new_df.title == movie_list[1]].index[0]
    movie_index_3 =  new_df[new_df.title == movie_list[2]].index[0]
    
    df_1 = new_df.sample(frac = 0.5)
    df_2 = new_df.iloc[[movie_index_1,movie_index_2,movie_index_3]]
    df_2 = df_2.append(df_1)
    
    vectors = content_data_processing(df_2)
    similarity = cosine_similarity(vectors)
    
    distances_1 = similarity[0]
    distances_2 = similarity[1]
    distances_3 = similarity[2]
    
    sim_score_1 = pd.Series(distances_1).sort_values(ascending = False)
    sim_score_2 = pd.Series(distances_2).sort_values(ascending = False)
    sim_score_3 = pd.Series(distances_3).sort_values(ascending = False)
    
    # Getting the indexes of the 10 most similar movies
    sim_score_list = sim_score_1.append(sim_score_2).append(sim_score_3).sort_values(ascending = False)


    # Appending the names of movies
    indexes = list(sim_score_list.index)
    
    recommended_movies = []
    
    top_n = 10
    for i in indexes:
        
        if df_2.iloc[i].title not in movie_list and len(recommended_movies) < top_n:
            recommended_movies.append(df_2.iloc[i].title)
    
        
    return recommended_movies

In [5]:
movie_list = ['Guardian Angel (1994)','Jack Frost (1979)','Wasteland No. 1: Ardent Verdant (2017)']


In [9]:

recommended_movies = content_model(movie_list, 10)


In [10]:
recommended_movies

['Caged No More (2016)',
 "President's Man: A Line in the Sand, The (2002)",
 'Firepower (1979)',
 'Deadly Encounter (1982)',
 'Eraser (1996)',
 'Assassination (2015)',
 'Moscow Heat (2004)',
 'Killer Elite, The (1975)',
 'Klansman, The (1974)',
 'Operation Thunderbolt (1977)']