# **Pratinav Seth**

## **Reg. No - 200968216**

### **Section A Batch 2**

### **B.Tech - Data Science and Engineering**


# **DSE 2262 MACHINE LEARNING LABORATORY  - Mini Project**

# **Objective:**

Build an app with a simple UI which will allow the user to search for movies and give recommendations based on the user and the movie searched for by use of content based and collaborative filtering.

## **Dataset –**

#### •	GroupLens Research MovieLens ml-latest-small dataset
    It consists of 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

# **Machine Learning Objectives and Approach:**

## **Recommendation System to be used:**

Will be trying to implement-

1.	**Content based filtering–**

  a.	Recommending movies based on search made by user by comparing similarities between genres

  b.	 Recommending movies based on previous watch history of a user based on genres of previous movies

2.	**Collaborative Filtering-**

  a.	Item-Item Filtering – Given user-id Recommends Movies which isn’t seen by user based on ratings of user towards a movie he had seen

  b.	User-Item Filtering – Given user-id in database tries to find user most similar and then recommends what that user has seen but our user hasn’t


**NOTE:**

Will try to explore GNN for Collaborative Item-Item Filtering if its within time and resource compute of given timelines of project. Along with that will also try to explore other ML based approaches for the same subject to if its within time and resource compute of given timelines of project.

# **Meta-Data of Dataset:**

## **Dataset -** 
GroupLens Research MovieLens ml-latest-small dataset

	It consists of 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

## **Summary**

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.
The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.
This and other GroupLens data sets are publicly available for download at <http://grouplens.org/datasets/>.

## **Model Building and API Building - PHASE 2**

### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import recmetrics


### Import train and test CSV files

In [2]:
df_movies = pd.read_csv("movies.csv")
df_ratings = pd.read_csv("ratings.csv")
df_links = pd.read_csv("links.csv")
df_tags = pd.read_csv("tags.csv")

### List of the features within the dataset

In [3]:
print("Movie : ", df_movies.columns,end="\n\n")
print("Rating : ", df_ratings.columns,end="\n\n")
print("Links : ", df_links.columns,end="\n\n")
print("Tags : ", df_tags.columns,end="\n\n")

Movie :  Index(['movieId', 'title', 'genres'], dtype='object')

Rating :  Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

Links :  Index(['movieId', 'imdbId', 'tmdbId'], dtype='object')

Tags :  Index(['userId', 'movieId', 'tag', 'timestamp'], dtype='object')



### Dropping the timestamp from ratings and tags dataframes

In [4]:
df_ratings.drop(columns='timestamp',inplace=True)
df_tags.drop(columns='timestamp',inplace=True)

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

### MODEL 0 - API 0

### **Cold Start**

**Recommending top 10 most popular movies if its a new user**

In [6]:
def cold_start_check(userid):
    if(userid in set(df_ratings["userId"])):
        return 0
    else:
        x=df_ratings.groupby('movieId').rating.mean()
        movie = pd.merge(df_movies,x,how='outer',on='movieId')
        movie['rating'].fillna('0',inplace=True)
        x = df_ratings.groupby('movieId',as_index=False).userId.count()
        x.sort_values('userId',ascending=False,inplace=True)
        y = pd.merge(movie,x,how='outer',on='movieId')
        y.sort_values(['userId','rating'],ascending=False,inplace=True)
        df_return = y[0:10]
        return df_return
    

In [7]:
cold_start_check(999999)

Unnamed: 0,movieId,title,genres,rating,userId
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,4.164134,329.0
277,318,"Shawshank Redemption, The (1994)",Crime|Drama,4.429022,317.0
257,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,4.197068,307.0
510,593,"Silence of the Lambs, The (1991)",Crime|Horror|Thriller,4.16129,279.0
1939,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,4.192446,278.0
224,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,4.231076,251.0
418,480,Jurassic Park (1993),Action|Adventure|Sci-Fi|Thriller,3.75,238.0
97,110,Braveheart (1995),Action|Drama|War,4.031646,237.0
507,589,Terminator 2: Judgment Day (1991),Action|Sci-Fi,3.970982,224.0
461,527,Schindler's List (1993),Drama|War,4.225,220.0


### MODEL 1 - API 1

### **Content based filtering–**

**Recommending movies based on search made by user by comparing similarities between genres**

In [8]:
tfidf_movies_genres = TfidfVectorizer(token_pattern = '[a-zA-Z0-9\-]+')
df_movies['genres'] = df_movies['genres'].replace(to_replace="(no genres listed)", value="")
tfidf_movies_genres_matrix = tfidf_movies_genres.fit_transform(df_movies['genres'])
cosine_sim_movies = linear_kernel(tfidf_movies_genres_matrix, tfidf_movies_genres_matrix)

In [9]:
def get_recommendations_based_on_genres(movie_title, cosine_sim_movies=cosine_sim_movies):
    idx_movie = df_movies.loc[df_movies['title'].isin([movie_title])]
    idx_movie = idx_movie.index
    sim_scores_movies = list(enumerate(cosine_sim_movies[idx_movie][0]))
    sim_scores_movies = sorted(sim_scores_movies, key=lambda x: x[1], reverse=True)
    sim_scores_movies = sim_scores_movies[1:3]
    movie_indices = [i[0] for i in sim_scores_movies]
    return df_movies['title'].iloc[movie_indices]

In [10]:
get_recommendations_based_on_genres("Father of the Bride Part II (1995)")

17                        Four Rooms (1995)
18    Ace Ventura: When Nature Calls (1995)
Name: title, dtype: object

### STILL IN PROGRESS


### Evaluating Content Based Filtering Model using KNN

**Where We are Recommending movies based on search made by user by comparing similarities between genres**

In [11]:
from sklearn.neighbors import KNeighborsClassifier  
def get_movie_label(movie_id):
    classifier = KNeighborsClassifier(n_neighbors=5)
    x= tfidf_movies_genres_matrix
    y = df_movies.iloc[:,-1]
    classifier.fit(x, y)
    y_pred = classifier.predict(tfidf_movies_genres_matrix[movie_id])
    return y_pred

In [12]:
true_count = 0
false_count = 0
def evaluate_content_based_model():
    """
    Evaluate content based model.  
    """
    for key, colums in df_movies.iterrows():
        movies_recommended_by_model = get_recommendations_based_on_genres(colums["title"])
        predicted_genres  = get_movie_label(movies_recommended_by_model.index)
        for predicted_genre in predicted_genres:
            global true_count, false_count
            if predicted_genre == colums["genres"]:
                true_count = true_count+1
            else:
                false_count = false_count +1
                
evaluate_content_based_model()
total = true_count + false_count
print("Hit:"+ str(true_count/total))
print("Fault:" + str(false_count/total))

Hit:0.9325087251077807
Fault:0.06749127489221926


### MODEL 2 - API 2

### **Content based filtering–** 
**Recommending movies based on previous watch history of a user based on genres of previous movies**

In [13]:
def get_recommendation_content_model(userId):
    recommended_movie_list = []
    movie_list = []
    df_rating_filtered = df_ratings[df_ratings["userId"]== userId]
    for key, row in df_rating_filtered.iterrows():
        movie_list.append((df_movies["title"][row["movieId"]==df_movies["movieId"]]).values) 
    for index, movie in enumerate(movie_list):
        for key, movie_recommended in get_recommendations_based_on_genres(movie[0]).iteritems():
            recommended_movie_list.append(movie_recommended) 
    for movie_title in recommended_movie_list:
        if movie_title in movie_list:
            recommended_movie_list.remove(movie_title)
    
    return set(recommended_movie_list)

In [14]:
get_recommendation_content_model(10)

{'300: Rise of an Empire (2014)',
 'Ace Ventura: When Nature Calls (1995)',
 'Adventures in Babysitting (1987)',
 'Aladdin and the King of Thieves (1996)',
 'Alice in Wonderland (2010)',
 'Along Came a Spider (2001)',
 'American President, The (1995)',
 "Antonia's Line (Antonia) (1995)",
 'Around the World in 80 Days (1956)',
 'Assassins (1995)',
 'Asterix and Cleopatra (Astérix et Cléopâtre) (1968)',
 'Babes in Toyland (1934)',
 'Batman Forever (1995)',
 'Beauty and the Beast (1991)',
 'Before the Rain (Pred dozhdot) (1994)',
 'Ben-Hur (1959)',
 'Beowulf & Grendel (2005)',
 'Big Bully (1996)',
 'Blood and Chocolate (2007)',
 'Bolt (2008)',
 'Broken Arrow (1996)',
 "Bug's Life, A (1998)",
 'Captain Horatio Hornblower R.N. (1951)',
 'Clear and Present Danger (1994)',
 'Cliffhanger (1993)',
 'Cloud Atlas (2012)',
 'Clueless (1995)',
 'Corpse Bride (2005)',
 'Dangerous Minds (1995)',
 'Dark Knight, The (2008)',
 'Die Hard: With a Vengeance (1995)',
 "Dracula (Bram Stoker's Dracula) (1992)

In [15]:
pip install recmetrics





### MODEL 3 - API 3

### **Collaborative Filtering-**

**Item-Item Filtering – Given user-id Recommends Movies which isn’t seen by user based on ratings of user towards a movie he had seen**

In [16]:
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation



In [17]:
df_movies_ratings=pd.merge(df_movies, df_ratings)

ratings_matrix_items = df_movies_ratings.pivot_table(index=['movieId'],columns=['userId'],values='rating').reset_index(drop=True)
ratings_matrix_items.fillna( 0, inplace = True )

movie_similarity = 1 - pairwise_distances( ratings_matrix_items.values, metric="cosine" )
np.fill_diagonal( movie_similarity, 0 ) #Filling diagonals with 0s for future use when sorting is done

ratings_matrix_items = pd.DataFrame( movie_similarity )

In [18]:
def item_similarity(movieName): 
    try:
        user_inp=movieName
        inp=df_movies[df_movies['title']==user_inp].index.tolist()
        inp=inp[0]
        df_movies['similarity'] = ratings_matrix_items.iloc[inp]
        df_movies.columns = ['movie_id', 'title', 'release_date','similarity']
    except:
        print("Sorry, the movie is not in the database!")

def recommendedMoviesAsperItemSimilarity(user_id):
    user_movie= df_movies_ratings[(df_movies_ratings.userId==user_id) & df_movies_ratings.rating.isin([5,3])][['title']]
    user_movie=user_movie.iloc[0,0]
    item_similarity(user_movie)
    sorted_movies_as_per_userChoice=df_movies.sort_values( ["similarity"], ascending = False )
    sorted_movies_as_per_userChoice=sorted_movies_as_per_userChoice[sorted_movies_as_per_userChoice['similarity'] >=0.35]['movie_id']
    recommended_movies=list()
    df_recommended_item=pd.DataFrame()
    user2Movies= df_ratings[df_ratings['userId']== user_id]['movieId']
    for movieId in sorted_movies_as_per_userChoice:
            if movieId not in user2Movies:
                df_new= df_ratings[(df_ratings.movieId==movieId)]
                df_recommended_item=pd.concat([df_recommended_item,df_new])
            best10=df_recommended_item.sort_values(["rating"], ascending = False )[1:10] 
    return best10['movieId']

def movieIdToTitle(listMovieIDs):
    movie_titles= list()
    for id in listMovieIDs:
        movie_titles.append(df_movies[df_movies['movie_id']==id]['title'])
    return movie_titles

In [19]:
user_id=50
print("Recommended movies,:\n",movieIdToTitle(recommendedMoviesAsperItemSimilarity(user_id)))

Recommended movies,:
 [277    Shawshank Redemption, The (1994)
Name: title, dtype: object, 505    Ghost (1990)
Name: title, dtype: object, 505    Ghost (1990)
Name: title, dtype: object, 512    Beauty and the Beast (1991)
Name: title, dtype: object, 505    Ghost (1990)
Name: title, dtype: object, 512    Beauty and the Beast (1991)
Name: title, dtype: object, 512    Beauty and the Beast (1991)
Name: title, dtype: object, 512    Beauty and the Beast (1991)
Name: title, dtype: object, 505    Ghost (1990)
Name: title, dtype: object]


### MODEL 4 - API 4

### **Collaborative Filtering-**
**User-Item Filtering – Given user-id in database tries to find user most similar and then recommends what that user has seen but our user hasn’t**

In [20]:
ratings_matrix_users = df_movies_ratings.pivot_table(index=['userId'],columns=['movieId'],values='rating').reset_index(drop=True)
ratings_matrix_users.fillna( 0, inplace = True )
movie_similarity = 1 - pairwise_distances( ratings_matrix_users.values, metric="cosine" )
np.fill_diagonal( movie_similarity, 0 )
ratings_matrix_users = pd.DataFrame( movie_similarity )
similar_user_series= ratings_matrix_users.idxmax(axis=1)
df_similar_user= similar_user_series.to_frame()

In [21]:
movieId_recommended=list()
def getRecommendedMoviesAsperUserSimilarity(userId):
    """
     Recommending movies which user hasn't watched as per User Similarity
    :param user_id: user_id to whom movie needs to be recommended
    :return: movieIds to user 
    """
    user2Movies= df_ratings[df_ratings['userId']== userId]['movieId']
    sim_user=df_similar_user.iloc[0,0]
    df_recommended=pd.DataFrame(columns=['movieId','title','genres','userId','rating','timestamp'])
    for movieId in df_ratings[df_ratings['userId']== sim_user]['movieId']:
        if movieId not in user2Movies:
            df_new= df_movies_ratings[(df_movies_ratings.userId==sim_user) & (df_movies_ratings.movieId==movieId)]
            df_recommended=pd.concat([df_recommended,df_new])
        best10=df_recommended.sort_values(['rating'], ascending = False )[1:10]  
    return best10['movieId']

In [22]:
user_id=50
recommend_movies= movieIdToTitle(getRecommendedMoviesAsperUserSimilarity(user_id))
print("Movies you should watch are:\n")
print(recommend_movies)

Movies you should watch are:

[1431    Rocky (1976)
Name: title, dtype: object, 742    African Queen, The (1951)
Name: title, dtype: object, 733    It's a Wonderful Life (1946)
Name: title, dtype: object, 939    Terminator, The (1984)
Name: title, dtype: object, 969    Back to the Future (1985)
Name: title, dtype: object, 510    Silence of the Lambs, The (1991)
Name: title, dtype: object, 1057    Star Trek II: The Wrath of Khan (1982)
Name: title, dtype: object, 1059    Star Trek IV: The Voyage Home (1986)
Name: title, dtype: object, 1939    Matrix, The (1999)
Name: title, dtype: object]
