# ENTERTAINMENT DATASET
Using the entertainment dataset to build a recommender system.

## BUSINESS OBJECTIVE
* Maximise profit
* Maximise visibility
* Maximise ease of use
* Maximise customer base
* Minimise attrition rate

## CONSTRAINTS
* High competition
* Copy Right issues
* Online piracy

## DATA DICTIONARY

| **slno** | **Name of Feature** | **Description**                   | **Type** | **Relevance** |
|:--------:|:-------------------:|:----------------------------------|:--------:|:-------------:|
|     1    | Id                  | It is the user ID of the users.   | Nominal  | Relevant      |
|     2    | Titles              | It is the name of the movies.     | Nominal  | Relevant      |
|     3    | Category            | It is the genre of the movies.    | Nominal  | Relevant      |
|     4    | Reviews             | It is the reviews of the movies.  | Ratio    | Irrelevant    |

Importing the required libraries

In [1]:
import pandas as pd

Loading the dataset using pandas library.

In [2]:
df0=pd.read_csv(r"D:\360Digitmg\ASSIGNMENTS\Ass10\Entertainment.csv")
df=df0.copy()
df.head()

Unnamed: 0,Id,Titles,Category,Reviews
0,6973,Toy Story (1995),"Drama, Romance, School, Supernatural",-8.98
1,6778,Jumanji (1995),"Action, Adventure, Drama, Fantasy, Magic, Mili...",8.88
2,9702,Grumpier Old Men (1995),"Action, Comedy, Historical, Parody, Samurai, S...",99.0
3,6769,Waiting to Exhale (1995),"Sci-Fi, Thriller",99.0
4,1123,Father of the Bride Part II (1995),"Action, Comedy, Historical, Parody, Samurai, S...",-0.44


From the below code we can get a general idea about the dataset. 

In [3]:
df.shape

(51, 4)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Id        51 non-null     int64  
 1   Titles    51 non-null     object 
 2   Category  51 non-null     object 
 3   Reviews   51 non-null     float64
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ KB


from the above information it is clear that there are no null values in the dataset and the shape of the dataset is (51,4).

Checking the number of duplicates in the dataset. 

In [5]:
df.duplicated(keep='first').sum()

0

In [11]:
df.columns

Index(['Id', 'Titles', 'Category', 'Reviews'], dtype='object')

* Term frequencey- inverse document frequncy is a numerical statistic that is intended to reflect how important a word is to document in a collecion or corpus.
* Creating a Tfidf Vectorizer to remove all stop words.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

Removing the stop words from tfid vectorizer 

In [10]:
tfidf = TfidfVectorizer(stop_words = "english")  

Preparing the Tfidf matrix by fitting and transforming to transform a count matrix to a normalized tf or tf-idf representation

In [12]:
tfidf_matrix = tfidf.fit_transform(df.Category)

In [13]:
tfidf_matrix.shape

(51, 34)

With the above matrix we need to find the similarity score, we will use cosine similarity matrix as it is independent of magnitude and easy to use. 

In [14]:
from sklearn.metrics.pairwise import linear_kernel

Computing the cosine similarity on Tfidf matrix

In [15]:
cosine_sim_matrix = linear_kernel(tfidf_matrix, tfidf_matrix)

Creating a mapping of movie titles to index number 

In [16]:
df_index = pd.Series(df.index, index = df['Titles'])

In [19]:
def get_recommendations(Name, topN):    
    # topN = 10
    # Getting the movie index using its title 
    df_id = df_index[Name]
    
    # Getting the pair wise similarity score for all the movie's with that movie
    cosine_scores = list(enumerate(cosine_sim_matrix[df_id]))
    
    # Sorting the cosine_similarity scores based on scores 
    cosine_scores = sorted(cosine_scores, key=lambda x:x[1], reverse = True)
    
    # Get the scores of top N most similar movies 
    cosine_scores_N = cosine_scores[1: topN+1]
    
    # Getting the movie index 
    df_idx  =  [i[0] for i in cosine_scores_N]
    df_scores =  [i[1] for i in cosine_scores_N]
    
    # Similar movies and scores
    df_similar_show = pd.DataFrame(columns=["Titles", "Score"])
    df_similar_show["Titles"] = df.loc[df_idx, "Titles"]
    df_similar_show["Score"] = df_scores
    df_similar_show.reset_index(inplace = True)  
    # anime_similar_show.drop(["index"], axis=1, inplace=True)
    print (df_similar_show)
    # return (anime_similar_show)

Entering the movie title and no of reccomendations needed we will the desired output. 

In [28]:
get_recommendations("Othello (1995)", topN = 10)


    index                                             Titles     Score
0      25                                     Othello (1995)  1.000000
1       0                                   Toy Story (1995)  0.625943
2      26                                Now and Then (1995)  0.546952
3      45                       When Night Is Falling (1995)  0.453912
4      50                                     Georgia (1995)  0.419972
5      39                                 Restoration (1995)  0.405287
6      42               How to Make an American Quilt (1995)  0.393702
7      29  Shanghai Triad (Yao a yao yao dao waipo qiao) ...  0.390965
8       3                           Waiting to Exhale (1995)  0.389033
9      23                                      Powder (1995)  0.388408
10     22                                   Assassins (1995)  0.366221


## CONCLUSION
* Using the recommendation system we can recommend some unknown movies which the customer was not going to watch but because it was recommended it will generate an interest about the movie based on genres from movies of similar types. 
* Movies with the genres which the customer is interested in will be recommended. 