## Content Based Movie Recommendation System
![alt text](GoogleRecommendationForTheDarkKnight.png "Google Movie Recommendation")
Google comes up with movies that are similar to the ones you like. 

It turns out that there are (mostly) three ways to build a recommendation engine:

    1. Popularity based recommendation engine
    2. Content based recommendation engine
    3. Collaborative filtering based recommendation engine
    
<h5>Popularity based recommendation engine:</h5>

Perhaps, this is the simplest kind of recommendation engine that every one can implement. The trending list we see in YouTube or Netflix is based on this algorithm. It keeps a track of view counts for each movie/video and then lists movies based on views in descending order(highest view count to lowest view count). It is Pretty simple but, effective.

<h5>Content based recommendation engine:</h5>

This type of recommendation systems, takes in a movie that a user currently likes as input. Then it analyzes the contents (storyline, genre, cast, director etc.) of the movie to find out other movies which have similar content. Then it ranks similar movies according to their similarity scores and recommends the most relevant movies to the user.

<h5>Collaborative filtering based recommendation engine: </h5>

This algorithm at first tries to find similar users based on their activities and preferences (for example, both the users watch same type of movies or movies directed by the same director). Now, between these users(say, A and B) if user A has seen a movie that user B has not seen yet, then that movie gets recommended to user B and vice-versa. In other words, the recommendations get filtered based on the collaboration between similar user’s preferences (thus, the name “Collaborative Filtering”). One typical application of this algorithm can be seen in the Amazon e-commerce platform, where you get to see the “Customers who viewed this item also viewed” and “Customers who bought this item also bought” list.

![alt text](AmazonRecommendation.jpg "Amazon Recommendation")

![alt text](ContentvsCollaborative.png "Content vs collaborative")


Another type of recommendation system can be created by mixing properties of two or more types of recommendation systems. This type of recommendation systems are known as hybrid recommendation system.

In this project, we are implementing content based recommendation system using cosine similarity.

##### Finding the similarity
We know our recommendation system is content based, so we need to find simliar movies for a given movie and then recommend those movies to the user.

But how can we find which movie is similar to given movie, and how much it is similar to given movie.

For suppose let's say we have two sentences

1. "Bangalore Hyderabad Bangalore"
2. "Hyderabad Hyderabad Bangalore"

we can write above strings as a vectors

Bangalore Hyderabad

1. (2, 1)
2. (1, 2)

Representing those two vectors in graph will be like this

![alt text](graph.png "Graph")

Here, the blue vector represents “Sentence 1” and the red vector represents “Sentence 1”.

Now we have graphically represented these two sentences. 

These two texts are represented as vectors. Right? So, we can say that two vectors are similar if the distance between them is small. By distance, we mean the angular distance between two vectors, which is represented by θ (theta). By thinking further from the machine learning perspective, we can understand that the value of cos θ makes more sense to us rather than the value of θ (theta) because, the cosine(or “cos”) function will map the value of θ in the first quadrant between 0 to 1 (Remember? cos 90° = 0 and cos 0° = 1 ).

##### Cosine similarity
It is a metric, helpful in determining, how similar the data objects are irrespective of their size. We can measure the similarity between two sentences in Python using Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a vector. 
The formula to find the cosine similarity between two vectors is given by--

Cos(x, y) = x . y / ||x|| * ||y||

where,

x . y = product (dot) of the vectors ‘x’ and ‘y’.

||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.

||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.

Example :

Consider an example to find the similarity between two vectors – ‘x’ and ‘y’, using Cosine Similarity.

The ‘x’ vector has values, x = { 1, 6, 3, 0 }
The ‘y’ vector has values, y = { 1, 7, 4, 1 }

The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||

In [1]:
import numpy as np
import math


# cosine similarity implementation
x = [1, 6, 3, 0]
y = [1, 7, 4, 1]

# consider x . y = xy
# xy = 1*1 + 6*7 + 3*4 + 0*1
xy = sum(x * y for x, y in zip(x, y))

# magx = √ 1^2 + 6^2 + 3^2 + 0^2
magx = np.sqrt(sum(x * x for x in x))

# similarly
magy = np.sqrt(sum(y * y for y in y))

CosineSimilarity = xy/(magx * magy)
print(f"Cosine similarity between {x} and {y} is : ", CosineSimilarity)

Cosine similarity between [1, 6, 3, 0] and [1, 7, 4, 1] is :  0.9907096022037775


In [2]:
text = ["Bangalore Hyderabad Bangalore","Hyderabad Hyderabad Bangalore"]

# but how can we convert them into vector's ??
# we need to find a way to represent these text's in vector form. 

The CountVectorizer() class from sklearn.feature_extraction.text library can do this for us. We need to import this library before we can create a new CountVectorizer() object. But in this project <code>we try to implement everything from scratch </code>.

### Content Based Movie Recommendation from scratch using <code>python, numpy, some mathematical concepts and pandas</code>

In [18]:
# Content Based Movie Recommendation class

class ContentBasedRecommendation:
    
    # constructor def which accepts target string and content to search from
    def __init__(self, Target, Content, Movies = None):

        # storing Movies list
        self.Movies = Movies
        
        # preserving content for example demonstration
        self.RawContent = Content
        
        # preserving Target for example demonstration
        self.RawTarget = Target
        
        # assing target dictonary that we get when calling self.CustomVectorizer with string parameter
        self.TargetContent = self.CustomVectorizer(str(Target))
        
        
        
        # list of Order of the keys in self.TargetContent
        self.Order = list(self.TargetContent.keys())
        
        # assigning Content to Content property of self object
        self.Content = self.CustomVectorizer(Content, self.Order)        
        
     
    
    def GetCountOfValuesInString(self, String):
        
        # Initilizing empty dictonary
        Vector = dict()

        # for each word in sentence seperated by space
        for PartofStr in String.split(" "):

            # If Vector has this key already
            if PartofStr in Vector.keys():

                # Incrent the value of key in Vector
                Vector[PartofStr] += 1
            else: 

                # Else add the key to the Vector with value 1
                Vector[PartofStr] = 1

        # return vactor
        return Vector
            
            
    
    # Custion Vectorizer function that converts any given string or list to packed vectors
    def CustomVectorizer(self, String, Order = None):
        
        # if we pass a string
        if type(String) == str:
            
            return self.GetCountOfValuesInString(String)
            
        
        # if we pass list as String
        elif type(String) == list:
            
            # Initilize empty list
            ListOfDicts = list()
            
            # for each string in the passed list
            for strr in String:
                
                # calling self.GetCountOfValuesInString() to get vector in the form of dictonary
                # and append the dictonary to ListOfDicts
                ListOfDicts.append(self.ReorderDict(self.GetCountOfValuesInString(strr), self.Order))
            
            # return ListOfDicts
            return ListOfDicts
    
        
    # funciton that returns cosine similarities
    def CosineSimilarity(self, FeatureVectors, Target):
    
        # square root of sum of squares of each data point in a target sample
        MagTarget = np.sqrt(sum(DataPoint * DataPoint for DataPoint in Target))
        
        # empty list to keep track Similarities
        Similarities = list()
        
        # for sample in population
        for Sample in FeatureVectors:
            
            # magnitude of sample vector
            MagSample = np.sqrt(sum(DataPoint * DataPoint for DataPoint in Sample))
            
            # sum of product of each data point in respective between respective sample and target
            xy = sum(x * y for x, y in zip(Target, Sample))
            
            # Formula for cosine similarity
            Val = xy / (MagTarget * MagSample)
            
            # If the value is nan
            if np.isnan(Val):
                
                # append 0 instead of nan
                Similarities.append(0)
            else:
                
                # else append normal cosine similarity
                Similarities.append(Val)
        
        # return cosine similarities list
        return Similarities

    # method that reorders the keyvalue pairs
    def ReorderDict(self, Dic, Order):
        
        # returns keyvalue pairs that target dictonary has
        return {k : Dic[k] if k in Dic.keys() else 0 for k in Order}

    
    # method that returns list of recommendations
    def GetRecommendations(self, NoofRecommendations):
        
        # getting values in target dictonary and typecasting to list
        Tar = list(self.TargetContent.values())
        
        # getting list of all dictonaries of available movies
        Avals = [list(dic.values()) for dic in self.Content]
        
        # finding cosine similarities
        Similarities = self.CosineSimilarity(Avals, Tar)
        
        # sorting indexes based on values
        # We will sort the list similar_movies according to similarity scores in descending order. 
        # Since the most similar movie to a given movie will be itself, 
        # we will discard the first element after sorting the movies.
        # Since we won't condiser same movie recommendation we add one to NoofRecommendaions
        SilmilarMovieIndexes = sorted(range(len(Similarities)), key = lambda x: Similarities[x])[-NoofRecommendations - 1:]
        
        # reversing sorted indexes
        SilmilarMovieIndexes.reverse()
        
        # returning movies by indexes
        RecommendedMovies = list()
        
        for index in SilmilarMovieIndexes:
            if self.Movies == None:
                RecommendedMovies.append(self.RawContent[index])
            else :
                RecommendedMovies.append(self.Movies[index])
        
        # removing the our target movie
        if self.Movies != None:
            RecommendedMovies.pop(0)
        return RecommendedMovies

In [7]:
# demonstrating the procedure using one example
OurIntrest = "Bangalore Hyderabad Bangalore"
Availabilites = ["Hyderabad Mumbai Chennai", "Mumbai Chennai Kolkata", 
                 "Bangalore Bangalore Hyderabad", "Hyderabad Bangalore Hyderabad", "Bangalore Hyderabad Bangalore",
                "Newyork Sydney Bangalore", "Bangalore Bangalore Bangalore", "Hyderabad Hyderabad Hyderabad"]

In [8]:
cbr = ContentBasedRecommendation(OurIntrest, Availabilites)
Recommendations = cbr.GetRecommendations(3)

print(f"##########     Sentences that are similar to {OurIntrest} are   ############")
for recommendation in Recommendations:
    print(recommendation)

##########     Sentences that are similar to Bangalore Hyderabad Bangalore are   ############
Bangalore Hyderabad Bangalore
Bangalore Bangalore Hyderabad


  Val = xy / (MagTarget * MagSample)


In [9]:
import pandas as pd

# reading csv file using pandas
df = pd.read_csv("Movies1.csv")

In [10]:
# analysing columns in dataset
df.columns

Index(['index', 'budget', 'genres', 'homepage', 'id', 'keywords',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title',
       'vote_average', 'vote_count', 'cast', 'crew', 'director'],
      dtype='object')

If you analyse the dataset, we will see that it has many extra info about a movie. We don’t need all of them. So, we choose <code>keywords, cast, genres and director</code> columns to use as our feature set(the so called “content” of the movie).

In [11]:
features = ['keywords','cast','genres','director']

create a function for combining the values of these columns into a single string column called CombinedFeatures.

In [12]:
# function that combines out required feature values into single feature for 
# further model implementation
def CombinedFeatures(row):
    return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']

In [13]:
# cleaning feature values using fillna
for feature in features:
    df[feature] = df[feature].fillna('') #filling all NaNs with blank string

In [14]:
#applying combined_features() method over each rows of dataframe and storing the CombinedFeatures column
df["CombinedFeatures"] = df.apply(CombinedFeatures,axis=1) 

In [15]:
# viewing what's happening with the method by selecting one sample
df.iloc[0].CombinedFeatures

'culture clash future space war space colony society Sam Worthington Zoe Saldana Sigourney Weaver Stephen Lang Michelle Rodriguez Action Adventure Fantasy Science Fiction James Cameron'

In [16]:
# for suppose i'm intrested in the dark Knight movie
# Get the content of the movie
ContentForTheDarknight = df[df["title"] == "The Dark Knight"]["CombinedFeatures"].values[0]

# print content of the the dark knight movie
print(ContentForTheDarknight)

dc comics crime fighter secret identity scarecrow sadism Christian Bale Heath Ledger Aaron Eckhart Michael Caine Maggie Gyllenhaal Drama Action Crime Thriller Christopher Nolan


In [19]:
# creating instance of the ContentBasedRecommendation class that accepts
# our intrests, other available movies content, and their names
cbr = ContentBasedRecommendation(ContentForTheDarknight, list(df["CombinedFeatures"].values), list(df["title"].values))

# Getting recommendations
Recommendations = cbr.GetRecommendations(10)

# printing recommendations
for recommendation in Recommendations:
    print(recommendation)

The Dark Knight Rises
Batman Begins
The Prestige
Kick-Ass 2
Kick-Ass
Batman & Robin
Batman
Batman Returns
Harsh Times
Harry Brown


  Val = xy / (MagTarget * MagSample)
