# HOW RECOMMENDATION SYSTEMS WORK 

## TODAYS'S AGENDA


- What are recommender systems?
- Different strategies for implementing recommender systems.
- How to evaluate if a recommender system is effective.
- Build a rule based recommendation system.


## What are recommender systems? 

- A recommender system is a type of information filtering system, by drawing from huge data sets, the system’s algorithm can pinpoint accurate user preferences.

- Once you know what a user likes, you can recommend them new and relevant content as you must have observed that job of a reco sys is not just to filter the data but also try to make the recommendations personal for every user

- From Amazon to Zomato nearly every app that you use gives you personalized recommendations.


![alt text here](900x410.jpg "Visualizing Recommendation System")

### Long Tail Problem

![alt text here](too_many_choices.png "The Paradox of Choice")

**The Paradox of Choice - our tendency to adapt to new things often dampens our initial excitement over buying a novel item**


Let us take a real life example to see how Amazon deals with this problem. Suppose you want to buy a book online and go to amazon website. Here's what you see
![alt text here](amazon-recommender-system.png "Book Choices")

- It’s the same as the jam bottles in the stores, too many choices. But once you start making choices on the platform, Amazon’s recommender system takes over. 

- Say you already know which book to buy and search for it, _"The Great Gatsby"_

- Amazon recommendations,

![alt text here](amazon-content-based-filtering-system.png "Book Recommendations")

- Here the system served you Fahrenheit 451. That’s because past Fitzgerald customers must have also bought Bradbury. As an alternative, your recommender system could offer other Fitzgerald books.

## Let's build a Recommendation System 

### Part I - What's the idea?

- Get the Movielens dataset from Kaggle. It has ~45K movies.
<br>

- Do exploratory data analysis of the dataset. Basically understand which attributes of movie data are useful for us and how can we use those.
<br>

- Then we finalize the attributes inn the following attributes - 
    1. series in which movie belongs to
    2. genres
    3. language of the movie
    4. name
    5. production companies
    6. country in which movie was made
    7. release date
    8. status(is it released, in pre-production etc.)
    9. IMDb rating
    10. cast(actors and actresses)

<br>

- Using the following features we start filtering through the data. We use these attributes as sieves to get similar data.<br> **Example** - Given a query for English movie, we use language of the movie as a sieve to filter out only English movies.
<br>

- After filtering through multiple layers of attributes, we finaly get most similar movies together.
<br>

- We apply the most basic formula for obtaining similar text. **Cosine Similarity**, on the cast and genres of the movie. Which means we will get movies with similar cast and genres as the recommendations.
<br>

- A brief explanation of what Cosine Similarity does. Given two sentences cosine similarity tries to find similarity between two statements. <br> **Example** - **statement1**: "Movie A is an American romantic comedy film."<br> **statement2**: "Movie B is a British romantic film."<br> **statement3**: "Movie C is an action film."

As you can already see in above example there are multiple ways to give recommendations for same entity.
Which brings us to the question, what are those multiple ways!?

## Different strategies to build a recommendation system

Majorly there are **three** ways:
- **Collaborative Filtering**
- **Content-based Filtering**
- **Hybrid (Combination of Both)**

### Collaborative Filtering 

A collaborative filtering recommender system analyzes similarities between users and/or item interactions. Once the system identifies similarities, it serves users recommendations. In general, users see items that similar users liked.

There are different types of collaborative filtering systems including:

- **Item-item Collaborative Filtering**
- **User-user Collaborative Filtering**

#### Item-item Collaborative Filtering

- An **item-item** filtering algorithm analyzes product associations taken from user ratings. Users then see recommendations based on how they rate individual products.
<br>

- For example, you watch a video. Now, you will see the most viewed videos with similar attributes. Below is an example from YouTube.
<br>

- I watched a special list for videos that I gave thumbs-up. YouTube then recommends me the most viewed videos from similar viewers' lists, which in this case is video by Sentdex.
<br>

- One more thing to notice in this example is I watched a few videos by Comicstorian - a youtube channel which is all about Superhero Comic Books, it automatically showed me videos Variant Comics(another comic books related channel)
<br>

![alt text here](youtube_reco.png "Youtube Recommendations")

#### User-user Collaborative Filtering

- The second kind of collaborative filtering takes the similarity of user tastes into consideration.
<br>

- So, user-user collaborative filtering doesn’t serve you items with the best ratings. 
<br>

- Instead, you join a cluster of other people with similar tastes and you see content based on historic choices.



Let’s say you use YouTube for the first time. You play a video tutorial by YouTube channel Lex Fridman.
The system clusters you with other users who also like the same channel. Then the YouTube recommendation system shows you other videos chosen by users in your cluster. The more choices you make, the more relevant the results.

 
![alt text here](youtube_video_reco.png "YouTube Recommendations")


### Content Based Recommender Systems

Content based filtering uses characteristics or properties of an item to serve recommendations. Characteristic information includes:

- Characteristics of Items (Keywords and Attributes)
- Characteristics of Users (Profile Information)

Let’s use a movie recommendation system as an example. Characteristics for the item Harry Potter and the Sorcerer’s Stone might include:

- Director Name – Chris Columbus
- Genres – Adventure, Fantasy, Family (IMDB)
- Stars – Daniel Radcliffe, Rupert Grint, Emma Watson

![alt text here](content-based-filtering-harry-potter.png "Content based Recommendations")

A content based recommender system can now serve the user:

- **More Harry Potter Movies**
- **More Adventure, Family, or Fantasy Movies**
- **More Chris Columbus Movies**
- **More Daniel Radcliffe Movies**

Of course, this isn't an exhaustive list of rules using which we can give recommendations. 
The system may also show the user more Harry Potter movies. The hypothesis is that if a user liked an item in the past, they might like similar items in the future.

##  Why do we need so many types of recommender systems?

Well it turns out that each of the above type of recommender system has its own set of problems. Let us go through them

### The Collaborative Filtering Problem 

- Collaborative filtering needs a lot of data to create relevant suggestions. So, when you start using a platform with a collaborative filtering system, you start cold.

- The cold start problem in recommender systems is common for collaborative filtering systems.

- For example, when John Doe visits YouTube for the first time, the system has to wait for him to watch several videos. Only then can it serve him relevant recommendations for other videos.

- One way to tackle this situation is to start clustering John Doe with others who watch similar videos or just recommend most popular videos till he watches enough number of videos

### The Content Based Filtering Problem

- The problem with content-based recommender systems is that they are restrictive. 
- You click on a t-shirt and you see more t-shirts. The system is incapable of knowing that your interests go beyond liking t-shirts.
- Basically the system isn't dynamic enough to keep the customer engaged, unless they are ready to browse through same category.
- A common solution is to ask users upfront about what kind of things they like. And as users interact with your site, you can use historical data to recommend them more tailored choices.

- The customer buys a t-shirt and some shorts. Now, you know that he likes both.

## How to Evaluate if a Recommender System is Effective

The are quite a few ways to evaluate, but we will be looking at two of them today
- **User Studies / Personas**
- **Online Recommender System Survey (A/B Testing)**

Now what I have mentioned above are two methods to check the efficiency of a recommendation system. But on what parameters do we check the efficiency?
So, let us just go through some important parameters - 
- **User Preference**: When systems make recommendations based on user interests, habits, and goals.
<br>

- **Prediction Accuracy**: When systems make accurate predictions about the results of serving a recommendation to a user.
<br>

- **Coverage**: The degree to which recommendations cover all available items and actions.
<br>

- **Diversity**: The more diverse the recommendations, the broader the offer for users.
<br>

- **Privacy**: Despite users willingly giving information to recommender systems, they still want that information to be private.
<br>

- **Adaptivity**: When the recommender system can adapt and serve users relevant recommendations even when the content and environment is dynamic.
<br>

- **Scalability**: The system can handle a growing amount of data.

So having understood the parameters let us check out ways to evaluate

### User Studies/Personas 

- In the beginning, you will have no real knowledge about your users or their interests. So, how do you create personalized offers for people you don’t know?
<br>

- The easiest way to test what will be effective is to create user personas. 
<br>

- A good way to assign user personas is to have new users answer a few brief onboarding questions. And that’s especially true for mobile applications.
<br>

- Here is how Spotify does it.
<br>

![alt text here](spotify1.png "Spotify Recommendations")

![alt text here](spotify2.png "Spotify Recommendations")

![alt text here](spotify3.png "Spotify Recommendations")

### Online Recommender System Survey (A/B Testing) 

- Once you launch your recommencer system, you run online evaluations. These evaluations comprise various A/B tests that you serve to users live.
<br>

- If you’re unfamiliar with A/B testing, it involves serving different options to users who arrive at the same point. One user sees option “A” while the other sees option “B.” Whichever option does better, that’s the one you continue to serve all users.
<br>

- By running A/B tests, you can see which recommendation inspires more clicks and conversions. 
<br>

- While ideal in the long run, this form of testing can have a negative impact on revenue and user experience if users don’t like either option.

## How Recommendation Systems help a business? 


- **Increase in sales thanks to personalized offers.**
- **Enhanced customer experience.**
- **More time spent on the platform.**
- **Customer retention thanks to users feeling understood**


## Let's build a Recommendation System 

### Part II - Writing the code

In [48]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from concurrent.futures import ProcessPoolExecutor
from unidecode import unidecode
from datetime import datetime
import multiprocessing as mp
import pandas as pd
import operator
import random
import json
import time
import ast
import csv
import re

In [49]:
#Load csv and remove certain entries with specific unwanted words for books
filename = "movies_metadata.csv"
movies = pd.read_csv(filename, delimiter = ",")
movies.replace(pd.np.nan,'Information not available',inplace=True)

  interactivity=interactivity, compiler=compiler, result=result)


In [50]:
#Load csv and remove certain entries with specific unwanted words for books
filename = "credits.csv"
credits_dataframe = pd.read_csv(filename, delimiter = ",")
credits_dataframe.replace(pd.np.nan,'Information not available',inplace=True)

In [53]:
movies.tail()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
45461,False,Information not available,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,Information not available,0,90,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1
45462,False,Information not available,0,"[{'id': 18, 'name': 'Drama'}]",Information not available,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0,360,"[{'iso_639_1': 'tl', 'name': ''}]",Released,Information not available,Century of Birthing,False,9.0,3
45463,False,Information not available,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",Information not available,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0,90,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6
45464,False,Information not available,0,[],Information not available,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0,87,[],Released,Information not available,Satan Triumphant,False,0.0,0
45465,False,Information not available,0,[],Information not available,461257,tt6980792,en,Queerama,50 years after decriminalisation of homosexual...,...,2017-06-09,0,75,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Information not available,Queerama,False,0.0,0


In [54]:
credits_dataframe.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


### Exploratory Data Analysis

In [55]:
movies['original_language'].value_counts()

en       32269
fr        2438
it        1529
ja        1350
de        1080
es         994
ru         826
hi         508
ko         444
zh         409
sv         384
pt         316
cn         313
fi         297
nl         248
da         225
pl         219
tr         150
cs         130
el         113
no         106
fa         101
hu         100
ta          78
th          76
he          67
sr          63
ro          57
te          45
ar          39
         ...  
eu           3
ne           2
ps           2
af           2
am           2
pa           2
lo           2
iu           2
bo           2
mn           2
si           1
jv           1
gl           1
68.0         1
lb           1
zu           1
ay           1
hy           1
82.0         1
qu           1
eo           1
rw           1
tg           1
uz           1
la           1
sm           1
104.0        1
fy           1
mt           1
cy           1
Name: original_language, Length: 93, dtype: int64

In [56]:
movies = movies.groupby('original_language').filter(lambda x : len(x)>100)

In [57]:
movies['original_language'].value_counts()

en    32269
fr     2438
it     1529
ja     1350
de     1080
es      994
ru      826
hi      508
ko      444
zh      409
sv      384
pt      316
cn      313
fi      297
nl      248
da      225
pl      219
tr      150
cs      130
el      113
no      106
fa      101
Name: original_language, dtype: int64

In [58]:
credits_dataframe['id']=credits_dataframe['id'].astype(int)
movies['id']=movies['id'].astype(int)
movies_metadata_dataframe= pd.merge(movies, credits_dataframe,on='id', how='outer')

In [59]:
movies_metadata_dataframe.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew'],
      dtype='object')

In [60]:
movie_metadata_dataframe = movies_metadata_dataframe.loc[:,["belongs_to_collection", "genres", "original_language", "original_title", "production_companies", "production_countries", "release_date", "status", "vote_average", "cast", "crew"]]

In [61]:
movie_metadata_dataframe = movie_metadata_dataframe[movie_metadata_dataframe['original_title'].notna()]

In [62]:
movie_metadata_dataframe.head()

Unnamed: 0,belongs_to_collection,genres,original_language,original_title,production_companies,production_countries,release_date,status,vote_average,cast,crew
0,"{'id': 10194, 'name': 'Toy Story Collection', ...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",en,Toy Story,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-10-30,Released,7.7,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,Information not available,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",en,Jumanji,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-15,Released,6.9,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",en,Grumpier Old Men,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,Released,6.5,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."
3,Information not available,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",en,Waiting to Exhale,[{'name': 'Twentieth Century Fox Film Corporat...,"[{'iso_3166_1': 'US', 'name': 'United States o...",1995-12-22,Released,6.1,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de..."
4,"{'id': 96871, 'name': 'Father of the Bride Col...","[{'id': 35, 'name': 'Comedy'}]",en,Father of the Bride Part II,"[{'name': 'Sandollar Productions', 'id': 5842}...","[{'iso_3166_1': 'US', 'name': 'United States o...",1995-02-10,Released,5.7,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de..."


In [63]:
movie_metadata_dataframe.shape

(44522, 11)

In [64]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    try:
        for i in x:
            if i['job'] == 'Director':
                return i['name']
    except:
        return "Information not available"

In [65]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_producer(x):
    try:
        for i in x:
            if i['job'] == 'Producer':
                return i['name']
    except:
        return "Information not available"

In [66]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    try:
        if isinstance(x, list):
            names = [i['name'] for i in x]
            #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
            if len(names) > 3:
                names = names[:3]
            return ", ".join(names)
        else:
            return "Information not available"
    except:
        return "Information not available"

In [67]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_dict(x):
    try:
        if isinstance(x, dict):
            names = x['name']
            return names
        else:
            return "Information not available"
    except:
        return "Information not available"

In [68]:
def get_list_type(x):
    try:
        if x!= "Information not available":
            return ast.literal_eval(str(x))   
        else:
            return "Information not available"
    except Exception as e:
        return "Information not available"

In [69]:
def get_dict_type(x):
    try:
        if x != "Information not available":
            return ast.literal_eval(str(x))   
        else:
            return "Information not available"
    except:
        return "Information not available"

In [70]:
movie_metadata_dataframe["belongs_to_collection"] = movie_metadata_dataframe["belongs_to_collection"].apply(get_dict_type)
movie_metadata_dataframe["belongs_to_collection"] = movie_metadata_dataframe["belongs_to_collection"].apply(get_dict)


In [71]:
movie_metadata_dataframe["production_countries"] = movie_metadata_dataframe["production_countries"].apply(get_list_type)
movie_metadata_dataframe["production_countries"] = movie_metadata_dataframe["production_countries"].apply(get_list)

In [72]:
movie_metadata_dataframe["genres"] = movie_metadata_dataframe["genres"].apply(get_list_type)
movie_metadata_dataframe["genres"] = movie_metadata_dataframe["genres"].apply(get_list)

In [73]:
movie_metadata_dataframe["production_companies"] = movie_metadata_dataframe["production_companies"].apply(get_list_type)
movie_metadata_dataframe["production_companies"] = movie_metadata_dataframe["production_companies"].apply(get_list)

In [74]:
movie_metadata_dataframe["cast"] = movie_metadata_dataframe["cast"].apply(get_list_type)
movie_metadata_dataframe["cast"] = movie_metadata_dataframe["cast"].apply(get_list)

In [75]:
movie_metadata_dataframe["crew"] = movie_metadata_dataframe["crew"].apply(get_list_type)
movie_metadata_dataframe["director"] = movie_metadata_dataframe["crew"].apply(get_director)
movie_metadata_dataframe["producer"] = movie_metadata_dataframe["crew"].apply(get_producer)

In [76]:
def get_year(data):
    data = re.sub("\(.*?\)","",data)
    data = re.sub("\<.*?>","", data)
    data = data.split(" ")[0]
    if data.isalpha() or data=="":
        return 0
    else:
        return int(data)

In [77]:
movie_metadata_dataframe["release_date"] = movie_metadata_dataframe["release_date"].str.replace("-"," ") 

In [78]:
movie_metadata_dataframe["release_year"] = movie_metadata_dataframe["release_date"].apply(get_year)

In [79]:
movie_metadata_dataframe["vote_average"] = movie_metadata_dataframe["vote_average"].replace("Information not available",0.0) 

In [80]:
movie_metadata_dataframe.tail()

Unnamed: 0,belongs_to_collection,genres,original_language,original_title,production_companies,production_countries,release_date,status,vote_average,cast,crew,director,producer,release_year
44517,Information not available,"Drama, Action, Romance",en,Robin Hood,"Westdeutscher Rundfunk (WDR), Working Title Fi...","Canada, Germany, United Kingdom",1991 05 13,Released,5.7,"Patrick Bergin, Uma Thurman, David Morrissey","[{'credit_id': '52fe44439251416c9100a899', 'de...",John Irvin,Sarah Radclyffe,1991
44518,Information not available,"Drama, Family",fa,رگ خواب,,Iran,Information not available,Released,4.0,"Leila Hatami, Kourosh Tahami, Elham Korda","[{'credit_id': '5894a97d925141426c00818c', 'de...",Hamid Nematollah,Hamid Nematollah,0
44519,Information not available,"Action, Drama, Thriller",en,Betrayal,American World Pictures,United States of America,2003 08 01,Released,3.8,"Erika Eleniak, Adam Baldwin, Julie du Page","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",Mark L. Lester,,2003
44520,Information not available,,en,Satana likuyushchiy,Yermoliev,Russia,1917 10 21,Released,0.0,"Iwan Mosschuchin, Nathalie Lissenko, Pavel Pavlov","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",Yakov Protazanov,Joseph N. Ermolieff,1917
44521,Information not available,,en,Queerama,,United Kingdom,2017 06 09,Released,0.0,,"[{'credit_id': '593e676c92514105b702e68e', 'de...",Daisy Asquith,,2017


In [81]:
#Replace empty and None values with "Information not available" using regex
movie_metadata_dataframe = movie_metadata_dataframe.replace(r'^\s*$', "Information not available", regex=True)

In [82]:
movie_metadata_dataframe.fillna(value=pd.np.nan, inplace=True)

In [83]:
movie_metadata_dataframe.replace(pd.np.nan,'Information not available',inplace=True)

In [84]:
movie_metadata_dataframe.tail()

Unnamed: 0,belongs_to_collection,genres,original_language,original_title,production_companies,production_countries,release_date,status,vote_average,cast,crew,director,producer,release_year
44517,Information not available,"Drama, Action, Romance",en,Robin Hood,"Westdeutscher Rundfunk (WDR), Working Title Fi...","Canada, Germany, United Kingdom",1991 05 13,Released,5.7,"Patrick Bergin, Uma Thurman, David Morrissey","[{'credit_id': '52fe44439251416c9100a899', 'de...",John Irvin,Sarah Radclyffe,1991
44518,Information not available,"Drama, Family",fa,رگ خواب,Information not available,Iran,Information not available,Released,4.0,"Leila Hatami, Kourosh Tahami, Elham Korda","[{'credit_id': '5894a97d925141426c00818c', 'de...",Hamid Nematollah,Hamid Nematollah,0
44519,Information not available,"Action, Drama, Thriller",en,Betrayal,American World Pictures,United States of America,2003 08 01,Released,3.8,"Erika Eleniak, Adam Baldwin, Julie du Page","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",Mark L. Lester,Information not available,2003
44520,Information not available,Information not available,en,Satana likuyushchiy,Yermoliev,Russia,1917 10 21,Released,0.0,"Iwan Mosschuchin, Nathalie Lissenko, Pavel Pavlov","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",Yakov Protazanov,Joseph N. Ermolieff,1917
44521,Information not available,Information not available,en,Queerama,Information not available,United Kingdom,2017 06 09,Released,0.0,Information not available,"[{'credit_id': '593e676c92514105b702e68e', 'de...",Daisy Asquith,Information not available,2017


In [85]:
del movie_metadata_dataframe["crew"]

In [86]:
movie_metadata_dataframe.head()

Unnamed: 0,belongs_to_collection,genres,original_language,original_title,production_companies,production_countries,release_date,status,vote_average,cast,director,producer,release_year
0,Toy Story Collection,"Animation, Comedy, Family",en,Toy Story,Pixar Animation Studios,United States of America,1995 10 30,Released,7.7,"Tom Hanks, Tim Allen, Don Rickles",John Lasseter,Bonnie Arnold,1995
1,Information not available,"Adventure, Fantasy, Family",en,Jumanji,"TriStar Pictures, Teitler Film, Interscope Com...",United States of America,1995 12 15,Released,6.9,"Robin Williams, Jonathan Hyde, Kirsten Dunst",Joe Johnston,Scott Kroopf,1995
2,Grumpy Old Men Collection,"Romance, Comedy",en,Grumpier Old Men,"Warner Bros., Lancaster Gate",United States of America,1995 12 22,Released,6.5,"Walter Matthau, Jack Lemmon, Ann-Margret",Howard Deutch,Information not available,1995
3,Information not available,"Comedy, Drama, Romance",en,Waiting to Exhale,Twentieth Century Fox Film Corporation,United States of America,1995 12 22,Released,6.1,"Whitney Houston, Angela Bassett, Loretta Devine",Forest Whitaker,Ronald Bass,1995
4,Father of the Bride Collection,Comedy,en,Father of the Bride Part II,"Sandollar Productions, Touchstone Pictures",United States of America,1995 02 10,Released,5.7,"Steve Martin, Diane Keaton, Martin Short",Charles Shyer,Nancy Meyers,1995


### Recommendation Code 

In [87]:
#convert dataframes to dictionaries
movie_dictionaries = movie_metadata_dataframe.to_dict('records')

In [88]:
#make a single dictionary with key as book name and value as dictionary with all the features
single_movie_dictionary = {}
for element in movie_dictionaries:
    single_movie_dictionary[element['original_title']] = element

In [90]:
movie_metadata_dataframe.columns

Index(['belongs_to_collection', 'genres', 'original_language',
       'original_title', 'production_companies', 'production_countries',
       'release_date', 'status', 'vote_average', 'cast', 'director',
       'producer', 'release_year'],
      dtype='object')

In [91]:
def similarity(movie,dataframe):
    start_time = time.time()
    count = CountVectorizer()
    count_matrix = count.fit_transform(dataframe['bag of words'])
    indices = pd.Series(dataframe.index)
    cosine_sim = cosine_similarity(count_matrix, count_matrix)
    
    recommended_movies = []
    idx = indices[indices == movie].index[0]

    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    top_10_indexes = list(score_series.iloc[1:-1].index)

    for i in top_10_indexes:
        recommended_movies.append(list(dataframe.index)[i])
        
    return recommended_movies

In [92]:
#features for movies - ['belongs_to_collection', 'genres', 'original_language', 'original_title', 'production_companies', 'production_countries', 'release_date', 'status', 'vote_average', 'cast', 'release_year']


def movie_recommendation(movie_dict, movie_dictionaries):
    start_time = time.time()
    first_level_filtering = []
    second_level_filtering = []
    third_level_filtering = []
    fourth_level_filtering = []
    same_series_movies = []
    bag_of_words_dict = {}

    movie_genres = movie_dict['genres'].lower().strip().split(", ")
    movie_language = movie_dict['original_language'].lower()
    movie_series = movie_dict['belongs_to_collection']
    movie_year = movie_dict['release_year']
    movie_country = movie_dict["production_countries"]
    if movie_dict['vote_average'] != "Information not available":
        movie_rating = movie_dict['vote_average']
    else:
        movie_rating = 0

    if movie_year != "Information not available":
        before_released = int(movie_year) - 10
        after_released = int(movie_year) + 10
    else:
        pass

    for i in range(len(movie_dictionaries)):
        if movie_dict["original_language"].strip() == movie_dictionaries[i]['original_language'] and (movie_dictionaries[i]['vote_average'] > 5.0):
            first_level_filtering.append(movie_dictionaries[i])
#     return len(first_level_filtering)

    for i in range(len(first_level_filtering)):
        if first_level_filtering[i]['status'] != "Released":
            pass
        else:
            second_level_filtering.append(first_level_filtering[i])     
#     return len(second_level_filtering)

    if movie_series!= 'Information not available':
        for i in range(len(second_level_filtering)):
            if movie_series == movie_dictionaries[i]['belongs_to_collection']:
                same_series_movies.append(movie_dictionaries[i]['original_title'])

    
    for i in range(len(second_level_filtering)):
        if (movie_country[0] in second_level_filtering[i]['production_countries']):# and second_level_filtering[i]['release_year']>before_released and second_level_filtering[i]['release_year']<after_released:# and self.get_year(first_level_filtering_cleaned_movies[i]['released'])>1960:
            try:
                if (movie_country[1] in second_level_filtering[i]['production_countries']):
                    third_level_filtering.append(second_level_filtering[i])
            except:
                third_level_filtering.append(second_level_filtering[i])
    if movie_genres[0] != "Information not available":
        if "animation" in movie_genres:
            for i in range(len(third_level_filtering)):
                if "animation" in third_level_filtering[i]['genres'].lower():
                    fourth_level_filtering.append(third_level_filtering[i])
            
            #Step to check if the given movie exists in level four list
            name_list = []
            for i in range(len(fourth_level_filtering)):
                name_list.append(fourth_level_filtering[i]['original_title'])
            if movie_dict['original_title'] in name_list:
                pass
            else:
                fourth_level_filtering.append(movie_dict)


            for i in range(len(fourth_level_filtering)):
                bag_of_words_dict[fourth_level_filtering[i]["original_title"]]=fourth_level_filtering[i]["cast"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()+" "+fourth_level_filtering[i]["production_companies"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()
            animated_random_data = pd.DataFrame.from_dict(bag_of_words_dict, columns=['bag of words'], orient='index')
            result_dataframe = similarity(movie_dict['original_title'],animated_random_data)
            final_dataframe = same_series_movies+result_dataframe
            try:
                final_dataframe.remove(movie_dict['original_title'])
            except:
                pass
            if len(final_dataframe)>0:
                return final_dataframe[:10]
            else:
                return "Information not available"

        else:
            for i in range(len(third_level_filtering)):
                if not "animation" in third_level_filtering[i]['genres']:
                    fourth_level_filtering.append(third_level_filtering[i])
            name_list = []
            for i in range(len(fourth_level_filtering)):
                name_list.append(fourth_level_filtering[i]['original_title'])
            if movie_dict['original_title'] in name_list:
                pass
            else:
                fourth_level_filtering.append(movie_dict)

            for i in range(len(fourth_level_filtering)):
                bag_of_words_dict[fourth_level_filtering[i]["original_title"]]=fourth_level_filtering[i]["cast"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()+" "+fourth_level_filtering[i]["production_companies"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()+" "+fourth_level_filtering[i]["genres"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()#+" "+fourth_level_filtering[i]["production_companies"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()
            non_animated_random_data = pd.DataFrame.from_dict(bag_of_words_dict, columns=['bag of words'], orient='index')#, columns=['wiki name', 'starring'])

            result_dataframe = similarity(movie_dict['original_title'],non_animated_random_data)
            final_dataframe = same_series_movies+result_dataframe
            
            
            try:
                final_dataframe.remove(movie_dict['original_title'])
            except:
                pass
            if len(final_dataframe)>0:
                return final_dataframe[:10]
            else:
                return "Information not available"
    else:
        for i in range(len(third_level_filtering)):
            if third_level_filtering[i]['release_year']>before_released and third_level_filtering[i]['release_year']<after_released:
                fourth_level_filtering.append(third_level_filtering[i])

        name_list = []
        for i in range(len(fourth_level_filtering)):
            name_list.append(fourth_level_filtering[i]['original_title'])
        if movie_dict['original_title'] in name_list:
            pass
        else:
            fourth_level_filtering.append(movie_dict)

        for i in range(len(fourth_level_filtering)):
            bag_of_words_dict[fourth_level_filtering[i]["original_title"]]=fourth_level_filtering[i]["cast"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()+" "+fourth_level_filtering[i]["genres"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()+" "+fourth_level_filtering[i]["production_companies"].replace("', '", "").replace("'","").replace(" ","").replace(","," ").strip().lower()
        random_data = pd.DataFrame.from_dict(bag_of_words_dict, columns=['bag of words'], orient='index')#, columns=['wiki name', 'starring'])
        result_dataframe = similarity(movie_dict['original_title'],random_data)

        result_dataframe = similarity(movie_dict['original_title'],random_data)
        final_dataframe = same_series_movies+result_dataframe
        
        try:
            final_dataframe.remove(movie_dict['original_title'])
        except:
            pass
        if len(final_dataframe)>0:
            return final_dataframe[:10]
        else:
            return "Information not available"


In [93]:
def recommendations(movie):
    reco_dict ={}
    movie = movie.replace("\n", "")
    try:
        single_movie_dict = single_movie_dictionary[movie]
        single_movie_dict['original_title'] = movie
        movie_reco_list = movie_recommendation(single_movie_dict, movie_dictionaries)
    except:
        movie_reco_list = "Information not available"
    reco_dict[movie] = movie_reco_list
    return reco_dict

In [95]:
recommendations("Toy Story")

{'Toy Story': ['Toy Story 2',
  'Toy Story 3',
  'Toy Story That Time Forgot',
  'Toy Story 2',
  'Partysaurus Rex',
  'Toy Story 3',
  'Toy Story of Terror!',
  'Hawaiian Vacation',
  'Small Fry',
  'Knick Knack']}

## What more can be done? 

- As you can see generating a recommendation takes time, so a good practice is to generate recommendations for all the entities and store them in JSON(dictionary) form with key as the name of the movie and value as the recommendations or even as SQL table. Fetching from with of them takes far less time.
<br>

- You can create an API to host a recommendation service from your local server. Something like a recommendation engine.
![alt text here](recommendation_service.png "Recommendation Service")
<br>

- The recommendation system we built was purely content based. This would be ideal for a cold start when we do not have feedback from user. So now build an analytical system which can take feedbacks from user and explore how the data from user and the item can be used together.
<br>

- Use Machine Learning and Deep Learning to combine user and item based recommendation systems and get even better results.

# THANK YOU!! 