## Recommendation Systems

- Knowing **"What customers are most likely to buy in future"** is key to personalized marketing for most of the businesses. Understanding customers past purchase behavior or customer demographics could be key to make future buy predictions. But how to use the customer behavior data, depends on many different algorithms or techniques. Some alogorithms may use demographic information to make this predictions. But most of the times, the orgranizations may not have these kind of information about customers at all. All that organization will have are what customers bought in past or if the liked it or not.

- Recommendation systems use techniques to leverage these information and make recommendation, which has been proved to be very successful. For examples, Amazon.com's most popular feature of **"Customers who bought this also buys this?"**

- Some of the key techiques that recommendation systems use are


    - Association Rules mining
    - Collaborative Filtering
    - Matrix Factorization
    - Page Rank Algorithm
    

- We will discuss **Collaborative filtering** techinque in this article.

- Two most widely used **Collaborative filtering techniques** are


    - User Similarity
    - Item Similarity

- Here is a nice [blog](https://buildingrecommenders.wordpress.com/2015/11/16/overview-of-recommender-algorithms-part-1/) explanation of collaborative filtering.

- For the purpose of demonstration, we will use the data provided by movilens. It is available [here](https://grouplens.org/datasets/movielens/).

- The dataset contains information about which user watched which movie and what ratings (on a scale of 1 - 5 ) he have given to the movie.

In [None]:
import pandas as pd
import numpy as np

## Loading Ratings dataset

In [None]:
rating_df = pd.read_csv( "https://raw.githubusercontent.com/manaranjanp/IIMBClasses/main/recsys/u.data"
                        , delimiter = "\t"
                        , header = None )

In [None]:
rating_df.head( 10 )

#### Name the columns

In [None]:
rating_df.columns = ["userid", "movieid", "rating", "timestamp"]

In [None]:
rating_df.head( 10 )

#### Number of unique users

In [None]:
len( rating_df.userid.unique() )

#### Number of unique movies

In [None]:
len( rating_df.movieid.unique() )

- **So a total of 1682 movies and 943 users data is available in the dataset.**

#### Let's drop the timestamp columns. We do not need it.

In [None]:
rating_df.drop( "timestamp", inplace = True, axis = 1 )

In [None]:
rating_df.head( 10 )

## Loading Movies Data

In [None]:
movies_df = pd.read_csv( "https://raw.githubusercontent.com/manaranjanp/IIMBClasses/main/recsys/u.item"
                        , delimiter = '\|'
                        , header = None
                        , engine='python'
                        , encoding = "ISO-8859-1")

In [None]:
movies_df.head(10)

In [None]:
movies_df = movies_df.iloc[:,:2]
movies_df.columns = ['movieid', 'title']

In [None]:
movies_df.head( 10 )

In [None]:
movies_df[126:127]

## Finding Item Similarity

### Let's create a pivot table of Movies to Users 

- The rows are movies and columns are users. And the values in the matrix are the rating for a specific movie by a specific user.

In [None]:
rating_mat = rating_df.pivot( index='movieid', 
                              columns='userid', 
                              values = "rating" ).reset_index(drop=True)

In [None]:
rating_mat

### Fill with 0, where users have not rated the movies

In [None]:
rating_mat.fillna( 0, inplace = True )

In [None]:
rating_mat.shape

In [None]:
rating_mat.head( 10 )

In [None]:
type(rating_mat)

### Calculating the item distances and similarities

In [None]:
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation

In [None]:
movie_sim = 1 - pairwise_distances( rating_mat.to_numpy(), metric="correlation" )

In [None]:
movie_sim.shape

In [None]:
movie_sim_df = pd.DataFrame( movie_sim )

In [None]:
movie_sim_df.shape

In [None]:
movie_sim_df.head( 10 )

### Finding similar movies to "Toy Story"

In [None]:
movies_df['similarity'] = movie_sim_df.iloc[0]
movies_df.columns = ['movieid', 'title', 'similarity']

In [None]:
movies_df.head( 10 )

In [None]:
movies_df.sort_values( ["similarity"], ascending = False )[0:10]

#### That means anyone who buys *Toy Story* and likes it, the top 3 movies that can be recommender to him or her are  *Star Wars (1977)*, *Independence Day (ID4) (1996)* and *Rock, The (1996)*

## Utility function to find similar movies

In [None]:
def get_similar_movies( movieid, topN = 5 ):
    movies_df['similarity'] = movie_sim_df.iloc[movieid -1]
    top_n = movies_df.sort_values( ["similarity"], ascending = False )[0:topN]   
    print( "Similar Movies to: ", )
    return top_n 

### Similar movies to *Twister*

In [None]:
get_similar_movies( 118 )

### Similar movies to *The Godfather*

In [None]:
movies_df[movies_df.movieid == 127]

In [None]:
get_similar_movies( 127, 10 )

### Similar movies to *The Lion King*

In [None]:
get_similar_movies( 71 )

### Similar movies to *Star Trek*

In [None]:
get_similar_movies( 228 )

### Similar movies to *Sleepless in Seattle*

In [None]:
get_similar_movies( 88, 10 )