## Recommendation Systems

- Knowing **"What customers are most likely to buy in future"** is key to personalized marketing for most of the businesses. Understanding customers past purchase behavior or customer demographics could be key to make future buy predictions. But how to use the customer behavior data, depends on many different algorithms or techniques. Some alogorithms may use demographic information to make this predictions. But most of the times, the orgranizations may not have these kind of information about customers at all. All that organization will have are what customers bought in past or if the liked it or not.

- Recommendation systems use techniques to leverage these information and make recommendation, which has been proved to be very successful. For examples, Amazon.com's most popular feature of **"Customers who bought this also buys this?"**

- Some of the key techiques that recommendation systems use are


    - Association Rules mining
    - Collaborative Filtering
    - Matrix Factorization
    - Page Rank Algorithm
    

- We will discuss **Collaborative filtering** techinque in this article.

- Two most widely used **Collaborative filtering techniques** are


    - User Similarity
    - Item Similarity

- Here is a nice [blog](https://buildingrecommenders.wordpress.com/2015/11/16/overview-of-recommender-algorithms-part-1/) explanation of collaborative filtering.

- For the purpose of demonstration, we will use the data provided by movilens. It is available [here](https://grouplens.org/datasets/movielens/).

- The dataset contains information about which user watched which movie and what ratings (on a scale of 1 - 5 ) he have given to the movie.

In [1]:
import pandas as pd
import numpy as np

## Loading Ratings dataset

In [2]:
rating_df = pd.read_csv( "https://raw.githubusercontent.com/manaranjanp/ISB_MLUL2/main/cf/u.data"
                        , delimiter = "\t"
                        , header = None )

In [3]:
rating_df.head( 10 )

Unnamed: 0,0,1,2,3
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


#### Name the columns

In [4]:
rating_df.columns = ["userid", "movieid", "rating", "timestamp"]

In [5]:
rating_df.head( 10 )

Unnamed: 0,userid,movieid,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


#### Number of unique users

In [6]:
len( rating_df.userid.unique() )

943

#### Number of unique movies

In [7]:
len( rating_df.movieid.unique() )

1682

- **So a total of 1682 movies and 943 users data is available in the dataset.**

#### Let's drop the timestamp columns. We do not need it.

In [8]:
rating_df.drop( "timestamp", inplace = True, axis = 1 )

In [9]:
rating_df.head( 10 )

Unnamed: 0,userid,movieid,rating
0,196,242,3
1,186,302,3
2,22,377,1
3,244,51,2
4,166,346,1
5,298,474,4
6,115,265,2
7,253,465,5
8,305,451,3
9,6,86,3


## Loading Movies Data

In [10]:
movies_df = pd.read_csv( "https://raw.githubusercontent.com/manaranjanp/ISB_MLUL2/main/cf/u.item"
                        , delimiter = '\|'
                        , header = None
                        , engine='python'
                        , encoding = "ISO-8859-1")

In [11]:
movies_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,9,Dead Man Walking (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Dead%20Man%20...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,10,Richard III (1995),22-Jan-1996,,http://us.imdb.com/M/title-exact?Richard%20III...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [12]:
movies_df = movies_df.iloc[:,:2]
movies_df.columns = ['movieid', 'title']

In [13]:
movies_df.head( 10 )

Unnamed: 0,movieid,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...
6,7,Twelve Monkeys (1995)
7,8,Babe (1995)
8,9,Dead Man Walking (1995)
9,10,Richard III (1995)


In [14]:
movies_df[126:127]

Unnamed: 0,movieid,title
126,127,"Godfather, The (1972)"


## Finding Item Similarity

### Let's create a pivot table of Movies to Users

- The rows are movies and columns are users. And the values in the matrix are the rating for a specific movie by a specific user.

In [15]:
rating_mat = rating_df.pivot( index='movieid',
                              columns='userid',
                              values = "rating" ).reset_index(drop=True)

In [16]:
rating_mat

userid,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
0,5.0,4.0,,,4.0,4.0,,,,4.0,...,2.0,3.0,4.0,,4.0,,,5.0,,
1,3.0,,,,3.0,,,,,,...,4.0,,,,,,,,,5.0
2,4.0,,,,,,,,,,...,,,4.0,,,,,,,
3,3.0,,,,,,5.0,,,4.0,...,5.0,,,,,,2.0,,,
4,3.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1677,,,,,,,,,,,...,,,,,,,,,,
1678,,,,,,,,,,,...,,,,,,,,,,
1679,,,,,,,,,,,...,,,,,,,,,,
1680,,,,,,,,,,,...,,,,,,,,,,


### Fill with 0, where users have not rated the movies

In [17]:
rating_mat.fillna( 0, inplace = True )

In [18]:
rating_mat.shape

(1682, 943)

In [19]:
rating_mat.head( 10 )

userid,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
0,5.0,4.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,4.0,...,2.0,3.0,4.0,0.0,4.0,0.0,0.0,5.0,0.0,0.0
1,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,4.0,...,5.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
4,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,...,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,0.0,2.0,5.0,3.0,4.0,4.0,...,0.0,0.0,4.0,0.0,4.0,0.0,4.0,4.0,0.0,0.0
7,1.0,0.0,0.0,0.0,0.0,4.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0
8,5.0,0.0,0.0,0.0,0.0,4.0,5.0,0.0,0.0,4.0,...,0.0,1.0,4.0,5.0,3.0,5.0,3.0,0.0,0.0,3.0
9,3.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
type(rating_mat)

### Calculating the item distances and similarities

In [21]:
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine, correlation

In [22]:
movie_sim = 1 - pairwise_distances( rating_mat.to_numpy(), metric="correlation" )

In [23]:
movie_sim.shape

(1682, 1682)

In [24]:
movie_sim_df = pd.DataFrame( movie_sim )

In [25]:
movie_sim_df.shape

(1682, 1682)

In [26]:
movie_sim_df.head( 10 )

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
0,1.0,0.234595,0.193362,0.226213,0.12884,0.015113,0.347354,0.25449,0.209502,0.104655,...,0.018215,-0.029676,-0.029676,-0.029676,0.018215,-0.029676,-0.029676,-0.029676,0.034179,0.034179
1,0.234595,1.0,0.190649,0.409044,0.240712,0.030062,0.220022,0.20602,0.077894,0.072906,...,-0.012451,-0.012451,-0.012451,-0.012451,-0.012451,-0.012451,-0.012451,-0.012451,0.071415,0.071415
2,0.193362,0.190649,1.0,0.227849,0.141368,0.065347,0.258855,0.078636,0.146181,0.079608,...,-0.009764,-0.009764,-0.009764,-0.009764,0.023964,-0.009764,-0.009764,-0.009764,-0.009764,0.091421
3,0.226213,0.409044,0.227849,1.0,0.237298,0.021878,0.295489,0.3528,0.229922,0.13822,...,-0.016619,-0.016619,0.088984,0.088984,0.025622,-0.016619,-0.016619,-0.016619,0.046743,0.067863
4,0.12884,0.240712,0.141368,0.237298,1.0,-0.008594,0.205289,0.145866,0.142541,-0.033746,...,-0.009889,-0.009889,-0.009889,-0.009889,-0.009889,-0.009889,-0.009889,-0.009889,-0.009889,0.088618
5,0.015113,0.030062,0.065347,0.021878,-0.008594,1.0,0.054415,0.01233,0.079619,0.166084,...,-0.005159,-0.005159,-0.005159,-0.005159,-0.005159,-0.005159,-0.005159,-0.005159,-0.005159,-0.005159
6,0.347354,0.220022,0.258855,0.295489,0.205289,0.054415,1.0,0.19067,0.286572,0.178505,...,-0.026036,0.03992,-0.026036,-0.026036,0.03992,-0.026036,-0.026036,-0.026036,0.03992,0.03992
7,0.25449,0.20602,0.078636,0.3528,0.145866,0.01233,0.19067,1.0,0.229331,0.152679,...,-0.01723,0.075617,0.057047,0.057047,0.075617,-0.01723,-0.01723,-0.01723,0.075617,-0.01723
8,0.209502,0.077894,0.146181,0.229922,0.142541,0.079619,0.286572,0.229331,1.0,0.158373,...,-0.021125,-0.021125,0.047273,0.047273,0.064372,-0.021125,-0.021125,-0.021125,0.047273,0.064372
9,0.104655,0.072906,0.079608,0.13822,-0.033746,0.166084,0.178505,0.152679,0.158373,1.0,...,-0.010138,-0.010138,0.073967,0.073967,-0.010138,-0.010138,-0.010138,-0.010138,-0.010138,-0.010138


### Finding similar movies to "Toy Story"

In [27]:
movies_df['similarity'] = movie_sim_df.iloc[0]
movies_df.columns = ['movieid', 'title', 'similarity']

In [28]:
movies_df.head( 10 )

Unnamed: 0,movieid,title,similarity
0,1,Toy Story (1995),1.0
1,2,GoldenEye (1995),0.234595
2,3,Four Rooms (1995),0.193362
3,4,Get Shorty (1995),0.226213
4,5,Copycat (1995),0.12884
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,0.015113
6,7,Twelve Monkeys (1995),0.347354
7,8,Babe (1995),0.25449
8,9,Dead Man Walking (1995),0.209502
9,10,Richard III (1995),0.104655


In [29]:
movies_df.sort_values( ["similarity"], ascending = False )[0:10]

Unnamed: 0,movieid,title,similarity
0,1,Toy Story (1995),1.0
49,50,Star Wars (1977),0.457677
120,121,Independence Day (ID4) (1996),0.454544
116,117,"Rock, The (1996)",0.431789
150,151,Willy Wonka and the Chocolate Factory (1971),0.423975
180,181,Return of the Jedi (1983),0.422991
404,405,Mission: Impossible (1996),0.41677
94,95,Aladdin (1992),0.407829
117,118,Twister (1996),0.404908
221,222,Star Trek: First Contact (1996),0.391073


#### That means anyone who buys *Toy Story* and likes it, the top 3 movies that can be recommender to him or her are  *Star Wars (1977)*, *Independence Day (ID4) (1996)* and *Rock, The (1996)*

## Utility function to find similar movies

In [30]:
def get_similar_movies( movieid, topN = 5 ):
    movies_df['similarity'] = movie_sim_df.iloc[movieid -1]
    top_n = movies_df.sort_values( ["similarity"], ascending = False )[0:topN]
    #print( "Similar Movies to: ", )
    return top_n

### Similar movies to *Twister*

In [31]:
get_similar_movies( 118 )

Unnamed: 0,movieid,title,similarity
117,118,Twister (1996),1.0
120,121,Independence Day (ID4) (1996),0.629867
404,405,Mission: Impossible (1996),0.542379
545,546,Broken Arrow (1996),0.509549
596,597,Eraser (1996),0.489693


### Similar movies to *The Godfather*

In [32]:
movies_df[movies_df.movieid == 127]

Unnamed: 0,movieid,title,similarity
126,127,"Godfather, The (1972)",0.152922


In [33]:
get_similar_movies( 127, 10 )

Unnamed: 0,movieid,title,similarity
126,127,"Godfather, The (1972)",1.0
186,187,"Godfather: Part II, The (1974)",0.543335
49,50,Star Wars (1977),0.409379
181,182,GoodFellas (1990),0.396741
22,23,Taxi Driver (1976),0.369608
99,100,Fargo (1996),0.345218
179,180,Apocalypse Now (1979),0.334691
191,192,Raging Bull (1980),0.331378
356,357,One Flew Over the Cuckoo's Nest (1975),0.331135
233,234,Jaws (1975),0.325565


### Similar movies to *The Lion King*

In [34]:
get_similar_movies( 71 )

Unnamed: 0,movieid,title,similarity
70,71,"Lion King, The (1994)",1.0
94,95,Aladdin (1992),0.683855
587,588,Beauty and the Beast (1991),0.605291
68,69,Forrest Gump (1994),0.572755
81,82,Jurassic Park (1993),0.557333


### Similar movies to *Star Trek*

In [35]:
get_similar_movies( 228 )

Unnamed: 0,movieid,title,similarity
227,228,Star Trek: The Wrath of Khan (1982),1.0
228,229,Star Trek III: The Search for Spock (1984),0.747498
229,230,Star Trek IV: The Voyage Home (1986),0.723112
226,227,Star Trek VI: The Undiscovered Country (1991),0.685605
175,176,Aliens (1986),0.590461


### Similar movies to *Sleepless in Seattle*

In [36]:
get_similar_movies( 88, 10 )

Unnamed: 0,movieid,title,similarity
87,88,Sleepless in Seattle (1993),1.0
65,66,While You Were Sleeping (1995),0.612566
392,393,Mrs. Doubtfire (1993),0.596703
201,202,Groundhog Day (1993),0.567747
731,732,Dave (1993),0.564841
215,216,When Harry Met Sally... (1989),0.552912
738,739,Pretty Woman (1990),0.547383
450,451,Grease (1978),0.542937
401,402,Ghost (1990),0.539376
203,204,Back to the Future (1985),0.52477


In [37]:
movies = [(movie[2], movie[1]) for movie in movies_df.itertuples()]

In [38]:
movies = sorted(movies, key=lambda x: x[0])

In [39]:
movies[0:5]

[("'Til There Was You (1997)", 1300),
 ('1-900 (1994)', 1353),
 ('101 Dalmatians (1996)', 225),
 ('12 Angry Men (1957)', 178),
 ('187 (1997)', 330)]

In [40]:
import ipywidgets as widgets
from IPython.display import display

In [41]:
movie_1 = widgets.Dropdown(
    options=movies,
    description='First Movie:',
)

movie_1

Dropdown(description='First Movie:', options=(("'Til There Was You (1997)", 1300), ('1-900 (1994)', 1353), ('1…

In [42]:
movie_1.value

1300

In [43]:
movie_2 = widgets.Dropdown(
    options=movies,
    description='Second Movie:',
)

movie_2

Dropdown(description='Second Movie:', options=(("'Til There Was You (1997)", 1300), ('1-900 (1994)', 1353), ('…

In [44]:
movie_2.value

1300

In [45]:
movie_3 = widgets.Dropdown(
    options=movies,
    description='Third Movie:',
)

movie_3

Dropdown(description='Third Movie:', options=(("'Til There Was You (1997)", 1300), ('1-900 (1994)', 1353), ('1…

In [46]:
movie_1.value, movie_2.value, movie_3.value

(1300, 1300, 1300)

In [47]:
rec_movies = (pd.concat([get_similar_movies(movie_1.value)
                       , get_similar_movies(movie_2.value)
                       , get_similar_movies(movie_3.value)], axis = 0))

In [48]:
rec_movies

Unnamed: 0,movieid,title,similarity
1299,1300,'Til There Was You (1997),1.0
1593,1594,Everest (1998),0.557241
1188,1189,Prefontaine (1997),0.456322
917,918,City of Angels (1998),0.350269
916,917,Mercury Rising (1998),0.337738
1299,1300,'Til There Was You (1997),1.0
1593,1594,Everest (1998),0.557241
1188,1189,Prefontaine (1997),0.456322
917,918,City of Angels (1998),0.350269
916,917,Mercury Rising (1998),0.337738


In [49]:
rec_movies[rec_movies.similarity != 1].drop_duplicates(subset = ['title']).sort_values('similarity', ascending = False)[0:10]

Unnamed: 0,movieid,title,similarity
1593,1594,Everest (1998),0.557241
1188,1189,Prefontaine (1997),0.456322
917,918,City of Angels (1998),0.350269
916,917,Mercury Rising (1998),0.337738
