In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

### First Practice Activity of a Recommendation System

To begin, my objective is learn about RecSystems with Reinforcement Learning, but as i don't know anything about RecSys or RL, i decided to understanding each term separately and, after that, finally study the two as a unique theme. Therefore, i will start with Recommendation System, due to the fact of be more simple than RL, beginning with all important definitions in these area of study.

#### First Ideas
The basis of the recommendation system is "What do you want to recommend, for whom, and how?" you can find many ways to make recommendations.

For example, if you have a client who watches a horror movie, you have two main ways to choose a recommendation for him. The first is to recommend other horror movies like the first one, i.e., similar items(films) to the original article in the interaction—that is an item-focused approach.

In other cases, you wish to recommend movies watched by other users similar to the client who watches that horror movie; in that case, you use a User-focused approach.

You can apply these approaches to different types of recommendation systems.


Despite that, you have two ways to make a recommendation system: collaborative and content.

##### Collaborative

Collaborative Methods use past interactions between users and items to make new recommendations using a "User-Item Interactions Matrix." In the middle of the recommendation, it uses the proximity between the users/items to predict. Is it possible to use a model-based approach or a metric to calculate the similarity and just use this metric.

It has some issues, like problems with new Users and Items, but with other methods specific to further information that can surpass this problem.

A lot of methods using a collaborative approach, such as:

        - Memory Based
        - Model-Based

##### Content

Content methods use additional information about users and/or items. In the movie example, you can have information about the Release Year, the Director, the Principal Actors, and many other data.

These methods have much more information and don't suffer with new Users or Items too much because they 

### Practice

I will use the most simple recommendation system for this first recommendation system, a memory-based collaborative model. These recommendation systems use just similarity metrics to make a recommendation, i.e., they don't have any ML model in the middle of the process as the simplest method is the most interesting way to new students like me.

One of the problems of that method is the bad scalability with the new users and new movies, because of the pivot table.

In [2]:
names = pd.read_csv("./movie_information/movie.csv")

names.head()

# how i gonna use a collaborative approach, i need just the information about the interaction user-item
# for later visualization i will include the title in the matrix, but just for make the visualization easiest

names = names.drop("genres", axis = 1)

In [3]:
interactions = pd.read_csv("./movie_information/rating.csv")

# that interactions are the most import information for the collaborative approach

interactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 610.4+ MB


In [4]:
data = pd.merge(names, interactions)

data.head()

Unnamed: 0,movieId,title,userId,rating,timestamp
0,1,Toy Story (1995),3,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),6,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),8,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),10,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),11,4.5,2009-01-02 01:13:41


In [5]:
del data["timestamp"]

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20000263 entries, 0 to 20000262
Data columns (total 4 columns):
 #   Column   Dtype  
---  ------   -----  
 0   movieId  int64  
 1   title    object 
 2   userId   int64  
 3   rating   float64
dtypes: float64(1), int64(2), object(1)
memory usage: 762.9+ MB


In [6]:
# 20Million taking just 5 Million to reduce the dimensionality
sampled_data = data.iloc[:5000000,:]

In [7]:
sampled_data["title"].value_counts()

Pulp Fiction (1994)                 67310
Forrest Gump (1994)                 66172
Shawshank Redemption, The (1994)    63366
Silence of the Lambs, The (1991)    63299
Jurassic Park (1993)                59715
                                    ...  
Honey Moon (Honigmond) (1996)          12
Boy Called Hate, A (1995)              11
Roula (1995)                           11
Girl in the Cadillac (1995)            10
Criminals (1996)                        6
Name: title, Length: 886, dtype: int64

In [8]:
sampled_data["userId"].value_counts()

124052    843
83090     786
128653    753
118205    701
46663     667
         ... 
89906       1
134454      1
92207       1
87256       1
111487      1
Name: userId, Length: 137065, dtype: int64

In [9]:
sampled_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000000 entries, 0 to 4999999
Data columns (total 4 columns):
 #   Column   Dtype  
---  ------   -----  
 0   movieId  int64  
 1   title    object 
 2   userId   int64  
 3   rating   float64
dtypes: float64(1), int64(2), object(1)
memory usage: 190.7+ MB


In [10]:
# Tables used in collaborative approaches have the format: Rows = users, Columns = Items, so we need to create a Pivot table with our data

pivot_table = sampled_data.pivot_table(index = ["userId"], columns = ["title"], values = "rating")

# After some tests use the fillna(0) bring better results than maintain the NA

pivot_table

title,'Til There Was You (1997),1-900 (06) (1994),"301, 302 (301/302) (1995)",8 Seconds (1994),Above the Rim (1994),Ace Ventura: Pet Detective (1994),Ace Ventura: When Nature Calls (1995),Across the Sea of Time (1995),Addams Family Values (1993),"Addiction, The (1995)",...,Wings of Courage (1995),With Honors (1994),Wolf (1994),Women Robbers (Diebinnen) (1995),"Wonderful, Horrible Life of Leni Riefenstahl, The (Macht der Bilder: Leni Riefenstahl, Die) (1993)","Wooden Man's Bride, The (Yan shen) (1994)","World of Apu, The (Apur Sansar) (1959)",Wyatt Earp (1994),Yankee Zulu (1994),"Young Poisoner's Handbook, The (1995)"
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,3.0,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138489,,,,,,,,,,,...,,,,,,,,,,
138490,,,,,,,,,,,...,,,,,,,,,,
138491,,,,,,,,,,,...,,,,,,,,,,
138492,,,,,,,,,,,...,,,,,,,,,,


In [11]:
def search_movie_name(search):
   result = (pivot_table.columns
          .to_series()
          .reset_index(drop = True)
          .loc[pivot_table.columns.str.contains(search)]).reset_index(drop = True)
   #print(result)
   return result.loc[0]

def get_recommendation_cosine(movie_name):
    if movie_name not in pivot_table.columns:
        print("A coluna não está presente.")
        return
    
    movie = pivot_table[movie_name].values.reshape(1, -1) # Reshape to [[values]]
    pivot_matrix = pivot_table.values.T # .T to Movies to a ROW
    
    similarity = cosine_similarity(movie, pivot_matrix) # Calc Similarity
    
    similarity_df = pd.DataFrame(similarity, columns=pivot_table.columns, index=['Similarity']) # Create the dataframe of similarity
    sorted_similarities = similarity_df.T.sort_values(by='Similarity', ascending=False)
    sorted_similarities = sorted_similarities.drop(movie_name)
    top_10_similarities = sorted_similarities.head(10)
    
    return top_10_similarities


def get_recommendation_corr(movie_name):
    if not(movie_name in pivot_table.columns):
        print("O nome não está presente.")
        return
    
    movie = pivot_table[movie_name]
    similarity = pivot_table.corrwith(movie)
    similarity = similarity.sort_values(ascending=False)
    return pd.DataFrame(similarity.iloc[1:11], columns = ["Correlation"])


I dont use cosine similarity when the data has NaN values, because give errors.

In [12]:
movie_name = search_movie_name("Ace")
get_recommendation_corr(movie_name)

Unnamed: 0_level_0,Correlation
title,Unnamed: 1_level_1
Headless Body in Topless Bar (1995),0.914991
Bye-Bye (1995),0.881422
Ace Ventura: When Nature Calls (1995),0.721531
War Stories (1995),0.693167
Dumb & Dumber (Dumb and Dumber) (1994),0.593361
Costa Brava (1946),0.578394
Every Other Weekend (Un week-end sur deux) (1990),0.560063
Shadow of Angels (Schatten der Engel) (1976),0.523733
"Modern Affair, A (1995)",0.502173
"Jar, The (Khomreh) (1992)",0.499


In [13]:
pivot_table = pivot_table.fillna(0)

pivot_table

title,'Til There Was You (1997),1-900 (06) (1994),"301, 302 (301/302) (1995)",8 Seconds (1994),Above the Rim (1994),Ace Ventura: Pet Detective (1994),Ace Ventura: When Nature Calls (1995),Across the Sea of Time (1995),Addams Family Values (1993),"Addiction, The (1995)",...,Wings of Courage (1995),With Honors (1994),Wolf (1994),Women Robbers (Diebinnen) (1995),"Wonderful, Horrible Life of Leni Riefenstahl, The (Macht der Bilder: Leni Riefenstahl, Die) (1993)","Wooden Man's Bride, The (Yan shen) (1994)","World of Apu, The (Apur Sansar) (1959)",Wyatt Earp (1994),Yankee Zulu (1994),"Young Poisoner's Handbook, The (1995)"
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138489,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
138490,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
138491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
138492,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
movie_name = search_movie_name("Ace")
get_recommendation_cosine(movie_name)

Unnamed: 0_level_0,Similarity
title,Unnamed: 1_level_1
Dumb & Dumber (Dumb and Dumber) (1994),0.700865
Ace Ventura: When Nature Calls (1995),0.630757
Batman Forever (1995),0.621913
"Mask, The (1994)",0.621414
Batman (1989),0.611816
True Lies (1994),0.601368
Die Hard: With a Vengeance (1995),0.596266
Aladdin (1992),0.570974
Apollo 13 (1995),0.564414
Jurassic Park (1993),0.563637


In [15]:
movie_name = search_movie_name("Ace")
get_recommendation_corr(movie_name)

Unnamed: 0_level_0,Correlation
title,Unnamed: 1_level_1
Dumb & Dumber (Dumb and Dumber) (1994),0.616559
Ace Ventura: When Nature Calls (1995),0.559202
Batman Forever (1995),0.509123
"Mask, The (1994)",0.503958
Die Hard: With a Vengeance (1995),0.469686
Batman (1989),0.46384
True Lies (1994),0.456031
Cliffhanger (1993),0.420391
Mrs. Doubtfire (1993),0.417642
Aladdin (1992),0.416525


##### Consideration about cosine similarity

A important information is cosine similarity dont considers magnitude, i.e. if a user rate all films they see with 0, is perfectly similar to another user who rates all the same films as 5, is important considers the rating of a film, but the users are similars in some way, when you focus in watched movies.

Cosine Similarity can be a bit dangerous, but if you know the risks, you can consider using this metric.


In [160]:
def get_recommendation_user(user_id):
    new_user = pivot_table.loc[user_id, :]
    watched_movies = new_user[new_user > 0.].index
    print(new_user[new_user > 0.])

    total = {}
    similaritySums = {}
    ranks = {}

    for i in pivot_table.index[:1000]:
        other_user = pivot_table.loc[i, :]
        s = cosine_similarity([new_user.values], [other_user.values])[0][0]
        if s <= 0:
            continue
        watched_movies_others = other_user[other_user > 0.].index
        for movie in watched_movies_others:
            if movie not in watched_movies:
                total.setdefault(movie, 0)
                total[movie] += (pivot_table.loc[i, movie]) * s
                similaritySums.setdefault(movie, 0)
                similaritySums[movie] += s
                ranks[movie] = total[movie] / similaritySums[movie]
    return dict(sorted(ranks.items(), key=lambda item: item[1], reverse = True))

## Tá recomendando meio estranho mas por enquanto tá bom, depois fazer mais testes

get_recommendation_user(1)

title
Blade Runner (1982)                                                4.0
City of Lost Children, The (Cité des enfants perdus, La) (1995)    3.5
Clerks (1994)                                                      4.0
Dragonheart (1996)                                                 3.0
Interview with the Vampire: The Vampire Chronicles (1994)          4.0
Jumanji (1995)                                                     3.5
Léon: The Professional (a.k.a. The Professional) (Léon) (1994)     4.0
Mask, The (1994)                                                   3.5
Pulp Fiction (1994)                                                4.0
Rob Roy (1995)                                                     4.0
Rumble in the Bronx (Hont faan kui) (1995)                         3.5
Seven (a.k.a. Se7en) (1995)                                        3.5
Shawshank Redemption, The (1994)                                   4.0
Silence of the Lambs, The (1991)                                   3.5


{'Cyclo (Xich lo) (1995)': 5.0,
 'Dear Diary (Caro Diario) (1994)': 5.0,
 'Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991)': 5.0,
 'Ballad of Narayama, The (Narayama Bushiko) (1958)': 5.0,
 'Sonic Outlaws (1995)': 5.0,
 'Day the Sun Turned Cold, The (Tianguo niezi) (1994)': 5.0,
 'Good Man in Africa, A (1994)': 5.0,
 'Unforgettable Summer, An (Un été inoubliable) (1994)': 5.0,
 'Nobody Loves Me (Keiner liebt mich) (1994)': 4.999999999999999,
 'Land and Freedom (Tierra y libertad) (1995)': 4.736167886441259,
 'Song of the Little Road (Pather Panchali) (1955)': 4.5130749141627,
 'Force of Evil (1948)': 4.5,
 'Eyes Without a Face (Yeux sans visage, Les) (1959)': 4.5,
 'Dream Man (1995)': 4.5,
 'Boys of St. Vincent, The (1992)': 4.497114255457546,
 "Star Maker, The (Uomo delle stelle, L') (1995)": 4.492638930939311,
 'Wonderful, Horrible Life of Leni Riefenstahl, The (Macht der Bilder: Leni Riefenstahl, Die) (1993)': 4.477581550649941,
 'Cosi (1996)': 4.36

#### Possível melhorias, testar outras métricas e suas possíveis utilizações, usar KNN ou K-Means, não pensei ainda para user based, utilizar estruturas como sparse table para armazenar a pivot

#### Padronizar o uso para tabela esparsa já que com NAN os resultados foram piores/esquisitos
#### Pegar como base o código para desenvolver o user-based
