# Pràctica 2: Recomanador Simple

Nom dels alumnes del grup: Pau Segura Baños i David Vilajosana Garriga


## 1. INTRODUCCIÓ

### 1.1. Abans de començar...

**\+ A més a més de les que ja es troben presents en la 1a cel·la i funcions natives de Python, durant la pràctica, només es podran fer servir les següents llibreries**:

`Pandas, Numpy, Itertools`

**\+ No es poden modificar les definicions de les funcions donades, ni canviar els noms de les variables i paràmetres ja donats**

Això no implica però que els hàgiu de fer servir. És a dir, que la funció tingui un paràmetre anomenat `df` no implica que l'hàgiu de fer servir, si no ho trobeu convenient.

**\+ En les funcions, s'especifica què serà i de quin tipus cada un dels paràmetres, cal respectar-ho**

Per exemple (ho posarà en el pydoc de la funció), `df` sempre serà indicatiu del `Pandas.DataFrame` de les dades.

### 1.2. Dades: puntuacions de pel·licules

La base de dades [movielens-1M](http://www.grouplens.org/node/73) conté 1,000,209 puntuacions de 3.900 pel·lícules fetes l'any 2000 per 6.040 usuaris anònims del recomanador online [MovieLens](http://www.movielens.org/). 

El consum total de tots els usuaris s'hi pot trobar al document "ratings.dat" el format següent:

    UserID::MovieID::Rating::Timestamp

- **UserID** usuari, amb id's entre 1 i 6040 
- **MovieID** pel·licula, amb id's entre 1 i 3952
- **Rating** puntuació, en una escala de 1 a 5 estrelles.
- **Timestamp** representat en segons

Cada usuari té com a mínim 20 puntuacions.

### 1.3. Dades: usuaris



Al fitxer ``users.dat`` hi trobem la informació referent a cadascun dels usuaris en el següent format:

        UserID::Gender::Age::Occupation::Zip-code

- **Gender** ve donat per "M" per home i "F" per dona.
- **Age** està representada de la següent forma:

	*  1:  "Under 18"
	* 18:  "18-24"
	* 25:  "25-34"
	* 35:  "35-44"
	* 45:  "45-49"
	* 50:  "50-55"
	* 56:  "56+"

- **Occupation** es tria entre les següents opcions:

	*  0:  "other" or not specified
	*  1:  "academic/educator"
	*  2:  "artist"
	*  3:  "clerical/admin"
	*  4:  "college/grad student"
	*  5:  "customer service"
	*  6:  "doctor/health care"
	*  7:  "executive/managerial"
	*  8:  "farmer"
	*  9:  "homemaker"
	* 10:  "K-12 student"
	* 11:  "lawyer"
	* 12:  "programmer"
	* 13:  "retired"
	* 14:  "sales/marketing"
	* 15:  "scientist"
	* 16:  "self-employed"
	* 17:  "technician/engineer"
	* 18:  "tradesman/craftsman"
	* 19:  "unemployed"
	* 20:  "writer"

Els usuaris han donat la informació voluntariament. Així doncs, alguns usuaris poden no tenir informació.


### 1.4. Dades: pel·lícules



Al fitxer ``movies.dat`` hi trobem la informació referent a cadascuna de les películes en el següent format:

        MovieID::Title::Genres

- **Titles** són identics als titols de la base de dades IMDB, incloent l'any de llançament.
- **Genres** de les películes estan separats i seleccionats d'entre els següents:

	* Action
	* Adventure
	* Animation
	* Children's
	* Comedy
	* Crime
	* Documentary
	* Drama
	* Fantasy
	* Film-Noir
	* Horror
	* Musical
	* Mystery
	* Romance
	* Sci-Fi
	* Thriller
	* War
	* Western

Algunes películes poden tenir l'ID malament degut a duplicats accidentals.

Les películes s'han entrat manualment, així que poden existir altres inconsistencies. 

## 2. Exploració de les dades

### 2.1. Descarregar i llegir dades

+ Baixa't els fitxers que composen la base de dades i els còpies al teu directori de treball. 

In [1]:
import os
if os.path.isfile("/etc/password.txt") == False:
    os.system('wget -nc http://files.grouplens.org/datasets/movielens/ml-1m.zip')
    os.system('unzip ml-1m.zip')

+ Llegeix les tres taules de la base de dades en tres DataFrames de pandas amb aquest codi:

In [2]:
import math
import numpy as np
import pandas as pd
import datetime
import itertools
from tqdm.notebook import trange, tqdm
import matplotlib.pyplot as plt

In [3]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames, engine='python')
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames, engine='python')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames, engine='python', encoding='latin-1')


### 2.2. Inspecció de les taules

In [4]:
users[:10]

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455
5,6,F,50,9,55117
6,7,M,35,1,6810
7,8,M,25,12,11413
8,9,M,25,17,61614
9,10,F,35,1,95370


In [5]:
users[-10:]

Unnamed: 0,user_id,gender,age,occupation,zip
6030,6031,F,18,0,45123
6031,6032,M,45,7,55108
6032,6033,M,50,13,78232
6033,6034,M,25,14,94117
6034,6035,F,25,1,78734
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,1060
6039,6040,M,25,6,11106


In [6]:
ratings[-10:]

Unnamed: 0,user_id,movie_id,rating,timestamp
1000199,6040,2022,5,956716207
1000200,6040,2028,5,956704519
1000201,6040,1080,4,957717322
1000202,6040,1089,4,956704996
1000203,6040,1090,3,956715518
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648
1000208,6040,1097,4,956715569


In [7]:
ratings[:10]

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
5,1,1197,3,978302268
6,1,1287,5,978302039
7,1,2804,5,978300719
8,1,594,4,978302268
9,1,919,4,978301368


In [8]:
ratings.sort_values('movie_id')[:5]


Unnamed: 0,user_id,movie_id,rating,timestamp
427702,2599,1,4,973796689
1966,18,1,4,978154768
683688,4089,1,5,965428947
596207,3626,1,4,966594018
465902,2873,1,5,972784317


In [9]:
movies[:5]

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
ratings[:5]

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


### 2.3 **Exemple:** Com extreure informació d'un DataFrame.

Suposa que volem calcular les **puntuacions mitjanes d'una pel·licula per sexe o edat**, dades que estan a frames diferents.

El primer pas a obtenir una única estructura que contingui tota la informació. Per fer-ho podem usar la funció ``merge`` de pandas. Aquesta funció infereix automàticament quines columnes ha d'usar per fer el ``merge`` basant-se en els noms que fan intersecció.

Reviseu aquests conceptes de pandas: https://pandas.pydata.org/docs/user_guide/merging.html

In [11]:
data = pd.merge(pd.merge(ratings, users), movies)

# Visualitzem la taula ordenada per identificador d'usuari
data.sort_values(by='user_id')[:10]

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
28501,1,48,5,978824351,F,1,10,48067,Pocahontas (1995),Animation|Children's|Musical|Romance
13819,1,938,4,978301752,F,1,10,48067,Gigi (1958),Musical
51327,1,1207,4,978300719,F,1,10,48067,To Kill a Mockingbird (1962),Drama
31152,1,1721,4,978300055,F,1,10,48067,Titanic (1997),Drama|Romance
37916,1,2762,4,978302091,F,1,10,48067,"Sixth Sense, The (1999)",Thriller
18472,1,2687,3,978824268,F,1,10,48067,Tarzan (1999),Animation|Children's
45685,1,2692,4,978301570,F,1,10,48067,Run Lola Run (Lola rennt) (1998),Action|Crime|Romance
22832,1,720,3,978300760,F,1,10,48067,Wallace & Gromit: The Best of Aardman Animatio...,Animation
32771,1,745,3,978824268,F,1,10,48067,"Close Shave, A (1995)",Animation|Comedy|Thriller


La funció ``iloc`` ens permet obtenir un subconjunt de files i/o columnes indexades per un enter:

In [12]:
data.iloc[3:5]

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


Els índexs Booleans ens permeten seleccionar una part de la taula que compleix una condició.

In [13]:
# comptem quin tant per cent de ratings estan fets per una dona

print(data[data['gender']=='F']['rating'].count()/float(data['rating'].count())*100, '%')

24.638850480249626 %


Per obtenir les **puntuacions mitjanes de cada pel·licula agrupada per edat** podem usar el mètode ``pivot_table`` que és una forma de "canviar" la forma de la taula especificant quin valor agregat (mitjançant una funció predefinida) hi volem en funció dels valors de dues columnes.

Reviseu aquests conceptes: 
+ https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
+ https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table

In [14]:
mean_ratings = data.pivot_table(values= 'rating', index='title', columns='age', aggfunc='mean')
mean_ratings[:10]

age,1,18,25,35,45,50,56
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"$1,000,000 Duck (1971)",,3.0,3.090909,3.133333,2.0,2.75,
'Night Mother (1986),2.0,4.666667,3.423077,2.904762,3.833333,3.555556,4.333333
'Til There Was You (1997),3.5,2.5,2.666667,2.9,2.333333,2.5,2.666667
"'burbs, The (1989)",4.5,3.244444,2.652174,2.818182,2.545455,3.208333,2.666667
...And Justice for All (1979),3.0,3.428571,3.724138,3.657143,4.1,3.551724,3.928571
1-900 (1994),,,2.0,,,,3.0
10 Things I Hate About You (1999),3.745455,3.41502,3.43295,3.102941,3.258065,3.62963,4.0
101 Dalmatians (1961),3.514286,3.295082,3.613757,3.826087,3.976744,3.65,3.190476
101 Dalmatians (1996),3.088235,2.467742,2.928571,3.27957,3.482759,3.4,3.555556
12 Angry Men (1957),4.176471,4.032609,4.408654,4.358333,4.274194,4.287879,4.235294


Per obtenir les **puntuacions mitjanes de cada pel·licula agrupada per sexe**:

In [15]:
mean_ratings = data.pivot_table('rating', index='title',columns='gender', aggfunc='mean')
mean_ratings[:10]

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024
1-900 (1994),2.0,3.0
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.5
101 Dalmatians (1996),3.24,2.911215
12 Angry Men (1957),4.184397,4.328421


Si volgéssim fer càlculs només sobre les pel·licules que han rebut **al menys** 250 puntuacions, primer hem de construir una taula amb el nombre d'avaluacions de cada títol. Per fer-ho, agruparem les dades per títol (amb el mètode ``groupby``) i usarem ``size()`` el nombre.

Reviseu aquest concepte: 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

El mètode ``groupby`` implenta un o més d'aquests processos:

+ Dividir les dades segons algun criteri.
+ Aplicar una funció a cada grup.
+ Combinar els resultats en una estructura de dades.

In [16]:
ratings_by_title = data.groupby('title').size()
print(ratings_by_title)

title
$1,000,000 Duck (1971)                         37
'Night Mother (1986)                           70
'Til There Was You (1997)                      52
'burbs, The (1989)                            303
...And Justice for All (1979)                 199
                                             ... 
Zed & Two Noughts, A (1985)                    29
Zero Effect (1998)                            301
Zero Kelvin (Kjærlighetens kjøtere) (1995)      2
Zeus and Roxanne (1997)                        23
eXistenZ (1999)                               410
Length: 3706, dtype: int64


Llavors podem crear un índex amb els títols amb més de 250 avaluacions.

In [17]:
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles

Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

L'índex de títols que reben al menys 250 puntuacions es pot fer servir per seleccionar les files de ``mean_ratings``: 

In [18]:
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"'burbs, The (1989)",2.793478,2.962085
10 Things I Hate About You (1999),3.646552,3.311966
101 Dalmatians (1961),3.791444,3.500000
101 Dalmatians (1996),3.240000,2.911215
12 Angry Men (1957),4.184397,4.328421
...,...,...
Young Guns (1988),3.371795,3.425620
Young Guns II (1990),2.934783,2.904025
Young Sherlock Holmes (1985),3.514706,3.363344
Zero Effect (1998),3.864407,3.723140


Per veure els films més valorats per les dones, podem ordenar per la columna F de forma descendent:

In [19]:
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings[:10]

gender,F,M
title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Close Shave, A (1995)",4.644444,4.473795
"Wrong Trousers, The (1993)",4.588235,4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.57265,4.464589
Wallace & Gromit: The Best of Aardman Animation (1996),4.563107,4.385075
Schindler's List (1993),4.562602,4.491415
"Shawshank Redemption, The (1994)",4.539075,4.560625
"Grand Day Out, A (1992)",4.537879,4.293255
To Kill a Mockingbird (1962),4.536667,4.372611
Creature Comforts (1990),4.513889,4.272277
"Usual Suspects, The (1995)",4.513317,4.518248


Suposem ara que volem les pel·licules que estan valorades de forma més diferent entre homes i dones. Una forma d'obtenir-ho és afegir una columna a ``mean_ratings`` que contingui la diferència en mitjana i llavors ordenar:

In [20]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

In [21]:
print(np.nan + 9.0) 

nan


Ordenant per ``diff`` ens dóna les pel·licules ben valorades per les dones que presenten més diferència entre homes i dones:

In [22]:
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:15]

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Dirty Dancing (1987),3.790378,2.959596,-0.830782
Jumpin' Jack Flash (1986),3.254717,2.578358,-0.676359
Grease (1978),3.975265,3.367041,-0.608224
Little Women (1994),3.870588,3.321739,-0.548849
Steel Magnolias (1989),3.901734,3.365957,-0.535777
Anastasia (1997),3.8,3.281609,-0.518391
"Rocky Horror Picture Show, The (1975)",3.673016,3.160131,-0.512885
"Color Purple, The (1985)",4.158192,3.659341,-0.498851
"Age of Innocence, The (1993)",3.827068,3.339506,-0.487561
Free Willy (1993),2.921348,2.438776,-0.482573


Invertint l'ordre de les files i fent un ``slicing`` de les 15 files superiors obtenim les pel·licules ben valorades pels homes que no han agradat a les dones: 

In [23]:
sorted_by_diff[::-1][:15]

gender,F,M,diff
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"Good, The Bad and The Ugly, The (1966)",3.494949,4.2213,0.726351
"Kentucky Fried Movie, The (1977)",2.878788,3.555147,0.676359
Dumb & Dumber (1994),2.697987,3.336595,0.638608
"Longest Day, The (1962)",3.411765,4.031447,0.619682
"Cable Guy, The (1996)",2.25,2.863787,0.613787
Evil Dead II (Dead By Dawn) (1987),3.297297,3.909283,0.611985
"Hidden, The (1987)",3.137931,3.745098,0.607167
Rocky III (1982),2.361702,2.943503,0.581801
Caddyshack (1980),3.396135,3.969737,0.573602
For a Few Dollars More (1965),3.409091,3.953795,0.544704


Si volguéssim les pel·licules que han generat puntuacions més discordants, independentment del gènere, podem fer servir la variança o la desviació estàndard de les puntuacions: 

In [24]:
# Standard deviation of rating grouped by title
rating_std_by_title = data.groupby('title')['rating'].std()
# Filter down to active_titles
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.sort_values(ascending=False)[:10]

title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

### Important: Temes de rendiment

Fixeu-vos en el comportament de Python i actueu en conseqüència:

In [25]:
%timeit data['title'] 
print(type(data['title']))
%timeit data.title 
print(type(data.title))
%timeit data[['title']] 
print(type(data[['title']]))

1.92 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
<class 'pandas.core.series.Series'>
4.54 µs ± 287 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
<class 'pandas.core.series.Series'>
8.01 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
<class 'pandas.core.frame.DataFrame'>


## 3. EXERCICIS

### 3.1. EXERCICI A

+ Donada la taula ``data`` tal i com es defineix a continuació, calcula la puntuació mitjana de cada usuari i guarda-la a un ``df`` anomenat ``users_mean_rating``. 

In [26]:
data_folder = 'ml-1m'

In [27]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table(f'{data_folder}/users.dat', sep='::', header=None, names=unames, engine='python')
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table(f'{data_folder}/ratings.dat', sep='::', header=None, names=rnames, engine='python')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table(f'{data_folder}/movies.dat', sep='::', header=None, names=mnames, engine='python',encoding='latin-1')

data = pd.merge(pd.merge(ratings, users), movies)

# la vostra solució aquí

# Agrupem els usuaris, per id, i calculem la mitja de les rating de les películes que ha valorat cada un dels usuaris.
users_mean_rating = data.groupby('user_id')['rating'].mean().reset_index()

users_mean_rating

Unnamed: 0,user_id,rating
0,1,4.188679
1,2,3.713178
2,3,3.901961
3,4,4.190476
4,5,3.146465
...,...,...
6035,6036,3.302928
6036,6037,3.717822
6037,6038,3.800000
6038,6039,3.878049


+ Quina és la pel·lícula més ben puntuada (en mitja) pels usuaris? (Guarda aquest valor en una variable de tipus ``string`` anomenada ``best_movie_rating`` ). 

In [28]:
# la vostra solució aquí

# Calculem la mitja de les puntuacions de totes les películes
mean_movie_rating = data.groupby('movie_id')['rating'].mean().reset_index()

merge_movie_rating = pd.merge(mean_movie_rating,movies, on='movie_id')

maximum_rating = max(merge_movie_rating['rating'])

best_movie_rating = merge_movie_rating.iloc[merge_movie_rating['rating'].idxmax()]

best_movie_rating = best_movie_rating['title']

print(best_movie_rating)

Gate of Heavenly Peace, The (1995)


+ Mira si hi ha més pel·licules amb la mateixa puntuació de la més ben puntuada.

In [29]:
# la vostra solució aquí

same_ratings = merge_movie_rating[merge_movie_rating['rating'] == maximum_rating]

same_ratings

Unnamed: 0,movie_id,rating,title,genres
744,787,5.0,"Gate of Heavenly Peace, The (1995)",Documentary
926,989,5.0,Schlafes Bruder (Brother of Sleep) (1995),Drama
1652,1830,5.0,Follow the Bitch (1998),Comedy
2955,3172,5.0,Ulysses (Ulisse) (1954),Adventure
3010,3233,5.0,Smashing Time (1967),Comedy
3054,3280,5.0,"Baby, The (1973)",Horror
3152,3382,5.0,Song of Freedom (1936),Drama
3367,3607,5.0,One Little Indian (1973),Comedy|Drama|Western
3414,3656,5.0,Lured (1947),Crime
3635,3881,5.0,Bittersweet Motel (2000),Documentary


+ Busca ara aquella pel·lícula, d'entre les que tenen 5 com a puntuació mitjana, que hagi rebut més valoracions i guarda-la a una variable anomenada ``best_movie_rating_maxviews``. Així tindrem la pel·licula més ben puntuada per més usuaris. 

In [30]:
# la vostra solució aquí

movie_counts = data.groupby('movie_id').size().reset_index(name='rating_count')

mean_movie_counts = pd.merge(same_ratings,movie_counts, on='movie_id')

max_rated_movie_count = mean_movie_counts.iloc[mean_movie_counts['rating_count'].idxmax()]

best_movie_rating_maxviews = max_rated_movie_count['title']

print(best_movie_rating_maxviews)



Gate of Heavenly Peace, The (1995)


### 3.2. EXERCICI B

+ Defineix una funció anomenada ``top_movie`` que donat un usuari ens retorni quina és la pel·lícula millor puntuada.


In [31]:
def top_movie(dataFrame,usr):
    # la vostra solució aquí
    
    users_data = data[data['user_id']==usr]
    
    max_rating = users_data.iloc[users_data['rating'].idxmax()]
    
    return max_rating['title']

    
    
print(top_movie(data,1))

One Flew Over the Cuckoo's Nest (1975)


### 3.2. EXERCICI C

+ Construeix una funció que donat el dataframe ``data`` et retorni un altre dataframe ``df_counts``amb el valor que cada usuari li ha donat a una peli. Això ho farem creant un dataframe on les columnes són els `movie_id`, les files `user_id` i els valors siguin el rating donat.

In [32]:
def build_counts_table(df):
    """
    Retorna un dataframe on les columnes són els `movie_id`, les files `user_id` i els valors
    la valoració que un usuari ha donat a una peli d'un `movie_id`
    
    :param df: DataFrame original 
    :return: DataFrame descrit adalt
    """
    
    build_table = df.pivot_table('rating', index='user_id',columns='movie_id').fillna(0)
    
    # la vostra solució aquí

    
    return build_table
    

In [33]:
df_counts = build_counts_table(data)
df_counts

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6036,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


+ Fés una funció que donada la taula anterior i dos id's (usuari i peli), extregui el valor donat:

In [34]:
def get_count(df, user_id, movie_id):
    """
    Retorna la valoració que l'usuari 'user_id' ha donat de 'movie_id'
    
    :param df: DataFrame retornat per `build_counts_table`
    :param user_id: ID de l'usuari
    :param movie_id: ID de la peli
    :return: Enter amb la valoració de la peli
    """
    
    # la vostra solució aquí
    
    
    row = df.at[user_id,movie_id]
    
    
    return row

get_count(df_counts, 1, 1)

5.0

### 3.4. EXERCICI D

In [35]:
data.nunique()

user_id         6040
movie_id        3706
rating             5
timestamp     458455
gender             2
age                7
occupation        21
zip             3439
title           3706
genres           301
dtype: int64

In [36]:
unique_movies = pd.unique(data['movie_id'])
unique_movies.max()

3952

Si observem el nombre total d'usuaris únics i de pel.licules úniques, podem veure que els id's dels usuaris van de 1 a 6040. Normalment volem índexos que comencin al nombre 0, anant de 0 a 6039. 

+ Explora els índexos de les pel·licules. **Quin problema hi ha amb els indexos de les pel·licules??**

> **Resposta**: Tenim 3952 pel·licules úniques, però hi ha 3706 índexos, per tant tenim pel·lícules repetides.

+ Usant la funció `pd.Categorical(*).codes`, re-indexa els id's dels usuaris i de les pelis perquè vagin de 0 a 6039 i de 0 a 3705 respectivament:

In [37]:
# la vostra solució aquí
data['movie_id']= pd.Categorical(data['movie_id']).codes
data['user_id'] = pd.Categorical(data['user_id']).codes

data.nunique()

user_id         6040
movie_id        3706
rating             5
timestamp     458455
gender             2
age                7
occupation        21
zip             3439
title           3706
genres           301
dtype: int64

+ Per comprovar que tot sigui correcte i guardar correctament la taula **df_counts**, torna a calcular i visualitza ``df_counts``:

In [38]:
# la vostra solució aquí

df_counts = build_counts_table(data)
df_counts

movie_id,0,1,2,3,4,5,6,7,8,9,...,3696,3697,3698,3699,3700,3701,3702,3703,3704,3705
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6035,0.0,0.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.5. EXERCICI E



+ Escriu una funció `distEuclid(x,y)`  que implementi la distància Euclidiana entre dos vectors usant funcions de pandas. 

+ Escriu la funció `SimEuclid (DataFrame, U1, U2)` que calculin la semblança entre dos usuaris segons aquesta estructura:

    + Calcular un vector per cada usuari, $U1$ i $U2$, amb les puntuacions dels ítems comuns que han puntuat el dos usuaris.
    + Si no hi ha puntuacions en comú, retornar 0. Sinó, retornar 
    
    $$\frac{1}{(1+distEuclid(U1, U2))}$$

+ Avalueu amb la funció ``%timeit`` quant triguen aquests càlculs per un parell d'usuaris.   

> *Nota: Alguns d'aquests exercicis tenen temps de càlcul de l'ordre de minuts sobre tota la base de dades. Per desenvolupar els algorismes és recomanable treballar amb una versió reduïda de la base de dades.* 

Per implementar aquestes funcions únicament es permet l'ús de les funcions:

* `np.sum`
* `np.sqrt`
* `np.power`
* `np.dot`
* `np.linalg.norm`
* `np.mean`

I s'ha de fer **sense bucles**!

In [39]:
import numpy as np
num_movies = data.nunique()['movie_id']

def distEuclid(x, y):
    """
    Retorna la distancia euclidiana de dos vectors n-dimensionals.

    :param x: Primer vector
    :param y: Segon vector
    :return : Escalar (float) corresponent a la distancia euclidiana
    """
    return np.linalg.norm(x - y)

def SimEuclid(data, User1, User2):
    """
    Retorna un score que representa la similitud entre user1 i user2 basada en la distancia euclidiana.

    :param data: dataframe que conté totes les dades
    :param User1: id user1
    :param User2: id user2
    :return : Escalar (float) corresponent al score
    """
    item_usr1 = data[data['user_id'] == User1][['movie_id', 'rating']]
    item_usr2 = data[data['user_id'] == User2][['movie_id', 'rating']]
    # Agafem les pel·lícules en comú de dos usuaris.
    common_items = np.intersect1d(item_usr1['movie_id'],item_usr2['movie_id'])

    if common_items.size == 0:
        return 0
    else:
        # Hem de normalitzar el resultat de distEuclid
        dist = 1 / (1 + distEuclid(item_usr1.loc[item_usr1['movie_id'].isin(common_items), 'rating'].values,
                                   item_usr2.loc[item_usr2['movie_id'].isin(common_items), 'rating'].values)) * \
               (len(common_items) / num_movies)
        
        return dist


In [40]:
# Execute functions
print(SimEuclid(data, 2,314))

0.0005396654074473826


In [41]:
# Execute functions
print(SimEuclid(data, 0,1))

0.0006296096420219464


In [42]:
# Execute functions
print(SimEuclid(data, 1, 5))

0.0005316828473355942


In [43]:
%timeit SimEuclid(data, 1, 5)

3.88 ms ± 475 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 3.6. EXERCICI F

En aquest exercici desenvoluparem un sistema de recomanació col·laboratiu **basat en usuaris**. 

La funció principal, ``getRecommendationsUser``, ha de tenir com a entrada una taula de puntuacions, un ``user_id``, el tipus de mesura de semblança (Euclidiana) que volem usar, el nombre `m` d'usuaris semblants que volem per fer la recomanació i el nombre ``n`` de recomanacions que volem. 

Exemple: ``getRecommendationsUser(data, 2, 50, 10, SimEuclid)``

Com a sortida ha de donar la llista de les $n$ millors pel·lícules que li podriem recomanar segons la seva semblança amb altres usuaris.

> *Nota 1: S'ha d'evitar comparar ``user_id`` a ell mateix.*

> *Nota 2: Recordeu que en Python podem passar funcions com a paràmetres d'una funció.*

#### EXERCICI F.1

+ Computa la *score* de similitud del usuari desitjat (userID) respecte tots els altres i retorna un diccionari dels $m$ usuaris més propers i el seu *score*, que seran els que usarem per fer la recomanació. Normalitzeu els *scores* de sortida.

In [44]:
def find_similar_users(DataFrame, userID, m, simfunction):
    """
    Retorna un diccionari de usuaris similars amb les scores corresponents.
    
    :param DataFrame: dataframe que conté totes les dades
    :param userID: usuari respecte al qual fem la recomanació
    :param m: nombre d'usuaris que volem per fer la recomanació
    :param similarity: mesura de similitud
    :return : dictionary
    """
    # la vostra solució 
    diccionario = {}
    
    for user in data['user_id'].unique():
        if user != userID:
            diccionario[user] = simfunction(data, userID, user)
            
    diccionario_primers_10 = dict(sorted(diccionario.items(), key=lambda item: item[1], reverse=True)[:10])
    return diccionario_primers_10


    
    
    

In [45]:
t = datetime.datetime.now()
sim_dict = find_similar_users(data, 2, 10, SimEuclid)
t = datetime.datetime.now()-t
print(str(t))

0:00:20.188376


In [46]:
sim_dict

{4447: 0.0016759799892629866,
 1646: 0.0016575437514455324,
 5830: 0.0015585655884994666,
 2270: 0.0015274228133923392,
 1903: 0.0015241950142672186,
 3271: 0.001514953400110106,
 4276: 0.001514879601095694,
 1879: 0.0014857868522638257,
 3625: 0.001482580170135753,
 650: 0.0014790760952288534}

+ Creieu que el temps de procés és assumible? Quan trigaria si ho fem per tots els usuaris?

> **RESPOSTA**:   Creiem que no es asumible ja que si per un usuari tarda 50 segons i fessim el mateix per tots els usuaris hauriem de multiplicar aquest temps per 6040! A més a més podriem tenir bases de dades amb més usuaris i pel·lícules i encara tardaria més.

+ Per solucionar el problema anterior, construeix una matriu de mida $U \times U$ on cada posició $(i,j)$ indiqui la distància entre l'element $i$ i el $j$. Així doncs, si estàs fent un recomanador basat en usuaris, `matriu[2, 3]` contindrà la similitud entre l'usuari 2 i el 3. Calcula quant triga la teva implementació. 

Heu de tenir en compte que:

* Si feu una funció que treballi amb els vectors de cada usuari i faci un doble ``for``, el procés de les dades pot trigar una bona estona.
* Si feu una funció que treballi específicament amb matrius (i no vectors) trigarà molt pocs segons. En aquest link podeu trobar indicacions de com fer-ho: https://jaykmody.com/blog/distance-matrices-with-numpy/



In [47]:
def compute_distance(fixed_arr, var_arr):
    """
    Donats dos vectors, calcula la distancia entre els subvectors formats 
    pels elements en comú. 
    """
    # Agafem les movies valorades
    user1_no0 = np.where(fixed_arr > 0)[0]
    distance = np.zeros(len(var_arr)) 
    for i, user2 in enumerate(var_arr):
        user2_no0 = np.where(user2 > 0)[0]
        # Obtenim els ítems comuns
        common_items = np.intersect1d(user1_no0, user2_no0)
        distance[i] = (1 / (1 + (np.linalg.norm(fixed_arr[common_items] - user2[common_items])))) * (len(common_items) / len(fixed_arr))

    return distance

In [48]:
def similarity_matrix(compute_distance, df_counts):
    """
    Retorna una matriu de mida M x M on cada posició 
    indica la similitud entre usuaris (resp. ítems).
    
    :param df_counts: df amb els valor que cada usuari li ha donat a una peli.
    :return : Matriu numpy de mida M x M amb les similituds.
    """
    
    # la vostra solució aquí
    num_users = df_counts.shape[0]
    similarity_matrix = np.zeros((num_users, num_users))

    for i in range(num_users):
        user1 = df_counts.iloc[i].values
        # Calculem la distància entre vectors
        distances = compute_distance(user1, df_counts.values[:i, :])
        similarity_matrix[i, :i] = distances

    # Omplim la resta de la matriu simètricament
    similarity_matrix = similarity_matrix + similarity_matrix.T

    return pd.DataFrame(similarity_matrix, index=df_counts.index, columns=df_counts.index)
    
    
    
    

In [49]:
t = datetime.datetime.now()
sim = similarity_matrix(compute_distance, df_counts)
t = datetime.datetime.now()-t
print("Temps amb doble for:",str(t))

Temps amb doble for: 0:21:19.357770


In [50]:
sim

user_id,0,1,2,3,4,5,6,7,8,9,...,6030,6031,6032,6033,6034,6035,6036,6037,6038,6039
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.000000,0.000630,0.000375,0.000447,0.000377,0.000705,0.000270,0.000705,0.001057,0.001366,...,0.000540,0.000593,0.000809,0.000135,0.000583,0.001250,0.000750,0.000000,0.000831,0.000875
1,0.000630,0.000000,0.000664,0.000469,0.000543,0.000532,0.000784,0.000852,0.000918,0.001353,...,0.000500,0.000483,0.001107,0.000135,0.000843,0.001799,0.001233,0.000250,0.000667,0.001234
2,0.000375,0.000664,0.000000,0.000417,0.000285,0.000363,0.000540,0.000369,0.000544,0.001107,...,0.000313,0.000592,0.000484,0.000000,0.000532,0.000986,0.000594,0.000405,0.000469,0.000669
3,0.000447,0.000469,0.000417,0.000000,0.000188,0.000270,0.000259,0.000494,0.000447,0.000507,...,0.000370,0.000324,0.000669,0.000000,0.000332,0.000957,0.000548,0.000130,0.000270,0.000655
4,0.000377,0.000543,0.000285,0.000188,0.000000,0.000326,0.000410,0.000883,0.000982,0.000825,...,0.000503,0.000447,0.000339,0.000335,0.001095,0.002076,0.001018,0.000154,0.000809,0.001515
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6035,0.001250,0.001799,0.000986,0.000957,0.002076,0.000802,0.000735,0.001645,0.001624,0.002506,...,0.000778,0.001619,0.001170,0.000775,0.002034,0.000000,0.002491,0.000559,0.001885,0.003295
6036,0.000750,0.001233,0.000594,0.000548,0.001018,0.000493,0.000442,0.000838,0.001192,0.001750,...,0.000617,0.001131,0.000725,0.000500,0.000870,0.002491,0.000000,0.000270,0.001341,0.002255
6037,0.000000,0.000250,0.000405,0.000130,0.000154,0.000235,0.000000,0.000270,0.000447,0.000740,...,0.000242,0.000277,0.000270,0.000000,0.000335,0.000559,0.000270,0.000000,0.000584,0.000410
6038,0.000831,0.000667,0.000469,0.000270,0.000809,0.000756,0.000270,0.000370,0.000559,0.001781,...,0.000390,0.000936,0.000447,0.000540,0.000483,0.001885,0.001341,0.000584,0.000000,0.001449


In [51]:
print("Temps amb doble for: 0:26:20.999203")

Temps amb doble for: 0:26:20.999203


+ Ara torna a re-fer la funció ``find_similar_users`` i mira quant triga. 

> Recorda que les scores han d'estar normalitzades!

In [52]:
def find_similar_users(DataFrame, sim_mx, userID, m):
    """
    Troba els m usuaris més semblants a l'usuari donat segons la matriu de similitud.
    
    :param dataframe: DataFrame que conté les dades d'usuari i les puntuacions.
    :param sim_matrix: Matriu de similitud entre usuaris.
    :param user_id: ID de l'usuari pel qual vols trobar usuaris similars.
    :param m: Nombre d'usuaris similars que vols trobar.
    :return: Llista amb els IDs dels m usuaris més semblants.
    """
    # Obtenim els índex dels usuaris semblants
    user_indices = np.argsort(sim_mx[userID])[::-1]

    user_indices = user_indices[user_indices != userID]

    top_m_similar_users = user_indices[:m]

    sim_dict = {}
    # Calculem el valor associat per cada user_id
    for user_id in top_m_similar_users:
        sim_value = sim_mx[userID][user_id]
        sim_dict[user_id] = sim_value
       
    
    
    return sim_dict

In [53]:
t = datetime.datetime.now()
sim_dict = find_similar_users(data, sim, 2, 10)
t = datetime.datetime.now()-t
print(str(t))

0:00:00.000997


In [54]:
sim_dict

{4447: 0.0016759799892629866,
 1646: 0.0016575437514455324,
 5830: 0.0015585655884994666,
 2270: 0.0015274228133923392,
 1903: 0.0015241950142672186,
 3271: 0.001514953400110106,
 4276: 0.001514879601095694,
 1879: 0.0014857868522638257,
 3625: 0.001482580170135753,
 650: 0.0014790760952288534}

#### EXERCICI F.2

+ Computa les recomanacions per un usuari concret a partir dels ratings dels seus $m$ usuaris més propers. Fes primer una funció que retorni la **weighted average list** dels $m$ usuaris més propers. Feu servir la funció anterior que usava la matriu de similituds per anar més ràpid!! (Nota: la **weighted average list** es calcularà agregant els $n$ items més puntuats de cadascun dels m users més semblants al usuari donat).

In [55]:
def weighted_average(DataFrame, user, sim_mx, m):
    
    # Agafem les pel·lícules totals i les de user en dos sets
    set_movie_ids_unicas = set(DataFrame['movie_id'].unique())
    set_peliculas_usuario = set(DataFrame[DataFrame['user_id'] == user]['movie_id'])
    
    # Restem els sets per trobar les no valorades per l'user
    peliculas_no_vistas = set_movie_ids_unicas - set_peliculas_usuario
    recommendations = {}
    # Busquem els usuaris similars
    similarity = find_similar_users(DataFrame, sim_mx, user, m)
    
    # Per cada peli busquem les valoracions dels usuaris semblants
    for movie_id in peliculas_no_vistas:
        similarity_total = 0
        recommendations[movie_id] = 0
        for usuario, similitud in similarity.items():
            rating = df_counts.at[usuario,movie_id]
            
            # Comprovem que hagi estat valorada per usuario i fem el calcul amb la similitud
            if rating != 0:
                recommendations[movie_id] += rating * similitud
                similarity_total += similitud
        if similarity_total != 0:
            recommendations[movie_id] = recommendations[movie_id] / similarity_total
    return recommendations

In [56]:
def getRecommendationsUser(DataFrame, user, sim_mx, n, m):
    """
    Retorna un dataframe de pel·licules amb els scores.
    
    :param DataFrame: dataframe que conté totes les dades
    :param user: usuari al qual fem la recomanació
    :param sim_mx: similarity_function
    :param n: nombre de pelis a recomanar
    :param m: nombre d'usuaris semblants a tenir en compte per les recomanacions
    :return : pandas de pelis amb els seus scores predits
    """
    # Calculem la mitjana ponderada dels usuaris més semblants
    weighted_avg = weighted_average(DataFrame, user, sim_mx, m)
    df = pd.DataFrame(list(weighted_avg.items()), columns=['id_movie', 'valoracion'])

    # Ordenar el DataFrame por el valor en orden descendente
    df_sorted = df.sort_values(by='valoracion', ascending=False)
    top_recommendations = df_sorted.head(n)
    
    return top_recommendations    

In [57]:
t = datetime.datetime.now()
user_prediction = getRecommendationsUser(data, 3, sim, 10, 50)
t = datetime.datetime.now()-t
print(str(t))

0:00:00.902228


In [58]:
user_prediction

Unnamed: 0,id_movie,valoracion
1937,1950,5.0
2285,2299,5.0
3067,3084,5.0
1746,1757,5.0
1709,1720,5.0
3579,3600,5.0
2917,2934,5.0
2411,2425,5.0
2442,2456,5.0
1088,1092,5.0


### 3.7. EXERCICI G


A continuació usarem la metrica **Mean Absolute Error (MAE)** per evaluar el nostre sistema. Aquesta mètrica ens permetrà mesurar la diferencia entre dues llistes donat un usuari: 
+ La llista amb els ratings originals d'un usuari donat
+ La llista de les prediccions generades per aquest usuari

#### EXERCICI G.1

+ Treu el 10% dels usuaris i reserva aquests en una variable anomenada ``test_set`` i la resta en una variable anomenada ``train_set``.

In [59]:
# la vostra solució aquí
test_set = []
train_set = []
data_copia = data.copy()

numero_usuarios_test = int(data['user_id'].nunique()*0.1)

usuaris_seleccionats = np.random.choice(data_copia['user_id'].unique(), numero_usuarios_test, replace=False)

test_set = data_copia[data_copia['user_id'].isin(usuaris_seleccionats)]        
train_set = data_copia[~data_copia['user_id'].isin(usuaris_seleccionats)]



assert len(test_set) + len(train_set) == len(data)

+ Què passarà si calculo la matriu de similitud amb ``train_set`` i després intento predir pels usuaris de ``test_set``??

> **Resposta**: Obtindrem resultats segons les similituds del train_set i no podrem aprofitar dades de valor que poder podriem relacionar si tinguessim també les dades del test_set. Això farà que tinguem menys precisió en algunes prediccions tot i que ens estalviaria bastant temps de comput ja que no revisariem tants usuaris.


#### EXERCICI G.2

+ Selecciona aproximadament el 80% de les interaccions de cada usuari de test i junta-les al ``train_set``. Podem ara podem evaluar el sistema?


> Com la pràctica és molt llarga us donem el codi per un usuari donat i vosaltres només heu de crear la funció que, per cada usuari, afagi el 80% de les intraccions i les unifiqui al dataframe de train.

In [60]:
test_set.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
3,14,1104,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,16,1104,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama
5,17,1104,4,978156168,F,18,3,95825,One Flew Over the Cuckoo's Nest (1975),Drama
14,47,1104,4,977975061,M,25,4,92107,One Flew Over the Cuckoo's Nest (1975),Drama
20,61,1104,4,977968584,F,35,3,98105,One Flew Over the Cuckoo's Nest (1975),Drama


In [61]:
# Agafem el 20% de les pelis que ha consumit cada usuari de test 
groupby_count = test_set.groupby('user_id')['movie_id'].count()*0.2
groupby_count

user_id
8        21.2
14       40.2
16       42.2
17       61.0
21       59.4
        ...  
5993     15.6
6009     63.4
6013     21.2
6033      4.2
6035    177.6
Name: movie_id, Length: 604, dtype: float64

Seleccionem la posició 1 i aquest use_id serà el que usarem pel codi d'exemple (que després haureu de replicar).

In [62]:
groupby_count.reset_index().iloc[1]

user_id     14.0
movie_id    40.2
Name: 1, dtype: float64

In [63]:
n_test_samples = int(groupby_count.reset_index().iloc[1]['movie_id'])
u = groupby_count.reset_index().iloc[1]['user_id']

In [64]:
test_set_user = test_set[test_set['user_id'] == u]
frame_test = test_set_user.sample(n_test_samples)
print("TOTAL SAMPLES OF THE USER: " + str(len(test_set_user)))
print("TOTAL SAMPLES OF THE USER IN TEST SET: " + str(len(frame_test)))

TOTAL SAMPLES OF THE USER: 201
TOTAL SAMPLES OF THE USER IN TEST SET: 40


In [65]:
len(test_set_user.index)

201

In [66]:
frame_train = test_set_user[~test_set_user.index.isin(frame_test.index)]
print("TOTAL SAMPLES OF THE USER IN TRAIN SET: " + str(len(frame_train)))

TOTAL SAMPLES OF THE USER IN TRAIN SET: 161


In [67]:
assert len(frame_train) + len(frame_test) == len(test_set_user)

In [68]:
def add_testdata(traindf, test_set):
    """
    Retorna els N usuaris més similars basat en la correlació de Pearson (no)
    
    :param traindf: dataframe que conté les dades de train
    :param test_set: dataframe que conté les dades de test

    :return : 
        - :param 1st: dataframe que conté les dades de train juntament amb el 80% de test seleccionat
        - :param 2nd: dataframe que conté les dades de test que queden (20% restant)
    """
    frame_test = pd.DataFrame()
    for user_id in test_set['user_id'].unique():
        # Selecciona el 20% de les pel·lícules consumides per l'usuari de test
        groupby_count = test_set.groupby('user_id')['movie_id'].count() * 0.2
        n_test_samples = int(groupby_count.loc[user_id])
        test_set_user = test_set[test_set['user_id'] == user_id]
        frame_test_user = test_set_user.sample(n_test_samples)
        frame_test = pd.concat([frame_test, frame_test_user])
        # Afegeix el 80% de les interaccions al conjunt de train
        traindf = pd.concat([traindf, test_set_user[~test_set_user.index.isin(frame_test_user.index)]])

    return traindf, frame_test
    


In [69]:
t = datetime.datetime.now()
train, test = add_testdata(train_set, test_set)
t = datetime.datetime.now()-t
print(str(t))

0:00:19.939510


In [70]:
train

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,0,1104,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,1104,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,11,1104,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
6,18,1104,5,982730936,M,1,10,48073,One Flew Over the Cuckoo's Nest (1975),Drama
7,23,1104,5,978136709,F,25,7,10023,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...,...,...,...,...
971198,2154,2319,5,974618489,F,1,10,50246,Pet Sematary II (1992),Horror
979279,2154,3594,5,974618390,F,1,10,50246,Phantasm IV: Oblivion (1998),Horror
981305,2154,1033,5,974618351,F,1,10,50246,Children of the Corn IV: The Gathering (1996),Horror
989276,2154,2579,4,974618750,F,1,10,50246,"Masque of the Red Death, The (1964)",Horror


In [71]:
test

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
537414,14,1455,3,978212065,M,25,7,22903,G.I. Jane (1997),Action|Drama|War
614471,14,610,4,978198040,M,25,7,22903,Primal Fear (1996),Drama|Thriller
70738,14,1120,4,978212663,M,25,7,22903,Star Wars: Episode VI - Return of the Jedi (1983),Action|Adventure|Romance|Sci-Fi|War
300680,14,1584,3,978198601,M,25,7,22903,"Big Lebowski, The (1998)",Comedy|Crime|Mystery|Thriller
480831,14,3256,4,978212591,M,25,7,22903,Hook (1991),Adventure|Fantasy
...,...,...,...,...,...,...,...,...,...,...
648472,2154,2411,5,974618965,F,1,10,50246,Night of the Comet (1984),Action|Horror|Sci-Fi
937022,2154,11,5,974618647,F,1,10,50246,Dracula: Dead and Loving It (1995),Comedy|Horror
950098,2154,1813,4,974619177,F,1,10,50246,Child's Play 3 (1992),Horror
973263,2154,2318,5,974619133,F,1,10,50246,Pet Sematary (1989),Horror


In [72]:
train.shape

(982350, 10)

In [73]:
test.shape

(17859, 10)

In [74]:
data.shape

(1000209, 10)

In [75]:
assert train.shape[0] + test.shape[0] == data.shape[0]

#### EXERCICI G.3

+ Fes una funció que serveixi per evaluar el nostre sistema usant la mètrica MAE. 

In [85]:
num_movies = data.nunique()['movie_id']
def evaluateRecommendations(train, test, m,n, sim):
    """
    Retorna l'error generat pel model
    
    :param DataFrame: dataframe que conté totes les dades
    :param userID: usuari respecte al qual fem la recomanació
    :param m: nombre d'usuaris que volem per fer la recomanació
    :param n: nombre de pelis a retornar (no)
    :param sim: matriu de similitud
    :return : Escalar (float) corresponent al MAE
    """
    total_error = 0
    num_evaluations = 0
    n = num_movies

    for user in test['user_id'].unique():
        # Obté les recomanacions per a l'usuari actual
        recommendations = getRecommendationsUser(train, user, sim, n, m)
        
        # Obté totes les movies valorades per l'user        
        user_test_movies = test[test['user_id'] == user]['movie_id']
        # De les valoracions, només agafem les que ha valorat també l'usuari
        user_recommendations = recommendations[recommendations['id_movie'].isin(user_test_movies)]


        # Calcula la mitja de les movies valorades per l'usuari
        user_ratings_mean = test[(test['user_id'] == user) & (test['movie_id'].isin(user_test_movies))]['rating'].mean()

        # Calcula la mitja de les recomanacions
        recommendations_mean = user_recommendations['valoracion'].mean()
        
        # Comprovem que almeys hi ha una movie coincident entre les recomanades i les valorades
        error = abs(recommendations_mean - user_ratings_mean)
        if error > 0:
            total_error += error
            num_evaluations += 1
        

    # Calcula el MAE global
    print("total_error: ", total_error, "num_evaluations: ", num_evaluations)
    mae = total_error / num_evaluations

    return mae

In [86]:
t = datetime.datetime.now()
mae = evaluateRecommendations(train, test, 50, 10, sim)
t = datetime.datetime.now()-t
print(str(t))

total_error:  123.52342274209923 num_evaluations:  604
0:07:16.326832


In [87]:
mae

0.20450897804983317

### 3.8. EXERCICI H (exercici opcional, no obligatori)


+ **Que surt més a compte, fer un recomanador unic pels dos sexes o un per cada sexe?** Justifica la resposta per escrit i amb el codi necessari.

In [88]:
def genderRecommendations(train,test,m,n,sim):
    
    testM = test[test['gender']=='M']
    trainM = train[train['gender']=='M']
    
    testF = test[test['gender']=='F']
    trainF = train[train['gender']=='F']
     

    maeM = evaluateRecommendations(trainM, testM, m, n, sim)
    maeF = evaluateRecommendations(trainF,testF, m, n, sim)
    
    return maeM, maeF

In [89]:
t = datetime.datetime.now()
maeM, maeF = genderRecommendations(train, test, 50, 10, sim)
t = datetime.datetime.now()-t
print(str(t))

total_error:  86.92563600934278 num_evaluations:  430
total_error:  37.120599688409314 num_evaluations:  174
0:07:35.892053


In [90]:
maeM

0.2021526418821925

In [91]:
maeF

0.21333677981844434

> Al principi, crèiem que sí resultaria més útil fer 2 recomanadors diferents. Però, veient l'error, no podem arribar a cap conclusió definitiva. Creiem que si tinguéssim una quantitat de dades més gran, sí que podríem observar més diferencies. 
