# 3er Entregable

Integrantes:
- Araoz, Tania
- Bajo, Pablo
- Barrera, Manuel

### Carga de librerias a utilizar 

In [13]:
import pandas as pd
from datetime import datetime

### Carga de datasets

In [3]:
movies = pd.read_csv("../data/ml-latest/movies.csv")
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
ratings = pd.read_csv("../data/ml-latest/ratings.csv")
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


> Se usa el dataset de ratings para trabajar, tiene las interacciones entre usuarios y películas

In [5]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


> El dataset contiene 100836 interacciones. <span style="color:red">ACTUALIZAR CON DATASET GRANDE</span>

> El timestamp está en formato int64, se debe convertir a formato fecha para poder trabajar.

In [7]:
ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

> No hay valores nulos

In [9]:
ratings['userId'].nunique()

610

> El dataset tiene 610 ususarios. <span style="color:red">Cambiar con dataset grande</span> 

In [10]:
ratings['movieId'].nunique()

9724

> el dataset contiene ratings de 9724 peliculas. <span style="color:red">Actualizar con dataset grande</span> 

In [12]:
ratings['rating'].sort_values(ascending=True).unique()

array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

> Los valores posibles de ratings van del 0.5 al 5, con un incremento de 0.5. 

#### Preprocesado

Convertimos el timestamp numerico en formato fecha

In [15]:
ratings["timestamp"] = ratings["timestamp"].apply(lambda x: datetime.utcfromtimestamp(x).strftime('%Y/%m/%d'))

In [16]:
ratings["timestamp"]

0         2000/07/30
1         2000/07/30
2         2000/07/30
3         2000/07/30
4         2000/07/30
             ...    
100831    2017/05/03
100832    2017/05/03
100833    2017/05/08
100834    2017/05/03
100835    2017/05/03
Name: timestamp, Length: 100836, dtype: object

> Vemos que la fecha tiene un formato de fecha, pero la columna es de tipo object

Utilizando pandas convertimos a un formato de fechas que permita el filtrado

In [17]:
ratings["timestamp"] = pd.to_datetime(ratings['timestamp'], format='%Y/%m/%d')

In [18]:
ratings["timestamp"]

0        2000-07-30
1        2000-07-30
2        2000-07-30
3        2000-07-30
4        2000-07-30
            ...    
100831   2017-05-03
100832   2017-05-03
100833   2017-05-08
100834   2017-05-03
100835   2017-05-03
Name: timestamp, Length: 100836, dtype: datetime64[ns]

> Vemos que la columna tiene el formato datetime64

In [19]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,2000-07-30
1,1,3,4.0,2000-07-30
2,1,6,4.0,2000-07-30
3,1,47,5.0,2000-07-30
4,1,50,5.0,2000-07-30


Vemos el rango de fechas del dataset

In [20]:
ratings.timestamp.min()

Timestamp('1996-03-29 00:00:00')

In [21]:
ratings.timestamp.max()

Timestamp('2018-09-24 00:00:00')

> Vemos que el rango de fechas va desde el 29/03/1996 al 24/09/24

#### Dividimos dataset en train, test y validation
Vemos la catidad de ratings por año

In [22]:
plot_df = ratings.copy()
plot_df["year"] = ratings.timestamp.dt.year
plot_df = plot_df.groupby("year", as_index=False).count()[["year", "userId"]]
plot_df.columns = ["year", "reviews_count"]
plot_df.head(25)

Unnamed: 0,year,reviews_count
0,1996,6040
1,1997,1916
2,1998,507
3,1999,2439
4,2000,10061
5,2001,3922
6,2002,3478
7,2003,4014
8,2004,3279
9,2005,5813


> Tomamos una proporción 80/20 para dividir el dataset en train - test

In [37]:
train = ratings[(ratings.timestamp < datetime(year=2016, month=1, day=1))]
train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,2000-07-30
1,1,3,4.0,2000-07-30
2,1,6,4.0,2000-07-30
3,1,47,5.0,2000-07-30
4,1,50,5.0,2000-07-30


In [25]:
train.shape

(79518, 4)

In [27]:
train.userId.nunique()

514

In [29]:
train.movieId.nunique()

7790

In [38]:
test = ratings[ratings.timestamp >= datetime(year=2016, month=1, day=1)]
test.head()

Unnamed: 0,userId,movieId,rating,timestamp
1119,10,296,1.0,2016-02-12
1120,10,356,3.5,2016-02-12
1121,10,588,4.0,2016-02-12
1122,10,597,3.5,2016-02-13
1123,10,912,4.0,2016-02-12


In [26]:
test.shape

(21318, 4)

In [33]:
test.userId.nunique()

120

In [31]:
test.movieId.nunique()

5714

In [39]:
plot_df = train.copy()
plot_df["year"] = train.timestamp.dt.year
plot_df = plot_df.groupby("year", as_index=False).count()[["year", "userId"]]
plot_df.columns = ["year", "reviews_count"]
plot_df.head(25)

Unnamed: 0,year,reviews_count
0,1996,6040
1,1997,1916
2,1998,507
3,1999,2439
4,2000,10061
5,2001,3922
6,2002,3478
7,2003,4014
8,2004,3279
9,2005,5813


In [40]:
train.shape

(79517, 4)

In [None]:
validation = ratings[ratings.timestamp >= datetime(year=2016, month=1, day=1)]
test.head()

In [34]:
test[~test.userId.isin(train.userId.unique())].userId.nunique()

96