## Caso de Estudio: Movielens

Movielens es una conjunto de datos de películas creada por la Universidad de Minnesota.

Los datos se pueden descargar en el siguiente enlace:
* http://grouplens.org/datasets/movielens/20m/

El conjunto de datos *MovieLens 20M Dataset* contiene:
* 20 millones de evaluaciones de peliculas (ratings.csv)
* 465.000 tags (tags.csv)
* 27.000 películas (movies.csv)
* 138.000 usuarios

Una vez completada la descarga debe descomprimir el fichero ml-20m.zip en el directorio datasets y renombrar el subdirectorio como movielens (../datasets/movielens)

In [None]:
import pandas as pd

Importar los Datos con *Pandas*.

Vamos a trabajar con 3 ficheros CSV:
* **movies.csv :** *movieId*, *title*, *genres*
* **tags.csv :** *userId*,*movieId*, *tag*, *timestamp*
* **ratings.csv :** *userId*,*movieId*,*rating*, *timestamp*

Timestamp (**Unix Time**) es el número de segundos transcurridos desde el 1 de Enero de 1970.


In [None]:
pwd

In [None]:
ls "../datasets/movielens"

In [None]:
movies = pd.read_csv('../datasets/movielens/movies.csv', sep=',')
print(type(movies))
movies.head(15)

In [None]:
tags = pd.read_csv('../datasets/movielens/tags.csv', sep=',')
tags.head()

In [None]:
ratings = pd.read_csv('../datasets/movielens/ratings.csv', sep=',', parse_dates=['timestamp'])
ratings.head()

In [None]:
# Borraremos la columna timestamp para el análisis inicial
del ratings['timestamp']
del tags['timestamp']

## Data Structures

In [None]:
ratings.tail()

### Series

In [None]:
# Extraer la Fila 0

row_0 = tags.iloc[0]
type(row_0)

In [None]:
print(row_0)

In [None]:
row_0.index

In [None]:
row_0['userId']

In [None]:
'rating' in row_0

In [None]:
row_0.name

In [None]:
row_0 = row_0.rename('first_row')
row_0.name

In [None]:
tags.head()

### DataFrames

In [None]:
tags.head(8)

In [None]:
tags.index

In [None]:
tags.columns

In [None]:
# Extraer la Fila 0, 11 y 2000

tags.iloc[ [0,11,2000] ]

### Estadística Descriptiva

Vamos a describir las características de este conjunto de datos mediante medidas resumen, tablas o gráficos.

In [None]:
ratings['rating'].describe()

In [None]:
ratings.describe().transpose()

In [None]:
ratings['rating'].mean()

In [None]:
ratings.mean()

In [None]:
ratings['rating'].min()

In [None]:
ratings['rating'].max()

In [None]:
ratings['rating'].std()

In [None]:
ratings['rating'].mode()

In [None]:
# El cálculo de la Correlación no tiene sentido en este Dataset
ratings.corr()

In [None]:
filter_1 = ratings['rating'] > 5
filter_1
filter_1.any()

In [None]:
filter_2 = ratings['rating'] > 0
filter_2.all()

### Limpieza de los datos: gestionando datos inexistentes

In [None]:
movies.shape

In [None]:
movies.iloc[0]

In [None]:
movies.isnull().any()

In [None]:
ratings.shape
#ratings.columns

In [None]:
ratings.isnull().any()

In [None]:
tags.shape
#tags.columns

In [None]:
tags.isnull().any()

In [None]:
# Eliminar valores nulos en tags
tags = tags.dropna()

In [None]:
tags.isnull().any()

### Visualización de Datos

In [None]:
%matplotlib inline

ratings.hist(column='rating', figsize=(10,5))

In [None]:
ratings.boxplot(column='rating')

In [None]:
ratings['rating'].describe()

### Selección de Datos
 

In [None]:
tags['tag'].head()

In [None]:
movies[['title','genres']].head()

In [None]:
ratings[-5:]

In [None]:
tag_counts = tags['tag'].value_counts()
tag_counts[:10]

In [None]:
tag_counts[:10].plot(kind='bar', figsize=(10,10))

### Filtrado de Datos (selección de filas)

In [None]:
is_highly_rated = ratings['rating'] >= 4.0
ratings[is_highly_rated][-5:]

In [None]:
movies.head()

In [None]:
is_animation = movies['genres'].str.contains('Animation')
is_adventure = movies['genres'].str.contains('Adventure')
movies[is_animation][5:15]

In [None]:
is_adventure = movies['genres'].str.contains('Adventure')
movies[is_adventure][5:15]

In [None]:
movies[is_animation & is_adventure].head()

### Agrupación y Agregación de Datos

In [None]:
ratings.head()

In [None]:
ratings_count = ratings[['movieId','rating']].groupby('rating').count()
ratings_count

In [None]:
average_rating = ratings[['movieId','rating']].groupby('movieId').mean()
average_rating.head()

In [None]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.head()

In [None]:
movie_count = ratings[['movieId','rating']].groupby('movieId').count()
movie_count.tail()

### Combinación de Dataframes

In [None]:
print(tags.count())
tags.head()

In [None]:
print(movies.count())
movies.head()

In [None]:
t = movies.merge(tags, on='movieId', how='inner')
t.head()


Otros Ejemplos: http://pandas.pydata.org/pandas-docs/stable/merging.html

### Análisis de Datos combinando Agregación, Combinación y Filtrado

In [None]:
avg_ratings = ratings.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']
#print(avg_ratings.count())
avg_ratings.head()

In [None]:
print(avg_ratings.count())
avg_ratings.hist('rating')

In [None]:
print(movies.count())
movies.head()

In [None]:
print(avg_ratings.count())
avg_ratings.head()

In [None]:
box_office = movies.merge(avg_ratings, on='movieId', how='inner')
box_office.tail()

In [None]:
is_highly_rated = box_office['rating'] >= 4.0
box_office.count()

box_office[is_highly_rated][-5:]

In [None]:
is_comedy = box_office['genres'].str.contains('Comedy')

box_office[is_comedy][:5]

In [None]:
box_office[is_comedy & is_highly_rated][-5:]

### Operaciones con Strings

In [None]:
movies.head()

### Dividir Generos en múltiples columnas

In [None]:
movie_genres = movies['genres'].str.split('|', expand=True)

In [None]:
movie_genres[:10]

### Añadir una columna para indicar si la película es una Comedia

In [None]:
movie_genres['isComedy'] = movies['genres'].str.contains('Comedy')

In [None]:
movie_genres[:10]

### Extraer el Año del Título de la película

In [None]:
movies.head()

In [None]:
#movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
movies['year'] = movies['title'].str.extract('.*\(([0-9]+)\).*', expand=True)

In [None]:
movies.tail()

Más operaciones con Strings:


http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods

### Analizando Fechas en formato Unix

La fecha en formato Unix es muy utilizada en IoT (datos de sensores y otras series temporales)

In [None]:
tags = pd.read_csv('../datasets/movielens/tags.csv', sep=',')

In [None]:
tags.dtypes

Unix time / POSIX time / epoch time records 
time in seconds <br> since midnight Coordinated Universal Time (UTC) of January 1, 1970

In [None]:
tags.head(5)

In [None]:
tags['parsed_time'] = pd.to_datetime(tags['timestamp'], unit='s')

In [None]:
tags.head(2)

### Selección de filas filtrando por fecha

In [None]:
greater_than_t = tags['parsed_time'] > '2015-02-01'

selected_rows = tags[greater_than_t]

selected_rows.head()
#tags.shape, selected_rows.shape

### Ordenación de filas por fecha

In [None]:
tags.sort_values(by='parsed_time', ascending=True)[:10]

## Analizando la calificación de las películas en función del año

In [None]:
average_rating = ratings[['movieId','rating']].groupby('movieId', as_index=False).mean()
average_rating.tail()

In [None]:
joined = movies.merge(average_rating, on='movieId', how='inner')
joined.head()

In [None]:
yearly_average = joined[['year','rating']].groupby('year', as_index=False).mean()
yearly_average[-20:]

In [None]:
yearly_average[-20:].plot(x='year', y='rating', figsize=(10,5), grid=True)

In [None]:
yearly_average[yearly_average['year'].str.contains('2009')]

In [None]:
err_movies = movies['year'].str.contains('2009–')
err_movies.any()

In [None]:
movies[err_movies == True]

In [None]:
ratings.head()

In [None]:
ratings[ratings['movieId'] == 107434]