# Unit 1: Exploratory Data Analysis on the MovieLens 100k Dataset

The [MovieLens](https://grouplens.org/datasets/movielens/) datasets are for recommender systems practitioners and researchers what MNIST is to computer vision people. Of course, the MovieLens datasets are not the only public datasets used in the RecSys community, but the most widely used. There are also the
* [Million Song Dataset](http://millionsongdataset.com/)
* [Amazon product review dataset](https://nijianmo.github.io/amazon/index.html)
* [Criteo datasets](https://labs.criteo.com/category/dataset/)
* [BookCrossings](http://www2.informatik.uni-freiburg.de/~cziegler/BX/) and many more

On _kdnuggets_ you can find a [simple overview](https://www.kdnuggets.com/2016/02/nine-datasets-investigating-recommender-systems.html) of some of them.

MovieLens comes in different sizes regarding the number of movie ratings, user, items. Take a look at the GroupLens website and explore them youself.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
from recsys_training.data import genres

In [None]:
ml100k_ratings_filepath = '../../data/raw/ml-100k/u.data'
ml100k_item_filepath = '../../data/raw/ml-100k/u.item'
ml100k_user_filepath = '../../data/raw/ml-100k/u.user'

## Load Data

In [None]:
ratings = pd.read_csv(ml100k_ratings_filepath,
                      sep='\t',
                      header=None,
                      names=['user', 'item', 'rating', 'timestamp'],
                      engine='python')

In [None]:
items = pd.read_csv(ml100k_item_filepath, sep='|', header=None,
                    names=['item', 'title', 'release', 'video_release', 'imdb_url']+genres,
                    engine='python')

In [None]:
users = pd.read_csv(ml100k_user_filepath, sep='|', header=None,
                    names=['user', 'age', 'gender', 'occupation', 'zip'])

## Data Exploration

In this unit, we like to get a better picture of the data we use for making recommendations in the upcoming units. Therefore, let's have a look to some statistics to get confident with the data and algorithms.

![](../parrot.png)

**Task:**
Let's find out the following:

* number of users
* number of items
* rating distribution
* user / item mean ratings
* popularity skewness
    * user rating count distribution
    * item rating count distribution
* time
* sparsity
* user / item features

### number of users

In [None]:
n_users = ratings['user'].unique().shape[0]
n_users

In [None]:
ratings['user'].unique().min()

In [None]:
ratings['user'].unique().max()

### number of items

In [None]:
n_items = ratings['item'].unique().shape[0]
n_items

In [None]:
ratings['item'].unique().min()

In [None]:
ratings['item'].unique().max()

### user rating distribution

In [None]:
ratings['rating'].value_counts().sort_index()

In [None]:
ratings['rating'].value_counts(normalize=True).sort_index()

In [None]:
sns.barplot(x=ratings['rating'].value_counts(normalize=True).sort_index().index,
            y=ratings['rating'].value_counts(normalize=True).sort_index().values)

In [None]:
ratings['rating'].describe()

### user rating count distribution

In [None]:
quantiles = ratings['user'].value_counts(normalize=True).cumsum()

sns.lineplot(np.arange(n_users)/n_users+1/n_users,
             quantiles)

### item rating count distribution

In [None]:
quantiles = ratings['item'].value_counts(normalize=True).cumsum()

sns.lineplot(np.arange(n_items)/n_items+1/n_items,
             quantiles)

### user mean ratings

In [None]:
user_mean_ratings = ratings[['user', 'rating']].groupby('user').mean().reset_index(drop=True)

In [None]:
sns.distplot(user_mean_ratings)

### item mean ratings

In [None]:
item_mean_ratings = ratings[['item', 'rating']].groupby('item').mean().reset_index(drop=True)

In [None]:
sns.distplot(item_mean_ratings)

### sparsity

In [None]:
n_users

In [None]:
n_items

In [None]:
# count the uniquely observed ratings
observed_ratings = ratings[['user', 'item']].drop_duplicates().shape[0]
observed_ratings

In [None]:
potential_ratings = n_users * n_items
potential_ratings

In [None]:
density = observed_ratings / potential_ratings

In [None]:
density

In [None]:
sparsity = 1 - density

In [None]:
sparsity