# Unit 1: Exploratory Data Analysis on the MovieLens 100k Dataset

### The [MovieLens](https://grouplens.org/datasets/movielens/) datasets are for recommender systems practitioners and researchers what MNIST is to computer vision people. Of course, the MovieLens datasets are not the only public datasets used in the RecSys community, but the most popular. There are also the 1 Million Song Dataset, Amazon product review datasets, Criteo dataset, BookCrossings, etc.

On _kdnuggets_ you can find a [simple overview](https://www.kdnuggets.com/2016/02/nine-datasets-investigating-recommender-systems.html) of some of them.

There are different sizes determined by the number of movie ratings provided by a group of users. Take a look at the GroupLens website and explore them.

In [18]:
import numpy as np
import pandas as pd
import seaborn as sns

In [13]:
from recsys_training.data import Dataset, genres

In [14]:
ml100k_ratings_filepath = '../data/raw/ml-100k/u.data'
ml100k_item_filepath = '../data/raw/ml-100k/u.item'
ml100k_user_filepath = '../data/raw/ml-100k/u.user'

## Load Data

In [15]:
data = Dataset(ml100k_ratings_filepath)
data.rating_split(seed=42)

In [16]:
items = pd.read_csv(ml100k_item_filepath, sep='|', header=None,
                    names=['item', 'title', 'release', 'video_release', 'imdb_url']+genres,
                    engine='python')

In [17]:
users = pd.read_csv(ml100k_user_filepath, sep='|', header=None,
                    names=['user', 'age', 'gender', 'occupation', 'zip'])

## Data Exploration

In this unit, we like to get a better picture of the data we use for making recommendations in the upcoming units. Therefore, let's have a look to some statistics to get confident with the data and algorithms.

**TODO:**
Let's find out the following:

* number of users
* number of items
* rating distribution
* user / item mean ratings
* popularity skewness
    * user rating count distribution
    * item rating count distribution
* time
* sparsity
* user / item features