# Unit 1: Exploratory Data Analysis on the MovieLens 100k Dataset

### The [MovieLens](https://grouplens.org/datasets/movielens/) datasets are for recommender systems practitioners and researchers what MNIST is to computer vision people. Of course, the MovieLens datasets are not the only public datasets used in the RecSys community, but the most popular. There are also the 1 Million Song Dataset, Amazon product review datasets, Criteo dataset, BookCrossings, etc.

Here you can find a simple overview of some of them (link to the kdnuggest article).

There are different sizes determined by the number of movie ratings provided by a group of users. Take a look at the GroupLens website and explore them.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import sys
import math

import numpy as np
import scipy as sp
import sklearn

import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import seaborn as sns
sns.set_context("poster")
sns.set(rc={'figure.figsize': (16, 9.)})
sns.set_style("whitegrid")

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

In [None]:
from recsys_training import *

In [2]:
ml100k_ratings_filepath = '../data/raw/ml-100k/u.data'
ml100k_item_filepath = '../data/raw/ml-100k/u.item'
ml100k_user_filepath = '../data/raw/ml-100k/u.user'

## Load Data

In [7]:
data = Dataset(ml100k_ratings_filepath)
data.rating_split(seed=42)

In [8]:
items = pd.read_csv(ml100k_item_filepath, sep='|', header=None,
                    names=['item', 'title', 'release', 'video_release', 'imdb_url']+genres,
                    engine='python')

In [9]:
users = pd.read_csv(ml100k_user_filepath, sep='|', header=None,
                    names=['user', 'age', 'gender', 'occupation', 'zip'])

## Data Exploration

In this unit, we like to get a better picture of the data we use for making recommendations in the upcoming units. Therefore, let's have a look to some statistics to get confident with the data and algorithms.

Let's find out the following:

* number of users
* number of items
* user rating distribution
* item rating distribution
* user / item mean ratings
* popularity skewness
* sparsity
* user / item features