# CSC6711 Project 2 - Exploring Rec Sys Data

* **Author**: Jacob Buysse

This notebook is an analysis of four datasets for recommendation systems (all files are located in the `datasets` subdirectory):
* MovieLens - `movielens_25m.feather` (Movies)
* Netflix Prize - `netflix_prize.feather` (Movies and TV Shows)
* Yahoo! Music R2 - `yahoo_r2_songs.subsampled.feather` (Songs)
* BoardGameGeek - `boardgamegeek.feather` (Board Games)

We will be using the following libraries:

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

Let us configure matplotlib for readable labels, high resolution, and automatic layout.

In [2]:
matplotlib.rc('axes', labelsize=16)
matplotlib.rc('figure', dpi=150, autolayout=True)

## MovieLens Analysis

Let us inspect the MoveLens dataset.

In [4]:
df1 = pd.read_feather('./datasets/movielens_25m.feather')
df1

Unnamed: 0,item_id,user_id,rating
0,296,1,5.0
1,306,1,3.5
2,307,1,5.0
3,665,1,5.0
4,899,1,3.5
...,...,...,...
25000090,50872,162541,4.5
25000091,55768,162541,2.5
25000092,56176,162541,2.0
25000093,58559,162541,4.0


In [6]:
df1.describe()

Unnamed: 0,item_id,user_id,rating
count,24890580.0,24890580.0,24890580.0
mean,20835.73,81203.44,3.536225
std,38289.46,46806.52,1.059729
min,1.0,1.0,0.5
25%,1196.0,40510.0,3.0
50%,2918.0,80948.0,3.5
75%,8446.0,121592.0,4.0
max,208737.0,162541.0,5.0


In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24890583 entries, 0 to 25000094
Data columns (total 3 columns):
 #   Column   Dtype  
---  ------   -----  
 0   item_id  int64  
 1   user_id  int64  
 2   rating   float64
dtypes: float64(1), int64(2)
memory usage: 759.6 MB


So we have 24.9M ratings by item_id/user_id.  Ratings are from 0.5 to 5 (so assuming a star rating system where you can't give a zero-rating and where every record has a rating - so no nulls).

Let us see how many distinct items and users we have.

In [10]:
print(f"{df1.item_id.nunique()} distinct item IDs")
print(f"{df1.user_id.nunique()} distinct user IDs")

24330 distinct item IDs
162541 distinct user IDs


So there are around 24k different movies and 162k different users which makes this a sparse rating matrix (otherwise there would be around 4M ratings).