# CSC6711 Project 3 - Non-Personalized Recommendations

* **Author**: Jacob Buysse

This notebook is an analysis of the non-personalized predictions from the 4 datasets from Project 2 and how good they are at predicing the individual users actual ratings.  The files are located in the `datasets` subdirectory:
* MovieLens - `movielens_25m.feather` (Movies)
* Netflix Prize - `netflix_prize.feather` (Movies and TV Shows)
* Yahoo! Music R2 - `yahoo_r2_songs.subsampled.feather` (Songs)
* BoardGameGeek - `boardgamegeek.feather` (Board Games)

We will be using the following libraries:

In [37]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GroupShuffleSplit

Let us configure matplotlib for readable labels, high resolution, and automatic layout.

In [2]:
matplotlib.rc('axes', labelsize=16)
matplotlib.rc('figure', dpi=150, autolayout=True)

## Datasets

Let us load the databases using Pandas.  We know from Project 2 that the contents are structured identically:

* `df1` - MovieLens
* `df2` - Netflix
* `df3` - Yahoo Music
* `df4` - BoardGameGeek

In each file, we have `item_id`, `user_id`, and `rating`.

In [3]:
df1 = pd.read_feather('./datasets/movielens_25m.feather')
df2 = pd.read_feather('./datasets/netflix_prize.feather')
df3 = pd.read_feather('./datasets/yahoo_r2_songs.subsampled.feather')
df4 = pd.read_feather('./datasets/boardgamegeek.feather')

We need to tweak the data for BoardGameGeek.  It has user_id as a string and we will encode it to a numeric using `LabelEncoder`.

In [39]:
user_id4_encoder = LabelEncoder()
user_id4_encoder.fit(df4.user_id);
df4['user_id'] = user_id4_encoder.transform(df4.user_id)

Next, let us split the datasets into 75/25 train/test subsets.  We will define a helper function `TrainTestSplit` to do that.

In [43]:
def TrainTestSplit(df):
    gss = GroupShuffleSplit(n_splits=1, train_size=0.75, random_state=777)
    train_index, test_index = next(gss.split(X=df, y=df.rating, groups=df.user_id))
    train_df = df.iloc[train_index]
    test_df = df.iloc[test_index]
    total_count = train_df.shape[0] + test_df.shape[0];
    item_count = df.item_id.nunique()
    user_count = df.user_id.nunique()
    train_pct_total = train_df.shape[0] / total_count
    test_pct_total = test_df.shape[0] / total_count
    train_pct_item = train_df.item_id.nunique() / item_count
    test_pct_item = test_df.item_id.nunique() / item_count
    train_pct_user = train_df.user_id.nunique() / user_count
    test_pct_user = test_df.user_id.nunique() / user_count
    print(f"Train {train_df.shape} ({train_pct_total:.0%} total, {train_pct_item:.0%} items, {train_pct_user:.0%} users) " +
          f"Test {test_df.shape} ({test_pct_total:.0%} total, {test_pct_item:.0%} items, {test_pct_user:.0%} users)")
    return train_df, test_df

train_df1, test_df1 = TrainTestSplit(df1)
train_df2, test_df2 = TrainTestSplit(df2)
train_df3, test_df3 = TrainTestSplit(df3)
train_df4, test_df4 = TrainTestSplit(df4)

Train (18706943, 3) (75% total, 100% items, 75% users) Test (6183640, 3) (25% total, 99% items, 25% users)
Train (38278492, 3) (75% total, 100% items, 75% users) Test (12752863, 3) (25% total, 100% items, 25% users)
Train (5201846, 3) (75% total, 100% items, 75% users) Test (1735429, 3) (25% total, 100% items, 25% users)
Train (14159851, 3) (75% total, 100% items, 75% users) Test (4782364, 3) (25% total, 100% items, 25% users)


We can see we have a 75/25 split for total record count and additionally by users (the grouping provided by the `GroupShuffleSplit` helper).   We can also see that all items are accounted for in every training set and all but one testing set - with the remaining one still containing 99% of items.