# CSC6711 Project 4 - Collaborative Filtering with kNN
* **Author:** Jacob Buysse

This notebook is an analysis of the predictions based on user clustering using kNN.
The files are located in the datasets subdirectory:

* MovieLens - `movielens_25m.feather` (Movies)
* Netflix Prize - `netflix_prize.feather` (Movies and TV Shows)
* Yahoo! Music R2 - `yahoo_r2_songs.subsampled.feather` (Songs)
* BoardGameGeek - `boardgamegeek.feather` (Board Games)

We will be using the following libraries:

In [33]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import GroupShuffleSplit, train_test_split

Let us configure matplotlib for readable labels, high resolution, and automatic layout.

In [2]:
matplotlib.rc('axes', labelsize=16)
matplotlib.rc('figure', dpi=150, autolayout=True)

## Datasets

Let us load the 4 datasets.  We will proceed to clean, filter, preprocess, and split the datasets before continuing on to the kNN portion.

In [45]:
datasets = [
    { 'Title': 'MovieLens', 'File': 'movielens_25m' },
    { 'Title': 'Netflix', 'File': 'netflix_prize' },
    { 'Title': 'Yahoo! Music', 'File': 'yahoo_r2_songs.subsampled' },
    { 'Title': 'BoardGameGeek', 'File': 'boardgamegeek' }
]
for dataset in datasets:
    # Load the file
    print(f"Loading {dataset['Title']}...")
    df = pd.read_feather(f"./datasets/{dataset['File']}.feather")

    # Add a rating_bin (floor of the rating) for graphing bins later
    df['rating_bin'] = np.floor(df.rating)

    # Use the label encoder to convert user_id into a numeric when it is a string (object)
    # NOTE: This is needed for the BoardGameGeek dataset
    if (df.user_id.dtype == object):
        print('Encoding user_id: string -> int64')
        user_id_encoder = LabelEncoder()
        user_id_encoder.fit(df.user_id)
        dataset['user_id_encoder'] = user_id_encoder
        df['user_id'] = user_id_encoder.transform(df.user_id)

    # Store the df in the dataset dictionary
    dataset['df'] = df
    print(f"Shape {df.shape}")

Loading MovieLens...
Shape (24890583, 4)
Loading Netflix...
Shape (51031355, 4)
Loading Yahoo! Music...
Shape (6937275, 4)
Loading BoardGameGeek...
Encoding user_id: string -> int64
Shape (18942215, 4)


Because of how we are doing our training/testing split and then seen/unseen split for testing, we need to exclude all users that only have a single rating.  This is because we cannot split those users in the testing dataset into both seen (for matching neighbors) and unseen (for prediction analysis).  Let us filter those users out now.

In [46]:
for dataset in datasets:
    print(f"Filtering out users with only one rating for {dataset['Title']}")
    df = dataset['df']
    counts_df = df.groupby('user_id')[['rating']].count()
    merged_df = df.merge(counts_df, on='user_id', suffixes=['', '_count'])
    filtered_df = merged_df[merged_df.rating_count > 1]
    dataset['df'] = filtered_df.copy()
    print(f"New Shape: {filtered_df.shape}")

Filtering out users with only one rating for MovieLens
New Shape: (24890583, 5)
Filtering out users with only one rating for Netflix
New Shape: (51027153, 5)
Filtering out users with only one rating for Yahoo! Music
New Shape: (6532945, 5)
Filtering out users with only one rating for BoardGameGeek
New Shape: (18862919, 5)


Let us encode the `item_id` column into a contiguous 0...n-1 range `item_idx` using a LabelEncoder.
This will be used for the columns of the sparse matrices.
Note that this encoding will be shared between the training and testing splits.

In [47]:
for dataset in datasets:
    print(f"Encoding item_id for {dataset['Title']}")
    df = dataset['df']
    item_id_encoder = LabelEncoder()
    item_id_encoder.fit(df.item_id)
    dataset['item_id_encoder'] = item_id_encoder
    n_items = item_id_encoder.classes_.size
    dataset['n_items'] = n_items
    df['item_idx'] = item_id_encoder.transform(df.item_id)
    print(f"Distinct Item Count: {n_items:,}")

Encoding item_id for MovieLens
Distinct Item Count: 24,330
Encoding item_id for Netflix
Distinct Item Count: 9,210
Encoding item_id for Yahoo! Music
Distinct Item Count: 1,368
Encoding item_id for BoardGameGeek
Distinct Item Count: 21,925


Let us do a 75/25 split for the training/testing datasets, split across the user ids as groups.

In [48]:
def TrainTestSplit(df):
    gss = GroupShuffleSplit(n_splits=1, train_size=0.75, random_state=777)
    train_index, test_index = next(gss.split(X=df, y=df.rating, groups=df.user_id))
    train_df = df.iloc[train_index].copy()
    test_df = df.iloc[test_index].copy()
    total_count = train_df.shape[0] + test_df.shape[0];
    item_count = df.item_id.nunique()
    user_count = df.user_id.nunique()
    train_pct_total = train_df.shape[0] / total_count
    test_pct_total = test_df.shape[0] / total_count
    train_pct_item = train_df.item_id.nunique() / item_count
    test_pct_item = test_df.item_id.nunique() / item_count
    train_pct_user = train_df.user_id.nunique() / user_count
    test_pct_user = test_df.user_id.nunique() / user_count
    print(f"Train {train_df.shape} ({train_pct_total:.0%} total, {train_pct_item:.0%} items, {train_pct_user:.0%} users) " +
          f"Test {test_df.shape} ({test_pct_total:.0%} total, {test_pct_item:.0%} items, {test_pct_user:.0%} users)")
    return train_df, test_df

for dataset in datasets:
    print(f"Splitting training/testing datasets for {dataset['Title']}")
    train_df, test_df = TrainTestSplit(dataset['df'])
    dataset['train_df'] = train_df
    dataset['test_df'] = test_df

Splitting training/testing datasets for MovieLens
Train (18706943, 6) (75% total, 100% items, 75% users) Test (6183640, 6) (25% total, 99% items, 25% users)
Splitting training/testing datasets for Netflix
Train (38292233, 6) (75% total, 100% items, 75% users) Test (12734920, 6) (25% total, 100% items, 25% users)
Splitting training/testing datasets for Yahoo! Music
Train (4901087, 6) (75% total, 100% items, 75% users) Test (1631858, 6) (25% total, 100% items, 25% users)
Splitting training/testing datasets for BoardGameGeek
Train (14119520, 6) (75% total, 100% items, 75% users) Test (4743399, 6) (25% total, 100% items, 25% users)


Now, for each training/testing data frame, encode the user ids to contiguous sets using LabelEncoders.

In [49]:
for dataset in datasets:
    print(f"Encoding user_id for {dataset['Title']}")

    train_df = dataset['train_df']
    train_user_id_encoder = LabelEncoder()
    train_user_id_encoder.fit(train_df.user_id)
    dataset['train_user_id_encoder'] = train_user_id_encoder
    n_train_users = train_user_id_encoder.classes_.size
    dataset['n_train_users'] = n_train_users
    train_df['user_idx'] = train_user_id_encoder.transform(train_df.user_id)
    print(f"Distinct Training Users: {n_train_users:,}")

    test_df = dataset['test_df']
    test_user_id_encoder = LabelEncoder()
    test_user_id_encoder.fit(test_df.user_id)
    dataset['test_user_id_encoder'] = test_user_id_encoder
    n_test_users = test_user_id_encoder.classes_.size
    dataset['n_test_users'] = n_test_users
    test_df['user_idx'] = test_user_id_encoder.transform(test_df.user_id)
    print(f"Distinct Testing Users: {n_test_users:,}")

Encoding user_id for MovieLens
Distinct Training Users: 121,905
Distinct Testing Users: 40,636
Encoding user_id for Netflix
Distinct Training Users: 355,362
Distinct Testing Users: 118,454
Encoding user_id for Yahoo! Music
Distinct Training Users: 638,199
Distinct Testing Users: 212,733
Encoding user_id for BoardGameGeek
Distinct Training Users: 249,059
Distinct Testing Users: 83,020


Now create a sparse matrix for the training set.

In [50]:
for dataset in datasets:
    print(f"Creating sparse training matrix for {dataset['Title']}")
    dataset['train_X'] = csr_matrix(
        (dataset['train_df'].rating, (dataset['train_df'].user_idx, dataset['train_df'].item_idx)),
        shape=(dataset['n_train_users'], dataset['n_items'])
    )

Creating sparse training matrix for MovieLens
Creating sparse training matrix for Netflix
Creating sparse training matrix for Yahoo! Music
Creating sparse training matrix for BoardGameGeek


Now we need to split the test dataset into a seen/unseen split (75/25), stratified by user id.

In [52]:
for dataset in datasets:
    print(f"Creating seen/unseen testing split {dataset['Title']}")
    test_df = dataset['test_df']
    seen_df, unseen_df = train_test_split(test_df, train_size=0.75, random_state=777, stratify=test_df.user_idx)
    dataset['seen_df'] = seen_df
    dataset['unsee_df'] = unseen_df
    print(f"Seen {seen_df.shape}, Unseen {unseen_df.shape}")

Creating seen/unseen testing split MovieLens
Seen (4637730, 7), Unseen (1545910, 7)
Creating seen/unseen testing split Netflix
Seen (9551190, 7), Unseen (3183730, 7)
Creating seen/unseen testing split Yahoo! Music
Seen (1223893, 7), Unseen (407965, 7)
Creating seen/unseen testing split BoardGameGeek
Seen (3557549, 7), Unseen (1185850, 7)


Now we can create the sparse matrix for the testing seen dataset.

In [53]:
for dataset in datasets:
    print(f"Creating sparse testing/seen matrix for {dataset['Title']}")
    dataset['test_X'] = csr_matrix(
        (dataset['seen_df'].rating, (dataset['seen_df'].user_idx, dataset['seen_df'].item_idx)),
        shape=(dataset['n_test_users'], dataset['n_items'])
    )

Creating sparse testing/seen matrix for MovieLens
Creating sparse testing/seen matrix for Netflix
Creating sparse testing/seen matrix for Yahoo! Music
Creating sparse testing/seen matrix for BoardGameGeek


## TODO: Left off here