# CSC6711 Project 4 - Collaborative Filtering with kNN
* **Author:** Jacob Buysse

This notebook is an analysis of the predictions based on user clustering using kNN.
The files are located in the datasets subdirectory:

* MovieLens - `movielens_25m.feather` (Movies)
* Netflix Prize - `netflix_prize.feather` (Movies and TV Shows)
* Yahoo! Music R2 - `yahoo_r2_songs.subsampled.feather` (Songs)
* BoardGameGeek - `boardgamegeek.feather` (Board Games)

We will be using the following libraries:

In [16]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix
from scipy.stats import linregress
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import GroupShuffleSplit, train_test_split

Let us configure matplotlib for readable labels, high resolution, and automatic layout.

In [2]:
matplotlib.rc('axes', labelsize=16)
matplotlib.rc('figure', dpi=150, autolayout=True)

## Datasets

Let us load the 4 datasets.  We will proceed to clean, filter, preprocess, and split the datasets before continuing on to the kNN portion.

In [3]:
datasets = [
    { 'Title': 'MovieLens', 'File': 'movielens_25m' },
    { 'Title': 'Netflix', 'File': 'netflix_prize' },
    { 'Title': 'Yahoo! Music', 'File': 'yahoo_r2_songs.subsampled' },
    { 'Title': 'BoardGameGeek', 'File': 'boardgamegeek' }
]
for dataset in datasets:
    # Load the file
    print(f"Loading {dataset['Title']}...")
    df = pd.read_feather(f"./datasets/{dataset['File']}.feather")

    # Add a rating_bin (floor of the rating) for graphing bins later
    df['rating_bin'] = np.floor(df.rating)

    # Use the label encoder to convert user_id into a numeric when it is a string (object)
    # NOTE: This is needed for the BoardGameGeek dataset
    if (df.user_id.dtype == object):
        print('Encoding user_id: string -> int64')
        user_id_encoder = LabelEncoder()
        user_id_encoder.fit(df.user_id)
        dataset['user_id_encoder'] = user_id_encoder
        df['user_id'] = user_id_encoder.transform(df.user_id)

    # Store the df in the dataset dictionary
    dataset['df'] = df
    print(f"Shape {df.shape}")

Loading MovieLens...
Shape (24890583, 4)
Loading Netflix...
Shape (51031355, 4)
Loading Yahoo! Music...
Shape (6937275, 4)
Loading BoardGameGeek...
Encoding user_id: string -> int64
Shape (18942215, 4)


Because of how we are doing our training/testing split and then seen/unseen split for testing, we need to exclude all users that only have a single rating.  This is because we cannot split those users in the testing dataset into both seen (for matching neighbors) and unseen (for prediction analysis).  Let us filter those users out now.

In [4]:
for dataset in datasets:
    print(f"Filtering out users with only one rating for {dataset['Title']}")
    df = dataset['df']
    counts_df = df.groupby('user_id')[['rating']].count()
    merged_df = df.merge(counts_df, on='user_id', suffixes=['', '_count'])
    filtered_df = merged_df[merged_df.rating_count > 1]
    dataset['df'] = filtered_df.copy()
    print(f"New Shape: {filtered_df.shape}")

Filtering out users with only one rating for MovieLens
New Shape: (24890583, 5)
Filtering out users with only one rating for Netflix
New Shape: (51027153, 5)
Filtering out users with only one rating for Yahoo! Music
New Shape: (6532945, 5)
Filtering out users with only one rating for BoardGameGeek
New Shape: (18862919, 5)


Let us encode the `item_id` column into a contiguous 0...n-1 range `item_idx` using a LabelEncoder.
This will be used for the columns of the sparse matrices.
Note that this encoding will be shared between the training and testing splits.

In [5]:
for dataset in datasets:
    print(f"Encoding item_id for {dataset['Title']}")
    df = dataset['df']
    item_id_encoder = LabelEncoder()
    item_id_encoder.fit(df.item_id)
    dataset['item_id_encoder'] = item_id_encoder
    n_items = item_id_encoder.classes_.size
    dataset['n_items'] = n_items
    df['item_idx'] = item_id_encoder.transform(df.item_id)
    print(f"Distinct Item Count: {n_items:,}")

Encoding item_id for MovieLens
Distinct Item Count: 24,330
Encoding item_id for Netflix
Distinct Item Count: 9,210
Encoding item_id for Yahoo! Music
Distinct Item Count: 1,368
Encoding item_id for BoardGameGeek
Distinct Item Count: 21,925


Let us do a 75/25 split for the training/testing datasets, split across the user ids as groups.

In [6]:
def TrainTestSplit(df):
    gss = GroupShuffleSplit(n_splits=1, train_size=0.75, random_state=777)
    train_index, test_index = next(gss.split(X=df, y=df.rating, groups=df.user_id))
    train_df = df.iloc[train_index].copy()
    test_df = df.iloc[test_index].copy()
    total_count = train_df.shape[0] + test_df.shape[0];
    item_count = df.item_id.nunique()
    user_count = df.user_id.nunique()
    train_pct_total = train_df.shape[0] / total_count
    test_pct_total = test_df.shape[0] / total_count
    train_pct_item = train_df.item_id.nunique() / item_count
    test_pct_item = test_df.item_id.nunique() / item_count
    train_pct_user = train_df.user_id.nunique() / user_count
    test_pct_user = test_df.user_id.nunique() / user_count
    print(f"Train {train_df.shape} ({train_pct_total:.0%} total, {train_pct_item:.0%} items, {train_pct_user:.0%} users) " +
          f"Test {test_df.shape} ({test_pct_total:.0%} total, {test_pct_item:.0%} items, {test_pct_user:.0%} users)")
    return train_df, test_df

for dataset in datasets:
    print(f"Splitting training/testing datasets for {dataset['Title']}")
    train_df, test_df = TrainTestSplit(dataset['df'])
    dataset['train_df'] = train_df
    dataset['test_df'] = test_df

Splitting training/testing datasets for MovieLens
Train (18706943, 6) (75% total, 100% items, 75% users) Test (6183640, 6) (25% total, 99% items, 25% users)
Splitting training/testing datasets for Netflix
Train (38292233, 6) (75% total, 100% items, 75% users) Test (12734920, 6) (25% total, 100% items, 25% users)
Splitting training/testing datasets for Yahoo! Music
Train (4901087, 6) (75% total, 100% items, 75% users) Test (1631858, 6) (25% total, 100% items, 25% users)
Splitting training/testing datasets for BoardGameGeek
Train (14119520, 6) (75% total, 100% items, 75% users) Test (4743399, 6) (25% total, 100% items, 25% users)


Now, for each training/testing data frame, encode the user ids to contiguous sets using LabelEncoders.

In [7]:
for dataset in datasets:
    print(f"Encoding user_id for {dataset['Title']}")

    train_df = dataset['train_df']
    train_user_id_encoder = LabelEncoder()
    train_user_id_encoder.fit(train_df.user_id)
    dataset['train_user_id_encoder'] = train_user_id_encoder
    n_train_users = train_user_id_encoder.classes_.size
    dataset['n_train_users'] = n_train_users
    train_df['user_idx'] = train_user_id_encoder.transform(train_df.user_id)
    print(f"Distinct Training Users: {n_train_users:,}")

    test_df = dataset['test_df']
    test_user_id_encoder = LabelEncoder()
    test_user_id_encoder.fit(test_df.user_id)
    dataset['test_user_id_encoder'] = test_user_id_encoder
    n_test_users = test_user_id_encoder.classes_.size
    dataset['n_test_users'] = n_test_users
    test_df['user_idx'] = test_user_id_encoder.transform(test_df.user_id)
    print(f"Distinct Testing Users: {n_test_users:,}")

Encoding user_id for MovieLens
Distinct Training Users: 121,905
Distinct Testing Users: 40,636
Encoding user_id for Netflix
Distinct Training Users: 355,362
Distinct Testing Users: 118,454
Encoding user_id for Yahoo! Music
Distinct Training Users: 638,199
Distinct Testing Users: 212,733
Encoding user_id for BoardGameGeek
Distinct Training Users: 249,059
Distinct Testing Users: 83,020


Now create a sparse matrix for the training set.

In [8]:
for dataset in datasets:
    print(f"Creating sparse training matrix for {dataset['Title']}")
    dataset['train_X'] = csr_matrix(
        (dataset['train_df'].rating, (dataset['train_df'].user_idx, dataset['train_df'].item_idx)),
        shape=(dataset['n_train_users'], dataset['n_items'])
    )

Creating sparse training matrix for MovieLens
Creating sparse training matrix for Netflix
Creating sparse training matrix for Yahoo! Music
Creating sparse training matrix for BoardGameGeek


Now we need to split the test dataset into a seen/unseen split (75/25), stratified by user id.

In [9]:
for dataset in datasets:
    print(f"Creating seen/unseen testing split {dataset['Title']}")
    test_df = dataset['test_df']
    seen_df, unseen_df = train_test_split(test_df, train_size=0.75, random_state=777, stratify=test_df.user_idx)
    dataset['seen_df'] = seen_df
    dataset['unseen_df'] = unseen_df
    print(f"Seen {seen_df.shape}, Unseen {unseen_df.shape}")

Creating seen/unseen testing split MovieLens
Seen (4637730, 7), Unseen (1545910, 7)
Creating seen/unseen testing split Netflix
Seen (9551190, 7), Unseen (3183730, 7)
Creating seen/unseen testing split Yahoo! Music
Seen (1223893, 7), Unseen (407965, 7)
Creating seen/unseen testing split BoardGameGeek
Seen (3557549, 7), Unseen (1185850, 7)


Now we can create the sparse matrix for the testing seen dataset.

In [10]:
for dataset in datasets:
    print(f"Creating sparse testing/seen matrix for {dataset['Title']}")
    dataset['test_X'] = csr_matrix(
        (dataset['seen_df'].rating, (dataset['seen_df'].user_idx, dataset['seen_df'].item_idx)),
        shape=(dataset['n_test_users'], dataset['n_items'])
    )

Creating sparse testing/seen matrix for MovieLens
Creating sparse testing/seen matrix for Netflix
Creating sparse testing/seen matrix for Yahoo! Music
Creating sparse testing/seen matrix for BoardGameGeek


## kNN Analysis

We will now use values for k ranging from 5 to 250 to find the optimal neihgborhood size for each dataset.  Since performing the predictions using knn takes some time (more than 1 minute for just k=5) we will save the resulting prediction datasets to feather files so we can recover in the event of an error.  NOTE: This happened (ran out of memory on the first pass when evaluating the k=250 datasets after starting the run overnight - phew!).

In [22]:
def load_or_compute_pred_df(dataset, k):
    cached_filename = f"./pred_cache/{dataset['File']}.pred.k{k}.feather"
    try:
        cached_df = pd.read_feather(cached_filename)
        print('Using cached result (clear /pred_cache folder to recompute)')
        return cached_df
    except:
        nn = NearestNeighbors(n_neighbors=k, metric='cosine', n_jobs=-1)
        print('Fitting kNN model on training data')
        nn.fit(dataset['train_X'])
        print('Finding neighbors for "seen" testing data')
        neighbors = nn.kneighbors(dataset['test_X'], return_distance=False)
        print('Melting neigbors')
        melted_neighbors_df = pd.DataFrame(neighbors).melt(
            value_name='neighbor_train_user_idx',
            ignore_index=False
        )[['neighbor_train_user_idx']].reset_index(names='test_user_idx')
        print('Computing average ratings for the "unseen" testing data')
        unseen_df = dataset['unseen_df']
        merged_df = unseen_df\
            .merge(melted_neighbors_df, left_on='user_idx', right_on='test_user_idx')\
            .merge(train_df, left_on=['neighbor_train_user_idx', 'item_idx'], right_on=['user_idx', 'item_idx'], suffixes=['', '_train'])
        averages_df = merged_df.groupby(['user_idx', 'item_idx'])[['rating_train']].mean()
        all_pred_df = unseen_df.merge(averages_df, how='left', on=['user_idx', 'item_idx'])\
            [['item_id', 'user_id', 'rating', 'rating_bin', 'rating_train']]
        print('Saving cached result to disk')
        all_pred_df.to_feather(cached_filename)
        return all_pred_df

def knn_test(dataset, k):
    print(f"Testing neighborhood size k={k} for {dataset['Title']}")
    all_pred_df = load_or_compute_pred_df(dataset, k)
    pred_df = all_pred_df[all_pred_df.rating_train.notnull()]
    dataset[f"all_pred_df_{k}"] = all_pred_df
    dataset[f"pred_df_{k}"] = pred_df
    n_total = all_pred_df.shape[0]
    n_pred = pred_df.shape[0]
    n_null = n_total - n_pred
    dataset[f"n_total_{k}"] = n_total
    dataset[f"n_pred_{k}"] = n_pred
    dataset[f"n_null_{k}"] = n_null
    print(f"Done. Predicted {n_pred} of {n_total} ({n_null} NaN)")    
    print('')

In [23]:
ks = [5, 10, 25, 50, 100, 150, 200, 250]
for k in ks:
    for dataset in datasets:
        knn_test(dataset, k)

Testing neighborhood size k=5 for MovieLens
Using cached result (clear /pred_cache folder to recompute)
Done. Predicted 1155346 of 1545910 (390564 NaN)

Testing neighborhood size k=5 for Netflix
Using cached result (clear /pred_cache folder to recompute)
Done. Predicted 69225 of 3183730 (3114505 NaN)

Testing neighborhood size k=5 for Yahoo! Music
Using cached result (clear /pred_cache folder to recompute)
Done. Predicted 13331 of 407965 (394634 NaN)

Testing neighborhood size k=5 for BoardGameGeek
Using cached result (clear /pred_cache folder to recompute)
Done. Predicted 20837 of 1185850 (1165013 NaN)

Testing neighborhood size k=10 for MovieLens
Using cached result (clear /pred_cache folder to recompute)
Done. Predicted 1280716 of 1545910 (265194 NaN)

Testing neighborhood size k=10 for Netflix
Using cached result (clear /pred_cache folder to recompute)
Done. Predicted 128844 of 3183730 (3054886 NaN)

Testing neighborhood size k=10 for Yahoo! Music
Using cached result (clear /pred_c

Next, we need to perform a linear regression to compute the R^2 value for each set of predictions.  We should also add the predicted size vs. testing size metric (to see how the unpredictable set changes with k).

In [30]:
def ScoreRatings(dataset, k):
    df = dataset[f"pred_df_{k}"]
    result = linregress(df.rating, df.rating_train)
    rvalue = result.rvalue
    coef_det = rvalue * rvalue
    print(f"k={k}, R^2 = {coef_det}")

In [31]:
for dataset in datasets:
    print(f"Evaluating R^2 for {dataset['Title']}")
    for k in ks:
        ScoreRatings(dataset, k)
    print("")

Evaluating R^2 for MovieLens
k=5, R^2 = 0.1495154188384571
k=10, R^2 = 0.1710952844882682
k=25, R^2 = 0.19837371652315747
k=50, R^2 = 0.21426213770362265
k=100, R^2 = 0.22734575747851915
k=150, R^2 = 0.23309274287158122
k=200, R^2 = 0.23650984752459725
k=250, R^2 = 0.23873341010063248

Evaluating R^2 for Netflix
k=5, R^2 = 0.0003092962723988484
k=10, R^2 = 0.00017448519509314183
k=25, R^2 = 0.00028483986626896417
k=50, R^2 = 0.0004209217059557052
k=100, R^2 = 0.00033947374880325396
k=150, R^2 = 0.00034953068556711715
k=200, R^2 = 0.0003000495175410974
k=250, R^2 = 2.1981085558120076e-05

Evaluating R^2 for Yahoo! Music
k=5, R^2 = 6.736476279609527e-06
k=10, R^2 = 0.00010823625326729013
k=25, R^2 = 0.00014584170604428016
k=50, R^2 = 0.00012428503385046921
k=100, R^2 = 8.821046626184293e-05
k=150, R^2 = 5.673372501940381e-05
k=200, R^2 = 7.477032510212727e-05
k=250, R^2 = 7.670232163740563e-05

Evaluating R^2 for BoardGameGeek
k=5, R^2 = 0.002363213472237026
k=10, R^2 = 0.002969648605171

These numbers do not look right.  MovieLens - maybe.  k=250 for BoardGameGeek - maybe.  But the rest are abysmal.

Time to double check the calculations.  Let us take Netflix for k=100 and linearly run through the combinations below.

In [35]:
df = pd.read_feather('./datasets/netflix_prize.feather')
print(f"Initial dataset: {df.shape}")
counts_df = df.groupby('user_id')[['rating']].count()
merged_df = df.merge(counts_df, on='user_id', suffixes=['', '_count'])
filtered_df = merged_df[merged_df.rating_count > 1]
multi_df = filtered_df.copy()
print(f"Filtering out users with only one rating: {multi_df.shape}")
multi_df

Initial dataset: (51031355, 3)
Filtering out users with only one rating: (51027153, 4)


Unnamed: 0,item_id,user_id,rating,rating_count
0,1,1488844,3,1127
1,1,822109,5,75
2,1,885013,4,181
3,1,30878,4,676
4,1,823519,3,324
...,...,...,...,...
51031350,9210,2420260,1,493
51031351,9210,761176,3,134
51031352,9210,459277,3,793
51031353,9210,2407365,4,362


In [37]:
item_id_encoder = LabelEncoder()
item_id_encoder.fit(multi_df.item_id)
n_items = item_id_encoder.classes_.size
print(f"Distinct Item Count: {n_items:,}")
multi_df['item_idx'] = item_id_encoder.transform(multi_df.item_id)
multi_df

Distinct Item Count: 9,210


Unnamed: 0,item_id,user_id,rating,rating_count,item_idx
0,1,1488844,3,1127,0
1,1,822109,5,75,0
2,1,885013,4,181,0
3,1,30878,4,676,0
4,1,823519,3,324,0
...,...,...,...,...,...
51031350,9210,2420260,1,493,9209
51031351,9210,761176,3,134,9209
51031352,9210,459277,3,793,9209
51031353,9210,2407365,4,362,9209


In [39]:
gss = GroupShuffleSplit(n_splits=1, train_size=0.75, random_state=777)
train_index, test_index = next(gss.split(X=multi_df, y=multi_df.rating, groups=multi_df.user_id))
train_df = multi_df.iloc[train_index].copy()
test_df = multi_df.iloc[test_index].copy()
total_count = train_df.shape[0] + test_df.shape[0];
item_count = multi_df.item_idx.nunique()
user_count = multi_df.user_id.nunique()
train_pct_total = train_df.shape[0] / total_count
test_pct_total = test_df.shape[0] / total_count
train_pct_item = train_df.item_idx.nunique() / item_count
test_pct_item = test_df.item_idx.nunique() / item_count
train_pct_user = train_df.user_id.nunique() / user_count
test_pct_user = test_df.user_id.nunique() / user_count
print(f"Train {train_df.shape} ({train_pct_total:.0%} total, {train_pct_item:.0%} items, {train_pct_user:.0%} users) " +
      f"Test {test_df.shape} ({test_pct_total:.0%} total, {test_pct_item:.0%} items, {test_pct_user:.0%} users)")

Train (38292233, 5) (75% total, 100% items, 75% users) Test (12734920, 5) (25% total, 100% items, 25% users)


In [40]:
train_user_id_encoder = LabelEncoder()
train_user_id_encoder.fit(train_df.user_id)
n_train_users = train_user_id_encoder.classes_.size
train_df['user_idx'] = train_user_id_encoder.transform(train_df.user_id)
print(f"Distinct Training Users: {n_train_users:,}")

test_user_id_encoder = LabelEncoder()
test_user_id_encoder.fit(test_df.user_id)
n_test_users = test_user_id_encoder.classes_.size
test_df['user_idx'] = test_user_id_encoder.transform(test_df.user_id)
print(f"Distinct Testing Users: {n_test_users:,}")

Distinct Training Users: 355,362
Distinct Testing Users: 118,454


In [41]:
train_X = csr_matrix(
    (train_df.rating, (train_df.user_idx, train_df.item_idx)),
    shape=(n_train_users, n_items)
)

In [42]:
seen_df, unseen_df = train_test_split(test_df, train_size=0.75, random_state=777, stratify=test_df.user_idx)
print(f"Seen {seen_df.shape}, Unseen {unseen_df.shape}")

Seen (9551190, 6), Unseen (3183730, 6)


In [43]:
test_X = csr_matrix(
    (seen_df.rating, (seen_df.user_idx, seen_df.item_idx)),
    shape=(n_test_users, n_items)
)

In [44]:
k = 100
print(f"Testing neighborhood size k={k} for Netflix")
nn = NearestNeighbors(n_neighbors=k, metric='cosine', n_jobs=-1)
print('Fitting kNN model on training data')
nn.fit(train_X)
print('Finding neighbors for "seen" testing data')
neighbors = nn.kneighbors(test_X, return_distance=False)
print('Melting neigbors')
melted_neighbors_df = pd.DataFrame(neighbors).melt(
    value_name='neighbor_train_user_idx',
    ignore_index=False
)[['neighbor_train_user_idx']].reset_index(names='test_user_idx')
print('Computing average ratings for the "unseen" testing data')
merged_df = unseen_df\
    .merge(melted_neighbors_df, left_on='user_idx', right_on='test_user_idx')\
    .merge(train_df, left_on=['neighbor_train_user_idx', 'item_idx'], right_on=['user_idx', 'item_idx'], suffixes=['', '_train'])
averages_df = merged_df.groupby(['user_idx', 'item_idx'])[['rating_train']].mean()

Testing neighborhood size k=100 for MovieLens
Fitting kNN model on training data
Finding neighbors for "seen" testing data
Melting neigbors
Computing average ratings for the "unseen" testing data


KeyError: "['rating_bin'] not in index"

The above command took around 15m to find neighbors, and around 18m in total to find predictions (output says MovieLens, but it is totally Netflix).

In [45]:
all_pred_df = unseen_df.merge(averages_df, how='left', on=['user_idx', 'item_idx'])\
    [['item_id', 'user_id', 'rating', 'rating_train']]
pred_df = all_pred_df[all_pred_df.rating_train.notnull()]
n_total = all_pred_df.shape[0]
n_pred = pred_df.shape[0]
n_null = n_total - n_pred
print(f"Done. Predicted {n_pred} of {n_total} ({n_null} NaN)")

Done. Predicted 2997931 of 3183730 (185799 NaN)


In [46]:
pred_df
result = linregress(pred_df.rating, pred_df.rating_train)
rvalue = result.rvalue
coef_det = rvalue * rvalue
print(f"k={k}, R^2 = {coef_det}")

k=100, R^2 = 0.19736097082472437


WTF.  That looks fine.  There is obviously something wrong with the code or logic the way it was written because copying the single-pass for Netflix with k=100 seems to work fine (0.19 seems reasonable, vs. 0.00033947374880325396 which is atrocious).

In [47]:
def LoadOrComputePrediction(name, k, train_X, test_X, unseen_df, train_df):
    cached_filename = f"./pred_cache/{name}.pred.v2.k{k}.feather"
    try:
        cached_df = pd.read_feather(cached_filename)
        print('Using cached result (clear /pred_cache folder to recompute)')
        return cached_df
    except:
        nn = NearestNeighbors(n_neighbors=k, metric='cosine', n_jobs=-1)
        print('Fitting kNN model on training data')
        nn.fit(train_X)
        print('Finding neighbors for "seen" testing data')
        neighbors = nn.kneighbors(test_X, return_distance=False)
        print('Melting neigbors')
        melted_neighbors_df = pd.DataFrame(neighbors).melt(
            value_name='neighbor_train_user_idx',
            ignore_index=False
        )[['neighbor_train_user_idx']].reset_index(names='test_user_idx')
        print('Computing average ratings for the "unseen" testing data')
        merged_df = unseen_df\
            .merge(melted_neighbors_df, left_on='user_idx', right_on='test_user_idx')\
            .merge(train_df, left_on=['neighbor_train_user_idx', 'item_idx'], right_on=['user_idx', 'item_idx'], suffixes=['', '_train'])
        averages_df = merged_df.groupby(['user_idx', 'item_idx'])[['rating_train']].mean()        
        all_pred_df = unseen_df.merge(averages_df, how='left', on=['user_idx', 'item_idx'])\
            [['item_id', 'user_id', 'rating', 'rating_train']]
        print('Saving cached result to disk')
        all_pred_df.to_feather(cached_filename)
        return all_pred_df    

def AnalyzeDataset(title, name):
    print(f"Analyzing dataset for {title}")
    df = pd.read_feather(f'./datasets/{name}.feather')
    print(f"Initial dataset: {df.shape}")
    counts_df = df.groupby('user_id')[['rating']].count()
    merged_df = df.merge(counts_df, on='user_id', suffixes=['', '_count'])
    filtered_df = merged_df[merged_df.rating_count > 1]
    multi_df = filtered_df.copy()
    print(f"Filtering out users with only one rating: {multi_df.shape}")

    item_id_encoder = LabelEncoder()
    item_id_encoder.fit(multi_df.item_id)
    n_items = item_id_encoder.classes_.size
    print(f"Distinct Item Count: {n_items:,}")
    multi_df['item_idx'] = item_id_encoder.transform(multi_df.item_id)

    gss = GroupShuffleSplit(n_splits=1, train_size=0.75, random_state=777)
    train_index, test_index = next(gss.split(X=multi_df, y=multi_df.rating, groups=multi_df.user_id))
    train_df = multi_df.iloc[train_index].copy()
    test_df = multi_df.iloc[test_index].copy()
    print(f"Train {train_df.shape}, Test {test_df.shape}")

    train_user_id_encoder = LabelEncoder()
    train_user_id_encoder.fit(train_df.user_id)
    n_train_users = train_user_id_encoder.classes_.size
    train_df['user_idx'] = train_user_id_encoder.transform(train_df.user_id)
    print(f"Distinct Training Users: {n_train_users:,}")
    
    test_user_id_encoder = LabelEncoder()
    test_user_id_encoder.fit(test_df.user_id)
    n_test_users = test_user_id_encoder.classes_.size
    test_df['user_idx'] = test_user_id_encoder.transform(test_df.user_id)
    print(f"Distinct Testing Users: {n_test_users:,}")

    train_X = csr_matrix(
        (train_df.rating, (train_df.user_idx, train_df.item_idx)),
        shape=(n_train_users, n_items)
    )

    seen_df, unseen_df = train_test_split(test_df, train_size=0.75, random_state=777, stratify=test_df.user_idx)
    print(f"Seen {seen_df.shape}, Unseen {unseen_df.shape}")

    test_X = csr_matrix(
        (seen_df.rating, (seen_df.user_idx, seen_df.item_idx)),
        shape=(n_test_users, n_items)
    )

    print('')

    for k in [5, 10, 25, 50, 100, 150, 200, 250]:
        print(f"Testing neighborhood size k={k}")
        all_pred_df = LoadOrComputePrediction(name, k, train_X, test_X, unseen_df, train_df)
        pred_df = all_pred_df[all_pred_df.rating_train.notnull()]
        n_total = all_pred_df.shape[0]
        n_pred = pred_df.shape[0]
        n_null = n_total - n_pred
        print(f"Done. Predicted {n_pred} of {n_total} ({n_null} NaN)")

        result = linregress(pred_df.rating, pred_df.rating_train)
        rvalue = result.rvalue
        coef_det = rvalue * rvalue
        print(f"k={k}, R^2 = {coef_det}")
        print('')

In [None]:
AnalyzeDataset('Netflix', 'netflix_prize')

Analyzing dataset for Netflix
Initial dataset: (51031355, 3)
Filtering out users with only one rating: (51027153, 4)
Distinct Item Count: 9,210
Train (38292233, 5), Test (12734920, 5)
Distinct Training Users: 355,362
Distinct Testing Users: 118,454
Seen (9551190, 6), Unseen (3183730, 6)
Fitting kNN model on training data
Finding neighbors for "seen" testing data
