# Preprocessing the data

Based on our analysis in the previous notebook, we identified several things needing consideration. The biggest one being that about 5% of users account for a third of the rating data. We will simply remove these high rating-count users from the dataset in this preprocessing file. Movies with less than 10 ratings are also removed. Lastly, we divide the datasets into two csv files: one for training and another for evaluating the model. The files will be saved to a folder called 'embeddingModel' in the location of this notebook. 

This notebook assumes the Movielens data was downloaded and unzipped by the [previous notebook](1. Downloading and Exploring the Movielens Data.ipynb) in this series. 

In [4]:
import numpy as np
import os
import pandas as pd
import random
import yaml

In [5]:
INPUT_RATING_FILE = 'ml-10M100K/ratings.dat'

OUTPUT_DIR = 'embeddingModel'
WALS_OUTPUT_TRAIN_FILE = os.path.join(OUTPUT_DIR, 'walsMovielensTrain.csv')
WALS_OUTPUT_TEST_FILE = os.path.join(OUTPUT_DIR, 'walsMovielensTest.csv')
WALS_OUTPUT_STATS_FILE = os.path.join(OUTPUT_DIR,'matrixInfo.yaml')

# Preprocessing parameters
TRANSLATION = 3.5
MAX_USER_RATING_THRESHOLD = 500
MIN_MOVIE_RATING_THRESHOLD = 10
PERCENT_TRAIN = 0.80


# Preprocessing for the WALS model

Input to the WALS model is a sparse matrix. We convert the movie lens data of 'userid', 'movieid', 'rating' into a 'matrix row index', 'matrix column index' and 'shifted rating'. The matrix indices start at zero. The WALS model performs better when the true mean rating is close to zero, so we subtract 3.5 from each rating. 

In [6]:
raw = pd.read_csv(INPUT_RATING_FILE, sep='::', header=None, names=['userid', 'movieid', 'rating', 'timestamp'], usecols=['userid', 'movieid', 'rating'], engine='python')

In [7]:

unique_userids = raw['userid'].unique()
unique_movieids = raw['movieid'].unique()
num_unique_users = unique_userids.shape[0]
num_unique_movies = unique_movieids.shape[0]
num_ratings = raw.shape[0]
print('Raw dataset stats: %d users, %d movies, %d ratings' % (num_unique_users, num_unique_movies, num_ratings))

user_histo = raw['userid'].value_counts()
blacklist_users_index = (user_histo.values > MAX_USER_RATING_THRESHOLD)
blacklist_users = user_histo.index[blacklist_users_index]
whitelist_users = user_histo.index[np.logical_not(blacklist_users_index)]

movie_histo = raw['movieid'].value_counts()
blacklist_movies_index = (movie_histo.values < MIN_MOVIE_RATING_THRESHOLD)
blacklist_movies = movie_histo.index[blacklist_movies_index]
whitelist_movies = movie_histo.index[np.logical_not(blacklist_movies_index)]

delete_mask1 = raw['userid'].isin(blacklist_users)
delete_mask2 = raw['movieid'].isin(blacklist_movies)
delete_mask = delete_mask1 | delete_mask2
raw = raw[~delete_mask]

unique_userids = raw['userid'].unique()
unique_movieids = raw['movieid'].unique()

new_num_unique_users = unique_userids.shape[0]
new_num_unique_movies = unique_movieids.shape[0]
new_num_ratings = raw.shape[0]

print 'New raw stats: %d users, %d movies, %d ratings' % (new_num_unique_users, new_num_unique_movies, new_num_ratings)
print 'Removed %d movies, or %f%% percent of all movies' % (num_unique_movies - new_num_unique_movies, 100*(num_unique_movies - new_num_unique_movies)/float(num_unique_movies))
print 'Removed %d users, or %f%% percent of all users' % (num_unique_users - new_num_unique_users, 100*(num_unique_users - new_num_unique_users)/float(num_unique_users))
print 'Removed %d ratings, or %f%% percent of all ratings' % (num_ratings - new_num_ratings, 100*(num_ratings - new_num_ratings)/float(num_ratings))


# Continuously reindex the user and movie id's to start at 0.

i = 0
old_userid_to_new = {}
for uid in unique_userids:
  old_userid_to_new[uid] = i
  i += 1

row = raw['userid'].apply(lambda x : old_userid_to_new[x])

i = 0
old_movieid_to_new = {}
for mid in unique_movieids:
  old_movieid_to_new[mid] = i
  i += 1

col = raw['movieid'].apply(lambda x : old_movieid_to_new[x])


values = raw['rating'].as_matrix().astype(np.float32) - TRANSLATION

preprocessed = pd.DataFrame(columns=['userid', 'movieid', 'rating'])
preprocessed['userid'] = row
preprocessed['movieid'] = col
preprocessed['rating'] = values

Raw dataset stats: 69878 users, 10677 movies, 10000054 ratings
New raw stats: 66219 users, 9670 movies, 6896233 ratings
Removed 1007 movies, or 9.431488% percent of all movies
Removed 3659 users, or 5.236269% percent of all users
Removed 3103821 ratings, or 31.038042% percent of all ratings


In [8]:

def splitValues(df):
  '''Given a dataframe, splits the rows into two dataframes
  
  Returns:
    df_train: a dataset for training, containing PERCENT_TRAIN
      of the rows of df.
    df_test: a dataset for testing.
  '''
  
  num_items = raw.shape[0]

  # Get a permutation of the row indices of df
  random.seed(34512)
  index = np.random.permutation(num_items)

  num_train = int(num_items * PERCENT_TRAIN)

  df_train = df.iloc[index[:num_train], :]
  df_test = df.iloc[index[num_train:], :]
  
  return (df_train, df_test)


In [9]:
# Before start writing files out, make sure folder exists.
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

In [10]:
# Split the data for training and testing, and save to a file.
(preprocessed_train, preprocessed_test) = splitValues(preprocessed)
preprocessed_train.to_csv(WALS_OUTPUT_TRAIN_FILE, header=False, index=False)
preprocessed_test.to_csv(WALS_OUTPUT_TEST_FILE, header=False, index=False)


In [11]:
# Save the number of rows and columns to a yaml file for easy use.
with open(WALS_OUTPUT_STATS_FILE, 'w') as outfile:
    matrix_info = {'num_rows': new_num_unique_users, 'num_columns': new_num_unique_movies}
    yaml.dump(matrix_info, outfile, default_flow_style=False)