# MovieLens Dataset Preparation

Raw data was obtained from the MovieLens 25M Dataset at [website (https://grouplens.org/datasets/movielens/)][website]. Need to clean and transform the data before any networks can be trained. "ratings.csv" contains reviews from users for various movies. A given user and a given movie can each be identified by a unique ID. The review from the user for a movie is in the form of a rating out of 5. Furthermore, there is a "timestamp" column with a row for each review, where a given row has a value which is the number of seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

In [1]:
import numpy as np
import pandas as pd

from IPython.display import clear_output
from timeit import default_timer

In [2]:
dataset = pd.read_csv("D:/Movie Recommendation System Project/data/raw data/ratings.csv")

dataset.sample(n = 5, axis = 0)

Unnamed: 0,userId,movieId,rating,timestamp
21874402,142226,45517,3.0,1534689228
7614241,49403,5034,3.0,1165248674
17301156,112093,96728,4.0,1496380634
3638843,24027,2004,4.0,899911761
8365331,54496,432,3.0,834265074


## Data Exploration and Reduction

We expect that ratings.csv will not have any null rows. Nevertheless, we can check to make sure and also print a summary of our new dataframe, along with the number of unique user and movie IDs.

In [3]:
# Rename columns
dataset.columns = ["user_ID", "movie_ID", "rating", "timestamp"]

# Print summary of dataset
print(dataset.info())

# Check for any null values in rows
print("Null Rows:", dataset[pd.isnull(dataset).any(axis=1)])

# Print numbers of unique users and movies
print("Number of unique users: {}".format(dataset["user_ID"].nunique()))
print("Number of unique movies: {}".format(dataset["movie_ID"].nunique()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000095 entries, 0 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   user_ID    int64  
 1   movie_ID   int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 762.9 MB
None
Null Rows: Empty DataFrame
Columns: [user_ID, movie_ID, rating, timestamp]
Index: []
Number of unique users: 162541
Number of unique movies: 59047


Due to later transformations we will make which will increase the size of our dataset, we will choose a random subset of the users in order to reduce the size of the dataset and hence reduce the time required to train and evaluate networks. As per the README.txt file in the "raw data" folder, each user has at least 20 reviews. We choose to randomly select users as opposed to simply rows because the latter approach will leave us with some users who then only have a number of reviews which is much less than 20. The results will improve for each user in the evaluation stage as the number of training rows for that given user increases. This is thus a property of our dataset which could be varied later on, such as imposing the restriction that each user must have at least 50 training rows. For now, we will simply use a random subset of users.  

In [4]:
np.random.seed(123)

# Choose random subset of unique users to reduce data preparation and training time cost
user_ID_subset = np.random.choice(dataset["user_ID"].unique(), 
                                size = int(len(dataset["user_ID"].unique()) * 0.33), 
                                replace = False)

dataset = dataset.loc[dataset["user_ID"].isin(user_ID_subset)]

# Print summary of smaller dataset
dataset.info()
print("Number of unique users: {}".format(dataset["user_ID"].nunique()))
print("Number of unique movies: {}".format(dataset["movie_ID"].nunique()))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8303384 entries, 254 to 25000094
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   user_ID    int64  
 1   movie_ID   int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 316.7 MB
Number of unique users: 53638
Number of unique movies: 43232


## Implicit Feedback Transformation and Leave-One-Out Methodology

For this project, I am interested in building models which can successfully predict whether or not a given user will interact (or watch) with a given movie. For this reason, the implicit feedback format will be used, where a value of 1 indicates an interaction and 0 indicates no interaction. The rows in the current dataset only include interactions between users and movies, and we can hence rename the "rating" column to "interaction" and change every value to 1.

In [5]:
# Want implicit feedback
# Use 1 for 'interaction' and 0 for 'no interaction'
dataset.loc[:, "rating"] = 1

# Rename columns
dataset.columns = ["user_ID", "movie_ID", "interaction", "timestamp"]

dataset.sample(n = 5, axis = 0)

Unnamed: 0,user_ID,movie_ID,interaction,timestamp
15858612,102781,341,1,888091760
24336347,158112,3897,1,1053819490
20485683,133209,3704,1,1297626782
11856950,76852,32296,1,1119886537
807990,5443,2433,1,944924701


To avoid a look-ahead bias and implement the leave-one-out test set methodology, we need to group rows by user ID and then rank them in time order. The pandas groupby object and the "timestamp" column allow us to do this. The single test set row for each user is selected as the row with a time rank of 1, since we have used descending order. Every other row for a given user is saved in the train set.

In [7]:
# Group by userID and rank timestamps in descending order for each userID
dataset["time_rank"] = dataset.groupby(by = "user_ID")["timestamp"].rank(method = "first", ascending = False)

train_set = dataset[dataset["time_rank"] != 1]
train_set = train_set[["user_ID", "movie_ID", "interaction"]]

print("Training Set:")
print(train_set.sample(n = 5))

test_set = dataset[dataset["time_rank"] == 1]
test_set = test_set[["user_ID", "movie_ID", "interaction"]]

print("Test Set:")
print(test_set.sample(n = 5))

Training Set:
          user_ID  movie_ID  interaction
5596867     36255     45722            1
6899980     44770     34048            1
14554676    94226       608            1
10814658    70176      6874            1
6091949     39500     33660            1
Test Set:
          user_ID  movie_ID  interaction
494681       3399      2987            1
14457181    93602    122886            1
2622146     17452      1282            1
1568280     10502       260            1
22269551   144809      2987            1


For the networks to be effective in predicting which movies users will and won't interact with, it also needs to be exposed to unseen movie samples, where the interaction is 0. The dataset doesn't include such rows, however we can randomly select movies which a given user has not seen, assume that the user is not interested in these movies and then append these rows to the dataset with an interaction of 0. This assumption turns out to work well in practice. 

In [8]:
# Define a function which can create rows for users and movies they haven't interacted with
# Unseen movies are labelled with an interaction of 0
# Ratio determines the final ratio of unseen to seen movies for each user
def append_unseen_samples(overall_dataset, subset, ratio):
    user_movie_set = set(zip(subset["user_ID"], subset["movie_ID"]))
    
    all_movie_IDs = overall_dataset["movie_ID"].unique()
    
    users = []
    movies = []
    interactions = []
    
    count = 0
    start = default_timer()
    for user, movie in user_movie_set:
        clear_output(wait = True)
        
        users.append(user)
        movies.append(movie)
        interactions.append(1)
        
        for x in range(ratio):
            unseen_movie = np.random.choice(all_movie_IDs)
            
            # Check if user has interacted with the randomly chosen movie
            # If interaction has occured, randomly choose new movie until unseen movie is found
            while (user, unseen_movie) in user_movie_set:
                unseen_movie = np.random.choice(all_movie_IDs)
                
            users.append(user)
            movies.append(unseen_movie)
            interactions.append(0)
            
        count += 1
        
        # Print percentage of loop completion and current time elapsed to predict total runtime for loop
        print("Overall Loop Progress: {:.2f}%".format((count/len(user_movie_set))*100))
        stop = default_timer()
        print("Current Overall Runtime: {:.2f} minutes".format((stop - start)/60))
        
    new_subset = pd.DataFrame(list(zip(users, movies, interactions)), columns = ["user_ID", "movie_ID", "interaction"])    
            
    ordered_subset = new_subset.sort_values(by = "user_ID", ascending = True)
    return ordered_subset

In [9]:
train_set = append_unseen_samples(dataset, train_set, ratio = 4)

train_set.to_csv("D:/Movie Recommendation System Project/data/data preparation/dataset frac=0.33, ratio=4/train_set.csv")
test_set.to_csv("D:/Movie Recommendation System Project/data/data preparation/dataset frac=0.33, ratio=4/test_set.csv")

Overall Loop Progress: 100.00%
Current Overall Runtime: 202.78 minutes
