# Movie Recommendations Dataset Preparation

The data is SVM (Support Vector Machine) formatted.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.datasets import dump_svmlight_file

# Libraries for Sagemaker
import boto3
import sagemaker.amazon.common as smac

**Classification with Sagemaker**
Input features: *userID*, *movieID*
Target Features: rating
Objective: Predict how a user would rate a particular movie

Movie Lens Overview: https://grouplens.org/datasets/movielens/

Dataset: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

*F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.*

## Load Movies and Parse Genre

Download movie dataset from *grouplens*

In [2]:
!wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip

--2020-04-19 20:05:42--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2020-04-19 20:05:42 (6.41 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]



In [None]:
!unzip ml.latest-small.zip

## Create dataframe from the movie files

In [3]:
df_movies = pd.read_csv(r'ml-latest-small/movies.csv')

In [5]:
df_movies.shape # Understand the array

(9742, 3)

In [6]:
df_movies.head() # data check

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
genre_list = df_movies.genres.map(lambda value: value.split('|')) # splitting on pipe

In [8]:
genre_list[:10] # Display the first 10 genres, or combos of

0    [Adventure, Animation, Children, Comedy, Fantasy]
1                       [Adventure, Children, Fantasy]
2                                    [Comedy, Romance]
3                             [Comedy, Drama, Romance]
4                                             [Comedy]
5                            [Action, Crime, Thriller]
6                                    [Comedy, Romance]
7                                [Adventure, Children]
8                                             [Action]
9                        [Action, Adventure, Thriller]
Name: genres, dtype: object

### Create a function to find the unique genres

In [9]:
def unique_genres (genre_list):
    unique_list = set()
    
    for items in genre_list:
        for selection in items:
            unique_list.add(selection)
    return sorted(unique_list)

In [10]:
genre = unique_genres(genre_list)

In [11]:
genre, len(genre) # list of genres

(['(no genres listed)',
  'Action',
  'Adventure',
  'Animation',
  'Children',
  'Comedy',
  'Crime',
  'Documentary',
  'Drama',
  'Fantasy',
  'Film-Noir',
  'Horror',
  'IMAX',
  'Musical',
  'Mystery',
  'Romance',
  'Sci-Fi',
  'Thriller',
  'War',
  'Western'],
 20)

### Table of genre for each movie

In [12]:
df_genre = pd.DataFrame(index=range(df_movies.shape[0]), columns=genre)

In [13]:
df_genre = df_genre.fillna(0)

In [14]:
df_genre.shape # display array shape

(9742, 20)

In [15]:
df_genre.head() # data check, should be all 0-initialized

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Fill genre for each movie

The next loop looks for a postive (1) in a genre for the movie

In [16]:
for row, movie_genre in enumerate(genre_list):
    df_genre.loc[row][movie_genre] = 1

In [17]:
df_genre.head() # data check - updated

Unnamed: 0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [18]:
# Some movies do not have genre listed
df_genre[df_genre['(no genres listed)'] >0].head

<bound method NDFrame.head of       (no genres listed)  Action  Adventure  Animation  Children  Comedy  \
8517                   1       0          0          0         0       0   
8684                   1       0          0          0         0       0   
8687                   1       0          0          0         0       0   
8782                   1       0          0          0         0       0   
8836                   1       0          0          0         0       0   
8902                   1       0          0          0         0       0   
9033                   1       0          0          0         0       0   
9053                   1       0          0          0         0       0   
9070                   1       0          0          0         0       0   
9091                   1       0          0          0         0       0   
9138                   1       0          0          0         0       0   
9178                   1       0          0          0    

In [19]:
# Merge with movie descriptions with example movie file
df_movies = df_movies.join(df_genre)

In [20]:
df_movies.head() # data check with a combined genre field

Unnamed: 0,movieId,title,genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
df_movies.to_csv(r'ml-latest-small/movies_genre.csv', index=False)

### Load Rating given by each user for a movie

In [22]:
df_ratings = pd.read_csv(r'ml-latest-small/ratings.csv')

In [23]:
df_ratings.userId.unique().shape

(610,)

In [24]:
df_ratings.movieId.unique().shape

(9724,)

In [25]:
df_ratings.drop(axis=1,columns=['timestamp'],inplace=True)

### Merge Ratings with Movie Description

In [26]:
df_movie_ratings = pd.merge(df_ratings,df_movies,on='movieId')

In [27]:
df_movie_ratings.head(2)

Unnamed: 0,userId,movieId,rating,title,genres,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,5,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0


## Training and Validation Set

### Target Variables as first column followed by input features

In [41]:
# Training =  70% of data
# Test = 30% of data
# Randomize the dataset
np.random.seed(256)
l = list(df_movie_ratings.index)
np.random.shuffle(l)
df = df_movie_ratings.iloc[l]

In [42]:
rows = df.shape[0]
train = int(0.7 * rows)
test = rows - train

In [44]:
rows, train, test # data check 

(100836, 70585, 30251)

In [45]:
df.shape

(100836, 25)

In [46]:
df.head(2) # data display checck

Unnamed: 0,userId,movieId,rating,title,genres,(no genres listed),Action,Adventure,Animation,Children,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
68520,610,108932,4.0,The Lego Movie (2014),Action|Adventure|Animation|Children|Comedy|Fan...,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
15916,313,3671,3.0,Blazing Saddles (1974),Comedy|Western,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## Sagemaker Factorization Machine

Expects all columns to be in *float32* format
Convert the *target variable* into *float32*

In [47]:
y = df['rating'].astype(np.float32).ravel()

In [48]:
len(y) # length of y, which used to the 'rows'

100836

In [49]:
y.dtype # data check for type

dtype('float32')

### Create two different training datasets.

*Training 1* rating, user id, movie id
*Training 2* rating, user id, movie id, movie genre

In [50]:
columns_user_movie = ['userId','movieId'] # training 1
columns_all = columns_user_movie + genre # training 2

In [51]:
columns_all # data check

['userId',
 'movieId',
 '(no genres listed)',
 'Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'IMAX',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

### Store a copy of the *user id*, *movie id*, and *rating*

In [53]:
df[['rating','userId','movieId']][:train].to_csv(r'ml-latest-small/user_movie_Train.csv', index=False)
df[['rating','userId','movieId']][train:].to_csv(r'ml-latest-small/user_movie_Test.csv', index=False)

## Use of *One Hot Encoder*

*One Hot Encoder* used for categorical encoding, usuaully used with a mix of non-numeric values and numeric, using *pandas'* dataframe. - [Dantes Gates](https://dantegates.github.io/2018/05/04/a-fast-one-hot-encoder-with-sklearn-and-pandas.html)

"Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them." - [TowardsDataScience](https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd)

In [58]:
encoder = preprocessing.OneHotEncoder(dtype=np.float32, categories='auto') # auto adds future compatibility post-0.22 sklearn

In [59]:
X = encoder.fit_transform(df[columns_user_movie]) # capture classification into matrix X

In [60]:
df.userId.unique().shape, df.movieId.unique().shape # dataframes of the uniques

((610,), (9724,))

### Write Dimensions

These are used for *training* and *testing* to find unique users and movies

In [63]:
dim_movie = df.userId.unique().shape[0] + df.movieId.unique().shape[0]
with open(r'ml-latest-small/movie_dimension.txt','w') as f:
    f.write(str(dim_movie))

In [65]:
X # display matrix X characteristics

<100836x10334 sparse matrix of type '<class 'numpy.float32'>'
	with 201672 stored elements in Compressed Sparse Row format>

In [66]:
X.shape[1]

10334

### Create a sparse matrix RecordIO file

*RecordIO* creates a file format for a sequence of records. Here, I will use *SageMaker* library to wrap the dataframe matrix into a protocol buffer tensor.

In [67]:
def write_sparse_recordio_file (filename, x, y=None):
    with open(filename,'wb') as f:
        smac.write_spmatrix_to_sparse_tensor(f,x,y)

### Training RecordIO file

In [68]:
write_sparse_recordio_file(r'ml-latest-small/user_movie_train.recordio',X[:train],y[:train])

### Test RecordIO file

In [69]:
write_sparse_recordio_file(r'ml-latest-small/user_movie_test.recordio',X[train:],y[train:])

## Create *libSVM* formatted file

Ref: [libSVM Data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/)

This allows ouput of rating, user_index:value, movie_index:value
- Example: 5.0 314:1 215:1
- Rating of 5.0, User with index 314, Movie with index 215 in *One Hot Encoder*

This file is used for:
1. Direct training for [*libFM*](http://www.libfm.org/libfm-1.42.manual.pdf) binary in local mode
1. Run inference with this format against *SageMaker* cloud to send only sparse input to *Sagemaker Prediction Service*

In [70]:
# Store in libSVM
dump_svmlight_file(X[:train],y[:train],r'ml-latest-small/user_movie_train.svm')
dump_svmlight_file(X[train:],y[train:],r'ml-latest-small/user_movie_test.svm')

## Create two lookup files

- File 1: Categorical Movie ID and corresponding Movie Index in One Hot Encoded Table
- File 2: Categorical User ID and corresponding User Index in One Hot Encoded Table

Why? Useful for predicting how a certain users would rate all the movies OR all users rate a certain movie

In [71]:
list_of_movies = df.movieId.unique()

# User 1 and all movies. Using np.full (https://het.as.utexas.edu/HET/Software/Numpy/reference/generated/numpy.full.html)
df_user_movie = pd.DataFrame({'userId': np.full(len(list_of_movies), 1), 'movieId' : list_of_movies})

In [72]:
df_user_movie[columns_user_movie].head() # data check

Unnamed: 0,userId,movieId
0,1,108932
1,1,3671
2,1,179819
3,1,2506
4,1,3147


In [73]:
list_of_movies # This is an array

array([108932,   3671, 179819, ..., 132618, 139747,   4197])

### Transform to *One Hot Encoding* Using Existing Encoder

In [74]:
X = encoder.transform(df_user_movie[columns_user_movie])

### ### Create File 1
Store *movieId* and corresponding *One Hot Encoder* entries

In [75]:
dump_svmlight_file(X,list_of_movies,r'ml-latest-small/one_hot_enc_movies.svm')

### Create File 2

Store *User ID* and corresponding One Hot Encoder entries

In [77]:
list_of_users = df.userId.unique()

In [78]:
list_of_users.shape

(610,)

In [79]:
list_of_users[:10]

array([610, 313,  98, 600, 292, 367, 160, 580, 352,  32])

In [80]:
# All users and movie 1
df_user_movie = pd.DataFrame({'userId': list_of_users, 'movieId' : np.full(len(list_of_users),1)})

In [82]:
df_user_movie.head() # data check

Unnamed: 0,userId,movieId
0,610,1
1,313,1
2,98,1
3,600,1
4,292,1


In [83]:
# Transform to one hot encoding (with existing encoder)
X = encoder.transform(df_user_movie[columns_user_movie])

In [84]:
# Store movieId and corresponding one hot encoded entries
dump_svmlight_file(X,list_of_users,r'ml-latest-small/one_hot_enc_users.svm')