# Module 7 Assignment Ian Feekes

This assignment covers Module 7 assigment for Ian Feekes. I can be contacted at ifeekes@sandiego.edu (916-333-9381)

Please feel free to contact me if this work does not meet the rubric or expectations and I will expediently and gratefully make necessary adjustments!

The summary of my findings can be found at the bottom of this colab notebook, and all input and output files will be placed in the same [google drive folder](https://drive.google.com/drive/folders/1fwO2g027J6qu7dLiJY3MTQmJFnt3JJHm?usp=sharing) in which this file resides.

## Initial Prompt

For this assignment you will build an Artificial Neural Network to predict movie ratings.

Recommendation Systems have become pervasive – Amazon and Netflix are good examples of recommendation systems. One of the predominant applications of recommendation systems is to predict movie ratings. In this assignment, the goal is to build a deep neural network-based recommendation system.

### Instructions:



* Download the dataset
* Load the dataset – the ratings.csv file
* Prepare the train and validation datasets
* Construct the Neural Network
* Compile the model
* Fit the training dataset to the Neural Network model
* Use the validation dataset to generate predictions
* Evaluate the performance of the model – use the square root of the mean squared error
* Summarize your findings

## Initial Configuration

### Initial Imports

The below imports are used for constructing the neural network and for data frame libraries and computation.

In [160]:
import keras                                           # Used for creating the model
import math                                            # Used for RMSE calculations in performance
import numpy as np                                     # Used for various math and matrix operations
import pandas as pd                                    # Data frame operations
from google.colab import drive                         # Used for allowing drive mounting
from sklearn.model_selection import train_test_split   # Splits the testing training data
from sklearn.preprocessing import LabelEncoder         # Labels for string to numerical data representation

### Specify the Hyperparameters that will be Used in the Training Cycle

In [145]:
# Model Hyperparameters
encodingDimension = 32                   # Size of encoded representations. If input is 784 floats this 32 gives a 24.5 compression factor
numEpochs = 10                           # Number of training epochs
batchSize = 32                           # Batch size for the model
shuffleVar = True                        # Whether or not to shuffle the training data. Reduces models' tendancy to memorize
nFactors = 150

### Specify Other Global Variables


In [146]:
dataTrainFileName = '/content/drive/My Drive/Colab Notebooks/Deep_Learning/Module_7/data_set/ml-100k/u.data' # Training file name declared for code-readability
columnNames = ['userId', 'itemId', 'rating', 'timestamp']                                                    # Column names to give the rating training data

## Import the Dataset
As per the prompt instructions, the below imports the mnist image dataset to be used for compression and decompression via the autoencoder.

### Configure Drive for File Imports
The below cell mounts my google drive to allow for local file importation if this file were to be stored as a csv. In practice we may use an api or library to scrape it off the web for us to limit inconvenient file operations.

In [147]:
# Mount the drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Read Drive File into Dataframe

In [148]:
# Read in the data from our Google Drive file. We have our column names specified since this file has special separators and no explicit column names
df = pd.read_csv(dataTrainFileName, sep='\t', header=None, names = columnNames)

# Break if the data seems to be empty
assert(df.shape[0] > 0 and df.shape[1] > 0)

df.head()

Unnamed: 0,userId,itemId,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


## Prepare the Train and Validation Datasets

The below cell imports the mnist dataset as testing and training data.

In [149]:
# Split data
X_train, X_test = train_test_split(df, test_size=0.2, random_state=1)

# Break flow of execution if there are any abnormalities with the imported training and testing data
assert(len(X_train) > 0)
assert(len(X_test) > 0)

# Print out for output
print(f"Shape of train data: {X_train.shape}")
print(f"Shape of test data: {X_test.shape}")

Shape of train data: (80000, 4)
Shape of test data: (20000, 4)


## Exploratory Analysis

In [150]:
#Get the number of unique entities in books and users columns
movie_dim = df.itemId.nunique()
user_dim = df.userId.nunique()

# Print out for output
print("Number of unique movies:", movie_dim)
print("Number of unique users:", user_dim)

Number of unique movies: 1682
Number of unique users: 943


## Construct the Neural Network



The Neural Network Architecture consists of 4 layers:

1. **Input Layer**

This layer takes the book and user vectors as input.

2. **Embedding Layer**

It consists of embedding for both movies and users. These are randomly initialized values that are updated during training. The embeddings represent latent factors in matrix factorization. The objective is to get the best values of embeddings in order to minimize the error between actual and predicted values.

3. **Fully Connected Layers**

The movies and users vectors are first created which are then concatenated. These concatenated vectors are passed to a neural networks consisting of 3 hidden layers. The first hidden layer consists of 128 neurons while the second consists of 64 neurons and finally the third hidden layer consists of 32 neurons. 



4. **Output Layer**

The output layer represented by the output variable consists of 1 neuron which gives predicted values given by the user to the movie.

In [151]:
# User Embedding Layer
input_users_layer = keras.layers.Input(shape=[1])
embed_users_layer = keras.layers.Embedding(user_dim + 1, nFactors, name="user_embeddings")(input_users_layer)
users_output = keras.layers.Flatten()(embed_users_layer)

# Movie Embedding Layer
input_movie_layer = keras.layers.Input(shape=[1])
embed_movie_layer = keras.layers.Embedding(movie_dim + 1, nFactors, name="movie_embeddings")(input_movie_layer)
movie_output = keras.layers.Flatten()(embed_movie_layer)

# concatenate features
concat = keras.layers.Concatenate()([movie_output, users_output])

# add fully-connected-layers
fc1 = keras.layers.Dense(128, activation='relu')(concat)
dropout_1 = keras.layers.Dropout(0.2,name='Dropout')(fc1)
fc2 = keras.layers.Dense(64, activation='relu')(fc1)
dropout_2 = keras.layers.Dropout(0.2,name='Dropout')(fc2)
fc3 = keras.layers.Dense(32, activation='relu')(fc2)
output = keras.layers.Dense(1)(fc3)

# Create model and compile it
model2 = keras.Model([input_movie_layer, input_users_layer], output)
model2.compile('adam', 'mean_squared_error')

In [152]:
model2.summary()

Model: "model_9"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_20 (InputLayer)          [(None, 1)]          0           []                               
                                                                                                  
 input_19 (InputLayer)          [(None, 1)]          0           []                               
                                                                                                  
 movie_embeddings (Embedding)   (None, 1, 150)       252450      ['input_20[0][0]']               
                                                                                                  
 user_embeddings (Embedding)    (None, 1, 150)       141600      ['input_19[0][0]']               
                                                                                            

## Fit the Training Dataset to the Neural Network Model

In [153]:
hist = model2.fit([X_train.itemId, X_train.userId], X_train.rating, 
               batch_size = batchSize, 
               epochs = numEpochs, 
               verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Use the Validation Dataset to Generate Predictions

In [154]:
# Number of predictions we'd like to see from the test set
numPredictionsToTest = 10

# Predict them and store them in an array
predictions = model2.predict([X_test.itemId.head(numPredictionsToTest),
                              X_test.userId.head(numPredictionsToTest)])

# Print them
[print(predictions[i], X_test.rating.iloc[i]) for i in range(0,numPredictionsToTest)]

[4.55851] 5
[4.219304] 5
[4.7940183] 5
[1.9933554] 4
[2.946805] 4
[4.4410677] 2
[3.1841278] 2
[1.5745565] 1
[4.770983] 4
[3.7763526] 5


[None, None, None, None, None, None, None, None, None, None]

In [155]:
# Generate predictions
predictions = model2.predict([X_test.itemId, X_test.userId]);

# Make predictions into a list of floats rather than a 2D matrix
predictionsTemp = []
for i in predictions:
  predictionsTemp.append(i[0]) 
predictions = predictionsTemp

## Evaluate the Performance of the Model

Note for the performance evaluation, we are using the square root of MSE as a metric

In [156]:
# Calculate MSE and RMSE
MSE = np.square(np.subtract(X_test.rating, predictions)).mean() 
RMSE = math.sqrt(MSE)

# Print MSE and RMSE
print("Mean Square Error:", MSE)
print("Root Mean Square Error:", RMSE)

Mean Square Error: 1.0339088803303962
Root Mean Square Error: 1.0168130999994032


## Summarize Findings

7 Models were evaluated for the recommendation system betwen users. They are listed in greater detail below this summary. 

Overall the network was capable of predicting movie ratings at best with an RMSE of .95, being a standard deviation of 1 star off with the movie recommendatinos. Further reductions of the batch size, lengthenings of the number of factors within the network, and increasing the epochs to the optimal point before overfitting occurs would allow noteworthy improvements in performance for predictions. 

Additionally, the actual highly-recommended movies in the system could be better visualized by mapping/encoding the movie IDs to the actual titles and sorting them in descending order for each user. 

While the ratings were technically normalized with 1 through 5 stars, perhaps a 0-1 normalization of ratings would have seen a bit better performance of the model for overfitting and underfitting. 

### Model 1: Baseline

#### Hyperparameters Used:

* numEpochs = 10                           
* batchSize = 64                           
* shuffleVar = True
* numFactors = 25                         

#### Results

##### MSE

0.9861992850423925

##### RMSE

0.9930756693436772


### Model 2: 20 Epochs

#### Hyperparameters Used:

* numEpochs = 20                           
* batchSize = 64                           
* shuffleVar = True           
* numFactors = 25              

#### RMSE

1.054521265306006

### Model 3: 40 Epochs

#### Hyperparameters Used:

* numEpochs = 40                           
* batchSize = 64                           
* shuffleVar = True    
* numFactors = 25                     

#### Results

##### MSE

1.29540046859509

##### RMSE

1.1381566098718972

### Model 4: Batch Size 32

#### Hyperparameters Used:

* numEpochs = 10                           
* batchSize = 32                           
* shuffleVar = True     
* numFactors = 25                    

#### Results

##### MSE

0.9711737502177759

##### RMSE

0.9854814814179796

### Model 5: Batch Size 16

#### Hyperparameters Used:

* numEpochs = 10                           
* batchSize = 16                           
* shuffleVar = True 
* numFactors = 25                       

#### Results

##### MSE

0.9196862811028448

##### RMSE

0.9590027534386149

### Model 6: Batch Size 32, Number of Factors 50

#### Hyperparameters Used:

* numEpochs = 10                           
* batchSize = 16                           
* shuffleVar = True 
* numFactors = 50                       

#### Results

##### MSE

1.0042209983201338

##### RMSE

1.002108276744651

### Model 7: Batch Size 32, Number of Factors 150

#### Hyperparameters Used:

* numEpochs = 10                           
* batchSize = 16                           
* shuffleVar = True 
* numFactors = 150                       

#### Results

##### MSE

1.0339088803303962

##### RMSE

1.0168130999994032