@Editors: Tung Nguyen, Mo Nguyen, Jian Yu

@base: 16/01/20 discussion

Main changes: 

- the input is the rating matrix
- Use he_uniform init for relu
- No batch normalization and dropout for encoder
- New metrics for every output:

      metrics={'output':['binary_accuracy','Precision'], 
                         'decoder':'mse', 
                         'decoder2':'mse'})

##Problem

- The MLP seems to be not updating with implicit data!!!

#Model implementation framework

TF2.0 and Keras implementation

- Create GMF model
    - Create helper methods: User/item latent
    - Create loss functions
    - Handle input $u_i, v_j$
    - Handle output $\hat{r}_{ij}$

## Organise imports


In [None]:
#@title
#import
#tensorflow_version 2.x
import tensorflow as tf
from tensorflow import keras
import tensorflow.python.keras
from tensorflow.python.keras import layers
from tensorflow.python.keras.layers import Input, Dense, Concatenate, Embedding, Dropout, BatchNormalization
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.regularizers import l1, l2, l1_l2
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [None]:
#dt_dir_name= "C:/Users/jiyu/Desktop/Mo/sample_data/ml-1m"
dt_dir_name= "C:/Users/thinguyen/Desktop/PhD_2020/Python Code/GNN/Mo/sample_data/ml-100k"

In [None]:
#100k 
dataset = pd.read_csv(dt_dir_name+"/u.data",sep='\t',names="user_id,item_id,rating,timestamp".split(","))

#ml1m
#dataset=pd.read_csv(dt_dir_name +'/'+ 'ratings.dat', delimiter='\:\:', names=['user_id', 'item_id', 'rating', 'timestamp'])  
dataset.head()

In [None]:
uids = np.sort(dataset.user_id.unique())
iids = np.sort(dataset.item_id.unique())

n_users = len(uids)
n_items = len(iids)

In [None]:
n_users, n_items

In [None]:
#reindex from 0 ids
dataset.user_id = dataset.user_id.astype('category').cat.codes.values
dataset.item_id = dataset.item_id.astype('category').cat.codes.values
#createMFModel(dataset=dataset)

#Create deep embedding using MLP of the [model](https://drive.google.com/file/d/1kN5loA18WyF1-I7BskOw6c9P1bdArxk7/view?usp=sharing)

## Create deep autoencoder 

Reference: [keras](https://blog.keras.io/building-autoencoders-in-keras.html)

##Turn original dataset to negative sample dataset

In [None]:
#Version 1.2 (flexible + superfast negative sampling uniform)
import random
import time
import scipy

def neg_sampling(ratings_df, n_neg=1, neg_val=0, pos_val=1, percent_print=5):
  """version 1.2: 1 positive 1 neg (2 times bigger than the original dataset by default)

    Parameters:
    input rating data as pandas dataframe: userId|movieId|rating
    n_neg: include n_negative / 1 positive

    Returns:
    negative sampled set as pandas dataframe
            userId|movieId|interact (implicit)
  """
  sparse_mat = scipy.sparse.coo_matrix((ratings_df.rating, (ratings_df.user_id, ratings_df.item_id)))
  dense_mat = np.asarray(sparse_mat.todense())
  print(dense_mat.shape)

  nsamples = ratings_df[['user_id', 'item_id']]
  nsamples['rating'] = nsamples.apply(lambda row: 1, axis=1)
  length = dense_mat.shape[0]
  printpc = int(length * percent_print/100)

  nTempData = []
  i = 0
  start_time = time.time()
  stop_time = time.time()

  extra_samples = 0
  for row in dense_mat:
    if(i%printpc==0):
      stop_time = time.time()
      print("processed ... {0:0.2f}% ...{1:0.2f}secs".format(float(i)*100 / length, stop_time - start_time))
      start_time = stop_time

    n_non_0 = len(np.nonzero(row)[0])
    zero_indices = np.where(row==0)[0]
    if(n_non_0 * n_neg + extra_samples >= len(zero_indices)):
      print(i, "non 0:", n_non_0,": len ",len(zero_indices))
      neg_indices = zero_indices.tolist()
      extra_samples = n_non_0 * n_neg + extra_samples - len(zero_indices)
    else:
      neg_indices = random.sample(zero_indices.tolist(), n_non_0 * n_neg + extra_samples)
      extra_samples = 0

    nTempData.extend([(uu, ii, rr) for (uu, ii, rr) in zip(np.repeat(i, len(neg_indices))
                    , neg_indices, np.repeat(neg_val, len(neg_indices)))])
    i+=1

  nsamples=nsamples.append(pd.DataFrame(nTempData, columns=["user_id","item_id", "rating"]),ignore_index=True)
  nsamples.reset_index(drop=True)
  return nsamples

In [None]:
neg_dataset = neg_sampling(dataset, n_neg=1)
neg_dataset.shape

#Create rating (uxi) matrix with implicit data

Change rating data -> 1

In [None]:
def create_rating_matrix(u_i_r_df):
  rating_matrix = np.zeros(shape =(n_users, n_items), dtype=int)
  for row in u_i_r_df.itertuples(index=False):
    rating_matrix[int(row[0]), int(row[1])] = int(row[2])
  return rating_matrix

In [None]:
rating_matrix = create_rating_matrix(neg_dataset)
rating_matrix.shape, rating_matrix

In [None]:
neg_dataset

In [None]:
rating_matrix2 = create_rating_matrix(dataset)
rating_matrix2.shape, rating_matrix2

##Create train, test, val sets from neg_dataset

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(neg_dataset, test_size=0.2, random_state=2020)
train, val = train_test_split(train, test_size=0.2, random_state=2020)

# train.reset_index(inplace=True, drop=True)
# test.reset_index(inplace=True, drop=True)
# val.reset_index(inplace=True, drop=True)

In [None]:
train.head()

##helper functions

In [None]:
def create_hidden_size(n_hidden_layers = 3, n_latent_factors = 32):
  """Sizes of each hidden layer, decreasing order"""
  hidden_size = [n_latent_factors*2**i for i in reversed(range(n_hidden_layers))]
  return hidden_size

#Create model with Keras


In [None]:
#create autoencoder + ...
#def create_model(n_users, n_items, l1_val=1e-6, l2_val=1e-5):
def create_model(n_users,n_items,learning_rate, n_hidden_layers,n_latent_factors,l1=1e-5, l2=1e-4):

  #hidden_size = create_hidden_size()
  hidden_size = create_hidden_size(n_hidden_layers,n_latent_factors)

  #create 4 input layers
  uii = Input(shape=(n_items,), name='uii')
  umi = Input(shape=(n_items,), name='umi') # is a neighbour of ui

  vji = Input(shape=(n_users,), name='vji')
  vni = Input(shape=(n_users,), name='vni') # is a neighour of vj

  #user autoencoder
  encoded = uii
  for nn in hidden_size[:-1]:
      encoded = Dense(nn, activation='relu', 
                      kernel_initializer='he_uniform',
                      # kernel_regularizer=l2(l2_val)
                      )(encoded)
      # encoded = BatchNormalization()(encoded)
      # encoded = Dropout(0.2)(encoded)

  encoded = Dense(hidden_size[-1], activation='relu', 
                  kernel_initializer='he_uniform',
                  # kernel_regularizer=l2(l2_val),
                  name='encoder')(encoded) 

  hidden_size.reverse()
  decoded = encoded
  for nn in hidden_size[1:]:
    decoded = Dense(nn, activation='relu', 
                    kernel_initializer='he_uniform',
                    # kernel_regularizer=l2(l2_val)
                    )(decoded)
    # decoded = BatchNormalization()(decoded)
    # decoded = Dropout(0.2)(decoded)
  decoded = Dense(n_items, activation='relu',
                  kernel_initializer='he_uniform',
                  # kernel_regularizer=l2(l2_val), 
                  name='decoder')(decoded)

  #for item autoencoder
  #hidden_size = create_hidden_size() #reset hidden size
  hidden_size = create_hidden_size(n_hidden_layers,n_latent_factors)#reset hidden size
  encoded2 = vji
  for nn in hidden_size[:-1]:
      encoded2 = Dense(nn, activation='relu', 
                       kernel_initializer='he_uniform',
                      #  kernel_regularizer=l2(l2_val)
                       )(encoded2) 
      # encoded2 = BatchNormalization()(encoded2)
      # encoded2 = Dropout(0.2)(encoded2)

  encoded2 = Dense(hidden_size[-1], activation='relu',
                   kernel_initializer='he_uniform',
                  #  kernel_regularizer=l2(l2_val), 
                   name='encoder2')(encoded2) 

  hidden_size.reverse()
  decoded2 = encoded2
  for nn in hidden_size[1:]:
    decoded2 = Dense(nn, activation='relu',
                     kernel_initializer='he_uniform',
                    #  kernel_regularizer=l2(l2_val)
                     )(decoded2)
    # decoded2 = BatchNormalization()(decoded2)
    # decoded2 = Dropout(0.2)(decoded2)

  decoded2 = Dense(n_users, activation='relu',
                   kernel_initializer='he_uniform',
                    # kernel_regularizer=l2(l2_val), 
                   name='decoder2')(decoded2)

  #prod = layers.dot([encoded, encoded2], axes=1, name='DotProduct')
  #V2: replace dot prod with mlp
  concat = layers.concatenate([encoded, encoded2])
  mlp = concat
  for i in range(3,-1,-1):
    if i == 0:
      mlp = Dense(1, activation='sigmoid',
                  name="output")(mlp)
    else:
      mlp = Dense(8*2**i, activation='sigmoid',
                  # kernel_regularizer=l1_l2(l1_val,l2_val),
                  # kernel_initializer=ß'he_uniform'
                  )(mlp)
      if i >= 2:
        mlp = BatchNormalization()(mlp)
        mlp = Dropout(0.2)(mlp)

  model = Model(inputs=[uii,  vji], outputs=[decoded,decoded2, mlp])
  adadelta = tf.keras.optimizers.Adadelta(learning_rate)
  model.compile(optimizer='adadelta', loss={'output':'binary_crossentropy', 
                                        'decoder':'mean_squared_error', 
                                        'decoder2':'mean_squared_error'
                                        }, 
                metrics={'output':['binary_accuracy',
                                  #  'Precision', 'AUC'
                                   ], 
                         'decoder':'mse', 
                         'decoder2':'mse'})

  #model.summary()

  return  model

In [None]:
#model = create_model(n_users, n_items)
model1 = create_model(n_users,n_items,args.lr[3],args.hidden_layers[0], args.hidden_factors[3], args.regularizer[1],args.regularizer[1])


In [None]:
tf.keras.utils.plot_model(model, to_file='model3.png', show_shapes=True)

###Create data generator using rating matrix

It takes rating matrix and generate a sequence of users, items, and ratings

In [None]:
from tensorflow.keras.utils import Sequence
import math

class DataGenerator(Sequence):
    def __init__(self, dataset, rating_matrix, batch_size=32,  shuffle=True):
        'Initialization'
        self.batch_size = batch_size
        self.dataset = dataset
        self.shuffle = shuffle
        self.indexes = self.dataset.index
        self.rating_matrix = rating_matrix
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return math.floor(len(self.dataset) / self.batch_size)

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        idxs = [i for i in range(index*self.batch_size,(index+1)*self.batch_size)]

        # Find list of IDs
        list_IDs_temp = [self.indexes[k] for k in idxs]

        # Generate data
        uids = self.dataset.iloc[list_IDs_temp,[0]].to_numpy().reshape(-1)
        iids = self.dataset.iloc[list_IDs_temp,[1]].to_numpy().reshape(-1)
        # print(uids)
        Users = np.stack([rating_matrix[row] for row in uids])
        Items = np.stack([rating_matrix[:, col] for col in iids])
        ratings = self.dataset.iloc[list_IDs_temp,[2]].to_numpy().reshape(-1)
        
        # ratings = keras.utils.to_categorical(rr)
        # print(Items, type(Items))
        # print(ratings, type(ratings))
        
        return (Users, Items),(Users, Items, ratings)

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.dataset))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

In [None]:
import easydict
args = easydict.EasyDict({
        'hidden_layers': [3,10],
        'hidden_factors': [8,16,32,64],
        'batch':[128,256,512,1024],
        'regularizer': [1e-4,1e-6,1e-8],
        'lr':[0.0001 ,0.0005, 0.001, 0.005],
    })     

In [None]:

for i in args.lr:
    for j in args.batch:
        for k in args.hidden_factors:
            for l in args.regularizer:
                model1=create_model(n_users,n_items,i,k,l,l)
                traindatagenerator = DataGenerator(train, rating_matrix,batch_size=j,shuffle=True)
                history = model1.fit(traindatagenerator, epochs=100,verbose=0)
                testdatagenerator = DataGenerator(test, rating_matrix,batch_size=j)
                results = model1.evaluate(testdatagenerator,verbose=0)
                print('RESULT: lr=',i, 'regularization=',l, ' batch=', j, 'hidden factors=', k, results)

##Training with data generator

In [None]:
traindatagenerator = DataGenerator(train, rating_matrix,shuffle=True)

history = model.fit(traindatagenerator, epochs=100, verbose=2)

In [None]:
##This is for normal training (old)
# history = model.fit({'uii':ext_user_matrix, 'vji':rating_matrix.T}, 
#            {'decoder':ext_user_matrix, 'decoder2':rating_matrix.T, 'output':np.diag(ext_user_matrix)},
#            epochs=100)

## Plot losses

There are several losses, pick the one we need

In [None]:
pd.Series(history.history['loss']).plot(logy=True)
pd.Series(history.history['output_binary_accuracy']).plot(logy=True)
plt.xlabel("Epoch")
plt.ylabel("Training Error")
plt.legend(['loss','output_binary_accuracy'])

Let's now see how our model does! I'll do a small post-processing step to round off our prediction to the nearest integer. This is usually not done, and thus just a whimsical step, since the training ratings are all integers! There are better ways to encode this intger requirement (one-hot encoding!), but we won't discuss them in this post.

In [None]:
testdatagenerator = DataGenerator(test, rating_matrix)

results = model.evaluate(testdatagenerator)

print(results)

#References

Input layer:

- Embedding layer: [Link](https://gdcoder.com/-what-is-an-embedding-layer/)
- Embedding lookup: [link text](https://keras.io/layers/embeddings/)
- Multi input: [link text](https://keras.io/getting-started/functional-api-guide/#multi-input-and-multi-output-models)
    