## Contributing svd as a block through DeepCtr's methods to mlsquare

**Fork mlsquare repository to your account and clone.**

**Or just Clone https://github.com/mlsquare/mlsquare.git**

* Navigate to `src/mlsquare/layers` folder, Where all potential lego blocks are to be added as python modules.
* Add  `deepctr.py` containing code for deepctr's SVD. 
* The SVD implementation in deepctr module here is obtained/realised with available deepctr's classes & methods.

**The following notebook serves as walkthough procedure for contributing deepctr's svd to mlsquare layers.**
* Towards the end the obatained model is evaluating against sample test values.

**Following code is saved as SVD function in `mlsquare/layers/deepctr.py`**

In [None]:
import tensorflow as tf

from deepctr.inputs import build_input_features, input_from_feature_columns

from deepctr.layers.interaction import FM
from deepctr.layers.utils import concat_fun


def SVD(feature_columns, embedding_size=100,
        l2_reg_embedding=1e-5, l2_reg_linear=1e-5, l2_reg_dnn=0, init_std=0.0001, seed=1024, bi_dropout=0,
        dnn_dropout=0):
    """Instantiates the Neural Factorization Machine architecture.

    :param feature_columns: An iterable containing all the sparse features used by model.
    :param num_factors: number of units in latent representation layer.
    :param l2_reg_embedding: float. L2 regularizer strength applied to embedding vector
    :param l2_reg_linear: float. L2 regularizer strength applied to linear part.
    :param l2_reg_dnn: float . L2 regularizer strength applied to DNN
    :param init_std: float,to use as the initialize std of embedding vector
    :param seed: integer ,to use as random seed.
    :param biout_dropout: When not ``None``, the probability we will drop out the output of BiInteractionPooling Layer.
    :param dnn_dropout: float in [0,1), the probability we will drop out a given DNN coordinate.
    :param act_func: Activation function to use at prediction layer.
    :param task: str, ``"binary"`` for  'binary_crossentropy' loss or  ``"multiclass"`` for 'categorical_crossentropy' loss
    :return: A Keras model instance.
    """
    features = build_input_features(feature_columns)

    input_layers = list(features.values())

    sparse_embedding_list, _ = input_from_feature_columns(features,feature_columns, embedding_size, l2_reg_embedding, init_std, seed)
    
    fm_input = concat_fun(sparse_embedding_list, axis=1)
    fm_logit = FM()(fm_input)

    model = tf.keras.models.Model(inputs=input_layers, outputs=fm_logit)
    
    return model

**In order to utilise above block as module to obtain a SVD equivalent dnn model, thereafter train & evaluate that model, Proceed as follows**

### 1. Import sample dataset as pandas dataframe
* List `sparse_features` & label encode input dataframe.
* Perform `train_test_split` to output training/test data and labels for model training.

In [1]:
import os
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from deepctr.models import DeepFM
from deepctr.inputs import SparseFeat


data_path = os.path.expanduser('u.data')
df= pd.read_csv(data_path, sep='\t',names= 'user_id,movie_id,rating,timestamp'.split(','))#, header=None)#used for DeepCTR

df.head(3)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116


DeepCTR version 0.7.0 detected. Your version is 0.6.3.
Use `pip install -U deepctr` to upgrade.Changelog: https://github.com/shenweichen/DeepCTR/releases/tag/v0.7.0


* List **sparse features** from input dataframe

In [2]:
sparse_features = ["user_id", "movie_id"]
y= ['rating']
print('feature names:',sparse_features, '\nlabel name:',y)

feature names: ['user_id', 'movie_id'] 
label name: ['rating']


 * Label encoding features of input dataframe

In [3]:
for feat in sparse_features:
        lbe = LabelEncoder()
        df[feat] = lbe.fit_transform(df[feat])
        
df.head(3)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,195,241,3,881250949
1,185,301,3,891717742
2,21,376,1,878887116


**Preparing training input data & target labels.**
* Training & test input data should be a list of numpy arrays of `user_ids` & `movie_ids`.
* Labels as numpy array of target values.

In [4]:
train, test = train_test_split(df, test_size=0.2)

train_model_input = [train[name].values for name in sparse_features]#includes values from only data[user_id], data[movie_id]
train_lbl = train[y].values

test_model_input = [test[name].values for name in sparse_features]
test_lbl = test[y].values

In [41]:
print('training data:\n', train_model_input, '\n\ntraining labels:\n', train_lbl)

training data:
 [array([920, 380,  82, ..., 698, 561, 552]), array([ 172,  343,   78, ..., 1374,  805,   80])] 

training labels:
 [[5]
 [3]
 [5]
 ...
 [3]
 [1]
 [3]]


### 2. Obtain feature columns, perform required data preparatory operations as described in DeepCtr docs (refer https://deepctr-doc.readthedocs.io/en/latest/Quick-Start.html)
* **Defining feature columns as list of SparseFeat instances for each sparse feature, here -- `user_id`, `movie_id`, by passing in `feature_name`, `num_unique feature vals` as arguments.**

In [5]:
feature_columns = [SparseFeat(feat, df[feat].nunique()) for feat in sparse_features]
feature_columns

[SparseFeat:user_id, SparseFeat:movie_id]

### 3. Import `SVD` from `mlsquare.layers.deepctr`
* Instantiate the model using `feature_columns` from above.
* Train the model & evaluate results.

In [6]:
from mlsquare.layers.deepctr import SVD

Using TensorFlow backend.
2019-12-05 21:14:42,520	INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-12-05_21-14-42_7176/logs.
2019-12-05 21:14:42,632	INFO services.py:363 -- Waiting for redis server at 127.0.0.1:31143 to respond...
2019-12-05 21:14:42,776	INFO services.py:363 -- Waiting for redis server at 127.0.0.1:23786 to respond...
2019-12-05 21:14:42,780	INFO services.py:760 -- Starting Redis shard with 20.0 GB max memory.
2019-12-05 21:14:42,815	INFO services.py:1384 -- Starting the Plasma object store with 1.0 GB memory using /dev/shm.


In [7]:
??SVD

* **Now Instantiate the model by passing in args-- `feature_columns` & `embedding_size`**

In [8]:
model = SVD(feature_columns, embedding_size=100)
model.summary()

Instructions for updating:
Colocations handled automatically by placer.


Instructions for updating:
Colocations handled automatically by placer.


Instructions for updating:
keep_dims is deprecated, use keepdims instead


Instructions for updating:
keep_dims is deprecated, use keepdims instead


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
user_id (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
movie_id (InputLayer)           (None, 1)            0                                            
__________________________________________________________________________________________________
sparse_emb_user_id (Embedding)  (None, 1, 100)       94300       user_id[0][0]                    
__________________________________________________________________________________________________
sparse_emb_movie_id (Embedding) (None, 1, 100)       168200      movie_id[0][0]                   
__________________________________________________________________________________________________
no_mask (N

* Compile the model & fit on train data

In [9]:
model.compile("adam", "mse", metrics=['mse'] )
history = model.fit(train_model_input, train_lbl, batch_size=64, epochs=8, verbose=2, validation_split=0.2,)

Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


Train on 64000 samples, validate on 16000 samples
Instructions for updating:
Use tf.cast instead.


Instructions for updating:
Use tf.cast instead.


Epoch 1/8
 - 6s - loss: 6.2337 - mean_squared_error: 6.2077 - val_loss: 1.4820 - val_mean_squared_error: 1.4288
Epoch 2/8
 - 6s - loss: 1.1983 - mean_squared_error: 1.1364 - val_loss: 1.0701 - val_mean_squared_error: 1.0017
Epoch 3/8
 - 6s - loss: 1.0345 - mean_squared_error: 0.9629 - val_loss: 1.0258 - val_mean_squared_error: 0.9516
Epoch 4/8
 - 5s - loss: 0.9983 - mean_squared_error: 0.9223 - val_loss: 1.0076 - val_mean_squared_error: 0.9303
Epoch 5/8
 - 4s - loss: 0.9780 - mean_squared_error: 0.8992 - val_loss: 0.9983 - val_mean_squared_error: 0.9187
Epoch 6/8
 - 4s - loss: 0.9498 - mean_squared_error: 0.8691 - val_loss: 0.9835 - val_mean_squared_error: 0.9017
Epoch 7/8
 - 4s - loss: 0.9190 - mean_squared_error: 0.8362 - val_loss: 0.9842 - val_mean_squared_error: 0.9007
Epoch 8/8
 - 4s - loss: 0.8841 - mean_squared_error: 0.7993 - val_loss: 0.9677 - val_mean_squared_error: 0.8821


* **Evaluating model prediction on test data**

In [10]:
test.head(3)

Unnamed: 0,user_id,movie_id,rating,timestamp
39551,344,233,4,884991831
44160,480,99,4,885828426
25092,469,918,3,879178370


In [11]:
user_id = test_model_input[0][1]
item_id = test_model_input[1][1]
print('Model prediction for test user id: {} & item id : {} from above is: {}'.format(user_id, item_id, model.predict(test_model_input)[1]))

Model prediction for test user id: 480 & item id : 99 from above is: [4.0426135]
