# Recommender Sytem for Trainings

I guess you know Netflix, IMDB, Amazon and several other services, these services use extensively Recommendation Systems, they do it in order to encourage more use of their platforms or to increase the value of the items you are going to purchase.  In Netflix you get movie or series recommendations, in Amazon you get items recomendations that you might like.

There is a lot of information there about Recomendation Systems, and I am not going to explain the theory, but if you are new, I encourage to read the following post for a theory introduction: https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada

The idea of this project is easy, we have internal trainings created by our company employees, we have also external trainings that we take via pluralsight, udemy or any other platform like coursera, and we have employees which take those internal or external trainings.   Employees have some attributes, like department, language, skills, etc.    All those attributes need to be taken into account into our recommender system.

For example:  if you are a new  employee with only 1 year of experience in Data Science (feature), and in the skills(feature) you have listed Statistics, Machine Learning, but another person in the company, has 10 years of experience in DataScience, with similar skills, and if that person took "Advanced machine learning specialization" in coursera.   The recommender system would be able to predict a.k.a recommend this training to the new employee.

If you read the blog post above, you will see that some recommender systems only take into account the interaction between users and items, but not the feautures that describe the user and the items a.k.a metadata.  In this project I have used LightFM a very well known library in order to build a hybrid recommender system where the user features are also taken into account.

To start, I will use 3 datasets:
- Users Dataset with features like name, department, language, gender, etc.
- Training Dataset with features like name, and main skill.
- TrainingsTaken Dataset

The last one is the relationship between a UserId and a Training ID, it's basically to know which user has taken which training. In recommender systems you can also have weights, a weight is basically a rating, in Movies, you can rate them 0–5 for example, so it's up to you to decide if you need weights or not for your business case. If you need weights, then you probably would put this field into the TrainingsTaken Dataset.

I am working with Azure ML, so I registered the datasets in the ML Studio, you can see how I did this in my  post   [How to generate synthetic data with Faker in Python and Azure ML](https://python.plainenglish.io/how-to-generate-synthetic-data-with-faker-in-python-and-azure-ml-24f69ddaea0e "How to generate synthetic data with Faker in Python and Azure ML")

With the code below, I am loading the datasets already registered from Azure ML into memory as pandas dataframe, this will allow us later to manipulate the data format as required.

In [None]:

from lightfm import LightFM
from lightfm.data import Dataset

# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = 'x'
resource_group = 'y'
workspace_name = 'z'

workspace = Workspace(subscription_id, resource_group, workspace_name)

datasetusers = Dataset.get_by_name(workspace, name='usersfake')
usersdf = datasetusers.to_pandas_dataframe()

datasettrainings = Dataset.get_by_name(workspace, name='trainings')
trainingsdf = datasettrainings.to_pandas_dataframe()


datasettrainingstaken  = Dataset.get_by_name(workspace, name='trainingtakenfake')
trainingstakendf = datasettrainingstaken.to_pandas_dataframe()

If you want to check the contents of each dataframe, you can use df.head(5) and the output will be similar to the following images.

![User Data Frame](https://miro.medium.com/max/700/1*-2utCuwrn560CeIqT6IoDg.png) 
User Data Frame

![Training Dataframe](https://miro.medium.com/max/403/1*ceCmQcWiVWQ6CIN9v7qxwg.png) 
Traingn Data Frane

![TrainingTaken Dataframe](https://miro.medium.com/max/291/1*BLQLhD89I_CV0HiwhbDz0w.png) 
Training Taken DataFrame

Now we have 3 pandas dataframes that we can use with the LightFM algorithm.

### LightFM

LightFM is a Python implementation of a number of popular recommendation algorithms for both implicit and explicit feedback, including efficient implementation of BPR and WARP ranking losses. It’s easy to use, fast (via multithreaded model estimation), and produces high-quality results. (source: https://github.com/lyst/lightfm)
I selected LightFM because when searching for a hybrid recommender system, it was one of the most used ones, which allows user features and item features to be used within the model, however, there are other recommender systems you can try.
To begin we need to create a LightFM Dataset, this dataset will allow us later to fit the model with the data in the desired format.

In [None]:
from lightfm.data import Dataset
dataset1 = Dataset()

### The fit method

We need to call the fit method so that LightFM knows who the users are, what items we are dealing with (trainings), and also the user and item features. On the recommender lingo our trainings are just items.

We will be passing three parameters to the fit method: the list of users, the list of items, and the user features, passing the list of users and items is pretty straightforward — just use the `User-Id` and `Training-Id` columns from trainingsttaken dataframe.

When it comes to pass the user_features, it's better to pass a list in which each element is in the format `feature_name:feature_value`.

Then ouruser_features should look something like this:
`['name:Susana Johnson', 'Age:42', 'los:IFS', 'ou:development', 'gender:F', 'skills:azure', 'language:dutch']`.

This list was generated by considering all possible `feature_name,feature_value pairs` that can be found in the training set. For example, for feature_name equal to Gender, there can be two feature_values namely M and F.

In [None]:
uf = []
col = ['ou']*len(usersdf.ou.unique()) + ['skills']*len(usersdf.skills.unique()) + ['language']*len(usersdf.language.unique()) + ['grade']*len(usersdf['grade'].unique()) + ['career interests']*len(usersdf['career interests'].unique())
unique_f1 = list(usersdf.ou.unique()) + list(usersdf.skills.unique()) + list(usersdf.language.unique()) + list(usersdf['grade'].unique())+ list(usersdf['career interests'].unique())
for x,y in zip(col, unique_f1):
    res = str(x)+ ":" +str(y)
    uf.append(res)
    print(res
    
ou:development
ou:operations
ou:architecture
ou:cloud operations
ou:pmo
skills:azure
skills:javascript
skills:pm
skills:.net
skills:python
skills:solutions design
skills:sql
language:dutch
language:french
language:german
language:spanish
language:english
grade:Junior
grade:Associate
grade:Senior Manager
grade:Manager
grade:Senior Associate
career interests:solutions design
career interests:javascript
career interests:pm
career interests:python
career interests:azure
career interests:.net
career interests:sql

The piece of code above generates the list we need, in the format explained above, with all possible combinations. This is what LightFM expects.

Now we need to call the .fit method which accepts a list of User-Ids, Training-IDs, and a list of all user features (list above).

After calling the fit method, I converted my trainingdf fields to numeric because that's the expected type, and finally using the dataset1 instance I call build_interactions, on this method what I do is to iterate over all trainingstaken dataframe and pass them as parameters one by one, the User-Id and the Training-Id, optionally you can also pass the weights (ratings), in my case I am ignoring this column, because I assumed, in the beginning, all rows with a 10 value would mean the user took the training, but in this case is irrelevant.

In [None]:
# we call fit to supply userid, item id and user/item features
dataset1.fit(
        usersdf['User-Id'].unique(), # all the users
        trainingsdf['Training-Id'].unique(), # all the items
        user_features = uf # additional user features
)

import pandas as pd
trainingstakendf["User-Id"] = pd.to_numeric(trainingstakendf["User-Id"])
trainingstakendf["Training-Id"] = pd.to_numeric(trainingstakendf["Training-Id"])

# plugging in the interactions and their weights
(interactions, weights) = dataset1.build_interactions([(x[0], x[1]) for x in trainingstakendf.values ])

interactions.todense()
Output:
matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int32)
        
weights.todense()
Output:
matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

On line Nr 13 the build interactions method returns a sparse matrix of interactions and weights, if you want a cleaner representation you can use the `.todense `method to view a summary of the dense matrix.

### Creating the user features

The `build_user_features` method requires parameters like this:


In [None]:
[
(user1 , [feature1, feature2, feature3, ….]),
(user2 , [feature1, feature2, feature3, ….]),
]

>Remember that `feature1 , feature2, feature3` , etc should be one of the items present in `user_features` list that we passed to the fit method before.

Just to reiterate, this is how our user_features list currently looks like:
`['name:Susana Johnson', 'Age:42', 'los:IFS', 'ou:development', 'gender:F', 'skills:azure', 'language:dutch']`.

For our particular example, it should look like this:

In [None]:
[
     ('1', ['name:Susana Johnson', 'Age:32', 'los:IFS', 'ou:development', 'gender:F', 'skills:azure', 'language:dutch']),
     ('2', .....
 ]

The following method and some of the explanation on this blog post were taken from: https://towardsdatascience.com/how-i-would-explain-building-lightfm-hybrid-recommenders-to-a-5-year-old-b6ee18571309

The specific code for my use case is below:

In [None]:
def feature_colon_value(my_list):
    """
    Takes as input a list and prepends the columns names to respective values in the list.
    For example: if my_list = [1,1,0,'del'],
    resultant output = ['f1:1', 'f2:1', 'f3:0', 'loc:del']
   
    """
    result = []
    ll = ['ou:','skills:', 'language:', 'grade:', 'career interests:']
    aa = my_list
    for x,y in zip(ll,aa):
        res = str(x) +""+ str(y)
        result.append(res)
    return result
# Using the helper function to generate user features in proper format for ALL users
ad_subset = usersdf[["ou", 'skills','language', 'grade', 'career interests']] 
ad_list = [list(x) for x in ad_subset.values]
feature_list = []
for item in ad_list:
    feature_list.append(feature_colon_value(item))
print(f'Final output: {feature_list}')

Basically from our user dataframe, we remove the columns which we think are not relevant for the training eg (name, age, etc). Then with some python magic, we create an array, where each element of the array is a list of feature name and value for the user on that specific index position.

Finally, we need to add the User-Id to each element of the array, which can be done with the following line:

In [None]:
user_tuple = list(zip(usersdf['User-Id'], feature_list))
user_tuple
Output:
[(8361131,
  ['ou:development',
   'skills:azure',
   'language:dutch',
   'grade:Junior',
   'career interests:solutions design']),
 (2162101,
  ['ou:development',
   'skills:javascript',
   'language:french',
   'grade:Junior',
   'career interests:javascript']),
 (81727,
  ['ou:operations',
   'skills:pm',
   'language:german',
   'grade:Junior',
   'career interests:pm']),

We are almost there.

Now we have our user feature in the required format, we can call the build_user_features method:

In [None]:
user_features = dataset1.build_user_features(user_tuple, normalize= False)

Again this returns a sparse matrix, and if we call the .todense method we can have a more clear representation.

In [None]:
user_features.todense()
matrix([[1., 0., 0., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 1., 0., 0.]], dtype=float32)

In the `user_features` matrix above, the rows are the users, and columns are the user features. There is a 1 present whenever that user has that particular user feature present in training data.

If we call the `.shape` attribute on the `user_features matrix`, we get:

In [None]:
user_features.shape
(9996, 10025)
user_id_map, user_feature_map, item_id_map, item_feature_map = dataset1.mapping()
user_feature_map

The method above returns a user id mapping, user feature mapping, item id mapping and item feature mapping, it might be handy when debugging and it will be very practical on the prediction phase.

### Lets build the model

Everything that we did before is in order to be able to fit the model, as it expects sparse matrix format for the interactions and user features parameters. The fit method expects the interactions, user features, and some other optional parameters.

In [None]:
model = LightFM(loss='warp')
model.fit(interactions,
user_features= user_features,
epochs=10)

After we train our model we can evaluate it with AUC

In [None]:
from lightfm.evaluation import auc_score
train_auc = auc_score(model,
interactions,
user_features=user_features
).mean()
print('Hybrid training set AUC: %s' % train_auc)
Hybrid training set AUC: 0.9402231

Remember it's dummy data, so the model might overfit, don't get too excited with the 94% metric.

### Let's predict for known users

The predict method takes 2 parameters: the user id mapping, and the list of item ids. Here we will use user_id_map from the previous step to get a reference to the specific user (user_x),

In [None]:
import numpy as np
user_x = user_id_map[9212216] #just a random user
n_users, n_items = interactions.shape # no of users * no of items
model.predict(user_x, np.arange(n_items)) # means predict for all

This will return the score for each item (training) into an array format:

In [None]:
array([-0.49955484, -0.4502962 , -0.6466697 , -0.7361969 , -0.30803648,
        0.01278364, -0.37532082, -0.2221036 , -0.7242191 , -1.6705698 ,
       -0.01221651, -0.23012483, -0.89942145, -1.3498331 , -0.7373183 ,
       -0.20021401,  0.21310112, -0.9948864 ,  0.13983092, -0.7846861 ,
       -0.5542359 , -0.30498767,  1.0424366 , -0.29013318, -0.23596957,
        0.1327716 , -0.49574524, -1.5379183 , -0.7636943 , -0.12699573,
        0.14224172, -0.4512871 , -0.49226752,  0.01528413,  0.4442131 ],
      dtype=float32)

From the documentation: this method returns an array of scores corresponding to the score assigned by the model to _pairs of inputs. Importantly, this means the i-th element of the output array corresponds to the score for the i-th user-item pair in the input arrays.

Concretely, you should expect the `lfm.predict([0, 1], [8, 9])` to return an array of np.float32 that may look something like `[0.42 0.31]`, where `0.42` is the score assigned to the user-item pair `(0, 8)` and `0.31` the score assigned to pair `(1, 9)` respectively.

If you check LightFM documentation you can also use predict_rank and it will return the items in a sorted order where the first ones are the recommendations for that specific user.

### The recommendation method

And finally as a bonus, I will leave the sample_recommendation_user method, this will take as input the trained model, the interactions matrix, an existing user if, the users dataframe, the tranings dataframe.  And the end result of this method is the trainings taken by the user and the trainings recommended to the user.

In [None]:
def sample_recommendation_user(model, interactions, user_id, usersdf,
                               trainingsdf, trainingstakendf,threshold = 0,nrec_items = 25, show = True):
    
    n_users, n_items = interactions.shape
    userInfo = usersdf[usersdf['EID']==user_id]
    user_x = np.int64(userInfo.index).item()
    scores = pd.Series(model.predict(user_x,np.arange(n_items)))
    Taken = ""
    Recom = ""

    
    resulting = trainingsdf.merge(scores.to_frame().reset_index(), left_index=True, right_index=True)
    resulting.drop(columns = ['index'], inplace=True)
    resulting.rename(columns={0: "Score"}, errors="raise", inplace=True)

    resulting.sort_values('Score', ascending=False, inplace=True)
    resulting = resulting.head(nrec_items)
    resulting.reset_index(drop=True, inplace=True)
    userInfo = usersdf[usersdf['EID']==user_id]
    userInfo.reset_index(drop=True, inplace=True)

    users_trainingTaken = trainingstakendf[trainingstakendf['EID']==user_id]
    users_trainingTaken.drop_duplicates(inplace=True)

    users_trainingTaken = pd.merge(users_trainingTaken, trainingsdf, how="inner", on='TID')
    users_trainingTaken.reset_index(drop=True, inplace=True)

    if show == True:
        for ix, row in users_trainingTaken.iterrows():
            Taken = Taken + str({row["Training Title"]})
        for ix,row in resulting.iterrows():
            Recom = Recom + str({row["Training Title"]})

    d = {'ID': user_id, 'Trainings Taken': Taken, 'Trainings Recommended': Recom}
    returndf = pd.DataFrame(data = d, index=[user_id])

    return returndf

### Registering the Model in Azure

Finally, we need to be able to use the model in our apps, for that we can register the model in Azure ML and maybe even deploy it as a web service, in order to register the model, we need to save it first as binary (pickle file), and then we can use Azure ML SDK in order to register the model.

In [None]:
import pickle
with open('savefile.pickle', 'wb') as fle:
pickle.dump(model, fle, protocol=pickle.HIGHEST_PROTOCOL)

This will save the pickle file in your current directory.

In [None]:
from azureml.core import Workspace
from azureml.core.model import Model
ws = Workspace(subscription_id="x",resource_group="y",workspace_name="z")
model = Model.register(ws, model_name="recommender", model_path="./savefile.pickle")

And finally with the code above we register the model in Azure.

![Azure ML Models](https://miro.medium.com/max/556/1*c90R148d2QGHUfPo3yrB0Q.png)

Once the model is registered into Azure ML, you can deploy it as a web service or load it from Python code to reuse it in your predictions.

## Final Words

Today we went through all the steps to understand recommender systems, we referenced a blog post with a clear theoretical introduction to recommender systems, then we generated fake data and registered it in Azure ML as datasets, and finally, we went step by step in preparing the datasets into the specific formats desired by LightFM.

Later we made some predictions for a random user ID, and finally we were able to register the LightFM model by exporting the file first as pickle format and then registering this file as a model for later usage in Azure ML.

We didn't go into details of evaluation metrics, or tuning parameters or features, but I expect that at the end you have a clear overview of the entire process so that you can apply it to your specific needs.