<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Regression-based Rating Score Prediction using Embedding Features**


Estimated time needed: **45** minutes


In our previous lab, you have trained a neural network to predict the user-item interactions while simultaneously extracting the user and item embedding features. In the neural network, extends this by using  two embedding vectors as an input into a Neural Network to predict the rating.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_4/images/rating_regression.png)



Another way to make rating predictions is to use the embedding as an input to a neural network by aggregating them into a single feature vector as input data `X`. 

With the interaction label `Y` such as a rating score or an enrollment mode, we can build our other standalone predictive models to approximate the mapping from `X` to `Y`, as shown in the above flowchart.


In this lab, you will be given the course interaction feature vectors as input data `X` and consider label `Y` as the numerical rating scores. As such, we turn the recommender system into a common regression task and you can apply what you have learned about regression modeling to predict the ratings.


## Objectives


After completing this lab you will be able to:


* Build regression models to predict ratings using the combined embedding vectors


----


## Prepare and setup lab environment


First install and import required libraries:


In [1]:
#!pip install scikit-learn==1.0.2

In [2]:
# also set a random state
rs = 123

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

import numpy as np
import random

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers


2024-01-16 15:47:34.034330: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-16 15:47:34.069853: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-16 15:47:34.069881: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-16 15:47:34.070846: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-16 15:47:34.075968: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-16 15:47:34.076433: I tensorflow/core/platform/cpu_feature_guard.cc:1

### Load datasets


In [4]:
rating_url = "nice_data.csv"

bias = False
if bias:
    emb_url = "bui_df.csv"
else:
    emb_url = "ui_df.csv"
    



The first dataset is the rating dataset that contains a user-item interaction matrix


In [5]:
rating_df = pd.read_csv(rating_url)

In [6]:
rating_df["rating"][ rating_df["rating"] < 1  ] = 0
rating_df["rating"][ rating_df["rating"]  > 1 ] = 1
rating_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rating_df["rating"][ rating_df["rating"] < 1  ] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rating_df["rating"][ rating_df["rating"]  > 1 ] = 1


Unnamed: 0,user,item,rating
0,2,BD0221EN,1.0
1,2,LB0107ENv1,1.0
2,2,SC0105EN,1.0
3,2,CO0201EN,0.0
4,2,BD0123EN,1.0


As you can see from the above data, the user and item are just ids, let's substitute them by their embedding vectors:


In [21]:
merged_df.head()

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,...,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,2,BD0221EN,1.0,0.02163,0.025561,-0.116771,-0.026099,0.08159,0.118796,-0.110325,...,0.001127,-0.042721,-0.020004,-0.00416,-0.022699,-0.054916,-0.01123,-0.037942,-0.016477,0.01604
1,2,LB0107ENv1,1.0,0.02163,0.025561,-0.116771,-0.026099,0.08159,0.118796,-0.110325,...,0.026085,0.023909,-0.01502,0.018504,-0.033117,0.04193,-0.026071,0.011439,0.015818,-0.013199
2,2,SC0105EN,1.0,0.02163,0.025561,-0.116771,-0.026099,0.08159,0.118796,-0.110325,...,0.024451,-0.021855,0.076256,-0.004345,-0.012459,-0.0152,0.016222,0.002568,0.00456,0.035618
3,2,CO0201EN,0.0,0.02163,0.025561,-0.116771,-0.026099,0.08159,0.118796,-0.110325,...,-0.00308,0.023442,-0.018411,0.0056,-0.018589,-0.007688,-0.01245,-0.033382,0.017794,-0.000448
4,2,BD0123EN,1.0,0.02163,0.025561,-0.116771,-0.026099,0.08159,0.118796,-0.110325,...,0.006007,0.006638,-0.002519,0.016422,-0.054736,0.015872,0.015215,-0.012534,-0.009105,-0.017294


Next, we can combine the user features (the column labels starting with `UFeature` and item features (the column labels starting with `CFeature`. In machine learning, there are many ways to aggregate two feature vectors such as element-wise add, multiply, max/min, average, etc. Here we simply add the two sets of feature columns:


In [None]:
u_feautres = [f"UFeature{i}" for i in range(6)]
c_features = [f"CFeature{i}" for i in range(6)]

user_embeddings = merged_df[u_feautres]
course_embeddings = merged_df[c_features]
ratings = merged_df['rating']

# Aggregate the two feature columns using element-wise add
regression_dataset = user_embeddings + course_embeddings.values
regression_dataset.columns = [f"Feature{i}" for i in range(16)]
regression_dataset['rating'] = ratings
regression_dataset.head()

By now, we have built the input dataset `X` and the output vector `y`:


In [None]:
X = regression_dataset.iloc[:, :-1]
y = regression_dataset.iloc[:, -1]
print(f"Input data shape: {X.shape}, Output data shape: {y.shape}")

## TASK: Perform regression on the interaction dataset


Now our input data `X` and output `y` are ready, let's build regression models to map X to y and predict ratings. 


y.unique()


In an online course system, we may consider the `Completion` mode to be `larger` than the `Audit` mode as a learner needs to put more efforts towards completion.  Now if we treat it as a regression problem,  we would expect the regression model to output ratings ranging from 2.0 to 3.0. To interpret regression model output, we can treat values closer to 2.0 as `Audit` and values closer to 3.0 as `Completion`.


You may use `sklearn` to train and evaluate various regression models.


_TODO: First split dataset into training and testing datasets_


In [None]:
### WRITE YOUR CODE HERE
from sklearn.preprocessing import StandardScaler

#scaler = StandardScaler
#X,y = scaler.fit_transform(X,y)
#y = y -2

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=rs,test_size=0.3)

In [None]:
X_train

In [None]:
sum( y_test == 2 ) , sum( y_test == 3 )

<details>
    <summary>Click here for Hints</summary>
    
Use `train_test_split()` to split dataset into training and testing datasets.  Use `X, y` as input dataset and output vector. Don't forget to specify `random_state = rs` and `test_size=0.3`.


_TODO: Create a basic linear regression model_


In [None]:
### WRITE YOUR CODE HERE
model = linear_model.Ridge(alpha=0.2)

<details>
    <summary>Click here for Hints</summary>
    
You can call `linear_model.Ridge()` method and specify `alpha=0.2` ( it's controlling regularization) in the parameters.


_TODO: Train the basic regression model with training data_


In [None]:
### WRITE YOUR CODE HERE
model.fit(X_train, y_train)

<details>
    <summary>Click here for Hints</summary>
    
You can call `model.fit()` method with `X_train, y_train` parameters.


_TODO: Evaluate the basic regression model_


In [None]:
### WRITE YOUR CODE HERE

### The main evaluation metric is RMSE but you may use other metrics as well

y_predict = model.predict(X_test)

mean_squared_error(y_test,  y_predict, squared=False), y_predict.shape

In [None]:
min(y_train), max(y_train), min(y_predict), max(y_predict)

In [None]:
rmses = []
alphas = np.arange( 0,10,0.5 )
for alpha in alphas:
    model = linear_model.Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_test)
    print( min(y_train), max(y_train), min(y_predict), max(y_predict) )
    rmse = mean_squared_error(y_test,  y_predict, squared=False)
    print(alpha,rmse)
    rmses.append( rmse )
    
plt.plot( alphas, rmses )

<details>
    <summary>Click here for Hints</summary>
    
You can call `model.predict()` method with `X_test` parameter to get model predictions. Then use `mean_squared_error()` with `y_test, your_predictions` parameters to calculate the RMSE. 


_TODO: Try different regression models such as Ridge, Lasso, ElasticNet and tune their hyperparameters to see which one has the best performance_


In [None]:
### WRITE YOUR CODE HERE
rmses = []
alphas = np.arange( 0.01,10,0.5 )
for alpha in alphas:
    model = linear_model.Lasso(alpha=alpha)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_test)
    print( min(y_train), max(y_train), min(y_predict), max(y_predict) )
    rmse = mean_squared_error(y_test,  y_predict, squared=False)
    print(alpha,rmse)
    rmses.append( rmse )
    
plt.plot( rmses )

In [None]:
min(y_predict), max(y_predict)

In [None]:
### WRITE YOUR CODE HERE
rmses = []
alphas = [ 0.00000001, 0.000001, 0.0001,0.01,0.1]
for alpha in alphas:
    model = linear_model.ElasticNet(alpha=alpha)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_test)
    print( min(y_train), max(y_train), min(y_predict), max(y_predict) )
    rmse = mean_squared_error(y_test,  y_predict, squared=False)
    print(alpha,rmse)
    rmses.append( rmse )
    
plt.plot( rmses )

### Summary


In this lab, you have built regression models to predict numerical course ratings using the embedding feature vectors extracted from neural networks. In the next lab, we can treat the prediction problem as a classification problem as rating only has two categorical values so classification can be a more natural problem statement.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2021-10-25|1.0|Yan|Created the initial version|


Copyright © 2021 IBM Corporation. All rights reserved.
