<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork32585014-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Regression-based Rating Score Prediction using Embedding Features**


Estimated time needed: **45** minutes


In our previous lab, you have trained a neural network to predict the user-item interactions while simultaneously extracting the user and item embedding features. In the neural network, extends this by using  two embedding vectors as an input into a Neural Network to predict the rating.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module\_4/images/rating_regression.png)


Another way to make rating predictions is to use the embedding as an input to a neural network by aggregating them into a single feature vector as input data `X`.

With the interaction label `Y` such as a rating score or an enrollment mode, we can build our other standalone predictive models to approximate the mapping from `X` to `Y`, as shown in the above flowchart.


In this lab, you will be given the course interaction feature vectors as input data `X` and consider label `Y` as the numerical rating scores. As such, we turn the recommender system into a common regression task and you can apply what you have learned about regression modeling to predict the ratings.


## Objectives


After completing this lab you will be able to:


*   Build regression models to predict ratings using the combined embedding vectors


***


## Prepare and setup lab environment


First install and import required libraries:


In [None]:
!pip install scikit-learn==1.0.2

In [1]:
# also set a random state
rs = 123

In [2]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

### Load datasets


In [3]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/ratings.csv"
user_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/user_embeddings.csv"
item_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_embeddings.csv"

The first dataset is the rating dataset that contains a user-item interaction matrix


In [45]:
#rating_df = pd.read_csv(rating_url)
rating_df = pd.read_csv('data/ratings.csv')

In [46]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,3.0
1,1342067,CL0101EN,3.0
2,1990814,ML0120ENv3,3.0
3,380098,BD0211EN,3.0
4,779563,DS0101EN,3.0


In [47]:
rating_df.shape

(233306, 3)

As you can see from the above data, the user and item are just ids, let's substitute them by their embedding vectors:


In [24]:
# Load user embeddings
#user_emb = pd.read_csv(user_emb_url)
#user_emb.to_csv('data/user_embeddings.csv', sep=',', header=True, index=False)
#user_emb = pd.read_csv('data/user_embeddings.csv')
user_emb = pd.read_csv('data/user_embeddings_computed.csv')
# Load item embeddings
#item_emb = pd.read_csv(item_emb_url)
#item_emb.to_csv('data/course_embeddings.csv', sep=',', header=True, index=False)
#item_emb = pd.read_csv('data/course_embeddings.csv')
item_emb = pd.read_csv('data/course_embeddings_computed.csv')

In [25]:
user_emb.head()

Unnamed: 0,user,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,1889878,0.092619,-0.231117,0.207782,0.075256,-0.175193,0.505722,-0.011609,-0.325078,0.152641,-0.265612,0.094592,-0.262018,0.838327,-0.194199,-0.153928,0.389778
1,1342067,0.190924,0.165438,0.108368,0.069375,0.041172,-0.05772,0.1167,0.497182,-0.051477,-0.164022,0.053494,0.204487,-0.138072,-0.084156,0.168388,0.03277
2,1990814,-0.015588,0.252763,0.063455,-0.099508,-0.021063,-0.315386,-0.361852,-0.055602,-0.665861,-0.089062,0.541673,0.071023,0.132675,0.17868,-0.00407,-0.061975
3,380098,-0.28042,-0.133849,-0.227513,0.00672,0.57547,0.15379,-0.395521,0.158342,-0.263698,0.137179,-0.291739,0.187631,-0.28021,0.259173,-0.023921,0.168686
4,779563,-0.01546,-0.271762,-0.472752,0.053249,-0.024169,-0.300003,0.208167,-0.101409,-0.087301,-0.062453,0.039226,-0.096039,-0.048271,-0.187053,0.294076,0.226417


In [26]:
user_emb.shape

(33901, 17)

In [27]:
item_emb.head()

Unnamed: 0,item,CFeature0,CFeature1,CFeature2,CFeature3,CFeature4,CFeature5,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,CC0101EN,-0.048532,0.030396,0.003829,0.004685,-0.001384,0.051492,0.014301,0.027785,0.090873,-0.047259,-0.00886,0.037038,0.01446,0.045229,-0.1027,0.088245
1,CL0101EN,0.06429,0.044361,0.047307,-0.010566,0.012116,0.026928,0.016909,0.095854,0.018161,0.012876,-0.003892,0.065286,0.0626,0.117804,0.019665,-0.023349
2,ML0120ENv3,0.034007,-0.002219,-0.040372,0.056647,0.110738,0.069199,0.104628,-0.027713,0.0524,-0.028607,0.079523,-0.008803,-0.070613,0.101053,-0.02365,-0.065341
3,BD0211EN,0.009033,-0.005793,0.017004,0.080865,-0.070522,0.057408,0.025328,-0.040609,-0.000427,-0.067003,0.102882,-0.001016,0.004374,0.009934,-0.02062,-0.033026
4,DS0101EN,0.038361,-0.013257,-0.034347,-0.017218,-0.003707,0.003634,0.007511,-0.007333,0.067755,-0.037664,-0.026654,-0.038843,-0.07587,-0.039574,-0.016651,0.04029


In [12]:
item_emb.shape

(126, 17)

In [13]:
# Merge user embedding features
user_emb_merged = pd.merge(rating_df, user_emb, how='left', left_on='user', right_on='user').fillna(0)
# Merge course embedding features
merged_df = pd.merge(user_emb_merged, item_emb, how='left', left_on='item', right_on='item').fillna(0)

In [14]:
merged_df.head()

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,...,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,1889878,CC0101EN,3.0,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,...,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,1342067,CL0101EN,3.0,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,...,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,1990814,ML0120ENv3,3.0,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,...,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,380098,BD0211EN,3.0,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,...,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,779563,DS0101EN,3.0,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,...,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283


Next, we can combine the user features (the column labels starting with `UFeature` and item features (the column labels starting with `CFeature`. In machine learning, there are many ways to aggregate two feature vectors such as element-wise add, multiply, max/min, average, etc. Here we simply add the two sets of feature columns:


In [15]:
u_feautres = [f"UFeature{i}" for i in range(16)]
c_features = [f"CFeature{i}" for i in range(16)]

user_embeddings = merged_df[u_feautres]
course_embeddings = merged_df[c_features]
ratings = merged_df['rating']

# Aggregate the two feature columns using element-wise add
regression_dataset = user_embeddings + course_embeddings.values
regression_dataset.columns = [f"Feature{i}" for i in range(16)]
regression_dataset['rating'] = ratings
regression_dataset.head()

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Feature13,Feature14,Feature15,rating
0,0.090378,-0.134799,0.0839,0.046534,0.077417,-0.004537,-0.018561,0.079236,-0.024561,0.027359,-0.188823,-0.080762,0.050271,-0.066013,0.058894,-0.007689,3.0
1,0.059437,-0.08474,0.067107,-0.009036,-0.031482,0.050057,0.125847,0.066517,-0.053798,-0.021671,0.064212,0.20466,-0.004188,0.007914,0.02717,0.076114,3.0
2,0.152061,-0.014739,-0.080112,-0.009516,0.02413,0.153802,-0.048061,-0.119888,0.059234,0.060882,0.004244,-0.166,0.045002,0.057566,-0.022081,0.108929,3.0
3,-0.014707,-0.011257,0.073692,0.054763,-0.050547,-0.020599,0.027146,-0.067012,0.106593,-0.020921,0.106658,-0.092025,0.024436,0.086183,0.029232,0.016287,3.0
4,0.112812,-0.001395,-0.011572,-0.032638,-0.08044,-0.057321,0.064595,-0.02088,-0.048939,0.068486,-0.031359,-0.044577,-0.002381,0.025505,-0.033164,-0.105266,3.0


By now, we have built the input dataset `X` and the output vector `y`:


In [16]:
X = regression_dataset.iloc[:, :-1]
y = regression_dataset.iloc[:, -1]
print(f"Input data shape: {X.shape}, Output data shape: {y.shape}")

Input data shape: (233306, 16), Output data shape: (233306,)


## TASK: Perform regression on the interaction dataset


Now our input data `X` and output `y` are ready, let's build regression models to map X to y and predict ratings.


In [17]:
y.unique()

array([3., 2.])

In an online course system, we may consider the `Completion` mode to be `larger` than the `Audit` mode as a learner needs to put more efforts towards completion.  Now if we treat it as a regression problem,  we would expect the regression model to output ratings ranging from 2.0 to 3.0. To interpret regression model output, we can treat values closer to 2.0 as `Audit` and values closer to 3.0 as `Completion`.


You may use `sklearn` to train and evaluate various regression models.


*TODO: First split dataset into training and testing datasets*


In [28]:
### WRITE YOUR CODE HERE
X_train, X_test, y_train, y_test = train_test_split(
    X, # predictive variables
    y, # target
    test_size=0.1, # portion of dataset to allocate to test set
    random_state=42 # we are setting the seed here, ALWAYS DO IT!
    # stratify=y # if we want to keep class ratios in splits
) # We can also use the stratify argument: stratify = X[variable]

*TODO: Create a basic linear regression model*


In [29]:
### WRITE YOUR CODE HERE
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, RidgeCV
lr = LinearRegression()

*TODO: Train the basic regression model with training data*


In [31]:
### WRITE YOUR CODE HERE
lr.fit(X_train, y_train)

LinearRegression()

*TODO: Evaluate the basic regression model*


In [33]:
### WRITE YOUR CODE HERE

### The main evaluation metric is RMSE but you may use other metrics as well
pred = lr.predict(X_test)

In [35]:
from sklearn.metrics import mean_squared_error

In [37]:
rmse = mean_squared_error(y_test, pred, squared=False)
print(rmse)

0.20798726630543732


*TODO: Try different regression models such as Ridge, Lasso, ElasticNet and tune their hyperparameters to see which one has the best performance*


In [43]:
### WRITE YOUR CODE HERE
# Regularized: we can also do cross-validation, see below: LassoCV, RidgeCV, etc.
alphas = [0.005, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 80]
lasso = LassoCV(alphas=alphas, random_state=0) # alpha: regularization strength
ridge = RidgeCV(alphas=alphas) 

lasso.fit(X_train, y_train)
ridge.fit(X_train, y_train)

pred_lasso = lasso.predict(X_test)
pred_ridge = ridge.predict(X_test)

rmse_lasso = mean_squared_error(y_test, pred_lasso, squared=False)
print(rmse_lasso)

rmse_ridge = mean_squared_error(y_test, pred_ridge, squared=False)
print(rmse_ridge)

0.20853908210448432
0.2079865369488658


### Summary


In this lab, you have built regression models to predict numerical course ratings using the embedding feature vectors extracted from neural networks. In the next lab, we can treat the prediction problem as a classification problem as rating only has two categorical values so classification can be a more natural problem statement.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork32585014-2022-01-01)


### Other Contributors


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By | Change Description          |
| ----------------- | ------- | ---------- | --------------------------- |
| 2021-10-25        | 1.0     | Yan        | Created the initial version |


Copyright © 2021 IBM Corporation. All rights reserved.
