<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Regression-based Rating Score Prediction using Embedding Features**


Estimated time needed: **45** minutes


In our previous lab, you have trained a neural network to predict the user-item interactions while simultaneously extracting the user and item embedding features. In the neural network, extends this by using  two embedding vectors as an input into a Neural Network to predict the rating.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_4/images/rating_regression.png)



Another way to make rating predictions is to use the embedding as an input to a neural network by aggregating them into a single feature vector as input data `X`. 

With the interaction label `Y` such as a rating score or an enrollment mode, we can build our other standalone predictive models to approximate the mapping from `X` to `Y`, as shown in the above flowchart.


In this lab, you will be given the course interaction feature vectors as input data `X` and consider label `Y` as the numerical rating scores. As such, we turn the recommender system into a common regression task and you can apply what you have learned about regression modeling to predict the ratings.


## Objectives


After completing this lab you will be able to:


* Build regression models to predict ratings using the combined embedding vectors


----


## Prepare and setup lab environment


First install and import required libraries:


In [6]:
!pip install scikit-learn==1.0.2



In [7]:
# also set a random state
rs = 123

In [8]:
!pip install pandas



In [9]:
import pandas as pd

In [10]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# Load Dataset

In [11]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/ratings.csv"
user_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/user_embeddings.csv"
item_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_embeddings.csv"

The first dataset is the rating dataset that contains a user-item interaction matrix


In [12]:
rating_df = pd.read_csv(rating_url)

In [13]:
rating_df.shape

(233306, 3)

In [14]:
rating_df.head(10)

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,3.0
1,1342067,CL0101EN,3.0
2,1990814,ML0120ENv3,3.0
3,380098,BD0211EN,3.0
4,779563,DS0101EN,3.0
5,1390655,ST0101EN,3.0
6,367075,DS0301EN,3.0
7,1858700,CC0101EN,3.0
8,600100,BD0211EN,3.0
9,623377,DS0105EN,3.0


In [15]:
rating_df.shape

(233306, 3)

As you can see from the above data, the user and item are just ids, let's substitute them by their embedding vectors:


In [16]:
# Load user embeddings
user_emb = pd.read_csv(user_emb_url)
# Load item embeddings
item_emb = pd.read_csv(item_emb_url)

In [17]:
user_emb.head(10)

Unnamed: 0,user,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,1889878,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,0.091464,-0.040247,0.018958,-0.153328,-0.090143,0.08283,-0.058721,0.057929,-0.001472
1,1342067,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,0.104128,-0.034401,0.004011,0.064832,0.165857,-0.004384,0.053257,0.014308,0.056684
2,1990814,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,-0.156026,0.039269,0.042195,0.014695,-0.115989,0.031158,0.102021,-0.020601,0.116488
3,380098,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,-0.060944,0.112384,0.002114,0.09066,-0.068545,0.008967,0.063962,0.052347,0.018072
4,779563,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,-0.019367,-0.031341,0.064896,-0.048158,-0.047309,-0.007544,0.010474,-0.032287,-0.083983
5,1390655,0.023796,0.063062,0.111711,0.008723,0.083231,0.095042,0.02642,-0.014873,-0.028716,0.04214,-0.012092,0.081946,0.006987,-0.073148,0.044278,0.044275
6,367075,-0.058648,-0.089343,0.12169,0.019357,-0.037281,0.049743,0.063332,0.045058,-0.006939,-0.009103,-0.211956,-0.050017,-0.158781,0.031542,0.037287,-0.041091
7,1858700,0.021692,-0.01002,-0.033231,-0.065473,0.032229,-0.019532,0.023956,-0.047255,-0.028421,0.027716,0.081974,-0.059678,0.076866,-0.101812,-0.002822,0.049491
8,600100,-0.037679,-0.051901,0.011822,-0.027294,0.020662,-0.033249,-0.048535,0.023609,0.000347,0.089887,0.024953,-0.091012,-0.140504,0.01819,0.035233,0.011054
9,623377,0.050706,0.092145,0.004021,0.046852,0.088993,0.095863,0.022414,-0.093847,-0.06113,-0.014215,-0.02837,-0.017722,-0.013436,0.063654,-0.01056,0.114723


In [18]:
user_emb.shape

(33901, 17)

In [19]:
item_emb.head(10)

Unnamed: 0,item,CFeature0,CFeature1,CFeature2,CFeature3,CFeature4,CFeature5,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,CC0101EN,0.009657,-0.005238,-0.004098,0.016303,-0.005274,-0.000361,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,CL0101EN,-0.008611,0.028041,0.021899,-0.001465,0.0069,-0.017981,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,ML0120ENv3,0.027439,-0.027649,-0.007484,-0.059451,0.003972,0.020496,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,BD0211EN,0.020163,-0.011972,-0.003714,-0.015548,-0.00754,0.014847,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,DS0101EN,0.006399,0.000492,0.00564,0.009639,-0.005487,-0.00059,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283
5,ST0101EN,0.011419,0.001678,-0.002454,0.01898,-0.018471,-0.010495,-0.010447,0.003091,-0.00467,0.016963,-0.006641,0.015515,-0.025659,-0.0195,-0.025636,-0.011817
6,DS0301EN,0.019708,0.009191,0.015346,0.002516,-0.000523,0.007861,0.004296,0.008389,-0.000931,-0.025427,0.021304,0.01558,-0.003242,0.010813,2.3e-05,0.015307
7,DS0105EN,-0.013659,-0.01727,0.011621,0.007102,0.00582,0.005197,0.011616,-0.004366,-0.003061,-0.019932,-0.005987,0.01493,-0.010595,-0.01712,0.001179,-0.00888
8,BD0141EN,0.007567,-0.029327,0.000919,0.030705,-0.009864,0.000851,0.002402,0.003107,-0.019846,0.013243,0.010134,0.016171,-0.019714,-0.005965,-0.014285,0.006799
9,CO0201EN,-0.032723,0.006079,-0.007806,0.009733,-0.003652,-0.026304,-0.00308,0.023442,-0.018411,0.0056,-0.018589,-0.007688,-0.01245,-0.033382,0.017794,-0.000448


In [20]:
# Merge user embedding features
user_emb_merged = pd.merge(rating_df, user_emb, how='left', left_on='user', right_on='user').fillna(0)
# Merge course embedding features
merged_df = pd.merge(user_emb_merged, item_emb, how='left', left_on='item', right_on='item').fillna(0)

In [21]:
user_emb_merged.shape

(233306, 19)

In [22]:
user_emb_merged.head()

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,1889878,CC0101EN,3.0,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,0.091464,-0.040247,0.018958,-0.153328,-0.090143,0.08283,-0.058721,0.057929,-0.001472
1,1342067,CL0101EN,3.0,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,0.104128,-0.034401,0.004011,0.064832,0.165857,-0.004384,0.053257,0.014308,0.056684
2,1990814,ML0120ENv3,3.0,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,-0.156026,0.039269,0.042195,0.014695,-0.115989,0.031158,0.102021,-0.020601,0.116488
3,380098,BD0211EN,3.0,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,-0.060944,0.112384,0.002114,0.09066,-0.068545,0.008967,0.063962,0.052347,0.018072
4,779563,DS0101EN,3.0,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,-0.019367,-0.031341,0.064896,-0.048158,-0.047309,-0.007544,0.010474,-0.032287,-0.083983


In [23]:
merged_df.head()

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,...,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,1889878,CC0101EN,3.0,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,...,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,1342067,CL0101EN,3.0,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,...,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,1990814,ML0120ENv3,3.0,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,...,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,380098,BD0211EN,3.0,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,...,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,779563,DS0101EN,3.0,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,...,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283


In [24]:
merged_df.head()

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,...,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,1889878,CC0101EN,3.0,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,...,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,1342067,CL0101EN,3.0,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,...,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,1990814,ML0120ENv3,3.0,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,...,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,380098,BD0211EN,3.0,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,...,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,779563,DS0101EN,3.0,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,...,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283


Next, we can combine the user features (the column labels starting with `UFeature` and item features (the column labels starting with `CFeature`. In machine learning, there are many ways to aggregate two feature vectors such as element-wise add, multiply, max/min, average, etc. Here we simply add the two sets of feature columns:


In [25]:
u_feautres = [f"UFeature{i}" for i in range(16)]
c_features = [f"CFeature{i}" for i in range(16)]
len(u_feautres)
u_feautres[5]

'UFeature5'

In [26]:
user_embeddings = merged_df[u_feautres]
course_embeddings = merged_df[c_features]
ratings = merged_df['rating']

In [27]:
user_embeddings

Unnamed: 0,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.003480,0.091464,-0.040247,0.018958,-0.153328,-0.090143,0.082830,-0.058721,0.057929,-0.001472
1,0.068047,-0.112781,0.045208,-0.007570,-0.038382,0.068037,0.114949,0.104128,-0.034401,0.004011,0.064832,0.165857,-0.004384,0.053257,0.014308,0.056684
2,0.124623,0.012910,-0.072627,0.049935,0.020158,0.133306,-0.035366,-0.156026,0.039269,0.042195,0.014695,-0.115989,0.031158,0.102021,-0.020601,0.116488
3,-0.034870,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,-0.060944,0.112384,0.002114,0.090660,-0.068545,0.008967,0.063962,0.052347,0.018072
4,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.074610,-0.019367,-0.031341,0.064896,-0.048158,-0.047309,-0.007544,0.010474,-0.032287,-0.083983
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233301,-0.021376,-0.081750,-0.140323,0.018257,0.070857,-0.150106,-0.101541,0.070504,0.041484,-0.133919,0.091251,0.110785,0.078464,-0.149068,-0.001619,0.004679
233302,0.038751,-0.045833,0.007787,0.054884,0.008866,-0.016915,-0.007734,-0.009360,0.023028,-0.022320,0.075616,-0.043146,0.009633,0.042186,-0.081890,-0.033386
233303,0.055601,0.032458,0.138734,-0.103575,-0.040634,0.019715,-0.024687,0.140954,-0.086923,0.016125,-0.089600,-0.009856,0.011376,0.068570,-0.049185,0.141572
233304,0.098573,-0.033596,0.146387,0.002943,0.111133,-0.100475,0.097536,0.088731,-0.006530,0.033264,0.078135,0.062370,-0.069393,0.007485,-0.034553,0.141143


In [28]:
course_embeddings

Unnamed: 0,CFeature0,CFeature1,CFeature2,CFeature3,CFeature4,CFeature5,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,0.009657,-0.005238,-0.004098,0.016303,-0.005274,-0.000361,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.032560,-0.007292,0.000966,-0.006218
1,-0.008611,0.028041,0.021899,-0.001465,0.006900,-0.017981,0.010899,-0.037610,-0.019397,-0.025682,-0.000620,0.038803,0.000196,-0.045343,0.012863,0.019429
2,0.027439,-0.027649,-0.007484,-0.059451,0.003972,0.020496,-0.012695,0.036138,0.019965,0.018686,-0.010450,-0.050011,0.013845,-0.044454,-0.001480,-0.007559
3,0.020163,-0.011972,-0.003714,-0.015548,-0.007540,0.014847,-0.005700,-0.006068,-0.005792,-0.023036,0.015999,-0.023480,0.015469,0.022221,-0.023115,-0.001785
4,0.006399,0.000492,0.005640,0.009639,-0.005487,-0.000590,-0.010015,-0.001514,-0.017598,0.003590,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233301,0.006399,0.000492,0.005640,0.009639,-0.005487,-0.000590,-0.010015,-0.001514,-0.017598,0.003590,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283
233302,-0.012058,-0.001864,0.003127,0.011207,0.015054,-0.000930,-0.006246,-0.001485,0.007065,-0.003130,0.007294,-0.000657,0.006152,-0.001489,0.015253,0.000122
233303,-0.006309,0.029949,-0.000870,-0.030568,-0.032244,0.011450,-0.004814,0.032963,-0.018020,0.013813,-0.048995,0.009753,-0.019230,-0.042314,-0.022855,0.008192
233304,0.007567,-0.029327,0.000919,0.030705,-0.009864,0.000851,0.002402,0.003107,-0.019846,0.013243,0.010134,0.016171,-0.019714,-0.005965,-0.014285,0.006799


In [29]:
ratings.head()

0    3.0
1    3.0
2    3.0
3    3.0
4    3.0
Name: rating, dtype: float64

In [30]:
# Aggregate the two feature columns using element-wise add
regression_dataset = user_embeddings + course_embeddings.values

In [31]:
regression_dataset.head()

Unnamed: 0,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,0.090378,-0.134799,0.0839,0.046534,0.077417,-0.004537,-0.018561,0.079236,-0.024561,0.027359,-0.188823,-0.080762,0.050271,-0.066013,0.058894,-0.007689
1,0.059437,-0.08474,0.067107,-0.009036,-0.031482,0.050057,0.125847,0.066517,-0.053798,-0.021671,0.064212,0.20466,-0.004188,0.007914,0.02717,0.076114
2,0.152061,-0.014739,-0.080112,-0.009516,0.02413,0.153802,-0.048061,-0.119888,0.059234,0.060882,0.004244,-0.166,0.045002,0.057566,-0.022081,0.108929
3,-0.014707,-0.011257,0.073692,0.054763,-0.050547,-0.020599,0.027146,-0.067012,0.106593,-0.020921,0.106658,-0.092025,0.024436,0.086183,0.029232,0.016287
4,0.112812,-0.001395,-0.011572,-0.032638,-0.08044,-0.057321,0.064595,-0.02088,-0.048939,0.068486,-0.031359,-0.044577,-0.002381,0.025505,-0.033164,-0.105266


In [32]:
regression_dataset.columns = [f"Feature{i}" for i in range(16)]

In [33]:
regression_dataset

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Feature13,Feature14,Feature15
0,0.090378,-0.134799,0.083900,0.046534,0.077417,-0.004537,-0.018561,0.079236,-0.024561,0.027359,-0.188823,-0.080762,0.050271,-0.066013,0.058894,-0.007689
1,0.059437,-0.084740,0.067107,-0.009036,-0.031482,0.050057,0.125847,0.066517,-0.053798,-0.021671,0.064212,0.204660,-0.004188,0.007914,0.027170,0.076114
2,0.152061,-0.014739,-0.080112,-0.009516,0.024130,0.153802,-0.048061,-0.119888,0.059234,0.060882,0.004244,-0.166000,0.045002,0.057566,-0.022081,0.108929
3,-0.014707,-0.011257,0.073692,0.054763,-0.050547,-0.020599,0.027146,-0.067012,0.106593,-0.020921,0.106658,-0.092025,0.024436,0.086183,0.029232,0.016287
4,0.112812,-0.001395,-0.011572,-0.032638,-0.080440,-0.057321,0.064595,-0.020880,-0.048939,0.068486,-0.031359,-0.044577,-0.002381,0.025505,-0.033164,-0.105266
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233301,-0.014977,-0.081258,-0.134683,0.027895,0.065370,-0.150696,-0.111557,0.068990,0.023886,-0.130328,0.108049,0.113518,0.083626,-0.134038,-0.002495,-0.016603
233302,0.026693,-0.047697,0.010914,0.066091,0.023919,-0.017845,-0.013980,-0.010845,0.030093,-0.025450,0.082910,-0.043803,0.015785,0.040697,-0.066637,-0.033264
233303,0.049292,0.062408,0.137864,-0.134142,-0.072878,0.031165,-0.029502,0.173918,-0.104943,0.029938,-0.138595,-0.000103,-0.007854,0.026256,-0.072040,0.149764
233304,0.106140,-0.062923,0.147306,0.033648,0.101269,-0.099624,0.099939,0.091838,-0.026377,0.046507,0.088269,0.078541,-0.089107,0.001519,-0.048838,0.147942


In [34]:
regression_dataset['rating'] = ratings
regression_dataset.head()

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Feature13,Feature14,Feature15,rating
0,0.090378,-0.134799,0.0839,0.046534,0.077417,-0.004537,-0.018561,0.079236,-0.024561,0.027359,-0.188823,-0.080762,0.050271,-0.066013,0.058894,-0.007689,3.0
1,0.059437,-0.08474,0.067107,-0.009036,-0.031482,0.050057,0.125847,0.066517,-0.053798,-0.021671,0.064212,0.20466,-0.004188,0.007914,0.02717,0.076114,3.0
2,0.152061,-0.014739,-0.080112,-0.009516,0.02413,0.153802,-0.048061,-0.119888,0.059234,0.060882,0.004244,-0.166,0.045002,0.057566,-0.022081,0.108929,3.0
3,-0.014707,-0.011257,0.073692,0.054763,-0.050547,-0.020599,0.027146,-0.067012,0.106593,-0.020921,0.106658,-0.092025,0.024436,0.086183,0.029232,0.016287,3.0
4,0.112812,-0.001395,-0.011572,-0.032638,-0.08044,-0.057321,0.064595,-0.02088,-0.048939,0.068486,-0.031359,-0.044577,-0.002381,0.025505,-0.033164,-0.105266,3.0


In [35]:
regression_dataset.shape

(233306, 17)

In [80]:
regression_dataset.dropna(inplace=True)

By now, we have built the input dataset `X` and the output vector `y`:


In [98]:
X = regression_dataset.iloc[:, :-1].values
y = regression_dataset.iloc[:, -1].values
print(f"Input data shape: {X.shape}, Output data shape: {y.shape}")

Input data shape: (233306, 16), Output data shape: (233306,)


In [99]:
X.shape

(233306, 16)

In [100]:
X.reshape(-1,1)

array([[ 0.0903777 ],
       [-0.13479884],
       [ 0.08389994],
       ...,
       [ 0.0142446 ],
       [-0.01347815],
       [ 0.1672347 ]])

AttributeError: 'numpy.ndarray' object has no attribute 'head'

## TASK: Perform regression on the interaction dataset


Now our input data `X` and output `y` are ready, let's build regression models to map X to y and predict ratings. 


y.unique()


In [82]:
y.head()

0    3.0
1    3.0
2    3.0
3    3.0
4    3.0
Name: rating, dtype: float64

In [83]:
y.unique()

array([3., 2.])

In an online course system, we may consider the `Completion` mode to be `larger` than the `Audit` mode as a learner needs to put more efforts towards completion.  Now if we treat it as a regression problem,  we would expect the regression model to output ratings ranging from 2.0 to 3.0. To interpret regression model output, we can treat values closer to 2.0 as `Audit` and values closer to 3.0 as `Completion`.


You may use `sklearn` to train and evaluate various regression models.


_TODO: First split dataset into training and testing datasets_


In [84]:
### WRITE YOUR CODE HERE

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

<details>
    <summary>Click here for Hints</summary>
    
Use `train_test_split()` to split dataset into training and testing datasets.  Use `X, y` as input dataset and output vector. Don't forget to specify `random_state = rs` and `test_size=0.3`.


_TODO: Create a basic linear regression model_


In [85]:
### WRITE YOUR CODE HERE
from sklearn.linear_model import LinearRegression, Ridge, Lasso
lr=LinearRegression()

In [86]:
from sklearn.linear_model import Ridge
rd = Ridge()

<details>
    <summary>Click here for Hints</summary>
    
You can call `linear_model.Ridge()` method and specify `alpha=0.2` ( it's controlling regularization) in the parameters.


_TODO: Train the basic regression model with training data_


In [87]:
### WRITE YOUR CODE HERE
lr.fit(X_train,y_train)

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():


LinearRegression()

In [112]:
from sklearn.naive_bayes import GaussianNB
U = regression_dataset.iloc[:,:-1].values
v = regression_dataset.iloc[:,-1].values
print(f"Input data shape: {U.shape}, Output data shape: {v.shape}")
U_train,U_test,v_train, v_test = train_test_split(U,v,test_size=0.2, random_state=42)
rd=Ridge(alpha=0.5)
rd

Input data shape: (233306, 16), Output data shape: (233306,)


Ridge(alpha=0.5)

In [113]:
U_train.shape, v_train.shape

((186644, 16), (186644,))

In [115]:
#U.head(5)

In [91]:
v.shape

(233306,)

In [92]:
v.count()

233306

In [118]:
rd.fit(U_train,v_train)

TypeError: solve() got an unexpected keyword argument 'sym_pos'

<details>
    <summary>Click here for Hints</summary>
    
You can call `model.fit()` method with `X_train, y_train` parameters.


_TODO: Evaluate the basic regression model_


In [None]:
### WRITE YOUR CODE HERE
from sklearn.metrics import mean_squared_error
import numpy as np
predictions=lr.predict(X_test)
### The main evaluation metric is RMSE but you may use other metrics as well
rmse=np.sqrt(mean_squared_error(predictions,y_test))
rmse

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():


0.21218050847568173

<details>
    <summary>Click here for Hints</summary>
    
You can call `model.predict()` method with `X_test` parameter to get model predictions. Then use `mean_squared_error()` with `y_test, your_predictions` parameters to calculate the RMSE. 


_TODO: Try different regression models such as Ridge, Lasso, ElasticNet and tune their hyperparameters to see which one has the best performance_


In [None]:
### WRITE YOUR CODE HERE


### Summary


In this lab, you have built regression models to predict numerical course ratings using the embedding feature vectors extracted from neural networks. In the next lab, we can treat the prediction problem as a classification problem as rating only has two categorical values so classification can be a more natural problem statement.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2021-10-25|1.0|Yan|Created the initial version|


Copyright © 2021 IBM Corporation. All rights reserved.
