<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Classification-based Rating Mode Prediction using Embedding Features**


Estimated time needed: **60** minutes


In this lab, you have built regression models to predict numerical course ratings using the embedding feature vectors extracted from neural networks. We can also consider the prediction problem as a classification problem as rating only has two categorical values (`Aduit` vs. `Completion`).


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_4/images/rating_classification.png)


The workflow is very similar to our previous lab. We first extract two embedding matrices out of the neural network, and aggregate them to be a single interaction feature vector as input data `X`.

This time, with the interaction label `Y` as categorical rating mode, we can build classification models to approximate the mapping from `X` to `Y`, as shown in the above flowchart.


## Objectives


After completing this lab you will be able to:


* Build classification models to predict rating modes using the combined embedding vectors


----


## Prepare and setup lab environment


First install and import required libraries:


In [1]:
!pip install scikit-learn==1.0.2



In [3]:
# also set a random state
rs = 123

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### Load datasets


In [5]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/ratings.csv"
user_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/user_embeddings.csv"
item_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_embeddings.csv"

The first dataset is the rating dataset contains user-item interaction matrix


In [6]:
rating_df = pd.read_csv(rating_url)

In [7]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,3.0
1,1342067,CL0101EN,3.0
2,1990814,ML0120ENv3,3.0
3,380098,BD0211EN,3.0
4,779563,DS0101EN,3.0


As you can see from the above data, the user and item are just ids, let's substitute them with their embedding vectors


In [8]:
user_emb = pd.read_csv(user_emb_url)
item_emb = pd.read_csv(item_emb_url)

In [9]:
user_emb.head()

Unnamed: 0,user,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,1889878,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,0.091464,-0.040247,0.018958,-0.153328,-0.090143,0.08283,-0.058721,0.057929,-0.001472
1,1342067,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,0.104128,-0.034401,0.004011,0.064832,0.165857,-0.004384,0.053257,0.014308,0.056684
2,1990814,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,-0.156026,0.039269,0.042195,0.014695,-0.115989,0.031158,0.102021,-0.020601,0.116488
3,380098,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,-0.060944,0.112384,0.002114,0.09066,-0.068545,0.008967,0.063962,0.052347,0.018072
4,779563,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,-0.019367,-0.031341,0.064896,-0.048158,-0.047309,-0.007544,0.010474,-0.032287,-0.083983


In [10]:
item_emb.head()

Unnamed: 0,item,CFeature0,CFeature1,CFeature2,CFeature3,CFeature4,CFeature5,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,CC0101EN,0.009657,-0.005238,-0.004098,0.016303,-0.005274,-0.000361,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,CL0101EN,-0.008611,0.028041,0.021899,-0.001465,0.0069,-0.017981,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,ML0120ENv3,0.027439,-0.027649,-0.007484,-0.059451,0.003972,0.020496,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,BD0211EN,0.020163,-0.011972,-0.003714,-0.015548,-0.00754,0.014847,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,DS0101EN,0.006399,0.000492,0.00564,0.009639,-0.005487,-0.00059,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283


In [11]:
# Merge user embedding features
merged_df = pd.merge(rating_df, user_emb, how='left', left_on='user', right_on='user').fillna(0)
# Merge course embedding features
merged_df = pd.merge(merged_df, item_emb, how='left', left_on='item', right_on='item').fillna(0)

In [12]:
merged_df.head()

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,...,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,1889878,CC0101EN,3.0,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,...,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,1342067,CL0101EN,3.0,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,...,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,1990814,ML0120ENv3,3.0,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,...,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,380098,BD0211EN,3.0,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,...,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,779563,DS0101EN,3.0,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,...,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283


Each user's embedding features and each item's embedding features are added to the dataset. Next, we perform element-wise add the user features (the column labels starting with `UFeature`) and item features (the column labels starting with `CFeature`).


In [13]:
u_feautres = [f"UFeature{i}" for i in range(16)]
c_features = [f"CFeature{i}" for i in range(16)]

In [16]:
u_feautres

['UFeature0',
 'UFeature1',
 'UFeature2',
 'UFeature3',
 'UFeature4',
 'UFeature5',
 'UFeature6',
 'UFeature7',
 'UFeature8',
 'UFeature9',
 'UFeature10',
 'UFeature11',
 'UFeature12',
 'UFeature13',
 'UFeature14',
 'UFeature15']

In [17]:
c_features

['CFeature0',
 'CFeature1',
 'CFeature2',
 'CFeature3',
 'CFeature4',
 'CFeature5',
 'CFeature6',
 'CFeature7',
 'CFeature8',
 'CFeature9',
 'CFeature10',
 'CFeature11',
 'CFeature12',
 'CFeature13',
 'CFeature14',
 'CFeature15']

In [18]:
user_embeddings = merged_df[u_feautres]
course_embeddings = merged_df[c_features]

In [20]:
user_embeddings.head(10)

Unnamed: 0,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,0.091464,-0.040247,0.018958,-0.153328,-0.090143,0.08283,-0.058721,0.057929,-0.001472
1,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,0.104128,-0.034401,0.004011,0.064832,0.165857,-0.004384,0.053257,0.014308,0.056684
2,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,-0.156026,0.039269,0.042195,0.014695,-0.115989,0.031158,0.102021,-0.020601,0.116488
3,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,-0.060944,0.112384,0.002114,0.09066,-0.068545,0.008967,0.063962,0.052347,0.018072
4,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,-0.019367,-0.031341,0.064896,-0.048158,-0.047309,-0.007544,0.010474,-0.032287,-0.083983
5,0.023796,0.063062,0.111711,0.008723,0.083231,0.095042,0.02642,-0.014873,-0.028716,0.04214,-0.012092,0.081946,0.006987,-0.073148,0.044278,0.044275
6,-0.058648,-0.089343,0.12169,0.019357,-0.037281,0.049743,0.063332,0.045058,-0.006939,-0.009103,-0.211956,-0.050017,-0.158781,0.031542,0.037287,-0.041091
7,0.021692,-0.01002,-0.033231,-0.065473,0.032229,-0.019532,0.023956,-0.047255,-0.028421,0.027716,0.081974,-0.059678,0.076866,-0.101812,-0.002822,0.049491
8,-0.037679,-0.051901,0.011822,-0.027294,0.020662,-0.033249,-0.048535,0.023609,0.000347,0.089887,0.024953,-0.091012,-0.140504,0.01819,0.035233,0.011054
9,0.050706,0.092145,0.004021,0.046852,0.088993,0.095863,0.022414,-0.093847,-0.06113,-0.014215,-0.02837,-0.017722,-0.013436,0.063654,-0.01056,0.114723


In [21]:
course_embeddings.head()

Unnamed: 0,CFeature0,CFeature1,CFeature2,CFeature3,CFeature4,CFeature5,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,0.009657,-0.005238,-0.004098,0.016303,-0.005274,-0.000361,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,-0.008611,0.028041,0.021899,-0.001465,0.0069,-0.017981,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,0.027439,-0.027649,-0.007484,-0.059451,0.003972,0.020496,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,0.020163,-0.011972,-0.003714,-0.015548,-0.00754,0.014847,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,0.006399,0.000492,0.00564,0.009639,-0.005487,-0.00059,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283


In [22]:
ratings = merged_df['rating']

# Aggregate the two feature columns using element-wise add
interaction_dataset = user_embeddings + course_embeddings.values
interaction_dataset.columns = [f"Feature{i}" for i in range(16)]
interaction_dataset['rating'] = ratings
interaction_dataset.head()

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Feature13,Feature14,Feature15,rating
0,0.090378,-0.134799,0.0839,0.046534,0.077417,-0.004537,-0.018561,0.079236,-0.024561,0.027359,-0.188823,-0.080762,0.050271,-0.066013,0.058894,-0.007689,3.0
1,0.059437,-0.08474,0.067107,-0.009036,-0.031482,0.050057,0.125847,0.066517,-0.053798,-0.021671,0.064212,0.20466,-0.004188,0.007914,0.02717,0.076114,3.0
2,0.152061,-0.014739,-0.080112,-0.009516,0.02413,0.153802,-0.048061,-0.119888,0.059234,0.060882,0.004244,-0.166,0.045002,0.057566,-0.022081,0.108929,3.0
3,-0.014707,-0.011257,0.073692,0.054763,-0.050547,-0.020599,0.027146,-0.067012,0.106593,-0.020921,0.106658,-0.092025,0.024436,0.086183,0.029232,0.016287,3.0
4,0.112812,-0.001395,-0.011572,-0.032638,-0.08044,-0.057321,0.064595,-0.02088,-0.048939,0.068486,-0.031359,-0.044577,-0.002381,0.025505,-0.033164,-0.105266,3.0


Next, let's use `LabelEncoder()` to encode our `rating` label to be categorical:


In [23]:
X = interaction_dataset.iloc[:, :-1]
y_raw = interaction_dataset.iloc[:, -1]

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y_raw.values.ravel())

In [27]:
X

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Feature13,Feature14,Feature15
0,0.090378,-0.134799,0.083900,0.046534,0.077417,-0.004537,-0.018561,0.079236,-0.024561,0.027359,-0.188823,-0.080762,0.050271,-0.066013,0.058894,-0.007689
1,0.059437,-0.084740,0.067107,-0.009036,-0.031482,0.050057,0.125847,0.066517,-0.053798,-0.021671,0.064212,0.204660,-0.004188,0.007914,0.027170,0.076114
2,0.152061,-0.014739,-0.080112,-0.009516,0.024130,0.153802,-0.048061,-0.119888,0.059234,0.060882,0.004244,-0.166000,0.045002,0.057566,-0.022081,0.108929
3,-0.014707,-0.011257,0.073692,0.054763,-0.050547,-0.020599,0.027146,-0.067012,0.106593,-0.020921,0.106658,-0.092025,0.024436,0.086183,0.029232,0.016287
4,0.112812,-0.001395,-0.011572,-0.032638,-0.080440,-0.057321,0.064595,-0.020880,-0.048939,0.068486,-0.031359,-0.044577,-0.002381,0.025505,-0.033164,-0.105266
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233301,-0.014977,-0.081258,-0.134683,0.027895,0.065370,-0.150696,-0.111557,0.068990,0.023886,-0.130328,0.108049,0.113518,0.083626,-0.134038,-0.002495,-0.016603
233302,0.026693,-0.047697,0.010914,0.066091,0.023919,-0.017845,-0.013980,-0.010845,0.030093,-0.025450,0.082910,-0.043803,0.015785,0.040697,-0.066637,-0.033264
233303,0.049292,0.062408,0.137864,-0.134142,-0.072878,0.031165,-0.029502,0.173918,-0.104943,0.029938,-0.138595,-0.000103,-0.007854,0.026256,-0.072040,0.149764
233304,0.106140,-0.062923,0.147306,0.033648,0.101269,-0.099624,0.099939,0.091838,-0.026377,0.046507,0.088269,0.078541,-0.089107,0.001519,-0.048838,0.147942


and split X and y into training and testing dataset:


In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rs)

In [29]:
print(f"Input data shape: {X.shape}, Output data shape: {y.shape}")

Input data shape: (233306, 16), Output data shape: (233306,)


## TASK: Perform classification tasks on the interaction dataset


Now our input data `X` and output label `y` is ready, let's build classification models to map `X` to `y`


You may use `sklearn` to train and evaluate various regression models.


_TODO: Define classification models such as Logistic Regression, Tree models, SVM, Bagging, and Boosting models_


In [32]:
### WRITE YOUR CODE HERE
pd.Series(y).value_counts(normalize=True)

1    0.952954
0    0.047046
Name: proportion, dtype: float64

In [33]:
from sklearn.metrics import classification_report
c_weight ={}
# We have two weights 1 is 0.95 and 0.04 respectively
c_weight[0]=0.952954
c_weight[1]=0.047046

In [34]:
model_logReg = LogisticRegression(penalty='l2',random_state=rs, max_iter=1000,class_weight=c_weight)

In [37]:
threshhold=0.5
model_logReg.fit(X_train,y_train)
predicted_probability = model_logReg.predict_proba(X_test)
predicted_probability

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():


array([[0.36318011, 0.63681989],
       [0.46788898, 0.53211102],
       [0.50884711, 0.49115289],
       ...,
       [0.3790562 , 0.6209438 ],
       [0.37612317, 0.62387683],
       [0.35201844, 0.64798156]])

In [38]:
predicted_probability[:,1]

array([0.63681989, 0.53211102, 0.49115289, ..., 0.6209438 , 0.62387683,
       0.64798156])

In [41]:
yp_lr = (predicted_probability[:,1] >= threshhold).astype('int')

In [43]:
print(classification_report(yp_lr,y_test))

              precision    recall  f1-score   support

           0       0.65      0.07      0.12     20638
           1       0.57      0.97      0.72     26024

    accuracy                           0.57     46662
   macro avg       0.61      0.52      0.42     46662
weighted avg       0.60      0.57      0.45     46662



In [44]:
# Bulding SVM model
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

In [45]:
estimators =[('SVM',SVC(random_state=42)),('KNN',KNeighborsClassifier()),('dt',DecisionTreeClassifier())]

In [46]:
clf=StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

In [47]:
clf.fit(X_train,y_train)
preds_clf=clf.predict(X_test)

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "

In [48]:
print(classification_report(preds_clf,y_test))

              precision    recall  f1-score   support

           0       0.84      0.96      0.90      1871
           1       1.00      0.99      1.00     44791

    accuracy                           0.99     46662
   macro avg       0.92      0.97      0.95     46662
weighted avg       0.99      0.99      0.99     46662



<details>
    <summary>Click here for Hints </summary>
    
For Example: you can call `RandomForestClassifier()` to define your model, don't forget to specify `max_depth= ..`  and `random_state=rs` in the parameters.


_TODO: Train your classification models with training data_


In [49]:
### WRITE YOUR CODE HERE
### You may need to tune the hyperparameters of the models
import pickle
pickle.dump(clf,open('stackingclassifier.sav','wb'))

In [50]:
load_model = pickle.load(open('stackingclassifier.sav','rb'))

In [51]:
preds_clf=load_model.predict(X_test)

  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():
  if not hasattr(array, "sparse") and array.dtypes.apply(is_sparse).any():


In [52]:
import numpy as np
from sklearn.metrics import mean_squared_error
print('RMSE', np.sqrt(mean_squared_error(preds_clf,y_test)))

RMSE 0.09464691575011283


<details>
    <summary>Click here for Hints</summary>
    
You can call `model.fit()` method with `X_train, y_train` parameters.


_TODO: Evaluate your classification models_


In [None]:
### WRITE YOUR CODE HERE

### The main evaluation metrics could be accuracy, recall, precision, F score, and AUC.


<details>
    <summary>Click here for Hints</summary>
    
You can call `model.predict()` method with `X_test` parameter to get model predictions. Then use `accuracy_score()` with `y_test, your_predictions` parameters to calculate the accuracy value. 
* You can use `precision_recall_fscore_support` command  with `y_test, your_predictions, average='binary'` parameters get recall, precision and F score.
    


### Summary


In this lab, you have built and evaluated various classification models to predict categorical course rating modes using the embedding feature vectors extracted from neural networks.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01)


### Other Contributors


## Change Log


|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2021-10-25|1.0|Yan|Created the initial version|


Copyright © 2021 IBM Corporation. All rights reserved.
