<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork32585014-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Classification-based Rating Mode Prediction using Embedding Features**


Estimated time needed: **60** minutes


In this lab, you have built regression models to predict numerical course ratings using the embedding feature vectors extracted from neural networks. We can also consider the prediction problem as a classification problem as rating only has two categorical values (`Aduit` vs. `Completion`).


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module\_4/images/rating_classification.png)


The workflow is very similar to our previous lab. We first extract two embedding matrices out of the neural network, and aggregate them to be a single interaction feature vector as input data `X`.

This time, with the interaction label `Y` as categorical rating mode, we can build classification models to approximate the mapping from `X` to `Y`, as shown in the above flowchart.


## Objectives


After completing this lab you will be able to:


*   Build classification models to predict rating modes using the combined embedding vectors


***


## Prepare and setup lab environment


First install and import required libraries:


In [None]:
!pip install scikit-learn==1.0.2

In [1]:
# also set a random state
rs = 123

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### Load datasets


In [3]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/ratings.csv"
user_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/user_embeddings.csv"
item_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_embeddings.csv"

The first dataset is the rating dataset contains user-item interaction matrix


In [4]:
#rating_df = pd.read_csv(rating_url)
rating_df = pd.read_csv('data/ratings.csv')

In [5]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,3.0
1,1342067,CL0101EN,3.0
2,1990814,ML0120ENv3,3.0
3,380098,BD0211EN,3.0
4,779563,DS0101EN,3.0


As you can see from the above data, the user and item are just ids, let's substitute them with their embedding vectors


In [6]:
# Load user embeddings
#user_emb = pd.read_csv(user_emb_url)
#user_emb.to_csv('data/user_embeddings.csv', sep=',', header=True, index=False)
#user_emb = pd.read_csv('data/user_embeddings.csv')
user_emb = pd.read_csv('data/user_embeddings_computed.csv')
# Load item embeddings
#item_emb = pd.read_csv(item_emb_url)
#item_emb.to_csv('data/course_embeddings.csv', sep=',', header=True, index=False)
#item_emb = pd.read_csv('data/course_embeddings.csv')
item_emb = pd.read_csv('data/course_embeddings_computed.csv')

In [7]:
user_emb.head()

Unnamed: 0,user,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,1889878,0.092619,-0.231117,0.207782,0.075256,-0.175193,0.505722,-0.011609,-0.325078,0.152641,-0.265612,0.094592,-0.262018,0.838327,-0.194199,-0.153928,0.389778
1,1342067,0.190924,0.165438,0.108368,0.069375,0.041172,-0.05772,0.1167,0.497182,-0.051477,-0.164022,0.053494,0.204487,-0.138072,-0.084156,0.168388,0.03277
2,1990814,-0.015588,0.252763,0.063455,-0.099508,-0.021063,-0.315386,-0.361852,-0.055602,-0.665861,-0.089062,0.541673,0.071023,0.132675,0.17868,-0.00407,-0.061975
3,380098,-0.28042,-0.133849,-0.227513,0.00672,0.57547,0.15379,-0.395521,0.158342,-0.263698,0.137179,-0.291739,0.187631,-0.28021,0.259173,-0.023921,0.168686
4,779563,-0.01546,-0.271762,-0.472752,0.053249,-0.024169,-0.300003,0.208167,-0.101409,-0.087301,-0.062453,0.039226,-0.096039,-0.048271,-0.187053,0.294076,0.226417


In [8]:
item_emb.head()

Unnamed: 0,item,CFeature0,CFeature1,CFeature2,CFeature3,CFeature4,CFeature5,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,CC0101EN,-0.048532,0.030396,0.003829,0.004685,-0.001384,0.051492,0.014301,0.027785,0.090873,-0.047259,-0.00886,0.037038,0.01446,0.045229,-0.1027,0.088245
1,CL0101EN,0.06429,0.044361,0.047307,-0.010566,0.012116,0.026928,0.016909,0.095854,0.018161,0.012876,-0.003892,0.065286,0.0626,0.117804,0.019665,-0.023349
2,ML0120ENv3,0.034007,-0.002219,-0.040372,0.056647,0.110738,0.069199,0.104628,-0.027713,0.0524,-0.028607,0.079523,-0.008803,-0.070613,0.101053,-0.02365,-0.065341
3,BD0211EN,0.009033,-0.005793,0.017004,0.080865,-0.070522,0.057408,0.025328,-0.040609,-0.000427,-0.067003,0.102882,-0.001016,0.004374,0.009934,-0.02062,-0.033026
4,DS0101EN,0.038361,-0.013257,-0.034347,-0.017218,-0.003707,0.003634,0.007511,-0.007333,0.067755,-0.037664,-0.026654,-0.038843,-0.07587,-0.039574,-0.016651,0.04029


In [9]:
# Merge user embedding features
merged_df = pd.merge(rating_df, user_emb, how='left', left_on='user', right_on='user').fillna(0)
# Merge course embedding features
merged_df = pd.merge(merged_df, item_emb, how='left', left_on='item', right_on='item').fillna(0)

In [10]:
merged_df.head()

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,...,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,1889878,CC0101EN,3.0,0.092619,-0.231117,0.207782,0.075256,-0.175193,0.505722,-0.011609,...,0.014301,0.027785,0.090873,-0.047259,-0.00886,0.037038,0.01446,0.045229,-0.1027,0.088245
1,1342067,CL0101EN,3.0,0.190924,0.165438,0.108368,0.069375,0.041172,-0.05772,0.1167,...,0.016909,0.095854,0.018161,0.012876,-0.003892,0.065286,0.0626,0.117804,0.019665,-0.023349
2,1990814,ML0120ENv3,3.0,-0.015588,0.252763,0.063455,-0.099508,-0.021063,-0.315386,-0.361852,...,0.104628,-0.027713,0.0524,-0.028607,0.079523,-0.008803,-0.070613,0.101053,-0.02365,-0.065341
3,380098,BD0211EN,3.0,-0.28042,-0.133849,-0.227513,0.00672,0.57547,0.15379,-0.395521,...,0.025328,-0.040609,-0.000427,-0.067003,0.102882,-0.001016,0.004374,0.009934,-0.02062,-0.033026
4,779563,DS0101EN,3.0,-0.01546,-0.271762,-0.472752,0.053249,-0.024169,-0.300003,0.208167,...,0.007511,-0.007333,0.067755,-0.037664,-0.026654,-0.038843,-0.07587,-0.039574,-0.016651,0.04029


Each user's embedding features and each item's embedding features are added to the dataset. Next, we perform element-wise add the user features (the column labels starting with `UFeature`) and item features (the column labels starting with `CFeature`).


In [11]:
u_feautres = [f"UFeature{i}" for i in range(16)]
c_features = [f"CFeature{i}" for i in range(16)]

user_embeddings = merged_df[u_feautres]
course_embeddings = merged_df[c_features]
ratings = merged_df['rating']

# Aggregate the two feature columns using element-wise add
interaction_dataset = user_embeddings + course_embeddings.values
interaction_dataset.columns = [f"Feature{i}" for i in range(16)]
interaction_dataset['rating'] = ratings
interaction_dataset.head()

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Feature13,Feature14,Feature15,rating
0,0.044087,-0.200722,0.211612,0.079941,-0.176578,0.557214,0.002692,-0.297292,0.243515,-0.312871,0.085732,-0.224981,0.852787,-0.14897,-0.256628,0.478023,3.0
1,0.255214,0.209799,0.155675,0.058809,0.053288,-0.030793,0.133609,0.593036,-0.033315,-0.151146,0.049602,0.269773,-0.075472,0.033648,0.188053,0.009421,3.0
2,0.018419,0.250544,0.023083,-0.042861,0.089675,-0.246187,-0.257224,-0.083314,-0.613461,-0.117669,0.621196,0.06222,0.062062,0.279733,-0.02772,-0.127316,3.0
3,-0.271387,-0.139642,-0.21051,0.087585,0.504948,0.211198,-0.370193,0.117733,-0.264125,0.070176,-0.188857,0.186616,-0.275836,0.269107,-0.044541,0.13566,3.0
4,0.022901,-0.28502,-0.507098,0.036032,-0.027875,-0.296369,0.215679,-0.108742,-0.019546,-0.100117,0.012572,-0.134882,-0.12414,-0.226627,0.277426,0.266707,3.0


Next, let's use `LabelEncoder()` to encode our `rating` label to be categorical:


In [13]:
X = interaction_dataset.iloc[:, :-1]
y_raw = interaction_dataset.iloc[:, -1]

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y_raw.values.ravel())

In [14]:
y

array([1, 1, 1, ..., 1, 1, 1])

and split X and y into training and testing dataset:


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rs)

In [16]:
print(f"Input data shape: {X.shape}, Output data shape: {y.shape}")

Input data shape: (233306, 16), Output data shape: (233306,)


## TASK: Perform classification tasks on the interaction dataset


Now our input data `X` and output label `y` is ready, let's build classification models to map `X` to `y`


You may use `sklearn` to train and evaluate various regression models.


*TODO: Define classification models such as Logistic Regression, Tree models, SVM, Bagging, and Boosting models*


In [19]:
### WRITE YOUR CODE HERE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

*TODO: Train your classification models with training data*


In [26]:
### WRITE YOUR CODE HERE

params_grid = {
  'max_depth': [15, 20],
  'n_estimators': [50, 100],
  'min_samples_split': [2, 5]
}

model = RandomForestClassifier(random_state=rs)

grid_search = GridSearchCV(estimator = model, 
                       param_grid = params_grid, 
                       scoring='f1',
                       cv = 3, verbose = 1)

In [None]:
grid_search.fit(X, y)

In [28]:
rfc = grid_search.best_estimator_

print(grid_search.best_score_)
print(grid_search.best_params_)

0.9868993955473416
{'max_depth': 20, 'min_samples_split': 5, 'n_estimators': 50}


*TODO: Evaluate your classification models*


In [29]:
### WRITE YOUR CODE HERE
from sklearn.metrics import precision_recall_fscore_support
### The main evaluation metrics could be accuracy, recall, precision, F score, and AUC.
pred = rfc.predict(X_test)

precision, recall, f1, support = precision_recall_fscore_support(y_test, pred)
print(f1)

[0.84034948 0.99326611]


### Summary


In this lab, you have built and evaluated various classification models to predict categorical course rating modes using the embedding feature vectors extracted from neural networks.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork32585014-2022-01-01)


### Other Contributors


## Change Log


| Date (YYYY-MM-DD) | Version | Changed By | Change Description          |
| ----------------- | ------- | ---------- | --------------------------- |
| 2021-10-25        | 1.0     | Yan        | Created the initial version |


Copyright © 2021 IBM Corporation. All rights reserved.
