<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Classification-based Rating Mode Prediction using Embedding Features**


Estimated time needed: **60** minutes


In this lab, you have built regression models to predict numerical course ratings using the embedding feature vectors extracted from neural networks. We can also consider the prediction problem as a classification problem also using embedding features.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_4/images/rating_classification.png)


The workflow is very similar to our previous lab. We first extract two embedding matrices out of the neural network, and aggregate them to be a single interaction feature vector as input data `X`.

This time, with the interaction label `Y` as categorical rating mode, we can build classification models to approximate the mapping from `X` to `Y`, as shown in the above flowchart.


## Objectives


After completing this lab you will be able to:


* Build classification models to predict rating modes using the combined embedding vectors


----


## Prepare and setup lab environment


First install and import required libraries:


In [21]:
%pip install scikit-learn
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [22]:
# also set a random state
rs = 123

In [23]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### Load datasets


In [24]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-ML0321EN-Coursera/labs/v2/module_3/ratings.csv"
user_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/user_embeddings.csv"
item_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_embeddings.csv"

The first dataset is the rating dataset contains user-item interaction matrix


In [25]:
rating_df = pd.read_csv(rating_url)

In [26]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,5
1,1342067,CL0101EN,3
2,1990814,ML0120ENv3,5
3,380098,BD0211EN,5
4,779563,DS0101EN,3


As you can see from the above data, the user and item are just ids, let's substitute them with their embedding vectors


In [27]:
user_emb = pd.read_csv(user_emb_url)
item_emb = pd.read_csv(item_emb_url)

In [28]:
user_emb.head()

Unnamed: 0,user,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,1889878,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,0.091464,-0.040247,0.018958,-0.153328,-0.090143,0.08283,-0.058721,0.057929,-0.001472
1,1342067,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,0.104128,-0.034401,0.004011,0.064832,0.165857,-0.004384,0.053257,0.014308,0.056684
2,1990814,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,-0.156026,0.039269,0.042195,0.014695,-0.115989,0.031158,0.102021,-0.020601,0.116488
3,380098,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,-0.060944,0.112384,0.002114,0.09066,-0.068545,0.008967,0.063962,0.052347,0.018072
4,779563,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,-0.019367,-0.031341,0.064896,-0.048158,-0.047309,-0.007544,0.010474,-0.032287,-0.083983


In [29]:
item_emb.head()

Unnamed: 0,item,CFeature0,CFeature1,CFeature2,CFeature3,CFeature4,CFeature5,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,CC0101EN,0.009657,-0.005238,-0.004098,0.016303,-0.005274,-0.000361,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,CL0101EN,-0.008611,0.028041,0.021899,-0.001465,0.0069,-0.017981,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,ML0120ENv3,0.027439,-0.027649,-0.007484,-0.059451,0.003972,0.020496,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,BD0211EN,0.020163,-0.011972,-0.003714,-0.015548,-0.00754,0.014847,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,DS0101EN,0.006399,0.000492,0.00564,0.009639,-0.005487,-0.00059,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283


In [30]:
# Merge user embedding features
merged_df = pd.merge(rating_df, user_emb, how='left', left_on='user', right_on='user').fillna(0)
# Merge course embedding features
merged_df = pd.merge(merged_df, item_emb, how='left', left_on='item', right_on='item').fillna(0)

In [31]:
merged_df.head()

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,...,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,1889878,CC0101EN,5,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.00348,...,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.03256,-0.007292,0.000966,-0.006218
1,1342067,CL0101EN,3,0.068047,-0.112781,0.045208,-0.00757,-0.038382,0.068037,0.114949,...,0.010899,-0.03761,-0.019397,-0.025682,-0.00062,0.038803,0.000196,-0.045343,0.012863,0.019429
2,1990814,ML0120ENv3,5,0.124623,0.01291,-0.072627,0.049935,0.020158,0.133306,-0.035366,...,-0.012695,0.036138,0.019965,0.018686,-0.01045,-0.050011,0.013845,-0.044454,-0.00148,-0.007559
3,380098,BD0211EN,5,-0.03487,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,...,-0.0057,-0.006068,-0.005792,-0.023036,0.015999,-0.02348,0.015469,0.022221,-0.023115,-0.001785
4,779563,DS0101EN,3,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.07461,...,-0.010015,-0.001514,-0.017598,0.00359,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283


Each user's embedding features and each item's embedding features are added to the dataset. Next, we perform element-wise add the user features (the column labels starting with `UFeature`) and item features (the column labels starting with `CFeature`).


In [32]:
u_feautres = [f"UFeature{i}" for i in range(16)] # Assuming there are 16 user embedding features
c_features = [f"CFeature{i}" for i in range(16)] # Assuming there are 16 course embedding features
# Extract user embedding features
user_embeddings = merged_df[u_feautres]
# Extract course embedding features
course_embeddings = merged_df[c_features]
# Extract ratings
ratings = merged_df['rating']

# Aggregate the two feature columns using element-wise add
interaction_dataset = user_embeddings + course_embeddings.values
# Rename the columns of the resulting DataFrame
interaction_dataset.columns = [f"Feature{i}" for i in range(16)]
# Add the 'rating' column from the original DataFrame to the regression dataset
interaction_dataset['rating'] = ratings
# Display the first few rows of the regression dataset
interaction_dataset.head()

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Feature13,Feature14,Feature15,rating
0,0.090378,-0.134799,0.0839,0.046534,0.077417,-0.004537,-0.018561,0.079236,-0.024561,0.027359,-0.188823,-0.080762,0.050271,-0.066013,0.058894,-0.007689,5
1,0.059437,-0.08474,0.067107,-0.009036,-0.031482,0.050057,0.125847,0.066517,-0.053798,-0.021671,0.064212,0.20466,-0.004188,0.007914,0.02717,0.076114,3
2,0.152061,-0.014739,-0.080112,-0.009516,0.02413,0.153802,-0.048061,-0.119888,0.059234,0.060882,0.004244,-0.166,0.045002,0.057566,-0.022081,0.108929,5
3,-0.014707,-0.011257,0.073692,0.054763,-0.050547,-0.020599,0.027146,-0.067012,0.106593,-0.020921,0.106658,-0.092025,0.024436,0.086183,0.029232,0.016287,5
4,0.112812,-0.001395,-0.011572,-0.032638,-0.08044,-0.057321,0.064595,-0.02088,-0.048939,0.068486,-0.031359,-0.044577,-0.002381,0.025505,-0.033164,-0.105266,3


Next, let's use `LabelEncoder()` to encode our `rating` label to be categorical:


In [33]:
# Extract features (X) from the interaction_dataset DataFrame
# Selects all rows and all columns except the last column (features)
X = interaction_dataset.iloc[:, :-1]
# Extract the target variable (y_raw) from the interaction_dataset DataFrame
# Selects all rows and only the last column (target variable)
y_raw = interaction_dataset.iloc[:, -1]
# Initialize a LabelEncoder object to encode the target variable
label_encoder = LabelEncoder()
# Encode the target variable (y_raw) using the LabelEncoder
# .values.ravel() converts the target variable to a flattened array before encoding
# The LabelEncoder fits and transforms the target variable, assigning encoded labels to y
y = label_encoder.fit_transform(y_raw.values.ravel())

and split X and y into training and testing dataset:


In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rs)

In [35]:
print(f"Input data shape: {X.shape}, Output data shape: {y.shape}")

Input data shape: (233306, 16), Output data shape: (233306,)


## TASK: Perform classification tasks on the interaction dataset


Now our input data `X` and output label `y` is ready, let's build classification models to map `X` to `y`


You may use `sklearn` to train and evaluate various regression models.


_TODO: Define classification models such as Logistic Regression, Tree models, SVM, Bagging, and Boosting models_


In [36]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [37]:
### WRITE YOUR CODE HERE

lr_model = LogisticRegression(random_state=rs)

lr_params = {
    'penalty': ['l2', 'l1'],
    'C': [0.1, 1.0],
    'solver': ['liblinear']
}

lr_model_cv = GridSearchCV(lr_model, param_grid=lr_params, cv=3)
lr_model_cv.fit(X_train, y_train)
print(lr_model_cv.best_estimator_)

LogisticRegression(penalty='l1', random_state=123, solver='liblinear')


In [38]:
rf_model = RandomForestClassifier(random_state=rs)

rf_params = {
    'n_estimators': [50],
    'criterion': ['gini'],
    'max_depth': [10],
    'min_samples_split': [2],
    'min_samples_leaf': [1, 2],
}

rf_model_cv = GridSearchCV(rf_model, param_grid=rf_params, cv=3)
rf_model_cv.fit(X_train, y_train)
print(rf_model_cv.best_estimator_)

RandomForestClassifier(max_depth=10, n_estimators=50, random_state=123)


In [None]:
svm_model = SVC(random_state=rs)

svm_params = {
    'C': [0.1, 1],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale']
}

svm_model_cv = GridSearchCV(svm_model,  param_grid=svm_params, cv=3)
svm_model_cv.fit(X_train, y_train)
print(svm_model_cv.best_estimator_)

In [39]:
bg_model = BaggingClassifier(random_state=rs)

bg_params = {
    'n_estimators': [10],
    'max_samples': [0.5],
    'max_features': [0.5, 1.0],
    'bootstrap': [True]
}

bg_model_cv = GridSearchCV(bg_model, param_grid=bg_params, cv=3)
bg_model_cv.fit(X_train, y_train)
print(bg_model_cv.best_estimator_)

BaggingClassifier(max_features=0.5, max_samples=0.5, random_state=123)


In [20]:
boost_model = GradientBoostingClassifier(random_state=rs)

boost_params = {
    'n_estimators': [50],
    'learning_rate': [0.01]
}

boost_model_cv = GridSearchCV(boost_model, param_grid=boost_params, cv=3)
boost_model_cv.fit(X_train, y_train)
print(boost_model_cv.best_estimator_)

KeyboardInterrupt: 

<details>
    <summary>Click here for Hints </summary>
    
For Example: you can call `RandomForestClassifier()` to define your model, don't forget to specify `max_depth= ..`  and `random_state=rs` in the parameters.


_TODO: Train your classification models with training data_


In [40]:
### WRITE YOUR CODE HERE
### You may need to tune the hyperparameters of the models
lr_model = LogisticRegression(penalty='l1', random_state=123, solver='liblinear')
rf_model = RandomForestClassifier(max_depth=10, n_estimators=50, random_state=rs)
#svm_model = SVC(C=1, random_state=rs)
bagging_model = BaggingClassifier(max_samples=0.5, random_state=rs)
#boosting_model = GradientBoostingClassifier(learning_rate=0.01, n_estimators=50, random_state=rs)


lr_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
#svm_model.fit(X_train, y_train)
bagging_model.fit(X_train, y_train)
#boosting_model.fit(X_train, y_train)



<details>
    <summary>Click here for Hints</summary>
    
You can call `model.fit()` method with `X_train, y_train` parameters.


_TODO: Evaluate your classification models_


In [42]:
### WRITE YOUR CODE HERE

### The main evaluation metrics could be accuracy, recall, precision, F score, and AUC.

lr_pred = lr_model.predict(X_test)
rf_pred = rf_model.predict(X_test)
#svm_pred = svm_model.predict(X_test)
bagging_pred = bagging_model.predict(X_test)
#boosting_pred = boosting_model.predict(X_test)

def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision, recall, fscore, _ = precision_recall_fscore_support(y_true, y_pred, average='weighted')
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)

    return accuracy, precision, recall, fscore, mse, rmse

# Calculate metrics for each model
score_lr = calculate_metrics(y_test, lr_pred)
score_rf = calculate_metrics(y_test, rf_pred)
#score_svm = calculate_metrics(y_test, svm_pred)
score_bagging = calculate_metrics(y_test, bagging_pred)
#score_boosting = calculate_metrics(y_test, boosting_pred)


In [44]:
# Create a dictionary to store the results
score = {
    'Model': ['Logistic Regression', 'Random Forest', 'Bagging'],
    'Accuracy': [score_lr[0], score_rf[0], score_bagging[0]],
    'Precision': [score_lr[1], score_rf[1],  score_bagging[1]],
    'Recall': [score_lr[2], score_rf[2],  score_bagging[2]],
    'F-score': [score_lr[3], score_rf[3],  score_bagging[3]],
    'MSE': [score_lr[4], score_rf[4],  score_bagging[4]],
    'RMSE': [score_lr[5], score_rf[5], score_bagging[5]]
}

# Convert the dictionary into a DataFrame
results_df = pd.DataFrame(score)
results_df


Unnamed: 0,Model,Accuracy,Precision,Recall,F-score,MSE,RMSE
0,Logistic Regression,0.334126,0.334961,0.334126,0.324852,1.394304,1.180806
1,Random Forest,0.336698,0.337924,0.336698,0.328792,1.373409,1.171925
2,Bagging,0.337255,0.337399,0.337255,0.335706,1.321932,1.149753


<details>
    <summary>Click here for Hints</summary>
    
You can call `model.predict()` method with `X_test` parameter to get model predictions. Then use `accuracy_score()` with `y_test, your_predictions` parameters to calculate the accuracy value.
* You can use `precision_recall_fscore_support` command  with `y_test, your_predictions, average='binary'` parameters get recall, precision and F score.
    


### Summary


In this lab, you have built and evaluated various classification models to predict categorical course rating modes using the embedding feature vectors extracted from neural networks.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/)


### Other Contributors


```toggle## Change Log
```


```toggle|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
```
```toggle|-|-|-|-|
```
```toggle|2021-10-25|1.0|Yan|Created the initial version|
```


Copyright © 2021 IBM Corporation. All rights reserved.
