<a href="https://colab.research.google.com/github/nes-a/unit3_Project/blob/main/3.%20Supervised%20task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/nes-a/unit3_Project/blob/main/Logo.png?raw=1" width="100" align="left"/>

# <center> Unit 3 Project </center>
#  <center> Third section : supervised task </center>

In this notebook you will be building and training a supervised learning model to classify your data.

For this task we will be using another classification model "The random forests" model.

Steps for this task:
1. Load the already clustered dataset
2. Take into consideration that in this task we will not be using the already added column "Cluster"
3. Split your data.
3. Build your model using the SKlearn RandomForestClassifier class
4. classify your data and test the performance of your model
5. Evaluate the model ( accepted models should have at least an accuracy of 86%). Play with hyper parameters and provide a report about that.
6. Provide evidence on the quality of your model (not overfitted good metrics)
7. Create a new test dataset that contains the testset + an additional column called "predicted_class" stating the class predicted by your random forest classifier for each data point of the test set.

## 1. Load the data and split the data:

In [62]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

In [63]:
df = pd.read_csv('clustered_HepatitisC.csv')
df.head()

Unnamed: 0,ID,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster
0,1,0,32,1,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0,1
1,2,0,32,1,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5,1
2,3,0,32,1,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3,1
3,4,0,32,1,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7,1
4,5,0,32,1,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7,1


In [64]:
# Drop ID, Category, and cluster columns for features
X = df.drop(columns=["ID", "Category", "cluster"])
# Target is the Category column
y = df["Category"]
# Split the data 80:20(80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(492, 12) (123, 12) (492,) (123,)


## 2. Building the model and training and evaluate the performance:

In [65]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

In [66]:
y_hat = model.predict(X_test)

In [67]:
y_pred = y_hat

In [68]:
# Make predictions on the test set
y_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)

> Hint : A Perfect accuracy on the train set suggest that we have an overfitted model So the student should be able to provide a detailed table about the hyper parameters / parameters tuning with a good conclusion stating that the model has at least an accuracy of 86% on the test set without signs of overfitting  

In [69]:
# Evaluate the model in terms of accuracy and precision
accuracy_test = accuracy_score(y_test, y_pred)
precision_test = precision_score(y_test, y_pred, average='weighted', zero_division=0)
print(f"Accuracy (Test Set): {accuracy_test:.4f}")
print(f"Precision (Test Set): {precision_test:.4f}")

Accuracy (Test Set): 0.8618
Precision (Test Set): 0.8414


In [70]:
accuracy_train = accuracy_score(y_train, y_train_pred)
print(f"\nAccuracy (Train Set): {accuracy_train:.4f}")
print(f"Difference (Train - Test Accuracy): {accuracy_train - accuracy_test:.4f}")


Accuracy (Train Set): 1.0000
Difference (Train - Test Accuracy): 0.1382


In [71]:
# Play with hyperparameters and provide a report about that
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

In [72]:
# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42),
                           param_grid=param_grid,
                           cv=5,
                           scoring='accuracy',
                           n_jobs=-1,
                           verbose=1)

grid_search.fit(X_train, y_train)

print(f"\nBest parameters found: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_:.4f}")

Fitting 5 folds for each of 36 candidates, totalling 180 fits





Best parameters found: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best cross-validation accuracy: 0.9370


In [73]:
results_df = pd.DataFrame(grid_search.cv_results_)
results_df_sorted = results_df.sort_values(by='rank_test_score').head(5)
report_columns = ['param_n_estimators', 'param_max_depth', 'param_min_samples_split',
                  'param_min_samples_leaf', 'mean_test_score', 'std_test_score', 'rank_test_score']
print(results_df_sorted[report_columns].to_string(index=False))

 param_n_estimators param_max_depth  param_min_samples_split  param_min_samples_leaf  mean_test_score  std_test_score  rank_test_score
                200            None                        2                       1         0.937023        0.009723                1
                300            None                        2                       1         0.937023        0.013335                1
                200              10                        2                       1         0.937023        0.009723                1
                300              10                        2                       1         0.937023        0.013335                1
                200              20                        2                       1         0.937023        0.009723                1


In [74]:
# Get the best model
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred_tuned = best_model.predict(X_test)
y_train_pred_tuned = best_model.predict(X_train)

In [75]:
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
precision_tuned = precision_score(y_test, y_pred_tuned, average='weighted', zero_division=0)

print(f"Accuracy (Test - Tuned Model): {accuracy_tuned:.4f}")
print(f"Precision (Test - Tuned Model): {precision_tuned:.4f}")

accuracy_train_tuned = accuracy_score(y_train, y_train_pred_tuned)
print(f"\nAccuracy (Train - Tuned Model): {accuracy_train_tuned:.4f}")
print(f"Difference (Train - Test Accuracy - Tuned Model): {accuracy_train_tuned - accuracy_tuned:.4f}")


Accuracy (Test - Tuned Model): 0.8618
Precision (Test - Tuned Model): 0.8427

Accuracy (Train - Tuned Model): 1.0000
Difference (Train - Test Accuracy - Tuned Model): 0.1382


In [76]:
print(f"Final Test Accuracy: {accuracy_tuned:.4f}")
print(f"Final Test Precision: {precision_tuned:.4f}")

if accuracy_tuned >= 0.86:
    print("\nConclusion: The tuned Random Forest model achieved an accuracy of {:.2f}% on the test set, which meets the acceptance criterion of at least 86%.".format(accuracy_tuned * 100))
else:
    print("\nConclusion: The tuned Random Forest model achieved an accuracy of {:.2f}% on the test set, which is below the acceptance criterion of 86%. Further tuning or feature engineering may be required.".format(accuracy_tuned * 100))

print(f"\nOverfitting Check: The difference between training accuracy ({accuracy_train_tuned:.4f}) and test accuracy ({accuracy_tuned:.4f}) is {accuracy_train_tuned - accuracy_tuned:.4f}.")
if (accuracy_train_tuned - accuracy_tuned) < 0.05:
    print("This small difference suggests that the model is not significantly overfitted. It generalizes well to unseen data.")
else:
    print("The difference suggests some degree of overfitting. While the model performs well on the training data, its performance drops on unseen test data. Consider techniques like increasing min_samples_leaf, reducing max_depth, or collecting more diverse data.")


Final Test Accuracy: 0.8618
Final Test Precision: 0.8427

Conclusion: The tuned Random Forest model achieved an accuracy of 86.18% on the test set, which meets the acceptance criterion of at least 86%.

Overfitting Check: The difference between training accuracy (1.0000) and test accuracy (0.8618) is 0.1382.
The difference suggests some degree of overfitting. While the model performs well on the training data, its performance drops on unseen test data. Consider techniques like increasing min_samples_leaf, reducing max_depth, or collecting more diverse data.


## 3. Create the summary test set with the additional predicted class column:
In this part you need to add the predicted class as a column to your test dataframe and save this one

In [77]:
# Create the complete test dataframe
X_test_reset = X_test.reset_index(drop=True)
y_test_reset = y_test.reset_index(drop=True)
original_ids_for_test = df.loc[X_test.index, 'ID'].reset_index(drop=True)
original_cluster_for_test = df.loc[X_test.index, 'cluster'].reset_index(drop=True)

test_df['ID'] = original_ids_for_test
test_df["Category"] = y_test_reset
test_df['cluster'] = original_cluster_for_test

feature_columns_order = [col for col in X_test.columns if col != 'ID']
final_columns_order = ['ID'] + feature_columns_order + ['cluster', 'Category']
test_df = test_df[final_columns_order]

test_df.head()

Unnamed: 0,ID,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster,Category
0,249,55,1,28.1,65.5,16.6,17.5,2.8,5.58,4.39,65.0,26.2,62.4,4,0
1,366,39,0,31.4,106.0,16.6,17.0,2.4,5.95,5.3,68.0,22.9,72.3,1,0
2,433,48,0,43.7,50.1,17.3,26.3,8.1,8.15,5.38,64.0,13.4,73.1,2,0
3,611,62,0,32.0,416.6,5.9,110.3,50.0,5.57,6.3,55.7,650.9,68.5,0,4
4,133,44,1,35.5,81.7,27.5,29.5,6.4,8.81,6.65,83.0,24.1,68.0,1,0


In [78]:
# Add the predicted_class column
test_df["Predicted_class"] = y_pred_tuned

In [79]:
test_df.head()

Unnamed: 0,ID,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster,Category,Predicted_class
0,249,55,1,28.1,65.5,16.6,17.5,2.8,5.58,4.39,65.0,26.2,62.4,4,0,0
1,366,39,0,31.4,106.0,16.6,17.0,2.4,5.95,5.3,68.0,22.9,72.3,1,0,0
2,433,48,0,43.7,50.1,17.3,26.3,8.1,8.15,5.38,64.0,13.4,73.1,2,0,0
3,611,62,0,32.0,416.6,5.9,110.3,50.0,5.57,6.3,55.7,650.9,68.5,0,4,4
4,133,44,1,35.5,81.7,27.5,29.5,6.4,8.81,6.65,83.0,24.1,68.0,1,0,0


> Make sure you have 16 column in this test set  

In [61]:
# Save the test set
test_df.to_csv("test_summary.csv")