# Laboratory exercise 6

The notebook contains exercises connected to auditory exercise 7. For any questions feel free to contact assistant: eda.jovicic@fer.hr

The main task of this notebook is to make predictions using supported learning with the Scikit Learn library. The goal is to predict the grade in Math considering other features of the student.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, accuracy_score, f1_score
from sklearn.impute import SimpleImputer

1. Load the cleaned dataset from the first exercise. If you haven't saved the dataset, rerun the exercise and save the final dataset.

In [2]:
students = pd.read_csv('cleaned_dataset.csv')
students

Unnamed: 0,EthnicGroup,ParentEduc,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore,Gender_female,Gender_male,LunchType_free/reduced,LunchType_standard
0,2,1,1,1,1,1,3.0,1,1,71,71,74,1.0,0.0,0.0,1.0
1,2,5,1,1,2,1,0.0,2,0,69,90,88,1.0,0.0,0.0,1.0
2,1,3,1,2,2,1,4.0,1,1,87,93,91,1.0,0.0,0.0,1.0
3,0,0,1,1,0,0,1.0,2,0,45,56,42,0.0,1.0,1.0,0.0
4,2,5,1,1,2,1,0.0,1,0,76,78,75,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29112,3,2,1,2,2,0,2.0,1,0,59,61,65,1.0,0.0,0.0,1.0
29113,4,2,1,2,1,0,1.0,0,0,58,53,51,0.0,1.0,0.0,1.0
29114,2,2,0,1,2,0,1.0,0,0,61,70,67,1.0,0.0,1.0,0.0
29115,3,0,0,1,1,0,3.0,1,0,82,90,93,1.0,0.0,0.0,1.0


2. Transform the MathScore feature into grades (1-5) using the following scoring system:

|   MathScore |  Grade  |
| ----------- | ------- |
|  88 - 100   |    5    |
|  75 - 87    |    4    |
|  63 - 74    |    3    |
|  50 - 62    |    2    |
|   0 - 49    |    1    |

In [3]:
students['Grade'] = pd.cut(students['MathScore'], bins=[0, 49, 62, 74, 87, 100], labels=[1, 2, 3, 4, 5])
students[['MathScore', 'Grade']]

Unnamed: 0,MathScore,Grade
0,71,3
1,69,3
2,87,4
3,45,1
4,76,4
...,...,...
29112,59,2
29113,58,2
29114,61,2
29115,82,4


3. Divide the dataset into features (X) and predictions (y). For features we will use all the columns except MathScore, ReadingScore and WritingScore. For predictions we will use the MathScore column. Split the dataset into training and testing sets. The split should be done in a  70-30% ratio.

In [4]:
#X (all features except MathScore, ReadingScore, WritingScore) and y (MathScore)
X = students.drop(['MathScore', 'ReadingScore', 'WritingScore'], axis=1)
y = students['MathScore']

In [5]:
#splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

4. Create a Random Forest model, with max_depth=5 and n_estimators=20. Train the model using training set, and then test it on testing set. Display the confusion matrix. Show precision, recall and F1 score for all grades. 

In [6]:
X = students.drop(['MathScore', 'ReadingScore', 'WritingScore'], axis=1)
y = students['MathScore']

#I have missing values in X
imputer = SimpleImputer(strategy='mean') 
X = imputer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

#Random Forest model
rf_model = RandomForestClassifier(max_depth=5, n_estimators=20, random_state=42)

#training
rf_model.fit(X_train, y_train)

#testing
y_pred = rf_model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix: {conf_matrix}")

class_report = classification_report(y_test, y_pred)
print(f"\nClassification Report: {class_report}")

Confusion Matrix: [[ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]
 ...
 [ 0  0  0 ...  0  0  7]
 [ 0  0  0 ...  0  0 15]
 [ 0  0  0 ...  0  0 47]]

Classification Report:               precision    recall  f1-score   support

           7       0.00      0.00      0.00         1
          10       0.00      0.00      0.00         1
          12       0.00      0.00      0.00         1
          14       0.00      0.00      0.00         1
          16       0.00      0.00      0.00         2
          17       0.00      0.00      0.00         2
          18       0.00      0.00      0.00         5
          19       0.00      0.00      0.00         3
          21       0.00      0.00      0.00         2
          22       0.00      0.00      0.00         2
          23       0.00      0.00      0.00         3
          24       0.00      0.00      0.00         6
          25       0.00      0.00      0.00         7
          26       0.00      0.00      0.00  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


5. The accuracy of our model is not the best. The reason could be having too many possible classes (grades). Let's transform the data again, but this time, instead of predicting grades, we want to predict whether the student will pass (grades 2, 3, 4 and 5) or fail (grade 1) math. After transforming the MathScore accordingly (0 - failed, 1 - passed), repeat task 4 and compare the results.

In [7]:
#new column, pass=1, fail=0
students['PassFail'] = (students['MathScore'] > 50).astype(int)

X = students.drop(['MathScore', 'ReadingScore', 'WritingScore', 'PassFail'], axis=1)
y = students['PassFail']

#fixing missing values in X
imputer = SimpleImputer(strategy='mean') 
X = imputer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Random Forest model
rf_model = RandomForestClassifier(max_depth=5, n_estimators=20, random_state=42)

#training
rf_model.fit(X_train, y_train)

#testing
y_pred = rf_model.predict(X_test)

conf_matrix = confusion_matrix(y_test, y_pred)
print(f"Confusion Matrix: {conf_matrix}")

class_report = classification_report(y_test, y_pred)
print(f"\nClassification Report: {class_report}")

Confusion Matrix: [[1174  124]
 [   0 7438]]

Classification Report:               precision    recall  f1-score   support

           0       1.00      0.90      0.95      1298
           1       0.98      1.00      0.99      7438

    accuracy                           0.99      8736
   macro avg       0.99      0.95      0.97      8736
weighted avg       0.99      0.99      0.99      8736



6. Compare the results. Did our model work better in the first case or the second? Explain why and suggest a way to improve it.

Our model worked better in the second case because the task was simplified to binary classification.  In the first task the classification involved distinguishing between multiple classes, so the accuracy might be lower because of the increased complexity. 