### Alex Khvatov Homework #6

Dataset
In this homework, we will use the Students Performance in 2024 JAMB dataset from Kaggle.

In [1]:
!wget -q -O jamb_exam_results.csv https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv

The goal of this homework is to create a regression model for predicting the performance of students on a standardized test (column 'JAMB_Score').

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline

In [3]:
df = pd.read_csv("jamb_exam_results.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
del df['student_id']
df.fillna(0, inplace=True)

In [4]:
df.head()


Unnamed: 0,jamb_score,study_hours_per_week,attendance_rate,teacher_quality,distance_to_school,school_type,school_location,extra_tutorials,access_to_learning_materials,parent_involvement,it_knowledge,age,gender,socioeconomic_status,parent_education_level,assignments_completed
0,192,22,78,4,12.4,Public,Urban,Yes,Yes,High,Medium,17,Male,Low,Tertiary,2
1,207,14,88,4,2.7,Public,Rural,No,Yes,High,High,15,Male,High,0,1
2,182,29,87,2,9.6,Public,Rural,Yes,Yes,High,Medium,20,Female,High,Tertiary,2
3,210,29,99,2,2.6,Public,Urban,No,Yes,Medium,High,22,Female,Medium,Tertiary,1
4,199,12,98,3,8.8,Public,Urban,No,Yes,Medium,Medium,22,Female,Medium,Tertiary,1


In [5]:
from sklearn.model_selection import train_test_split
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.2, random_state=1)

In [6]:
df_train=df_train.reset_index(drop=True)
df_val=df_val.reset_index(drop=True)
df_test=df_test.reset_index(drop=True)

In [7]:
y_train = df_train.jamb_score.values
y_val = df_val.jamb_score.values
y_test = df_test.jamb_score.values

In [8]:
del df_train['jamb_score']
del df_val['jamb_score']
del df_test['jamb_score']

In [9]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score

In [10]:
train_dict = df_train.to_dict(orient='records')
val_dict = df_val.to_dict(orient='records')

In [11]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)
X_val = dv.fit_transform(val_dict)

## Question 1

Let's train a decision tree regressor to predict the jamb_score variable.

* Train a model with `max_depth=1`

Which feature is used for splitting the data?

* study_hours_per_week
* attendance_rate
* teacher_quality
* distance_to_school

In [12]:
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

In [13]:
from sklearn.tree import export_text

In [14]:
print(export_text(dt, feature_names=dv.get_feature_names_out()))

|--- study_hours_per_week <= 18.50
|   |--- value: [156.06]
|--- study_hours_per_week >  18.50
|   |--- value: [188.77]



answer: study_hours_per_week

## Question 2

Train a random forest regressor with these parameters:

* n_estimators=10
* random_state=1
* n_jobs=-1 (optional - to make training faster)


What's the RMSE of this model on the validation data?

* 22.13
* 42.13
* 62.13
* 82.12

In [15]:
from sklearn.ensemble import RandomForestRegressor

In [16]:
rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

In [17]:
from sklearn.metrics import root_mean_squared_error

In [18]:
y_pred = rf.predict(X_val)

In [19]:
root_mean_squared_error(y_val, y_pred)

np.float64(41.60899752457394)

the answer is closer to 42.13

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set random_state to 1.
* Evaluate the model on the validation dataset.

 After which value of n_estimators does RMSE stop improving? Consider 3 decimal places for calculating the answer.

* 10
* 25
* 80
* 200

In [20]:
n_estimators=[10, 25, 80, 200]

def get_rmse(n_estimators:int, max_depth=None)->float:
    rf = RandomForestRegressor(n_estimators=n_estimators, random_state=1, n_jobs=-1, max_depth=max_depth)
    rf.fit(X_train, y_train)

    y_pred = rf.predict(X_val)
    return root_mean_squared_error(y_val, y_pred)

for n in n_estimators:
    rmse = get_rmse(n)
    print(f"n_validators={n:>3} rmse={rmse:.3f}")

n_validators= 10 rmse=41.609
n_validators= 25 rmse=40.608
n_validators= 80 rmse=40.223
n_validators=200 rmse=40.378


the answer is _80_

## Question 4


Let's select the best max_depth:

* Try different values of max_depth: [10, 15, 20, 25]
* For each of these values,
    * try different values of n_estimators from 10 till 200 (with step 10)
    * calculate the mean RMSE

* Fix the random seed: random_state=1


What's the best max_depth, using the mean RMSE?

* 10
* 15
* 20
* 25

In [21]:
max_depths=[10, 15, 20, 25]
n_estimators_range = range(10, 201, 10)

depth_rmse =[]
for max_depth in max_depths:
    rmse_for_given_max_depth = []
    for n_esitmators in n_estimators_range:
        rmse_for_given_max_depth.append(get_rmse(n_esitmators, max_depth))
    mean_rmse = np.mean(rmse_for_given_max_depth)
    depth_rmse.append((max_depth, mean_rmse))
    print(f"max_depth={max_depth} mean_rmse={mean_rmse:.3f}")

max_depth=10 mean_rmse=39.908
max_depth=15 mean_rmse=40.299
max_depth=20 mean_rmse=40.289
max_depth=25 mean_rmse=40.408


In [22]:
print(f"best max_depth: {sorted(depth_rmse, key=lambda x: x[1])[0][0]}")

best max_depth: 10


answer: max_depth=10 mean_rmse=39.908 - is the smallest

## Question 5

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorithm, it finds the best split. When doing it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field.

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
     * n_estimators=10,
     * max_depth=20,
     * random_state=1,
     * n_jobs=-1 (optional)
* Get the feature importance information from this model

    
What's the most important feature (among these 4)?

 * study_hours_per_week
 * attendance_rate
 * distance_to_school
 * teacher_quality

In [23]:
rf = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

In [24]:
print(f"Most important feature {rf.feature_importances_}")

Most important feature [0.01094609 0.01008961 0.0668335  0.03139094 0.15118005 0.14134736
 0.00952129 0.00981117 0.01129091 0.00898984 0.0221945  0.01240691
 0.01235437 0.         0.01535858 0.01413879 0.01439532 0.02276782
 0.01479034 0.01069137 0.00988814 0.00979223 0.00990396 0.00937294
 0.02315288 0.01359314 0.01423341 0.23924465 0.08031991]


In [25]:
feature_importances = rf.feature_importances_

In [26]:
feature_names = list(dv.get_feature_names_out())

In [27]:
sorted(zip(feature_names, feature_importances), key=lambda x: x[1], reverse=True)[0]

('study_hours_per_week', np.float64(0.23924464530175324))

answer = ('__study_hours_per_week__', np.float64(0.23924464530175324))

## Question 6

Now let's train an XGBoost model! For this question, we'll tune the eta parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:


```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change eta from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* Both give equal value

In [28]:
!pip install xgboost

Looking in indexes: http://nexus-vm-npm.svc.devops-prod.us-east.bdf-cloud.iqvia.net:8081/nexus/repository/pypi-all/simple


In [29]:
import xgboost as xgb

In [30]:
features = list(dv.get_feature_names_out())
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)

In [31]:
xgb_params = {
    'eta': 0.1, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
    

model = xgb.train(xgb_params, dtrain, num_boost_round=100)

In [32]:
y_pred = model.predict(dval)
rmse = root_mean_squared_error(y_val, y_pred)
print(f"{xgb_params['eta']=} {rmse=}")

xgb_params['eta']=0.1 rmse=np.float64(40.19582382984711)


Answer:

xgb_params['eta']=0.3 rmse=np.float64(42.37919586079883)

xgb_params['eta']=0.1 rmse=np.float64(40.19582382984711)

RMSE is best (smallest) when eta is 0.1