# Homework

> Note: sometimes your answer doesn't match one of 
> the options exactly. That's fine. 
> Select the option that's closest to your solution.

## Dataset

In this homework, we will use the Students Performance in 2024 JAMB dataset from [Kaggle](https://www.kaggle.com/datasets/idowuadamo/students-performance-in-2024-jamb).

Here's a wget-able [link](https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv):

```bash
wget https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv
```

The goal of this homework is to create a regression model for predicting the performance of students on a standardized test (column `'JAMB_Score'`).

## Preparing the dataset 

First, let's make the names lowercase:

```python
df.columns = df.columns.str.lower().str.replace(' ', '_')
```

Preparation:

* Remove the `student_id` column.
* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
import xgboost as xgb

In [2]:
df = pd.read_csv("jamb_exam_results.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
df.drop("student_id", axis=1, inplace=True)
df.fillna(0, inplace=True)

In [3]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
target = "jamb_score"
y_train = df_train[target]
y_val = df_val[target]
y_test = df_test[target]
X_train = df_train.drop(target, axis=1).to_dict(orient='records')
X_val = df_val.drop(target, axis=1).to_dict(orient='records')
X_test = df_test.drop(target, axis=1).to_dict(orient='records')

In [4]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train)
X_val = dv.transform(X_val)
X_test = dv.transform(X_test)
features = dv.feature_names_

## Question 1

Let's train a decision tree regressor to predict the `jamb_score` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?


In [5]:
reg = DecisionTreeRegressor(max_depth=1)
reg = reg.fit(X_train, y_train)
feature = features[reg.tree_.feature[0]]
print(f"{feature} is used for splitting the data")

study_hours_per_week is used for splitting the data


## Question 2

Train a random forest model with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on validation?

In [6]:
reg = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
reg.fit(X_train, y_train)
y_pred = reg.predict(X_val)
rmse = root_mean_squared_error(y_val, y_pred)
print(f"The RMSE of this model on validation is {round(rmse, 2)}")

The RMSE of this model on validation is 42.14


## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set `random_state` to `1`.
* Evaluate the model on the validation dataset.


After which value of `n_estimators` does RMSE stop improving?
Consider 3 decimal places for calculating the answer.

In [7]:
estimators = np.arange(10, 201, 10)
errors = []
for n_estimators in estimators:
    reg = RandomForestRegressor(n_estimators=n_estimators, random_state=1, n_jobs=-1)
    reg = reg.fit(X_train, y_train)
    y_pred = reg.predict(X_val)
    error = root_mean_squared_error(y_val, y_pred)
    errors.append(error)

errors = np.array(errors)
difference = errors[1:] - errors[:-1]
idx = np.where(difference > 0)[0][0]
print(f"After {estimators[idx]} estimators, RMSE stop improving")

After 90 estimators, RMSE stop improving


## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values,
  * try different values of `n_estimators` from 10 till 200 (with step 10)
  * calculate the mean RMSE 
* Fix the random seed: `random_state=1`


What's the best `max_depth`, using the mean RMSE?

In [8]:
depths = [10, 15, 20, 25]
mean_errors = []
for max_depth in depths:
    errors = []
    for n_estimators in estimators:
        reg = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=1, n_jobs=-1)
        reg = reg.fit(X_train, y_train)
        y_pred = reg.predict(X_val)
        error = root_mean_squared_error(y_val, y_pred)
        errors.append(error)
    
    mean_errors.append(np.mean(errors))

print(f"The best max_depth, based on the mean RMSE, is {depths[np.argmin(mean_errors)]}")

The best max_depth, based on the mean RMSE, is 10


## Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorithm, it finds the best split. 
When doing it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the
[`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
field. 

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
  * `n_estimators=10`,
  * `max_depth=20`,
  * `random_state=1`,
  * `n_jobs=-1` (optional)
* Get the feature importance information from this model


What's the most important feature (among these 4)? 

In [9]:
reg = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
reg = reg.fit(X_train, y_train)
print(f"The most important feature is {features[np.argmax(reg.feature_importances_)]}")

The most important feature is study_hours_per_week


## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

In [10]:
xgb_params = {
    'max_depth': 6,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
    'device': 'cuda'
}
etas = [0.1, 0.3]
errors = []
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)
watchlist = [(dtrain, 'train'), (dval, 'val')]

for eta in etas:
    xgb_params["eta"] = eta
    model = xgb.train(xgb_params, dtrain, num_boost_round=100,
                      verbose_eval=5,
                      evals=watchlist)
    y_pred = model.predict(dval)
    error = root_mean_squared_error(y_val, y_pred)
    errors.append(error)

[0]	train-rmse:45.50072	val-rmse:46.99373
[5]	train-rmse:40.15460	val-rmse:43.05644
[10]	train-rmse:37.11353	val-rmse:41.55631
[15]	train-rmse:35.07766	val-rmse:40.70892
[20]	train-rmse:33.57997	val-rmse:40.37859
[25]	train-rmse:32.50134	val-rmse:40.21661
[30]	train-rmse:31.47315	val-rmse:40.20963
[35]	train-rmse:30.68870	val-rmse:40.19360
[40]	train-rmse:29.89807	val-rmse:40.15747
[45]	train-rmse:29.33094	val-rmse:40.21096
[50]	train-rmse:28.58793	val-rmse:40.28533
[55]	train-rmse:27.95277	val-rmse:40.44296
[60]	train-rmse:27.26360	val-rmse:40.55054
[65]	train-rmse:26.56706	val-rmse:40.66625
[70]	train-rmse:26.05959	val-rmse:40.73555
[75]	train-rmse:25.55747	val-rmse:40.76267
[80]	train-rmse:25.13835	val-rmse:40.82813
[85]	train-rmse:24.64140	val-rmse:40.87915
[90]	train-rmse:23.93958	val-rmse:40.89645
[95]	train-rmse:23.39469	val-rmse:40.95651
[99]	train-rmse:23.14487	val-rmse:41.04335
[0]	train-rmse:42.69552	val-rmse:44.86028
[5]	train-rmse:34.43646	val-rmse:40.87186
[10]	train-rmse

In [11]:
print(f"eta={etas[np.argmin(errors)]} leads to the best RMSE score on the validation dataset")

eta=0.1 leads to the best RMSE score on the validation dataset


## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw06
* If your answer doesn't match options exactly, select the closest one