### Homework6:


### Dataset

In this homework, we will use the Students Performance in 2024 JAMB dataset from [Kaggle](https://www.kaggle.com/datasets/idowuadamo/students-performance-in-2024-jamb).

Here's a wget-able [link](https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv):

```bash
wget https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv
```

The goal of this homework is to create a regression model for predicting the performance of students on a standardized test (column `'JAMB_Score'`).


### Preparing the dataset 

First, let's make the names lowercase:

```python
df.columns = df.columns.str.lower().str.replace(' ', '_')
```

Preparation:

* Remove the `student_id` column.
* Fill missing values with zeros.
* Do train/validation/test split with 60%/20%/20% distribution. 
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

In [1]:
import pandas as pd

In [2]:
# Load dataset
!wget https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv
df = pd.read_csv('jamb_exam_results.csv')

--2024-11-06 18:16:20--  https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/alexeygrigorev/datasets/refs/heads/master/jamb_exam_results.csv [following]
--2024-11-06 18:16:21--  https://raw.githubusercontent.com/alexeygrigorev/datasets/refs/heads/master/jamb_exam_results.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 391501 (382K) [text/plain]
Saving to: 'jamb_exam_results.csv'

     0K .......... .......... .......... .......... .......... 13%  470K 1s
    50K .......... .......... .......... .......... 

In [3]:
# Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [None]:
# Remove student_id columns
df = df.drop('student_id', axis=1)

# Fill missing values
df = df.fillna(0)

In [None]:
from sklearn.model_selection import train_test_split

# Target and features
target = 'jamb_score'
X = df.drop(target, axis=1)
y = df[target]

# Split the data
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=1)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1)

In [None]:
from sklearn.feature_extraction import DictVectorizer

# Initialize DictVectorizer
dv = DictVectorizer(sparse=True)

# Convert dataframes to dictionaries
X_train_dict = X_train.to_dict(orient='records')
X_val_dict = X_val.to_dict(orient='records')
X_test_dict = X_test.to_dict(orient='records')

X_train = dv.fit_transform(X_train_dict)
X_val = dv.transform(X_val_dict)
X_test = dv.transform(X_test_dict)

## Question 1

Let's train a decision tree regressor to predict the `jamb_score` variable. 

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?

* `study_hours_per_week`
* `attendance_rate`
* `teacher_quality`
* `distance_to_school`

In [7]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

In [8]:
# Initialize the model with max_depth=1
dt = DecisionTreeRegressor(max_depth=1, random_state=1)

In [9]:
# Train the model
dt.fit(X_train, y_train)

In [None]:
# Get feature used for splitting
feature_names = dv.get_feature_names_out()
split_feature_index = dt.tree_.feature[0]  # the feature index at the first split
split_feature_name = feature_names[split_feature_index]

print(f"Feature used for splitting: {split_feature_name}")

Feature used for splitting: study_hours_per_week


## Question 2

Train a random forest regressor with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on the validation data?

* 22.13
* 42.13
* 62.13
* 82.12

In [22]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

In [23]:
# Initialize the model with the specified parameters
rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)

In [24]:
# Train the model
rf.fit(X_train, y_train)

In [None]:
# Predict validation set
y_pred = rf.predict(X_val)

In [26]:
# Calculate RMSE
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
print(f"RMSE on validation data: {rmse:.2f}")

RMSE on validation data: 43.16


## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set `random_state` to `1`.
* Evaluate the model on the validation dataset.


After which value of `n_estimators` does RMSE stop improving?
Consider 3 decimal places for calculating the answer.

- 10
- 25
- 80
- 200

In [None]:
# list to store RMSE values
rmse_values = []

In [None]:
# Loop different values of n_estimators
for n in range(10, 201, 10):
    # Initialize the model
    rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    
    # Train the model
    rf.fit(X_train, y_train)
    
    # Predict validation set
    y_pred = rf.predict(X_val)
    
    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    rmse_values.append((n, rmse))
    print(f"n_estimators={n}, RMSE={rmse:.3f}")

n_estimators=10, RMSE=43.158
n_estimators=20, RMSE=41.790
n_estimators=30, RMSE=41.556
n_estimators=40, RMSE=41.076
n_estimators=50, RMSE=40.957
n_estimators=60, RMSE=40.774
n_estimators=70, RMSE=40.588
n_estimators=80, RMSE=40.503
n_estimators=90, RMSE=40.435
n_estimators=100, RMSE=40.365
n_estimators=110, RMSE=40.348
n_estimators=120, RMSE=40.302
n_estimators=130, RMSE=40.286
n_estimators=140, RMSE=40.263
n_estimators=150, RMSE=40.254
n_estimators=160, RMSE=40.200
n_estimators=170, RMSE=40.187
n_estimators=180, RMSE=40.136
n_estimators=190, RMSE=40.152
n_estimators=200, RMSE=40.138


In [29]:
# Determine the point where RMSE stops improving
for i in range(1, len(rmse_values)):
    if round(rmse_values[i][1], 3) >= round(rmse_values[i-1][1], 3):
        print(f"RMSE stops improving after n_estimators={rmse_values[i-1][0]}")
        break

RMSE stops improving after n_estimators=180



## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values,
  * try different values of `n_estimators` from 10 till 200 (with step 10)
  * calculate the mean RMSE 
* Fix the random seed: `random_state=1`


What's the best `max_depth`, using the mean RMSE?

* 10
* 15
* 20
* 25

In [None]:
# Initialize dictionary to store mean RMSE values for each max_depth
mean_rmse_by_depth = {}

In [None]:
# Define values for max_depth and n_estimators
max_depth_values = [10, 15, 20, 25]
n_estimators_values = range(10, 201, 10)

In [None]:
# Loop different values of max_depth
for depth in max_depth_values:
    rmse_list = []
    
    # Loop different values of n_estimators
    for n in n_estimators_values:
        # Initialize model with current max_depth and n_estimators
        rf = RandomForestRegressor(n_estimators=n, max_depth=depth, random_state=1, n_jobs=-1)
        
        # Train the model
        rf.fit(X_train, y_train)
        
        # Predict validation set
        y_pred = rf.predict(X_val)
        
        # Calculate RMSE and add to list
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        rmse_list.append(rmse)
    
    # Calculate mean RMSE for current max_depth
    mean_rmse_by_depth[depth] = np.mean(rmse_list)
    print(f"Mean RMSE for max_depth={depth}: {mean_rmse_by_depth[depth]:.3f}")

Mean RMSE for max_depth=10: 40.138
Mean RMSE for max_depth=15: 40.644
Mean RMSE for max_depth=20: 40.610
Mean RMSE for max_depth=25: 40.688


In [None]:
# Identify max_depth with lowest mean RMSE
best_max_depth = min(mean_rmse_by_depth, key=mean_rmse_by_depth.get)
print(f"Best max_depth based on mean RMSE: {best_max_depth}")

Best max_depth based on mean RMSE: 10


# Question 5

We can extract feature importance information from tree-based models. 

At each step of the decision tree learning algorithm, it finds the best split. 
When doing it, we can calculate "gain" - the reduction in impurity before and after the split. 
This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the
[`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
field. 

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
  * `n_estimators=10`,
  * `max_depth=20`,
  * `random_state=1`,
  * `n_jobs=-1` (optional)
* Get the feature importance information from this model


What's the most important feature (among these 4)? 

* `study_hours_per_week`
* `attendance_rate`
* `distance_to_school`
* `teacher_quality`

In [None]:
# Initialize and train model with specified parameters
rf = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

In [40]:
# Get feature importances
feature_importances = rf.feature_importances_
feature_names = dv.get_feature_names_out()

In [None]:
# Create DataFrame to match feature names with importance scores
import pandas as pd

In [42]:
importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importances})
importance_df = importance_df.sort_values(by='importance', ascending=False)


In [43]:
# Display the most important feature
most_important_feature = importance_df.iloc[0]
print(f"Most important feature: {most_important_feature['feature']} with importance score of {most_important_feature['importance']:.3f}")

Most important feature: study_hours_per_week with importance score of 0.254


## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}


Now change `eta` from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* Both give equal value

In [44]:
# Import XGBoost
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [45]:
# Convert the training and validation data to DMatrix format
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

In [None]:
# watchlist to monitor performance validation set
watchlist = [(dtrain, 'train'), (dval, 'val')]

In [None]:
# XGBoost parameters
xgb_params_1 = {
    'eta': 0.1,  # 0.3, 0.1
    'max_depth': 6,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

In [50]:
xgb_params_2 = xgb_params_1.copy()
xgb_params_2['eta'] = 0.1 

In [51]:
# Train model with eta = 0.3
model_1 = xgb.train(params=xgb_params_1, dtrain=dtrain, num_boost_round=100, evals=watchlist, early_stopping_rounds=10, verbose_eval=False)
rmse_eta_0_3 = model_1.best_score

In [52]:
# Train model with eta = 0.1
model_2 = xgb.train(params=xgb_params_2, dtrain=dtrain, num_boost_round=100, evals=watchlist, early_stopping_rounds=10, verbose_eval=False)
rmse_eta_0_1 = model_2.best_score

In [53]:
# Compare RMSE values
print(f"RMSE with eta=0.3: {rmse_eta_0_3:.3f}")
print(f"RMSE with eta=0.1: {rmse_eta_0_1:.3f}")

RMSE with eta=0.3: 40.166
RMSE with eta=0.1: 40.166


In [54]:
if rmse_eta_0_1 < rmse_eta_0_3:
    print("Best RMSE with eta=0.1")
elif rmse_eta_0_3 < rmse_eta_0_1:
    print("Best RMSE with eta=0.3")
else:
    print("Both give equal RMSE")

Both give equal RMSE
