## Ensemble trees with each subject


In this analysis, we aim to build a model that predicts student grades, but instead of using data from all subjects, we focus on just one subject at a time. This allows us to study that subject in more detail and understand how well we can predict grades specifically for it. Altough we know that this will be just used for comparison as it would be computationally expensive to do this for each subject.

To do this, we start by loading the cleaned dataset. Then, we select only the data related to the subject we're interested in, such as "Fonaments d'enginyeria, Àlgebra" or etc. This helps us create a more focused and specialized model.

Next, we choose the most relevant features for prediction , things like the student's access grade, gender, and parents’ education. These features are used to train an ensemble model made of decision trees, which are known for being strong predictors.

Finally, we test how well the model works and use it to predict grades for the most recent academic year (2023/24). This approach gives us clear insights into each subject's predictability and helps compare performance across different models.


In [None]:
# Step 1: Choose a specific subject
chosen_subject = 'Fonaments dels computadors'  # Change as needed

# Step 2: Filter train and prediction datasets
df_subject_train = df_train[df_train['assignatura'] == chosen_subject].copy()
df_subject_pred = df_pred_target[df_pred_target['assignatura'] == chosen_subject].copy()

# Step 3: Prepare features and target
X_train_subj = df_subject_train[best_features]
y_train_subj = df_subject_train['nota_assignatura']

X_pred_subj = df_subject_pred[best_features]

# Step 4: Preprocessing and modeling (re-use existing pipeline)
categorical_cols_subj = X_train_subj.select_dtypes(include='object').columns.tolist()

preprocessor_subj = ColumnTransformer(
    transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols_subj)],
    remainder='passthrough'
)

ensemble_pipeline_subj = Pipeline([
    ('preprocessor', preprocessor_subj),
    ('regressor', VotingRegressor([
        ('rf', RandomForestRegressor(n_estimators=30, max_depth=10, random_state=42)),
        ('gb', GradientBoostingRegressor(n_estimators=50, random_state=42)),
        ('dt', DecisionTreeRegressor(max_depth=8, random_state=42))
    ]))
])

# Step 5: Fit and Predict
ensemble_pipeline_subj.fit(X_train_subj, y_train_subj)
df_subject_pred['predicted_nota_assignatura'] = ensemble_pipeline_subj.predict(X_pred_subj)

# Optional: View results
print(df_subject_pred[['assignatura', 'predicted_nota_assignatura']].head())


NameError: name 'df_train' is not defined