# TPM034A Machine Learning for socio-technical systems 
## `Assignment 03: Understanding primary school performance`

**Delft University of Technology**<br>
**Q2 2022**<br>
**Instructor:** Sander van Cranenburgh <br>
**TAs:**  Francisco Garrido Valenzuela & Lucas Spierenburg <br>


## `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your the applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

# `Application: Primary school performance` <br>

### **Introduction**

In this lab session you will use ML to predict which primary schools perform well and which do not. The quality of primary school is deemed to impact the children's future employment outcomes, especially for students from lower-income families (see [reference](https://www.emerald.com/insight/content/doi/10.1016/S0147-9121(01)20039-9/full/html)). Therefore, **the ministry of education** expects all schools to meet certain performance criteria.<br>

**Measuring school performance** is notoriously difficult. In lab session 3, you have developed ML models to predict the share of students adviced to go to higher education (HAVO, VWO). However, it would be **naive** to equate such an indicator to school performance. This would ignore self-selection effects. For instance, wealthier families tend to live together in wealthy neighbourhood. Moreover, wealthier families may be able to afford extra education (e.g. saturday morning math classes) which less well-off families are not able to. As a result, the share of students adviced to go to higher education may be higher in wealthier neigbourhoods not because of the better school, but because of opportunities to have extra education. School advices thus do not necessarily reflect the performance of schools. <br>

Nonetheless, it is evident that - all else being equal - a good performing school has a larger share of students leaving with advice for higher education than a bad performing school. In other words, based on behavioural intuition (and theory) we can expect school performance to impact on the share of school leaver with an advice for higher education. <br>

School performance is latent; we do not measure it directly. But, using the models developed in lab session 3 we can predict the **expected** share of school leavers with an advice for higher education. We can compare this expectation with the observed share. The difference between the two may, at least partially, be explained by the school performance. In other words, if the expected share is much higher than the observed share, the school might not perform well. The figure below shows this conceptual model.<br>
![conceptual_model](data/conceptual_model.jpg)<br>


Despite the limited direct data on school performance, the ministry of education would like to get your advice on the following two points:

1. The ministry of education has the capacity to do full-fledged assessments of just 10 schools. Can you provide a list of 10 schools to assess with priority?
1. Schools receive extra funding for each student with a non-native Dutch language, so-called NOAT2 students. The extra funding should enable schools to allocate more resources to the teaching NOAT2 students, while not leaving other students (called NOAT1) worse off. In other words, the extra funding should mitigate the impact from NOAT2 students on the share of school leavers with an advice for higher education. The ministry would like to know whether the current funding is sufficient to achieve this goal. 


#### **Data**
For this assignment you have access to two data sets:
* The school advice data set of lab session 3 [link](data/school_data.csv). Similar as in lab session 3 the buurt features are expressed as shares of the buurt population (for example, the number of women in the buurt [count] is converted into the share of women in the buurt [%]).
* NOAT data [link](data/NOAT.csv) <br>. This data provide the number of pupils that are Dutch native speakers (NOAT1_STUDENTS), and the number of pupils that are non native Dutch speakers (NOAT2_STUDENTS).

### **Tasks and grading**

1.  **Load, merge, and clean data** [1 pnt]
    1. Load the school data and the NOAT data.
    1. Compute the share of NOAT1 students and the share of NOAT2 students in each school.
    1. Merge the computed shares of NOAT1 and NOAT2 students in the school data. (Note that the NOAT data does not exist for some school in the school data, do **NOT** drop observations for these schools, and replace nan values by 0.).
1. **Prepare the data** [1 pnt]
    1. Encode categorical variables ('DENOMINATION' and 'SPECIES_PO').
    1. Select the 30 most relevant variables for prediction. (hint: use those features that most strongly correlate with the target: SHARE_HIGH).
    1. Scale the data using sklearn's StandardScaler.
    1. Split the data into a train and a test set.
1. **Train multiple models to predict the share of students adviced to go to higher education** [4 pnt]
    1. Random forest: print the performance of the random forest, visualise the importance of each feature on a barplot.
    1. MLP: perform the hyperparameter tuning, print the performance of the MLP
    1. Gradient boosting: print the performance of the gradient boosting regression
    1. Ensemble model: Create 3 ensemble models from the 3 models (print the performance of each ensemble model):
    - ensemble model 1: Random Forest, MLP
    - ensemble model 2: Random Forest, GBR
    - ensemble model 3: MLP, GBR
1. **Substantive results** [3 pnt]
    1. Predict the expected share of school leaver with an advice for higher eduction using the model with the best generalisation performance. 
    1. Compute the difference between the expected SHARE_HIGH and the actual SHARE_HIGH. Which 10 schools would you recommend the ministry of education to assess in-depth with priority?
    1. Reflect on the meaningfulness and limitations of your analysis.
1. **Extra funding for NOAT2 students** [1 pnt]
    1. Explain why the question of the ministry regarding the adequacy of the current funding for NOAT2 **cannot** be answered using ML (or at least not the ML taught in this course), and the provided data. <br>


<br>

### `Learning objective`
This assignment provides less structure (i.e. concrete descriptions of tasks we expect you to do) than the previous ones. This is deliberate. By this time, you have more experience. The learning objective is that you are able to reasonably independently apply ML in the context of a socio-technical environment. 


### **Submission**
- The deadline for this assignment is **Wed, 07 December 2022** 
- Use **Python 3.7 or above**
- You have to submit your work in zip file with the ipynb **(fully executed)** in Brightspace

### `Workspace set-up`
**Option 1: Google Colab**<br>
Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM34A/Q2_2022
#!pip install -r Q2_2022/requirements_colab.txt
#!mv "/content/Q2_2022/Assignments/assignment_03/data" /content/data

**Option 2: Local environment**<br>
Uncomment the following cell if you are running this notebook on your local environment. This will install all dependencies on your Python version.

In [None]:
#!pip install -r requirements.txt

In [None]:
# Import required Python packages and modules
import os
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path
from os import getcwd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, VotingRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, make_scorer,log_loss
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm

# Setting
pd.set_option('display.max_columns', None)

### **1. Load, merge, and clean data**
#### 1.1 **Load** the school and the NOAT data.

#### 1.2 Compute the share of NOAT1 students and the share of NOAT2 students in each school.

#### 1.3 Add the share of NOAT1 students and the share of NOAT2 students to the school data. (The NOAT data does not exist for some school in the school data, do **NOT** drop observations for these schools, and replace na values by 0.)

### **2. Prepare the data**
#### 2.1 Encode categorical features ('DENOMINATION' and 'SPECIES_PO').

#### 2.2 Select the 30 most relevant features for prediction. (hint: use the variables correlating the most with the target)

Now we can create the train and test data sets, which we will use to train our ML models.

#### 2.3 Scale the data using sklearn's StandardScaler

#### 2.4 Split the sample into a train and a test sets.

### **3. Training machine learning models**<br>

To compare different models, we provide you with a custom evaluation function that allows us to swiftly report the following stats for the train and test data:
* mean square error
* mean absolute error
* R2

In [None]:
def eval_regression_perf(model, X_train, X_test, Y_train, Y_test):
    
    # Make prediction with the trained model
    Y_pred_train = model.predict(X_train)
    Y_pred_test = model.predict(X_test)

    # Create a function that computes the MSE, MAE, and R2
    def perfs(Y,Y_pred):
        mse = mean_squared_error(Y,Y_pred)
        mae = mean_absolute_error(Y,Y_pred)
        R2 = r2_score(Y,Y_pred)
        return mse,mae,R2

    # Apply the perfs function to the train and test data sets
    mse_train, mae_train, r2_train = perfs(Y_train,Y_pred_train)
    mse_test,  mae_test , r2_test  = perfs(Y_test,Y_pred_test)
        
    # Print results
    print('Performance')
    print(f'Mean Squared  Error Train | Test: \t{mse_train:>7.4f}\t|  {mse_test:>7.4f}')
    print(f'Mean Absolute Error Train | Test: \t{mae_train:>7.4f}\t|  {mae_test:>7.4f}')
    print(f'R2                  Train | Test: \t{ r2_train:>7.4f}\t|  {r2_test:>7.4f}\n')

#### 3.1 Random forest:
- Set the number of trees to 250
-  Use the following hyperparameters: 
    - max_depth = 9
    - max_features = 0.5
    - max_leaf_nodes = 30
    - min_samples_leaf = 50
- Print the performance of the random forest
- Visualise the importance of each feature on a barplot.

#### 3.2 MLP
- Set the batch size to 250, and max_iter to 2000
- Perform a grid search on the following hyperparameters: 
    - hidden_layer_sizes, values: (20,20),(30,30),(25,25)
    - alpha, values: 5,1,0.1
    - learning_rate_init: 0.01,0.001,0.0001
- Perform the MLP regression with the tuned hyperparameters
- Print the performance of the MLP

#### 3.3 Gradient boosting
- Use the following hyperparameters:
    - max_depth = 6
    - max_features = 0.5
    - max_leaf_nodes = 30
    - min_samples_leaf = 50
    - learning_rate = 0.003
- Print the performance of the Gradient boosting regression

#### 3.4 Ensemble model
- Create 3 ensemble models from the 3 models with the tuned hyperparameters for each model
- model 1: Random forest, MLP
- model 2: Random forest, GBR
- model 3: MLP, GBR<br>

Print the performance of each ensemble model

### **4. Substantive results**

#### 4.1 Predict the expected share of school leaver with an advice for higher eduction using the model with the best generalisation performance.

#### 4.2 Compute the difference between the expected SHARE_HIGH and the actual SHARE_HIGH. Which 10 schools would you recommend the ministry of education to assess in-depth with priority?

#### 4.3 Reflect on the meaningfulness and limitations of your analysis

### 5  **Extra funding for NOAT2 students**

#### 5.1 Explain why the question of the ministry regarding the adequacy of the current funding for NOAT2 **cannot** be answered using ML (or at least not the ML taught in this course), and the provided data. <br>