# Project Report
### Magdalena Barros, Paula Cadena, Michael Rosenbaum | CAPP 30254: Machine Learning for Public Policy 

We design a data flow to implement a Proxy Meants Test (PMT) to classify poverty status into 4 categories. These types of tests are common in international development, where readily available information on household composition and assets can proxy income to target social programs ([Grosh & Baker, 2013](https://elibrary.worldbank.org/doi/abs/10.1596/0-8213-3313-5)). This work uses a [Kaggle Dataset](https://www.kaggle.com/competitions/costa-rican-household-poverty-prediction/data) including partially cleaned data from the Inter-american Development Bank (IDB) on 2,988 Costa Rican households.

This data is highly imabalnced and includes a number of repeated and conditional measures. Prior to tuning candidate models, we first reduce the dataset to a foundational set of durable asset, household composition, and housing features and then use synthetic oversampling techniques to create a balanced dataset on our poverty labels.

Based on a grid-search of hyperparemeters across penalized multinomial regressions, random forests, and KNN, we decide on **[FINAL MODEL]/!\TODO/!** to predict poverty status. We achieve **[XY]%/!\TODO/!** accuracy on our training sample. Our final model focuses on prioritizing recall, the number of true positives identified by the model, so that anti-poverty programs can reach the most relevant populations.

## 1 | Data Set-up

Our data exploration has focued on understanding the underlying dataset. The data comes from a [2017 ILO survey administered](https://webapps.ilo.org/surveyLib/index.php/catalog/7230/related-materials) by the Intra-American Development Bank (IDB). It is drawn from a nationally representative household survey and includes a subset of household- and individual-level variables that are cleaned by the IDB. 

The initial Kaggle competition was meant to provide the IDB with different approaches to apply PMTs to social programs in low- and medium-income countries. As such, the data provided is relatively clean compared to the original survey data, but also includes a number of variables that would not be useful for prediction, are created by the IDB, or would introduce challenges to claim a classifier. We handle three-types of specific characteristics in our data preparation: missingness, label imbalance, and repeated measures.

### 1.1 | Missingness

Three variables have a significant amount of missing values:

1. Type of house (`v2a1`): we replace NaN values with zero in cases where houses are fully paid or non-rent paying and retain rent status in each mode.

1. Number of tablets owned (`v18q1`): We combine the indicator variable for owning a tablet with the number of tables own to produce a non-missing variable on number of tablets owned.
    
1. Grades behind (`rez_esc`): We do not include this in the household-level model we predict from, but leave it in the individual-level dataset to produce features from as needed.

### 1.2 | Outcome Distribution

The outcome distribution is also heavily skewed. Even the outcome is subset to the household-level nearly 70% of the observations are in the highest income category.

![](https://github.com/m-rosenbaum/cr_pmt/blob/2b4d9f89c9460ca2055c4e757bb03cbc6cb75c74/assets/label_skew.png)

As we intend to use some classifiers such as KNN that would require weighting to produce accurate classification measures in minority categories, we introduce synthetic oversampling to create a training set balanced on panels. To do so, we follow the following procedure:

1. Split the sample into a 80:20 percent test:train split.
2. Synthetically oversample the 2,390 training households into 6,244 household observations with each label making up 25% of the data. We use the SMOTE method to do so for two reasons: large enough N in the minority classes to produce a sufficient spread of syntehtic data and ease of implementation in Python ([Chawla et. al., 2002](https://doi.org/10.1613/jair.953)).
3. Train and tune our models on the training set.

Future efforts could include different approachs to weighting and oversampling.

### 1.3 | Repeated measures

We remove a number of features based on repeated measurements of the same underlying construct or based on the distribution of the outcome variable.

- Drop features calculated by : `edjefe`, `edjefa`, `area2`, `r4h1`, `r4h2`, `r4h3`, `r4m1`, `r4m2`, `r4m3`, `r4t1`, `r4t2`, `overcrowding`, `tamhog`, `tamviv`, `male`, `hogar_total`, `dependency`, `meaneduc`, `SQBescolari`, `SQBage`, `SQBhogar_total`, `SQBedjefe`, `SQBhogar_nin`, `SQBovercrowding`, `SQBdependency`, `SQBmeaned`, `agesq`, `techozinc`, `techoentrepiso`, `techocane`, `techootro`.
- Retain `hhsize` and drop `tamhog`, `r4t3`, `tamviv` as `hhsize` is calculated by the survey software based on the household roster, whereas each other is not aligned with the counts of household compositions variables (age, sex, etc.) 
- Collapse individual-level categorical variables that have percentages less than 5%: `estadocivil`, `instlevel`, `parentesco`. 5% is chosen arbirtrarily as a small proportion of the dataset.
- Collapse household-level categorical variables that have percentages less than 5% including durable assets: `piso`, `pared`, `techo`, `sanitario`, `elimbasu`, `tipovivi`. 5% is chosen arbirtrarily as a small proportion of the dataset.

This leaves us with 82 features in our "foundational model" before we develop any additional features.


### 1.4 | Final Data Preparation Pipeline

Our final pipeline can be summarized as follows:

1. Clean the survey data into a standardized format, based on content-agnostic decisions.
    1. Clean missing values.
    2. Clean individual-level categorical variables based.
    3. Clean hosuehold-level categorical variables. 
    4. Create features as needed.
    5. Drop extraneous variables.
    6. Collapse to the household-level on `idhogar`.
1. Conduct an 80 / 20 test-train split.
1. Synthetic oversampling of data prior to cross-validation. 

## 3 | Model Tuning

We tune three types of models based on the content covered in this course, with the aim to apply each of them in the PMT context and explore a variety of hyperparameter estimates. This section first summarizes each of the models used, including why they may be applicable for this task, and then interpretation of the cross-validation procedure we conduct.

### 3.1 | Model choice

The Costa Rican PMT context requires classification into 4 ordinal classes. This limits the set of models we can include slightly, but we still focus on the key models discussed in the course, outside of Neural Nets and Perceptrons:

1. **K-Nearest Neighbors** (12): Class labels are assigned based on a vote from *K* neighbors across a given distance function. As our features are standardized and neighbors normalized, we do not weight different features differently. Instead, we explore two sets of hyperparameters across 12 different models:
    1. *K* (number of nearest neighbors): We explore different sizes of the neighborhood, including {3, 5, 10, 25, 50, 100}. We expect that larger leaves will perform better on the test data due to the amount of variance in the data and the risk of overfitting on smaller neighborhood sizes.
    1. *Distance weighting*: We vary between unweighted and weighted
2. **Penalized multinomial regression** (16): The scikit-learn logistic regression classifier allows us to estimate multinomial logit models for multiple categories. This gives us probabilities of any input observation to be in any of our labels, with the highest probabiltiy preferred. Our primary concern was estimation: we use the `liblinear` solver in `scikit-learn` as it can solve both $L_1$ and $L_2$ penalties and was both quicker and more likely to converge across our hyperparameter tuples in our testing.
    1. *Penalty form*: We consider both $L_1$ and $L_2$ penalties to handle the sparseness of the survey data quickly -- most variables are indicator variables for multiple-response option variables. 
    1. *$\lambda$*: We vary the penalty amount over powers of 10 to explore a large space of penalty terms, with the aim to explore parsimonious and complex models. We cover the following penalty values: {0.01, 0.1, 1, 10, 100, 1000, 10000, 100000}.
3. **Random forests** (144): We estimate a variety of random forest models using `scikit-learn`'s default classifier across a variety of hyperparameters. As with the other models, we are concerned about overfitting on sparse and noisy data. We focus primarily on controlling how parsimonious each tree is and how inconsistent average trees are in the forest to try to push towards a simplier test given the variance in the input data. We choose a Gini split critereon based on research suggesting that Gini is quicker and is unlikely to matter for accuracy of prediction ([Raileanu & Stoffel, 2004](https://link.springer.com/article/10.1023/b:amai.0000018580.96245.c6)). Due to the number and plausible range of the hyperparemers for random forests, we only include a small candidate set that aims to explore across a larger hyperparameter space, but likely miss the most efficient area.
    1. *Number of trees*: We use forests with trees of size {25, 50, 100}
    1. *Maximum depth of trees*: We allow trees to grow to full depth or keep them relatively short, only allowing 5 splits (32 leaves averaging 65 observations). We test the following values: {100% features (82), 50% features (41), 25% features (20), 5 features}
    1. *Minimum observations per leaf*: We allow the trees to end on leaves with relatively small proportions of the sample included, ranging from a single observations to 50: {1, 5, 10, 50}.
    1. *Percent of observations sampled*: We include various resampling rates, including between 25% and 75% {0.25, 0.50, 0.75}

This results in a total of 172 models to estimate. If this were a larger project, we would consider more models including a Neural Net, as well as more expansive searches across the hyperparameter space. However, we did not discuss multiclass perceptrons, so do not implement them in this project. Furthermore, we would likely conduct our hyperparameter search more intentionally, such as by choosing hyperparameters through gradient descent or exploring promising models in more detail after a first attempt at exploring the models.

### 3.2 | Cross-validation

To evaluate the performance of our model, we use K-Fold cross-validation with 5-folds. This involves repeatedly splitting our training data into 80/20 test-train splits, evaluating the accuracy of the model, and then averaging the estimates across the 5 attempts. Since we oversample our data using synthetic data, we follow [Damien Martin's procedure](https://kiwidamien.github.io/how-to-do-cross-validation-when-upsampling-data.html) to avoid any data leakage and bias our training set estimates. We chose to evaluate the accuracy of the model on 'F1 Macro' scores that is the unweighted average of the precision and recall scores for each class. Compared to overall prediction accuracy, we wanted to focus on avoiding false negatives, without optimizing completely for recall. 

Since we've kept the candidate set of models at a reasonable amount, we're able to conduct a grid search over all combinations of models and parameters to select our best performing model on the candidate set. 

### 3.3 | Results

## 4 | Selection and Results

As mentioned above, we select a **TODO: MODEL DESC**. 

### 4.1 | Training set accuracy

When trained on the full training sample (complete with oversampling from 2,390 households to 6,244), the performance is as follows:

| Accuracy | Recall | Precision | F1 Macro |
|----------|---------|---------|-----------|
| X% | Y% | Z% | W% |

It results in the following confusion matrix:

**TODO: Image**

**TODO: Conclusions from the confusion matrix**

### 4.2 | Test set accuracy

We also test the model on the test set that we set aside prior of 598 households. The performance is as follows:

| Accuracy | Recall | Precision | F1 Macro |
|----------|---------|---------|-----------|
| X% | Y% | Z% | W% |

It results in the following confusion matrix:

**TODO: Image**

**TODO: Conclusions from the confusion matrix**

## 5 | Conclusion