# Project Report
### Magdalena Barros, Paula Cadena, Michael Rosenbaum | CAPP 30254: Machine Learning for Public Policy 

We design a data flow to implement a Proxy Meants Test (PMT) to classify poverty status into 4 categories. These types of tests are common in international development, where readily available information on household composition and assets can proxy income to target social programs ([Grosh & Baker, 2013](https://elibrary.worldbank.org/doi/abs/10.1596/0-8213-3313-5)). This work uses a [Kaggle Dataset](https://www.kaggle.com/competitions/costa-rican-household-poverty-prediction/data) including partially cleaned data from the Inter-american Development Bank (IDB) on 2,988 Costa Rican households.

This data is highly imabalnced and includes a number of repeated and conditional measures. Prior to tuning candidate models, we first reduce the dataset to a foundational set of durable asset, household composition, and housing features and then use synthetic oversampling techniques to create a balanced dataset on our poverty labels.

Based on a grid-search of hyperparemeters across penalized multinomial regressions, random forests, and KNN, we decide on [FINAL MODEL]/!\TODO/!\ to predict poverty status. We achieve [XY]%/!\TODO/!\ accuracy on our training sample. Our final model focuses on prioritizing recall, the number of true positives identified by the model, so that anti-poverty programs can reach the most relevant populations.

## 1 | Data Set-up

Our data exploration has focued on understanding the underlying dataset. The data comes from a [2017 ILO survey administered](https://webapps.ilo.org/surveyLib/index.php/catalog/7230/related-materials) by the Intra-American Development Bank (IDB). It is drawn from a nationally representative household survey and includes a subset of household- and individual-level variables that are cleaned by the IDB. 

The initial Kaggle competition was meant to provide the IDB with different approaches to apply PMTs to social programs in low- and medium-income countries. As such, the data provided is relatively clean compared to the original survey data, but also includes a number of variables that would not be useful for prediction, are created by the IDB, or would introduce challenges to claim a classifier. We handle three-types of specific characteristics in our data preparation: missingness, label imbalance, and repeated measures.

### 1.1 | Missingness

Three variables have a significant amount of missing values:

1. Type of house (`v2a1`): we replace NaN values with zero in cases where houses are fully paid or non-rent paying and retain rent status in each mode.

1. Number of tablets owned (`v18q1`): We combine the indicator variable for owning a tablet with the number of tables own to produce a non-missing variable on number of tablets owned.
    
1. Grades behind (`rez_esc`): We do not include this in the household-level model we predict from, but leave it in the individual-level dataset to produce features from as needed.

### 1.2 | Outcome Distribution

The outcome distribution is also heavily skewed. Even the outcome is subset to the household-level nearly 70% of the observations are in the highest income category.

![](https://github.com/m-rosenbaum/cr_pmt/blob/2b4d9f89c9460ca2055c4e757bb03cbc6cb75c74/assets/label_skew.png)

As we intend to use some classifiers such as KNN that would require weighting to produce accurate classification measures in minority categories, we introduce synthetic oversampling to create a training set balanced on panels. To do so, we follow the following procedure:

1. Split the sample into a 80:20 percent test:train split.
2. Synthetically oversample the 2,390 training households into 6,244 household observations with each label making up 25% of the data. We use the SMOTE method to do so for two reasons: large enough N in the minority classes to produce a sufficient spread of syntehtic data and ease of implementation in Python ([Chawla et. al., 2002](https://doi.org/10.1613/jair.953)).
3. Train and tune our models on the training set.

Future efforts could include different approachs to weighting and oversampling.

### 1.3 | Repeated measures

We remove a number of features based on repeated measurements of the same underlying construct or based on the distribution of the outcome variable.

- Drop features calculated by : `edjefe`, `edjefa`, `area2`, `r4h1`, `r4h2`, `r4h3`, `r4m1`, `r4m2`, `r4m3`, `r4t1`, `r4t2`, `overcrowding`, `tamhog`, `tamviv`, `male`, `hogar_total`, `dependency`, `meaneduc`, `SQBescolari`, `SQBage`, `SQBhogar_total`, `SQBedjefe`, `SQBhogar_nin`, `SQBovercrowding`, `SQBdependency`, `SQBmeaned`, `agesq`, `techozinc`, `techoentrepiso`, `techocane`, `techootro`.
- Retain `hhsize` and drop `tamhog`, `r4t3`, `tamviv` as `hhsize` is calculated by the survey software based on the household roster, whereas each other is not aligned with the counts of household compositions variables (age, sex, etc.) 
- Collapse individual-level categorical variables that have percentages less than 5%: `estadocivil`, `instlevel`, `parentesco`. 5% is chosen arbirtrarily as a small proportion of the dataset.
- Collapse household-level categorical variables that have percentages less than 5% including durable assets: `piso`, `pared`, `techo`, `sanitario`, `elimbasu`, `tipovivi`. 5% is chosen arbirtrarily as a small proportion of the dataset.

This leaves us with 82 features in our "foundational-model" before we develop any additional features.


### 1.4 | Final Pipeline

Our final pipeline can be summarized as follows:

1. Clean the survey data into a standardized format, based on content-agnostic decisions.
    1. Clean missing values.
    2. Clean individual-level categorical variables based.
    3. Clean hosuehold-level categorical variables. 
    4. Create features as needed.
    5. Drop extraneous variables.
    6. Collapse to the household-level on `idhogar`.
1. Conduct an 80 / 20 test-train split.
1. Synthetic oversampling of data prior to cross-validation. 

## 3 | Model Tuning

--- TODO ---- MR to keep writing

### 3.1 | Model choice

We choose 3 types of models to consider:

1. K-Nearest Neighbors classifiers:
2. Penalized multinomial regression:
3. Random forests:

### 3.2 | Cross-validation

### 3.3 | Results

## 4 | Selection and Results

### 4.1 | Selected model

### 4.2 | Results

## 5 | Conclusion