## HW 3: Predicting whether a DonorsChoose.org project will be funded within 60 days

#### Cecile Murray

### Overview

DonorsChoose.org is a website that helps connect teachers in need of financial support for educational projects with a network of donors willing to fund those projects. Donors give any amount they choose; once a project is fully funded, the requested materials are delivered. I analyzed past project funding data to build a model predicting whether or not a given project will be fully funded within 60 days of being posted on DonorsChoose. This prediction will allow projects at risk of being funded slowly to be identified so that we can promote those projects or otherwise intervene to help them be funded more quickly.

I find that tree models perform better than linear models at this prediction task. In particular, given the goal of targeting 5 percent of projects for interventions that will help them get funded more quickly, I recommend proceeding with a random forest model composed of many estimators of limited depth.

### Data 

I built predictive models using data on individual project listings posted on DonorsChoose between January 1, 2012 and December 31, 2013. Each listing includes some basic identifying information, geographic identifiers, and some fields describing the project's attributes. It also includes a field indicating when it was posted on DonorsChoose and, if it was fully funded, the date on which it received full funding. From these two dates, I computed the time between posting and funding; I assess whether that time is greater than 60 days and use the result as my outcome variable. 

#### Predictor variables

I use the following features in order to generate predictions. 

* The number of students reached
* The total price of the project
* Whether the project is eligible for a match donation from a high-profile corporate sponsor
* The type of resource being requested: for example, technology or books
* The primary academic focus area of the project
* The specific primary academic subject of the project
* The secondary focus area of the project
* The specific secondary academic subject of the project
* The grade level of the students who will be helped by the project
* Whether or not the teacher lists their title as doctor, indicating that they hold an advanced degree
* Whether or not the teacher lists their salutation as "Ms." or "Mrs.", indicating that they are female
* State in which the school is located
* Whether or not the school is located in one of the nation's largest cities, namely Los Angeles, Chicago, Houston, Brooklyn, Bronx, New York
* The relative poverty level of the school, as measured by the share of students receiving free or reduced-price lunch

I replaced missing values in the number of students reached field with the average of other funded projects in the training set, reasoning that donors will have a sense of the typical values of students reached from the other projects they may be considering that are posted around that time. I also add a feature denoting that the value was missing in case donors respond differently to projects where the value is missing. 

I normalized the number of students and the total price, as this would affect the predictions made by certain classes of models. For similar reasons, I convert categorical variables, such as the type of resource, into a series of binary dummy variables.Importantly, I do not use the field indicating when the project was fully funded because this would by definition allow the model to perfectly predict whether or not a project would be funded within the time window.


### Models and evaluation strategy

In order to identify which model is likely to produce the best predictions, I built and tested a range of types of models, with a variety of parameters. In total, I ran 165 models. 

Each model generates a prediction score for each project; higher prediction scores correlate with a higher likelihood that a project will be funded within the time window (but should not be considered probabilities). Subsequently, each prediction score is translated into a binary classification about whether or not a project will be funded based on whether the score is above a certain threshold. I tested seven thresholds, corresponding to 1%, 2%, 5%, 10%, 20%, 30%, and 50% of projects. Selecting a particular percentage threshold means that the model will predict that percentage of projects will be funded. In other words, when I set the threshold at 5 percent, the model predicts that the projects with scores in the top 5th percentile will be funded.

#### Evaluation: training and testing 

To evaluate the performance of the models, I split my data into different training and testing datasets. In each case, the training data contained all the projects posted on DonorsChoose up until a certain date, and the testing data contained all the projects posted after that date. I created three training and test sets: one that used July 1, 2012 as the split date, one that used December 31, 2012, and one that used July 1, 2013. Essentially, each training set contained increasing numbers of projects. By comparing model performance across multiple training and test sets, I can assess whether a particular model reliably outperforms the others.

#### Metrics

I evaluated the performance of each model using a few standard metrics from machine learning, which I describe below.

* **Precision:** This metric is essentially a true positives rate. It describes what share of the projects that the model predicted would be funded within 60 days actually were funded in that time window.

* **Recall:** This metric indicates the share of all the projects that actually were funded in time that the model picked up on. 

* **Baseline accuracy:** This metric represents the share of projects that the model classified correctly. That is, it takes the sum of projects predicted to be funded that were fully funded in 60 days and the projects predicted not to be funded that were not funded in that time frame, and divides by the total number of projects. If the model makes the correct prediction less often than would be predicted by random chance, then it is probably not a good model. 

* **AUC-ROC:** This measure essentially compares the true-positive and false-positive rates. A random model would produce AUC-ROC scores of 0.5 because it would be no more likely to correctly identify projects that will be funded in the time window as to incorrectly identify them. A AUC-ROC value closer to 1 indicates that a model is better at correctly predicting projects that will be funded.

#### Benchmarks

I compared each model's performance against a few simple, non-machine-learning benchmarks to see whether a given model is truly adding value. I considered the following benchmarks:

* **Random:** If we picked projects randomly, would we be more or less accurate than the model? Given that about 25 percent of projects were not funded within the time window in the dataset, does our model correctly identify a larger share of these slowly-funded projects than we would if we picked randomly? 
* **Majority** In the dataset as a whole, about 25 percent of projects were not funded within the time window. Consequently, if we predicted that every project would be funded within the time window, we would be correct 75 percent of the time. How does the model compare against this approach? (Note that in practice, I am comparing against the share of funded projects in each test dataset, not the dataset as a whole.)

### Results

The table below shows results from the 165 model runs at a 5 percent threshold level (the full table of results at all seven thresholds is included as a separate file). 

* The best-performing model across all train-test split was an SVM model. However, in general, the tree models outperformed the linear models on precision, though the difference is not large. By a very small margin, they also outperformed the linear models on recall - especially the decision trees.

* The AUC-ROC scores were almost identical across all models, and therefore not especially useful for identifying better-performing models.

* Overall, the results of the models are quite stable over time, with no model type showing huge gains in precision or recall as more data are added. The bootstrap aggregation models do show a somewhat consistent pattern of small increases, but again, these gains are marginal at best.

* Though the overall maximum precision score does not improve with more data, the tree models perform best in two of three train-test splits and in particular, they perform best in the last split, which has the most data. 

* Unfortunately, no model exceeded 70 percent accuracy, meaning that no model was more accurate than a majority classifier approach in which we simply classified all projects as likely to be funded within 60 days. That said, given that the goal is to identify 5 percent of projects on which to intervene, this fact alone does not mean the predictive model doesn't add any value.

### Recommendations

Given the goal, I recommend focusing on model precision. Since there is capacity to intervene on about 5 percent of posted projects, we should aim to maximize the share of those projects that are correctly predicted not to be funded within the 60 day window. In other words, because we have a fixed amount of resources and we want to deploy them most effectively, we want to minimize the number of projects that would have been funded within the time window but are predicted not to, and therefore receive aid. 

Given this focus on precision, I recommend using a tree model, since they outperformed the linear models. In this particular case, the decision tree with a maximum depth of 1 performed the best of the tree models. However, because of the increases in performance of the bagged mode over time, I would recommend bagging a shallow tree model.



In [1]:
import pandas as pd
pd.set_option('display.max_rows', 200)
pd.read_csv('output/Memo_Table.csv')

Unnamed: 0,classifier,params,Train/Test Split ID,Precision,Recall,Accuracy,AUC_ROC Score
0,BA,{'n_estimators': 10},0,0.276065,0.048945,0.695557,0.499261
1,BA,{'n_estimators': 10},1,0.301154,0.051126,0.685552,0.500793
2,BA,{'n_estimators': 10},2,0.297872,0.052322,0.695106,0.50162
3,BA,{'n_estimators': 25},0,0.28499,0.050527,0.69645,0.500363
4,BA,{'n_estimators': 25},1,0.292046,0.04958,0.684641,0.499697
5,BA,{'n_estimators': 25},2,0.302852,0.053197,0.695604,0.502231
6,BA,{'n_estimators': 50},0,0.264097,0.046823,0.69436,0.497784
7,BA,{'n_estimators': 50},1,0.302975,0.051435,0.685734,0.501012
8,BA,{'n_estimators': 50},2,0.296062,0.052004,0.694925,0.501397
9,DecisionTree,"{'criterion': 'entropy', 'max_depth': 10}",0,0.281947,0.049987,0.696146,0.499988
