## HW 3: Predicting whether a DonorsChoose.org project will be funded within 60 days

#### Cecile Murray

DonorsChoose.org is a website that helps connect teachers in need of financial support for educational projects with a network of donors willing to fund those projects. Donors give any amount they choose; once a project is fully funded, the requested materials are delivered. I analyzed past project funding data to build a model predicting whether or not a given project will be fully funded within 60 days of being posted on DonorsChoose. This prediction will allow projects at risk of being funded slowly to be identified so that we can promote those projects or otherwise intervene to help them be funded more quickly.

[DESCRIBE FINDINGS]

### Data 

I built predictive models using data on the individual project listings between January 1, 2012 and December 31, 2013. Each listing includes some basic identifying information, geographic identifiers, and some fields describing the project's attributes. It also includes a field indicating when it was posted on DonorsChoose and, if it was fully funded, the date on which it received full funding. From these two dates, I computed the time between posting and funding; I assess whether that time is less than or equal to 60 days and use the result as my outcome variable. 

#### Predictor variables

I use the following features in order to generate predictions. 

* The number of students reached
* The total price of the project
* Whether the project is eligible for a match donation from a high-profile corporate sponsor
* The type of resource being requested: for example, technology or books
* The primary academic focus area of the project
* The specific primary academic subject of the project
* The secondary focus area of the project
* The specific secondary academic subject of the project
* The grade level of the students who will be helped by the project
* Whether or not the teacher lists their title as doctor, indicating that they hold an advanced degree
* Whether or not the teacher lists their salutation as "Ms." or "Mrs.", indicating that they are female
* State in which the school is located
* Whether or not the school is located in one of the nation's largest cities, namely Los Angeles, Chicago, Houston, Brooklyn, Bronx, New York
* The relative poverty level of the school, as measured by the share of students receiving free or reduced-price lunch

I replaced missing values in the number of students reached field with zeroes, as I assume that a donor who saw a project where no number of students was listed would interpret that missing information as a zero. 

I normalized the number of students and the total price, as this would affect the predictions made by certain classes of models. For similar reasons, I convert categorical variables, such as the type of resource, into a series of binary dummy variables.Importantly, I do not use the field indicating when the project was fully funded because this would by definition allow the model to perfectly predict whether or not a project would be funded within the time window.


### Models and evaluation strategy

In order to identify which model is likely to produce the best predictions, I built and tested a range of types of models, each of which generates a prediction score. Higher prediction scores correlate with a higher likelihood that a project will be funded within the time window (but should not be considered probabilities).

#### Evaluation: training and testing 

To evaluate the performance of the models, I split my data into different training and testing datasets. In each case, the training data contained all the projects posted on DonorsChoose up until a certain date, and the testing data contained all the projects posted after that date. I created three training and test sets: one that used July 1, 2012 as the split date, one that used December 31, 2012, and one that used July 1, 2013. Essentially, each training set contained increasing numbers of projects. By comparing model performance across multiple training and test sets, I can assess whether a particular model reliably outperforms the others.

#### Metrics

I evaluated the performance of each model using a few standard metrics from machine learning, which I describe below. Each of these metrics relies on the choice of a threshold to translate a prediction score into a binary classification about whether or not a project will be funded. I tested seven thresholds, corresponding to 1%, 2%, 5%, 10%, 20%, 30%, and 50% of projects. Selecting a particular percentage threshold means that the model will predict that percentage of projects will be funded.

* **Precision:** This metric is essentially a true positives rate. It describes what share of the projects that the model predicted would be funded within 60 days actually were funded in that time window.

* **Recall:** This metric indicates the share of all the projects that actually were funded in time that the model picked up on. 

* **Baseline accuracy:** This metric represents the share of projects that the model classified correctly. That is, it takes the sum of projects predicted to be funded that were fully funded in 60 days and the projects predicted not to be funded that were not funded in that time frame, and divides by the total number of projects. If the model makes the correct prediction less often than would be predicted by random chance, then it is probably not a good model. 

* **AUC-ROC:** This measure essentially compares the true-positive and false-positive rates. A random model would produce AUC-ROC scores of 0.5 because it would be no more likely to correctly identify projects that will be funded in the time window as to incorrectly identify them. A AUC-ROC value closer to 1 indicates that a model is better at correctly predicting projects that will be funded.

#### Benchmarks

I compared each model's performance against a few simple, non-machine-learning benchmarks to see whether a given model is truly adding value. I considered the following benchmarks:

* **Random:** If we picked projects randomly, would we be more or less accurate than the model? 
* **Majority** In the dataset as a whole, about 25 percent of projects were not funded within the time window. Consequently, if we predicted that every project would be funded within the time window, we would be correct 75 percent of the time. How does the model compare against this approach? (Note that in practice, I am comparing against the share of funded projects in each test dataset, not the dataset as a whole.)

### Results

* Which classifier does better on which metrics?
* How do the results change over time? 

### Recommendations

Given that there is capacity to intervene on about 5 percent of posted projects, we should aim to maximize the share of those projects that are correctly predicted not to be funded within the 60 day window. In other words, because we have a fixed amount of resources and we want to deploy them most effectively, we want to minimize the number of projects that would have been funded within the time window but are predicted not to, and thereforex receive aid. Therefore, I recommend we focus on precision. 

* What would be your recommendation to someone who's working on this model to identify 5% of posted projects to intervene with, which model should they decide to go forward with and deploy?

