# Recommending Projects to Donors - DonorChoose.org

![DonorChooseLogo](images/DonorsChoose_org_logo.jpg)

DonorChoose.org is a nonprofit dedicated to providing the funds that teachers need in order to improve the overall quaility of education. This kernel looks through the data to discover insights and build a recommendation system to assist re-engaging donors. 

##### By Jacob Sieber

# Table of Contents:
* [Introduction and Approach](#intro)
* [EDA Highlights](#eda)
* [Building User Profiles](#user-profiles)
* [Building Project Profiles](#project-profiles)
* [Creating a Weighted Recommendation System](#final-product)

# Introduction and Approach <a class="anchor" id="intro"></a>

## The problem

DonorChoose.org wants to find a solution that "will enable DonorsChoose.org to build targeted email campaigns recommending specific classroom requests to prior donors". In any business, one of the greatest revenue-generating segments of customers (or donors) are those who have previously provided revenue. In DonorsChoose.org's case, previous donors already shown a preference for the product and have an email that they can be contacted through. In the EDA stage, the effect of re-engaged customers on the bottom line is quantified. 

The primary focus of this kernel is to generate a solution that will match previous donors to projects that they will likely make donatations toward. 

The three target metrics for the solution are:
* Performance - Good Targeting
* Adaptable - Feasable Implementation
* Intelligible - Easily Understandable

## The approach

By framing the solution as a "Recommender System", there is already a firm foundation of work on which the project can be based. This problem varies from the typical recommender system. Two key differences are: donors tend to "make a single purchase" (donate to a single project) and there are a significant amount of products (projects) that have only one to three donors. This can complicate the more standard recommendation approaches. Therefore, the approach I decided to take was to recommend new projects based both upon features of users and features of projects. Many several new features have been engineered from the dataset.


The best way to begin is to use the data to give a brief overview of the company and the problem we wish to solve using the data we have been provided. Perhaps the most important metrics for any large company is the bottom line, so this kernel begins with examining income and a sucessful solutions possible impact on income. 

In [1]:
# Loading in packages

library(data.table)
library(scales)
library(repr)
library(tidyverse)

options(scipen=10000)
options(repr.plot.width=10, repr.plot.height=4)

-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 2.2.1     v purrr   0.2.4
v tibble  1.4.2     v dplyr   0.7.4
v tidyr   0.8.0     v stringr 1.3.1
v readr   1.1.1     v forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::between()    masks data.table::between()
x readr::col_factor() masks scales::col_factor()
x purrr::discard()    masks scales::discard()
x dplyr::filter()     masks stats::filter()
x dplyr::first()      masks data.table::first()
x dplyr::lag()        masks stats::lag()
x dplyr::last()       masks data.table::last()
x purrr::transpose()  masks data.table::transpose()


In [None]:
# Loading in data

Donations <- fread('data/Donations.csv')
Donors <- fread('data/Donors.csv')
Projects <- fread('data/Projects.csv')
Resources <- fread('data/Resources.csv')
Schools <- fread('data/Schools.csv')
Teachers <- fread('data/Teachers.csv')

In [None]:
# Linking up the datasets

Donations_Linked <- merge(Donations,Projects) %>%
  merge(Donors, by = 'Donor ID') %>%
  merge(Schools, by = 'School ID') %>%
  merge(Teachers, by = 'Teacher ID')

In [None]:
# Adjusting the data for graphing and modeling

Donations[, `Donation Received Date` := anytime::anydate(`Donation Received Date`)]
Donations[, Year := format(`Donation Received Date`, '%Y')]
Donation_by_day <- Donations[,.(Total_Donations = sum(`Donation Amount`)), by = `Donation Received Date`]
Donation_by_day[, Year := format(`Donation Received Date`, '%Y')]


Projects$`Project Cost` <- as.numeric(gsub('[$,]', '', Projects$`Project Cost`))

# Performance Intensive

Repeated <- Donations_Linked[,.(
    `Times Donated`=.N, 
    `Different Schools` = length(unique(`School ID`)),
    `Different Teachers` =  length(unique(`Teacher ID`)),
    `Different States` = length(unique(`School State`))
    ),by=`Donor ID`]

# EDA Highlights <a class="anchor" id="eda"></a>

A brief EDA to get a better idea of the DonorsChoose.org business model.

### Revenue Year by Year

There is a strong upward trend on annual revenue. This is great news for DonorsChoose.org!

In [None]:
# Data transformation

plot_data  <- Donations[Year!=2018 & Year!=2012,.(`Donation Amount` = sum( `Donation Amount`)), by =.(Year,`Donation Included Optional Donation`)]
plot_data  <- plot_data[,.(`Donation Amount`,`Direct Revenue` = (sum(`Donation Amount`[`Donation Included Optional Donation` == 'Yes']))*.15), by =.(Year,`Donation Included Optional Donation`)]

# Plotting total donations provided by donors by year

ggplot(plot_data[,.(`Donation Amount` = sum(`Donation Amount`)), by = Year], aes(Year, `Donation Amount`,group = 1)) +
    geom_line(color = 'green', size = 1.5) +
    scale_y_continuous(labels=dollar_format()) +
    labs(title = 'Total Donations by Year') +
    theme(plot.title = element_text(hjust = 0.5, size = 22))

### Funding for DonorsChoose.org Operations

If we consider the operations funding to come directly from donations that have included the optional donation (15%), there is also strong growth in contributions directly for operations

In [None]:
ggplot(data = plot_data[`Donation Included Optional Donation` == 'Yes'], aes(Year,`Direct Revenue`,group = 1))+
    geom_line(linetype = 'twodash', color ='green',size = 1.5) +
    scale_y_continuous(labels=dollar_format()) +
    labs(title = 'Contributions Directly for DonorsChoose.org Operations') +
    theme(plot.title = element_text(hjust = 0.5, size = 22))

### Donations Received over Time

There is an upward yearly trend on donations, with brief spikes of large donations. The highest revenue special events are:

* [#BestSchoolDay Matching 2017](https://www.donorschoose.org/blog/best-school-day-2017/) March 25-30, 2017
* [The Bill & Melinda Gates Foundation Donation Matching:](https://help.donorschoose.org/hc/en-us/articles/115013788948-2X-Match-on-professional-development-projects-thanks-to-the-Bill-and-Melinda-Gates-Foundation) August 23-25, 2016 
* [1 Million Classroom Matching:](https://www.prnewswire.com/news-releases/donorschooseorg-announces-1-million-classroom-project-requests-funded-for-teachers-nationwide-300587983.html) January 25-29, 2018 
* [#GivingTuesday Raffle Prize:](https://www.donorschoose.org/blog/press/press-release-donorschoose-org-launch-500000-givingtuesday-giveaway/) Novemeber 28-30, 2017

In [None]:
ggplot(Donation_by_day, aes(`Donation Received Date`, Total_Donations)) +
    geom_line(size = 1, aes(color = Year)) +
    theme(legend.position="none") +
    theme(plot.title = element_text(hjust = 0.5, size = 22)) +
    scale_y_continuous(labels=dollar_format()) +
    labs(title = 'Donations Received over Time') 

### How Much do Donors typically Donate?
The vast majority of donations are from 0 to 75 dollars.

In [None]:
ggplot(Donations[`Donation Amount` < 600], aes(`Donation Amount`)) +
  geom_histogram(binwidth = 25, fill = 'skyblue', color = 'black') +
  scale_x_continuous(breaks = seq(0, 600, by = 25)) +
  labs(title = 'Distribution of Donations (covers over 99% of the data)') +
  theme(plot.title = element_text(hjust = 0.5, size = 22))

### What is the distribution of project sucess and projects by grade level?

The distribution of sucess rate and number of projects can be easily visualized through a bar chart. Most of the projects (71.79%) are for elementary schools, 16.38% of projects are for middle schoolers, and 11.83% of projects are for high schoolers. There is not a large difference in success rate across all of the projects.

In [None]:
by_grade  <- Projects[`Project Grade Level Category` != 'unknown',.N,by=.(`Project Grade Level Category`,`Project Current Status`)]

# Observing the number of projects by grade. It appears as though those in lower grades have many more projects decicated. 
# However, the ratio of funded projects appears so be similar throughout, so little grade bias by donor.

ggplot(by_grade, aes(reorder(`Project Grade Level Category`,-N), N)) +
    geom_bar(stat= 'identity', aes(fill = reorder(`Project Current Status`, N))) +
    labs(title = 'Number of Projects by Grade', x='Grade Level', y='Count' ) +
    scale_fill_manual(values=c('#ffffb2',"#969696",  "#fb6a4a", "#78c679")) +
    guides(fill=guide_legend(title="Project Status")) +
    theme(plot.title = element_text(hjust = 0.5, size = 22))

# Building User Profiles  <a class="anchor" id="user-profiles"></a>

Now that there is a basic understanding of the data, donors can be profiled. This profiling will allow us to recommend projects that similar users have donated to. First, data analysis is used to find what features are important for donors. After that, promising variables will be put into a new dataframe. Then, more feautres will be created and added to the original variables. And finally, a cosine matrix will be created.

### Is Donor location important? (State loyality)

In order to build a recommendation system, it is important to see if we should try to engage users across state lines. To do this, we must observe if people donate from states that they are not from. We can confirm that most donatons come from a the state in which the donor lives. Therefore, we should include state in recommendation system.

In [None]:
states = Donations_Linked[,.(`Donor ID`,`School State`,`School City`,`Donor State`,`Donor City`)]
states[,`Same State` := ifelse(`School State` == `Donor State`, TRUE,FALSE)]
states[,`Same City` := ifelse(`School City` == `Donor City`, TRUE,FALSE)]


state_ratio = nrow(states[`Same State`==TRUE])/nrow(states)
city_ratio = nrow(states[`Same City`==TRUE])/nrow(states)


print(paste0('The ratio of donations coming from the same state: ', round(state_ratio,4)*100,'%'))
print(paste0('The ratio of donations coming from the same city: ', round(city_ratio,4)*100,'%'))

### Other measures of loyailty

Below, there is evidence patterns within donations for those who have given less than nine times. *This data representation covers around 90% of all donors*, and is the target group for which DonorsChoose.org would like to promote more re-engagement from.

* Pattern 1: There is a preference for giving to a single school.
  * Even when donating eight times, around 36% of donors will give to only one school.
* Pattern 2: There is a smaller preference for donating to a single teacher.
  * When donors donate eight times, around 27% of donors will only donate to a single teacher.
* Pattern 3: There is a strong preference only giving to a single state.
  * When donors donate eight times, around 65% of donors will donate to a single state.
  
With this new information in mind, the school, teacher, and state donated to will be included in the donor profile.

In [None]:
Repeated_Donors[, `:=`(
    `Only Donated to One School` = ifelse(`Different Schools` == 1,1,0),
    `Only Donated to One Teacher` = ifelse(`Different Teachers` == 1,1,0),
    `Only Donated to One State` = ifelse(`Different States` == 1,1,0)
)]

In [None]:
Repeated_Donors_All  <- head(Repeated_Donors)
Repeated_Donors <- Repeated_Donors[`Times Donated`  > 1]

In [None]:
head(Repeated_Donors)

In [None]:
Repeated_Donors_Grouped  <- Repeated_Donors[,.(
    `Donors` = .N, 
    `Different Schools` = sum(`Different Schools`),
    `Percentage of Gave only to One School`= mean(`Only Donated to One School`),
    `Percentage of Gave only to One Teacher`= mean(`Only Donated to One Teacher`),
    `Percentage of Gave only to One State`=mean(`Only Donated to One State`)
), keyby = `Times Donated`]



In [None]:
# Only_One_School  <- Repeated_Donors[,.(`Donors` = .N , `Different Schools` = sum(`Different Schools`),`Percentage of Gave only to One School`= mean(`Only Donated to One School`)), keyby = `Times Donated`]
head(Repeated_Donors_Grouped,10)

In [None]:
Repeated_Donors_Grouped[`Times Donated` <= 8,sum(Donors),]/Repeated_Donors_Grouped[,sum(Donors)]

In [None]:
ggplot(Repeated_Donors_Grouped[`Times Donated` <= 8], aes(x=`Times Donated`, y = `Percentage of Gave only to One School`)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    labs(title = 'Percentage of Gave only to One School (covers 90% of all donors)', y='') +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_y_continuous(labels = scales::percent) +
    scale_x_continuous(breaks = seq(2, 8, by = 1))

In [None]:
ggplot(Repeated_Donors_Grouped[`Times Donated` <= 8], aes(x=`Times Donated`, y = `Percentage of Gave only to One Teacher`)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    labs(title = 'Teacher Loyalty by Times Donated (covers 90% of all donors)') +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_y_continuous(labels = scales::percent) +
    scale_x_continuous(breaks = seq(2, 8, by = 1))

In [None]:
ggplot(Repeated_Donors_Grouped[`Times Donated` <= 8], aes(x=`Times Donated`, y = `Percentage of Gave only to One State`)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    labs(title = 'State Loyalty by Times Donated (covers 90% of all donors)') +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_y_continuous(labels = scales::percent) +
    scale_x_continuous(breaks = seq(2, 8, by = 1))

### Creating Donor Profiles

Now that we have the information that we would like to profile donors with, a new dataframe can be created with 

In [None]:
# Adding a variable indicating if the teacher is female or not
Donations_Linked[, `Is Female` := ifelse(`Teacher Prefix` == 'Mrs.' | `Teacher Prefix` == 'Ms.', 1,0)]
Donations_Linked[, `Donation Included` := ifelse(`Donation Included Optional Donation` == 'Yes', 1,0)]

# Creating the profile dataframe
Donation_Profiles <-  Donations_Linked %>%
    select(`Donor ID`,`Donor City`,`Donor State`,`Donation Amount`,
           `Donation Included`, `Donor Is Teacher`, `Project Cost`, `School Metro Type`,
          `School Percentage Free Lunch`, `Is Female`, `Donation Received Date`) %>%
    arrange(desc(`Donation Received Date`))

# Donor_Profiles  <- merge(x = Donor_Profiles, y = , by = "CustomerId", all.x = TRUE)

In [None]:
head(Donation_Profiles)

In [None]:
# Now data is aggregated for each donor

Donor_Profiles  <- Donation_Profiles %>%
    group_by(`Donor ID`) %>%
    summarise(
        `Donor City` = first(`Donor City`),
        `Donor State` = first(`Donor State`),
        `Average Donation` = mean(`Donation Amount`),
        `Donation Included Ratio` = mean(`Donation Included`),
        `Donor Is Teacher` = first(`Donor Is Teacher`),
        `Average Project Cost` = mean(`Project Cost`),
        `School Percentage Free Lunch` = mean(`School Percentage Free Lunch`),
        `Donate to Female Teacher Ratio` = mean(`Is Female`)
    )

In [None]:
head(Repeated_Donors)