## Improved Machine Learning Pipeline Applied to DonorsChoose Data
### CAPP 30254 Homework 3

The module `pipeline_library` contains updated functions that allow us to apply the machine learning pipeline. The general steps of the pipeline are as follows:

1. Load the data
2. Explore the data
3. Preprocess and clean the data
4. Generate features for the ML model
5. Build a machine learning model
6. Evaluate the model

Here I will use the pipeline to analyze data from the K-12 school charity site DonorsChoose to predict which classroom projects will receive full funding within 60 days of being posted. The data is a modified version of data from https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data.

### Load and explore the data
I'll start by loading and exploring the data.

In [1]:
import

import pipeline_library as pl

In [2]:
proj_df = pl.load_csv_data("data/projects_2012_2013.csv")
proj_df.head()

Unnamed: 0,projectid,teacher_acctid,schoolid,school_ncesid,school_latitude,school_longitude,school_city,school_state,school_metro,school_district,...,secondary_focus_subject,secondary_focus_area,resource_type,poverty_level,grade_level,total_price_including_optional_support,students_reached,eligible_double_your_impact_match,date_posted,datefullyfunded
0,00001ccc0e81598c4bd86bacb94d7acb,96963218e74e10c3764a5cfb153e6fea,9f3f9f2c2da7edda5648ccd10554ed8c,170993000000.0,41.807654,-87.673257,Chicago,IL,urban,Pershing Elem Network,...,Visual Arts,Music & The Arts,Supplies,highest poverty,Grades PreK-2,1498.61,31.0,f,4/14/13,5/2/13
1,0000fa3aa8f6649abab23615b546016d,2a578595fe351e7fce057e048c409b18,3432ed3d4466fac2f2ead83ab354e333,64098010000.0,34.296596,-119.296596,Ventura,CA,urban,Ventura Unif School District,...,Literature & Writing,Literacy & Language,Books,highest poverty,Grades 3-5,282.47,28.0,t,4/7/12,4/18/12
2,000134f07d4b30140d63262c871748ff,26bd60377bdbffb53a644a16c5308e82,dc8dcb501c3b2bb0b10e9c6ee2cd8afd,62271000000.0,34.078625,-118.257834,Los Angeles,CA,urban,Los Angeles Unif Sch Dist,...,Social Sciences,History & Civics,Technology,high poverty,Grades 3-5,1012.38,56.0,f,1/30/12,4/15/12
3,0001f2d0b3827bba67cdbeaa248b832d,15d900805d9d716c051c671827109f45,8bea7e8c6e4279fca6276128db89292e,360009000000.0,40.687286,-73.988217,Brooklyn,NY,urban,New York City Dept Of Ed,...,,,Books,high poverty,Grades PreK-2,175.33,23.0,f,10/11/12,12/5/12
4,0004536db996ba697ca72c9e058bfe69,400f8b82bb0143f6a40b217a517fe311,fbdefab6fe41e12c55886c610c110753,360687000000.0,40.793018,-73.205635,Central Islip,NY,suburban,Central Islip Union Free SD,...,Literature & Writing,Literacy & Language,Technology,high poverty,Grades PreK-2,3591.11,150.0,f,1/8/13,3/25/13


Check frequency and location of null values.

In [3]:
proj_df.isnull().sum()

projectid                                     0
teacher_acctid                                0
schoolid                                      0
school_ncesid                              9233
school_latitude                               0
school_longitude                              0
school_city                                   0
school_state                                  0
school_metro                              15224
school_district                             172
school_county                                 0
school_charter                                0
school_magnet                                 0
teacher_prefix                                0
primary_focus_subject                        15
primary_focus_area                           15
secondary_focus_subject                   40556
secondary_focus_area                      40556
resource_type                                17
poverty_level                                 0
grade_level                             

### Preprocess data and prepare features and label
Next I will preprocess and clean the data, prepare selected columns to be features, and prepare the label for the machine learning models.

The following are the columns in the dataset and the cleaning/transformation they will require:
- 'projectid'
- 'teacher_acctid'
- 'schoolid'
- 'school_ncesid'
- 'school_latitude'
- 'school_longitude'
- 'school_city'
- 'school_state'
- 'school_metro'
- 'school_district'
- 'school_county'
- 'school_charter'
- 'school_magnet'
- 'teacher_prefix'
- 'primary_focus_subject'
- 'primary_focus_area'
- 'secondary_focus_subject'
- 'secondary_focus_area'
- 'resource_type'
- 'poverty_level'
- 'grade_level'
- 'total_price_including_optional_support'
- 'students_reached'
- 'eligible_double_your_impact_match'
- 'date_posted'
- 'datefullyfunded'
- 'daystofunding'

In [5]:
pl.cols_to_datetime(proj_df, ["date_posted", "datefullyfunded"])

proj_df["daystofunding"] = proj_df["datefullyfunded"] - proj_df["date_posted"]

cutoff = pd.Timedelta("60 days")
pl.make_dummy(proj_df, "daystofunding", cutoff, gt_cutoff=False, new_col="fundedin60days")

# selected_features = [
#     'school_latitude',
#     'school_longitude',
#     #'school_city',
#     #'school_state',
#     'school_metro', # make dummy, handle nulls
#     #'school_district',
#     #'school_county',
#     'school_charter', # make dummy
#     'school_magnet', # make dummy
#     'teacher_prefix', # make dummy
#     #'primary_focus_subject',
#     #'primary_focus_area', # tbd
#     #'secondary_focus_subject',
#     #'secondary_focus_area',
#     # 'resource_type', # tbd
#     'poverty_level', # encode
#     'grade_level', # encode, handle nulls
#     'total_price_including_optional_support',
#     'students_reached', # handle nulls
#     'eligible_double_your_impact_match' # make dummy
#     #'date_posted',
#     #'datefullyfunded'
# ]

NameError: name 'pd' is not defined

In [None]:
# test_size = 0.3
# x_train, x_test, y_train, y_test = pl.split_data(df, selected_features, label, test_size)
# dec_tree = pl.fit_decision_tree(x_train, y_train)
# threshold = 0.4
# pl.evaluate_model(dec_tree, x_test, y_test, threshold)