In [1]:
import os
import sys
import numpy as np
import pandas as pd
import random
import logging
from datetime import date, datetime, timedelta
from dateutil import parser
import pandas_profiling
import matplotlib.pyplot as plt
import warnings
from ml_utils import Pipeline

warnings.filterwarnings('ignore')

# Homework 3 - Improving the Pipeline
### Justin Cohler

For this assignment, I have built out a version the Pipeline located in ml_utils (same directory) with functions for looping through parameter grids for several models. 

I included Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forests, Gradient Boosting, and Bagging in the runs below. I configured SVMs for running, but was unable to run in a timely fashion on the full dataset, so it is not included in the result set below. 

Additionally, I ran temporal validation on 2-month intervals from 6/1/2013 to 12/31/2013, always testing on a 2-month interval with a training window that would grow backwards in time. This was more reasonable than running over 3 years for time-to-compute. The results dataframe is shown below. Precision-Recall graphs are all in this same directory, named by their model and run numbers. 

## The winning model
The best model found was a Random Forest Classifier, with the following properties:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=50, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

And the following result metrics:

- auc-roc:                                                0.717098
- p_at_1:                                                 0.977011
- p_at_2:                                               0.961686
- p_at_5:                                                0.961686
- p_at_10:                                                0.954406
- p_at_20:                                               0.915342
- p_at_30:                                                0.87883
- p_at_50:                                               0.815995
- r_at_1:                                              0.0145457
- r_at_2:                                                 0.028635
- r_at_5:                                                0.0715875
- r_at_10:                                                0.142091
- r_at_20:                                                0.272603
- r_at_30:                                                0.392619
- r_at_50:                                                0.607609

The Precision-Recall graph for the optimal RF model is below (all other precision-recall graphs are in this same directory):
!['Precision Recall - Random Forest (Gini, Max Depth of 50)'](RF1.png)


# Parameter Grid
Below, the following code prints out the parameter grid for the models, which is housed in the ml_utils Pipeline class. 

The pipeline.generate_classifiers() method prints out a dictionary of SKLearn models, as well as their respective parameter grids.

In [18]:
logger = logging.getLogger('hw3')

pd.options.display.max_rows = 999

pipeline = Pipeline()
pipeline.generate_classifiers()
pipeline.classifiers

{'BAG': {'params': {'bootstrap_features': [False, True],
   'max_features': [5, 20],
   'max_samples': [5],
   'n_estimators': [10]},
  'type': BaggingClassifier(base_estimator=None, bootstrap=True,
           bootstrap_features=False, max_features=1.0, max_samples=1.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)},
 'DT': {'params': {'criterion': ['gini', 'entropy'],
   'max_depth': [5, 50],
   'min_samples_split': [2, 10]},
  'type': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, presort=False, random_state=None,
              splitter='best')},
 'GB': {'params': {'learning_rate': [0.5],
   'max_depth': [1, 5],
   'n_estimators': [5, 10],
   'subsample': [0.5]},
  'type': 

## Data Exploration

We can draw the following information from the profile report analyzed below:
- Most of the projects are from High and Highest Poverty classrooms/schools
- Many of the projects are focused on the subject of Literacy (28,050/66,562), nearly half!
- Books, supplies, and technology comprise the majority of the resources teachers request.
- Cities and counties are fairly well distributed, showing that Donors Choose has been popularized as a platform across the country.
- Most of the projects originate from schools in urban areas (44,452/50,180) 
- The vast majority of teachers requesting projects are women (Ms and Mrs. prefixes comprise much more than 50% of the rows, closer to 90%)
- Pearson and Spearman correlation matrices look similar, implying that outliers aren't a major issue in the data.


In [8]:
pandas_profiling.ProfileReport(projects)

0,1
Number of variables,116
Number of observations,94612
Total Missing (%),0.0%
Total size in memory,23.9 MiB
Average record size in memory,265.0 B

0,1
Numeric,12
Categorical,0
Boolean,97
Date,1
Text (Unique),1
Rejected,5
Unsupported,0

0,1
Distinct count,214
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2013-06-01 00:00:00
Maximum,2013-12-31 00:00:00

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.020441

0,1
0,92678
1,1934

Value,Count,Frequency (%),Unnamed: 3
0,92678,98.0%,
1,1934,2.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.29179

0,1
0,67005
1,27607

Value,Count,Frequency (%),Unnamed: 3
0,67005,70.8%,
1,27607,29.2%,

0,1
Constant value,30

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.31729

0,1
0,64593
1,30019

Value,Count,Frequency (%),Unnamed: 3
0,64593,68.3%,
1,30019,31.7%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.1731

0,1
0,78235
1,16377

Value,Count,Frequency (%),Unnamed: 3
0,78235,82.7%,
1,16377,17.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.13398

0,1
0,81936
1,12676

Value,Count,Frequency (%),Unnamed: 3
0,81936,86.6%,
1,12676,13.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.37564

0,1
0,59072
1,35540

Value,Count,Frequency (%),Unnamed: 3
0,59072,62.4%,
1,35540,37.6%,

0,1
Distinct count,94612
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,92078
Minimum,44772
Maximum,139383
Zeros (%),0.0%

0,1
Minimum,44772
5-th percentile,49503
Q1,68425
Median,92078
Q3,115730
95-th percentile,134650
Maximum,139383
Range,94611
Interquartile range,47306

0,1
Standard deviation,27312
Coef of variation,0.29662
Kurtosis,-1.2
Mean,92078
MAD,23653
Skewness,0
Sum,8711636430
Variance,745960000
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
133119,1,0.0%,
90809,1,0.0%,
76464,1,0.0%,
74417,1,0.0%,
80562,1,0.0%,
78515,1,0.0%,
68276,1,0.0%,
66229,1,0.0%,
72374,1,0.0%,
70327,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
44772,1,0.0%,
44773,1,0.0%,
44774,1,0.0%,
44775,1,0.0%,
44776,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
139379,1,0.0%,
139380,1,0.0%,
139381,1,0.0%,
139382,1,0.0%,
139383,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.26525

0,1
0,69516
1,25096

Value,Count,Frequency (%),Unnamed: 3
0,69516,73.5%,
1,25096,26.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.54042

0,1
1,51130
0,43482

Value,Count,Frequency (%),Unnamed: 3
1,51130,54.0%,
0,43482,46.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.03415

0,1
0,91381
1,3231

Value,Count,Frequency (%),Unnamed: 3
0,91381,96.6%,
1,3231,3.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.16018

0,1
0,79457
1,15155

Value,Count,Frequency (%),Unnamed: 3
0,79457,84.0%,
1,15155,16.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.061958

0,1
0,88750
1,5862

Value,Count,Frequency (%),Unnamed: 3
0,88750,93.8%,
1,5862,6.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.025599

0,1
0,92190
1,2422

Value,Count,Frequency (%),Unnamed: 3
0,92190,97.4%,
1,2422,2.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.038082

0,1
0,91009
1,3603

Value,Count,Frequency (%),Unnamed: 3
0,91009,96.2%,
1,3603,3.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.4465

0,1
0,52368
1,42244

Value,Count,Frequency (%),Unnamed: 3
0,52368,55.4%,
1,42244,44.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.27465

0,1
0,68627
1,25985

Value,Count,Frequency (%),Unnamed: 3
0,68627,72.5%,
1,25985,27.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.08648

0,1
0,86430
1,8182

Value,Count,Frequency (%),Unnamed: 3
0,86430,91.4%,
1,8182,8.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.066736

0,1
0,88298
1,6314

Value,Count,Frequency (%),Unnamed: 3
0,88298,93.3%,
1,6314,6.7%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.055109

0,1
0,89398
1,5214

Value,Count,Frequency (%),Unnamed: 3
0,89398,94.5%,
1,5214,5.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0095231

0,1
0,93711
1,901

Value,Count,Frequency (%),Unnamed: 3
0,93711,99.0%,
1,901,1.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.00297

0,1
0,94331
1,281

Value,Count,Frequency (%),Unnamed: 3
0,94331,99.7%,
1,281,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.011415

0,1
0,93532
1,1080

Value,Count,Frequency (%),Unnamed: 3
0,93532,98.9%,
1,1080,1.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0016277

0,1
0,94458
1,154

Value,Count,Frequency (%),Unnamed: 3
0,94458,99.8%,
1,154,0.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.012831

0,1
0,93398
1,1214

Value,Count,Frequency (%),Unnamed: 3
0,93398,98.7%,
1,1214,1.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.020737

0,1
0,92650
1,1962

Value,Count,Frequency (%),Unnamed: 3
0,92650,97.9%,
1,1962,2.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0031286

0,1
0,94316
1,296

Value,Count,Frequency (%),Unnamed: 3
0,94316,99.7%,
1,296,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.03916

0,1
0,90907
1,3705

Value,Count,Frequency (%),Unnamed: 3
0,90907,96.1%,
1,3705,3.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0037205

0,1
0,94260
1,352

Value,Count,Frequency (%),Unnamed: 3
0,94260,99.6%,
1,352,0.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0073352

0,1
0,93918
1,694

Value,Count,Frequency (%),Unnamed: 3
0,93918,99.3%,
1,694,0.7%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0091426

0,1
0,93747
1,865

Value,Count,Frequency (%),Unnamed: 3
0,93747,99.1%,
1,865,0.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.032131

0,1
0,91572
1,3040

Value,Count,Frequency (%),Unnamed: 3
0,91572,96.8%,
1,3040,3.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.011161

0,1
0,93556
1,1056

Value,Count,Frequency (%),Unnamed: 3
0,93556,98.9%,
1,1056,1.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.020547

0,1
0,92668
1,1944

Value,Count,Frequency (%),Unnamed: 3
0,92668,97.9%,
1,1944,2.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.29647

0,1
0,66562
1,28050

Value,Count,Frequency (%),Unnamed: 3
0,66562,70.4%,
1,28050,29.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.12986

0,1
0,82326
1,12286

Value,Count,Frequency (%),Unnamed: 3
0,82326,87.0%,
1,12286,13.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.14825

0,1
0,80586
1,14026

Value,Count,Frequency (%),Unnamed: 3
0,80586,85.2%,
1,14026,14.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.029425

0,1
0,91828
1,2784

Value,Count,Frequency (%),Unnamed: 3
0,91828,97.1%,
1,2784,2.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0012049

0,1
0,94498
1,114

Value,Count,Frequency (%),Unnamed: 3
0,94498,99.9%,
1,114,0.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.014026

0,1
0,93285
1,1327

Value,Count,Frequency (%),Unnamed: 3
0,93285,98.6%,
1,1327,1.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.00090898

0,1
0,94526
1,86

Value,Count,Frequency (%),Unnamed: 3
0,94526,99.9%,
1,86,0.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.01317

0,1
0,93366
1,1246

Value,Count,Frequency (%),Unnamed: 3
0,93366,98.7%,
1,1246,1.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.011436

0,1
0,93530
1,1082

Value,Count,Frequency (%),Unnamed: 3
0,93530,98.9%,
1,1082,1.1%,

0,1
Correlation,1

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0040904

0,1
0,94225
1,387

Value,Count,Frequency (%),Unnamed: 3
0,94225,99.6%,
1,387,0.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.043884

0,1
0,90460
1,4152

Value,Count,Frequency (%),Unnamed: 3
0,90460,95.6%,
1,4152,4.4%,

First 3 values
73c4c664c4cbf7af922c8f1f1af819c6
8bd8373f5ec7522aa3b0d0378aa693e8
bcf066bea99258d6a03f92f6144c974d

Last 3 values
7aec49d5b5ce11c64f51410e612a6677
bd201da6fbc54afe71ef735e65cc350f
1597d3431f0f1610d6d05cb37535d2ac

Value,Count,Frequency (%),Unnamed: 3
0000b38bbc7252972f7984848cf58098,1,0.0%,
0000ee613c92ddc5298bf63142996a5c,1,0.0%,
00017d99c933cb7888c63abbd807f406,1,0.0%,
0002386291b6c2c659a34664fbaca804,1,0.0%,
00023c177bd5f268d0c7c23894dec129,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
fffc128714a3035d7af7637c4ab30615,1,0.0%,
fffc5f77d0ba9fb9b9510582caa30bdd,1,0.0%,
fffda9a35e156656df0a2e2c7091e9cf,1,0.0%,
fffdf9d286e715165b60674ac9d05c6c,1,0.0%,
fffdfafaf7cc8fdd9b6567051b394da7,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.18819

0,1
0,76807
1,17805

Value,Count,Frequency (%),Unnamed: 3
0,76807,81.2%,
1,17805,18.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.11057

0,1
0,84151
1,10461

Value,Count,Frequency (%),Unnamed: 3
0,84151,88.9%,
1,10461,11.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.3056

0,1
0,65699
1,28913

Value,Count,Frequency (%),Unnamed: 3
0,65699,69.4%,
1,28913,30.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.39157

0,1
0,57565
1,37047

Value,Count,Frequency (%),Unnamed: 3
0,57565,60.8%,
1,37047,39.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0029383

0,1
0,94334
1,278

Value,Count,Frequency (%),Unnamed: 3
0,94334,99.7%,
1,278,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0011415

0,1
0,94504
1,108

Value,Count,Frequency (%),Unnamed: 3
0,94504,99.9%,
1,108,0.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.093878

0,1
0,85730
1,8882

Value,Count,Frequency (%),Unnamed: 3
0,85730,90.6%,
1,8882,9.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0028221

0,1
0,94345
1,267

Value,Count,Frequency (%),Unnamed: 3
0,94345,99.7%,
1,267,0.3%,

0,1
Distinct count,5457
Unique (%),5.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2565.2
Minimum,0
Maximum,5456
Zeros (%),0.1%

0,1
Minimum,0.0
5-th percentile,238.55
Q1,1134.0
Median,2607.0
Q3,3790.0
95-th percentile,5060.0
Maximum,5456.0
Range,5456.0
Interquartile range,2656.0

0,1
Standard deviation,1524.9
Coef of variation,0.59446
Kurtosis,-1.1475
Mean,2565.2
MAD,1309.8
Skewness,0.045576
Sum,242694296
Variance,2325300
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
2206,2731,2.9%,
836,2452,2.6%,
2760,2052,2.2%,
562,1682,1.8%,
3051,1328,1.4%,
554,1320,1.4%,
3790,1222,1.3%,
3414,1105,1.2%,
200,1009,1.1%,
2263,910,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0,68,0.1%,
1,13,0.0%,
2,21,0.0%,
3,3,0.0%,
4,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5452,27,0.0%,
5453,3,0.0%,
5454,1,0.0%,
5455,1,0.0%,
5456,1,0.0%,

0,1
Distinct count,1393
Unique (%),1.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,677.85
Minimum,0
Maximum,1392
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,66
Q1,361
Median,718
Q3,946
95-th percentile,1284
Maximum,1392
Range,1392
Interquartile range,585

0,1
Standard deviation,360.67
Coef of variation,0.53207
Kurtosis,-0.92046
Mean,677.85
MAD,302.6
Skewness,-0.029352
Sum,64133095
Variance,130080
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
725,4934,5.2%,
534,3587,3.8%,
918,3509,3.7%,
291,2757,2.9%,
811,2150,2.3%,
11,1770,1.9%,
652,1583,1.7%,
324,1377,1.5%,
1074,1320,1.4%,
143,1308,1.4%,

Value,Count,Frequency (%),Unnamed: 3
0,6,0.0%,
1,6,0.0%,
2,3,0.0%,
3,166,0.2%,
4,10,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1388,199,0.2%,
1389,20,0.0%,
1390,35,0.0%,
1391,1,0.0%,
1392,30,0.0%,

0,1
Distinct count,5385
Unique (%),5.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2598.5
Minimum,0
Maximum,5384
Zeros (%),0.0%

0,1
Minimum,0
5-th percentile,206
Q1,1363
Median,2680
Q3,3644
95-th percentile,4958
Maximum,5384
Range,5384
Interquartile range,2281

0,1
Standard deviation,1442.5
Coef of variation,0.55513
Kurtosis,-0.97084
Mean,2598.5
MAD,1218.3
Skewness,-0.037645
Sum,245850313
Variance,2080800
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
3267,4637,4.9%,
2680,3335,3.5%,
2965,2150,2.3%,
3691,1213,1.3%,
3084,1166,1.2%,
908,930,1.0%,
2137,882,0.9%,
816,836,0.9%,
201,829,0.9%,
3505,765,0.8%,

Value,Count,Frequency (%),Unnamed: 3
0,6,0.0%,
1,6,0.0%,
2,3,0.0%,
3,20,0.0%,
4,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5380,1,0.0%,
5381,28,0.0%,
5382,208,0.2%,
5383,155,0.2%,
5384,105,0.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0066799

0,1
0,93980
1,632

Value,Count,Frequency (%),Unnamed: 3
0,93980,99.3%,
1,632,0.7%,

0,1
Distinct count,25338
Unique (%),26.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,36.591
Minimum,18.249
Maximum,67.258
Zeros (%),0.0%

0,1
Minimum,18.249
5-th percentile,27.792
Q1,33.61
Median,36.294
Q3,40.676
95-th percentile,44.295
Maximum,67.258
Range,49.009
Interquartile range,7.0661

0,1
Standard deviation,5.4535
Coef of variation,0.14904
Kurtosis,1.803
Mean,36.591
MAD,4.3282
Skewness,0.18777
Sum,3461900
Variance,29.74
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
35.334083,151,0.2%,
31.850540999999996,118,0.1%,
35.326623,117,0.1%,
35.327049,103,0.1%,
40.80616,99,0.1%,
41.745142,99,0.1%,
41.170125,88,0.1%,
28.355344,87,0.1%,
37.148140000000005,83,0.1%,
37.271029999999996,82,0.1%,

Value,Count,Frequency (%),Unnamed: 3
18.24914,23,0.0%,
19.316349,2,0.0%,
19.387832,2,0.0%,
19.427968,4,0.0%,
19.456561,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
64.74793199999999,1,0.0%,
64.827225,1,0.0%,
64.84481,1,0.0%,
65.672562,1,0.0%,
67.258157,1,0.0%,

0,1
Distinct count,25335
Unique (%),26.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,-95.162
Minimum,-167.59
Maximum,-66.628
Zeros (%),0.0%

0,1
Minimum,-167.59
5-th percentile,-122.34
Q1,-112.4
Median,-89.73
Q3,-80.692
95-th percentile,-73.825
Maximum,-66.628
Range,100.96
Interquartile range,31.706

0,1
Standard deviation,18.436
Coef of variation,-0.19373
Kurtosis,0.097985
Mean,-95.162
MAD,15.366
Skewness,-0.81727
Sum,-9003500
Variance,339.89
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
-97.46927099999999,151,0.2%,
-81.60578100000001,118,0.1%,
-97.505367,117,0.1%,
-97.555534,103,0.1%,
-73.94802,99,0.1%,
-87.717264,99,0.1%,
-73.208586,88,0.1%,
-81.347334,87,0.1%,
-119.644249,83,0.1%,
-82.939972,82,0.1%,

Value,Count,Frequency (%),Unnamed: 3
-167.58711200000002,1,0.0%,
-165.3263,1,0.0%,
-165.10675,1,0.0%,
-163.95339099999998,12,0.0%,
-162.818719,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-67.82788199999999,1,0.0%,
-67.618477,5,0.0%,
-67.412713,1,0.0%,
-67.242699,1,0.0%,
-66.628036,23,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.076988

0,1
0,87328
1,7284

Value,Count,Frequency (%),Unnamed: 3
0,87328,92.3%,
1,7284,7.7%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.115

0,1
0,83732
1,10880

Value,Count,Frequency (%),Unnamed: 3
0,83732,88.5%,
1,10880,11.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.28278

0,1
0,67858
1,26754

Value,Count,Frequency (%),Unnamed: 3
0,67858,71.7%,
1,26754,28.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.46962

0,1
0,50180
1,44432

Value,Count,Frequency (%),Unnamed: 3
0,50180,53.0%,
1,44432,47.0%,

0,1
Distinct count,23808
Unique (%),25.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,250950000000
Minimum,10001000000
Maximum,610000000000
Zeros (%),0.0%

0,1
Minimum,10001000000
5-th percentile,60231000000
Q1,120020000000
Median,231080000000
Q3,390440000000
95-th percentile,490090000000
Maximum,610000000000
Range,600000000000
Interquartile range,270420000000

0,1
Standard deviation,157370000000
Coef of variation,0.6271
Kurtosis,-1.2727
Mean,250950000000
MAD,137610000000
Skewness,0.25946
Sum,2.3743e+16
Variance,2.4766e+22
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
231077000925.0,7324,7.7%,
402025000993.0,151,0.2%,
130330001303.0,118,0.1%,
402025001000.0,117,0.1%,
402025001966.0,103,0.1%,
170993000689.0,99,0.1%,
360013905776.0,95,0.1%,
90045000070.0,88,0.1%,
120144007692.0,87,0.1%,
120105005598.0,82,0.1%,

Value,Count,Frequency (%),Unnamed: 3
10000500870.0,1,0.0%,
10000500879.0,7,0.0%,
10000500889.0,24,0.0%,
10000501616.0,2,0.0%,
10000600878.0,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
560509000277.0,1,0.0%,
560609000342.0,1,0.0%,
560609000401.0,1,0.0%,
610000300001.0,1,0.0%,
610000300006.0,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0084027

0,1
0,93817
1,795

Value,Count,Frequency (%),Unnamed: 3
0,93817,99.2%,
1,795,0.8%,

0,1
Correlation,0.95512

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.047901

0,1
0,90080
1,4532

Value,Count,Frequency (%),Unnamed: 3
0,90080,95.2%,
1,4532,4.8%,

0,1
Distinct count,10323
Unique (%),10.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,56325
Minimum,1001
Maximum,99926
Zeros (%),0.0%

0,1
Minimum,1001
5-th percentile,7753
Q1,30058
Median,60620
Q3,85730
95-th percentile,95838
Maximum,99926
Range,98925
Interquartile range,55672

0,1
Standard deviation,30558
Coef of variation,0.54253
Kurtosis,-1.3988
Mean,56325
MAD,27670
Skewness,-0.16
Sum,5329000000
Variance,933790000
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
73160.0,776,0.8%,
73170.0,275,0.3%,
31313.0,184,0.2%,
92620.0,172,0.2%,
77373.0,152,0.2%,
95630.0,151,0.2%,
60085.0,149,0.2%,
75211.0,148,0.2%,
90250.0,145,0.2%,
93313.0,143,0.2%,

Value,Count,Frequency (%),Unnamed: 3
1001.0,2,0.0%,
1020.0,3,0.0%,
1026.0,1,0.0%,
1027.0,2,0.0%,
1030.0,3,0.0%,

Value,Count,Frequency (%),Unnamed: 3
99801.0,3,0.0%,
99827.0,3,0.0%,
99901.0,8,0.0%,
99925.0,1,0.0%,
99926.0,1,0.0%,

0,1
Distinct count,25904
Unique (%),27.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,12964
Minimum,0
Maximum,25903
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,1293.6
Q1,6486.0
Median,12989.0
Q3,19445.0
95-th percentile,24584.0
Maximum,25903.0
Range,25903.0
Interquartile range,12959.0

0,1
Standard deviation,7459.5
Coef of variation,0.5754
Kurtosis,-1.1997
Mean,12964
MAD,6460.6
Skewness,-0.0029274
Sum,1226554872
Variance,55644000
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
23371,151,0.2%,
3149,118,0.1%,
10079,117,0.1%,
550,103,0.1%,
14763,99,0.1%,
7987,95,0.1%,
12836,88,0.1%,
25830,87,0.1%,
13531,82,0.1%,
16378,82,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,2,0.0%,
1,3,0.0%,
2,1,0.0%,
3,4,0.0%,
4,8,0.0%,

Value,Count,Frequency (%),Unnamed: 3
25899,1,0.0%,
25900,1,0.0%,
25901,2,0.0%,
25902,1,0.0%,
25903,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.086067

0,1
0,86469
1,8143

Value,Count,Frequency (%),Unnamed: 3
0,86469,91.4%,
1,8143,8.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.021509

0,1
0,92577
1,2035

Value,Count,Frequency (%),Unnamed: 3
0,92577,97.8%,
1,2035,2.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.052287

0,1
0,89665
1,4947

Value,Count,Frequency (%),Unnamed: 3
0,89665,94.8%,
1,4947,5.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.24067

0,1
0,71842
1,22770

Value,Count,Frequency (%),Unnamed: 3
0,71842,75.9%,
1,22770,24.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.1951

0,1
0,76153
1,18459

Value,Count,Frequency (%),Unnamed: 3
0,76153,80.5%,
1,18459,19.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.043726

0,1
0,90475
1,4137

Value,Count,Frequency (%),Unnamed: 3
0,90475,95.6%,
1,4137,4.4%,

0,1
Correlation,1

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.03081

0,1
0,91697
1,2915

Value,Count,Frequency (%),Unnamed: 3
0,91697,96.9%,
1,2915,3.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.013962

0,1
0,93291
1,1321

Value,Count,Frequency (%),Unnamed: 3
0,93291,98.6%,
1,1321,1.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0052742

0,1
0,94113
1,499

Value,Count,Frequency (%),Unnamed: 3
0,94113,99.5%,
1,499,0.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0167

0,1
0,93032
1,1580

Value,Count,Frequency (%),Unnamed: 3
0,93032,98.3%,
1,1580,1.7%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0036676

0,1
0,94265
1,347

Value,Count,Frequency (%),Unnamed: 3
0,94265,99.6%,
1,347,0.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.036782

0,1
0,91132
1,3480

Value,Count,Frequency (%),Unnamed: 3
0,91132,96.3%,
1,3480,3.7%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.027935

0,1
0,91969
1,2643

Value,Count,Frequency (%),Unnamed: 3
0,91969,97.2%,
1,2643,2.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0033928

0,1
0,94291
1,321

Value,Count,Frequency (%),Unnamed: 3
0,94291,99.7%,
1,321,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.025832

0,1
0,92168
1,2444

Value,Count,Frequency (%),Unnamed: 3
0,92168,97.4%,
1,2444,2.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0059612

0,1
0,94048
1,564

Value,Count,Frequency (%),Unnamed: 3
0,94048,99.4%,
1,564,0.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0034985

0,1
0,94281
1,331

Value,Count,Frequency (%),Unnamed: 3
0,94281,99.7%,
1,331,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0052108

0,1
0,94119
1,493

Value,Count,Frequency (%),Unnamed: 3
0,94119,99.5%,
1,493,0.5%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.02599

0,1
0,92153
1,2459

Value,Count,Frequency (%),Unnamed: 3
0,92153,97.4%,
1,2459,2.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.011404

0,1
0,93533
1,1079

Value,Count,Frequency (%),Unnamed: 3
0,93533,98.9%,
1,1079,1.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.023063

0,1
0,92430
1,2182

Value,Count,Frequency (%),Unnamed: 3
0,92430,97.7%,
1,2182,2.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.10125

0,1
0,85033
1,9579

Value,Count,Frequency (%),Unnamed: 3
0,85033,89.9%,
1,9579,10.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.099142

0,1
0,85232
1,9380

Value,Count,Frequency (%),Unnamed: 3
0,85232,90.1%,
1,9380,9.9%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.11247

0,1
0,83971
1,10641

Value,Count,Frequency (%),Unnamed: 3
0,83971,88.8%,
1,10641,11.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.008297

0,1
0,93827
1,785

Value,Count,Frequency (%),Unnamed: 3
0,93827,99.2%,
1,785,0.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.002061

0,1
0,94417
1,195

Value,Count,Frequency (%),Unnamed: 3
0,94417,99.8%,
1,195,0.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.014438

0,1
0,93246
1,1366

Value,Count,Frequency (%),Unnamed: 3
0,93246,98.6%,
1,1366,1.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0034034

0,1
0,94290
1,322

Value,Count,Frequency (%),Unnamed: 3
0,94290,99.7%,
1,322,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.018158

0,1
0,92894
1,1718

Value,Count,Frequency (%),Unnamed: 3
0,92894,98.2%,
1,1718,1.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.020558

0,1
0,92667
1,1945

Value,Count,Frequency (%),Unnamed: 3
0,92667,97.9%,
1,1945,2.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.040967

0,1
0,90736
1,3876

Value,Count,Frequency (%),Unnamed: 3
0,90736,95.9%,
1,3876,4.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0028326

0,1
0,94344
1,268

Value,Count,Frequency (%),Unnamed: 3
0,94344,99.7%,
1,268,0.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.017271

0,1
0,92978
1,1634

Value,Count,Frequency (%),Unnamed: 3
0,92978,98.3%,
1,1634,1.7%,

0,1
Distinct count,713
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,97.434
Minimum,1
Maximum,999
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,14
Q1,23
Median,32
Q3,100
95-th percentile,450
Maximum,999
Range,998
Interquartile range,77

0,1
Standard deviation,157.17
Coef of variation,1.6131
Kurtosis,12.533
Mean,97.434
MAD,93.955
Skewness,3.3635
Sum,9218400
Variance,24701
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
25.0,6739,7.1%,
30.0,5574,5.9%,
20.0,5527,5.8%,
24.0,3957,4.2%,
22.0,3607,3.8%,
100.0,3005,3.2%,
50.0,2613,2.8%,
60.0,2451,2.6%,
150.0,2392,2.5%,
18.0,2374,2.5%,

Value,Count,Frequency (%),Unnamed: 3
1.0,47,0.0%,
2.0,55,0.1%,
3.0,59,0.1%,
4.0,104,0.1%,
5.0,204,0.2%,

Value,Count,Frequency (%),Unnamed: 3
991.0,1,0.0%,
995.0,3,0.0%,
997.0,3,0.0%,
998.0,4,0.0%,
999.0,551,0.6%,

0,1
Distinct count,63724
Unique (%),67.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,31876
Minimum,0
Maximum,63723
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,3228.6
Q1,16002.0
Median,31922.0
Q3,47745.0
95-th percentile,60461.0
Maximum,63723.0
Range,63723.0
Interquartile range,31744.0

0,1
Standard deviation,18343
Coef of variation,0.57547
Kurtosis,-1.1961
Mean,31876
MAD,15879
Skewness,-0.0035666
Sum,3015805060
Variance,336480000
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
45853,45,0.0%,
57309,37,0.0%,
46905,34,0.0%,
43751,34,0.0%,
44613,32,0.0%,
41522,31,0.0%,
8231,31,0.0%,
59432,28,0.0%,
22234,28,0.0%,
13559,27,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,2,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
63719,1,0.0%,
63720,1,0.0%,
63721,1,0.0%,
63722,2,0.0%,
63723,1,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.0080117

0,1
0,93854
1,758

Value,Count,Frequency (%),Unnamed: 3
0,93854,99.2%,
1,758,0.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,2.1139e-05

0,1
0,94610
1,2

Value,Count,Frequency (%),Unnamed: 3
0,94610,100.0%,
1,2,0.0%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.12765

0,1
0,82535
1,12077

Value,Count,Frequency (%),Unnamed: 3
0,82535,87.2%,
1,12077,12.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.48119

0,1
0,49086
1,45526

Value,Count,Frequency (%),Unnamed: 3
0,49086,51.9%,
1,45526,48.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.39114

0,1
0,57605
1,37007

Value,Count,Frequency (%),Unnamed: 3
0,57605,60.9%,
1,37007,39.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.045026

0,1
0,90352
1,4260

Value,Count,Frequency (%),Unnamed: 3
0,90352,95.5%,
1,4260,4.5%,

0,1
Distinct count,49080
Unique (%),51.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,616.45
Minimum,69.01
Maximum,139730
Zeros (%),0.0%

0,1
Minimum,69.01
5-th percentile,162.02
Q1,306.71
Median,447.99
Q3,711.88
95-th percentile,1614.6
Maximum,139730.0
Range,139660.0
Interquartile range,405.17

0,1
Standard deviation,1084.9
Coef of variation,1.7599
Kurtosis,6077.9
Mean,616.45
MAD,353.21
Skewness,55.348
Sum,58324000
Variance,1177000
Memory size,739.2 KiB

Value,Count,Frequency (%),Unnamed: 3
2085.0,710,0.8%,
447.99,591,0.6%,
376.94,307,0.3%,
853.98,303,0.3%,
711.88,258,0.3%,
506.04,252,0.3%,
484.59,232,0.2%,
231.98,206,0.2%,
384.24,193,0.2%,
2268.0,175,0.2%,

Value,Count,Frequency (%),Unnamed: 3
69.01,1,0.0%,
89.38,1,0.0%,
89.45,1,0.0%,
91.71,1,0.0%,
95.01,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
45716.7,1,0.0%,
45953.28,1,0.0%,
57617.24,1,0.0%,
59220.26,1,0.0%,
139725.41,2,0.0%,

0,1
Correlation,1

Unnamed: 0,projectid,teacher_acctid,schoolid,school_ncesid,school_latitude,school_longitude,school_city,school_state,school_zip,school_district,school_county,school_charter,school_magnet,school_year_round,school_nlns,school_kipp,school_charter_ready_promise,teacher_teach_for_america,teacher_ny_teaching_fellow,fulfillment_labor_materials,total_price_excluding_optional_support,total_price_including_optional_support,students_reached,eligible_double_your_impact_match,eligible_almost_home_match,date_posted,grade_level_Grades 3-5,grade_level_Grades 6-8,grade_level_Grades 9-12,grade_level_Grades PreK-2,secondary_focus_subject_Applied Sciences,secondary_focus_subject_Character Education,secondary_focus_subject_Civics & Government,secondary_focus_subject_College & Career Prep,secondary_focus_subject_Community Service,secondary_focus_subject_ESL,secondary_focus_subject_Early Development,secondary_focus_subject_Economics,secondary_focus_subject_Environmental Science,secondary_focus_subject_Extracurricular,secondary_focus_subject_Foreign Languages,secondary_focus_subject_Gym & Fitness,secondary_focus_subject_Health & Life Science,secondary_focus_subject_Health & Wellness,secondary_focus_subject_History & Geography,secondary_focus_subject_Literacy,secondary_focus_subject_Literature & Writing,secondary_focus_subject_Mathematics,secondary_focus_subject_Music,secondary_focus_subject_Nutrition,secondary_focus_subject_Other,secondary_focus_subject_Parent Involvement,secondary_focus_subject_Performing Arts,secondary_focus_subject_Social Sciences,secondary_focus_subject_Special Needs,secondary_focus_subject_Sports,secondary_focus_subject_Visual Arts,secondary_focus_area_Applied Learning,secondary_focus_area_Health & Sports,secondary_focus_area_History & Civics,secondary_focus_area_Literacy & Language,secondary_focus_area_Math & Science,secondary_focus_area_Music & The Arts,secondary_focus_area_Special Needs,primary_focus_area_Applied Learning,primary_focus_area_Health & Sports,primary_focus_area_History & Civics,primary_focus_area_Literacy & Language,primary_focus_area_Math & Science,primary_focus_area_Music & The Arts,primary_focus_area_Special Needs,primary_focus_subject_Applied Sciences,primary_focus_subject_Character Education,primary_focus_subject_Civics & Government,primary_focus_subject_College & Career Prep,primary_focus_subject_Community Service,primary_focus_subject_ESL,primary_focus_subject_Early Development,primary_focus_subject_Economics,primary_focus_subject_Environmental Science,primary_focus_subject_Extracurricular,primary_focus_subject_Foreign Languages,primary_focus_subject_Gym & Fitness,primary_focus_subject_Health & Life Science,primary_focus_subject_Health & Wellness,primary_focus_subject_History & Geography,primary_focus_subject_Literacy,primary_focus_subject_Literature & Writing,primary_focus_subject_Mathematics,primary_focus_subject_Music,primary_focus_subject_Nutrition,primary_focus_subject_Other,primary_focus_subject_Parent Involvement,primary_focus_subject_Performing Arts,primary_focus_subject_Social Sciences,primary_focus_subject_Special Needs,primary_focus_subject_Sports,primary_focus_subject_Visual Arts,teacher_prefix_Dr.,teacher_prefix_Mr.,teacher_prefix_Mrs.,teacher_prefix_Ms.,school_metro_rural,school_metro_suburban,school_metro_urban,poverty_level_high poverty,poverty_level_highest poverty,poverty_level_low poverty,poverty_level_moderate poverty,resource_type_Books,resource_type_Other,resource_type_Supplies,resource_type_Technology,resource_type_Trips,resource_type_Visitors
44772,62526d85d2a1818432d03d600969e99c,58697,9245,171371000000.0,41.972419,-88.174597,271,14,60103.0,1406,370,0,0,0,0,0,0,0,0,30.0,444.36,522.78,7.0,0,0,2013-12-31,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0
44773,33d59ac771b80222ad63ef0f4ac47ade,55346,21904,160153000000.0,43.501154,-112.05678,2247,13,83402.0,2177,121,0,0,0,0,0,0,0,0,30.0,233.24,274.4,30.0,0,0,2013-12-31,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,0,1,0,0,0
44774,1a3aaeffc56dd2a421e37d8298024c0a,60958,14840,330261000000.0,42.888244,-71.320224,1194,30,3038.0,4243,1057,0,0,0,0,0,0,0,0,30.0,285.09,335.4,230.0,0,0,2013-12-31,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,0,0
44775,33aa19ee4da4c5adf47d0dfb84fab5ef,5741,15615,510324000000.0,37.476158,-77.488397,4073,45,23224.0,3959,1041,0,0,0,0,0,0,0,0,30.0,232.94,274.05,18.0,0,0,2013-12-31,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0
44776,e31c0ea8b68f404699dfb0d39e9bc99b,3704,20536,170993000000.0,41.952851,-87.650233,836,14,60613.0,3895,291,0,1,0,0,0,0,0,0,30.0,513.41,604.01,70.0,1,0,2013-12-31,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0


In [4]:
start = parser.parse("2013-06-01")
end = parser.parse("2013-12-31")

logger.info("Ingesting dataframes...")
outcomes = pipeline.ingest('data/outcomes.csv')
projects = pipeline.ingest('data/projects.csv')


Ingesting dataframes...


## Model Running
The run_temporal method runs Logistic Regression, KNearest Neighbors, Decision Trees, Random Forest, Gradient Boosting, and Bagging. 

### Baselining
Additionally, run_temporal generates a random baseline, denoted by "RAND". The baseline takes the proportion of fully funded projects in the given training set and creates a proportionally random test set from the training set's percentage. If 30% of projects in the train set were funded, the test set will choose 0.3 as the threshold for N random floats between 0 and 1, where N is the length of the test set. It is then tested against the true values like the actual models, and is added to the results DataFrame.

In [5]:

outcomes.set_index("projectid")
outcomes = outcomes.drop(['is_exciting', 'at_least_1_teacher_referred_donor',  'at_least_1_green_donation', 'great_chat', 'three_or_more_non_teacher_referred_donors',
                          'one_non_teacher_referred_donor_giving_100_plus', 'donation_from_thoughtful_donor', 'great_messages_proportion', 'teacher_referred_count', 'non_teacher_referred_count'], axis=1)

projects["date_posted"] = pd.to_datetime(projects["date_posted"])
projects.set_index("date_posted")
projects = projects[projects['date_posted'].between(
    start, end, inclusive=True)]

projects = pipeline.presplit(projects)
logger.info("Project dimensions after feature generation: {}".format(projects.shape))

baselines = ['RAND']
results_df = pipeline.run_temporal(['LR', 'KNN', 'DT', 'RF', 'GB', 'BAG'], projects, outcomes, start, end, baselines)

Generating features...
Generated dummy features.
Generated labelized features.
Made binaries from 't' 'f' pair columns.
Finished feature generation.
Project dimensions after feature generation: (94612, 115)
Temporally validating on:
Train: 2013-08-30 00:00:00 - 2013-10-30 00:00:00
Test: 2013-10-31 00:00:00 - 2013-12-31 00:00:00
Prediction window: 2 months
Merging dataframes...
DataFrames merged with dimensions: train(62043, 116), test(26109, 116).
Splitting X and y, train and test...
Running LR...
Running with params {'C': 0.01, 'penalty': 'l1'}
Running with params {'C': 0.01, 'penalty': 'l2'}
Running with params {'C': 0.1, 'penalty': 'l1'}
Running with params {'C': 0.1, 'penalty': 'l2'}
LR finished.
Running KNN...
Running with params {'algorithm': 'ball_tree', 'n_neighbors': 10, 'weights': 'uniform'}
Running with params {'algorithm': 'ball_tree', 'n_neighbors': 10, 'weights': 'distance'}
Running with params {'algorithm': 'ball_tree', 'n_neighbors': 20, 'weights': 'uniform'}
Running wi

DT finished.
Running RF...
Running with params {'max_depth': 5, 'max_features': 'sqrt', 'min_samples_split': 10, 'n_estimators': 10}
Running with params {'max_depth': 50, 'max_features': 'sqrt', 'min_samples_split': 10, 'n_estimators': 10}
RF finished.
Running GB...
Running with params {'learning_rate': 0.5, 'max_depth': 1, 'n_estimators': 5, 'subsample': 0.5}
Running with params {'learning_rate': 0.5, 'max_depth': 1, 'n_estimators': 10, 'subsample': 0.5}
Running with params {'learning_rate': 0.5, 'max_depth': 5, 'n_estimators': 5, 'subsample': 0.5}
Running with params {'learning_rate': 0.5, 'max_depth': 5, 'n_estimators': 10, 'subsample': 0.5}
GB finished.
Running BAG...
Running with params {'bootstrap_features': False, 'max_features': 5, 'max_samples': 5, 'n_estimators': 10}
Running with params {'bootstrap_features': False, 'max_features': 20, 'max_samples': 5, 'n_estimators': 10}
Running with params {'bootstrap_features': True, 'max_features': 5, 'max_samples': 5, 'n_estimators': 10

## Results
Below, the results are sorted by the highest auc-roc score.

In [17]:
results_df.sort_values(by=['auc-roc'], ascending=False)

Unnamed: 0,model_type,clf,parameters,auc-roc,p_at_1,p_at_2,p_at_5,p_at_10,p_at_20,p_at_30,p_at_50,r_at_1,r_at_2,r_at_5,r_at_10,r_at_20,r_at_30,r_at_50
51,RF,"(DecisionTreeClassifier(class_weight=None, cri...","{'max_depth': 5, 'max_features': 'sqrt', 'min_...",0.717098,0.977011,0.961686,0.961686,0.954406,0.915342,0.87883,0.815995,0.014546,0.028635,0.071587,0.142091,0.272603,0.392619,0.607609
23,GB,([DecisionTreeRegressor(criterion='friedman_ms...,"{'learning_rate': 0.5, 'max_depth': 1, 'n_esti...",0.710538,0.97318,0.971264,0.951724,0.938697,0.906914,0.877298,0.814999,0.014489,0.02892,0.070846,0.139752,0.270093,0.391934,0.606868
55,GB,([DecisionTreeRegressor(criterion='friedman_ms...,"{'learning_rate': 0.5, 'max_depth': 5, 'n_esti...",0.706727,0.957854,0.971264,0.963985,0.949042,0.91381,0.875511,0.806879,0.01426,0.02892,0.071759,0.141293,0.272146,0.391136,0.600821
56,GB,([DecisionTreeRegressor(criterion='friedman_ms...,"{'learning_rate': 0.5, 'max_depth': 5, 'n_esti...",0.70358,0.97318,0.969349,0.96092,0.937548,0.903275,0.868105,0.810556,0.014489,0.028863,0.07153,0.139581,0.269009,0.387827,0.603559
20,RF,"(DecisionTreeClassifier(class_weight=None, cri...","{'max_depth': 5, 'max_features': 'sqrt', 'min_...",0.702334,0.961686,0.95977,0.957088,0.941379,0.905382,0.868488,0.804887,0.014317,0.028578,0.071245,0.140152,0.269637,0.387998,0.599338
54,GB,([DecisionTreeRegressor(criterion='friedman_ms...,"{'learning_rate': 0.5, 'max_depth': 1, 'n_esti...",0.700499,0.977011,0.936782,0.955556,0.928736,0.892166,0.861593,0.820898,0.014546,0.027893,0.071131,0.138269,0.265701,0.384918,0.61126
24,GB,([DecisionTreeRegressor(criterion='friedman_ms...,"{'learning_rate': 0.5, 'max_depth': 5, 'n_esti...",0.700103,0.97318,0.957854,0.951724,0.943295,0.907489,0.867722,0.810709,0.014489,0.028521,0.070846,0.140437,0.270264,0.387656,0.603673
25,GB,([DecisionTreeRegressor(criterion='friedman_ms...,"{'learning_rate': 0.5, 'max_depth': 5, 'n_esti...",0.692787,0.965517,0.95977,0.948659,0.943295,0.906723,0.870787,0.804581,0.014375,0.028578,0.070618,0.140437,0.270036,0.389025,0.59911
47,DT,"DecisionTreeClassifier(class_weight=None, crit...","{'criterion': 'entropy', 'max_depth': 5, 'min_...",0.691617,0.904215,0.948276,0.944828,0.92682,0.909979,0.832482,0.835146,0.013462,0.028236,0.070333,0.137984,0.271006,0.371913,0.62187
48,DT,"DecisionTreeClassifier(class_weight=None, crit...","{'criterion': 'entropy', 'max_depth': 5, 'min_...",0.691617,0.904215,0.948276,0.944828,0.92682,0.909979,0.832482,0.835146,0.013462,0.028236,0.070333,0.137984,0.271006,0.371913,0.62187
