# MATH8670 Machine Learning Competition

Presenter: Jeremy James

Date: 12/7/2021

## Data Exploration

- Used [Sweetviz](SWEETVIZ_REPORT.html) to get a quick overview of the data
- Dependent variable analysis
- Differences between training and test data

### Dependent variable analysis
- Dependent variable is not normally distributed
- May need to apply transformation, depending on model
- Extremes (0 and > 1) add to confusion

### Differences between training and test data
- Train data from 2015 and prior vs test data from 2016 onwards
- More volatile usage rates in earlier years
- [Notebook analysis](traintestdifferences.html)

## Preparation for Design Matrix
- XGBoost model benefited from transformed usage rate (used sklearn's QuantileTransformer)
- Created categorical and numeric features out of date columns
- Created Final_Allocation to Initial_Allocation ratio
- Created previous grant count columns for organizations and Piids
- Filtered out data prior to 2012 that are not between 0 and 1 (including 1)

## Validation Approach
- Train/test split 70/30
- All records equally likely to be in each split
- Review difference between MSE of model predictions and just assuming average usage rate from training set

## Method Selection
- Used XGBoost and InterpretML's Explainable Boosting Machine
- Used test data, as well as Kaggle scores, to determine EBM fit the data better
- Found XGBoost overfit the data at a typical number of boosting rounds

## Parameter Tuning
- Used test data to determine good value for min_samples_leaf for EBM
- Applied Grid Search Cross Validation with 5 folds to find optimal parameters
- 4 folds were used to train, 1 fold to validate
- Searched over 3 possible values of learning_rate and 5 for min_samples_leaf
- Did not result in better performing model on test data as well as Kaggle

## Results Table
Data Versions
- v1: Final Initial Allocation Ratio and Award/Usage Date Differences columns
- v2: Create categorical features from dates, ensure all categorical features will be interpreted as such
- v3: Add previous grant counts for organizations and piids. Filter out "extreme" data prior to 2012
Tuning Versions
- v1: Use score on hold out data to choose parameter value
- v2: Use Grid Search CV to choose parameter values

In [6]:
pd.read_csv('results_table.csv', sep='\t')

Unnamed: 0,Model,Data Version,Tuning Version,Parameters,MSE
0,EBM,,,Default,0.38772
1,EBM,v1,v1,min_sample_leaf=20,0.38524
2,EBM,v2,v1,min_sample_leaf=20,0.38807
3,XGBoost,v3,v1,n_estimators=15,0.39912
4,EBM,v3,v1,min_sample_leaf=20,0.38398
5,EBM,v3,v2,min_sample_leaf=30,0.38699


## Findings & Summary
- EBM with min_samples_leaf=20 performed best
- Significant data engineering only provided small improvement
- [Notebook with model explanations](model_explanations.html)