### Summary

### Data sources

1. NOAA NARR
2. USDA 1.88 million wildfires

### Data wrangling - CA only

The main data wrangling challenge for this project is that the training data comes from two different sources: NOAA weather data and USDA fire data. Therefore, the data must be parsed and formatted, then joined. The most challenging part of this process was to match wildfire locations from the USDA (Fig. 1) data with weather data geospatial bins from the NOAA data (Fig. 2). See below for major steps needed:

1. Get NOAA NARR weather data
2. Parse NOAA NARR weather data
3. Parse USDA wildfire data
4. Combine NOAA NARR weather data files
5. Combine weather and fire data
6. Add weather features
7. Fully stratified sampling

<p style="text-align: center;"><b>Figure 1: Raw USDA wildfire location data for California</b></p>

![Scatterplot of California wildfire locations](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/california_fires_scatterplot.png?raw=true)

<p style="text-align: center;"><b>Figure 2: USDA wildfire location data regridded to match NOAA weather data</b></p>

![Scatterplot of regirdded California wildfire locations](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/regridded_california_fires_scatterplot.png?raw=true)

### Exploratory data analysis
Initial exploratory data analysis was conducted on the training dataset to get a feel for the shape of the data. Aspects investigated included:

1. Trends in fires over time and space (Fig. 3)
2. Spatial distribution of weather variables (Fig. 4)
3. Numerical distribution of weather variables (Fig. 5)
4. Correlation of weather variables with fire (Fig. 6)
5. Cross correlation of variables in the data set (Fig. 7)

<p style="text-align: center;"><b>Figure 3: Trends in wildfire occurrence</b></p>

![California wildfires overview](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/fire_data_overveiw.png?raw=true)

![LSTM Cassandra predictions](https://github.com/gperdrizet/wildfire/blob/master/figures/statefull_single_LSTM_cassandra_training_predictions_CA_only.png?raw=true)

### Model selection & optimization

1. Classifier model selection
2. XGBoost optimization
3. Deep neural net optimization
4. Single LSTM optimization
5. Parallel LSTM optimization

![Classifier model selection](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/classifier_model_selection.png?raw=true)

![Classifier model selection: time to run](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/classifier_model_selection_time.png?raw=true)

![Classifier model selection: memory](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/classifier_model_selection_memory.png?raw=true)

![XGBoost confusion matrix: no optimization](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/xgboost_confusion_matrix_no_opotimization.png?raw=true)

![XGBoost confusion matrix: optimized](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/xgboost_confusion_matrix_optimized.png?raw=true)

![XGBoost feature importance](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/xgboost_feature_importance.png?raw=true)

![MLP learning curves](https://github.com/gperdrizet/wildfire/blob/master/figures/simple_MLP_learning_curves.png?raw=true)

![MLP learning curves](https://github.com/gperdrizet/wildfire/blob/master/figures/simple_MLP_training_predictions.png?raw=true)

![LSTM Cassandra confusion matrix](https://github.com/gperdrizet/wildfire/blob/master/figures/statefull_single_LSTM_cassandra_confusion_maxrix_CA_only.png?raw=true)

![LSTM Cassandra predictions](https://github.com/gperdrizet/wildfire/blob/master/figures/statefull_single_LSTM_cassandra_training_predictions_CA_only.png?raw=true)

![LSTM Cassandra testpredictions](https://github.com/gperdrizet/wildfire/blob/master/figures/statefull_single_LSTM_cassandra_training_predictions_CA_only.png?raw=true)

### Deployment

1. Luigi prediction pipeline
2. Flask API

### Scaling - whole US

1. Get NOAA NARR weather data
2. Parse NOAA NARR weather data
3. Parse USDA wildfire data
4. Combine NOAA NARR weather data files
5. Combine weather and fire data
6. Add weather features
7. Fully stratified sampling