## Data wrangling - CA only

The main data wrangling challenge for this project is that the training data comes from two different sources: NOAA weather data and USDA fire data. Therefore, the data must be parsed and formatted, then joined. The most challenging part of this process was to match wildfire locations from the USDA (Fig. 1) data with weather data geospatial bins from the NOAA data (Fig. 2). See below for major steps needed:

1. Get NOAA NARR weather data
2. Parse NOAA NARR weather data
3. Parse USDA wildfire data
4. Combine NOAA NARR weather data files
5. Combine weather and fire data
6. Add weather features
7. Fully stratified sampling

<h2 style="text-align: center;"><b>Figure 1: Raw USDA wildfire location data for California</b></h2>

![Scatterplot of California wildfire locations](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/california_fires_scatterplot.png?raw=true)

<h2 style="text-align: center;"><b>Figure 2: USDA wildfire location data regridded to match NOAA weather data</b></h2>

![Scatterplot of regirdded California wildfire locations](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/regridded_california_fires_scatterplot.png?raw=true)

## Exploratory data analysis
Initial exploratory data analysis was conducted on the training dataset to get a feel for the shape of the data. Aspects investigated included:

1. Trends in fires over time and space (Fig. 3)
2. Spatial distribution of weather variables (Fig. 4)
3. Numerical distribution of weather variables (Fig. 5)
4. Correlation of weather variables and fire (Fig. 6)
5. Cross correlation of variables in the data set (Fig. 7)

<h2 style="text-align: center;"><b>Figure 3: Trends in wildfire occurrence</b></h2>

![California wildfires overview](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/fire_data_overveiw.png?raw=true)

<h2 style="text-align: center;"><b>Figure 4: Spatial distribution of weather variables</b></h2>

![weather data heatmaps](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/weather_data_heat_maps.png?raw=true)

<h2 style="text-align: center;"><b>Figure 5: Numerical distribution of weather variables</b></h2>

![weather data distributions](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/weather_data_distributions.png?raw=true)

<h2 style="text-align: center;"><b>Figure 6: Correlation of weather variables and fire</b></h2>

![weather fire correlations](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/weather_fire_correlation_lineplot.png?raw=true)

<h2 style="text-align: center;"><b>Figure 7: Weather variable cross correlation</b></h2>

![weather cross correlations](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/weather_data_crossd_correlation.png?raw=true)

## Model selection & optimization

Several types of machine learning models were evaluated for their ability to predict fire based on weather data. Ultimately, a solution using a custom geospatially parallel LSTM was chosen based on generalizability.

1. Classifier model selection (Fig. 8-10)
2. XGBoost optimization (Fig. 11-13)
3. Deep neural net optimization (Fig. 14-15)
4. Single LSTM optimization (Fig. 16-18)
5. Parallel LSTM optimization (Fig. 19-21)

<h2 style="text-align: center;"><b>Figure 8: Evaluation of classification models: performance</b></h2>

![Classifier model selection](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/classifier_model_selection.png?raw=true)

<h2 style="text-align: center;"><b>Figure 9: Evaluation of classification models: run time</b></h2>

![Classifier model selection: time to run](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/classifier_model_selection_time.png?raw=true)

<h2 style="text-align: center;"><b>Figure 10: Evaluation of classification models: memory use</b></h2>

![Classifier model selection: memory](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/classifier_model_selection_memory.png?raw=true)

<h2 style="text-align: center;"><b>Figure 11: XGBoost optimization: baseline</b></h2>

![XGBoost confusion matrix: no optimization](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/xgboost_confusion_matrix_no_opotimization.png?raw=true)

<h2 style="text-align: center;"><b>Figure 12: XGBoost optimization: tuned hyperparameters</b></h2>

![XGBoost confusion matrix: optimized](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/xgboost_confusion_matrix_optimized.png?raw=true)

<h2 style="text-align: center;"><b>Figure 13: XGBoost optimization: feature importance</b></h2>

![XGBoost feature importance](https://github.com/gperdrizet/wildfire_production/blob/master/project_info/figures/xgboost_feature_importance.png?raw=true)

<h2 style="text-align: center;"><b>Figure 14: Deep neural net: learning curves</b></h2>

![MLP learning curves](https://github.com/gperdrizet/wildfire/blob/master/figures/simple_MLP_learning_curves.png?raw=true)

<h2 style="text-align: center;"><b>Figure 15: Deep neural net: example predictions on training data</b></h2>

![MLP learning curves](https://github.com/gperdrizet/wildfire/blob/master/figures/simple_MLP_training_predictions.png?raw=true)

<h2 style="text-align: center;"><b>Figure 16: Single LSTM: training data confusion matrix</b></h2>

![LSTM Cassandra confusion matrix](https://github.com/gperdrizet/wildfire/blob/master/figures/statefull_single_LSTM_cassandra_confusion_maxrix_CA_only.png?raw=true)

<h2 style="text-align: center;"><b>Figure 17: Single LSTM: example training data predictions</b></h2>

![LSTM Cassandra predictions](https://github.com/gperdrizet/wildfire/blob/master/figures/statefull_single_LSTM_cassandra_training_predictions_CA_only.png?raw=true)

<h2 style="text-align: center;"><b>Figure 18: parallel LSTM learning curves</b></h2>

![Parallel LSTM learning curves](https://github.com/gperdrizet/wildfire/blob/master/figures/parallel_LSTM_learning_curves_CA_only.png?raw=true)

<h2 style="text-align: center;"><b>Figure 19: Parallel LSTM confusion matrix</b></h2>

![Parallel LSTM training confusion matrix](https://github.com/gperdrizet/wildfire/blob/master/figures/parallel_statefull_LSTM_confusion_matrix_CA_only.png?raw=true)