# 01_Forest_Fire_Tree_Outcomes_EDA

This project will build on a recently published dataset (published today, actually).

**The Fire and Tree Mortality Database, for empirical modeling of individual tree mortality after fire** by C. Alina Cansler et al., published 22 June 2020.

Cansler, C.A., Hood, S.M., Varner, J.M. et al. The Fire and Tree Mortality Database, for empirical modeling of individual tree mortality after fire. Sci Data 7, 194 (2020). https://doi.org/10.1038/s41597-020-0522-7



Developing a model that can predict tree recovery based on measured damage to a tree after a fire can help determine the best course of action when considering how to address partially burned trees in parks or on private land.
 
In addition, this data can be combined with other forest and meteorological information to extend the types of predictions and inferences we can make around forest fires.


## Contents

1. Project goals and constraints
2. Data gathering
3. Explore the primary fire and tree data
4. Explore secondary data sets
5. Summary



# 1. Project goals and constraints

Goals: 
- Build a model to predict the survival of a tree based on damage sustained in a fire.
- Examine the primary contributors to the success of the model, and compare to published model results (see Cansler)

Evaluation:
- TBD - expect to start with linear or logistical regression predicting the number of years of survival (need to finish more EDA on this and other sources first)
- TBD - may try to replicate the work of previous models, in which case I'll use their metrics for comparison (need to read those papers first, though)
- TBD - if I can get to the next level, which is to show that there is additional data that has predictive power (or "identify knowledge gaps that could be addressed in future research", as Cansler says) then I may add more 

Value: 
- Models for tree survival are used by forest and land management, possibly by homeowners
- Cansler: "pre-fire planning and post-fire decision support"

Time Frame: 
- A functioning model will be completed by 9 July 2020.
- If time permits this will be extedned to include other data sources and insights. See the 'Roadmap' in the Summary section for where the expanding and contracting of focus will occur.

## Proposed methods and models
* **Methods**
  * EDA on core data; identification of supporting data
  * Build a model
      * DecisionTree is required for obvious reasons
      * Replicate the models built by others but using this much fuller dataset
  * Think outside the boxwood
* **Models under consideration**
  * PCA 
  * ...TODO: Fill this in

## Risks & Assumptions

|ID|Description|Possible Migitation(s)|
|---|---|---|
|1|Kaggle|Make my project better than theirs.|
|2|||

## Revisions

There are no revisions yet. 

# 2. Data gathering

The central data comes from the 2020-06-22 publication of **The Fire and Tree Mortality Database, for empirical modeling of individual tree mortality after fire** by Cansler et al., which is in turn a collection of data from 41 other datasets, some of which were compilations already.

By the end of this project, it is likely that I will have combined this with a couple of other datasets

|Dataset Name|Source|Link|Notes||
|---|---|---|---|---|
|**Fire and Tree Mortality Database**|nature.com datasets|https://www.nature.com/articles/s41597-020-0522-7|Burn extent and 0-to-10 year outcomes for 160k+ trees||
|Forest Fires Data Set|UCI, also Kaggle| https://archive.ics.uci.edu/ml/datasets/forest+fires|Dataset intended for predicting forest fires from weather conditions; I would use it to match weather pre-fire to tree outcomes post-fire||
|TBD|TBD|TBD|Weather conditions in the weeks and years following the fire||
|TBD|TBD|TBD|Satellite photos over the relevant years, possibly extending beyond the 10 years in the database||

# 3. Explore the primary fire and tree data (Cansler et al.)

## 3a. Imports and Globals

In [5]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
# Global options to increase rows and columns displayed
pd.set_option('display.max_columns', None)
# pd.reset_option(“max_columns”) # to reset back to limited columns

pd.set_option("max_rows", None)

In [7]:
fname_trees = './Data/FTM_trees.csv'
fname_fires = './Data/FTM_fires.csv'

In [8]:
!ls ./Data

In [None]:
df_trees = pd.read_csv(fname_trees)
# issue on import:
#/opt/anaconda3/envs/dsi/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: 
# Columns (4,5,6,7,10,62,63) have mixed types.Specify dtype option on import or set low_memory=False.
# interactivity=interactivity, compiler=compiler, result=result)

# 4
# 5
# 6
# 7
# 10
# 62
# 63

In [None]:
df_fires = pd.read_csv(fname_fires)

In [None]:
print(f"shape of trees: {df_trees.shape}")
print(f"shape of fires: {df_fires.shape}")

num_trees = df_trees.shape[0]

## 3b. Trees

Data Dictionary

From the file _metadata_RDS-2020-0001.html

|Feature|Description|NA means|Notes|
|---|---|---|---|
|YrFireName||||
|Species||||
|Dataset||||
|Times_burned|number of times this tree was burned|||
|ID|Location within fire other than the plot number that is a location identifier in the original dataset. May include burn unit, region of fire, etc. |NA = not applicable.||
|Plot|Plot number/name from original dataset. |NA = not applicable.||
|TreeNum|Tree tag number from original dataset. |NA = tree numbers were not assessed in the original dataset.||
|Unit|Unit name within a fire from original dataset.|NA = not applicable.||
|||||
|||||
|||||
|||||
|||||
|||||
|||||
|||||
|||||


In [None]:
df_trees['ID'].value_counts()

In [None]:
df_trees.head(3)

In [None]:
df_trees['YrFireName'].value_counts()

In [None]:
df_trees[df_trees['YrFireName']=='1998 - BAND 30']

In [None]:
# look at distributions of trees over the years

print(f"Fraction alive year  0: {df_trees['yr0status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year  1: {df_trees['yr1status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year  2: {df_trees['yr2status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year  3: {df_trees['yr3status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year  4: {df_trees['yr4status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year  5: {df_trees['yr5status'].notnull().sum()/num_trees:.02f}")

print(f"Fraction alive year  6: {df_trees['yr6status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year  7: {df_trees['yr7status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year  8: {df_trees['yr8status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year  9: {df_trees['yr9status'].notnull().sum()/num_trees:.02f}")
print(f"Fraction alive year 10: {df_trees['yr10status'].notnull().sum()/num_trees:.02f}")


## Fires

In [None]:
df_fires.head()

In [None]:
df_fires['State'].value_counts()

In [None]:
whos

# 4. Explore secondary data sets

# 5. Summary

In [None]:
## Identify the data types you are working with

See the data dictionary

In [None]:
## Examing the distributions of your data, numerically and/or visually

In [None]:
## Identify outliers

In [None]:
## Identify missing data and look for patterns of missing data

In [None]:
## Describe how your EDA will inform your modeling decisions and process

## Roadmap and milestones
 - Mon 22 June - **Capstone checkin #2 EDA + Plan**
   - 0 to 60 in 1 day -- data was published this morning, I found it, capstone revised
   - EDA off to the races
  
 - Mon 29 June
   - tree mortality prediction models -- obviously have to try some kind of DecisionTree
   - classification of tree damage?
   - brainstorm of other data and context info complete, top ideas prioritized
  
 - Weds 1 July - **Capstone checkin #3 Progress update**
   - explore possible innovations
   - explore expansions
  
 - Mon 06 July 
   - wind up the expansions; finalize what makes it into the capstone phase
   - organize & hide the stuff that won't finish
   - start writing reports
   
 - Thu 09 July - **Capstone checkin #4 Report writeup**
   - report should be well under way
   
 - Mon 13 July     
   - repo ready
   
 - Weds 15 July - **Capstone checkin #5 Presentation**