# Data Science Tools 1 - Final Project 

## Wildfire Dataset

#### Nicole Pierick & Elizabeth Bob



## Import Packages & Libraries

In [None]:
import pandas as pd
import plotly.express as px
import numpy as np
import missingno as msno
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium.plugins import HeatMap
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import KNNImputer

## Load the Dataset
WFIGS - Current Wildland Fire Perimeters Dataset from the Wildland Fire Interagency Geospatial Services (WFIGS) Group and National Interagency Fire Center 

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data = pd.read_csv("WFIGS_Wildland_Fire_Perimeters_Full_History.csv", sep = ",", dtype ='unicode')

## Describe & Display the Dataset

The wildfire dataset utilized in the following analysis was created by the National Interagency Fire Center. The dataset contains 108 columns and 13555 rows, with each row representing a distinct wildfire between 2021-2022. The attributes encompass geographic, time, and monitoring/administrative information, as well as characteristics of the fire. 

Wildfire Dataset Attributes
<ul>
    <li><b>Incident Name (Polygon) </b>The Incident Name from the source polygon.</li>
<li><b>Feature Category </b>Type of wildland fire perimeter.</li>
<li><b>Map Method </b>Controlled vocabulary to define how the source polygon was derived. Map Method may help define data quality.</li>
<li><b>GIS Acres </b>User-calculated acreage.</li>
<li><b>Polygon Create Date </b>System field. Time stamp for the source polygon feature creation.</li>
<li><b>Polygon Modified Date </b>System field. Time stamp for the most recent edit to the source polygon feature.</li>
<li><b>Polygon Collection Date Time </b>Date time for the source polygon feature collection.</li>
<li><b>Acres Auto Calculated </b>Automated calculation of the source polygon acreage.</li>
<li><b>Polygon Source </b>Data source of the perimeter geometry.</li>
<li><b>ABCD Misc </b>A FireCode used by USDA FS to track and compile cost information for emergency initial attack fire suppression expenditures. for A, B, C & D size class fires on FS lands.</li>
<li><b>ADS Permission State </b>Indicates the permission hierarchy that is currently being applied when a system utilizes the UpdateIncident operation.</li>
<li><b>IRWIN Archived On </b>"A date set by IRWIN that indicates when an incident's data has met the rules defined for the record to become part of the historical fire records rather than an operational incident record.  The value will be set the current date/time if any of the following criteria are met: 
1.  ContainmentDataTime or ControlDateTime or FireOutDateTime or ModifiedOnDateTime > 12 months from the current DateTime
2.  FinalFireReportDate is not null and ADSPermissionState is 'certified'."</li>
<li><b>Calculated Acres </b>A measure of acres calculated (i.e., infrared) from a geospatial perimeter of a fire.  More specifically, the number of acres within the current perimeter of a specific, individual incident, including unburned and unburnable islands.  The minimum size must be 0.1.</li>
<li><b>Containment Date Time </b>The date and time a wildfire was declared contained.</li>
<li><b>Control Date Time </b>The date and time a wildfire was declared under control.</li>
<li><b>Created By System </b>ArcGIS Server Username of system that created the IRWIN Incident record.</li>
<li><b>IRWIN Created On Date Time </b>Date/time that the IRWIN Incident record was created.</li>
<li><b>Incident Size </b>"A measure of acres reported for a fire.  More specifically, the number of acres within the current perimeter of a specific, individual incident, including unburned and unburnable islands.  The minimum size must be 0.1.
*Field name irwin_DailyAcres. Data from IRWIN IncidentSize field."</li>
<li><b>Discovery Acres </b>An estimate of acres burning upon the discovery of the fire. More specifically when the fire is first reported by the first person that calls in the fire.  The estimate should include number of acres within the current perimeter of a specific, individual incident, including unburned and unburnable islands.</li>
<li><b>Dispatch Center ID </b>A unique identifier for a dispatch center responsible for supporting the incident.</li>
<li><b>Final Fire Report Approved By Title </b>The title of the person that approved the final fire report for the incident.</li>
<li><b>Final Fire Report Approved By Unit </b>NWCG Unit ID associated with the individual who approved the final report for the incident.</li>
<li><b>Final Fire Report Approved Date </b>The date that the final fire report was approved for the incident.</li>
<li><b>Fire Behavior General </b>A general category describing the manner in which the fire is currently reacting to the influences of fuel, weather, and topography.</li>
<li><b>Fire Behavior General 1 </b>A more specific category further describing the general fire behavior (manner in which the fire is currently reacting to the influences of fuel, weather, and topography).</li>
<li><b>Fire Behavior General 2 </b>A more specific category further describing the general fire behavior (manner in which the fire is currently reacting to the influences of fuel, weather, and topography).</li>
<li><b>Fire Behavior General 3 </b>A more specific category further describing the general fire behavior (manner in which the fire is currently reacting to the influences of fuel, weather, and topography).</li>
<li><b>Fire Cause </b>Broad classification of the reason the fire occurred identified as human, natural or unknown.</li>
<li><b>Fire Cause General </b>Agency or circumstance which started a fire or set the stage for its occurrence; source of a fire's ignition. For statistical purposes, fire causes are further broken into specific causes.</li>
<li><b>Fire Cause Specific </b>A further categorization of each General Fire Cause to indicate more specifically the agency or circumstance which started a fire or set the stage for its occurrence; source of a fire's ignition.</li>
<li><b>Fire Code </b>A code used within the interagency wildland fire community to track and compile cost information for emergency fire suppression expenditures for the incident.</li>
<li><b>Fire Department ID </b>The U.S. Fire Administration (USFA) has created a national database of Fire Departments.  Most Fire Departments do not have an NWCG Unit ID and so it is the intent of the IRWIN team to create a new field that includes this data element to assist the National Association of State Foresters (NASF) with data collection.</li>
<li><b>Fire Discovery Date Time </b>The date and time a fire was reported as discovered or confirmed to exist.  May also be the start date for reporting purposes.</li>
<li><b>Fire Mgmt Complexity </b>The highest management level utilized to manage a wildland fire event.</li>
<li><b>Fire Out Date Time </b>The date and time when a fire is declared out.</li>
<li><b>Fire Strategy Confine Percent </b>Indicates the percentage of the incident area where the fire suppression strategy of "Confine" is being implemented.</li>
<li><b>Fire Strategy Full Supp Percent </b>Indicates the percentage of the incident area where the fire suppression strategy of "Full Suppression" is being implemented.</li>
<li><b>Fire Strategy Monitor Percent </b>Indicates the percentage of the incident area where the fire suppression strategy of "Monitor" is being implemented.</li>
<li><b>Fire Strategy Point Zone Percent </b>Indicates the percentage of the incident area where the fire suppression strategy of "Point Zone Protection" is being implemented.</li>
<li><b>FS Job Code </b>A code use to indicate the Forest Service job accounting code for the incident.  This is specific to the Forest Service.  Usually displayed as 2 char prefix on FireCode.</li>
<li><b>FS Override Code </b>A code used to indicate the Forest Service override code for the incident.  This is specific to the Forest Service.  Usually displayed as a 4 char suffix on FireCode.  For example, if the FS is assisting DOI, an override of 1502 will be used.</li>
<li><b>GACC </b>"A code that identifies one of the wildland fire geographic area coordination center at the point of origin for the incident.
A geographic area coordination center is a facility that is used for the coordination of agency or jurisdictional resources in support of one or more incidents within a geographic coordination area."</li>
<li><b>ICS 209 Report Date Time </b>The date and time of the latest approved ICS-209 report.</li>
<li><b>ICS 209 Report For Time Period From </b>The date and time of the beginning of the time period for the current ICS-209 submission.</li>
<li><b>ICS 209 Report For Time Period To </b>The date and time of the end of the time period for the current ICS-209 submission.</li>
<li><b>ICS 209 Report Status </b>The version of the ICS-209 report (initial, update, or final). There should never be more than one initial report, but there can be numerous updates, and even multiple finals (as determined by business rules).</li>
<li><b>Incident Management Organization </b>The incident management organization for the incident, which may be a Type 1, 2, or 3 Incident Management Team (IMT), a Unified Command, a Unified Command with an IMT, National Incident Management Organization (NIMO), etc.  This field is null if no team is assigned.</li>
<li><b>Incident Name </b>The name assigned to an incident.</li>
<li><b>Incident Short Description </b>General descriptive location of the incident such as the number of miles from an identifiable town.</li>
<li><b>Incident Type Category </b>The Event Category is a sub-group of the Event Kind code and description. The Event Category further breaks down the Event Kind into more specific event categories.</li>
<li><b>Incident Type Kind </b>A general, high-level code and description of the types of incidents and planned events to which the interagency wildland fire community responds.</li>
<li><b>Initial Latitude </b>The latitude location of the initial reported point of origin specified in decimal degrees.</li>
<li><b>Initial Longitude </b>The longitude location of the initial reported point of origin specified in decimal degrees.</li>
<li><b>Initial Response Acres </b>An estimate of acres burning at the time of initial response. More specifically when the IC arrives and performs initial size up.  The minimum size must be 0.1.  The estimate should include number of acres within the current perimeter of a specific, individual incident, including unburned and unburnable islands.</li>
<li><b>Initial Response Date Time </b>The date/time of the initial response to the incident. More specifically when the IC arrives and performs initial size up.</li>
<li><b>IRWIN ID </b>Unique identifier assigned to each incident record in IRWIN.</li>
<li><b>Is Dispatch Complete </b>"An indicator used by external systems to indicate if the criteria has been met to close an incident in their system. 
OR
Value of true indicates that the incident record is ready for a fire reporting system to complete and certify the final fire report.  This field can be set by a CAD or Ordering system."</li>
<li><b>Is Fire Cause Investigated </b>Indicates if an investigation is underway or was completed to determine the cause of a fire.</li>
<li><b>Is FS Assisted </b>Indicates if the Forest Service provided assistance on an incident outside their jurisdiction.</li>
<li><b>Is Multi Jurisdictional </b>Indicates if the incident covers multiple jurisdictions.</li>
<li><b>Is Reimbursable </b>Indicates the cost of an incident may be another agency’s responsibility.</li>
<li><b>Is Trespass </b>Indicates if the incident is a trespass claim or if a bill will be pursued.</li>
<li><b>Is Unified Command </b>Indicates whether the incident is being managed under Unified Command.  Unified Command is an application of the Incident Command System used when there is more than one agency with incident jurisdiction or when incidents cross political jurisdictions. Under Unified Command, agencies work together through their designated Incident Commanders at a single incident command post to establish common objectives and issue a single Incident Action Plan.</li>
<li><b>Local Incident Identifier </b>A number or code that uniquely identifies an incident for a particular local fire management organization within a particular calendar year.</li>
<li><b>Modified By System </b>ArcGIS Server username of system that last modified the IRWIN Incident record.</li>
<li><b>IRWIN Modified On Date Time </b>Date/time that the IRWIN Incident record was last modified.</li>
<li><b>Organizational Assessment </b>The Organizational Assessment is part of the Wildland Fire Risk and Complexity Assessment (RCA) that was implemented by NWCG in January 2014, which guides Agency Administrators in their management organization selection, both in escalating and moderating situations.  The Organizational Assessment can be compared with the current Incident Management Organization and many other incident level data elements over the life of the fire. It may not always match the current Incident Management Organization value. The authority for producing the Organizational Assessment lies with the incident commander (NWCG PMS-210), thus, the Organizational Assessment can change independent of the published decision.</li>
<li><b>Percent Contained </b>Indicates the percent of incident area that is no longer active.  Reference definition in fire line handbook when developing standard.</li>
<li><b>Percent Perimeter To Be Contained </b>Indicates the percent of perimeter left to be completed. This entry is appropriate for full suppression, point/zone protection, and confine fires, or any combination of these strategies. This entry is not used for wildfires managed entirely under a monitor strategy.  (Note: Value is not currently being passed by ICS-209)</li>
<li><b>POO City </b>The closest city to the incident point of origin.</li>
<li><b>POO County </b>The County Name identifying the county or equivalent entity at point of origin designated at the time of collection.</li>
<li><b>POO Dispatch Center ID </b>A unique identifier for the dispatch center that intersects with the incident point of origin.</li>
<li><b>POO Fips </b>The code which uniquely identifies counties and county equivalents.  The first two digits are the FIPS State code and the last three are the county code within the state.</li>
<li><b>POO Jurisdictional Agency </b>The agency having land and resource management responsibility for a incident as provided by federal, state or local law.</li>
<li><b>POO Jurisdictional Unit </b>NWCG Unit Identifier to identify the unit with jurisdiction for the land where the point of origin of a fire falls.</li>
<li><b>POO Jurisdictional Unit Parent Unit </b>The unit ID for the parent entity, such as a BLM State Office or USFS Regional Office, that resides over the Jurisdictional Unit.</li>
<li><b>POO Landowner Category </b>More specific classification of land ownership within land owner kinds identifying the deeded owner at the point of origin at the time of the incident.</li>
<li><b>POO Landowner Kind </b>Broad classification of land ownership identifying the deeded owner at the point of origin at the time of the incident.</li>
<li><b>POO Legal Desc Principal Meridian </b>The principal meridian of the legal description (section, township, range) of the incident at point of origin.</li>
<li><b>POO Legal Desc Qtr </b>The quarter section of the legal description (section, township, range) of the incident at point of origin.</li>
<li><b>POO Legal Desc Qtr Qtr </b>The quarter/quarter section of the legal description (section, township, range) of the incident at point of origin.</li>
<li><b>POO Legal Desc Range </b>The range of the legal description (section, township, range) of the incident at point of origin.</li>
<li><b>POO Legal Desc Section </b>The section of the legal description (section, township, range) of the incident at point of origin.</li>
<li><b>POO Legal Desc Township </b>The township of the legal description (section, township, range) of the incident at point of origin.</li>
<li><b>POO Predictive Service Area ID </b>The predictive service area ID where the incidents point of origin is location.  Predictive Service Areas (PSAs) are geographic areas of similar climate based on statistical correlation of Remote Automated Weather Stations (RAWS) data.</li>
<li><b>POO Protecting Agency </b>Indicates the agency that has protection responsibility at the point of origin.</li>
<li><b>POO Protecting Unit </b>"NWCG Unit responsible for providing direct incident management and services to a an incident pursuant to its jurisdictional responsibility or as specified by law, contract or agreement.                                                                                                               Definition Extension:
 - Protection can be re-assigned by agreement.
 - The nature and extent of the incident determines protection (for example Wildfire vs. All Hazard.)"</li>
<li><b>POO State </b>The State alpha code identifying the state or equivalent entity at point of origin.</li>
<li><b>Predominant Fuel Group </b>The fuel majority fuel model type that best represents fire behavior in the incident area, grouped into one of seven categories.</li>
<li><b>Predominant Fuel Model </b>Describes the type of fuels found within the  majority of the incident area.</li>
<li><b>Primary Fuel Model </b>The fuel model which best represents the primary carrier of the fire for the reporting period.</li>
<li><b>Secondary Fuel Model </b>The fuel model which best represents the secondary carrier of the fire for the reporting period.</li>
<li><b>Strategic Decision Publish Date </b>The Decision Publish Date represents the date Agency Administrators published (approved) the Strategic Decision document. New decisions can be created and published at any time until the incident has been called out.</li>
<li><b>Total Incident Personnel </b>The total number of personnel assigned. Includes overhead, crewmembers, helicopter crewmember, engine crewmembers, camp crew people, etc.</li>
<li><b>Unique Fire Identifier </b>Unique identifier assigned to each wildland fire.  yyyy = calendar year, SSUUUU = POO protecting unit identifier (5 or 6 characters), xxxxxx = local incident identifier (6 to 10 characters)</li>
<li><b>WFDSS Decision Status </b>Indicates the state of the WFDSS decision and/or if a WFDSS decision has been approved for the incident.  This information is helpful in resolving conflicts between incident records.</li>
<li><b>Is Part of Complex </b>Indicates whether the incident is part of a Complex or not. "0" for no, "1" for yes.</li>
<li><b>Complex Name </b>The Incident Name of the Complex that is the parent of the incident.</li>
<li><b>Complex ID </b>The IRWIN ID for the Complex record that is the parent of the incident.</li>
</ul>

In [None]:
data.head(20)

In [None]:
# Display dataset columns, non-null counts, & data types 
data.info(verbose = True, show_counts = True)

The dataset contains features with many null values, as well as features that may not be relevant for the current analysis. Decisions will need to be made with respect to how the null values and unnecessary columns are handled.

## Cleaning the Dataset 

In [None]:
# Remove leading and trailing spaces
data.columns = data.columns.str.strip()

In [None]:
# Standardize text across rows and columns
data = data.applymap(lambda x: x.lower().strip() if type(x) == str else x)

In [None]:
# Check for duplicate rows
data.duplicated().any()

There does not appear to be any duplicate rows, thus all rows will remain in the dataset for further processing.

Null values within the dataset can be visualized in a variety of ways that each uniquely inform decisions regarding the missing values.

In [None]:
# Visualize null values as matrix
msno.matrix(data);

In [None]:
# Visualize null values as bar plot
msno.bar(data);

In [None]:
# Visualize null values as dendrogram
msno.dendrogram(data);

Several strategies will be utilized to reduce the number of null values and columns in the dataset.

In [None]:
# Drop columns where 75% or more of the values are missing
data = data.dropna(thresh = 0.75*len(data), axis = 1)
print(data.sample(10))

In [None]:
# Drop columns that are not relevant for the current analysis
data = data.drop(data.columns[np.r_[0, 2:4, 5:7, 8:11, 15, 19, 21:23, 25:40, 41:50]], axis=1)

The dataset contains two pairs of columns wherein the columns in the pair reflect the same information. It is not necessary to retain redundant columns, thus the columns will need to be evaluated in order to determine which column in each pair is appropriate to drop.

In [None]:
# Dataset contains two features reflecting the incident name
# irwin_IncidentName contains fewer null values
# If irwin_IncidentName is null and poly_IncidentName is populated, add the value to irwin_IncidentName

data['irwin_IncidentName'] = data['irwin_IncidentName'].fillna(data['poly_IncidentName'])

In [None]:
# Check for null values in irwin_IncidentName
data['irwin_IncidentName'].isnull().sum()

There are no longer any missing values for the wildfire incident names. 

In [None]:
# Drop poly_IncidentName
data = data.drop('poly_IncidentName', axis = 1)

In [None]:
# Dataset contains two features reflecting acreage and the values do not always align
# poly_Acres_AutoCalc contains fewer null values and is an automated calculation of the source polygon acreage
# poly_GISAcres is user-calculated acreage
# If poly_Acres_AutoCalc is null and poly_GISAcres is populated, add the value to poly_Acres_AutoCalc

data['poly_Acres_AutoCalc'] = data['poly_Acres_AutoCalc'].fillna(data['poly_GISAcres'])

In [None]:
# Check for null values in poly_Acres_AutoCalc
data['poly_Acres_AutoCalc'].isnull().sum()

There a some remaining null values, however these will be addressed later on. 

In [None]:
# Drop poly_GISAcres
data = data.drop('poly_GISAcres', axis = 1)

The remaining null values in the dataset columns relevant for analysis will be remedied using K-nearest neighbors imputation.

In [None]:
# Encode categorical variables with null values for KNN imputation
oe = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value = np.nan)
data['irwin_FireCause'] = oe.fit_transform(data[['irwin_FireCause']])  

In [None]:
# Impute missing values using K-Nearest Neighbors
imputer = KNNImputer(n_neighbors = 5)

data['poly_Acres_AutoCalc'] = imputer.fit_transform(data[['poly_Acres_AutoCalc']])
data['irwin_DailyAcres'] = imputer.fit_transform(data[['irwin_DailyAcres']])
data['irwin_DiscoveryAcres'] = imputer.fit_transform(data[['irwin_DiscoveryAcres']])
data['irwin_FireCause'] = imputer.fit_transform(data[['irwin_FireCause']])

print(data.sample(10))

In [None]:
# Inverse encoding to recover categorical values
data['irwin_FireCause'] = oe.inverse_transform(data[['irwin_FireCause']])  

In [None]:
# Modify data types

# Convert appropriate columns to string
data[['irwin_FireCause', 'irwin_IncidentName', 'irwin_POOState']] = data[['irwin_FireCause', 
    'irwin_IncidentName', 'irwin_POOState']].astype('string')

# Convert appropriate columns to float
data[['poly_Acres_AutoCalc', 'irwin_DailyAcres', 'irwin_DiscoveryAcres', 'irwin_InitialLatitude',
    'irwin_InitialLongitude', 'SHAPE_Length', 'SHAPE_Area']] = data[['poly_Acres_AutoCalc', 
    'irwin_DailyAcres', 'irwin_DiscoveryAcres', 'irwin_InitialLatitude', 'irwin_InitialLongitude', 
    'SHAPE_Length', 'SHAPE_Area']].astype('float64')

# Convert appropriate columns to datetime
data[['irwin_ContainmentDateTime', 'irwin_ControlDateTime', 'irwin_FireDiscoveryDateTime', 
    'irwin_FireOutDateTime']] = data[['irwin_ContainmentDateTime', 'irwin_ControlDateTime', 
    'irwin_FireDiscoveryDateTime', 'irwin_FireOutDateTime']].apply(pd.to_datetime, 
    format = '%Y-%m-%dT%H:%M:%S.%f%z')

In [None]:
# Display modifed data types
data.dtypes

## Feature Engineering

New features will be created from existing features in the dataset to support additional analysis.

In [None]:
# Extract state from irwin_POOState and create new column
data['State'] = data['irwin_POOState'].str.split('-').str[1]

# Modify data type
data['State'] = data[['State']].astype('string')

data['State'].sample(10)

In [None]:
# Calculate duration of fire in days and create new column 
data['fire_Duration'] = data['irwin_ControlDateTime'] - data['irwin_FireDiscoveryDateTime']
data['fire_Duration'] = data['fire_Duration'].dt.total_seconds()/60/60/24

data['fire_Duration'].sample(10)

In [None]:
# Calculate amount of time between fire containment and control in days and create new column
data['containment_Control'] = data['irwin_ControlDateTime'] - data['irwin_ContainmentDateTime']
data['containment_Control'] = data['containment_Control'].dt.total_seconds()/60/60/24

data['containment_Control'].sample(10)

In [None]:
# Impute missing values using K-Nearest Neighbors
data['fire_Duration'] = imputer.fit_transform(data[['fire_Duration']])
data['containment_Control'] = imputer.fit_transform(data[['containment_Control']])

In [None]:
# Creates year and month columns from Discovery Date Column 
data['Year'] = data['irwin_FireDiscoveryDateTime'].dt.year
data['Month'] = data['irwin_FireDiscoveryDateTime'].dt.month

## Descriptive Statistics

Descriptive statistics will be used to conduct further exploration on several of the features in the dataset. These descriptive statistics include count, mean, median, minimum, maximum, standard deviation, and quartiles.

<h3> Acres </h3>

In [None]:
# Descriptive statistics for acres
data['poly_Acres_AutoCalc'].describe().apply(lambda x: format(x, 'f'))

In [None]:
# Median for acres
median = data['poly_Acres_AutoCalc'].median()
print('median', round(median, 6))

In [None]:
# Total number of acres
data['poly_Acres_AutoCalc'].sum()

<h3> Discovery Acres </h3>

In [None]:
# Descriptive statistics for discovery acres
data['irwin_DiscoveryAcres'].describe().apply(lambda x: format(x, 'f'))

In [None]:
# Median for discovery acres
median = data['irwin_DiscoveryAcres'].median()
print('median', round(median, 6))

<h3> Daily Acres </h3>

In [None]:
# Descriptive statistics for daily acres
data['irwin_DailyAcres'].describe().apply(lambda x: format(x, 'f'))

In [None]:
# Median for daily acres
median = data['irwin_DailyAcres'].median()
print('median', round(median, 6))

<h3> Fire Duration </h3>

In [None]:
# Descriptive statistics for fire duration 
data['fire_Duration'].describe()

In [None]:
# Median for fire duration
median = data['fire_Duration'].median()
print('median', round(median, 6))

<h3> Time Between Control & Containment of Fire </h3>

In [None]:
# Descriptive statistics for amount of time to control fire following containment
data['containment_Control'].describe()

In [None]:
# Median for amount of time to control fire following containment
median = data['containment_Control'].median()
print('median', round(median, 6))

## Aggregation & Pivot Tables

In [None]:
# Pivot table reflecting total acreage and fire duration impacted by wildfire by state
pivot = np.round(pd.pivot_table(data, values = ['poly_Acres_AutoCalc', 'fire_Duration'], 
                                index = ['State'], 
                                aggfunc = np.sum,
                                fill_value = 0),2)
                                
pivot = pivot.reindex(pivot['poly_Acres_AutoCalc'].sort_values(ascending = False).index)

pivot.style.bar(color = '#33B6CB')

In [None]:
# Pivot Table showing average fire duration and acres by state
pivot = np.round(pd.pivot_table(data, values=['poly_Acres_AutoCalc', 'fire_Duration'], 
                                index=['State'], 
                                aggfunc=np.mean,
                                fill_value=0),2)
                                
pivot = pivot.reindex(pivot['poly_Acres_AutoCalc'].sort_values(ascending=False).index)

pivot.style.bar(color='#d65f5f')

Per the pivot tables, Arizona had the greatest total number of days wherein a wildfire was present, whereas California had the greatest total number of acreage burned. Conversely, Vermont had both the least total number of days of wildfire activity and the least number of acreage impacted.

In [None]:
# Pivot Table of mean, median, and sum of Fire_Duration and Acres by year
# 2016 - 2019 excluded from the table
np.round(pd.pivot_table((data[data['Year'].isin([2020,2021,2022])]), values=['poly_Acres_AutoCalc','fire_Duration'], 
                                index=['Year'], 
                                aggfunc=[np.mean, np.median, np.sum],
                                fill_value=0),2)

The following observations can be made from the above pivot table:

* 2020 had the highest mean acreage impacted yet the lowest mean, median, and total fire duration
* 2021 had the highest mean fire duration, total fire duation, and total acreage impacted but the lowest mean and median acreage impacted
* 2022 had the highest median fire duration yet the lowest total acreage impacted

In [None]:
# Pivot Table of mean, median, and sum of Fire_Duration and Acres by year and month
# 2016 - 2019 excluded from the table
np.round(pd.pivot_table((data[data['Year'].isin([2020,2021,2022])]), values=['poly_Acres_AutoCalc','fire_Duration'], 
                                index=['Year', 'Month'], 
                                aggfunc=[np.mean, np.median, np.sum],
                                fill_value=0),2)

The following observations can be made from the above pivot table:

<p> 2020 </p>
<ul>
<li>September 2020 had the highest mean fire duration and acreage impacted, whereas April 2020 had the lowest</li>
<li>December 2020 had the highest median fire duration, whereas February 2020 had the highest median acreage impacted</li>
<li>July 2020 had the lowest median fire duration and May 2020 had the lowest median acreage impacted</li>
<li>August 2020 had the highest total fire duration and acreage impacted</li>
</ul>
<p> 2021 </p>
<ul>
<li>January had the lowest total fire duration and April 2021 had the lowest total acreage impacted</li>
<li>July 2021 had the highest mean fire duration and acreage impacted, while January 2021 had the lowest mean fire duration and November 2021 had the lowest mean acreage impacted</li>
<li>November 2021 had the highest median fire duration and December 2021 had the highest median acreage impacted</li>
<li>May 2021 had the lowest median fire duration and acreage impacted</li>
<li>July 2021 had the highest total fire duration and acreage impacted</li>
<li>January 2021 had the lowest total fire duration and November 2021 had the lowest total acreage impacted </li>
</ul>
<p> 2022 </p>
<ul>
<li>February 2022 had the highest mean fire duration and June 2022 had the highest mean acreage impacted</li>
<li>September 2022 had the lowest mean and median fire duration and January 2022 had the lowest mean acreage impacted, total fire duration, and total acreage impacted</li>
</ul>

## Cross Tabulation & Countplots

In [None]:
# Pandas cross tabulate: fire cause by state
cause_state_table = pd.crosstab(data.irwin_FireCause, data.State)
cause_state_table

In [None]:
# Countplot - fires per state
sns.set(rc={"figure.figsize":(20, 10)}) 
ax = sns.countplot(x='State', data=data, order = data['State'].value_counts().index)
plt.title('Number of Fires Per State', fontsize=20)
plt.show()

From the plot, it is apparent that Montana had the most wildfires, while New Jersey had the least.

In [None]:
# Countplot - fire cause
sns.set(rc={"figure.figsize":(20, 10)}) 
ax = sns.countplot(x='irwin_FireCause', data=data, order = data['irwin_FireCause'].value_counts().index)
plt.title('Number of Fires Per Cause', fontsize=20)
plt.xlabel('Fire Cause')
plt.show()

The fire cauase count plot indicated that most of the wildfires were caused by humans. An unknown cause accounted for the fewest number of wildfires.

In [None]:
# Countplot - fires per state
sns.set(rc={"figure.figsize":(20, 10)}) 
ax = sns.countplot(x='State', data=data, hue='irwin_FireCause', order = data['State'].value_counts().index)
plt.title('Fire Causes Per State', fontsize=20)
plt.legend(title='Fire Cause', loc='upper right')
plt.show()

Wildfires in all but three states could be primarily attributed to human causes. Alaska and Oregon's primary cause of wildfires was deemed natural, while Texas's primary cause was undetermined.

In [None]:
# countplot - fire cause counts per year
# 2016 - 2019 excluded from the table
sns.set(rc={"figure.figsize":(20, 10)}) 
ax = sns.countplot(x='Year', hue='irwin_FireCause', data=(data[data['Year'].isin([2020,2021,2022])]))
plt.title('Fire Cause Counts Per Year', fontsize=20)
plt.show()

Human-caused fires made up the majority of fires for each year.

## Bar Plots

The following bar plots visually illustrate the results of the pivot tables above.

In [None]:
# Bar plot of mean fire_duration by state
plot_order = data.groupby(["State"])['fire_Duration'].aggregate(np.mean).reset_index().sort_values('fire_Duration')
ax = sns.barplot(data=data,
            x='State',
            y='fire_Duration',
            estimator = np.mean,
            color = 'steelblue',
            ci=None,
            order = plot_order['State'])
plt.title('Mean Fire Duration Per State', fontsize=20)
plt.show()

In [None]:
# Bar plot of mean acres by state
plot_order = data.groupby(["State"])['poly_Acres_AutoCalc'].aggregate(np.mean).reset_index().sort_values('poly_Acres_AutoCalc')
ax = sns.barplot(data=data,
            x='State',
            y='poly_Acres_AutoCalc',
            estimator = np.mean,
            color = 'darkseagreen',
            ci=None,
            order = plot_order['State'])
plt.title('Mean Acres Burned Per State', fontsize=20)
plt.show()

In [None]:
# Bar plot of total fire_duration by state
plot_order = data.groupby(["State"])['fire_Duration'].aggregate(np.sum).reset_index().sort_values('fire_Duration')
ax = sns.barplot(data = data,
            x = 'State',
            y = 'fire_Duration',
            estimator = np.sum,
            color = 'purple',
            ci = None,
            order = plot_order['State'])
plt.title('Total Fire Duration by State', fontsize = 20)
plt.show()

In [None]:
# Bar plot of total acres by state
plot_order = data.groupby(["State"])['poly_Acres_AutoCalc'].aggregate(np.sum).reset_index().sort_values('poly_Acres_AutoCalc')
ax = sns.barplot(data = data,
            x = 'State',
            y = 'poly_Acres_AutoCalc',
            estimator = np.sum,
            color = 'lightcoral',
            ci=None,
            order = plot_order['State'])
plt.title('Total Acres Burned by State', fontsize=20)
plt.show()

## Correlation

In [None]:
# Correlation between state and fire cause
data['State'] = oe.fit_transform(data[['State']]) 
data['irwin_FireCause'] = oe.fit_transform(data[['irwin_FireCause']])
data['State'].corr(data['irwin_FireCause'])

In [None]:
# Correlation between state and acreage
data['State'].corr(data['poly_Acres_AutoCalc'])

In [None]:
# Correlation between state and fire duration
data['State'].corr(data['fire_Duration'])

There does not appear to be any correlation between the state, fire cause, acreage impacted, nor duration of the wildfire.

## Map Visuals

In [None]:
# Plotly - Fire Cause
fig = px.scatter_mapbox(data, lat="irwin_InitialLatitude", lon="irwin_InitialLongitude", hover_name="irwin_FireCause", hover_data=["irwin_InitialLatitude", "irwin_InitialLongitude"],
                        color="irwin_FireCause", zoom=5, height=500)
fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

In [None]:
# Heat Map using Folium

heat_map = folium.Map(location=[48, -102],
                    zoom_start = 10, min_zoom=1) 

heat_df = data[['irwin_InitialLatitude', 'irwin_InitialLongitude']]
heat_df = heat_df.dropna(axis=0, subset=['irwin_InitialLatitude','irwin_InitialLongitude'])

heat_data = [[row['irwin_InitialLatitude'],row['irwin_InitialLongitude']] for index, row in heat_df.iterrows()]

HeatMap(heat_data, radius=10).add_to(heat_map)

heat_map

## Future Directions

Further analysis could be conducted and additional datasets utilized in order to extend the results of current analysis. 

* The incorporation of climate/weather data would allow for exploration into the relationship between wildfire activity and aspects of climate/weather such as temperature, humidity, wind speed, wind direction, and precipitation
* Further and more detailed visualization of the dataset could be obtain by plotting the size and shape of each wildfire at its respective location
* Predictive models could be developed to determine future wildfire location, duration, and severity, which has the potential to inform wildfire prevention and containment
* The present analysis could be employed to influence wildfire awareness and safety education