## Analysis of Wildfires in California

In this project, we will analyze the data collected on wildfires that have occurred in California between 2013 and 2020; and determine the impact it has had on environment in terms of air quality, increase in temperature etc.

In [None]:
import numpy as np
import pandas as pd

First lets import the dataset and analyze the structure of the data.

In [38]:
raw_df = pd.read_csv("Wildfire-Analysis-1.csv")
print(f"The size of the raw dataset is {raw_df.shape}")
raw_df.sample(7)

The size of the raw dataset is (1636, 40)


Unnamed: 0,AcresBurned,Active,AdminUnit,AirTankers,ArchiveYear,CalFireIncident,CanonicalUrl,ConditionStatement,ControlStatement,Counties,...,SearchKeywords,Started,Status,StructuresDamaged,StructuresDestroyed,StructuresEvacuated,StructuresThreatened,UniqueId,Updated,WaterTenders
755,200.0,False,California National Guard,,2017,False,/incidents/2017/8/22/range-fire/,,,Alameda,...,"Range Fire, Alameda County, Camp Parks Dublin,...",2017-08-22T14:00:00Z,Finalized,,,,,89b5d4db-ce4e-43b9-8618-f89088d35ee1,2018-01-09T12:44:00Z,
1009,14.0,False,CAL FIRE San Benito-Monterey Unit,,2017,True,/incidents/2017/6/9/hog-fire/,,,Monterey,...,"Hog Canyon, Ranchita Canyon Rd, Parkfield, Jun...",2017-06-09T16:15:00Z,Finalized,,,,,b9142a59-e3dd-4797-b6ed-c5c847b16f67,2018-01-09T10:29:00Z,
720,460.0,False,Unified Command: CAL FIRE San Bernardino-Inyo-...,,2017,True,/incidents/2017/7/14/bridge-fire/,,,San Bernardino,...,"Bridge Fire, July 2017, Greenspot, Santa Ana C...",2017-07-14T14:23:00Z,Finalized,,,,,1348f32a-510f-4ad3-971f-b75b9cf9f0c0,2018-04-12T14:52:00Z,
307,1049.0,False,CAL FIRE/Riverside County Fire,,2015,True,/incidents/2015/4/18/highway-fire/,,,Riverside,...,"Highway Fire, Riverside Co., Hwy 71, Hwy 91, P...",2015-04-18T18:12:00Z,Finalized,,,,,8e3c0b14-01e5-4c41-8641-ff987ac10ad9,2015-04-24T07:30:00Z,
442,7050.0,False,CAL FIRE Fresno Kings Unit,,2016,True,/incidents/2016/8/9/mineral-fire/,Mineral Fire is 100% contained. For more infor...,,Fresno,...,"Mineral Fire, August 9, 2016, Fresno County, C...",2016-08-09T13:08:00Z,Finalized,,2.0,,0.0,346c59b1-375e-4d42-876a-d97b769c939f,2016-08-18T19:00:00Z,
1100,1756.0,False,CAL FIRE Madera-Mariposa-Merced Unit,,2018,True,/incidents/2018/5/2/nees-fire/,,,Merced,...,"Nees Fire, Interstate 5, W Nees Ave, Los Banos...",2018-05-02T16:00:00Z,Finalized,,,,,92c441f3-4729-4ba5-890f-fb7693725275,2019-01-04T10:26:00Z,
229,120.0,False,CAL FIRE Nevada-Placer-Yuba Unit / Placer Coun...,,2014,True,/incidents/2014/1/22/brewer-fire/,,,Placer,...,"Brewer Fire, Brewer, South Brewer Road, Rosevi...",2014-01-22T13:10:00Z,Finalized,,,,,072e04ed-1f34-45e8-b7c2-3788ebf3be4c,2014-01-22T16:35:00Z,


The dataset includes 1,636 wildfire incidents recorded between 2013 and 2020, with 40 columns providing detailed information about each incident..

In [39]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1636 entries, 0 to 1635
Data columns (total 40 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   AcresBurned           1633 non-null   float64
 1   Active                1636 non-null   bool   
 2   AdminUnit             1636 non-null   object 
 3   AirTankers            28 non-null     float64
 4   ArchiveYear           1636 non-null   int64  
 5   CalFireIncident       1636 non-null   bool   
 6   CanonicalUrl          1636 non-null   object 
 7   ConditionStatement    284 non-null    object 
 8   ControlStatement      105 non-null    object 
 9   Counties              1636 non-null   object 
 10  CountyIds             1636 non-null   object 
 11  CrewsInvolved         171 non-null    float64
 12  Dozers                123 non-null    float64
 13  Engines               191 non-null    float64
 14  Extinguished          1577 non-null   object 
 15  Fatalities           

From the columns available in the dataset, the below columns are relevant to our analysis and hence will be extracted for further processing.
| Column Name         	|                       Description                       	|
|---------------------	|:-------------------------------------------------------:	|
| AcresBurned         	|           Acres of land affected by wildfires           	|
| AdminUnit           	|   Administrative Jurisdiction of California Fire Dept.  	|
| ArchiveYear         	|              Year of fire incident reported             	|
| Counties            	|                    counties involved                    	|
| Extinguished        	|     Date, Time of the day the fire was extinguished     	|
| Fatalities          	|       Number of deaths caused due to the incident       	|
| Latitude, Longitude 	|            geo Location of the fire incident            	|
| MajorIncident       	|                      Based on acres of land burned        |
| Name                	| Name of the fire incident (based on incident occurence) 	|
| PersonnelInvolved   	|        Man power deployed to handle the incident        	|
| Started             	|                  Time of fire reported                  	|
| WaterTenders         	| Type of firefighting apparatus that specialises <br>in the transport of water <br> (**Capacity ~U.S 2900 Gallons each**).               	|

In [40]:
extracted_columns = [
    'AcresBurned',
    'AdminUnit',
    'ArchiveYear',
    'Counties',
    'Extinguished',
    'Fatalities',
    'Latitude',
    'Longitude',
    'MajorIncident',
    'Name',
    'PersonnelInvolved',
    'Started',
    'WaterTenders'
]
filtered_df = raw_df[extracted_columns].copy()
filtered_df['PersonnelInvolved'] = pd.to_numeric(filtered_df['PersonnelInvolved'], errors='coerce')
filtered_df['Fatalities'] = pd.to_numeric(filtered_df['Fatalities'], errors='coerce')
filtered_df['WaterTenders'] = pd.to_numeric(filtered_df['WaterTenders'], errors='coerce')
filtered_df['Extinguished'] = pd.to_datetime(filtered_df['Extinguished'], format='ISO8601', errors='coerce')
filtered_df['Started'] = pd.to_datetime(filtered_df['Started'], format='ISO8601', errors='coerce')

Now that all the necessary type conversions are done, we can compute how long it took to contain the fire from the `Started` and `Extinguished` columns. The value for this column will be represented as number of days.

In [41]:
filtered_df['fire_duration'] = (filtered_df['Extinguished'] - filtered_df['Started']).dt.total_seconds() / (3600 * 24)
filtered_df['fire_duration'] = pd.to_numeric(filtered_df['fire_duration'], errors='coerce')
filtered_df.describe()

Unnamed: 0,AcresBurned,ArchiveYear,Fatalities,Latitude,Longitude,PersonnelInvolved,WaterTenders,fire_duration
count,1633.0,1636.0,21.0,1636.0,1636.0,204.0,146.0,1577.0
mean,4589.443968,2016.608802,8.619048,37.203975,-108.082642,328.553922,7.815068,84.894964
std,27266.337722,1.84534,18.529642,135.40138,37.006927,521.138789,12.719251,875.903563
min,0.0,2013.0,1.0,-120.258,-124.19629,0.0,1.0,-17051.964583
25%,35.0,2015.0,1.0,34.165891,-121.768358,55.0,2.0,1.80816
50%,100.0,2017.0,3.0,37.104065,-120.46156,151.5,4.0,20.134028
75%,422.0,2018.0,6.0,39.086808,-117.474073,350.0,6.0,170.502778
max,410203.0,2019.0,85.0,5487.0,118.9082,3100.0,79.0,17900.723611


From the above summary, we can see that `fire_duration` has certain values that are negative, and as such are not valid for representing number of days. Lets analyze and find out if this is due to a data entry error or if we should take absolute values of these numbers.

In [42]:
filtered_df[filtered_df['fire_duration'] < 0]['fire_duration'].value_counts()

fire_duration
-1.854167        1
-0.131250        1
-0.000347        1
-0.158322        1
-0.000683        1
-0.003646        1
-0.173310        1
-0.334225        1
-0.105671        1
-0.268078        1
-0.000599        1
-0.152801        1
-1.000556        1
-0.000799        1
-0.399306        1
-0.111065        1
-0.006771        1
-0.006250        1
-17051.964583    1
-0.361111        1
-0.683333        1
-16648.938889    1
-0.270833        1
-0.439583        1
-1.510417        1
-0.051389        1
-0.104861        1
-1.000607        1
Name: count, dtype: int64

From the above values, it is safe to conclude that there are values for `fire_duration` which have a large negative or a large positive value. This is due to the value of `Extinguished` or `Started` column being left with its default value of 31st December, 1969. For our analysis, we will ignore these values i.e., fire duration values greater than 500 days and less than 100 days; and take the absolute value of small negative days (less than 500days).

In [43]:
filtered_df.drop(filtered_df[(filtered_df['fire_duration'] < 0) | (filtered_df['fire_duration'] > 500)].index, inplace=True)
filtered_df['fire_duration'] = np.abs(filtered_df['fire_duration'])

We will also drop all rows that have invalid latitude and longitude.

In [44]:
filtered_df.drop(filtered_df[(filtered_df['Latitude'] < -90) | (filtered_df['Latitude'] > 90)].index, inplace=True)
filtered_df.drop(filtered_df[(filtered_df['Longitude'] < -180) | (filtered_df['Longitude'] >= 0)].index, inplace=True)

Now lets analyze the `AdminUnit` column values and see the available values.

In [45]:
filtered_df['AdminUnit'].unique().shape

(457,)

Running the `AdminUnit` column shows that are ~ 460 unique Fire handling Departments in California which is not the case. There is a lot of different formats that are used for entering the Admin Unit names that has resulted in duplication of units under different names. In order to clean this, we use a simple approach of removing any word formatting (uppercase to lower case), removing common terms from the list such as:

```
["/", "california", "cal", "fire", "unit", "county", "department", "national", "forest", "district", "unified command:", "usfs", "us", "service", "and", "of", "the", "city"]
```

Then finally taking unique values which has resulted in ~ 200 Units which can further be cleaned. But for this project we will stick with this number.

In [46]:
filtered_admin_df = filtered_df['AdminUnit']
words_to_remove = pd.Series(["/", "california", "cal", "fire", "unit", "county", "department", "national", "forest", "district", "unified command:", "usfs", "us", "service", "and", "of", "the", "city"])
filtered_admin_df = (filtered_admin_df.str.lower().str.replace("-", " ").str.replace("los angeles", "la"))

for index in range(len(words_to_remove)):
    filtered_admin_df = filtered_admin_df.str.replace(words_to_remove[index], "")

filtered_admin_df = (filtered_admin_df.str.lstrip().str.rstrip().str.replace(" ", ""))
filtered_df['AdminUnitCleaned'] = filtered_admin_df
filtered_df['AdminUnitCleaned'].unique().shape

(229,)

In [47]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1442 entries, 0 to 1635
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype              
---  ------             --------------  -----              
 0   AcresBurned        1439 non-null   float64            
 1   AdminUnit          1442 non-null   object             
 2   ArchiveYear        1442 non-null   int64              
 3   Counties           1442 non-null   object             
 4   Extinguished       1383 non-null   datetime64[ns, UTC]
 5   Fatalities         21 non-null     float64            
 6   Latitude           1442 non-null   float64            
 7   Longitude          1442 non-null   float64            
 8   MajorIncident      1442 non-null   bool               
 9   Name               1442 non-null   object             
 10  PersonnelInvolved  186 non-null    float64            
 11  Started            1442 non-null   datetime64[ns, UTC]
 12  WaterTenders       136 non-null    float64           

Now that the data is cleaned, lets go ahead and save the cleaned data in `Wildfire-Analysis-Cleaned.csv`.

In [48]:
filtered_df.to_csv('Wildfire-Analysis-Cleaned.csv', index=False)