**TASK 2 -: Exploratory Data Analysis (EDA)** - This task focuses on exploring the wildfire dataset to understand its structure, detect patterns, and generate insights — using only text and tables


**Importing Required Libraries** - 
                              We import pandas for data manipulation, matplotlib and seaborn for visualizations. Seaborn style is set for better plots.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for Seaborn



**Loading the Dataset** - We import pandas for data manipulation, matplotlib and seaborn for visualizations. Seaborn style is set for better plots.

In [2]:
# Load the dataset
df = pd.read_csv("California Wildfire Damage.csv")

# Preview the first few rows
df.head()


Unnamed: 0,Incident_ID,Date,Location,Area_Burned (Acres),Homes_Destroyed,Businesses_Destroyed,Vehicles_Damaged,Injuries,Fatalities,Estimated_Financial_Loss (Million $),Cause
0,INC1000,2020-11-22,Sonoma County,14048,763,474,235,70,19,2270.57,Lightning
1,INC1001,2021-09-23,Sonoma County,33667,1633,4,263,100,2,1381.14,Lightning
2,INC1002,2022-02-10,Shasta County,26394,915,291,31,50,6,2421.96,Human Activity
3,INC1003,2021-05-17,Sonoma County,20004,1220,128,34,28,0,3964.16,Unknown
4,INC1004,2021-09-22,Sonoma County,40320,794,469,147,0,15,1800.09,Unknown


**Dataset Info and Data Types** - This step shows the number of entries, column names, data types, and memory usage. It helps identify object (categorical) vs numeric columns.

In [3]:

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Incident_ID                           100 non-null    object 
 1   Date                                  100 non-null    object 
 2   Location                              100 non-null    object 
 3   Area_Burned (Acres)                   100 non-null    int64  
 4   Homes_Destroyed                       100 non-null    int64  
 5   Businesses_Destroyed                  100 non-null    int64  
 6   Vehicles_Damaged                      100 non-null    int64  
 7   Injuries                              100 non-null    int64  
 8   Fatalities                            100 non-null    int64  
 9   Estimated_Financial_Loss (Million $)  100 non-null    float64
 10  Cause                                 100 non-null    object 
dtypes: float64(1), int64

**Summary Statistics** - We view statistics like mean, min, max, and standard deviation for numeric columns to understand central tendencies and spread.

In [7]:

df.describe()



Unnamed: 0,Area_Burned (Acres),Homes_Destroyed,Businesses_Destroyed,Vehicles_Damaged,Injuries,Fatalities,Estimated_Financial_Loss (Million $)
count,100.0,100.0,100.0,100.0,100.0,100.0,100.0
mean,26531.46,941.89,251.57,150.33,42.04,9.93,2396.0993
std,14282.123824,543.019967,138.317761,88.471197,29.201467,5.682251,1430.439422
min,357.0,18.0,4.0,5.0,0.0,0.0,52.59
25%,15916.25,501.0,134.75,70.75,16.0,5.0,1175.195
50%,25618.0,908.5,256.5,150.5,37.0,10.0,2408.53
75%,39775.0,1401.75,371.0,229.75,60.0,14.25,3662.11
max,49653.0,1968.0,493.0,300.0,100.0,20.0,4866.99


**Check for Missing Values** - We ensure there are no missing values in the dataset. If there were any, we would need to handle them

In [5]:

df.isnull().sum()


Incident_ID                             0
Date                                    0
Location                                0
Area_Burned (Acres)                     0
Homes_Destroyed                         0
Businesses_Destroyed                    0
Vehicles_Damaged                        0
Injuries                                0
Fatalities                              0
Estimated_Financial_Loss (Million $)    0
Cause                                   0
dtype: int64

**Check for Unique Values** - We check how many unique values each column has. This helps us identify IDs, categories, and potential grouping columns.

In [8]:
df.nunique()

Incident_ID                             100
Date                                     97
Location                                 10
Area_Burned (Acres)                     100
Homes_Destroyed                          97
Businesses_Destroyed                     93
Vehicles_Damaged                         88
Injuries                                 61
Fatalities                               21
Estimated_Financial_Loss (Million $)    100
Cause                                     3
dtype: int64

**Value Counts for Categorical Columns** - We check the distribution of wildfire causes to see which ones are most frequent (e.g., Lightning, Human Activity, Unknown).

In [10]:
df['Cause'].value_counts()

Cause
Human Activity    38
Lightning         31
Unknown           31
Name: count, dtype: int64

**Convert Date to Datetime and Extract Features** - The Date column is converted to a datetime object, and Year and Month are extracted for time-series trend analysis.

In [11]:

df['Date'] = pd.to_datetime(df['Date'])


df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month


df[['Date', 'Year', 'Month']].head()


Unnamed: 0,Date,Year,Month
0,2020-11-22,2020,11
1,2021-09-23,2021,9
2,2022-02-10,2022,2
3,2021-05-17,2021,5
4,2021-09-22,2021,9


**Correlation Table** - A correlation table shows how strongly numeric features relate to each other. For example, we expect a strong positive correlation between Area Burned and Financial Loss — larger fires usually cause higher losses.

In [14]:
df.corr(numeric_only=True)

Unnamed: 0,Area_Burned (Acres),Homes_Destroyed,Businesses_Destroyed,Vehicles_Damaged,Injuries,Fatalities,Estimated_Financial_Loss (Million $),Year,Month
Area_Burned (Acres),1.0,0.051915,0.028195,-0.136432,0.094843,0.050394,0.075187,0.100373,-0.037889
Homes_Destroyed,0.051915,1.0,0.113493,-0.073115,0.01527,-0.045863,0.046645,-0.055574,-0.108145
Businesses_Destroyed,0.028195,0.113493,1.0,-0.075566,-0.103607,0.073564,-0.07799,0.000181,0.005933
Vehicles_Damaged,-0.136432,-0.073115,-0.075566,1.0,0.119331,-0.177314,-0.02445,-0.07997,-0.018939
Injuries,0.094843,0.01527,-0.103607,0.119331,1.0,-0.037908,0.079737,0.084566,-0.095812
Fatalities,0.050394,-0.045863,0.073564,-0.177314,-0.037908,1.0,0.184919,0.175887,0.022532
Estimated_Financial_Loss (Million $),0.075187,0.046645,-0.07799,-0.02445,0.079737,0.184919,1.0,0.075614,-0.024612
Year,0.100373,-0.055574,0.000181,-0.07997,0.084566,0.175887,0.075614,1.0,0.088414
Month,-0.037889,-0.108145,0.005933,-0.018939,-0.095812,0.022532,-0.024612,0.088414,1.0


**Questions Raised from EDA** - A list of insightful, data-driven questions that we can try to answer through further analysis or visualizations.

In [13]:
questions = [
    "1. Which year had the highest total area burned?",
    "2. Is there a correlation between area burned and financial loss?",
    "3. Which cause leads to the most fatalities?",
    "4. Which locations are most frequently affected?",
    "5. Are there seasonal trends in wildfire occurrences?"
]

for q in questions:
    print(q)


1. Which year had the highest total area burned?
2. Is there a correlation between area burned and financial loss?
3. Which cause leads to the most fatalities?
4. Which locations are most frequently affected?
5. Are there seasonal trends in wildfire occurrences?


**Grouped Yearly Statistics** - We group the data by Year to calculate total area burned and total financial loss per year — useful for temporal trend insights.



In [15]:
# Group data by Year to view annual totals
yearly_summary = df.groupby('Year')[[
    'Area_Burned (Acres)',
    'Homes_Destroyed',
    'Estimated_Financial_Loss (Million $)',
    'Injuries',
    'Fatalities'
]].sum()

yearly_summary


Unnamed: 0_level_0,Area_Burned (Acres),Homes_Destroyed,Estimated_Financial_Loss (Million $),Injuries,Fatalities
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014,292258,15616,22034.1,561,103
2015,335017,6637,20635.89,255,61
2016,221408,7365,22327.25,371,109
2017,67059,3863,13745.34,174,46
2018,286662,13066,30937.25,564,112
2019,204166,6657,23323.0,356,77
2020,210763,7424,20400.76,319,104
2021,347909,13944,37069.46,555,131
2022,295972,9246,26704.28,546,133
2023,391932,10371,22432.6,503,117
