# Smoke Estimate

In this notebook I explore the various features of wildfire data to estimate an annual smoke estimate for Mesa, Arizona. The goal is to develop a formula that quantifies the smoke impact based on factors such as fire size, proximity to the city, and the total acres burned.

While doing this, I ensure the following:

1. The estimate only considers the last 60 years of wildland fire data. (Given that we do not have wildfire data for 2021-2024, I cannot find an estimate for that.)
2. The estimate only considers fires that are within 650 miles of Mesa, AZ.
3. The data defines the annual fire season as running from May 1st through October 31st.

(Note: Above 3 points were taken from the assignment and were provided by Prof. McDonald. I only edited it a little)

In [41]:
#
#    IMPORTS
#

#    These are standard python modules. In case you do not have a python module, you should use `!pip install <module>`
import pandas as pd
#

In [42]:
# Load the data
# The data has been filtered to make sure the above mentioned conditions are met.
wildfire_data = pd.read_csv("../Processed Data/filtered_wildfires_with_distances.csv", index_col=0)
aqi_data = pd.read_csv("../Processed Data/AQI_data.csv", index_col=0)

### Explore the wildfire data to find the best columns for calculating the smoke estimate

In [43]:
# Here I explore what all columns are in the filtered wildfire dataset
print(wildfire_data.shape)
print(wildfire_data.columns)

(33901, 22)
Index(['OBJECTID', 'USGS_Assigned_ID', 'Assigned_Fire_Type', 'Fire_Year',
       'Fire_Polygon_Tier', 'Fire_Attribute_Tiers', 'GIS_Acres',
       'GIS_Hectares', 'Listed_Fire_Types', 'Listed_Fire_Names',
       'Listed_Fire_Codes', 'Listed_Fire_IDs', 'Listed_Fire_Dates',
       'Listed_Fire_Causes', 'Listed_Fire_Cause_Class',
       'Listed_Rx_Reported_Acres', 'Circleness_Scale', 'Circle_Flag',
       'Shape_Length', 'Shape_Area', 'Min_Distance', 'Average_Distance'],
      dtype='object')


In [44]:
# List of columns I want to explore further
columns_to_explore = [
    'GIS_Acres', 'Min_Distance', 'Average_Distance',
    'Assigned_Fire_Type', 'Fire_Polygon_Tier',
    'Fire_Attribute_Tiers', 'Fire_Year', 'Circleness_Scale', 'Circle_Flag', 'Shape_Length', 'Shape_Area'
]

print(wildfire_data[columns_to_explore].info())



<class 'pandas.core.frame.DataFrame'>
Int64Index: 33901 entries, 0 to 117161
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   GIS_Acres             33901 non-null  float64
 1   Min_Distance          33901 non-null  float64
 2   Average_Distance      33901 non-null  float64
 3   Assigned_Fire_Type    33901 non-null  object 
 4   Fire_Polygon_Tier     33901 non-null  int64  
 5   Fire_Attribute_Tiers  33901 non-null  object 
 6   Fire_Year             33901 non-null  int64  
 7   Circleness_Scale      33901 non-null  float64
 8   Circle_Flag           2818 non-null   float64
 9   Shape_Length          33901 non-null  float64
 10  Shape_Area            33901 non-null  float64
dtypes: float64(7), int64(2), object(2)
memory usage: 3.1+ MB
None


In [45]:
print(wildfire_data[columns_to_explore].describe())

           GIS_Acres  Min_Distance  Average_Distance  Fire_Polygon_Tier  \
count   33901.000000  33901.000000      33901.000000       33901.000000   
mean     1958.422223    419.826642        420.451218           2.116044   
std     12566.777832    160.290488        160.258957           2.076768   
min         0.000021      8.157651          8.861608           1.000000   
25%        15.404123    316.153677        316.783063           1.000000   
50%       105.145155    445.207924        445.849193           1.000000   
75%       712.560628    544.842893        545.163520           2.000000   
max    595105.933935    649.995609        673.222077           8.000000   

          Fire_Year  Circleness_Scale  Circle_Flag  Shape_Length    Shape_Area  
count  33901.000000      33901.000000       2818.0  3.390100e+04  3.390100e+04  
mean    2000.712486          0.478881          1.0  1.073487e+04  7.925454e+06  
std       14.990027          0.263769          0.0  2.759050e+04  5.085595e+07  


In [46]:

# This function helps me explore the categorical columns
def explore_categorical(column):
    print(f"Unique values in {column}:")
    print(wildfire_data[column].value_counts())
    print(f"Number of unique values: {wildfire_data[column].nunique()}")

# Explore categorical columns
categorical_columns = ['Assigned_Fire_Type', 'Fire_Polygon_Tier',
                       'Fire_Attribute_Tiers', 'Circle_Flag']

for col in categorical_columns:
    explore_categorical(col)

Unique values in Assigned_Fire_Type:
Wildfire                            26814
Prescribed Fire                      4512
Likely Wildfire                      2167
Unknown - Likely Prescribed Fire      323
Unknown - Likely Wildfire              85
Name: Assigned_Fire_Type, dtype: int64
Number of unique values: 5
Unique values in Fire_Polygon_Tier:
1    23791
7     3841
3     3283
2     1744
6      707
8      285
5      199
4       51
Name: Fire_Polygon_Tier, dtype: int64
Number of unique values: 8
Unique values in Fire_Attribute_Tiers:
1 (1), 3 (1)                  4878
1 (1), 3 (2)                  3311
1 (2), 3 (3)                  2588
7 (1)                         2355
1 (1), 3 (3), 4 (1)           1747
                              ... 
1 (2), 3 (5), 4 (1), 6 (1)       1
1 (2), 3 (5), 4 (2), 5 (1)       1
1 (1), 3 (2), 4 (1), 7 (2)       1
1 (4), 3 (16), 4 (4)             1
8 (20)                           1
Name: Fire_Attribute_Tiers, Length: 1376, dtype: int64
Number of unique va

In [47]:
# Here I check the count of fires for each year
for year in range(1964, 2025):
  fire_count = len(wildfire_data[wildfire_data['Fire_Year'] == year])
  print(f"Number of fires in {year}: {fire_count}")

Number of fires in 1964: 161
Number of fires in 1965: 154
Number of fires in 1966: 221
Number of fires in 1967: 269
Number of fires in 1968: 243
Number of fires in 1969: 193
Number of fires in 1970: 276
Number of fires in 1971: 207
Number of fires in 1972: 237
Number of fires in 1973: 219
Number of fires in 1974: 367
Number of fires in 1975: 263
Number of fires in 1976: 283
Number of fires in 1977: 225
Number of fires in 1978: 250
Number of fires in 1979: 440
Number of fires in 1980: 457
Number of fires in 1981: 419
Number of fires in 1982: 207
Number of fires in 1983: 277
Number of fires in 1984: 381
Number of fires in 1985: 535
Number of fires in 1986: 400
Number of fires in 1987: 566
Number of fires in 1988: 600
Number of fires in 1989: 624
Number of fires in 1990: 395
Number of fires in 1991: 276
Number of fires in 1992: 423
Number of fires in 1993: 451
Number of fires in 1994: 655
Number of fires in 1995: 521
Number of fires in 1996: 697
Number of fires in 1997: 390
Number of fire

From the cells above, I have understood that the GIS_Acres is an important feature since it can range from about 0 to almost 600K acres. This goes to show how intense a wildfire is. Another column that I found interesting is Assigned_Fire_Type. This is intriguing since a major wildfire that is actually affecting the people negatively would likely be a confirmed wildfire in the type. However, we notice that there are quite a few other categories so I think that is important to consider. Lastly, I also noticed that the column Min_Distance has a wide range from about 8 to 650 miles. I explore this further in the next cell.

Note: It is critical to understand that there is no data for 2021-2024. I decided to let it stay as NaN

In [48]:
# Here I decide to make 5 categories of proximity showing their corresponding distance ranges
# This would help me understand if the fire is actually close to the city and how does it contribute to the estimate
proximity_levels = {
    1: (0, 130),
    2: (130, 260),
    3: (260, 390),
    4: (390, 520),
    5: (520, 650)
}

# Here I create a new column 'Proximity_Level' based on the Min_Distance
wildfire_data['Proximity_Level'] = pd.cut(
    wildfire_data['Min_Distance'],
    bins=[0, 130, 260, 390, 520, 650],
    labels=[1, 2, 3, 4, 5],
    include_lowest=True
)


print("Proximity Level Distribution:")
for level, (lower_bound, upper_bound) in proximity_levels.items():
    level_data = wildfire_data[wildfire_data['Proximity_Level'] == level]
    count = len(level_data)
    print(f"Proximity Level {level}: Range ({lower_bound}-{upper_bound}) miles, Count: {count}")


Proximity Level Distribution:
Proximity Level 1: Range (0-130) miles, Count: 2545
Proximity Level 2: Range (130-260) miles, Count: 3024
Proximity Level 3: Range (260-390) miles, Count: 7386
Proximity Level 4: Range (390-520) miles, Count: 8778
Proximity Level 5: Range (520-650) miles, Count: 12168


Based on the above output we see that more often than not the wildfire is farther away from the city, ideally making it less impactful

### Calculte Smoke Estimate

The `calculate_smoke_estimate` function computes an estimated "smoke impact" score for individual wildfire records. I calculate the smoke_estimate based on three main factors:

1. **Proximity Level** - This measures the closeness of the wildfire to my assigned city Mesa.
2. **GIS Acres** - This represents the size of the wildfire.
3. **Fire Type** - This is the different categorizes of the fire type.

I basically get the final estimate for each wildfire record by weighing and combining these factors.

### Key Steps in the Calculation

#### Step 1: Set the weights for each factor
I assign each factor a weight based on how important a role it plays in terms of smoke impact:

- **`proximity_weight`**: I set this to 0.7, as proximity is a strong indicator of potential impact. I think if a fire is closer to my city, it is more likely to leave a larger impact. However, this is also dependent on other factors like wind, weather etc. so I don't give this the highest weight.
- **`gis_acres_weight`**: I set this to 0.8 to highlight that larger fires are likely to produce more smoke.
- **`fire_type_weight`**: I decided to set this to 0.4, as there is not too much specific information on the type of fire, but I believe that if it was a major wildfire, it would be reported so.

#### Step 2: Calculate how the individual factor will impact the smoke estimate

1. **Proximity**: 
   - I calculate the proximity impact as $$ 1 / Proximity Level $$. I do so becuase proximity is inversely proportional to the smoke impact. Basically, that smaller proximity values (indicating closer proximity) yield higher impacts.

2. **Fire Size based on GIS_Acres**:
   - I normalize the wildfire area by dividing each wildfire’s acreage by the largest area in the dataset because this scales the values between 0 and 1. This makes it easier to compare wildfires of different sizes and to combine area meaningfully with other factors in the smoke estimate calculation.

3. **Fire Type Impact**:
   - I assign the fire type an impact value as follows:
     - `Wildfire`: 1.0 (highest impact due to smoke production).
     - `Prescribed Fire`: 0.7.
     - `Likely Wildfire`: 0.9.
     - `Unknown - Likely Prescribed Fire`: 0.6.
     - `Unknown - Likely Wildfire`: 0.8.
     - All other types default to 0.5.

#### Step 3: Calculating the final smoke estimate for each fire

I calculate the final smoke estimate as a weighted sum of the three factors listed above as follows:

$$
\text{smoke\_estimate} = (\text{proximity\_level\_impact} \times \text{proximity\_weight}) + (\text{gis\_acres\_impact} \times \text{gis\_acres\_weight}) + (\text{fire\_type\_impact} \times \text{fire\_type\_weight})
$$


This weighted formula combines each factor based on its importance, resulting in an estimate of the smoke impact for each wildfire record.



In [49]:
# This function calculates a smoke estimate for a single wildfire record
def calculate_smoke_estimate(row):
  # Weights for the different factors
  proximity_weight = 0.7
  gis_acres_weight = 0.8
  fire_type_weight = 0.4

  # Proximity level impact
  proximity_level_impact = 1 / row['Proximity_Level']  # Higher proximity, higher impact

  # GIS Acres impact
  gis_acres_impact = row['GIS_Acres'] / wildfire_data['GIS_Acres'].max()  # Normalize GIS Acres

  # Fire type impact
  if row['Assigned_Fire_Type'] == 'Wildfire':
    fire_type_impact = 1
  elif row['Assigned_Fire_Type'] == 'Prescribed Fire':
    fire_type_impact = 0.7
  elif row['Assigned_Fire_Type'] == 'Likely Wildfire':
    fire_type_impact = 0.9
  elif row['Assigned_Fire_Type'] == 'Unknown - Likely Prescribed Fire':
    fire_type_impact = 0.6
  elif row['Assigned_Fire_Type'] == 'Unknown - Likely Wildfire':
    fire_type_impact = 0.8
  else:
    fire_type_impact = 0.5

  # Calculate the weighted smoke estimate
  smoke_estimate = (
      (proximity_level_impact * proximity_weight)
      + (gis_acres_impact * gis_acres_weight)
      + (fire_type_impact * fire_type_weight)
  )

  return smoke_estimate

In [50]:
# Here I found out the min and max values for the yearly average aqi to help scale my smoke estimate
print(min(aqi_data["yearly_avg_aqi"]))
print(max(aqi_data["yearly_avg_aqi"]))

4.641221374045801
63.722527472527474


In [59]:
# Generate data for the years 1964-2024
years = list(range(1964, 2025))

# Filter wildfire data for years 1964-2020 and calculate smoke estimates
wildfire_data = wildfire_data[(wildfire_data['Fire_Year'] >= 1964) & (wildfire_data['Fire_Year'] <= 2024)]
wildfire_data['Smoke_Estimate'] = wildfire_data.apply(calculate_smoke_estimate, axis=1)

""" In the next line I sum the smoke estimates for each year to get the annual estimated smoke impact from all fires. 
I then divide it by 184 to normalize the result across the fire season which has 184 days, giving me an average  
smoke estimate.
Attribution: I took this idea about dividing it by 184 from Manasa Shivappa. 
"""
yearly_smoke_estimates = wildfire_data.groupby('Fire_Year')['Smoke_Estimate'].agg('sum') / 184

# Here I normalize yearly_smoke_estimates to the range [4, 64] based on the yearly_avg_aqi value
# I used ChatGPT to get the formula for normalization. My query was how do I normalize a column between 4 and 64. I edited the response to my use case.
min_value, max_value = 4, 64
yearly_smoke_estimates = (
    (yearly_smoke_estimates - yearly_smoke_estimates.min()) /
    (yearly_smoke_estimates.max() - yearly_smoke_estimates.min())
) * (max_value - min_value) + min_value

yearly_smoke_estimates = yearly_smoke_estimates.reindex(years)
yearly_smoke_estimates

Fire_Year
1964     4.341713
1965     4.000000
1966     6.274285
1967     7.693339
1968     6.949404
          ...    
2020    64.000000
2021          NaN
2022          NaN
2023          NaN
2024          NaN
Name: Smoke_Estimate, Length: 61, dtype: float64

Now that I had the smoke estimate for each year, I decided to create a single table with the smoke estimate, yearly_avg_aqi and the year.
However, before I did that, I needed to deal with the NaN values in the yearly_avg_aqi column. For this, I used two things.
First, I created a function that handles the edge case where the first value is empty (in my case I did not have data for 1964). Second, I used the interpolation function to calculate the value of a missing data (in the middle of the data). 

In [None]:
# First, I got a list of aqi's for each year
years = list(range(1964, 2025))
yearly_aqi_unique = aqi_data[['year', 'yearly_avg_aqi']].drop_duplicates()
yearly_avg_aqi = pd.Series(yearly_aqi_unique['yearly_avg_aqi'].values, index=yearly_aqi_unique['year'])
yearly_avg_aqi = yearly_avg_aqi.reindex(years)  # Ensure complete range from 1964 to 2024

yearly_avg_aqi # Here we see that we have no aqi for 1964 and 1970


year
1964          NaN
1965     6.132075
1966     4.641221
1967    53.584158
1968    63.722527
          ...    
2020    45.770657
2021    47.492489
2022    40.919786
2023    44.967453
2024    42.166169
Length: 61, dtype: float64

In [62]:
# This function handles the missing data in the first record by filling in the average of the next two valid values
def fill_first_record(series):
    if pd.isna(series.iloc[0]):
        if pd.notna(series.iloc[1]) and pd.notna(series.iloc[2]):
            series.iloc[0] = (series.iloc[1] + series.iloc[2]) / 2
        elif pd.notna(series.iloc[1]):  # Fallback to just the next value
            series.iloc[0] = series.iloc[1]

    return series

yearly_avg_aqi = fill_first_record(yearly_avg_aqi)

# Here I interpolate for any remaining NaNs by averaging between the previous and next valid values
yearly_avg_aqi = yearly_avg_aqi.interpolate(method='linear')
yearly_avg_aqi

year
1964     5.386648
1965     6.132075
1966     4.641221
1967    53.584158
1968    63.722527
          ...    
2020    45.770657
2021    47.492489
2022    40.919786
2023    44.967453
2024    42.166169
Length: 61, dtype: float64

In [None]:
# Now I create the merged DataFrame
merged_df = pd.DataFrame({
    'Fire_Year': years,
    'Smoke_Estimate': yearly_smoke_estimates.reindex(years),
    'yearly_avg_aqi': yearly_avg_aqi
}).reset_index(drop=True)

# Display the merged table
merged_df

Unnamed: 0,Fire_Year,Smoke_Estimate,yearly_avg_aqi
0,1964,4.341713,5.386648
1,1965,4.000000,6.132075
2,1966,6.274285,4.641221
3,1967,7.693339,53.584158
4,1968,6.949404,63.722527
...,...,...,...
56,2020,64.000000,45.770657
57,2021,,47.492489
58,2022,,40.919786
59,2023,,44.967453


Now that I have a merged dateframe with the smoke estimate and yearly average aqi, I decided to check if my smoke estimate has a correlation with the yearly average aqi. I use pearson correlation for this calculation.

In [65]:
correlation = merged_df['Smoke_Estimate'].corr(merged_df['yearly_avg_aqi'], method='pearson')
print(f"Pearson correlation between Smoke Estimate and Yearly Avg AQI: {correlation}")

Pearson correlation between Smoke Estimate and Yearly Avg AQI: 0.3242018303165157


From the Pearson correlation above, we can see that there is a low-moderate correlation between my smoke_estimate and the yearly_avg_aqi. This makes sense as my smoke estimate is based on only fires (proximity, area burned and type of fire), whereas the AQI reflects overall air quality, which can be influenced by other factors such as industrial pollution, vehicle emissions, or weather conditions. 
It is therefore important to note that while wildfire smoke contributes to poor air quality, it is not the sole factor affecting AQI measurements.

In [66]:
# Here I save the smoke estimate and yearly average aqi along with the year in a single table for easier comparison
merged_df.to_csv("../Processed Data/smoke_estimate_with_year_aqi.csv")