# **Forest Fire Prediction**

### Part 2: Data preparation
In this section, we will focus on cleaning the raw data to ensure its quality and consistency. Data cleaning is an essential step in the data preprocessing pipeline, as it helps to eliminate errors, handle missing values, and transform the data into a suitable format for analysis.

Data handling stages:
- Handling Missing Values
- Handling Duplicates
- Data Conversion
- Outliers and Validations

#### **Imports section:** (and warning exception handling)

In [1]:
# Please note if running on a clean environment, need to install missing modules
import pandas as pd
import numpy as np
import reverse_geocoder as rg
pd.options.mode.chained_assignment = None

#### Global variables:

In [2]:
# We are creating a new CSV file in each stage to minimize data loss if accrues 
CSV_NAME = 'fire_history.csv'
COLS_RED_CSV = 'fire_history_cols_reduction.csv'
MISS_CSV = 'fire_history_miss_values_removed.csv'
DUP_CSV = 'fire_history_dup_values_removed.csv'
DATA_CONV_CSV = 'fire_history_data_conv.csv'
ADD_COLS_CSV = 'fire_history_additional.csv'
CSV_OUTLIERS = 'fire_history_outliers_removed.csv'
#CSV_VALIDATION = 'fire_history_validation.csv'

FINAL_AFTER_PREP_CSV = 'fire_history_prep.csv'

COLS = ['UniqueFireIdentifier', 'FireDiscoveryDateTime', 'FireOutDateTime', 'InitialLatitude', 'InitialLongitude', 'POOCounty', 'FireCause']

#### Removing unnecessary columns:

In [3]:
def keep_necessary_columns(df, cols):
    df_cleaned = df[cols]
    return df_cleaned

#### Handling missing values:
*The `handle_missing_values()` function is used to fill in missing values in the `FireCause` column before removing all rows with missing values.*

In [4]:
def handle_missing_values(df):
    df_cleaned = df.copy()
    df_cleaned['FireCause'].fillna('Undetermined', inplace=True)
    df_cleaned.dropna(inplace=True, ignore_index=True)
    return df_cleaned

#### Removing duplicates:
*We are using `UniqueFireIdentifier` column to identify duplicates. Unique identifier assigned to each wildland fire.  yyyy = calendar year, SSUUUU = POO protecting unit identifier (5 or 6 characters), xxxxxx = local incident identifier (6 to 10 characters)*

In [5]:
def duplicated_values(df):
    df_cleaned = df.copy()
    df_cleaned.drop_duplicates(subset='UniqueFireIdentifier', keep='first', inplace=True)
    return df_cleaned

#### Data Type Conversion:
*The `convert_date()` function converts dates into a useable format for our API and handles `pandas.to_datetime` limitations and human errors when entering data.*

In [6]:
def convert_date(dates):
    for i in range(len(dates)):
        try:
            dates[i] = pd.to_datetime(dates[i]).date()
        except:
            dates[i] = np.nan
    return dates

*The `data_conversion()` function converts our date columns to a usable format, and converts the `FireCause` column to a categorical column using `.map(cause_mapping)`.*

In [24]:
def data_conversion(df):
    df_cleaned = df.copy()

    # Convert FireDiscoveryDateTime and FireOutDateTime to useable format and remove missing values after conversion
    df_cleaned['FireDiscoveryDateTime'] = convert_date(df_cleaned['FireDiscoveryDateTime'])
    df_cleaned['FireOutDateTime'] = convert_date(df_cleaned['FireOutDateTime'])
    df_cleaned.dropna(inplace=True, ignore_index=True)

    # Convert FireCause to categorical column
    cause_mapping = {'Human': 1, 'Natural': 2, 'Unknown': 3, 'Undetermined': 4}
    df_cleaned['FireCause'] = df_cleaned['FireCause'].map(cause_mapping)

    return df_cleaned

#### Extras:
*The `calc_days()` function calculates the difference in days between two dates.*

In [8]:
def calc_days(end_dates, start_dates):
    days = []
    for i in range(len(end_dates)):
        end_date = end_dates[i]
        start_date = start_dates[i]
        days.append((end_date - start_date).days)
    return days

*The `additional_cols()` functions adds two additional useful columns to the DataFrame.*

In [9]:
def additional_cols(df):
    df_extra = df.copy()

    # Add FireDuration column
    df_extra['FireDuration'] = calc_days(df_extra['FireOutDateTime'], df_extra['FireDiscoveryDateTime'])

    # Add CausedByWeather column
    df_extra['CausedByWeather'] = df_extra['FireCause'].apply(lambda x: 1 if x == 2 else 0)

    return (df_extra)

#### Validation and Detecting Outliers:
*The `remove_outliers()` function is used to remove rows from a DataFrame based on a given boolean mask of outliers.*

In [10]:
def remove_outliers(df, outliers):
    for row in df.index:
        if outliers[row] == True:
            df.drop(row, inplace=True)
    df.reset_index(drop=True, inplace=True)
    return df

*The `duration_outliers` function ensures there are no negative duration values, and the `calculate_outliers()` function calculates outliers based on the IQR range. The IQR is less sensitive to extreme values and can handle non-normal distributions.*

In [11]:
def duration_outliers(dur):
    outliers = []
    for d in dur:
        if d < 0:
            outliers.append(True)
        else:
            outliers.append(False)
    return outliers

In [26]:
def calculate_outliers(data):
    outliers = []
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - (1.5 * IQR)
    upper_bound = Q3 + (1.5 * IQR)
    for x in data:
        if x < lower_bound or x > upper_bound:
            outliers.append(True)
        else:
            outliers.append(False)
    return outliers

*The `coordinates_outliers()` and `valid_locations()` functions ensure that the coordinates are in range and valid.*

In [13]:
def coordinates_outliers(lats, longs):
    outliers = []
    for i in range(len(lats)):
        lat = lats[i]
        long = longs[i]
        if lat == 0 or long == 0:
            outliers.append(True)
        elif lat < -90 or lat > 90 or long < -180 or long > 180:
            outliers.append(True)
        else:
            outliers.append(False)
    return outliers

In [29]:
def valid_locations(valid_results, counties):
    for i in range(len(valid_results)):
        county = counties[i].lower()
        valid_county = valid_results[i]['admin2']
        valid_country = valid_results[i]['cc']

        if valid_county is None or county not in valid_county.lower() or valid_country != 'US':
            counties[i] = np.nan
            
    return counties 

*And FINALLY - The `handle_outliers()` and `validate_data()` functions that are responsible for handling outliers and validating data in the given DataFrame.*

In [15]:
def handle_outliers(df):
    df_cleaned = df.copy()

    # Analyze fire duration values
    fire_duration_outliers = duration_outliers(df_cleaned['FireDuration'])
    df_cleaned = remove_outliers(df_cleaned, fire_duration_outliers)
    fire_duration_IQR_outliers = calculate_outliers(df_cleaned['FireDuration'])  
    df_cleaned = remove_outliers(df_cleaned, fire_duration_IQR_outliers)

    # Analyze coordinates values
    fire_coordinates_outliers = coordinates_outliers(df_cleaned['InitialLatitude'], df_cleaned['InitialLongitude'])
    df_cleaned = remove_outliers(df_cleaned, fire_coordinates_outliers)

    return df_cleaned

In [16]:
def validate_data(df):
    valid_df = df.copy()

    lats = valid_df['InitialLatitude'].to_list()
    longs = valid_df['InitialLongitude'].to_list()
    coords = list(zip(lats, longs))

    # Analyze location values
    results = rg.search(coords)
    valid_df['POOCounty'] = valid_locations(results, valid_df['POOCounty'])
    valid_df.dropna(inplace=True, ignore_index=True)

    return valid_df

### Implementation section:
Out data before any handling and preparations

In [17]:
df = pd.read_csv(CSV_NAME)
df

  df = pd.read_csv(CSV_NAME)


Unnamed: 0,X,Y,OBJECTID,SourceOID,ABCDMisc,ADSPermissionState,ContainmentDateTime,ControlDateTime,CreatedBySystem,IncidentSize,...,EstimatedFinalCost,OrganizationalAssessment,StrategicDecisionPublishDate,CreatedOnDateTime_dt,ModifiedOnDateTime_dt,IsCpxChild,CpxName,CpxID,SourceGlobalID,GlobalID
0,-118.180712,33.808985,1,7747595,,DEFAULT,,,lacocad,,...,,,,2020/02/28 20:52:36.363+00,2020/02/28 20:52:36.363+00,0,,,{6A311ABB-DF4F-4947-B8DD-3900BDA784F6},48d2c0e2-5e38-4d40-9d5e-066b076c7d98
1,-117.153901,33.176394,2,6384391,,DEFAULT,,,firecode,,...,,,,2019/07/01 20:10:12.737+00,2019/07/01 20:10:12.737+00,0,,,{1AF2C949-B159-4D8F-8D39-90CB58BC5DD5},17d2d66a-d451-4592-a172-7b2c860a2cc9
2,-121.104180,38.834727,3,1383752,,DEFAULT,,,firecode,,...,,,,2016/06/20 22:39:02.410+00,2016/06/20 22:39:02.410+00,0,,,{1B179EA1-97CE-4699-915B-374754BCBC5B},60c471ff-3c85-41b4-9135-e7338d7ec90b
3,-117.228592,33.782442,4,22499589,,DEFAULT,,,cfcad,,...,,,,2021/11/25 15:24:53.120+00,2021/11/25 15:24:53.120+00,0,,,{E61E387B-4ED7-4971-9604-C5D7391FAF77},149237ec-a42e-43d6-9318-22207a705dd9
4,-118.309032,33.941815,5,23869477,,DEFAULT,,,lacocad,,...,,,,2022/11/21 11:28:49.097+00,2022/11/21 11:28:49.097+00,0,,,{AEB6F7A3-A109-4132-9FEB-FB1EE1DF3193},ef7675e3-d5be-412a-a6c1-0d63fc7153c8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263817,-116.073913,43.243246,315324,28035649,,FIREREPORTING,2023/05/20 19:06:00+00,2023/05/20 19:06:00+00,wildcad,0.50,...,,,,2023/06/19 16:23:18.380+00,2023/06/19 16:26:42.300+00,0,,,{1553284C-4F2F-4D1E-8DFF-77F4593289FE},32553b11-17d3-405a-82ff-1cfbdfa6492e
263818,-116.026013,43.184536,315326,28035650,,FIREREPORTING,2023/05/20 19:33:00+00,2023/05/20 19:33:00+00,wildcad,0.50,...,,,,2023/06/19 16:28:15.883+00,2023/06/19 16:28:37.680+00,0,,,{2957B1E7-485A-4BE8-8914-2EEAD0823DF3},09c8f1ca-6e1f-4438-b929-55488216cb74
263819,-116.069113,43.241006,315327,28035652,,FIREREPORTING,2023/05/20 23:30:00+00,2023/05/20 23:30:00+00,wildcad,20.00,...,,,,2023/06/19 16:35:53.457+00,2023/06/19 16:36:22.327+00,0,,,{B0A31C1B-638A-4FBC-AF6F-E8BAC9DFCED5},72a22987-ba12-414c-9b8e-63eaa2587dd9
263820,-151.187739,60.447151,315328,28035653,,DEFAULT,2023/06/19 08:19:42+00,2023/06/19 08:37:23+00,ifm,0.10,...,,,,2023/06/19 16:40:37.620+00,2023/06/19 16:44:40.917+00,0,,,{F8490B1B-82F1-4851-8386-F121978FE268},197b872b-1932-46aa-a7e6-628097227187


In [18]:
df = keep_necessary_columns(df, COLS)
df.to_csv(COLS_RED_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause
0,2020-CALAC-066100,2020/02/28 20:45:40+00,,33.808980,-118.180700,Los Angeles,Unknown
1,2019-CAMVU-009269,2019/07/01 19:54:00+00,,,,San Diego,
2,2016-CANEU-014375,2016/06/20 22:06:00+00,,,,Placer,
3,2021-CARRU-163915,2021/11/25 15:17:33+00,,33.782437,-117.228580,Riverside,Undetermined
4,2022-CALAC-396331,2022/11/21 11:25:34+00,,33.941810,-118.309020,Los Angeles,Undetermined
...,...,...,...,...,...,...,...
263817,2023-IDIDNG-000356,2023/05/20 18:30:00+00,2023/05/20 19:06:00+00,43.243240,-116.073900,Ada,Human
263818,2023-IDIDNG-000357,2023/05/20 19:15:00+00,2023/05/20 19:33:00+00,43.184530,-116.026000,Ada,Human
263819,2023-IDIDNG-000358,2023/05/20 20:28:00+00,2023/05/20 23:30:00+00,43.241000,-116.069100,Ada,Human
263820,2023-AKKKS-303132,2023/06/19 05:14:10+00,,60.447150,-151.187717,Kenai Peninsula,Human


In [19]:
df = handle_missing_values(df)
df.to_csv(MISS_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause
0,2022-MTCRA-220737,2022/09/19 06:08:00+00,2022/11/07 18:05:00+00,45.62299,-107.46470,Big Horn,Undetermined
1,2022-COUMA-000926,2022/08/09 21:32:00+00,2022/08/15 20:39:00+00,37.17861,-108.88910,Montezuma,Natural
2,2022-LASBR-000250,2022/12/10 01:09:19+00,2022/12/11 19:30:00+00,29.86160,-93.77264,Cameron,Undetermined
3,2022-WAMSF-000348,2022/09/02 20:08:00+00,2022/12/16 21:13:00+00,47.63933,-121.60540,King,Undetermined
4,2022-WACOF-001050,2022/04/14 17:30:00+00,2023/01/12 16:30:00+00,48.55950,-119.06330,Okanogan,Undetermined
...,...,...,...,...,...,...,...
126749,2023-IDIDNG-000355,2023/05/19 00:08:00+00,2023/05/19 00:16:00+00,43.19766,-116.03630,Ada,Human
126750,2023-IDIDNG-000356,2023/05/20 18:30:00+00,2023/05/20 19:06:00+00,43.24324,-116.07390,Ada,Human
126751,2023-IDIDNG-000357,2023/05/20 19:15:00+00,2023/05/20 19:33:00+00,43.18453,-116.02600,Ada,Human
126752,2023-IDIDNG-000358,2023/05/20 20:28:00+00,2023/05/20 23:30:00+00,43.24100,-116.06910,Ada,Human


In [20]:
df = duplicated_values(df)
df.to_csv(DUP_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause
0,2022-MTCRA-220737,2022/09/19 06:08:00+00,2022/11/07 18:05:00+00,45.62299,-107.46470,Big Horn,Undetermined
1,2022-COUMA-000926,2022/08/09 21:32:00+00,2022/08/15 20:39:00+00,37.17861,-108.88910,Montezuma,Natural
2,2022-LASBR-000250,2022/12/10 01:09:19+00,2022/12/11 19:30:00+00,29.86160,-93.77264,Cameron,Undetermined
3,2022-WAMSF-000348,2022/09/02 20:08:00+00,2022/12/16 21:13:00+00,47.63933,-121.60540,King,Undetermined
4,2022-WACOF-001050,2022/04/14 17:30:00+00,2023/01/12 16:30:00+00,48.55950,-119.06330,Okanogan,Undetermined
...,...,...,...,...,...,...,...
126749,2023-IDIDNG-000355,2023/05/19 00:08:00+00,2023/05/19 00:16:00+00,43.19766,-116.03630,Ada,Human
126750,2023-IDIDNG-000356,2023/05/20 18:30:00+00,2023/05/20 19:06:00+00,43.24324,-116.07390,Ada,Human
126751,2023-IDIDNG-000357,2023/05/20 19:15:00+00,2023/05/20 19:33:00+00,43.18453,-116.02600,Ada,Human
126752,2023-IDIDNG-000358,2023/05/20 20:28:00+00,2023/05/20 23:30:00+00,43.24100,-116.06910,Ada,Human


In [23]:
df = data_conversion(df)
df.to_csv(DATA_CONV_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause
0,2022-MTCRA-220737,2022-09-19,2022-11-07,45.62299,-107.46470,Big Horn,4
1,2022-COUMA-000926,2022-08-09,2022-08-15,37.17861,-108.88910,Montezuma,2
2,2022-LASBR-000250,2022-12-10,2022-12-11,29.86160,-93.77264,Cameron,4
3,2022-WAMSF-000348,2022-09-02,2022-12-16,47.63933,-121.60540,King,4
4,2022-WACOF-001050,2022-04-14,2023-01-12,48.55950,-119.06330,Okanogan,4
...,...,...,...,...,...,...,...
126736,2023-IDIDNG-000355,2023-05-19,2023-05-19,43.19766,-116.03630,Ada,1
126737,2023-IDIDNG-000356,2023-05-20,2023-05-20,43.24324,-116.07390,Ada,1
126738,2023-IDIDNG-000357,2023-05-20,2023-05-20,43.18453,-116.02600,Ada,1
126739,2023-IDIDNG-000358,2023-05-20,2023-05-20,43.24100,-116.06910,Ada,1


In [25]:
df = additional_cols(df)
df.to_csv(ADD_COLS_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause,FireDuration,CausedByWeather
0,2022-MTCRA-220737,2022-09-19,2022-11-07,45.62299,-107.46470,Big Horn,4,49,0
1,2022-COUMA-000926,2022-08-09,2022-08-15,37.17861,-108.88910,Montezuma,2,6,1
2,2022-LASBR-000250,2022-12-10,2022-12-11,29.86160,-93.77264,Cameron,4,1,0
3,2022-WAMSF-000348,2022-09-02,2022-12-16,47.63933,-121.60540,King,4,105,0
4,2022-WACOF-001050,2022-04-14,2023-01-12,48.55950,-119.06330,Okanogan,4,273,0
...,...,...,...,...,...,...,...,...,...
126736,2023-IDIDNG-000355,2023-05-19,2023-05-19,43.19766,-116.03630,Ada,1,0,0
126737,2023-IDIDNG-000356,2023-05-20,2023-05-20,43.24324,-116.07390,Ada,1,0,0
126738,2023-IDIDNG-000357,2023-05-20,2023-05-20,43.18453,-116.02600,Ada,1,0,0
126739,2023-IDIDNG-000358,2023-05-20,2023-05-20,43.24100,-116.06910,Ada,1,0,0


In [27]:
df = handle_outliers(df)
df.to_csv(CSV_OUTLIERS, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause,FireDuration,CausedByWeather
0,2022-COUMA-000926,2022-08-09,2022-08-15,37.17861,-108.88910,Montezuma,2,6,1
1,2022-LASBR-000250,2022-12-10,2022-12-11,29.86160,-93.77264,Cameron,4,1,0
2,2022-AZGID-000143,2022-03-14,2022-03-15,33.11710,-110.77040,Gila,1,1,0
3,2022-PAPAS-001588,2022-11-09,2022-11-10,41.70000,-79.03100,Warren,1,1,0
4,2022-AZFTA-000943,2022-10-29,2022-11-01,33.99352,-110.48160,Gila,4,3,0
...,...,...,...,...,...,...,...,...,...
112497,2023-IDIDNG-000355,2023-05-19,2023-05-19,43.19766,-116.03630,Ada,1,0,0
112498,2023-IDIDNG-000356,2023-05-20,2023-05-20,43.24324,-116.07390,Ada,1,0,0
112499,2023-IDIDNG-000357,2023-05-20,2023-05-20,43.18453,-116.02600,Ada,1,0,0
112500,2023-IDIDNG-000358,2023-05-20,2023-05-20,43.24100,-116.06910,Ada,1,0,0


In [30]:
df = validate_data(df)
df.to_csv(FINAL_AFTER_PREP_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause,FireDuration,CausedByWeather
0,2022-COUMA-000926,2022-08-09,2022-08-15,37.178610,-108.889100,Montezuma,2,6,1
1,2022-PAPAS-001588,2022-11-09,2022-11-10,41.700000,-79.031000,Warren,1,1,0
2,2022-ORNOD-220301,2022-08-17,2022-08-17,43.831530,-122.733900,Lane,1,0,0
3,2022-ORNOD-220285,2022-08-13,2022-08-14,43.837670,-122.773600,Lane,1,1,0
4,2022-ORBENN-000436,2022-07-26,2022-07-26,44.007700,-121.223700,Deschutes,1,0,0
...,...,...,...,...,...,...,...,...,...
83421,2023-IDBOD-000349,2023-06-18,2023-06-20,43.330760,-116.409500,Ada,1,2,0
83422,2023-CASNF-000573,2023-06-18,2023-06-19,37.568056,-119.843889,Mariposa,4,1,0
83423,2023-AZA5S-230819,2023-06-18,2023-06-19,35.892840,-114.087600,Mohave,1,1,0
83424,2023-AZA3S-230824,2023-06-19,2023-06-19,32.645000,-111.392900,Pinal,1,0,0


In [39]:
print("DataFrame information:")
df.info()
print()
print("DataFrame description:")
df.describe(include='all')

DataFrame information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83426 entries, 0 to 83425
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   UniqueFireIdentifier   83426 non-null  object 
 1   FireDiscoveryDateTime  83426 non-null  object 
 2   FireOutDateTime        83426 non-null  object 
 3   InitialLatitude        83426 non-null  float64
 4   InitialLongitude       83426 non-null  float64
 5   POOCounty              83426 non-null  object 
 6   FireCause              83426 non-null  int64  
 7   FireDuration           83426 non-null  int64  
 8   CausedByWeather        83426 non-null  int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 5.7+ MB

DataFrame description:


Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause,FireDuration,CausedByWeather
count,83426,83426,83426,83426.0,83426.0,83426,83426.0,83426.0,83426.0
unique,83426,3189,3194,,,976,,,
top,2022-COUMA-000926,2020-07-05,2021-07-05,,,Coconino,,,
freq,1,258,163,,,1848,,,
mean,,,,40.55798,-109.908788,,1.727315,4.892791,0.268501
std,,,,6.594656,12.933619,,0.994427,5.75579,0.443183
min,,,,25.136111,-167.075908,,1.0,0.0,0.0
25%,,,,35.203945,-116.987375,,1.0,1.0,0.0
50%,,,,40.108,-111.58295,,1.0,3.0,0.0
75%,,,,45.09965,-105.0861,,2.0,7.0,1.0
