# **Forest Fire Prediction**

### Part 2: Data cleaning
In this section, we will focus on cleaning the raw data to ensure its quality and consistency. Data cleaning is an essential step in the data preprocessing pipeline, as it helps to eliminate errors, handle missing values, and transform the data into a suitable format for analysis.

Data handling stages:
- Handling Missing Values
- Handling Duplicates
- Data Conversion
- Outliers and Validations

#### **Imports section:** (and warning exception handling)

In [31]:
# Please note if running on a clean environment, need to install missing modules
import pandas as pd
import numpy as np
from datetime import datetime as dt
import re
from geopy.geocoders import Nominatim
pd.options.mode.chained_assignment = None

#### Global variables:

In [42]:
# We are creating a new CSV file in each stage to minimize data loss if accrues 
CSV_NAME = 'fire_history.csv'
COLS_RED_CSV = 'fire_history_cols_reduction.csv'
MISS_CSV = 'fire_history_miss_values_removed.csv'
DUP_CSV = 'fire_history_dup_values_removed.csv'
DATA_CONV_CSV = 'fire_history_data_conv.csv'
ADD_COLS_CSV = 'fire_history_additional.csv'
CSV_OUTLIERS = 'fire_history_outliers_removed.csv'

FINAL_AFTER_PREP_CSV = 'fire_history_prep.csv'

COLS = ['UniqueFireIdentifier', 'FireDiscoveryDateTime', 'FireOutDateTime', 'InitialLatitude', 'InitialLongitude', 'POOCounty', 'FireCause']

#### Removing unnecessary columns:

In [3]:
def keep_necessary_columns(df, cols):
    df_cleaned = df[cols]
    return df_cleaned

#### Handling missing values:
*The `handle_missing_values()` function is used to fill in missing values in the `FireCause` column before removing all rows with missing values.*

In [44]:
def handle_missing_values(df):
    df_cleaned = df.copy()
    df_cleaned['FireCause'].fillna('Undetermined', inplace=True)
    df_cleaned.dropna(inplace=True, ignore_index=True)
    return df_cleaned

#### Removing duplicates:
*We are using `UniqueFireIdentifier` column to identify duplicates. Unique identifier assigned to each wildland fire.  yyyy = calendar year, SSUUUU = POO protecting unit identifier (5 or 6 characters), xxxxxx = local incident identifier (6 to 10 characters)*

In [6]:
def duplicated_values(df):
    df_cleaned = df.copy()
    df_cleaned.drop_duplicates(subset='UniqueFireIdentifier', keep='first', inplace=True)
    return df_cleaned

#### Data Type Conversion:
*The `convert_date()` function converts dates into a useable format for our API and handles `pandas.to_datetime` limitations and human errors when entering data.*

In [50]:
def convert_date(dates):
    for i in range(len(dates)):
        try:
            pd.to_datetime(dates[i]).date()
        except:
            dates[i] = np.nan
        else: 
            dates[i] = pd.to_datetime(dates[i]).date()
    return dates

*The `data_conversion()` function converts our date columns to a usable format, and converts the `FireCause` column to a categorical column using `.map(cause_mapping)`.*

In [35]:
def data_conversion(df):
    df_cleaned = df.copy()

    # Convert FireDiscoveryDateTime and FireOutDateTime to useable format and remove missing values after conversion
    df_cleaned['FireDiscoveryDateTime'] = convert_date(df_cleaned['FireDiscoveryDateTime'])
    df_cleaned['FireOutDateTime'] = convert_date(df_cleaned['FireOutDateTime'])
    df_cleaned.dropna(inplace=True, ignore_index=True)

    # Convert FireCause to categorical column
    cause_mapping = {'Human': 1, 'Natural': 2, 'Unknown': 3, 'Undetermined': 4}
    df_cleaned['FireCause'] = df_cleaned['FireCause'].map(cause_mapping)

    return df_cleaned

#### Extras:
*The `calc_days()` function calculates the difference in days between two dates.*

In [36]:
def calc_days(end_dates, start_dates):
    days = []
    date_format = '%Y-%m-%d'
    for i in range(len(end_dates)):
        end_date = dt.strptime(end_dates[i], date_format).date()
        start_date = dt.strptime(start_dates[i], date_format).date()
        days.append((end_date - start_date).days)
    return days

*The `additional_cols()` functions adds two additional useful columns to the DataFrame.*

In [37]:
def additional_cols(df):
    df_extra = df.copy()

    # Add FireDuration column
    df_extra['FireDuration'] = calc_days(df_extra['FireOutDateTime'], df_extra['FireDiscoveryDateTime'])
    
    # Add CausedByWeather column
    df_extra['CausedByWeather'] = df_extra['FireCause'].apply(lambda x: 1 if x == 2 else 0)
    
    return (df_extra)


#### Detecting Outliers:
*The `remove_outliers()` function is used to remove rows from a DataFrame based on a given boolean mask of outliers.*

In [38]:
def remove_outliers(df, outliers):
    for row in df.index:
        if outliers[row] == True:
            df.drop(row, inplace=True)
    return df

*The `calculate_outliers()` function calculates outliers based on the IQR range. The IQR is less sensitive to extreme values and can handle non-normal distributions.*

In [39]:
def calculate_outliers(data):
    outliers = []
    Q1, Q3 = np.percentile(data, [25, 75])
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    if data < lower_bound or data > upper_bound:
        outliers.append(True)
    else:
        outliers.append(False)
    return outliers

*The `detect_location_outliers()` function utilizes the geopy library and the Nominatim geocoder for reverse geocoding.*

In [40]:
def valid_locations(df):
    geolocator = Nominatim(user_agent="Geolocation")
    outliers = []

    for row in df.index:
        county = df['POOCounty'][row].lower()
        lat = df['InitialLatitude'][row]
        long = df['InitialLongitude'][row]

        location = geolocator.reverse((lat, long), exactly_one=True)
        valid_county = location.raw.get('address', {}).get('county').lower()
        valid_country = location.raw.get('address', {}).get('country').lower()

        if location is None or county not in valid_county or valid_country != 'United States':
            outliers.append(True)
        else:
            outliers.append(False)
    return outliers

*And FINALLY - The `handle_outliers()` function that handles outliers in the given DataFrame.*

In [41]:
def handle_outliers(df):
    df_cleaned = df.copy()
    
    fire_duration_outliers = calculate_outliers(df_cleaned['FireDuration'])  # Analyze fire duration values
    df_cleaned = remove_outliers(df_cleaned, fire_duration_outliers)

    fire_location_outliers = valid_locations(df_cleaned)  # Validate locations
    df_cleaned = remove_outliers(df_cleaned, fire_location_outliers)

    return df_cleaned

### Implementation section:
Out data before any handling and preparations

In [45]:
df = pd.read_csv(CSV_NAME)
df

  df = pd.read_csv(CSV_NAME)


Unnamed: 0,X,Y,OBJECTID,SourceOID,ABCDMisc,ADSPermissionState,ContainmentDateTime,ControlDateTime,CreatedBySystem,IncidentSize,...,EstimatedFinalCost,OrganizationalAssessment,StrategicDecisionPublishDate,CreatedOnDateTime_dt,ModifiedOnDateTime_dt,IsCpxChild,CpxName,CpxID,SourceGlobalID,GlobalID
0,-118.180712,33.808985,1,7747595,,DEFAULT,,,lacocad,,...,,,,2020/02/28 20:52:36.363+00,2020/02/28 20:52:36.363+00,0,,,{6A311ABB-DF4F-4947-B8DD-3900BDA784F6},48d2c0e2-5e38-4d40-9d5e-066b076c7d98
1,-117.153901,33.176394,2,6384391,,DEFAULT,,,firecode,,...,,,,2019/07/01 20:10:12.737+00,2019/07/01 20:10:12.737+00,0,,,{1AF2C949-B159-4D8F-8D39-90CB58BC5DD5},17d2d66a-d451-4592-a172-7b2c860a2cc9
2,-121.104180,38.834727,3,1383752,,DEFAULT,,,firecode,,...,,,,2016/06/20 22:39:02.410+00,2016/06/20 22:39:02.410+00,0,,,{1B179EA1-97CE-4699-915B-374754BCBC5B},60c471ff-3c85-41b4-9135-e7338d7ec90b
3,-117.228592,33.782442,4,22499589,,DEFAULT,,,cfcad,,...,,,,2021/11/25 15:24:53.120+00,2021/11/25 15:24:53.120+00,0,,,{E61E387B-4ED7-4971-9604-C5D7391FAF77},149237ec-a42e-43d6-9318-22207a705dd9
4,-118.309032,33.941815,5,23869477,,DEFAULT,,,lacocad,,...,,,,2022/11/21 11:28:49.097+00,2022/11/21 11:28:49.097+00,0,,,{AEB6F7A3-A109-4132-9FEB-FB1EE1DF3193},ef7675e3-d5be-412a-a6c1-0d63fc7153c8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263817,-116.073913,43.243246,315324,28035649,,FIREREPORTING,2023/05/20 19:06:00+00,2023/05/20 19:06:00+00,wildcad,0.50,...,,,,2023/06/19 16:23:18.380+00,2023/06/19 16:26:42.300+00,0,,,{1553284C-4F2F-4D1E-8DFF-77F4593289FE},32553b11-17d3-405a-82ff-1cfbdfa6492e
263818,-116.026013,43.184536,315326,28035650,,FIREREPORTING,2023/05/20 19:33:00+00,2023/05/20 19:33:00+00,wildcad,0.50,...,,,,2023/06/19 16:28:15.883+00,2023/06/19 16:28:37.680+00,0,,,{2957B1E7-485A-4BE8-8914-2EEAD0823DF3},09c8f1ca-6e1f-4438-b929-55488216cb74
263819,-116.069113,43.241006,315327,28035652,,FIREREPORTING,2023/05/20 23:30:00+00,2023/05/20 23:30:00+00,wildcad,20.00,...,,,,2023/06/19 16:35:53.457+00,2023/06/19 16:36:22.327+00,0,,,{B0A31C1B-638A-4FBC-AF6F-E8BAC9DFCED5},72a22987-ba12-414c-9b8e-63eaa2587dd9
263820,-151.187739,60.447151,315328,28035653,,DEFAULT,2023/06/19 08:19:42+00,2023/06/19 08:37:23+00,ifm,0.10,...,,,,2023/06/19 16:40:37.620+00,2023/06/19 16:44:40.917+00,0,,,{F8490B1B-82F1-4851-8386-F121978FE268},197b872b-1932-46aa-a7e6-628097227187


In [46]:
df = keep_necessary_columns(df, COLS)
df.to_csv(COLS_RED_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause
0,2020-CALAC-066100,2020/02/28 20:45:40+00,,33.808980,-118.180700,Los Angeles,Unknown
1,2019-CAMVU-009269,2019/07/01 19:54:00+00,,,,San Diego,
2,2016-CANEU-014375,2016/06/20 22:06:00+00,,,,Placer,
3,2021-CARRU-163915,2021/11/25 15:17:33+00,,33.782437,-117.228580,Riverside,Undetermined
4,2022-CALAC-396331,2022/11/21 11:25:34+00,,33.941810,-118.309020,Los Angeles,Undetermined
...,...,...,...,...,...,...,...
263817,2023-IDIDNG-000356,2023/05/20 18:30:00+00,2023/05/20 19:06:00+00,43.243240,-116.073900,Ada,Human
263818,2023-IDIDNG-000357,2023/05/20 19:15:00+00,2023/05/20 19:33:00+00,43.184530,-116.026000,Ada,Human
263819,2023-IDIDNG-000358,2023/05/20 20:28:00+00,2023/05/20 23:30:00+00,43.241000,-116.069100,Ada,Human
263820,2023-AKKKS-303132,2023/06/19 05:14:10+00,,60.447150,-151.187717,Kenai Peninsula,Human


In [47]:
df = handle_missing_values(df)
df.to_csv(MISS_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause
0,2022-MTCRA-220737,2022/09/19 06:08:00+00,2022/11/07 18:05:00+00,45.62299,-107.46470,Big Horn,Undetermined
1,2022-COUMA-000926,2022/08/09 21:32:00+00,2022/08/15 20:39:00+00,37.17861,-108.88910,Montezuma,Natural
2,2022-LASBR-000250,2022/12/10 01:09:19+00,2022/12/11 19:30:00+00,29.86160,-93.77264,Cameron,Undetermined
3,2022-WAMSF-000348,2022/09/02 20:08:00+00,2022/12/16 21:13:00+00,47.63933,-121.60540,King,Undetermined
4,2022-WACOF-001050,2022/04/14 17:30:00+00,2023/01/12 16:30:00+00,48.55950,-119.06330,Okanogan,Undetermined
...,...,...,...,...,...,...,...
126749,2023-IDIDNG-000355,2023/05/19 00:08:00+00,2023/05/19 00:16:00+00,43.19766,-116.03630,Ada,Human
126750,2023-IDIDNG-000356,2023/05/20 18:30:00+00,2023/05/20 19:06:00+00,43.24324,-116.07390,Ada,Human
126751,2023-IDIDNG-000357,2023/05/20 19:15:00+00,2023/05/20 19:33:00+00,43.18453,-116.02600,Ada,Human
126752,2023-IDIDNG-000358,2023/05/20 20:28:00+00,2023/05/20 23:30:00+00,43.24100,-116.06910,Ada,Human


In [48]:
df = duplicated_values(df)
df.to_csv(DUP_CSV, index=False)
df

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause
0,2022-MTCRA-220737,2022/09/19 06:08:00+00,2022/11/07 18:05:00+00,45.62299,-107.46470,Big Horn,Undetermined
1,2022-COUMA-000926,2022/08/09 21:32:00+00,2022/08/15 20:39:00+00,37.17861,-108.88910,Montezuma,Natural
2,2022-LASBR-000250,2022/12/10 01:09:19+00,2022/12/11 19:30:00+00,29.86160,-93.77264,Cameron,Undetermined
3,2022-WAMSF-000348,2022/09/02 20:08:00+00,2022/12/16 21:13:00+00,47.63933,-121.60540,King,Undetermined
4,2022-WACOF-001050,2022/04/14 17:30:00+00,2023/01/12 16:30:00+00,48.55950,-119.06330,Okanogan,Undetermined
...,...,...,...,...,...,...,...
126749,2023-IDIDNG-000355,2023/05/19 00:08:00+00,2023/05/19 00:16:00+00,43.19766,-116.03630,Ada,Human
126750,2023-IDIDNG-000356,2023/05/20 18:30:00+00,2023/05/20 19:06:00+00,43.24324,-116.07390,Ada,Human
126751,2023-IDIDNG-000357,2023/05/20 19:15:00+00,2023/05/20 19:33:00+00,43.18453,-116.02600,Ada,Human
126752,2023-IDIDNG-000358,2023/05/20 20:28:00+00,2023/05/20 23:30:00+00,43.24100,-116.06910,Ada,Human


In [51]:
df = data_conversion(df)
df.to_csv(DATA_CONV_CSV, index=False)
df

0 2022-09-19
1 2022-08-09
2 2022-12-10
3 2022-09-02
4 2022-04-14
5 2022-10-27
6 2022-03-14
7 2022-10-28
8 2022-11-07
9 2022-09-06
10 2022-11-07
11 2022-11-20
12 2022-07-17
13 2022-11-09
14 2022-07-14
15 2022-10-11
16 2022-10-24
17 2022-10-29
18 2022-11-16
19 2022-09-23
20 2022-09-18
21 2022-10-17
22 2022-08-17
23 2022-09-21
24 2022-04-29
25 2022-02-01
26 2022-05-19
27 2022-08-13
28 2022-09-29
29 2022-07-26
30 2022-07-15
31 2022-07-25
32 2022-09-24
33 2022-11-06
34 2022-08-24
35 2022-06-06
36 2022-12-13
37 2022-04-29
38 2022-04-29
39 2022-09-22
40 2022-01-02
41 2022-11-05
42 2022-09-27
43 2022-07-30
44 2022-08-07
45 2022-11-21
46 2022-11-04
47 2022-11-27
48 2022-09-09
49 2022-10-04
50 2022-09-08
51 2022-08-17
52 2022-09-29
53 2022-05-11
54 2022-09-26
55 2022-09-05
56 2022-10-24
57 2022-08-11
58 2022-11-07
59 2022-09-22
60 2022-10-23
61 2022-01-30
62 2022-04-29
63 2022-04-16
64 2022-09-01
65 2022-12-04
66 2022-10-19
67 2020-08-06
68 2017-10-17
69 2021-09-08
70 2020-11-23
71 2019-10-26
72

Unnamed: 0,UniqueFireIdentifier,FireDiscoveryDateTime,FireOutDateTime,InitialLatitude,InitialLongitude,POOCounty,FireCause
0,2022-MTCRA-220737,2022-09-19,2022-11-07,45.62299,-107.46470,Big Horn,
1,2022-COUMA-000926,2022-08-09,2022-08-15,37.17861,-108.88910,Montezuma,
2,2022-LASBR-000250,2022-12-10,2022-12-11,29.86160,-93.77264,Cameron,
3,2022-WAMSF-000348,2022-09-02,2022-12-16,47.63933,-121.60540,King,
4,2022-WACOF-001050,2022-04-14,2023-01-12,48.55950,-119.06330,Okanogan,
...,...,...,...,...,...,...,...
126736,2023-IDIDNG-000355,2023-05-19,2023-05-19,43.19766,-116.03630,Ada,
126737,2023-IDIDNG-000356,2023-05-20,2023-05-20,43.24324,-116.07390,Ada,
126738,2023-IDIDNG-000357,2023-05-20,2023-05-20,43.18453,-116.02600,Ada,
126739,2023-IDIDNG-000358,2023-05-20,2023-05-20,43.24100,-116.06910,Ada,


In [None]:
df = handle_outliers(df)
df.to_csv(CSV_OUTLIERS, index=False)
df