# **Forest Fire Prediction**

### Part 2: Data cleaning
In this section, we will focus on cleaning the raw data to ensure its quality and consistency. Data cleaning is an essential step in the data preprocessing pipeline, as it helps to eliminate errors, handle missing values, and transform the data into a suitable format for analysis.

Data handling stages:
- Handling Missing Values
- Handling Duplicates
- Data Conversion
- Outliers and Validations

#### **Imports section:** (and warning exception handling)

In [None]:
# Please note if running on a clean environment, need to install missing modules
import pandas as pd
import numpy as np
from datetime import date as dt
import re
from geopy.geocoders import Nominatim
pd.options.mode.chained_assignment = None

#### Global variables:

In [None]:
# We are creating a new CSV file in each stage to minimize data loss if accrues 
CSV_NAME = 'fire_history.csv'
COLS_RED_CSV = 'fire_history_cols_reduction.csv'
MISS_CSV = 'fire_history_miss_values_removed.csv'
DUP_CSV = 'fire_history_dup_values_removed.csv'
DATA_CONV_CSV = 'fire_history_data_conv.csv'
CSV_OUTLIERS = 'fire_history_outliers_removed.csv'

FINAL_CSV = 'fire_history_final.csv'

COLS = ['UniqueFireIdentifier', 'FireDiscoveryDateTime', 'FireOutDateTime', 'InitialLatitude', 'InitialLongitude', 'POOCounty', 'FireCause']

#### Removing unnecessary columns:

In [None]:
def keep_necessary_columns(df, cols):
    df_cleaned = df[cols]
    return df_cleaned

#### Handling missing values:

In [None]:
def missing_values(df):
    df_cleaned = df.copy()
    df_cleaned.dropna(inplace=True)
    return df_cleaned

*The `handle_missing_values()` function is used to fill in missing values in the `FireCause` column before removing all rows with missing values.*

In [None]:
def handle_missing_values(df):
    df_cleaned = df.copy()
    df_cleaned['FireCause'].fillna('Undetermined', inplace=True)
    df_cleaned = missing_values(df_cleaned)
    return df_cleaned

#### Removing duplicates:
*We are using `UniqueFireIdentifier` column to identify duplicates. Unique identifier assigned to each wildland fire.  yyyy = calendar year, SSUUUU = POO protecting unit identifier (5 or 6 characters), xxxxxx = local incident identifier (6 to 10 characters)*

In [None]:
def duplicated_values(df):
    df_cleaned = df.copy()
    df_cleaned.drop_duplicates(subset='UniqueFireIdentifier', keep='first', inplace=True)
    return df_cleaned

#### Data Type Conversion:
*The `convert_date()` function converts dates into a useable format for our API and handles `pandas.to_datetime` limitations and human errors when entering data.*

In [None]:
def convert_date(dates):
    for i in range(len(dates)):
        try:
            pd.to_datetime(dates[i]).date()
        except:
            dates[i] = np.nan
        else: 
            dates[i] = pd.to_datetime(dates[i]).date()
    return dates

*The `calc_days()` function calculates the difference in days between two dates.*

In [None]:
def calc_days(end_dates, start_dates):
    days = []
    for i in range(len(end_dates)):
        end_date = end_dates[i]
        start_date = start_dates[i]
        days.append((end_date - start_date).days)
    return days

*The `data_conversion()` function converts our date columns to a usable format, converts `FireCause` column to a categorical column using `.map(cause_mapping)`, and adds `FireDuration` and `CausedByWeather` columns.*

In [None]:
def data_conversion(df):
    df_cleaned = df.copy()

    # Convert FireDiscoveryDateTime and FireOutDateTime to useable format
    df_cleaned['FireDiscoveryDateTime'] = convert_date(df_cleaned['FireDiscoveryDateTime'])
    df_cleaned['FireOutDateTime'] = convert_date(df_cleaned['FireOutDateTime'])
    df_cleaned = missing_values(df_cleaned) # Remove missing values after conversion

    # Add FireDuration column
    df_cleaned['FireDuration'] = calc_days(df_cleaned['FireOutDateTime'], df_cleaned['FireDiscoveryDateTime'])

    # Convert FireCause to categorical column
    cause_mapping = {'Human': 1, 'Natural': 2, 'Unknown': 3, 'Undetermined': 4, np.nan: 4}
    df_cleaned['FireCause'] = df_cleaned['FireCause'].map(cause_mapping)

    # Add CausedByWeather column
    df_cleaned['CausedByWeather'] = df_cleaned['FireCause'].apply(lambda x: 1 if x == 2 else 0)

    return df_cleaned

#### Detecting Outliers:
*The `remove_outliers()` function is used to remove rows from a DataFrame based on a given boolean mask of outliers.*

In [None]:
def remove_outliers(df, outliers):
    for row in df.index:
        if outliers[row] == True:
            df.drop(row, inplace=True)
    return df

*The `calculate_outliers()` function calculates outliers based on the IQR range. The IQR is less sensitive to extreme values and can handle non-normal distributions.*

In [None]:
def calculate_outliers(data):
    outliers = []
    Q1, Q3 = np.percentile(data, [25, 75])
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    if data < lower_bound or data > upper_bound:
        outliers.append(True)
    else:
        outliers.append(False)
    return outliers

*The `detect_location_outliers()` function utilizes the geopy library and the Nominatim geocoder for reverse geocoding.*

In [None]:
def valid_locations(df):
    geolocator = Nominatim(user_agent="Geolocation")
    outliers = []

    for row in df.index:
        county = df['POOCounty'][row].lower()
        lat = df['InitialLatitude'][row]
        long = df['InitialLongitude'][row]

        location = geolocator.reverse((lat, long), exactly_one=True)
        valid_county = location.raw.get('address', {}).get('county').lower()
        valid_country = location.raw.get('address', {}).get('country').lower()

        if location is None or county not in valid_county or valid_country != 'United States':
            outliers.append(True)
        else:
            outliers.append(False)
    return outliers

*And FINALLY - The `handle_outliers()` function that handles outliers in the given DataFrame.*

In [None]:
def handle_outliers(df):
    df_cleaned = df.copy()
    
    fire_duration_outliers = calculate_outliers(df_cleaned['FireDuration'])  # Analyze fire duration values
    df_cleaned = remove_outliers(df_cleaned, fire_duration_outliers)

    fire_location_outliers = valid_locations(df_cleaned)  # Validate locations
    df_cleaned = remove_outliers(df_cleaned, fire_location_outliers)

    return df_cleaned

# Implementation section:

In [None]:
df = pd.read_csv(CSV_NAME)
df

In [None]:
df = keep_necessary_columns(df, COLS)
df.to_csv(COLS_RED_CSV, index=False)
df

In [None]:
df = missing_values(df)
df.to_csv(MISS_CSV, index=False)
df

In [None]:
df = duplicated_values(df)
df.to_csv(DUP_CSV, index=False)
df

In [None]:
df = pd.read_csv(DUP_CSV)
df = data_conversion(df)
df.to_csv(DATA_CONV_CSV, index=False)
df

In [None]:
df = handle_outliers(df)
df.to_csv(CSV_OUTLIERS, index=False)
df