# **Forest Fire Prediction**

### Part 2: Data cleaning

In this section, we will focus on cleaning the raw data to ensure its quality and consistency. Data cleaning is an essential step in the data preprocessing pipeline, as it helps to eliminate errors, handle missing values, and transform the data into a suitable format for analysis.

# Handling Missing Values

Missing values are a common occurrence in datasets and can pose challenges during analysis. Here are some techniques to handle missing values:

- **Identify Missing Values**: Start by identifying missing values in the dataset using functions like `isnull()` or `isna()`.

- **Drop Rows or Columns**: If the missing values are limited to a few rows or columns, you can choose to drop them using functions like `dropna()`.

- **Imputation**: For missing values in numerical data, you can fill them with statistical measures like mean, median, or interpolation. For categorical data, you can fill missing values with the most frequent category.

# Handling Duplicates

Duplicates in the dataset can lead to biased results and skewed analysis. Here's how to handle duplicates:

- **Identify Duplicates**: Use functions like `duplicated()` to identify duplicated rows in the dataset.

- **Remove Duplicates**: If duplicates are found, you can remove them using `drop_duplicates()` function.

# Data Type Conversion

Sometimes, the data types of columns may need to be converted to a different format for analysis. Here are some common data type conversions:

- **Numeric Conversion**: Convert columns from object or string types to numeric types(e.g., float or integer) using `astype()` or `to_numeric()`.

- **Datetime Conversion**: Convert columns containing dates or timestamps to datetime format using `to_datetime()`.

# Outliers Detection and Treatment

Outliers can significantly impact statistical analysis and machine learning models. Consider the following steps for outlier detection and treatment:

- **Visualize Data**: Plotting box plots or histograms can help identify potential outliers.

- **Define Thresholds**: Establish thresholds or statistical measures(e.g., mean ± 3 standard deviations) to determine outliers.

- **Handling Outliers**: Decide on appropriate actions for outliers, such as removal, imputation, or transforming them to a more suitable value.

These are just a few examples of common data cleaning tasks. The specific cleaning steps may vary depending on the nature of the dataset and the analysis goals. The goal of data cleaning is to ensure the accuracy, consistency, and reliability of the data for further analysis.


Imports section:

In [1]:
# Please note if running on a clean environment, need to install missing modules
import pandas as pd
import numpy as np
from datetime import date as dt
import re
from geopy.geocoders import Nominatim

Global variables:

In [None]:
CSV_NAME = 'fire_history.csv'
CLEAN_COL_CSV = 'fire_history_clean_cols.csv'
CLEAN_MISS_CSV = 'fire_history_clean_miss.csv'
CLEAN_DUP_CSV = 'fire_history_dup.csv'
DATA_CONV_CSV = 'fire_history_data_conv.csv'
OUTLIERS_CSV = 'fire_history_outliers.csv'

COLS = []

COLS1 = ['UniqueFireIdentifier', 'FireDiscoveryDateTime', 'FireOutDateTime',
        'POOCounty', 'InitialLatitude', 'InitialLongitude', 'FireCause']


Removing unnecessary columns:

In [None]:
def keep_necessary_columns(df, cols):
    df_cleaned = df[cols]
    return df_cleaned

Handling Missing Values:

In [None]:
def missing_values(df):
    df_cleaned = df.copy()
    df_cleaned['FireCause'].fillna('Undetermined', inplace=True)
    df_cleaned.dropna(inplace=True)
    return df_cleaned

Handling Duplicates: <br>
<br>
*We are using `UniqueFireIdentifier` column to identify duplicates. Unique identifier assigned to each wildland fire.  yyyy = calendar year, SSUUUU = POO protecting unit identifier (5 or 6 characters), xxxxxx = local incident identifier (6 to 10 characters)*

In [None]:
def duplicate_values(df):
    df_cleaned = df.copy()
    df_cleaned.drop_duplicates(subset='UniqueFireIdentifier', keep='first', inplace=True)
    # df_cleaned.drop_duplicates(keep='first', inplace=True)
    return df_cleaned

Data Type Conversion:

In [2]:
def convert_date(dates):
    for i in range(len(dates)):
        dates[i] = pd.to_datetime(dates[i]).date()
    return dates


In [None]:
def calc_days(end_dates, start_dates):
    days = []
    for i in range(len(end_dates)):
        end_date = end_dates[i]
        start_date = start_dates[i]
        days.append((end_date - start_date).days)
    return days


In [2]:
def data_type_conversion(df):
    df_cleaned = df.copy()

    # Convert FireDiscoveryDateTime and FireOutDateTime to useable format
    df_cleaned['FireDiscoveryDateTime'] = convert_date(df_cleaned['FireDiscoveryDateTime'])
    df_cleaned['FireOutDateTime'] = convert_date(df_cleaned['FireOutDateTime'])

    # Add FireDuration column
    df_cleaned['FireDuration'] = calc_days(df_cleaned['FireOutDateTime'], df_cleaned['FireDiscoveryDateTime'])

    # Convert FireCause to categorical column
    cause_mapping = {'Human': 1, 'Natural': 2, 'Unknown': 3, 'Undetermined': 4}
    df_cleaned['FireCause'] = df_cleaned['FireCause'].map(cause_mapping)

    # Add CausedByWeather column
    df_cleaned['CausedByWeather'] = df_cleaned['FireCause'].apply(lambda x: 1 if x==2 else 0)

    return df_cleaned


Outliers Detection and Treatment:

*The `remove_outliers()` function is used to remove rows from a DataFrame based on a given boolean mask of outliers.*

In [None]:
def remove_outliers(df, outliers):
    return df[~outliers]

*The `calculate_outliers()` function calculates outliers based on the IQR range. The IQR is less sensitive to extreme values and can handle non-normal distributions.*

In [5]:
def calculate_outliers(series):
    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return (series < lower_bound) | (series > upper_bound)

*The `detect_outliers()` function utilizes the geopy library and the Nominatim geocoder for reverse geocoding. The function takes latitude and longitude values as input and iterates over each pair to reverse geocode and obtain the country information.*<br>
*If the location is not within the USA, or if the reverse geocoding fails to provide country information, it is considered an outlier.*

In [4]:
def detect_outliers(latitude, longitude):
    geolocator = Nominatim(user_agent="Geolocation")
    country = "United States"
    outliers = []
    for lat, long in zip(latitude, longitude):
        location = geolocator.reverse((lat, long), exactly_one=True)
        if location is None or location.raw.get('address', {}).get('country')!= country:
            outliers.append(True)
        else:
            outliers.append(False)
    return pd.Series(outliers)

*And FINALLY - The `handle_outliers()` function handles outliers in the input DataFrame by analyzing fire duration outliers and Fire location outliers. It utilizes helper functions `calculate_outliers()` and `remove_outliers()` that we saw above for handling fire duration outliers based on IQR range, and a custom function `detect_outliers()` for detecting fire location outliers based on latitude and longitude.*

In [None]:
def handle_outliers(df):
    df_cleaned = df.copy()
    
    # Analyze fire duration and location outliers
    fire_duration_outliers = calculate_outliers(df_cleaned['FireDuration'])
    df_cleaned = remove_outliers(df_cleaned, fire_duration_outliers)

    fire_location_outliers = detect_outliers(df_cleaned['InitialLatitude'], df_cleaned['InitialLongitude'])
    df_cleaned = remove_outliers(df_cleaned, fire_location_outliers)

    return df_cleaned
