# Module Title:	 Machine Learning for Business  
### Assessment Title:  MLBus_HDipData_CA1
### Lecturer Name:	 Dr. Muhammad Iqba  
### Student Full Name & Number:	Natalia de Oliveira Rodrigues 2023112 and Heitor Gomes de Araujo Filho 2023098

This CA will assess student attainment of the following minimum intended learning outcomes:

1. Critically evaluate and implement appropriate clustering algorithms and interpret and document 
their results. (Linked to PLO 1, PLO 5)
2. Apply modelling to time series data to facilitate business intelligence needs (Linked to PLO 1, PLO 2, 
PLO 3

**Project Objective:** 
Perform time series analysis on the historical plane crash data and use clustering techniques to identify patterns and clusters of crash incidents over time. 

1. **Temporal Patterns Analysis:** How the frequency of plane crashes has evolved over the years. Are there any long-term trends or seasonal patterns in crash occurrences?

2. **Clustering of Crash Incidents:** Identify commonalities among different incidents using clustering algorithms to group similar plane crashes based on characteristics such as crash causes, flight phases, and other relevant factors. 

3. **Visualization of Clustered Data:** How certain types of crashes have become more or less prevalent over the years?(identified clusters over time)

4. **Anomaly Detection:** These could be extreme or unusual crash incidents that deviate from the typical patterns.

5. **Forecasting:** Predict the future trend of plane crashes based on historical data using time series forecasting models. (valuable tool for aviation safety assessment)

6. **Interpreting Cluster Characteristics:**  Are there specific conditions or causes that lead to certain types of accidents? Investigate the characteristics and factors that contribute for each cluster of crashes formation.

7. **Evaluation of Clustering Methods:** Compare and evaluate different clustering algorithms to determine which one provides the most meaningful insights into the dataset.

**Aims:** 
- Deeper understanding of the historical plane crash data, 
- Identify recurring patterns, 
- Potentially discover factors that contribute to certain types of accidents. 

# Exploratory Data Analysis

In [1]:
# import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.preprocessing import LabelEncoder

from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('Plane Crashes.csv')

In [None]:
# To convert date to datetime
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
def glimpse(df):
    display(f'There are {df.shape[0]} observations and {df.shape[1]} attributes in this dataset.')
    print("-" * 120)
    display(df.head(3))
    print("-" * 120)
    display(df.tail(3))
    print("-" * 120)
    display(df.describe())
    print("-" * 120)
    display(df.info())
    print("-" * 120)
    display(df.isnull().sum().sort_values(ascending=False))
    
glimpse(df)

# Data Preprocessing 
Data preprocessing primarily focuses on cleaning, transforming, and preparing raw data for immediate analysis or modeling. Data preprocessing involves tasks such as handling missing data, dealing with outliers, scaling features, encoding categorical variables, and ensuring data is ready for analysis or model training. 


### Data Preprocessing: Preparing raw data for immediate analysis

In [None]:
# This will convert columns dtype
columns_to_convert = ['YOM', 'Crew on board', 'Crew fatalities', 'Pax on board', 'PAX fatalities', 
                      'Other fatalities', 'Total fatalities']

for column in columns_to_convert:
    df[column] = pd.to_numeric(df[column], errors='coerce').astype('Int64')

In [None]:
# This select recent data the past 10 Years 
# Allow us to capture contemporary trends and patterns.
df = df[df['Date'] > '31-05-2012']

In [None]:
# Attach your own 'Data' index to the dataframe
df.index = df['Date']

# Drop the 'Date' column from the dataframe
df.drop('Date', axis = 1, inplace = True)

### Data Preprocessing: Handling missing data

In [None]:
# This removes the attributes with high missing data values, and attributes without variance(unique identifiers). 
df = df.drop(columns=["Flight no.", "Time", 'MSN','Registration'])

In [None]:
# This removes all rows with any missing values - Less than 2% of the data will be dropped
df.dropna(inplace=True) 

In [None]:
df.isnull().sum().sort_values(ascending=False)

### Data Preprocessing: Investigating categorical variables

In [None]:
df.describe(include = 'object').T

In [None]:
# Invalid values column Year of manufacture
df.YOM.unique()

In [None]:
# To create a filter valid_year where only values after 1900 and before 2022 will be kept
valid_years = (df['YOM'] >= 1900) & (df['YOM'] <= 2022)
df = df[valid_years]

In [None]:
# This investigate unique values of Region attribute
df.Region.unique()

In [None]:
# This investigate how many observations where Region is World
df_region_check = df[df['Region'] == 'World']
print(f'There are {df_region_check.shape[0]} observations where Region is classified as World.')

**Region:**
- World is a way to classify aviation incidents that do not belong to a specific continent or region. For example when it happen in internatinal airspace, or over oceans, or in locations that do not fall within the boundaries of a specific continent. 

- America continent is split in North America, South America and Central America to provide more detailed information regarding the region of the aviation incidents. 


In [None]:
# This investigate unique values of Crash cause attribute
df['Crash cause'].unique()

In [None]:
# This investigate how many observations where Crash cause is Unknown
df_cause_check = df[df['Crash cause'] == 'Unknown']
print(f'There are {df_cause_check.shape[0]} observations where Crash cause is classified as Unknown.')

In [None]:
df['Flight type'].value_counts()

**Flight type:**
- Private: Private flights are those operated by individuals or organizations for non-commercial, personal, or business purposes.
- Scheduled Revenue Flight: These are the typical passenger or cargo flights you find in commercial aviation. Passengers purchase tickets or cargo space, and the flights follow a set timetable.
- Charter/Taxi (Non Scheduled Revenue Flight): Charter or non-scheduled revenue flights are flights that are not part of regular airline schedules. They are typically arranged on a case-by-case basis for specific customers or purposes.
- Survey / Patrol / Reconnaissance: they are operated for purposes like aerial photography, monitoring, or data collection.


### Data Preprocessing: Encoding categorical variables
- Nominal variables = one-hot encoded (Aircraft, Operator, Flight type, Crash site, Schedule, Chash location, Country, Crash cause, Circumstances
- Ordinal variables = label encoded (if the order is meaningful: Survivors and Flight phase).

In [None]:
df['Survivors'].unique()

In [None]:
# Create a LabelEncoder object
label_encoder = LabelEncoder()

# Apply label encoding to the "Survivors" column
df['Survivors_encoded'] = label_encoder.fit_transform(df['Survivors'])

In [None]:
df['Flight phase'].unique()

In [None]:
df['FlightPhase_encoded'] = label_encoder.fit_transform(df['Flight phase'])

In [None]:
df.Region.unique()

In [None]:
# Define the custom order of continents
custom_order = ['Asia', 'Africa', 'North America', 'South America', 'Europe','Antarctica', 'Oceania', 
                'Central America', 'World']

# Apply label encoding with the custom order
df['Region_encoded'] = df['Region'].map({region: i for i, region in enumerate(custom_order)}).astype(int)

### Data Preprocessing: Dealing with outliers

### Feature Importance:

### Data Preprocessing: Scaling features

- scale() Z-score scaling
- StandardScaler() another Z-score scaling. Standardization scales the data to have a mean of 0 and a std of 1.
- MinMaxScaler() Scaling scales the data to a specific range, typically between 0 and 1.

In [None]:
# This function will allow us to test differente scale methds and see the best result
def scale_data(df, method='scale'):
    if method == 'scale':
        scaler = scale()
    elif method == 'standardization':
        scaler = StandardScaler()
    elif method == 'minmax':
        scaler = MinMaxScaler()
    else:
        raise ValueError("Invalid scaling method. Choose from 'standardization','minmax' or 'scale'.")
        scaler = None  

    scaled_data = scaler.fit_transform(df)
    scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
    
    return scaled_df    

In [None]:
# Call the scale function
#scale_data(df, method='standardization')    

**Done:**

- 24/10/2023 EAD & Pre-processing
- Date columns had dtype changed to datatime
- Data must be transformed in the dataset index and ascending sorted
- 4 Columns dropped
- The past 10 years was selected to captured conteponrary trends and patterns (Dataset contain data up to 03-06-2022)
- Less than 2% of missing data was dropped
- YOM, Crew on board,Crew fatalities,Pax on board,PAX fatalities,Other fatalities,Total fatalities, must be transformed in integer
- YOM has incorrect values like 16,18,23,26 when we are expecting 4 digit value YEAR like 2023
- 25/10/2023 EAD & Data Normalization
- Investigate Categorical Data
- Data Normalization