# Module Title:	 Machine Learning for Business  
### Assessment Title:  MLBus_HDipData_CA1
### Lecturer Name:	 Dr. Muhammad Iqba  
### Student Full Name & Number:	Natalia de Oliveira Rodrigues 2023112 and Heitor de Araujo Filho 2023098

This CA will assess student attainment of the following minimum intended learning outcomes:

1. Critically evaluate and implement appropriate clustering algorithms and interpret and document 
their results. (Linked to PLO 1, PLO 5)
2. Apply modelling to time series data to facilitate business intelligence needs (Linked to PLO 1, PLO 2, 
PLO 3

**Project Objective:** 
Perform time series analysis on the historical plane crash data and use clustering techniques to identify patterns and clusters of crash incidents over time. 

1. **Temporal Patterns Analysis:** How the frequency of plane crashes has evolved over the years. Are there any long-term trends or seasonal patterns in crash occurrences?

2. **Clustering of Crash Incidents:** Identify commonalities among different incidents using clustering algorithms to group similar plane crashes based on characteristics such as crash causes, flight phases, and other relevant factors. 

3. **Visualization of Clustered Data:** How certain types of crashes have become more or less prevalent over the years?(identified clusters over time)

4. **Anomaly Detection:** These could be extreme or unusual crash incidents that deviate from the typical patterns.

5. **Forecasting:** Predict the future trend of plane crashes based on historical data using time series forecasting models. (valuable tool for aviation safety assessment)

6. **Interpreting Cluster Characteristics:**  Are there specific conditions or causes that lead to certain types of accidents? Investigate the characteristics and factors that contribute for each cluster of crashes formation.

7. **Evaluation of Clustering Methods:** Compare and evaluate different clustering algorithms to determine which one provides the most meaningful insights into the dataset.

**Aims:** 
- Deeper understanding of the historical plane crash data, 
- Identify recurring patterns, 
- Potentially discover factors that contribute to certain types of accidents. 

# Exploratory Data Analysis

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
df = pd.read_csv('../../data/Plane_Crashes.csv')

In [22]:
def glimpse(df):
    display(f'There are {df.shape[0]} observations and {df.shape[1]} attributes in this dataset.')
    display(df.head(3))
    display(df.tail(3))
    display(df.describe())
    display(df.info())
    display(df.isnull().sum())
    
glimpse(df)

'There are 28536 observations and 24 attributes in this dataset.'

Unnamed: 0,Date,Time,Aircraft,Operator,Registration,Flight phase,Flight type,Survivors,Crash site,Schedule,...,Country,Region,Crew on board,Crew fatalities,Pax on board,PAX fatalities,Other fatalities,Total fatalities,Circumstances,Crash cause
0,1918-05-02,,De Havilland DH.4,United States Signal Corps - USSC,AS-32084,Takeoff (climb),Test,No,Airport (less than 10 km from airport),Dayton - Dayton,...,United States of America,North America,2.0,2.0,0.0,0.0,0.0,2,The single engine airplane departed Dayton-McC...,Technical failure
1,1918-06-08,,Handley Page V/1500,Handley Page Aircraft Company Ltd,E4104,Takeoff (climb),Test,Yes,Airport (less than 10 km from airport),Cricklewood - Cricklewood,...,United Kingdom,Europe,6.0,5.0,0.0,0.0,0.0,5,"Assembled at Cricklewood Airfield in May 1918,...",Technical failure
2,1918-06-11,,Avro 504,Royal Air Force - RAF,A8544,Flight,Training,Yes,"Plain, Valley",Abukir - Abukir,...,Egypt,Africa,2.0,1.0,0.0,0.0,0.0,1,The single engine aircraft was completing a lo...,Unknown


Unnamed: 0,Date,Time,Aircraft,Operator,Registration,Flight phase,Flight type,Survivors,Crash site,Schedule,...,Country,Region,Crew on board,Crew fatalities,Pax on board,PAX fatalities,Other fatalities,Total fatalities,Circumstances,Crash cause
28533,2022-05-24,15H 40M 0S,De Havilland DHC-3 Otter,Yakutat Coastal Airlines,N703TH,Landing (descent or approach),Charter/Taxi (Non Scheduled Revenue Flight),Yes,Airport (less than 10 km from airport),Yakutat – Dry Bay,...,United States of America,North America,1.0,0.0,3.0,0.0,0.0,0,The single engine airplane departed Yakutat on...,Unknown
28534,2022-05-29,10H 7M 0S,De Havilland DHC-6 Twin Otter,Tara Air,9N-AET,Flight,Scheduled Revenue Flight,No,Mountains,Pokhara – Jomsom,...,Nepal,Asia,3.0,3.0,19.0,19.0,0.0,22,The twin engine airplane departed Pokhara City...,Human factor
28535,2022-06-03,13H 46M 0S,Cessna 208B Grand Caravan,GoJump Oceanside,N7581F,Landing (descent or approach),Skydiving / Paratroopers,Yes,Airport (less than 10 km from airport),Oceanside - Oceanside,...,United States of America,North America,1.0,0.0,1.0,1.0,0.0,1,The single engine was completing local skydivi...,Unknown


Unnamed: 0,YOM,Flight no.,Crew on board,Crew fatalities,Pax on board,PAX fatalities,Other fatalities,Total fatalities
count,23225.0,0.0,28512.0,28535.0,28482.0,28535.0,28526.0,28536.0
mean,1931.942519,,3.052539,1.771649,7.705393,3.679727,0.10976,5.567389
std,285.486067,,11.738151,2.520554,24.066368,15.288171,2.644296,16.713203
min,0.0,,0.0,0.0,0.0,0.0,0.0,0.0
25%,1944.0,,1.0,0.0,0.0,0.0,0.0,0.0
50%,1958.0,,2.0,1.0,0.0,0.0,0.0,1.0
75%,1974.0,,4.0,3.0,4.0,1.0,0.0,5.0
max,19567.0,,1924.0,25.0,509.0,506.0,297.0,520.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28536 entries, 0 to 28535
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Date              28536 non-null  object 
 1   Time              13949 non-null  object 
 2   Aircraft          28535 non-null  object 
 3   Operator          28536 non-null  object 
 4   Registration      27721 non-null  object 
 5   Flight phase      27898 non-null  object 
 6   Flight type       28479 non-null  object 
 7   Survivors         27239 non-null  object 
 8   Crash site        28153 non-null  object 
 9   Schedule          19590 non-null  object 
 10  MSN               24354 non-null  object 
 11  YOM               23225 non-null  float64
 12  Flight no.        0 non-null      float64
 13  Crash location    28524 non-null  object 
 14  Country           28535 non-null  object 
 15  Region            28535 non-null  object 
 16  Crew on board     28512 non-null  float6

None

Date                    0
Time                14587
Aircraft                1
Operator                0
Registration          815
Flight phase          638
Flight type            57
Survivors            1297
Crash site            383
Schedule             8946
MSN                  4182
YOM                  5311
Flight no.          28536
Crash location         12
Country                 1
Region                  1
Crew on board          24
Crew fatalities         1
Pax on board           54
PAX fatalities          1
Other fatalities       10
Total fatalities        0
Circumstances          25
Crash cause             0
dtype: int64