# Machine Learning Pipeline - Feature Engineering

Implementation of each of the steps in the Machine Learning Pipeline. 

1. Data Analysis
2. **Feature Engineering**
3. Feature Selection
4. Model Training
5. Obtaining Predictions / Scoring

Plane Crash Dataset available on [Kaggle.com](https://www.kaggle.com/datasets/kamilkarczmarczyk/plane-crash-dataset-03042023). See below for more details.

===================================================================================================

Data description:
===================================================================
- Date: Date of accident, in the format - January 01, 2001
- Time: Local time, in 24 hr. format unless otherwise specified
- Airline/Op: Airline or operator of the aircraft
- Flight #: Flight number assigned by the aircraft operator
- Route: Complete or partial route flown prior to the accident
- AC Type: Aircraft type
- Reg: ICAO registration of the aircraft
- cn / ln: Construction or serial number / Line or fuselage number
- Aboard: Total aboard (passengers / crew)
- Fatalities: Total fatalities aboard (passengers / crew)
- Ground: Total killed on the ground
- Summary: Brief description of the accident and cause if known

Target: 
========================================================
* drop samples with missing aboard_all, fatalities_all
* aboard_passengers, fatalities_passenger, ground .fillna(0)
* cast type int for aboard_all, fatalities_all,aboard_passengers, fatalities_passenger, ground
* drop samples with 0 value in aboard_all
* survived = aboard_all- fatalities_all, binarised

Pipeline steps:
==============================================================================================
* train/test split
* drop samples with missing summary
* apply get_multiple_locations() to route column, to create number of routes column (route_n)
* create month, decade (year-(year%10)) from date
* apply get_locations() to location
* summary preprocess using spasy and vectorise with FastText
* select features: matrix of vectors concatenated with decade, month, routes_n

## Imports

In [1]:
import pandas as pd
# to display all the columns of the dataframe in the notebook
pd.pandas.set_option("display.max_columns", None)
pd.set_option('display.float_format', lambda x: "%.4f" % x)

import numpy as np
import datetime as dt
import re

import spacy
import spacy.cli
spacy.cli.download("en_core_web_lg")
nlp = spacy.load("en_core_web_lg")

from sklearn.model_selection import train_test_split

import logging
logging.getLogger().setLevel(logging.INFO)

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 587.7/587.7 MB 3.8 MB/s eta 0:00:00



[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


## src

In [2]:
def get_data(df):
    # make a copy of original data to work with
    data = df.copy()
    # replace missing values in form of "?"
    data = data.replace("?", np.nan)
    # rename columns
    multiple_white_spaces = re.compile(r"\s+")
    data.columns = [multiple_white_spaces.sub("_", _) for _ in (re.sub(r"[:/]", "", c) for c in data.columns)]
    return data


def get_target(df, drop_na, drop_zero, fatalities):
    data = df.copy()
    # drop rows where drop_na columns value is None
    data.dropna(subset=drop_na, inplace=True)
    # drop rows where drop_zero columns value is 0
    for col in drop_zero:
        data.drop(data[data[col] == 0].index, inplace=True)
    # for all fatalities columns: fill in missing values with 0, convert values to int
    for col in fatalities:
        data[col].fillna(0, inplace = True)
        data[col] = data[col].astype(int)
    # create target variable survived and binary encode it
    data[DataSchema.survived] = data[DataSchema.aboard_all]-data[DataSchema.fatalities_all]
    data[DataSchema.survived] = np.where(data[DataSchema.survived] > 0, 1, data[DataSchema.survived])
    return data

## Config

In [3]:
class DataSchema:
    date = "Date"
    time = "Time"
    location = "Location"
    ac_type = "AC_Type"
    operator = "Operator"
    route = "Route"
    cn_ln = "cn_ln"
    flight_n = "Flight_#"
    is_military = "Is_military"
    mlitary_country = "Military_country"
    aboard_all = "Aboard_all"
    aboard_passengers = "Aboard_passengers"
    fatalities_all = "Fatalities_all"
    fatalities_passengers = "Fatalities_passengers"
    ground = "Ground"
    registration = "Registration"
    summary = "Summary"
    routes_lst = "Routes_lst"
    year = "Year"
    decade = "Decade"
    month = "Month"
    hour = "Hour"
    routes_n = "Routes_#"
    vector = "Vector"
    fatalities = "Fatalities"
    survived = "Survived"
    survived_pct = "Survived_pct"


DROP_NA = [DataSchema.aboard_all, DataSchema.fatalities_all]
DROP_ZERO = [DataSchema.aboard_all]
FATALITIES = [
    DataSchema.aboard_all, 
    DataSchema.aboard_passengers, 
    DataSchema.fatalities_all, 
    DataSchema.fatalities_passengers, 
    DataSchema.ground
]

## Load data

In [4]:
# load dataset
raw_data = pd.read_csv("data/raw_data.csv", sep=";")

In [9]:
data = get_data(raw_data)
logging.info(f"\n\033[32m{raw_data.shape=}\n\033[35m{data.shape=}\n\033[36m{data.columns=}\033[0m")

INFO:root:
[32mraw_data.shape=(5028, 17)
[35mdata.shape=(5011, 18)
[36mdata.columns=Index(['Date', 'Time', 'Location', 'AC_Type', 'Operator', 'Route', 'cn_ln',
       'Flight_#', 'Is_military', 'Military_country', 'Aboard_all',
       'Aboard_passengers', 'Fatalities_all', 'Fatalities_passengers',
       'Ground', 'Registration', 'Summary', 'Survived'],
      dtype='object')[0m


## Target

In [6]:
data = get_target(data, DROP_NA, DROP_ZERO, FATALITIES)
data[DataSchema.survived].value_counts(normalize=True)

Survived
0   0.6374
1   0.3626
Name: proportion, dtype: float64

## Train test split

In [7]:
# features and target
X = data.drop(DataSchema.survived, axis=1)
y = data[DataSchema.survived]

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

## Pipeline

In [None]:
# set up the pipeline
price_pipe = Pipeline(
    
)