# **Airplane Engine Type Study Notebook**

## Objectives

*   Answer business requirement 1: 
    * The client is interested to understand the patterns between an airplanes design features and its Performance features, so that the client can learn which are the most relevant variables to consider when choosing **Engine Type** (jet, piston or propjet) in the design process of a new airplane.

## Inputs

* outputs/datasets/collection/airplane_performance_study.csv

## Outputs

* generate code that answers business requirement 1 and can be used to build the Streamlit App






---

# Change working directory

Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Data

* We drop the columns with Meta Data 'Model' and 'Company' since these are identifier variables not needed for the study
* Dropping the two Engine "size" features 'THR' (Force with unit lbf) and 'SHP' (Power with unit SHP). The "THR" column is occupied (otherwise showing NaN) when the "Engine Type" is categorized with Jet and the same goes for the "SHP" when the "Engine Type" is categorized with piston or propjet. These two features () are interesting from an Aircraft Design perspective but since these are different quantities with different units they become difficult and awkward to compare with each other.

* We can see that THR (Thrust) is not present in lower velocities regime simply because Jet powered airplanes which are measured in THR are not flying in this velocity regime. With the same reasoning the SHP (Shaft Horse Power) represented piston (propeller driven) powered airplanes which flies in the lower velocity regime.

<img src="/workspace/data-driven-design/images_notebook/THR_SHP_kaggle.png" alt="Screenshot showing distribution between Cessna and Piper" height="200" />

In [None]:
import pandas as pd
df = (pd.read_csv("/workspace/data-driven-design/outputs/datasets/collection/airplane_performance_study.csv")
    .drop(['Model', 'Company', 'THR', 'SHP'], axis=1))
df.head(10)

# Data Exploration

We are interested to get more familiar with the dataset, check variable type and distribution, missing levels and what these variables mean in the business context of Airplane Design.

* Data set is dominated by numerical/quantitative, continous data and only three categorical features:
  * Multi Engine
  * TP mods
  * Engine Type
* The categorical data could be considered nominal since the categories simply represent different propulsion cases however they could also be considered ordinal since Multiple Engines are "more" than a single Engine, that a modified Engine could represent an improved engine and that a Jet Engine has many advantages over Piston etc.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

# Correlation Study

In [None]:
# Check the data types
print(df.dtypes)

Converting data type of Multi Engine to object and checking if the conversion was successfull

In [None]:
# Convert to categorical type
df['Multi_Engine'] = df['Multi_Engine'].astype('object')

# Verify the conversion
df['Multi_Engine'].dtype


In [None]:
from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(variables='Multi_Engine', drop_last=False)
df_ohe = encoder.fit_transform(df)
df_ohe.head(3)

In [None]:
# Get column names
column_names = df_ohe.columns
print(column_names)

Convert the column names created by the onehotencoder to make the table more intuitive to read.

Reference to below fix: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [None]:
# Bugfix: Make a copy
df_ohe = df_ohe.copy()

# Replace column name
df_ohe.rename(columns={'Multi_Engine_False': 'Single_Engine'}, inplace=True)
df_ohe.rename(columns={'Multi_Engine_True': 'Multi_Engine'}, inplace=True)

df_ohe.head(3)


### Using "Engine Type" as target variable

We are using the One Hot Encoder on the categorical features ("Engine Type", "Multi Engine") to 
"Avoids Ordinal Relationships" (One hot encoding prevents the model from assuming any ordinal relationship between the categories. Make sure the datatype is either object or category (othervise the OneHotEncoder will not work!)

We use `.corr()` for `spearman` and `pearson` methods, and investigate the top 10 correlations.

Calculate Pearson to check the Linear relationship between variables

In [None]:
# Step 1: Select relevant numeric columns (excluding 'Multi Engine' if needed)
df_subset = df_ohe.select_dtypes(include=['float64', 'int64'])

# Step 2: Calculate Pearson correlation with 'Single Engine' as well as 'Multi Engine'
corr_pearson_single_engine = df_subset.corr(method='pearson')['Single_Engine'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_pearson_multi_engine = df_subset.corr(method='pearson')['Multi_Engine'].sort_values(key=abs, ascending=False)[1:].head(10)

# Drop "Single Engine" and "Multi Engine" from the Series if they exist
corr_pearson_single_engine = corr_pearson_single_engine.drop(['Multi_Engine'], errors='ignore')
corr_pearson_multi_engine = corr_pearson_multi_engine.drop(['Single_Engine'], errors='ignore')

# Now print the remaining correlations
print(corr_pearson_single_engine)

Do the same for `spearman` to check the Monotonic relationship between variables

In [None]:
# Step 1: Select relevant numeric columns (excluding 'Multi Engine' if needed)
df_subset = df_ohe.select_dtypes(include=['float64', 'int64'])

# Step 2: Calculate Pearson correlation with 'Single Engine' as well as 'Multi Engine'
corr_spearman_single_engine = df_subset.corr(method='spearman')['Single_Engine'].sort_values(key=abs, ascending=False)[1:].head(10)
corr_spearman_multi_engine = df_subset.corr(method='spearman')['Multi_Engine'].sort_values(key=abs, ascending=False)[1:].head(10)

# Drop "Single Engine" and "Multi Engine" from the Series if they exist
corr_spearman_single_engine = corr_spearman_single_engine.drop(['Multi_Engine'], errors='ignore')
corr_spearman_multi_engine = corr_spearman_multi_engine.drop(['Single_Engine'], errors='ignore')

# Now print the remaining correlations
print(corr_pearson_single_engine)

For both methods, we notice moderate or strong levels of correlation between multi engine and a given variable. This is good news since we Ideally pursue strong correlation levels. We will consider the top five correlation levels at `df_ohe` and will study the associated variables at `df`

In [None]:
top_n = 5
set(corr_pearson_single_engine[:top_n].index.to_list() + corr_pearson_single_engine[:top_n].index.to_list())
set(corr_pearson_multi_engine[:top_n].index.to_list() + corr_pearson_multi_engine[:top_n].index.to_list())

Therefore (by looking on the sign before the correlation values we can determine if one increases or decreases as the target, Multi Engine, increases). We will investigate if:
* A multi engined airplane typically has a higher Hmax than a single engined airplane
* A multi engined airplane typically has a higher Vcruise than a single engined airplane
* A multi engined airplane typically has a higher Vl than a single engined airplane
* A multi engined airplane typically has a higher Vmax than a single engined airplane
* A multi engined airplane typically has a higher Vstall than a single engined airplane

We suspect Airplanes with Multiple Engines are Higher, Further,
Faster as the slogan goes: Correct. The correlation study in the
'Multi Engine Airplane Study' supports that.

The study of the Airplane data showed a general Performance
increase in Service Ceiling (Hmax), Range, Cruise and Max speed
(Vcruise an Vmax), but also on the negative side: a higher 
landing speed and stall speed (Vl and Vstall).
This insight will enter into the Conceptual Design Prediction
tools."

In [None]:
vars_to_study = ['Hmax', 'Vcruise', 'Vl', 'Vmax', 'Vstall']
vars_to_study

# EDA on selected variables

In [None]:
df_eda = df.filter(vars_to_study + ['Multi_Engine'])
df_eda.head(30)

## Variables Distribution by Multi Engine 

We plot the distribution (numerical and categorical) coloured by Multi Engine

In [16]:
%matplotlib inline

---

In [17]:
# Code copied from "3A_airplane_engine_type_study" notebook - "Variables Distribution by Multi Engine"-section
def plot_numerical(df, col, target_var):
    fig, ax = plt.subplots(figsize=(8, 5))  # Create a figure and axis
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step", ax=ax)  # Pass ax to the plot
    ax.set_title(f"{col}", fontsize=20, y=1.05)
    st.pyplot(fig)  # Pass the figure to st.pyplot()

# Code copied from "3A_airplane_engine_type_study" notebook - "Variables Distribution by Multi Engine"-section
def multi_engine_per_feature(df_eda, vars_to_study):
    target_var = 'Multi_Engine'
    for col in vars_to_study:
        plot_numerical(df_eda, col, target_var)
        print("\n\n")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_numerical(df, col, target_var):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.title(f"{col}", fontsize=20, y=1.05)
    plt.show()


target_var = 'Multi_Engine'
for col in vars_to_study:
    plot_numerical(df_eda, col, target_var)
    print("\n\n")


---

## Parallel Plot

For the paralleled plot there is only relevant to include the interval which has data points and enough of data points. For this reason we cut of the lower and/or upper ends of the data range for the benefit of the paralleled Plot. However by making the first (lower) and last (upper) bin (interval) go to negative infinity and positive infinity respectively we do not throw away any values since they all enter into the plot.

Extreme values can sometimes skew data but not in this case since extreme performance airplanes really *does* exist and are relevant to our analysis *need* therefore to be included to not skew the graphs. This is however not true if we desire to only predict more conservative designs. In other words: if we want a prediction for conservative and conventional airplane design, then we should not include the outliers since these typically are the result of aggressice and extreme (non-conventional) designs!


In [None]:
from feature_engine.discretisation import ArbitraryDiscretiser
import numpy as np

# Step 1: Define the mapping arrays
# Maps hard coded based on inspection of the histogram plots under "Variables Distribution by Multi Engine" in this notebook.
Hmax_map = [-np.Inf, 23000, 32000, 42000, 50000, np.Inf]
Vcruise_map = [-np.Inf, 250, 350, 450, 550, np.Inf]
Vl_map = [-np.Inf, 2000, 3000, 4000, np.Inf]
Vmax_map = [-np.Inf, 250, 350, 450, 550, np.Inf]
Vstall_map = [-np.Inf, 70, 90, 110, np.Inf]

# Step 2: Combine all mappings into a single binning dictionary (Inbetween step necessary since we have multiple variables)
binning_dict = {
    'Hmax': Hmax_map,
    'Vcruise': Vcruise_map,
    'Vl': Vl_map,
    'Vmax': Vmax_map,
    'Vstall': Vstall_map
}

# Step 3: Initialize the ArbitraryDiscretiser with the combined binning dictionary
disc = ArbitraryDiscretiser(binning_dict=binning_dict)

# Step 4: Fit and transform the DataFrame
df_parallel = disc.fit_transform(df_eda)

# Display the first few rows of the transformed DataFrame
print(df_parallel.head())
df_parallel.head()


In [None]:
# Fit and transform the DataFrame
df_parallel = disc.fit_transform(df_eda)

# Access the binning dictionaries after fitting
if hasattr(disc, 'binner_dict_'):
    print("Binning dictionary for Hmax:", disc.binner_dict_['Hmax'])
    print("Binning dictionary for Vcruise:", disc.binner_dict_['Vcruise'])
    print("Binning dictionary for Vl:", disc.binner_dict_['Vl'])
    print("Binning dictionary for Vmax:", disc.binner_dict_['Vmax'])
    print("Binning dictionary for Vstall:", disc.binner_dict_['Vstall'])
else:
    print("binner_dict_ does not exist. Please check if the discretiser was fitted successfully.")


Create a map to replace the variable with more informative levels.

In [None]:
# Assuming disc is already fitted and contains the binning dictionary
labels_map = {}

# Iterate over each variable in the binning dictionary
for variable in disc.binner_dict_.keys():
    classes_ranges = disc.binner_dict_[variable][1:-1]  # Exclude -Inf and +Inf
    n_classes = len(classes_ranges) + 1  # Number of intervals/classes
    
    # Initialize labels for this variable
    variable_labels = {}
    
    for n in range(n_classes):
        if n == 0:
            variable_labels[n] = f"<{classes_ranges[0]}"
        elif n == n_classes - 1:
            variable_labels[n] = f"+{classes_ranges[-1]}"
        else:
            variable_labels[n] = f"{classes_ranges[n - 1]} to {classes_ranges[n]}"
    
    # Store the labels in the main labels_map
    labels_map[variable] = variable_labels

# Output the labels map for each variable
labels_map


Replace according to the labels_map

In [None]:
# Replace the values in df_parallel for each variable using the corresponding labels from labels_map
for variable, labels in labels_map.items():
    df_parallel[variable] = df_parallel[variable].replace(labels)

# Display the first few rows of the transformed DataFrame
df_parallel.head(10)


In [None]:
# Convert boolean to integer via replacing
df_parallel['Multi_Engine'] = df_parallel['Multi_Engine'].replace({True: 1, False: 0})
# Display the first few rows of the transformed DataFrame
df_parallel.head(10)

Creates multi-dimensional categorical data plot

In [None]:
import plotly.express as px
fig = px.parallel_categories(df_parallel, color="Multi_Engine")   #fig = px.parallel_categories(df_parallel, color="Multi_Engine", color_discrete_sequence=["blue", "orange"])
fig.show(renderer='jupyterlab')

---

# Conclusions

* We suspect Airplanes with Multiple Engines are Higher, Further,
Faster as the slogan goes: Correct. The correlation study in the
'Multi Engine Airplane Study' supports that.

The study of the Airplane data showed a general Performance
increase in Service Ceiling (Hmax), Range, Cruise and Max speed
(Vcruise an Vmax), but also on the negative side: a higher 
landing speed and stall speed (Vl and Vstall).
This insight will enter into the Conceptual Design Prediction
tools."

---