# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle, via storage on local computer, and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   The inputs/datasets/raw/Airplane_Complete_Imputation.csv downloaded from Kaggle and uploaded to gitpod workspace via local computer.

## Outputs

* Generate Dataset: outputs/datasets/collection/airplane_performance_study.csv

## Additional Comments

Method to download data directly from Kaggle to the notebook with an authentication token (Kaggle JSON file) gave a "403 - Forbidden".


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/data-driven-design/jupyter_notebooks'

Make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/data-driven-design'

---

# Install python packages in the notebooks

In [4]:
pip install --upgrade pip

Note: you may need to restart the kernel to use updated packages.


In [5]:
%pip install --no-cache-dir -r /workspace/data-driven-design/requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [6]:
#pip freeze > requirements.txt

# Fetch data from Kaggle

Download data file directly from Kaggle and upload to Explorer in Git Pod workspace via local computer:

Step 1: Go to [Aircraft Performance (Aircraft Bluebook) ](https://www.kaggle.com/datasets/heitornunes/aircraft-performance-dataset-aircraft-bluebook?select=Airplane_Complete_Imputation.csv)

Step 2: Scroll down and choose the "Airplane Complete Imputation.csv"-file 

<img src="/workspace/data-driven-design/images_notebook/choose_file_from_kaggle.jpg" alt="Screenshot showing which file on Kaggle to download" width="700" />

Step 3: Download data file from Kaggle to local computer [Aircraft Performance (Aircraft Bluebook) ](https://www.kaggle.com/datasets/heitornunes/aircraft-performance-dataset-aircraft-bluebook?select=Airplane_Complete_Imputation.csv)

<img src="/workspace/data-driven-design/images_notebook/download_from_kaggle.jpg" alt="Screenshot showing where to download the data on Kaggle" width="700" />

---

# Load, Rearrange and Inspect Kaggle data

Set pandas to display all columns without truncating (and enabling horizontal slider)

In [7]:
import pandas as pd
pd.set_option('display.max_columns', None)  # Show all columns

In [8]:
df = pd.read_csv(f"inputs/datasets/raw/Airplane_Complete_Imputation.csv")
df.head()

Unnamed: 0,Model,Company,Engine Type,Multi Engine,TP mods,THR,SHP,Length,Height,Wing Span,FW,MEW,AUW,Vmax,Vcruise,Vstall,Hmax,Hmax (One),ROC,ROC (One),Vlo,Slo,Vl,Sl,Range
0,15 AC Sedan,Aeronca,Piston,False,False,,145.0,25.25,10.25,37.416667,241.2,1180.0,2050.0,104.0,91.0,46.0,13000.0,13000.0,450.0,450.0,900.0,391.970247,1300.0,257.745075,370.0
1,11 CC Super Chief,Aeronca,Piston,False,False,,85.0,20.583333,8.75,36.083333,100.5,820.0,1350.0,89.0,83.0,44.0,12300.0,12300.0,600.0,600.0,720.0,26.247647,800.0,225.324824,190.0
2,7 CCM Champ,Aeronca,Piston,False,False,,90.0,21.416667,8.583333,35.0,127.3,810.0,1300.0,90.0,78.0,37.0,16000.0,16000.0,650.0,650.0,475.0,363.139711,850.0,585.751317,210.0
3,7 DC Champ,Aeronca,Piston,False,False,,85.0,21.416667,8.583333,35.0,127.3,800.0,1300.0,88.0,78.0,37.0,13000.0,13000.0,620.0,620.0,500.0,407.797297,850.0,642.046166,210.0
4,7 AC Champ,Aeronca,Piston,False,False,,65.0,21.416667,8.75,35.0,93.8,740.0,1220.0,83.0,74.0,33.0,12500.0,12500.0,370.0,370.0,632.0,297.056192,885.0,329.571813,175.0


Rearrange columns in data to group features that belong together side by side in order to make heatmaps etc. more easily interpreted

In [9]:
new_order = ['Model', 'Company', 'Wing Span', 'Length', 'Height', 'Multi Engine', 'TP mods', 'Engine Type', 'THR', 'SHP', 'AUW', 'MEW', 'FW', 'Vmax', 'Vcruise', 'Vstall', 'Range', 'Hmax', 'Hmax (One)', 'ROC', 'ROC (One)', 'Vlo', 'Slo', 'Vl', 'Sl']  # Specify new order
df = df[new_order]
df.to_csv('rearranged_file.csv', index=False)  # We overwrite the df with this rearranged file
df.head()

Unnamed: 0,Model,Company,Wing Span,Length,Height,Multi Engine,TP mods,Engine Type,THR,SHP,AUW,MEW,FW,Vmax,Vcruise,Vstall,Range,Hmax,Hmax (One),ROC,ROC (One),Vlo,Slo,Vl,Sl
0,15 AC Sedan,Aeronca,37.416667,25.25,10.25,False,False,Piston,,145.0,2050.0,1180.0,241.2,104.0,91.0,46.0,370.0,13000.0,13000.0,450.0,450.0,900.0,391.970247,1300.0,257.745075
1,11 CC Super Chief,Aeronca,36.083333,20.583333,8.75,False,False,Piston,,85.0,1350.0,820.0,100.5,89.0,83.0,44.0,190.0,12300.0,12300.0,600.0,600.0,720.0,26.247647,800.0,225.324824
2,7 CCM Champ,Aeronca,35.0,21.416667,8.583333,False,False,Piston,,90.0,1300.0,810.0,127.3,90.0,78.0,37.0,210.0,16000.0,16000.0,650.0,650.0,475.0,363.139711,850.0,585.751317
3,7 DC Champ,Aeronca,35.0,21.416667,8.583333,False,False,Piston,,85.0,1300.0,800.0,127.3,88.0,78.0,37.0,210.0,13000.0,13000.0,620.0,620.0,500.0,407.797297,850.0,642.046166
4,7 AC Champ,Aeronca,35.0,21.416667,8.75,False,False,Piston,,65.0,1220.0,740.0,93.8,83.0,74.0,33.0,175.0,12500.0,12500.0,370.0,370.0,632.0,297.056192,885.0,329.571813


DataFrame Summary. Click 'View as a scrollable element' at the bottom of the output in case the 'display.max_columns'-command does not work (probably overruled by jupyter notebook)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 860 entries, 0 to 859
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Model         860 non-null    object 
 1   Company       860 non-null    object 
 2   Wing Span     860 non-null    float64
 3   Length        860 non-null    float64
 4   Height        860 non-null    float64
 5   Multi Engine  860 non-null    bool   
 6   TP mods       860 non-null    bool   
 7   Engine Type   860 non-null    object 
 8   THR           156 non-null    float64
 9   SHP           704 non-null    float64
 10  AUW           860 non-null    float64
 11  MEW           860 non-null    float64
 12  FW            860 non-null    float64
 13  Vmax          860 non-null    float64
 14  Vcruise       860 non-null    float64
 15  Vstall        860 non-null    float64
 16  Range         860 non-null    float64
 17  Hmax          860 non-null    float64
 18  Hmax (One)    860 non-null    

We want to check if there are duplicated airplanes by checking `Model`: There are three duplicates.

The 'keep=False' ensures that both lines of duplicate values are being displayed

In [11]:
df[df.duplicated(subset=['Model'], keep=False)]

Unnamed: 0,Model,Company,Wing Span,Length,Height,Multi Engine,TP mods,Engine Type,THR,SHP,AUW,MEW,FW,Vmax,Vcruise,Vstall,Range,Hmax,Hmax (One),ROC,ROC (One),Vlo,Slo,Vl,Sl
29,AT-602,"Air Tractor, Inc.",56.0,33.5,12.166667,False,False,Propjet,,1050.0,12500.0,5829.0,1447.2,158.0,126.0,52.0,600.0,8000.0,8000.0,650.0,650.0,1756.907849,1830.0,1949.973315,1634.223699
65,AT-602,Beechcraft (Hawker Beechcraft),56.0,33.5,11.0,False,False,Propjet,,1050.0,12500.0,5829.0,1447.2,158.0,126.0,52.0,600.0,8000.0,8000.0,650.0,650.0,2803.267722,1830.0,2862.65093,1820.805987
445,Lineage 1000,Dassault Falcon Jet,94.25,118.916667,33.75,True,False,Jet,18500.0,,120152.0,70841.0,48217.0,546.94,472.0,133.063795,4600.0,41000.0,16816.632111,2464.0,720.0,7351.680105,6076.0,4108.774275,2450.0
481,Lineage 1000,Embraer Aircraft - Empresa Brasileira,94.25,118.916667,33.75,True,False,Jet,18500.0,,120152.0,70841.0,48217.0,546.94,452.0,107.896643,4592.0,41000.0,16648.502977,2464.0,720.0,6661.152244,6076.0,2647.67721,3402.0
511,G500,Gulfstream Aerospace,86.333333,91.166667,25.5,True,False,Jet,15144.0,,79600.0,46850.0,30250.0,616.975,566.95,127.868748,5300.0,51000.0,29210.904017,2467.0,732.453021,5912.498345,5300.0,3333.0225,2620.0
514,G500,Gulfstream Aerospace,93.5,96.416667,25.833333,True,False,Jet,15385.0,,85100.0,47600.0,34939.0,590.295,566.95,107.629085,5200.0,51000.0,27700.0,5515.951635,707.0,6700.448422,5150.0,4383.458188,2770.0


The double entries having the same and almost the same values for their features probably refer to confusion in the "Company" or similar versions of the same model.  We will drop one of each double entries.

In [12]:
# Drop the row at index 1
df_dropped = df.drop(index=[65, 445, 514], inplace=True)
print(df_dropped)

None


We check for duplicates again: There are no duplicates.

In [13]:
df[df.duplicated(subset=['Model'], keep=False)]

Unnamed: 0,Model,Company,Wing Span,Length,Height,Multi Engine,TP mods,Engine Type,THR,SHP,AUW,MEW,FW,Vmax,Vcruise,Vstall,Range,Hmax,Hmax (One),ROC,ROC (One),Vlo,Slo,Vl,Sl


We noticed `Engine Type` is a categorical variable with three alternatives: 'Piston', 'Propjet'and 'Jet'. We will replace/convert the categories to integers as the ML model requires numeric variables. 

In [14]:
df['Engine Type'].unique()

array(['Piston', 'Propjet', 'Jet'], dtype=object)

In [15]:
df['Engine Type'] = df['Engine Type'].replace({"Piston":0, "Propjet":1, "Jet":2})

Check the `Engine Type` data type.

In [16]:
df['Engine Type'].dtype

dtype('int64')

Finally before we push the df to the repo we replace the pesky spaces with underscores.

In [17]:
# Renaming columns
df.rename(columns={
    'Multi Engine': 'Multi_Engine',
    'Wing Span': 'Wing_Span',
    'TP mods': 'TP_mods',  # Fixed this entry to match naming convention
    'Engine Type': 'Engine_Type',  # Adjusted to use underscores
    'Hmax (One)': 'Hmax_(One)',
    'ROC (One)': 'ROC_(One)'
}, inplace=True)

# Push files to Repo

In [18]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/airplane_performance_study.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


Double check that we really have the correct update csv-file in the repo and yes it is correct.

In [19]:
df = pd.read_csv(f"outputs/datasets/collection/airplane_performance_study.csv")
df.head()

Unnamed: 0,Model,Company,Wing_Span,Length,Height,Multi_Engine,TP_mods,Engine_Type,THR,SHP,AUW,MEW,FW,Vmax,Vcruise,Vstall,Range,Hmax,Hmax_(One),ROC,ROC_(One),Vlo,Slo,Vl,Sl
0,15 AC Sedan,Aeronca,37.416667,25.25,10.25,False,False,0,,145.0,2050.0,1180.0,241.2,104.0,91.0,46.0,370.0,13000.0,13000.0,450.0,450.0,900.0,391.970247,1300.0,257.745075
1,11 CC Super Chief,Aeronca,36.083333,20.583333,8.75,False,False,0,,85.0,1350.0,820.0,100.5,89.0,83.0,44.0,190.0,12300.0,12300.0,600.0,600.0,720.0,26.247647,800.0,225.324824
2,7 CCM Champ,Aeronca,35.0,21.416667,8.583333,False,False,0,,90.0,1300.0,810.0,127.3,90.0,78.0,37.0,210.0,16000.0,16000.0,650.0,650.0,475.0,363.139711,850.0,585.751317
3,7 DC Champ,Aeronca,35.0,21.416667,8.583333,False,False,0,,85.0,1300.0,800.0,127.3,88.0,78.0,37.0,210.0,13000.0,13000.0,620.0,620.0,500.0,407.797297,850.0,642.046166
4,7 AC Champ,Aeronca,35.0,21.416667,8.75,False,False,0,,65.0,1220.0,740.0,93.8,83.0,74.0,33.0,175.0,12500.0,12500.0,370.0,370.0,632.0,297.056192,885.0,329.571813
