# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle, via storage on local computer, and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   The inputs/datasets/raw/Airplane_Complete_Imputation.csv downloaded from Kaggle and uploaded to gitpod workspace via local computer.

## Outputs

* Generate Dataset: outputs/datasets/collection/airplane_performance_study.csv

## Additional Comments

Method to download data directly from Kaggle to the notebook with an authentication token (Kaggle JSON file) gave a "403 - Forbidden".


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/data-driven-design/jupyter_notebooks'

Make the parent of the current directory the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/data-driven-design'

---

# Install python packages in the notebooks

In [7]:
%pip install -r /workspace/data-driven-design//requirements.txt

Collecting html-minifier==0.0.4 (from -r /workspace/data-driven-design//requirements.txt (line 14))
  Downloading html_minifier-0.0.4-py2.py3-none-any.whl.metadata (639 bytes)
Downloading html_minifier-0.0.4-py2.py3-none-any.whl (3.6 kB)
Installing collected packages: html-minifier
Successfully installed html-minifier-0.0.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [8]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Fetch data from Kaggle

Download data file directly from Kaggle and upload to Explorer in Git Pod workspace via local computer:

Step 1: Go to [Aircraft Performance (Aircraft Bluebook) ](https://www.kaggle.com/datasets/heitornunes/aircraft-performance-dataset-aircraft-bluebook?select=Airplane_Complete_Imputation.csv)

Step 2: Scroll down and choose the "Airplane Complete Imputation.csv"-file 

<img src="/workspace/data-driven-design/images_notebook/choose_file_from_kaggle.jpg" alt="Screenshot showing which file on Kaggle to download" width="700" />

Step 3: Download data file from Kaggle to local computer [Aircraft Performance (Aircraft Bluebook) ](https://www.kaggle.com/datasets/heitornunes/aircraft-performance-dataset-aircraft-bluebook?select=Airplane_Complete_Imputation.csv)

<img src="/workspace/data-driven-design/images_notebook/download_from_kaggle.jpg" alt="Screenshot showing where to download the data on Kaggle" width="700" />

---

# Load, Rearrange and Inspect Kaggle data

Set pandas to display all columns without truncating (and enabling horizontal slider)

In [6]:
import pandas as pd
pd.set_option('display.max_columns', None)  # Show all columns

In [None]:
df = pd.read_csv(f"inputs/datasets/raw/Airplane_Complete_Imputation.csv")
df.head()

Rearrange columns in data to group features that belong together side by side in order to make heatmaps etc. more easily interpreted

In [None]:
new_order = ['Model', 'Company', 'Wing Span', 'Length', 'Height', 'Multi Engine', 'TP mods', 'Engine Type', 'THR', 'SHP', 'AUW', 'MEW', 'FW', 'Vmax', 'Vcruise', 'Vstall', 'Range', 'Hmax', 'Hmax (One)', 'ROC', 'ROC (One)', 'Vlo', 'Slo', 'Vl', 'Sl']  # Specify new order
df = df[new_order]
df.to_csv('rearranged_file.csv', index=False)  # We overwrite the df with this rearranged file
df.head()

DataFrame Summary. Click 'View as a scrollable element' at the bottom of the output in case the 'display.max_columns'-command does not work (probably overruled by jupyter notebook)

In [None]:
df.info()

We want to check if there are duplicated airplanes by checking `Model`: There are three duplicates.

The 'keep=False' ensures that both lines of duplicate values are being displayed

In [None]:
df[df.duplicated(subset=['Model'], keep=False)]

The double entries having the same and almost the same values for their features probably refer to confusion in the "Company" or similar versions of the same model.  We will drop one of each double entries.

In [None]:
# Drop the row at index 1
df_dropped = df.drop(index=[65, 445, 514], inplace=True)
print(df_dropped)

We check for duplicates again: There are no duplicates.

In [None]:
df[df.duplicated(subset=['Model'], keep=False)]

We noticed `Engine Type` is a categorical variable with three alternatives: 'Piston', 'Propjet'and 'Jet'. We will replace/convert the categories to integers as the ML model requires numeric variables. 

In [None]:
df['Engine Type'].unique()

In [14]:
df['Engine Type'] = df['Engine Type'].replace({"Piston":0, "Propjet":1, "Jet":2})

Check the `Engine Type` data type.

In [None]:
df['Engine Type'].dtype

Finally before we push the df to the repo we replace the pesky spaces with underscores.

In [16]:
# Renaming columns
df.rename(columns={
    'Multi Engine': 'Multi_Engine',
    'Wing Span': 'Wing_Span',
    'TP mods': 'TP_mods',  # Fixed this entry to match naming convention
    'Engine Type': 'Engine_Type',  # Adjusted to use underscores
    'Hmax (One)': 'Hmax_(One)',
    'ROC (One)': 'ROC_(One)'
}, inplace=True)

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/airplane_performance_study.csv",index=False)

Double check that we really have the correct update csv-file in the repo and yes it is correct.

In [None]:
df = pd.read_csv(f"outputs/datasets/collection/airplane_performance_study.csv")
df.head()