In the following section, it will be explained how the csv.xz files were modified to eventually be meaningful for the research questions.
The first part involved the loading of the necessary libraries, in particular pandas and lzma.

In [1]:
import pandas as pd
import lzma

file_path320 = 'A320_valid.csv.xz'
file_path319 = 'A319_valid.csv.xz'
file_path321 = 'A321_valid.csv.xz'
file_path332 = 'A332_valid.csv.xz'

After evaluating all the columns and parameters in the excel file, we decided to include the following in our final datasets.

In [None]:

# Specify the column to load
columns_to_load = [
    'time', 
    'timestep', 
    'maxtimestep', 
    'icao24', 
    'callsign', 
    'baroaltitude', 
    'lat', 
    'lon', 
    'velocity', 
    'segment', 
    'modeltype', 
    'operator', 
    'vertratecorr',
    'fromICAO', 
    'toICAO', 
    'distance_from_dep', 
    'trip_distance', 
    'temp', 
    'tas'
]

# Load columns from the CSV
with lzma.open(file_path319) as f:
    df319 = pd.read_csv(f, usecols=columns_to_load)
with lzma.open(file_path320) as f:
    df320 = pd.read_csv(f, usecols=columns_to_load)
with lzma.open(file_path321) as f:
    df321 = pd.read_csv(f, usecols=columns_to_load)
with lzma.open(file_path332) as f:
    df332 = pd.read_csv(f, usecols=columns_to_load)


As described in section (region selection) only the flights in the area around paris are included. We therefore created a box based on minimum and maximum latitude and longitude.
The box can be sign below. Afterwards we created a new dataframe which includes only the rows who at timestep 0 (start of measurement) are inside the given parameters.

In [None]:
# filter the data to only include flights in are
lat_min = 48.65
lat_max = 49.1
lon_min = 2.01
lon_max = 2.76

filtered_df319 = df319[(df319['lat'] >= lat_min) & (df319['lat'] <= lat_max) & (df319['lon'] >= lon_min) & (df319['lon'] <= lon_max) & (df319['timestep'] == 0)]
filtered_df320 = df320[(df320['lat'] >= lat_min) & (df320['lat'] <= lat_max) & (df320['lon'] >= lon_min) & (df320['lon'] <= lon_max) & (df320['timestep'] == 0)]
filtered_df321 = df321[(df321['lat'] >= lat_min) & (df321['lat'] <= lat_max) & (df321['lon'] >= lon_min) & (df321['lon'] <= lon_max) & (df321['timestep'] == 0)]
filtered_df332 = df332[(df332['lat'] >= lat_min) & (df332['lat'] <= lat_max) & (df332['lon'] >= lon_min) & (df332['lon'] <= lon_max) & (df332['timestep'] == 0)]


As the column named segment can be used as a index for the flights, we created a list for each airplane, which only included the unique segments.
We then took our previous dataframe and modified it to only include data about the flights whose segment is in the list created. 

In [None]:
# get the unique segments from the filtered dataframes and filter the original dataframes
filtered_segments319 = filtered_df319['segment'].unique()
df_filteredsegments319 = df319[df319['segment'].isin(filtered_segments319)]

filtered_segments320 = filtered_df320['segment'].unique()
df_filteredsegments320 = df320[df320['segment'].isin(filtered_segments320)]

filtered_segments321 = filtered_df321['segment'].unique()
df_filteredsegments321 = df321[df321['segment'].isin(filtered_segments321)]

filtered_segments332 = filtered_df332['segment'].unique()
df_filteredsegments332 = df332[df332['segment'].isin(filtered_segments332)]


As we wanted to have all flights comparable, we only included those, whose measurements starts between 0 and 100m. 
A similar approach above is used. First we created a dataframe which only included flights which at timestep 0 were below 100m.
Then we created lists with the segments to be used for the indices. After comparing the original dataframe with the segments, we obtained the dataset which only included the flights starting below 100m.


In [None]:
# filter the data to only include flights below 100 at timestep 0
filtered_heightdf319 = df_filteredsegments319[(df_filteredsegments319['baroaltitude'] <= 100) & (df_filteredsegments319['timestep'] == 0)]
filtered_heightdf320 = df_filteredsegments320[(df_filteredsegments320['baroaltitude'] <= 100) & (df_filteredsegments320['timestep'] == 0)]
filtered_heightdf321 = df_filteredsegments321[(df_filteredsegments321['baroaltitude'] <= 100) & (df_filteredsegments321['timestep'] == 0)]
filtered_heightdf332 = df_filteredsegments332[(df_filteredsegments332['baroaltitude'] <= 100) & (df_filteredsegments332['timestep'] == 0)]

# get the unique segments from the filtered dataframes
plane_list319 = filtered_heightdf319['segment'].unique()
plane_list320 = filtered_heightdf320['segment'].unique()
plane_list321 = filtered_heightdf321['segment'].unique()
plane_list332 = filtered_heightdf332['segment'].unique()

final319 = df_filteredsegments319[df_filteredsegments319['segment'].isin(plane_list319)]
final320 = df_filteredsegments320[df_filteredsegments320['segment'].isin(plane_list320)]
final321 = df_filteredsegments321[df_filteredsegments321['segment'].isin(plane_list321)]
final332 = df_filteredsegments332[df_filteredsegments332['segment'].isin(plane_list332)]



We decided to not store the modified dataset in csv files again, but instead opted for the use of pickle files. The reason therefore is the faster handling time and all our teammembers using python. 

In [None]:
# save the filtered dataframes to pickle filss
final319.to_pickle('A319_final.pkl')
final320.to_pickle('A320_final.pkl')
final321.to_pickle('A321_final.pkl')
final332.to_pickle('A332_final.pkl')
