Processing to do: 
- Make sure that flight_ids are unique
- How to represent aircraft trajectories ?
    - Constant lengthbut with different timestamps from start ?
    - Constant timestamps across the different trajectories (e.g. every 1min) but with variable length ? We can pad all the series to the same length and padded values are set to NaNs, representing missing observations. Then, the corresponding position of the mas to a NaN is set to be zero. 
- Normalisation using z-score (StandardScaler). Each variable is normalized independently.
- Separating per departure / arrival runways ? -> Maybe not a good idea as we don't have that many trajectories
- Maybe: Instead of working on full trajectories, work only on 3 - 5 min of trajectories. As a result, TS always have the same number of points, increases significantly the number of samples, allows to classify the trajectory phases. -> We should try both solution I think

In [1]:
from traffic.core import Traffic, Flight

import numpy as np
from multiprocessing import Pool
from tqdm.auto import tqdm

In [2]:
file = "/mnt/beegfs/store/kruu/context_learning/datasets/04_LFPO-LFBO_merged/all_flights.parquet"
t = Traffic.from_file(file)

In [3]:
t = t.drop(["flight_id"], axis = 1)
t

Unnamed: 0_level_0,Unnamed: 1_level_0,count
icao24,callsign,Unnamed: 2_level_1
3946e1,AFR42PN,89587
3944ef,AFR42PN,71559
3944f0,AFR41WQ,69365
4401e4,EJU627A,60301
3985a2,AFR78RT,58870
440cfe,EJU414M,56896
440097,EJU739N,55617
394c10,AFR82NK,54918
4403f0,EJU627A,54833
3944ef,AFR38GE,53386


In [4]:
t = t.assign_id().eval(max_workers=8)
t

Unnamed: 0_level_0,count
flight_id,Unnamed: 1_level_1
TVF53LW_7069,10318
AFR74YX_10553,8141
EJU414M_13167,7506
EJU963A_12448,7477
AFR38GE_3136,7284
AFR83CA_7217,6935
AFR97WR_8609,6896
AFR74YX_10552,6848
AFR25VF_6707,6812
AFR37BA_5940,6806


In [6]:
# resampling every 1s ?
#Drop lines where there is a Nan in input_features
#Select only when groundspeed is > 50
#Resample every 1s

def moving_groundspeed(f, threshold):
    start = f.query(f"groundspeed > {threshold}").start
    return f.after(start)

t.data = t.data.dropna()
t = t.pipe(moving_groundspeed, 50).resample("1s").eval(max_workers= 50, desc = "")

                                                      

In [7]:
t

Unnamed: 0_level_0,count
flight_id,Unnamed: 1_level_1
AFR74YX_10553,8140
AFR74YX_10552,6848
EJU194C_11214,6059
NAK097_10532,6054
TVF13UN_10557,6035
AFR19QL_5884,5951
VOE11CH_019,5945
EJU963A_12448,5885
AFR47KD_10543,5718
AFR35ZZ_13907,5635


In [8]:
# Computing duration from first observation
t.data["first_observation"] = t.data.groupby("flight_id")["timestamp"].transform("first")
t.data['duration_from_first'] = (t.data['timestamp'] - t.data['first_observation']).dt.seconds
t.data = t.data.drop(columns=['first_observation'])

In [9]:
# Selecting time-series features

input_features = [
    # "duration_from_first", #we know from the preprocessing that it's sampled every 1s
    "latitude",
    "longitude",
    "geoaltitude",
    "track",
    "vertical_rate",
    "cumdist",
]

In [19]:
#Tranforming traffic object in samples for Neural Networks
# Build an empty array with nans of size n_flight * n_timestamps_max * n_features
# fill the empty array with the flights, and let nans if the flight is not long enough


def process_flight(
    flight: Traffic,
    input_columns: list,
    max_len: int,
):
    
    data = flight.data[input_features].values
    
    if max_len >= data.shape[0]:
        padding = np.full((max_len - data.shape[0] + 1, data.shape[1]), np.nan)
        data = np.vstack([data, padding])
    
    return data 

In [20]:
chunks = [
    (
        flight,
        input_features,
        t.data.duration_from_first.max(),
    )
    for flight in t
]
with Pool(20) as p:
    results = p.starmap(process_flight, tqdm(chunks))
    

100%|██████████| 13100/13100 [05:37<00:00, 38.81it/s]


In [21]:
inputs = np.stack([result for result in results])

In [28]:
inputs.shape

(13100, 8140, 6)

In [27]:
import os

path_save = "/mnt/beegfs/store/kruu/context_learning/datasets/05_LFPO-LFBO_samples"

if not os.path.exists(f"{path_save}"):
    os.makedirs(f"{path_save}")

np.save(
        f"{path_save}/inputs.npy",
        inputs.astype(np.float32),
    )