In [1]:
import polars as pl 

from pathlib import Path 

from utils.settings_manager import Settings
from utils.data_loader import DataLoader

In [2]:
stg = Settings()

#  Features creation and explorative data analysis
This notebook focusses on data exploration and features creation. 

## Explorative data analysis

### data imports 

In [3]:
d_loader = DataLoader(stg.datasets["base_path"])
files = d_loader.list_files(base_name=stg.datasets["base_name"], 
                            size=stg.datasets["size"],
                            suffix=stg.datasets["suffix"]
                            )
df = d_loader.load_concat_lazy(files)
dfs = d_loader.load_list_lazy(files)


[PosixPath('/home/blaise/Documents/Dev/adsb_classifier/data/raw/Sample_2000_A1.parquet'), PosixPath('/home/blaise/Documents/Dev/adsb_classifier/data/raw/Sample_2000_A5.parquet'), PosixPath('/home/blaise/Documents/Dev/adsb_classifier/data/raw/Sample_2000_A2.parquet'), PosixPath('/home/blaise/Documents/Dev/adsb_classifier/data/raw/Sample_2000_A3.parquet')]


## Overview

Let's first take a look to the data shape, the columns and the missing values. The columns are:

- timestamp: a timestamp for the adsb message
- hex: The icao identification of the aircraft 
- flight: the flight identification 
- alt_baro: the barometric altitude of the aircraft
- alt_geom: the gps altitude of the aircraft
- gs: ground speed of the aircraft 
- lat: latitude 
- lon: longitude
- grounded: aircraft is grounded 
- category: the category of the aircraft 

In [4]:
df.collect().describe()

describe,timestamp,hex,flight,alt_baro,alt_geom,gs,lat,lon,grounded,category
str,f64,str,str,f64,f64,f64,f64,f64,f64,str
"""count""",10947950.0,"""10947950""","""10947950""",10947950.0,10947950.0,10947950.0,10947950.0,10947950.0,10947950.0,"""10947950"""
"""null_count""",0.0,"""0""","""0""",38059.0,1126193.0,71169.0,69394.0,69394.0,38059.0,"""0"""
"""mean""",1696200000.0,,,20274.799327,23465.256614,303.80994,35.72912,-38.309601,0.106108,
"""std""",23935.169752,,,15584.770914,15397.11864,174.347986,17.190321,76.840245,0.307976,
"""min""",1696100000.0,"""001c2b""",""" """,-775.0,-3000.0,0.0,-49.66983,-173.593547,0.0,"""A1"""
"""25%""",1696100000.0,,,3625.0,6750.0,135.6,32.995993,-94.80938,,
"""50%""",1696200000.0,,,21000.0,27325.0,363.4,38.978003,-76.622295,,
"""75%""",1696200000.0,,,36000.0,37800.0,456.0,44.396484,11.307747,,
"""max""",1696200000.0,"""e94bc9""","""ZUPFM """,61600.0,87850.0,5079.6,72.142468,179.540648,1.0,"""A5"""


We have some null values for alt_baro, alt_geom, gs, lat, lon and grounded. We'll have to handle these missing values later. The number of single messages is large (>10e6 messages). However as we'll group these messages by flight and generate the features on the flights the model won't use all the messages 

### Plots and analysis

In [5]:
from visualization.eda_plots import df_violin_plots

First let's take a look to the columns with continuous values 

In [6]:
fig = df_violin_plots(df, "alt_baro", "category")
fig.show()

In [None]:
fig = df_violin_plots(df, "alt_baro", "category")


In [None]:
a