In [1]:
!pip install --upgrade duckdb pandas

Collecting duckdb
  Using cached duckdb-0.10.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: duckdb
  Attempting uninstall: duckdb
    Found existing installation: duckdb 0.10.1
    Uninstalling duckdb-0.10.1:
      Successfully uninstalled duckdb-0.10.1
Successfully installed duckdb-0.10.2


In [2]:
!pip install pyspark



In [3]:
# We should have the same version (0.10.1) of duckdb to load the database without any problem:
!pip show duckdb

Name: duckdb
Version: 0.10.2
Summary: DuckDB in-process database
Home-page: https://www.duckdb.org
Author: 
Author-email: 
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: 
Required-by: malloy


In [4]:
!pip install -U duckdb==0.10.1

Collecting duckdb==0.10.1
  Using cached duckdb-0.10.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.1 MB)
Installing collected packages: duckdb
  Attempting uninstall: duckdb
    Found existing installation: duckdb 0.10.2
    Uninstalling duckdb-0.10.2:
      Successfully uninstalled duckdb-0.10.2
Successfully installed duckdb-0.10.1


In [5]:
!wget -O "duckdb.jar" "https://repo1.maven.org/maven2/org/duckdb/duckdb_jdbc/0.10.1/duckdb_jdbc-0.10.1.jar"

--2024-04-24 19:20:01--  https://repo1.maven.org/maven2/org/duckdb/duckdb_jdbc/0.10.1/duckdb_jdbc-0.10.1.jar
Resolving repo1.maven.org (repo1.maven.org)... 199.232.192.209, 199.232.196.209, 2a04:4e42:4c::209, ...
Connecting to repo1.maven.org (repo1.maven.org)|199.232.192.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64009472 (61M) [application/java-archive]
Saving to: ‘duckdb.jar’


2024-04-24 19:20:02 (189 MB/s) - ‘duckdb.jar’ saved [64009472/64009472]



In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .config("spark.jars", "duckdb.jar") \
    .getOrCreate()

In this pipeline, we will preditc if a flight will arrive late to the destination airport. To do so, we will use the flights, airports and weather datasets. Now, we will select the rows necessary for the analysis of the flights dataset. As we have the airports dataset, we will merge the flight dataset with the weather dataset with the nearest latitude and longitude of the destination airport and meteorological data of the day of the flight.

In [27]:
query = """SELECT
    f.city,
    f.has_delay,
    f.time_of_day,
    w.avg_temperature_2m,
    w.avg_relative_humidity_2m,
    w.total_precipitation,
    w.avg_cloud_cover
FROM
    flights f
INNER JOIN
    airports a
ON
    f.airport_acronym = a.airport_acronym
INNER JOIN
    weather w
ON
    a.latitude=w.latitude and a.longitude=w.longitude and f.arrival_date=w.date
"""


DF = spark.read \
  .format("jdbc") \
  .option("url", "jdbc:duckdb:exploitation_database.duckdb") \
  .option("driver", "org.duckdb.DuckDBDriver") \
  .option("query", query) \
  .load()

DF.show()

+--------+---------+-----------+------------------+------------------------+-------------------+------------------+
|    city|has_delay|time_of_day|avg_temperature_2m|avg_relative_humidity_2m|total_precipitation|   avg_cloud_cover|
+--------+---------+-----------+------------------+------------------------+-------------------+------------------+
|  Athens|        1|    evening| 13.59660428762436|      56.208333333333336|                0.0|50.291666666666664|
|    Rome|        1|  afternoon| 13.65533318122228|      57.666666666666664|                0.0|46.208333333333336|
|    Rome|        1|  afternoon|13.991666714350382|      55.416666666666664|                0.0|45.958333333333336|
|Budapest|        1|    evening| 9.587249795595804|      55.583333333333336|                0.0|            75.375|
|  Vienna|        1|    evening| 9.837875028451284|       75.83333333333333|0.20000000298023224|              87.5|
|   Paris|        1|      night|13.895458181699118|       71.20833333333

# MODEL

In fact, we tried several models, but the best one was the DecisionTreeClassifier. We will use this model to predict if a flight will arrive late to the destination airport. We will use the following features:

- `city`: city of the destination airport
- `time_of_day`: time of the day of the flight (morning, afternoon, evening, night)
- `avg_temperature_2m`: average temperature at 2 meters above the ground
- `avg_relative_humidity_2m`: average relative humidity at 2 meters above the ground
- `total_precipitation`: total precipitation
- `avg_cloud_cover`: average cloud cover

In [33]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

data = DF.toPandas()

In [29]:
# We split the data into train and test sets
train, test = train_test_split(data, test_size=0.3, random_state=42, shuffle=True)

In [30]:
# We preprocess the data, scaling numerical features and one-hot encoding categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), selector(dtype_exclude="object")),
        ('cat', OneHotEncoder(), selector(dtype_include="object"))
])

In [31]:
# We train a decision tree classifier using grid search to find the best hyperparameters
param_grid = {
    'decisiontreeclassifier__max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'decisiontreeclassifier__criterion': ['gini', 'entropy']
}
grid_search = GridSearchCV(make_pipeline(preprocessor, DecisionTreeClassifier()), param_grid, cv=5, n_jobs=-1)
grid_search.fit(train.drop('has_delay', axis=1), train['has_delay'])

In [34]:
print("Best parameters found: ", grid_search.best_params_)
print("Best score found: ", grid_search.best_score_)
print("Test score: ", grid_search.score(test.drop('has_delay', axis=1), test['has_delay']))
print(classification_report(test['has_delay'], grid_search.predict(test.drop('has_delay', axis=1))))

Best parameters found:  {'decisiontreeclassifier__criterion': 'entropy', 'decisiontreeclassifier__max_depth': 9}
Best score found:  0.6003101334294987
Test score:  0.5972222222222222
              precision    recall  f1-score   support

           0       0.63      0.52      0.57       257
           1       0.58      0.68      0.62       247

    accuracy                           0.60       504
   macro avg       0.60      0.60      0.60       504
weighted avg       0.60      0.60      0.59       504



In [35]:
# Save the best model
import joblib
joblib.dump(grid_search, 'best_model_dt.pkl')

['best_model_dt.pkl']