 # Flight Prices XGBoost with Dask (GPU) Notebook

 This notebook demonstrates how to:
 - Download and extract the flight prices dataset.
 - Set up a local Dask cluster.
 - Read and process the data using Dask.
 - Prepare features and target for regression.
 - Train an XGBoost model using GPU acceleration via Dask.
 - Evaluate the model by computing the RMSE in a distributed manner.

 **Note:*  Ensure that you have a compatible GPU setup for GPU acceleration.

 ## Section 1: Download and Setup

 Install required packages, download the dataset from Kaggle, and unzip it.

In [1]:
# Install required packages
!pip install xgboost --upgrade
!pip install kaggle
!kaggle datasets download -d dilwong/flightprices
!unzip -n flightprices.zip
!pip install "dask[complete]==2024.10.0" xgboost --upgrade

Dataset URL: https://www.kaggle.com/datasets/dilwong/flightprices
License(s): Attribution 4.0 International (CC BY 4.0)
Downloading flightprices.zip to /content
100% 5.51G/5.51G [02:17<00:00, 44.2MB/s]
100% 5.51G/5.51G [02:17<00:00, 42.9MB/s]
Archive:  flightprices.zip
  inflating: itineraries.csv         
Collecting lz4>=4.3.2 (from dask[complete]==2024.10.0)
  Downloading lz4-4.4.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting dask-expr<1.2,>=1.1 (from dask==2024.10.0->dask[complete]==2024.10.0)
  Downloading dask_expr-1.1.21-py3-none-any.whl.metadata (2.6 kB)
Collecting distributed==2024.10.0 (from dask==2024.10.0->dask[complete]==2024.10.0)
  Downloading distributed-2024.10.0-py3-none-any.whl.metadata (3.3 kB)
Collecting sortedcontainers>=2.0.5 (from distributed==2024.10.0->dask==2024.10.0->dask[complete]==2024.10.0)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting tblib>=1.6.0 (from distributed==2024.10

 ## Section 2: Import Libraries and Initialize Dask Cluster

 Import necessary libraries and create a local Dask cluster.

In [2]:
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster

# Create a local Dask cluster with 1 worker and 8 threads per worker
cluster = LocalCluster(n_workers=1, threads_per_worker=8)
client = Client(cluster)
print(client)

INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:38245
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:8787/status
INFO:distributed.scheduler:Registering Worker plugin shuffle
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:42275'
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:37789', name: 0, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:37789
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:41294
INFO:distributed.scheduler:Receive client connection: Client-6cb8f85f-f174-11ef-8195-0242ac1c000c
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:41310


<Client: 'tcp://127.0.0.1:38245' processes=1 threads=8, memory=50.99 GiB>


 ## Section 3: Data Loading and Overview

 Read the CSV file using Dask. The blocksize parameter defines the size of each block (e.g., 64MB).
 Display a preview of the data and show the number of partitions.

In [3]:
df = dd.read_csv('itineraries.csv', blocksize=64e6, assume_missing=True)
print(df.head())
print("Number of partitions:", df.npartitions)

                              legId  searchDate  flightDate startingAirport  \
0  9ca0e81111c683bec1012473feefd28f  2022-04-16  2022-04-17             ATL   
1  98685953630e772a098941b71906592b  2022-04-16  2022-04-17             ATL   
2  98d90cbc32bfbb05c2fc32897c7c1087  2022-04-16  2022-04-17             ATL   
3  969a269d38eae583f455486fa90877b4  2022-04-16  2022-04-17             ATL   
4  980370cf27c89b40d2833a1d5afc9751  2022-04-16  2022-04-17             ATL   

  destinationAirport fareBasisCode travelDuration  elapsedDays  \
0                BOS      LA0NX0MC        PT2H29M          0.0   
1                BOS      LA0NX0MC        PT2H30M          0.0   
2                BOS      LA0NX0MC        PT2H30M          0.0   
3                BOS      LA0NX0MC        PT2H32M          0.0   
4                BOS      LA0NX0MC        PT2H34M          0.0   

   isBasicEconomy  isRefundable  ...  segmentsArrivalTimeEpochSeconds  \
0           False         False  ...                   

 ## Section 4: Data Preprocessing

 Define the features and target, convert relevant columns to float32 to optimize memory, and fill missing values using the approximate median.
 Optionally, you can save the DataFrame in Parquet format for faster future access.

In [4]:
# Define the features and target variable
features = ['elapsedDays', 'totalTravelDistance', 'seatsRemaining']
target = 'baseFare'

# Convert specified columns to float32
for col in features + [target]:
    df[col] = df[col].astype('float32')

# Fill missing values for features using the approximate median
for col in features:
    med = df[col].median_approximate()
    df[col] = df[col].fillna(med)
df[target] = df[target].fillna(df[target].median_approximate())

# Optional: Save to Parquet for faster future access
# df.to_parquet('itineraries.parquet')

 ## Section 5: Data Splitting and Conversion to Dask Arrays

 Split the data into training (80%) and testing (20%) sets. Then, convert the Dask DataFrames into Dask arrays,
 and create DaskDMatrix objects for XGBoost.

In [5]:
import xgboost as xgb

# Split the dataset into train (80%) and test (20%) sets
train, test = df.random_split([0.8, 0.2], random_state=42)

# Convert the Dask DataFrames into Dask arrays
X_train = train[features].to_dask_array(lengths=True)
y_train = train[target].to_dask_array(lengths=True)
X_test = test[features].to_dask_array(lengths=True)
y_test = test[target].to_dask_array(lengths=True)

# Create DaskDMatrix objects for XGBoost
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)
dtest  = xgb.dask.DaskDMatrix(client, X_test, y_test)

INFO:distributed.core:Event loop was unresponsive in Scheduler for 5.15s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 5.16s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 5.04s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 5.08s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 4.97s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause

 ## Section 6: Model Training with XGBoost using GPU

 Set the parameters for XGBoost to use GPU acceleration, and train the model using XGBoost's Dask interface.

In [6]:
import dask.array as da

# XGBoost parameters for GPU usage
params = {
    'objective': 'reg:squarederror',
    'tree_method': 'gpu_hist',      # Use GPU to build trees
    'predictor': 'gpu_predictor',   # Use GPU for predictions
    'max_depth': 6,
    'eta': 0.1,
    'seed': 42
}

num_rounds = 100

# Train the model using the Dask interface of XGBoost
output = xgb.dask.train(client, params, dtrain, num_boost_round=num_rounds, evals=[(dtest, 'test')])
bst = output['booster']

INFO:distributed.worker:Run out-of-band function '_start_tracker'
INFO:distributed.scheduler:Receive client connection: Client-worker-bd28a9a6-f184-11ef-8820-0242ac1c000c
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:56128


 ## Section 7: Prediction and Evaluation

 Make predictions on the test set using the trained model and calculate the RMSE in a distributed manner.

In [7]:

# Make predictions on the test set
preds = xgb.dask.predict(client, bst, dtest)

# Compute RMSE using Dask Array operations
rmse = da.sqrt(((y_test - preds) ** 2).mean())
print("XGBoost with GPU (Dask) - RMSE:", rmse.compute())


INFO:distributed.core:Event loop was unresponsive in Nanny for 4.77s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 4.74s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.


XGBoost with GPU (Dask) - RMSE: 147.79932


## End of Notebook

 This notebook showcased how to:
 - Set up and use a local Dask cluster.
 - Process and prepare the flight prices dataset with Dask.
 - Train an XGBoost model using GPU acceleration.
 - Evaluate model performance using distributed computations.