# Retrieval

In this notebook, we consider the retrievability of the data requisite for our analysis. Where possible in a scriptable way, we also provide the Python code to download the data.

The following modules are required:

In [None]:
import os
import subprocess
import shutil
import requests
import tempfile

from ipynb_utils import CFG

## Flight Delay Data

Unlike platforms such as [kaggle](https://www.kaggle.com), zindi offers no option to retrieve the data in a (Python-)scriptable way. Consequently, the easiest method to acquire the data is download it manually from [here](https://zindi.africa/competitions/flight-delay-prediction-challenge/data) (an account is required). Of the three files provided, only train.csv is pertinent to our purposes. This file must be preserved at the location specified by CFG["TRAIN_DATA_PATH"].

## Airport Data

In addition to the flight delay data, we also require a "dictionary" to decode the airport identifiers. This dictionary furnishes further information concerning airports, such as their geographic location, country, and continent.

Unlike the zindi data, the airport data are accessible at numerous sites without subscription or authentication barriers, and may be downloaded via HTTP.

The following variables are required:

In [None]:
# HTTP address of the (public) repository.
HTTP_REPO_URL = "https://github.com/davidmegginson/ourairports-data"

# Path of the csv file relative to the (remote) repository root.
CSV_PATH_REL = "airports.csv"

# Target path of the csv file on the local machine.
AIRPORTS_DATA_PATH = CFG["AIRPORTS_DATA_PATH"]

# Full HTTP address of the csv file.
http_url = f"{HTTP_REPO_URL}/blob/main/{CSV_PATH_REL}"

The actual download is effected by the subsequent code cell.

In [None]:
response = requests.get(http_url)
response.raise_for_status()

with open(AIRPORTS_DATA_PATH, "wb") as f:
    f.write(response.content)