# Retrieval

In this file, we discuss the retrievability of the data which are required for our analysis. If possible in a scriptable way, we also provide the Python code to download the data. 

We need the following modules:

In [None]:
import os
import subprocess
import shutil
import requests
import tempfile

from local_nbutils import CFG

## Flight Delay Data

Contrary to e.g. [kaggle](https://www.kaggle.com), zindi provides no option to retrieve the data in a (Python-)scriptable way. Therefore, the fastest method to retrieve the data is downloading them manually from [here](https://zindi.africa/competitions/flight-delay-prediction-challenge/data) (an account is necessary). Among the three offered files, only `train.csv` is relevant to us. Please store this file at the location `CFG["TRAIN_DATA_PATH"]`.

## Airport Data

In addition to the flight delay data, we will also need a "dictionary" to decode the airport identifiers. This dictionary provides further information on airports like geographic location, country and continent.

### HTTP Method

Contrary to the zindi data, the airport data are available at may sites without subscription or authentication barriers and a download via HTTP is possible.

In [None]:
# HTTP address of the (public) repository.
HTTP_REPO_URL = "https://github.com/davidmegginson/ourairports-data"

# Path of the csv file relative to the (remote) repository root.
CSV_PATH_REL = "airports.csv"

# Target path of the csv file on the local machine.
AIRPORTS_DATA_PATH = CFG["AIRPORTS_DATA_PATH"]

# Full HTTP address of the csv file.
http_url = f"{HTTP_REPO_URL}/blob/main/{CSV_PATH_REL}"

In [None]:
response = requests.get(http_url)
response.raise_for_status()

with open(AIRPORTS_DATA_PATH, "wb") as f:
    f.write(response.content)

### SSH Method

Although the HTTP version suffices in our case, let us present, for reasons of completeness, also a SSH version that would be necessary if the csv file belongs to a repository where the client needs to provide appropriate credentials via SSH.

In [None]:
# SSH address of the (not necessarily public) repository.
SSH_REPO_URL = "git@github.com:davidmegginson/ourairports-data.git"

# Path of the csv file relative to the (remote) repository root.
CSV_PATH_REL = "airports.csv"

# Target path of the csv file on the local machine.
AIRPORTS_DATA_PATH = CFG["AIRPORTS_DATA_PATH"]

# Full HTTP address of the csv file.
http_url = f"{HTTP_REPO_URL}/blob/main/{CSV_PATH_REL}"

In [None]:
with tempfile.TemporaryDirectory() as tmpdir:
    subprocess.run(
        [
            "git",
            "clone",
            # Clones shallowly.
            "--depth",
            "1",
            # Skips blobs initially.
            "--filter=blob:none",
            # Enables sparse checkout mode.
            "--sparse",
            SSH_REPO_URL,
            tmpdir,
        ],
        check=True,
    )

    subprocess.run(
        [
            "git",
            "-C",
            tmpdir,
            # Initialises sparse checkout.
            "sparse-checkout",
            "init",
        ],
        check=True,
    )

    subprocess.run(
        [
            "git",
            "-C",
            tmpdir,
            # Specifies which files to include in the sparse checkout.
            "sparse-checkout",
            "set",
            CSV_PATH_REL,
            # Relaxes the checks as sparse-checkout expects directories.
            "--skip-checks",
        ],
        check=True,
    )

    src_file = os.path.join(tmpdir, CSV_PATH_REL)
    shutil.copy2(src_file, AIRPORTS_DATA_PATH)