# Download data

##### Please run ```make configure-envs``` before running this notebook, as it is an interactive script.

### Download raw data from Kaggle

In [7]:
!make -C .. download-data

poetry run python scripts/data/download_data.py
Download start...
Downloading resized-2015-2019-diabetic-retinopathy-detection.zip to data/raw
100%|█████████████████████████████████████▉| 17.3G/17.3G [10:09<00:00, 35.0MB/s]
100%|██████████████████████████████████████| 17.3G/17.3G [10:09<00:00, 30.6MB/s]
Files downloaded successfully.
Unzipping files...
Unzipping all files...
Unzipped all files.
Deleted zip file.
Files unzipped successfully.


#### Get path to the raw data

In [10]:
import os
from pathlib import Path

from dotenv import load_dotenv
load_dotenv()
root_data = os.getenv("KAGGLE_FILES_DIR")
dataset_path = Path(os.getcwd(), "..", root_data)
raw = Path(dataset_path, "raw")

#### Read CSVs files with labels


In [11]:
import pandas as pd

labels_traintest15_train19 = pd.read_csv(Path(raw, 'labels', 'traintestLabels15_trainLabels19.csv'), header=0, usecols=["image", "level"])
labels_train19 = pd.read_csv(Path(raw, 'labels', 'trainLabels19.csv'), header=0, usecols=["id_code", "diagnosis"])
labels_train19.rename(columns={"id_code": "image", "diagnosis": "level"}, inplace=True)  # rename columns to match other datasets

labels_train15 = pd.read_csv(Path(raw, 'labels', 'trainLabels15.csv'), header=0, usecols=["image", "level"])
labels_test15 = pd.read_csv(Path(raw, 'labels', 'testLabels15.csv'), header=0, usecols=["image", "level"])


#### Shapes of labels CSVs

In [12]:
print(labels_traintest15_train19.shape)
print(labels_train19.shape)
print(labels_train15.shape)
print(labels_test15.shape)

(92364, 2)
(3662, 2)
(35126, 2)
(53576, 2)


#### Concatenate all labels with no duplicates

In [14]:
labels = pd.concat([labels_traintest15_train19, labels_train19, labels_train15, labels_test15], ignore_index=True).drop_duplicates()
labels.shape

(92364, 2)

##### Looks like `labels_traintest15_train19` labels contains all unique data.
#### Check images directories length

In [17]:
imgs_15 = Path(raw, 'resized_traintest15_train19')
imgs_19 = Path(raw, 'resized_test19')

print(len(list(imgs_15.glob("*"))))
print(len(list(imgs_19.glob("*"))))


92364
1928


#### Check if `traintestLabels15_trainLabels19.csv` contains all images from `resized_traintest15_train19`.

In [19]:
missing = []

for label in labels[["image"]].values:
    p = Path(imgs_15, f"{label[0]}.jpg")
    if not p.exists():
        missing.append(label)
        print(f"Image {label[0]} not found in `resized_traintest15_train19`")

#### All images from `traintestLabels15_trainLabels19.csv` are present in `resized_traintest15_train19`.
#### Other CSVs and resized_test19 directory will be removed.