# Pneumonia Multimodal Dataset Generation

Our main goal in this notebook is to utilize as much data as possible from the vast amount available at MIMIC-IV in a multimodal manner. We want to investigate how incorporating clinical data affects the overall classification of diseases.

## Requirements

Please make sure that the following requirements are met before executing the cells below:
- Credentialed access to [Physionet](https://physionet.org/) (after registering you have to apply for the [credentialing process](https://physionet.org/settings/credentialing/) and get accepted)
- Request access for MIMIC-IV, MIMIC-CXR, and MIMIC-CXR-JPG
- Request access for MIMIM-IV and MIMIC-CXR using Google BigQuery (a Google Account is needed for this. You can find more information on the [MIMIC website](https://mimic.mit.edu/docs/gettingstarted/cloud/))


## Configuration
In this section the needed packages are imported and some helper functions are defined. We also specify important variables as described below.

In [2]:
import numpy as np
import pandas as pd
import pandas_gbq as gbq
import os
import pydata_google_auth
from tqdm import tqdm
from PIL import Image
from transformers import AutoTokenizer
from multiprocessing import cpu_count, Pool

In [72]:
# Google Cloud authentication
# Needed to query data from MIMIC-IV

SCOPES = [
    'https://www.googleapis.com/auth/cloud-platform',
    'https://www.googleapis.com/auth/drive',
]

# Set auth_local_webserver to True to have a slightly more convienient 
# authorization flow. Note, this doesn't work if you're running from a 
# notebook on a remote sever, such as over SSH or with Google Colab. 
credentials = pydata_google_auth.get_user_credentials(
    SCOPES,
    use_local_webserver=False,
)

gbq.context.credentials = credentials

# The id of the GCP project that is connected to physionet in GBQ
# project_id = 'my-project-1234567'
project_id = 'master-thesis-332120'

# Physionet authentication
# Needed to download reports and CXRs from MIMIC-CXR
# user = 'my-user'
# password = 'my-passwd'
user = 'mohkoh'
password = 'Gawhak-zawpoz-1tunja'

# Don't change this
base_url = "https://physionet.org/files/"

# The image resolution R for the CXR images. 
# The original images will be downscaled to R x R
image_resolution = 225

# Set path to the local working directory
# If the path does not exist it will be created
# The generated dataframes, study reports, etc. will be copied here
# Moving files out of the directory or renaming them during the 
# execution of this notebook might lead to errors
# local_dir = '.local_dir'
local_dir = '/home/mohammad/Projects/master-thesis/frames'
os.makedirs(local_dir, exist_ok=True)

Some helper functions to run queries

In [9]:
def run_query(query, project_id=project_id):
  """Runs SQL query in GBQ and returns result as pandas DataFrame

  Args:
      query (string): Query in SQL Standard dialect.
      project_id (string, optional): The project-id from GCP to use. Defaults to a global variable named project_id.

  Returns:
      [type]: [description]
  """
  return gbq.read_gbq(query, project_id=project_id, dialect='standard')

def lazy_run_query(local_path, query=None, save_local=True, project_id=project_id, transform=None):
  """Runs a query if no local version (as .csv) exists. Only tries to load local version if no query is specified. 

  Args:
      local_path (string): Path of the local version to look for.
      query (string): Query in SQL Standard dialect.
      save_local (bool, optional): Whether or not the result should be saved locally. Defaults to True.
      project_id (string, optional): The project-id from GCP to use. Defaults to a global variable named project_id.
      transform (func, optional): If specified, the function is called on the queried/loaded result. 
                                  The function should expect a pandas DataFrame as only parameter.
                                  The function should only return a pandas DataFrame. Defaults to None.

  Returns:
      [type]: [description]
  """
  local_exists = os.path.isfile(local_path)
  if (local_exists):
    result = pd.read_csv(local_path)
  elif (query is not None):
    result = run_query(query, project_id=project_id)
  else:
    return None
    
  if (save_local and not local_exists):
    result.to_csv(local_path, index=False)
    
  if(transform is not None):
    result = transform(result)

  return result

## Datasets
In this section we retrieve all the information that we need from MIMIC-CXR/IV and filter according to our cohort definition. We rely on `pandas` for all data manipulation steps as well as the execution of SQL queries on the BigQuery instance. Thanks to the predefined concepts in `mimic-derived`, in most cases we can simply query whole tables.


Our goal for this section is to bring forth a dataset consisting of:
- Patient demographics (age, gender)
- Complete Bloodcount (CBC) specimen (max. 3 days old)
- Vital signs (VIT) such as oxigen satisfaction or temperature (max. 24 hours)
- Radiology Report Indication Section
  
We argue that the chosen sources provide a good tradeoff between utility and availability with regards to our classification task.

### MIMIC-CXR and ICD-Codes

In this subsection we merge all CXR related tables and explain the most relevant tables and attributes. The table `study_list` contains a list of all patients and studies with their respective reports (as relative paths). The table `record_list` is analogue to `study_list`, while containing related X-Ray images instead of reports (again as relative pahts). Note that the same patient can have multiple different images for the same study, due to multiple view positions. Lastly, the `dicom_meta_string` table contains  meta information for each image. In particular, we are interested in the timing of the studies to select appropriate covariates in later steps, and the view position of the image. 

Each row then contains at least the following attributes:
- `subject_id` (identifier for patient)
- `study_id` (identifier for radiology study)
- `study_datetime` (date and time when image was taken)
- `report_path` (relative path of the report)
- `image_path` (relative path of the image)
- `view_position` (position from which the image was taken)

There is another MIMIX-CXR related table, namely `chexpert`, which contains 14 labels per image generated by the [chexpert-labeler](https://github.com/stanfordmlgroup/chexpert-labeler), a NLP tool that is based on Negation and tries to retrieve lung disease and findings labels from study reports (for more information check the link). Our task is to classify `Pneumonia`, but we want to make our classifier robust agains ambiguities between similar visual features of different diseases. Therefore, we include all of the 14 labels available:
- Atelectasis
- Cardiomegaly
- Consolidation
- Edema
- Enlarged_Cardiomediastinum
- Fracture
- Lung_Lesion
- Lung_Opacity
- No_Finding
- Pleural_Effusion
- Pleural_Other
- Pneumonia
- Pneumothorax
- Support_Devices

Each label contains one of four values: 1.0, −1.0, 0.0, or NaN, which indicate positive, negative, uncertain, or missing observations, respectively. 

Lastly, we add the `age` and `gender` information of the respective patients to each entity by joining the `patients` table from MIMIC-Core on the `subject_id`


An issue with the chexpert-labeler (and other NLP tools for label extraction) is the huge amount of uncertain and missing labels, which reduces the amount of data we can use significantly. Just substituting those labels with negative or positive ones could lead to a substantial increase in false negative rates and corrupt or bias subsequent results.

As described in the paper, we try to fill uncertain or missing labels in our preprocessing by using the available ICD-9 and ICD-10 ontologies from `diagnoses_icd` and `d_icd_diagnoses` tables from `mimic_hosp`. This approach is only possible for diseases that can be described with ICD-codes, not for mere findings in the radiograph images. ICD codes can only be identified by the `subject_id` and `hadm_id` of the respective patients. We have no information about the time of a diagnosis, but only a list of all diagnoses per stay in the hospital. Thus we can't use the ICD codes to determine positive labels and are also limited in determining negative labels i.e. we can only globally derive labels for the whole hospitalization and not for each study during the same hospitalization.

Unfortunatly, we don't have the `hadm_id` for the CXR studies. In fact, we can not even assume each study to have a `hadm_id`, since not every study was performed during an admission to the hospital. However, through the `hadm_id` we can retrieve the admission time and discharging time (`admittime` and `dischtime`) for each patient and each admission respectively. We can use these information to check for each study if it was conducted inbetween, since we have the `study_datetime` for each study. 

For each match we concatenate all the corresponding `long_title` values from `d_icd_diagnoses`, which serve as descriptions for the associated ontologies.  Lastly, we search in the descriptions for keywords indicating `Pneumonia`.

If the disease is not included in the extracted list of diagnoses, we set the label to 0. In particular, we don't overwrite existing certain positive labels from the `chexpert-labeler`, since it has a higher specificity than ICD based diagnoses.

If the disease is included, we change the label to 1 only if other co-occurences are given (see `handle_missing_labels` in `src/datasets/preprocessing.py`). This is due to the fact, that we don't know at which time the patients have been diagnosed. At best the ICD ontology increases the likelihood of the disease. This can still lead to many false positives in our labels.  

The remaining missing labels are mapped to `0`. For the remaining uncertain labels we try both binary mappings, to `0` and to `1`. 

In [56]:
mimic_cxr = lazy_run_query(os.path.join(local_dir, "mimic_cxr.csv"), query=
""" 
    WITH patients AS
    (
        SELECT *
        FROM `physionet-data.mimic_core.patients`
    )
    , study_list AS
    (
        SELECT subject_id, study_id, path AS report_path
        FROM `physionet-data.mimic_cxr.study_list`
    )
    , dicom_meta AS
    (
        SELECT dicom AS dicom_id
        , PARSE_DATETIME('%Y%m%d %H%M%E*S', CONCAT(StudyDate, ' ', StudyTime)) AS study_datetime
        , ViewPosition AS view_position
        FROM `physionet-data.mimic_cxr.dicom_metadata_string`
    )
    , chexpert AS
    (
        SELECT subject_id, study_id
        , Atelectasis
        , Cardiomegaly
        , Consolidation
        , Edema
        , Enlarged_Cardiomediastinum
        , Fracture
        , Lung_Lesion
        , Lung_Opacity
        , No_Finding
        , Pleural_Effusion
        , Pleural_Other
        , Pneumonia
        , Pneumothorax
        , Support_Devices
        FROM `physionet-data.mimic_cxr.chexpert`
    ),
    diag_adm AS (
        WITH diagnoses AS (
            SELECT hadm_id
            , STRING_AGG(long_title ORDER BY long_title) AS diagnoses_text
            FROM `physionet-data.mimic_hosp.diagnoses_icd` AS diagnoses_icd
            JOIN `physionet-data.mimic_hosp.d_icd_diagnoses` AS d_icd_diagnoses 
            ON diagnoses_icd.icd_code = d_icd_diagnoses.icd_code
            GROUP BY hadm_id 
        ) 
        SELECT subject_id
        , admittime
        , dischtime
        , diagnoses_text
        FROM `physionet-data.mimic_core.admissions` AS admissions
        JOIN diagnoses ON diagnoses.hadm_id = admissions.hadm_id
    )
    SELECT record_list.subject_id
    , record_list.study_id
    , anchor_age
    , gender
    , study_datetime
    , report_path
    , record_list.dicom_id
    , path AS image_path
    , view_position
    , Atelectasis
    , Cardiomegaly
    , Consolidation
    , Edema
    , Enlarged_Cardiomediastinum
    , Fracture
    , Lung_Lesion
    , Lung_Opacity
    , No_Finding
    , Pleural_Effusion
    , Pleural_Other
    , Pneumonia
    , Pneumothorax
    , Support_Devices
    , LOWER(diagnoses_text) AS diagnoses_text
    FROM `physionet-data.mimic_cxr.record_list` record_list
    INNER JOIN patients ON record_list.subject_id = patients.subject_id
    INNER JOIN dicom_meta ON record_list.dicom_id = dicom_meta.dicom_id
    INNER JOIN study_list ON record_list.subject_id = study_list.subject_id AND record_list.study_id = study_list.study_id
    INNER JOIN chexpert ON record_list.subject_id = chexpert.subject_id AND record_list.study_id = chexpert.study_id
    LEFT OUTER JOIN diag_adm 
        ON record_list.subject_id = diag_adm.subject_id 
        AND dicom_meta.study_datetime BETWEEN diag_adm.admittime AND diag_adm.dischtime
    WHERE record_list.dicom_id IS NOT NULL
    ORDER BY study_id
    ;
""", save_local=True, transform=None)

Downloading: 100%|██████████| 376207/376207 [03:45<00:00, 1669.00rows/s]


### Clinical Covariates 
In this subsection we want wo add relevant clinical covariates as additional modality to support our deep learning model. In particular, we are interested in the latest complete blood count test (CBC) and the latest vitalsigns of the patient prior to the corresponding study date. Of course, we can't expect every patient to have the information available before the image was taken. However, considering the clinical workflow it is very likely that such data is acquired before the imaging (at least for patients with symptpoms for our chosen group of diseases). As our task is to investigate the impact of the clinical information on disease classification by image, we only consider those entities for our cohort, which have a CBC not older than 3 days and vitalsigns not older than 24 hours before the study date available. In particular, we add the following features to our dataset:

Complete Blood Count (CBC):
- `hematocrit` (volume percentage of red blood cells in blood)
- `hemoglobin` (protein that carries oxygen through the blood)
- `mch` (mean corpuscular hemoglobin, average mass of hemoglobin per red blood cell)
- `mchc` (mean corpuscular hemoglobin concentration, calculated by dividing the hemoglobin by the hematocrit)
- `mcv` (mean corpuscular volume, average volume of a red blood corpuscle)
- `platelet` (cell fragments that form clots and stop or prevent bleeding)
- `rbc` (red blood cells)
- `rdw` (red blood cell distribution width)
- `wbc` (white blood cells)

Vitalsigns:
- `heart_rate` 
- `dbp` (diastolic blood pressure)
- `sbp` (systolic blood pressure)
- `mbp` (median blood pressure)
- `resp_rate` (respiration rate)
- `temperature` (body temperature)
- `spo2` (peripheral capillary oxygen saturation)

Fortunatly, we find both abstractions inside the `mimic_derived` module. So we simply subselect our chosen features.

In [87]:
vit_path = os.path.join(local_dir, "vit_raw.csv")
cbc_path = os.path.join(local_dir, "cbc_raw.csv")

vit_raw = lazy_run_query(
    vit_path,
    """ WITH max_times AS (
        WITH cxr AS
        (
            SELECT subject_id, study_id, dicom_id, study_datetime
            FROM `physionet-data.mimic_cxr.record_list`
            INNER JOIN (
                SELECT dicom, PARSE_DATETIME('%Y%m%d %H%M%E*S', CONCAT(StudyDate, ' ', StudyTime)) AS study_datetime
                FROM `physionet-data.mimic_cxr.dicom_metadata_string`
            )
            ON dicom_id = dicom
        )
        SELECT vit.subject_id
        , cxr.study_datetime
        , cxr.study_id
        , MAX(vit.charttime) AS vit_charttime
        FROM `physionet-data.mimic_derived.vitalsign` AS vit,
        cxr
        WHERE cxr.subject_id = vit.subject_id 
        AND study_datetime >= vit.charttime
        AND DATE_DIFF(CAST(study_datetime AS DATE), CAST(vit.charttime AS DATE), DAY) <= 1
        AND resp_rate IS NOT NULL
        AND heart_rate IS NOT NULL
        AND temperature IS NOT NULL
        AND spo2 IS NOT NULL
        GROUP BY vit.subject_id
        , cxr.study_datetime
        , cxr.study_id
    )
    SELECT vit.subject_id
    , max_times.study_datetime
    , max_times.study_id
    , vit.charttime AS vit_charttime
    , heart_rate
    , sbp
    , dbp
    , mbp
    , resp_rate
    , temperature
    , spo2
    FROM `physionet-data.mimic_derived.vitalsign` AS vit
    JOIN max_times
    ON max_times.subject_id = vit.subject_id
    AND max_times.vit_charttime = vit.charttime
""",
    save_local=True,
)

cbc_raw = lazy_run_query(
    cbc_path,
    """ WITH max_times AS (
        WITH cxr AS
        (
            SELECT subject_id, study_id, dicom_id, study_datetime
            FROM `physionet-data.mimic_cxr.record_list`
            INNER JOIN (
                SELECT dicom, PARSE_DATETIME('%Y%m%d %H%M%E*S', CONCAT(StudyDate, ' ', StudyTime)) AS study_datetime
                FROM `physionet-data.mimic_cxr.dicom_metadata_string`
            )
            ON dicom_id = dicom
        )
        SELECT cbc.subject_id
        , cxr.study_datetime
        , cxr.study_id
        , MAX(cbc.charttime) AS cbc_charttime
        FROM `physionet-data.mimic_derived.complete_blood_count` AS cbc, 
        cxr
        WHERE cxr.subject_id = cbc.subject_id
        AND study_datetime >= cbc.charttime
        AND DATE_DIFF(CAST(study_datetime AS DATE), CAST(cbc.charttime AS DATE), DAY) <= 3 
        AND hematocrit IS NOT NULL
        AND hemoglobin IS NOT NULL
        AND mch IS NOT NULL
        AND mchc IS NOT NULL
        AND mcv IS NOT NULL
        AND platelet IS NOT NULL
        AND rbc IS NOT NULL
        AND rdw IS NOT NULL
        AND wbc IS NOT NULL
        GROUP BY cbc.subject_id
        , cxr.study_datetime
        , cxr.study_id
    )
    SELECT cbc.subject_id
    , max_times.study_datetime
    , max_times.study_id
    , cbc.charttime AS cbc_charttime
    , hematocrit
    , hemoglobin
    , mch
    , mchc
    , mcv
    , platelet
    , rbc
    , rdw
    , wbc
    FROM `physionet-data.mimic_derived.complete_blood_count` AS cbc
    JOIN max_times
    ON max_times.subject_id = cbc.subject_id
    AND max_times.cbc_charttime = cbc.charttime
""",
    save_local=True,
    transform=lambda df: df.dropna(),
)


Downloading: 100%|██████████| 55890/55890 [00:11<00:00, 4772.95rows/s]
Downloading: 100%|██████████| 183496/183496 [00:40<00:00, 4533.01rows/s]


### Clinical Notes (Radiology Report Indication Sections)
In this subsection we add free text that was produced by the clinicians in the context of the study. This may include physical exams, clinical notes, study reports etc. 

Currently, we only have the radiology report available. This report is written by the radiologist *after* examining the X-Ray image of the patients. From the perspective of our classification task it doesn't make sense to include information generated by the radiologist, since this information is not available at the time we see the X-Ray image. 

The only exception is the indication section of a radiology report, which is written by the doctor, who ordered the X-Ray for the patient. To extract the indication sections, we first download all reports from Physionet. We then use a script from the [MIMIC Code Repository](https://github.com/MIT-LCP/mimic-code) to extract all indication sections. Please note that we use a [fork of the repository](https://github.com/mohkoh19/mimic-cxr), in which the `create_section_files.py` script has been modified to include the indication sections next to the impressions and findings sections. 

In [70]:
reports_file = os.path.join(base_url, "mimic-cxr/2.0.0/mimic-cxr-reports.zip")

In [17]:
!wget -q --user {user} --password {password} {reports_file} -P {local_dir}
!unzip -q {os.path.join(local_dir, "mimic-cxr-reports.zip")} -d {local_dir} && rm {os.path.join(local_dir, "mimic-cxr-reports.zip")}

--2022-12-20 16:06:31--  https://physionet.org/files/mimic-cxr/2.0.0/mimic-cxr-reports.zip
Resolving physionet.org (physionet.org)... 18.18.42.54
Connecting to physionet.org (physionet.org)|18.18.42.54|:443... connected.
HTTP request sent, awaiting response... 401 Unauthorized
Authentication selected: Basic realm="PhysioNet", charset="UTF-8"
Reusing existing connection to physionet.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 141942511 (135M) [application/zip]
Saving to: ‘/home/mohammad/Projects/master-thesis/sandbox/mimic-cxr-reports.zip’


2022-12-20 16:07:02 (4,49 MB/s) - ‘/home/mohammad/Projects/master-thesis/sandbox/mimic-cxr-reports.zip’ saved [141942511/141942511]



In [21]:
!cd {local_dir} && git clone git@github.com:mohkoh19/mimic-cxr.git
!cd {local_dir} && python mimic-cxr/txt/create_section_files.py --reports_path ./files --output_path . --no_split

fatal: destination path 'mimic-cxr' already exists and is not an empty directory.
p10
100%|█████████████████████████████████████| 6397/6397 [00:01<00:00, 3742.07it/s]
p11
100%|█████████████████████████████████████| 6571/6571 [00:01<00:00, 3935.57it/s]
p12
100%|█████████████████████████████████████| 6528/6528 [00:01<00:00, 4200.50it/s]
p13
100%|█████████████████████████████████████| 6550/6550 [00:01<00:00, 4049.32it/s]
p14
100%|█████████████████████████████████████| 6507/6507 [00:01<00:00, 4165.24it/s]
p15
100%|█████████████████████████████████████| 6593/6593 [00:01<00:00, 3880.04it/s]
p16
100%|█████████████████████████████████████| 6476/6476 [00:01<00:00, 4260.46it/s]
p17
100%|█████████████████████████████████████| 6644/6644 [00:01<00:00, 3945.72it/s]
p18
100%|█████████████████████████████████████| 6543/6543 [00:01<00:00, 4143.89it/s]
p19
100%|█████████████████████████████████████| 6579/6579 [00:01<00:00, 3908.03it/s]


In [22]:
clino_raw = pd.read_csv(
    os.path.join(local_dir, "mimic_cxr_sectioned.csv")
)[["study", "indication"]]
clino_raw.rename(columns={"study": "study_id", "indication": "notes"}, inplace=True)
clino_raw["study_id"] = clino_raw["study_id"].str[1:].astype(int)
clino_raw = clino_raw.dropna().reset_index(drop=True)


After extracting the notes and applying some basic preprocessing, we use a [tokenizer provided by the authors of BioBERT](https://huggingface.co/dmis-lab/biobert-v1.1) on Hugging Face to tokenize the notes and save them as tokens instead of strings. This will save us a lot of runtime during training and inference.

In [24]:
clino = clino_raw.copy()

# Transform notes to single lower-case sentence, remove anonymised info
notes = clino["notes"]
notes = notes.str.lower()
notes.replace(r"\n", " ", regex=True, inplace=True)
notes.replace(r"[^\w\s]", "", regex=True, inplace=True)
notes.replace(r"___", " ", regex=True, inplace=True)
notes.replace(r" *(year old|y o|yo|yearold)", "", regex=True, inplace=True)
clino["notes"] = notes

# Tokenize all notes and attach to dataframe
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1")
tokens = tokenizer(notes.to_list(), padding="max_length", truncation=True)
df_tokens = pd.DataFrame.from_dict(tokens, orient="index").T
clino = clino.join(df_tokens)

In [25]:
clino.to_pickle(os.path.join(local_dir, 'clino.pkl'))

### Merge Datasets

We merge all datasets to a final multivariate dataset that can be used for all of our models, namely `mimic_cxr_mv.pkl`.

In [74]:
mimic_cxr = pd.read_csv(os.path.join(local_dir, "mimic_cxr.csv"))
clino = pd.read_pickle(os.path.join(local_dir, "clino.pkl"))
cbc_raw = pd.read_csv(os.path.join(local_dir, "cbc_raw.csv"))
vit_raw = pd.read_csv(os.path.join(local_dir, "vit_raw.csv"))

In [27]:
# Incase we read from csv datetime type is not preserved, run this cell to match all datetime types
mimic_cxr["study_datetime"] = pd.to_datetime(mimic_cxr["study_datetime"])
cbc_raw["study_datetime"] = pd.to_datetime(cbc_raw["study_datetime"])
vit_raw["study_datetime"] = pd.to_datetime(vit_raw["study_datetime"])

In [28]:
mimic_cxr_mv = pd.merge(mimic_cxr, cbc_raw, on=['study_id', 'subject_id', 'study_datetime'], how="left")
mimic_cxr_mv = pd.merge(mimic_cxr_mv, vit_raw, on=['study_id', 'subject_id', 'study_datetime'], how="left")
mimic_cxr_mv = pd.merge(mimic_cxr_mv, clino, on=['study_id'], how="left")

In [29]:
mimic_cxr_mv.to_pickle(os.path.join(local_dir, "mimic_cxr_mv.pkl"))

### Download and Downscale CXRs (JPEG)
The `mimic_cxr_mv.pkl` dataset does not include the CXR images, but only relative file paths. Thus, it is expected that the local file structure resembles the remote file structure on physionet. The cell below downloads all CXR images in the JPEG format from MIMIC-CXR-JPG in the given resolution and saves them at the correct location in the `files/` folder that has been extracted earlier for the radiology reports.

In [95]:
def load_and_scale(rel_path):
    remote_path = os.path.join(base_url, "mimic-cxr-jpg/2.0.0/", rel_path)
    local_path = os.path.join(local_dir, rel_path)
    !wget -q --user {user} --password {password} {remote_path} -P {os.path.dirname(local_path)}
    image = Image.open(local_path)
    image = image.resize((image_resolution, image_resolution), Image.LANCZOS)
    image.save(local_path)

with Pool(cpu_count()) as p:
    image_list = mimic_cxr.image_path.str.replace(".dcm", ".jpg").to_list()[:100]
    list(tqdm(p.imap(load_and_scale, image_list), total=len(image_list)))

  if sys.path[0] == '':
100%|██████████| 100/100 [00:33<00:00,  2.95it/s]
