# Functionalities of `medbench.aml`

This notebook has examples on how to use the `medbench.aml` module, and to show what it can do.

## Dependencies

In [1]:
import os
import sys

sys.path.append(os.path.join(os.getcwd(), '..'))


In [2]:
%load_ext autoreload
%autoreload 2

from medbench.aml import AzureML

## Authentication

We use `azure.identity.DefaultAzureCredential` to authenticate, meaning we try, in order, to authenticate with:
    - EnvironmentCredential
	- ManagedIdentityCredential
	- SharedTokenCacheCredential
	- AzureCliCredential

To authenticate using the fallback method (AzureCliCredential), run `az login`

## Connect to specific registries

In [4]:
aml = AzureML.connect_to_registry(registry_name="azureml-1p")

In [8]:
# Print registry names
for registry in aml.registries:
    print(f"Registry Name: {registry.name}")

Registry Name: azureml-1p


## Retrieving datasets

### Loading data assets

In [9]:
headqa_data_asset = aml.get_dataset(name="HeadQA", version="latest")
headqa_data_asset

Data({'path': 'https://azml1p5efskuse01.blob.core.windows.net/azureml-1p-edc7e889-1df9-5873-8d5d-4e7928a2397e/UI/2024-10-10_172215_UTC', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'HeadQA', 'description': 'HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. They are designed by the Ministerio de Sanidad, Consumo y Bienestar Social, who also provides direct access to the exams of the last 5 years (in Spanish).', 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': 'azureml://registries/azureml-1p/data/HeadQA/versions/1', 'Resource__source_path': '', 'base_path': '/home/lschettini/dev/medbench/medbench-data/notebooks', 'creation_context': <azure.ai.ml.entities._system_data.SystemData obje

### Loading .jsonl datasets as pandas DataFrames

If there's a **single .jsonl** in the folder, `aml.get_dataset` can infer what file to load automatically.

In [10]:
headqa_df = aml.get_dataset(name="HeadQA", version="latest", read_folder_jsonl=True)
headqa_df.head()

2024-10-16 11:22:42.672 | DEBUG    | medbench.aml:get_dataset:130 - Downloading HEAD_EN.jsonl to temporary directory `/tmp/tmp8o63bnr3`...
2024-10-16 11:22:43.299 | INFO     | medbench.aml:get_dataset:135 - Reading HEAD_EN.jsonl as a pandas DataFrame.
  return pd.read_json(jsonl_str, lines=True)


Unnamed: 0,question,exp,cop,opa,opb,opc,opd,subject_name,topic_name,id,choice_type
0,Which of the following is not true for myelina...,,1,Impulse through myelinated fibers is slower th...,Membrane currents are generated at nodes of Ra...,Saltatory conduction of impulses is seen,Local anesthesia is effective only when the ne...,Physiology,,45258d3d-b974-44dd-a161-c3fccbdadd88,multi
1,Which of the following is not true about glome...,Ans-a. The oncotic pressure of the fluid leavi...,1,The oncotic pressure of the fluid leaving the ...,Glucose concentration in the capillaries is th...,Constriction of afferent aeriole decreases the...,Hematocrit of the fluid leaving the capillarie...,Physiology,,b944ada9-d776-4c2a-9180-3ae5f393f72d,multi
2,A 29 yrs old woman with a pregnancy of 17 week...,,3,No test is required now as her age is below 35...,Ultra sound at this point of time will definit...,Amniotic fluid samples plus chromosomal analys...,blood screening at this point of time will cle...,Medicine,,b64a9cd7-d076-4c55-8be1-f9c44fece6cc,single
3,Axonal transport is:,Fast anterograde (400 mm/day) transport occurs...,3,Antegrade,Retrograde,Antegrade and retrograde,,Physiology,,c6365cce-507c-40f6-90a2-46b867f47b6e,multi
4,Low insulin to glucagon ratio is seen in all o...,Answer- A. Glycogen synthesisLow insulin to gl...,1,Glycogen synthesis,Glycogen breakdown,Gluconeogenesis,Ketogenesis,Biochemistry,,72c1c5e0-b64f-4eef-bf22-ecfb60c5c19c,multi


If there are multiple `.jsonl` files, we must specify the target file, otherwise we will get the data asset object instead.

In [11]:
medqa_dev_df = aml.get_dataset(
    name="medqa", version="latest", read_folder_jsonl=True, target_jsonl="dev.jsonl"
)
medqa_dev_df.head()

2024-10-16 11:22:55.163 | DEBUG    | medbench.aml:get_dataset:130 - Downloading dev.jsonl to temporary directory `/tmp/tmpu3es902l`...
2024-10-16 11:23:02.767 | INFO     | medbench.aml:get_dataset:135 - Reading dev.jsonl as a pandas DataFrame.
  return pd.read_json(jsonl_str, lines=True)


Unnamed: 0,question,answer,options,meta_info,answer_idx,metamap_phrases
0,A 21-year-old sexually active male complains o...,Ceftriaxone,(A) Gentamicin (B) Ciprofloxacin (C) Ceftriaxo...,step1,C,"[21-year-old sexually active male, fever, pain..."
1,A 5-year-old girl is brought to the emergency ...,Cyclic vomiting syndrome,(A) Cyclic vomiting syndrome (B) Gastroenterit...,step2&3,A,"[5 year old girl, brought, emergency departmen..."
2,A 40-year-old woman presents with difficulty f...,Trazodone,(A) Diazepam (B) Paroxetine (C) Zolpidem (D) T...,step1,D,"[40 year old woman presents, difficulty fallin..."
3,A 37-year-old female with a history of type II...,Obtain a urine analysis and urine culture,(A) Obtain an abdominal CT scan (B) Obtain a u...,step2&3,B,"[year old female, history of type II diabetes ..."
4,A 19-year-old boy presents with confusion and ...,Hypoperfusion,(A) Hypoperfusion (B) Hyperglycemia (C) Metabo...,step1,A,"[year old boy presents, confusion, speak, pati..."


In [12]:
medqa_data_asset = aml.get_dataset(
    name="medqa", version="latest", read_folder_jsonl=True
)
medqa_data_asset



Data({'path': 'https://azml1p5efskuse01.blob.core.windows.net/azureml-1p-f020e80f-90fe-520a-ba61-d6d77cc1bdcf/UI/2023-09-25_205048_UTC/medqa', 'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_folder', 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'medqa', 'description': None, 'tags': {}, 'properties': {}, 'print_as_yaml': False, 'id': 'azureml://registries/azureml-1p/data/medqa/versions/1', 'Resource__source_path': '', 'base_path': '/home/lschettini/dev/medbench/medbench-data/notebooks', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f16088af290>, 'serialize': <msrest.serialization.Serializer object at 0x7f15c93ea410>, 'version': '1', 'latest_version': None, 'datastore': None})

## Registering datasets

This section has examples on howto register datasets directly from HuggingFace and from local files.

### Dependencies

In [None]:
import tempfile

from datasets import load_dataset, Dataset

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Registering HuggingFace datasets (or local folders)

Helper function:

In [None]:
def huggingface_dataset_to_jsonl(dataset: Dataset, output_folder: str):
    for split in dataset.keys():
        dataset[split].to_pandas().to_json(
            os.path.join(output_folder, f"{split}.jsonl"), orient="records", lines=True
        )

For this example let's register an MMLU subset

In [None]:
mmlu_dataset = "cais/mmlu"
mmlu_subset = "anatomy"

# Name has a suffix so we don't create a new unecessary version of the dataset.
dataset_name = f"mmlu_{mmlu_subset}_test"

In [None]:
ds = load_dataset("cais/mmlu", mmlu_subset)

with tempfile.TemporaryDirectory() as temp_dir_name:
    # Convert Hugging Face dataset to JSONL
    huggingface_dataset_to_jsonl(ds, temp_dir_name)

    # Register the dataset in AzureML
    aml.register_folder_as_dataset(
        folder_path=temp_dir_name,
        dataset_name=dataset_name,
        dataset_description=(
            f"MMLU {mmlu_subset.title()} dataset.\n\n"
            "This MMLU subset consists of multiple-choice questions with 4 answer options and is designed to evaluate a model's understanding of specific medical and biological domains.\n\n"
            "Source: https://huggingface.co/datasets/cais/mmlu\n"
            "Source version: c30699e8356da336a370243923dbaf21066bb9fe (commit from 20240308)"
        )
    )