## Environment Configuration
- **Purpose**: Configure GPU settings for CUDA and Triton compatibility.  
- **Key Actions**:  
  - `CUDA_VISIBLE_DEVICES="0"`: Restricts the script to use only the first GPU.  
  - `CUDA_DEVICE_ORDER="PCI_BUS_ID"`: Ensures GPUs are ordered by PCI bus ID (useful for multi-GPU setups).  
  - `TRITON_CAPABILITY="75"`: Sets Triton compute capability to match NVIDIA Volta+ GPUs (e.g., T4).


In [1]:
import os
os.environ["TRITON_CAPABILITY"] = "75"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

## Dependency Management
- **Steps**:  
  1. Uninstalls existing `numpy` (1.26.4) to avoid conflicts.  
  2. Installs `numpy==1.24.3`, downgrading for compatibility with `autoawq`.  
  3. Installs `autoawq[kernels]` (quantization library) and updates `transformers`/`accelerate`.

In [2]:
!pip uninstall numpy -y
!pip install --no-cache-dir numpy==1.24.3
!pip install autoawq[kernels]
!pip install --upgrade transformers accelerate

Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Collecting numpy==1.24.3
  Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m239.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albucore 0.0.19 requires numpy>=1.24.4, but you have numpy 1.24.3 which is incompatible.
albumentations 1.4.20 requires numpy>=1.24.4, but you have numpy 1.24.3 which is incompatible.
bayesian-optimization 2.0.3 requires numpy>=1.25, but you have numpy 1.24.3 which is incompatible.
featuretools 1.31.0 requires n

## Load the data from the previous notebook

In [3]:
import os
import numpy as np
import pandas as pd
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

data = pd.read_csv("/kaggle/input/out1.csv", index_col=0)
data.head()

/kaggle/input/out1.csv
/kaggle/input/tiers.csv
/kaggle/input/__results__.html
/kaggle/input/__notebook__.ipynb
/kaggle/input/__output__.json
/kaggle/input/custom.css
/kaggle/input/__results___files/__results___14_0.png
/kaggle/input/__results___files/__results___10_0.png
/kaggle/input/__results___files/__results___12_0.png
/kaggle/input/__results___files/__results___16_0.png


Unnamed: 0,name_of_drug,drug_tier,requirement_limits
0,ANTI-INFECTIVE AGENTS,ANTI-INFECTIVE AGENTS,ANTI-INFECTIVE AGENTS
1,ANTHELMINTICS,ANTHELMINTICS,ANTHELMINTICS
2,albendazole,1,
3,ivermectin,1,
4,ANTIBACTERIALS,ANTIBACTERIALS,ANTIBACTERIALS


#### Data Filtering
- **Logic**:  
  - `idx_nonheaders`: Filters out rows where all columns have the same value (likely headers).  
  - `idx_limited`: Separates rows with non-null `requirement_limits`.
---
- `data_alternatives` contains the unrestricted drugs (which are potential alternatives)
- `data` now contains only the restricted drugs

In [4]:
idx_nonheaders = data.nunique(axis=1) != 1
idx_limited = data["requirement_limits"].notna()
data_alternatives = data[idx_nonheaders & (~idx_limited)]
data = data[idx_nonheaders & idx_limited].reset_index(drop=True)
data.sample(10)

Unnamed: 0,name_of_drug,drug_tier,requirement_limits
37,IMBRUVICA,4,QL
139,VIMIZIM,4,QL
30,fulvestrant,4,QL
117,buprenorphine hcl- naloxone hcl dihydrate,1,QL
92,dihydroergotamine mesylate,"1, 4",QL
44,KISQALI (200 MG DOSE),4,QL
138,STRENSIQ,4,QL
46,LENVIMA (10 MG DAILY DOSE),4,QL
132,zolpidem tartrate,1,QL
80,TYKERB,2,QL


## LLM Initialization
- **Model**: Loads `BioMistral-7B-AWQ`, a quantized medical LLM optimized for GPU inference.
    - This is based on the well-known `Mistral-7B` LLM (commonly used by Kagglers to win AI competitions)
    - `Mistral AI` is a French artificial intelligence (AI) startup founded by ex-Meta and ex-Google employees, headquartered in Paris.
    - `BioMistral-7B` is a fine-tuned version of `Mistral-7B`
        - fine-tuned on clinical data
    - `BioMistral-7B-AWQ` is a "quantized" version of `BioMistral-7B`
        - I'm using this because this is just a demo
- **Key Steps**:  
  - `AutoTokenizer`/`AutoModelForCausalLM`: Load tokenizer and model.  
  - `pad_token_id=tokenizer.pad_token_id`: Ensures padding compatibility.  

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name = "BioMistral/BioMistral-7B-AWQ-QGS128-W4-GEMV"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.generation_config.pad_token_id = tokenizer.pad_token_id
model.resize_token_embeddings(len(tokenizer))

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/737 [00:00<?, ?B/s]

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors:   0%|          | 0.00/4.15G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Embedding(32000, 4096)

## LLM Pipeline Setup

In [6]:
hf_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=200,  # Reduce output length to avoid GPU memory issues
)

Device set to use cuda:0


## Alternative Drug Preprocessing
- **Action**: Extracts lowercase drug names from `data_alternatives` into a set (`readily_available_alternatives`).  
- **Purpose**: Enables fast lookup for viable alternatives during enrichment.

In [7]:
readily_available_alternatives = set(data_alternatives.name_of_drug.str.lower().tolist())

## Data Transformation & Enrichment
- **Template**: Defines rules to map raw data to structured JSON (e.g., `drug_tier` → `formulary_tier`).  
- **Functions**:  
  - `viable_alternatives()`: Queries the model for alternatives (e.g., "alternatives to ivermectin") and filters against `readily_available_alternatives`.  
  - `normalize_and_enrich()`: Executes the pipeline to generate JSON and appends alternatives.  
- **Test Case**: Processes `ivermectin`, returning enriched fields like `therapeutic_category: antiparasitic_agents` and `viable_alternatives: [albendazole]`.  

In [8]:
import json

TEMPLATE = """
Transform the drug data into structured JSON using these rules:
1. Map `name_of_drug` to `name`.
2. Map `drug_tier` to `formulary_tier` (allow single integers **or** arrays like [1, 2]).
3. Map `requirement_limits` to `utilization_management`.
4. Assign `therapeutic_category` (e.g., "anticonvulsants", "antiparkinsonian_agents") based on the drug’s primary use.
5. Assign `pharmacologic_class` (e.g., "beta_lactam", "calcium_channel_blocking_agents") based on the drug’s primary use.
6. Derive `delivery_formulation` from the drug name suffix (e.g., "in_sleep" → "sleep_formulation"). Default to `null` if no suffix exists.
7. Set `combination_product` to `true` **only** if the drug name contains components separated by underscores (e.g., "carbidopa_levodopa"). 
    If true, add `combination_components` as a list of split names (e.g., ["carbidopa", "levodopa"]).
8. Set `brand_status` to "brand" for non-generic drugs (e.g., "apokyn"); otherwise, "generic".
9. If the drug is branded, add its corresponding `generic_name`
10. Assign `salt_form` (e.g., "hydrochloride", "sodium", "gluconate") based on the drug's salt structure.

(All output strings should be in lowercase. All output strings should be snake-case. `null` should be null and not the string "null")

Input:
{drug}

Output:
"""

def viable_alternatives(drug_name):
    alternatives = hf_pipeline(f"The alternatives to {drug_name} are (names only, comma-separated):")[0]['generated_text'].lower().rstrip(".").split("): ")[-1].split(", ")
    alternatives = [alt[4:] if alt.startswith("and ") else alt.strip() for alt in alternatives]
    alternatives = sorted(readily_available_alternatives & set(alternatives))
    return alternatives

def normalize_and_enrich(drug):
    try:
        prompt = TEMPLATE.format(drug=drug)
        response = hf_pipeline(prompt)
        response = json.loads(response[0]["generated_text"].split("Output:")[-1])
        response["viable_alternatives"] = viable_alternatives(json.loads(drug)["name_of_drug"])
        return response
    except Exception as e:
        return {"error": str(e), "response": response[0]["generated_text"]}

drug = '{"name_of_drug":"ivermectin","drug_tier":"4","requirement_limits":"QL"}'
result = normalize_and_enrich(drug)
result

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{'name': 'ivermectin',
 'formulary_tier': 4,
 'utilization_management': 'QL',
 'therapeutic_category': 'antiparasitic_agents',
 'pharmacologic_class': 'macrolides',
 'delivery_formulation': 'null',
 'combination_product': 'false',
 'brand_status': 'generic',
 'generic_name': 'null',
 'salt_form': 'null',
 'viable_alternatives': ['albendazole']}

## ~~Batch Processing~~
- **Intent**: Process all rows in batches of 32 to avoid GPU overload.  
- **Status**: Disabled for demo purposes (resource constraints).  

In [9]:
# def promptify(drug):
#     return TEMPLATE.format(drug=drug)

# prompts = [promptify(row.to_json()) for _, row in data.iterrows()]

# def batch(iterable, n=1):
#     l = len(iterable)
#     for ndx in range(0, l, n):
#         yield iterable[ndx:min(ndx + n, l)]

# batches = batch(prompts, n=32)

## Sample Processing
- **Workflow**:  
  1. Randomly selects 10 drugs from `data`.  
  2. Uses `tqdm` for progress tracking.  
  3. Calls `normalize_and_enrich()` on each sample.  

In [10]:
from tqdm.notebook import tqdm

# this is a demo project so we will stick with a few samples
rows = [row.to_json() for _, row in data.iterrows()]
samples = np.random.choice(rows, 30).tolist() 
results = [normalize_and_enrich(sample) for sample in tqdm(samples)]

  0%|          | 0/30 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting

## Results Export
- **Action**: Converts results to a DataFrame and saves as `out2.csv`.  
- **Output**: Shows enriched fields like `therapeutic_category`, `pharmacologic_class`, and `viable_alternatives`.  

In [11]:
results = pd.DataFrame(results)
results.to_csv("out2.csv")
results

Unnamed: 0,name,formulary_tier,utilization_management,therapeutic_category,pharmacologic_class,delivery_formulation,combination_product,brand_status,generic_name,salt_form,viable_alternatives,brand_name
0,abiraterone acetate,"[1, 4]",QL,androgen_suppressors,steroid_inhibitors,,False,generic,abiraterone,acetate,[],
1,thalomid,4,QL,immunosuppressants,immunosuppressive_agents,,False,generic,,hydrochloride,[],
2,zelboraf,4,QL,antineoplastic_agents,BRAF_inhibator,,False,generic,zelboraf,hydrochloride,[],
3,hemofil m,2,QL,blood_substitutes,blood_substitutes,,False,generic,,,[],
4,KOVALTRY,2,QL,antiparkinsonian_agents,dopamine_receptor_antagonists,,False,generic,Kovatry,,[],
5,sorafenib tosylate,4,QL,antineoplastic_agents,tyrosine_kinase_inhibitors,,False,generic,,sulfate,[],
6,vimizim,4,QL,antiparkinsonian_agents,dopamine_agonists,,False,generic,vimizim,hydrochloride,[],
7,stivarga,4,QL,antiparkinsonian_agents,dopamine_agonists,,False,generic,,,[],
8,altuviiio,2,QL,anticonvulsants,barbiturates,,False,generic,,,[],
9,avonex,"[2, 4]",QL,antiparkinsonian_agents,dopamine_agonists,,False,generic,,hydrochloride,[],
