# MedTrix

This is a project to generate medical records

## Part Three: Bloom Finetuning and MLops Deploy

## Notebook Settings

#### Path

In [2]:
from pathlib import Path
import os

b_path = Path.cwd().parents[0]
os.chdir(b_path)
!ls

d_path = b_path / 'dataset'
n_path = b_path / 'notebook'
s_path = b_path / 'scripts'

dataset		 images    mlruns    poetry.lock     requirements.txt
Dockerfile	 LICENSE   model     pyproject.toml  scripts
environment.yml  logs.log  notebook  README.md	     src


#### Install

In [20]:
!which python

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/home/leobit/anaconda3/envs/conda_medtrix_env/bin/python


In [6]:
!pip install datasets==2.4.0
!pip install jmespath
!pip install git+https://github.com/bigscience-workshop/biomedical.git

Collecting git+https://github.com/bigscience-workshop/biomedical.git
  Cloning https://github.com/bigscience-workshop/biomedical.git to /tmp/pip-req-build-xnepcp9z
  Running command git clone --filter=blob:none --quiet https://github.com/bigscience-workshop/biomedical.git /tmp/pip-req-build-xnepcp9z
  Resolved https://github.com/bigscience-workshop/biomedical.git to commit 0279c7b1be1513f5ce79878f9f481f853cebc05a
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting bioc==2.0.post4
  Using cached bioc-2.0.post4-py3-none-any.whl (37 kB)
Collecting intervaltree
  Using cached intervaltree-3.1.0.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting jsonlines>=1.2.0
  Downloading jsonlines-3.1.0-py3-none-any.whl (8.6 kB)


Collecting sortedcontainers<3.0,>=2.0
  Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Building wheels for collected packages: bigbio, intervaltree
  Building wheel for bigbio (pyproject.toml) ... [?25ldone
[?25h  Created wheel for bigbio: filename=bigbio-0.0.1-py3-none-any.whl size=556388 sha256=d2ca2be0599876e7c60cb8f3531f5192b6d52f50bd3eab83a5391ebaf907fc38
  Stored in directory: /tmp/pip-ephem-wheel-cache-0xqqw3tm/wheels/e3/f7/a9/0ee471a912c84b8e88d54f8fab8d0c17608c633c224598d39f
  Building wheel for intervaltree (setup.py) ... [?25ldone
[?25h  Created wheel for intervaltree: filename=intervaltree-3.1.0-py2.py3-none-any.whl size=26100 sha256=04baa19212a15ddc85a9c531070019e0b6f348d3be1b3345b4f6c78563b3a2b1
  Stored in directory: /home/leobit/.cache/pip/wheels/fa/80/8c/43488a924a046b733b64de3fac99252674c892a4c3801c0a61
Successfully built bigbio intervaltree
Installing collected packages: sortedcontainers, jsonlines, intervaltree, bioc, bigbio
Successfully instal

#### Configuration

#### Import

In [3]:
import spacy
import re
import pandas as pd
from transformers import pipeline
from transformers import BloomTokenizerFast, BloomForCausalLM
from datasets import load_dataset
from bigbio.dataloader import BigBioConfigHelpers
from collections import defaultdict

  from .autonotebook import tqdm as notebook_tqdm


## Testing Model

In [7]:
model = BloomForCausalLM.from_pretrained("bigscience/bloom-7b1")
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-7b1")

Downloading config.json: 100%|█████████████████████████████████████████████████████████████████████████████| 710/710 [00:00<00:00, 635kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████| 13.2G/13.2G [11:25<00:00, 20.6MB/s]
Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████| 222/222 [00:00<00:00, 181kB/s]
Downloading tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████| 13.8M/13.8M [00:01<00:00, 9.61MB/s]
Downloading special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████| 85.0/85.0 [00:00<00:00, 42.2kB/s]


In [12]:
def generate_from_model(model, tokenizer, text, max_tokens):
    encoded_input = tokenizer(text, return_tensors='pt')
    output_sequences = model.generate(
        input_ids=encoded_input['input_ids'],
        max_length=max_tokens,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        early_stopping=True
    )
    return tokenizer.decode(output_sequences[0], skip_special_tokens=True)

In [13]:
max_tokens = 250
text = "Anne is a patient with Diabettes and have another conditions like"

generate_from_model(model, tokenizer, text, max_tokens)

'Anne is a patient with Diabettes and have another conditions like obesity. Her main treatment is insulin therapy but not very well controlled. The insulin she has to take is really expensive because of the price increase in France recently, so she cannot afford it anymore. It will cost her 20 000 euros, for a time only! How can we help her? How can we raise enough money in order to help her?'

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['personality', 'candidates', 'history', 'conv_id', 'utterance_idx'],
        num_rows: 131438
    })
    validation: Dataset({
        features: ['personality', 'candidates', 'history', 'conv_id', 'utterance_idx'],
        num_rows: 7801
    })
})

In [8]:
conhelps = BigBioConfigHelpers()

In [9]:
conhelps.filtered(
    lambda x:
        x.dataset_name.startswith("n2c2")
    and not x.is_bigbio_schema
)

BigBioConfigHelper(script='/home/leobit/anaconda3/envs/conda_medtrix_env/lib/python3.10/site-packages/bigbio/biodatasets/n2c2_2006_deid/n2c2_2006_deid.py', dataset_name='n2c2_2006_deid', tasks=[<Tasks.NAMED_ENTITY_RECOGNITION: 'NER'>], languages=[<Lang.EN: 'English'>], config=BigBioConfig(name='n2c2_2006_deid_source', version=1.0.0, data_dir=None, data_files=None, description='n2c2_2006 deid source schema', schema='source', subset_id='n2c2_2006_deid'), is_local=True, is_pubmed=False, is_bigbio_schema=False, bigbio_schema_caps=None, is_large=False, is_resource=False, is_default=True, is_broken=False, bigbio_version='1.0.0', source_version='1.0.0', citation='@article{uzuner2007evaluating,\n    author = {\n        Uzuner, Özlem and\n        Luo, Yuan and\n        Szolovits, Peter\n    },\n    title     = {Evaluating the State-of-the-Art in Automatic De-identification},\n    journal   = {Journal of the American Medical Informatics Association},\n    volume    = {14},\n    number    = {5},\

## New approach on Conditional Text Generation

In [15]:
# Load MIMIC-III
df_mimic_full = pd.read_csv(d_path / "mimic-iii.csv", index_col=0)

# Replace wrong texts
mimic_replace_d = {
    ":[**":": [**",
    "#:":":",
    "\n\nD:":"\n\nDate:"
}
for orig, repl in mimic_replace_d.items():
    df_mimic_full['TEXT'] = df_mimic_full['TEXT'].apply(lambda x: x.replace(orig, repl))

In [25]:
# Pattern to get Topics
add_topics = ['facility', 'HISTORY  OF  THE  PRESENT  ILLNESS(?=\:)', 'Admission Date(?=\:)', 'Discharge Date(?=\:)', 'Sex(?=\:)', 'Chief Complaint(?=\:)', 'Addendum(?=\:)', '(?i)HISTORY OF PRESENT ILLNESS(?=\:)']
pattern = re.compile(f"((?<=\\n\\n)[\w\s]+(?=\:))|{'|'.join(add_topics)}", flags=0)
hpi_p = re.compile("\[\*\*[^\[]*\*\*\]", flags=0)

  pattern = re.compile(f"((?<=\\n\\n)[\w\s]+(?=\:))|{'|'.join(add_topics)}", flags=0)


In [23]:
def get_topics_text(text):
    topics = []
    positions = []
    sections_text = {}
    for m in pattern.finditer(text):
        s = m.group().replace('\n','')
        s = "_".join(s.lower().split())
        topics.append(s)
        positions.append((m.span()[0], m.span()[1]+2))
    for i, topic in enumerate(topics):
        start = positions[i][1]
        try:
            end = positions[i+1][0]
        except:
            end = len(text)-1
        sections_text[topic]=text[start:end].replace('\n',' ')
        
    return sections_text

In [9]:
en_nlp = spacy.blank('en')

In [28]:
topics_for_analysis = ['family_history','history_of_present_illness','hospital_course', 'brief_hospital_course', 'social_history', 'past_medical_history', ]
token_mean = defaultdict(list)
for idx, doc_text in enumerate(df_mimic_full['TEXT'].to_list()):
    sections_text = get_topics_text(doc_text)
    for topic, text in sections_text.items():
        if not topic in topics_for_analysis:continue
        doc = en_nlp(text)
        token_mean[topic]+=[len(doc)]

In [29]:
token_mean_f = {}
for topic, tokens_sizes_l in token_mean.items():
    token_mean_f[topic] = sum(tokens_sizes_l) / len(tokens_sizes_l)

In [30]:
token_mean_f

{'history_of_present_illness': 249.89652999410484,
 'past_medical_history': 77.88776049931239,
 'social_history': 44.40687888766923,
 'family_history': 15.293274121120401,
 'brief_hospital_course': 508.9598072009903,
 'hospital_course': 529.2624113475177}

## Construct T5 Dataset

In [9]:
## TEST, TREATMENTS, ANATOMY, MEDICATION
df_struct_new = pd.read_csv(d_path / 'df_struct_new_detect.csv')

In [16]:
## PROBLEMS
df_struct = pd.read_csv(d_path / "df_struct.csv")

In [11]:
## FAMILY HISTORY
df_struct_fam = pd.read_csv(d_path / "df_struct_fam.csv")

In [18]:
## MIMIC-III Text
df_struct_text = pd.read_csv(d_path / "df_struct_text.csv")

In [12]:
df_struct_new

Unnamed: 0,history_of_present_illness_treat,history_of_present_illness_test,history_of_present_illness_drugs,history_of_present_illness_anat,past_medical_history_treat,past_medical_history_test,past_medical_history_drugs,past_medical_history_anat,brief_hospital_course_treat,brief_hospital_course_test,hospital_course_treat,hospital_course_test,brief_hospital_course_drugs,hospital_course_drugs,brief_hospital_course_anat,hospital_course_anat
0,"['her 3rd L NS', 'Insulin srip', '30 mtroprolo...","['AG', 'bicarb', 'inital vitals', 'K', 'HCO3',...","['NS', 'Insulin', 'tartrate', 'insulin', 'NS',...",['right face'],[],['baseline Cr'],[],['vertebral'],"['Humalog SS', 'long acting Levemir', 'an insu...","['Sugars', 'glucose', 'UA', 'Her electrolytes'...",[],[],"['Humalog', 'Levemir', 'insulin', 'ativan', 'c...",[],"['blood', 'kidney']",[]
1,"['naproxen', '4L NS IV', 'protonix', 'an octre...","['EGD', 'his vitals', 'rectal exam', 'Abdomina...","['naproxen', 'NS', 'protonix', 'octreotide', '...","['esophageal varices', 'abdominal', 'abdominal...",['interferon'],[],[],['Esophageal'],"['naproxen', 'pantoprazole', 'tylenol', 'NSAID...","['Mr', 'EGD', 'repeat endoscopy', 'tbili']",[],[],"['naproxen', 'pantoprazole', 'tylenol', 'NSAIDs']",[],"['esophageal varices', 'antrum', 'pylorus', 'B...",[]
2,"['nebulizer', 'antibiotics', 'steroids', 'medi...","['heart', 'blood pressure', 'Her ABG', 'pH', '...","['nebulizer', 'antibiotics', 'steroids', 'Albu...","['myelomas', 'respiratory', 'respiratory', 'le...","['Decadron', 'intubation']",['bone marrow biopsy'],"['Decadron', 'steroid']","['myeloma', 'bone marrow', 'Pulmonary']",[],[],"['BIPAP', 'intravenous Solu-Medrol', 'nebulize...","['sating', 'her chest x-rays', 'An echocardiog...",[],"['Zithromax', 'Coumadin', 'Lovenox', 'Coumadin...",[],"['Pulmonary', 'pulmonary', 'intravenous', 'che..."
3,['multiple abdominal surgeries'],[],[],"['abdominal', 'abdominal', 'bowel', 'oral']",[],[],[],[],"['small bowel resection', 'jejunosotomy', 'lys...","['an uncomplicated exploratory laparatomy', 'C...",[],[],"['morphine', 'Cefazolin', 'Flagyl', 'Narcotics...",[],"['bowel', 'adhesions', 'nasal', 'chest', 'righ...",[]
4,"['SL NTG', 'treatment', 'coronary bypass graft...","['a cardiac catheterization', 'LC', 'LVEDP']",['NTG'],"['coronary', 'cardiac', 'vessel', 'coronary', ...","['AICD', 's/p ablation', 'radiation and excisi...","['LCx', 'EF']",['Insulin'],"['Ventricular', 'Peripheral vascular', 'Left l...","['Coronary artery bypass graft', 'the procedur...",['invasive monitoring'],[],[],"['vasopressor', 'Beta blocker', 'Lisinopril', ...",[],"['Coronary artery', 'atrial', 'epicardial wire...",[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52715,"['CABG', 'a Nitrocort drip', 'a dopamine drip'...","['Cath', 'EF']","['CAD', 'CABG', 'diaphoresis', 'pain', 'pain',...","['chest', 'chest', 'mitral', 'saphenous vein g...",['coronary artery bypass grafting'],[],['hypercholesterolemia'],"['coronary artery', 'coronary artery']",[],[],"['coronary artery bypass grafting', 'coronary ...","['Chest x-ray', 'bronchoscopy', 'Chest x-ray',...",[],"['RPL', 'pump', 'pump', 'bronchoscopy', 'stand...",[],"['heart', 'coronary artery', 'coronary artery'..."
52716,"['hemodialysis', 'hemodialysis', 'further mana...","['last echocardiogram', 'left ventricular func...","['chronic', ""'s"", 'echocardiogram', 'plus', 'C...","['heart', 'pulmonary', 'heart', 'renal', 'hear...",[],[],[],[],[],[],[],[],[],[],[],[]
52717,"['a balloon valvuloplasty', 'IV narcotics', 'i...","['serial echocardiograms', 'workup', 'The bact...","['valvuloplasty', 'Hospital', 'serial', 'Hospi...","['aortic', 'aortic valve', 'intravenous', 'hea...",['valvuloplasty'],[],"['valvuloplasty', 'IV', 'hospital', 'alcohol']",['aortic'],[],[],[],[],[],[],[],[]
52718,"['prior PCI', 'stenting', 'cardiac surgical in...","['Exercise tolerance test', 'Outside cardiac c...","['PCI', 'stenting', 'pain', 'dyspnea', 'tolera...","['coronary artery', 'chest', 'cardiac', 'vesse...","['Prior PCI', 'Stenting to Right Coronary Arte...",[],"['PCI', 'Stenting', 'Peripheral', 'Hyperlipide...","['Coronary', 'Coronary']","['surgery', 'pump coronary artery bypass graft...","['routine preoperative evaluation', 'his worku...",[],[],"['cardiac', 'preoperative', 'pump', 'CVICU', '...",[],"['cardiac', 'calcified aorta', 'coronary arter...",[]


In [13]:
df_struct

Unnamed: 0,age,sex,allergies,chief_complaint,history_of_present_illness,social_history,discharge_diagnosis,brief_hospital_course,hospital_course,past_medical_history
0,35.0,F,['Levaquin'],"['nausea', 'vomiting']","['Type 1 diabetes mellitus', 'neuropathy', 'ne...","['Patient lives', 'home', 'Location', 'daughte...","['Diabetic keotacidosis', 'Hematemesis (blood ...","['HTN', 'poorly controlled type I DM', 'c/b ne...",[],"['Type 1 diabetes mellitis', 'neuropathy', 'ne..."
1,59.0,M,[],['coffee ground emesis'],"['HepC cirrhosis', 'grade I/II esophageal vari...","['He', 'lives', 'drinking alcohol', 'session',...",['Peptic ulcer GI bleed'],"['HCV cirrhosis', 'grade II esophageal varices...",[],"['HCV Cirrhosis', 'Esophageal varices HTN']"
2,48.0,F,[],['Dyspnea'],"['multiple myelomas', 'respiratory distress', ...","['children', 'home', 'home health aid', 'years...",[],[],"['Dyspnea', 'chronic obstructive pulmonary dis...","['Multiple myeloma', 'increase protein', 'Pulm..."
3,73.0,F,"['Penicillins', 'Dilantin']","['Severe abdominal and back pain', 'Abdominal ...","['pancreatitis', 'previous SBO', 'abdominal pa...","['Married', 'lives', 'husband', 'retired pedia...","['Small bowel obstruction', 'Internal hernia',...","['adhesions', 'her pain', 'confusion', 'decrea...",[],"['Chronic Pancreatitis', 'Migraines', 'Pancrea..."
4,60.0,M,[],['Angina'],"['known coronary disease', 'AMI', 'angina', 'r...","['Race', 'caucasian', 'Last Dental Exam', 'Liv...","['Coronary Artery Disease CAD', 'AMI', 'Ventri...","['non capturing atrial lead', 'the sternal and...",[],"['CAD', 'AMI', 'Cardiomyopathy', 'Ventricular ..."
...,...,...,...,...,...,...,...,...,...,...
52715,60.0,M,[],"['shortness of breath', 'chest discomfort', 'c...","['CAD', 'shortness of breath', 'diaphoresis', ...","['No IV', 'Retired janitor']",[],[],"['a cool, mottled, left foot', 'right upper lo...","['coronary artery disease', 'MI', 'hypertensio..."
52716,58.0,F,['Demerol'],[],"['multiple medical problems', 'congestive hear...","['patient', 'tobacco', 'use', 'ethanol', 'abus...",[],[],[],[]
52717,23.0,M,[],[],"['a known congenital bicuspid aortic valve', '...",[],[],[],[],"['Congenital bicuspid aortic valve', 'Congenit..."
52718,69.0,M,['Midazolam Hcl'],['Chest pain'],"['known coronary artery disease', 'worsening c...","['Married', 'children', 'Stop smoking', '[**20...","['Coronary Artery Disease', 'Atrial Fibrillati...","['a heavily calcified aorta', 'some urinary re...",[],"['Coronary Artery Disease', 'Peripheral Vascul..."


In [20]:
df_struct_fam

Unnamed: 0,text,family_history
0,Both parents have HTN and T2DM. Grandfather ha...,"['parents', 'HTN', 'T2DM', 'Grandfather', 'MI']"
1,NC,['NC']
2,Family history shows father died of an myocar...,"['Family history', 'father', 'died', 'myocardi..."
3,"Father: deceased, leukemia Brother: colon canc...","['Father', 'deceased', 'leukemia', 'colon canc..."
4,"Father died 50yo cirrhosis, mother died 42yo MI","['Father', 'died', 'cirrhosis', 'mother', 'die..."
...,...,...
40302,Brother and son had MI in 50's,"['Brother', 'son', 'MI']"
40303,Nonconttributory,['Nonconttributory']
40304,Mother died of renal failure. Brother died of...,"['Mother', 'died', 'renal failure', 'died', 'l..."
40305,Denied premature coronary artery disease,['Denied']


In [26]:
get_topics_text(df_struct_text.iloc[0].to_list()[0])

{'admission_date': ' [**2117-9-11**]              ',
 'discharge_date': '  [**2117-9-17**]  ',
 'date_of_birth': ' [**2082-3-21**]             ',
 'sex': '  F  ',
 'service': 'MEDICINE  ',
 'allergies': 'Levaquin  ',
 'attending': '[**First Name3 (LF) 2195**] ',
 'chief_complaint': 'nausea, vomiting  ',
 'major_surgical_or_invasive_procedure': 'none  ',
 'history_of_present_illness': '35F w/ poorly controlled Type 1 diabetes mellitus w/ neuropathy, nephropathy, HTN, gastroparesis, CKD and retinopathy, recently hospitalized for orthostatic hypotension [**2-3**] autonomic neuropathy [**Date range (1) 25088**]; DKA hospitalizations in [**6-12**] and [**7-12**], now returning w/ 5d history of worsening nausea, vomiting with coffee-ground emesis, chills, and dyspnea on exertion.  Last week she had a fall and hit her right face.  she also had 1 day of diarrhea, which resolved early last week.  Found to be in DKA with AG 30 and bicarb 11. . In the ED inital vitals were 09:00 0 98.2 113 181/99