# Medtrox

## Purpose
Generate Medical Records

## Methodology
- OpenCV
- OCR (Computer Vision)

## WIP - improvements

Notable TODOs:
- todo 1;
- todo 2;
- todo 3.

## Results
Describe and comment the most important results.

## Suggested next steps
State suggested next steps, based on results obtained in this notebook.

## Part Three: T5 Finetuning and MLops Deploy

# Setup

### Notebook Information

In [2]:
print("Python Version:")
!python --version

print("\nBase Environment")
tmp = !conda info -a | grep -i "CONDA_ROOT" | cut -d ":" -f 2
base_env_path = tmp[0].strip()
print(base_env_path)

print("\nActive Environment:")
tmp = !conda info -a | grep -i "active environment" | cut -d ":" -f 2
print(tmp[0].strip())

print("\nEnvironment Python Path:")
!which python

print("")
base_jupy_path = base_env_path +  "/bin/jupyter"
!{base_jupy_path} nbextension list

Python Version:
Python 3.10.6

Base Environment
/home/leolara/miniconda3

Active Environment:
cond_medtrix_env

Environment Python Path:
/home/leolara/miniconda3/envs/cond_medtrix_env/bin/python

Config option `kernel_spec_manager_class` not recognized by `ListNBExtensionsApp`.
Known nbextensions:
  config dir: /home/leolara/.jupyter/nbconfig
    notebook section
      jupyter-js-widgets/extension [32m enabled [0m
      - Validating: [32mOK[0m
      nbextensions_configurator/config_menu/main [32m enabled [0m
      - Validating: problems found:
        - require? [31m X[0m nbextensions_configurator/config_menu/main
      execute_time/ExecuteTime [32m enabled [0m
      - Validating: [32mOK[0m
    tree section
      nbextensions_configurator/tree_tab/main [32m enabled [0m
      - Validating: problems found:
        - require? [31m X[0m nbextensions_configurator/tree_tab/main
  config dir: /home/leolara/miniconda3/etc/jupyter/nbconfig
    notebook section
      jupyter-js-w

### Path

In [3]:
import sys
from pathlib import Path
from contextlib import suppress
project_paths = ['dataset', 'notebook', 'scripts', 'model', 'src', 'images']

b_path = Path.cwd().parents[0]

for _path in project_paths:
    (b_path / _path).mkdir(exist_ok=True)
    if not globals().get(f"{_path[:1]}_path"):
        globals()[f"{_path[:1]}_path"] = b_path / _path
    else:
        globals()[f"{_path}_path"] = b_path / _path

## Aditional Paths

## Add to Path
sys.path.append(str(s_path))

### Installation

In [4]:
%%sh
pip install dvclive
pip install joblib



In [None]:
!pip install 'protobuf<=3.19' --force-reinstall

In [None]:
!conda install cudatoolkit -y

In [16]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/leolara/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

#### Configuration

In [None]:
!git config --global credential.helper store

In [None]:
!jupyter nbextension enable

In [9]:
pd.set_option('max_colwidth', 999)
pd.set_option('display.max_columns', None)

In [3]:
import warnings, os
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')
os.environ["TOKENIZERS_PARALLELISM"] = "false"

#### Import

In [5]:
import spacy
import re
import pandas as pd
import pickle
import numpy as np
import torch
import nltk
import mlflow
import bitsandbytes as bnb
from sklearn.model_selection import StratifiedKFold
from transformers import pipeline
from transformers import BloomTokenizerFast, BloomForCausalLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names
from collections import defaultdict
from ast import literal_eval
from tqdm import tqdm
from pprint import pprint
from nltk.tokenize import wordpunct_tokenize
from huggingface_hub import notebook_login
from datasets import load_dataset, load_metric, Dataset

## Testing Model

In [7]:
model = BloomForCausalLM.from_pretrained("bigscience/bloom-7b1")
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom-7b1")

Downloading config.json: 100%|█████████████████████████████████████████████████████████████████████████████| 710/710 [00:00<00:00, 635kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████| 13.2G/13.2G [11:25<00:00, 20.6MB/s]
Downloading tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████| 222/222 [00:00<00:00, 181kB/s]
Downloading tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████| 13.8M/13.8M [00:01<00:00, 9.61MB/s]
Downloading special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████| 85.0/85.0 [00:00<00:00, 42.2kB/s]


In [12]:
def generate_from_model(model, tokenizer, text, max_tokens):
    encoded_input = tokenizer(text, return_tensors='pt')
    output_sequences = model.generate(
        input_ids=encoded_input['input_ids'],
        max_length=max_tokens,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        early_stopping=True
    )
    return tokenizer.decode(output_sequences[0], skip_special_tokens=True)

In [13]:
max_tokens = 250
text = "Anne is a patient with Diabettes and have another conditions like"

generate_from_model(model, tokenizer, text, max_tokens)

'Anne is a patient with Diabettes and have another conditions like obesity. Her main treatment is insulin therapy but not very well controlled. The insulin she has to take is really expensive because of the price increase in France recently, so she cannot afford it anymore. It will cost her 20 000 euros, for a time only! How can we help her? How can we raise enough money in order to help her?'

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['personality', 'candidates', 'history', 'conv_id', 'utterance_idx'],
        num_rows: 131438
    })
    validation: Dataset({
        features: ['personality', 'candidates', 'history', 'conv_id', 'utterance_idx'],
        num_rows: 7801
    })
})

## New approach on Conditional Text Generation

In [11]:
# Load MIMIC-III
df_mimic_full = pd.read_csv(d_path / "mimic-iii.csv", index_col=0)

# Replace wrong texts
mimic_replace_d = {
    ":[**":": [**",
    "#:":":",
    "\n\nD:":"\n\nDate:"
}
for orig, repl in mimic_replace_d.items():
    df_mimic_full['TEXT'] = df_mimic_full['TEXT'].apply(lambda x: x.replace(orig, repl))

In [12]:
# Pattern to get Topics
add_topics = ['facility', 'HISTORY  OF  THE  PRESENT  ILLNESS(?=\:)', 'Admission Date(?=\:)', 'Discharge Date(?=\:)', 'Sex(?=\:)', 'Chief Complaint(?=\:)', 'Addendum(?=\:)', '(?i)HISTORY OF PRESENT ILLNESS(?=\:)']
pattern = re.compile(f"((?<=\\n\\n)[\w\s]+(?=\:))|{'|'.join(add_topics)}", flags=0)
hpi_p = re.compile("\[\*\*[^\[]*\*\*\]", flags=0)

  pattern = re.compile(f"((?<=\\n\\n)[\w\s]+(?=\:))|{'|'.join(add_topics)}", flags=0)


In [13]:
def get_topics_text(text):
    topics = []
    positions = []
    sections_text = {}
    for m in pattern.finditer(text):
        s = m.group().replace('\n','')
        s = "_".join(s.lower().split())
        topics.append(s)
        positions.append((m.span()[0], m.span()[1]+2))
    for i, topic in enumerate(topics):
        start = positions[i][1]
        try:
            end = positions[i+1][0]
        except:
            end = len(text)-1
        sections_text[topic]=text[start:end].replace('\n',' ')
        
    return sections_text

In [9]:
en_nlp = spacy.blank('en')

In [28]:
topics_for_analysis = ['family_history','history_of_present_illness','hospital_course', 'brief_hospital_course', 'social_history', 'past_medical_history', ]
token_mean = defaultdict(list)
for idx, doc_text in enumerate(df_mimic_full['TEXT'].to_list()):
    sections_text = get_topics_text(doc_text)
    for topic, text in sections_text.items():
        if not topic in topics_for_analysis:continue
        doc = en_nlp(text)
        token_mean[topic]+=[len(doc)]

In [29]:
token_mean_f = {}
for topic, tokens_sizes_l in token_mean.items():
    token_mean_f[topic] = sum(tokens_sizes_l) / len(tokens_sizes_l)

In [30]:
token_mean_f

{'history_of_present_illness': 249.89652999410484,
 'past_medical_history': 77.88776049931239,
 'social_history': 44.40687888766923,
 'family_history': 15.293274121120401,
 'brief_hospital_course': 508.9598072009903,
 'hospital_course': 529.2624113475177}

## Construct T5 Dataset

In [11]:
## TEST, TREATMENTS, ANATOMY, MEDICATION
df_struct_new = pd.read_csv(d_path / 'df_struct_new_detect.csv')

In [12]:
## PROBLEMS
df_struct = pd.read_csv(d_path / "df_struct.csv")

In [13]:
## FAMILY HISTORY
df_struct_fam = pd.read_csv(d_path / "df_struct_fam.csv")

In [8]:
## MIMIC-III Text
df_struct_text = pd.read_csv(d_path / "df_struct_text.csv")

#### Adjusting Family Dataset

In [28]:
df_struct_fam.iloc[0]['family_history']

"['parents', 'HTN', 'T2DM', 'Grandfather', 'MI']"

In [62]:
df_struct_fam

Unnamed: 0,text,family_history
0,Both parents have HTN and T2DM. Grandfather ha...,"['parents', 'HTN', 'T2DM', 'Grandfather', 'MI']"
1,NC,['NC']
2,Family history shows father died of an myocar...,"['Family history', 'father', 'died', 'myocardi..."
3,"Father: deceased, leukemia Brother: colon canc...","['Father', 'deceased', 'leukemia', 'colon canc..."
4,"Father died 50yo cirrhosis, mother died 42yo MI","['Father', 'died', 'cirrhosis', 'mother', 'die..."
...,...,...
40302,Brother and son had MI in 50's,"['Brother', 'son', 'MI']"
40303,Nonconttributory,['Nonconttributory']
40304,Mother died of renal failure. Brother died of...,"['Mother', 'died', 'renal failure', 'died', 'l..."
40305,Denied premature coronary artery disease,['Denied']


In [70]:
idx_p = 0
struct_d = defaultdict(list)
for text in df_mimic_full['TEXT'].to_list():
    ## SECTIONS
    sections_text = get_topics_text(text)
    fam_text = sections_text.get('family_history',[])
    struct_d['text']+=[text]
    if fam_text:
        struct_d['family_history'] += [literal_eval(df_struct_fam.iloc[idx_p]['family_history'])]
        idx_p+=1
    else:
        struct_d['family_history'] += [[]]

In [71]:
df_tmp = pd.DataFrame(struct_d)

In [99]:
df_tmp.drop([13735, 13736], inplace=True)

In [100]:
df_tmp = df_tmp.reset_index(drop=True)

In [101]:
df_tmp

Unnamed: 0,text,family_history
0,{Admission Date: [**2117-9-11**] ...,"[parents, HTN, T2DM, Grandfather, MI]"
1,{Admission Date: [**2150-4-17**] ...,[NC]
2,{Admission Date: [**2108-4-6**] Dischar...,"[Family history, father, died, myocardial infa..."
3,{Admission Date: [**2145-3-31**] ...,"[Father, deceased, leukemia, colon cancer]"
4,{Admission Date: [**2162-5-16**] ...,"[Father, died, cirrhosis, mother, died, MI]"
...,...,...
52715,{Admission Date: [**2161-10-23**] Disch...,[]
52716,{Admission Date: [**2188-7-7**] Dischar...,[]
52717,{Admission Date: [**2137-12-11**] Disch...,[]
52718,{Admission Date: [**2119-2-18**] ...,[Denied]


In [102]:
df_tmp.iloc[0]['family_history']

['parents', 'HTN', 'T2DM', 'Grandfather', 'MI']

In [105]:
df_tmp.drop(columns='text', inplace=True)

In [106]:
df_struct_fam = df_tmp.copy()

In [107]:
df_struct_fam.to_csv(d_path / 'df_struct_fam.csv', index=0)

#### Join all Datasets


In [110]:
df_struct_full = pd.concat([df_struct, df_struct_new, df_struct_fam], axis=1)

In [172]:
df_struct_full.to_csv(d_path / 'df_struct_full.csv', index=0)

In [14]:
df_struct_full = pd.read_csv(d_path / 'df_struct_full.csv')

In [14]:
df_struct_full.columns

Index(['age', 'sex', 'allergies', 'chief_complaint',
       'history_of_present_illness_problem', 'social_history',
       'discharge_diagnosis', 'brief_hospital_course_problem',
       'hospital_course_problem', 'past_medical_history_problem',
       'history_of_present_illness_treat', 'history_of_present_illness_test',
       'history_of_present_illness_drugs', 'history_of_present_illness_anat',
       'past_medical_history_treat', 'past_medical_history_test',
       'past_medical_history_drugs', 'past_medical_history_anat',
       'brief_hospital_course_treat', 'brief_hospital_course_test',
       'hospital_course_treat', 'hospital_course_test',
       'brief_hospital_course_drugs', 'hospital_course_drugs',
       'brief_hospital_course_anat', 'hospital_course_anat', 'family_history'],
      dtype='object')

In [135]:
df_struct_full.columns = ['age', 'sex', 'allergies', 'chief_complaint',
       'history_of_present_illness_problem', 'social_history', 'discharge_diagnosis',
       'brief_hospital_course_problem', 'hospital_course_problem', 'past_medical_history_problem',
       'history_of_present_illness_treat', 'history_of_present_illness_test',
       'history_of_present_illness_drugs', 'history_of_present_illness_anat',
       'past_medical_history_treat', 'past_medical_history_test',
       'past_medical_history_drugs', 'past_medical_history_anat',
       'brief_hospital_course_treat', 'brief_hospital_course_test',
       'hospital_course_treat', 'hospital_course_test',
       'brief_hospital_course_drugs', 'hospital_course_drugs',
       'brief_hospital_course_anat', 'hospital_course_anat', 'family_history']

In [15]:
df_struct_full

Unnamed: 0,age,sex,allergies,chief_complaint,history_of_present_illness_problem,social_history,discharge_diagnosis,brief_hospital_course_problem,hospital_course_problem,past_medical_history_problem,...,past_medical_history_anat,brief_hospital_course_treat,brief_hospital_course_test,hospital_course_treat,hospital_course_test,brief_hospital_course_drugs,hospital_course_drugs,brief_hospital_course_anat,hospital_course_anat,family_history
0,35.0,F,['Levaquin'],"['nausea', 'vomiting']","['Type 1 diabetes mellitus', 'neuropathy', 'ne...","['Patient lives', 'home', 'Location', 'daughte...","['Diabetic keotacidosis', 'Hematemesis (blood ...","['HTN', 'poorly controlled type I DM', 'c/b ne...",[],"['Type 1 diabetes mellitis', 'neuropathy', 'ne...",...,['vertebral'],"['Humalog SS', 'long acting Levemir', 'an insu...","['Sugars', 'glucose', 'UA', 'Her electrolytes'...",[],[],"['Humalog', 'Levemir', 'insulin', 'ativan', 'c...",[],"['blood', 'kidney']",[],"['parents', 'HTN', 'T2DM', 'Grandfather', 'MI']"
1,59.0,M,[],['coffee ground emesis'],"['HepC cirrhosis', 'grade I/II esophageal vari...","['He', 'lives', 'drinking alcohol', 'session',...",['Peptic ulcer GI bleed'],"['HCV cirrhosis', 'grade II esophageal varices...",[],"['HCV Cirrhosis', 'Esophageal varices HTN']",...,['Esophageal'],"['naproxen', 'pantoprazole', 'tylenol', 'NSAID...","['Mr', 'EGD', 'repeat endoscopy', 'tbili']",[],[],"['naproxen', 'pantoprazole', 'tylenol', 'NSAIDs']",[],"['esophageal varices', 'antrum', 'pylorus', 'B...",[],['NC']
2,48.0,F,[],['Dyspnea'],"['multiple myelomas', 'respiratory distress', ...","['children', 'home', 'home health aid', 'years...",[],[],"['Dyspnea', 'chronic obstructive pulmonary dis...","['Multiple myeloma', 'increase protein', 'Pulm...",...,"['myeloma', 'bone marrow', 'Pulmonary']",[],[],"['BIPAP', 'intravenous Solu-Medrol', 'nebulize...","['sating', 'her chest x-rays', 'An echocardiog...",[],"['Zithromax', 'Coumadin', 'Lovenox', 'Coumadin...",[],"['Pulmonary', 'pulmonary', 'intravenous', 'che...","['Family history', 'father', 'died', 'myocardi..."
3,73.0,F,"['Penicillins', 'Dilantin']","['Severe abdominal and back pain', 'Abdominal ...","['pancreatitis', 'previous SBO', 'abdominal pa...","['Married', 'lives', 'husband', 'retired pedia...","['Small bowel obstruction', 'Internal hernia',...","['adhesions', 'her pain', 'confusion', 'decrea...",[],"['Chronic Pancreatitis', 'Migraines', 'Pancrea...",...,[],"['small bowel resection', 'jejunosotomy', 'lys...","['an uncomplicated exploratory laparatomy', 'C...",[],[],"['morphine', 'Cefazolin', 'Flagyl', 'Narcotics...",[],"['bowel', 'adhesions', 'nasal', 'chest', 'righ...",[],"['Father', 'deceased', 'leukemia', 'colon canc..."
4,60.0,M,[],['Angina'],"['known coronary disease', 'AMI', 'angina', 'r...","['Race', 'caucasian', 'Last Dental Exam', 'Liv...","['Coronary Artery Disease CAD', 'AMI', 'Ventri...","['non capturing atrial lead', 'the sternal and...",[],"['CAD', 'AMI', 'Cardiomyopathy', 'Ventricular ...",...,"['Ventricular', 'Peripheral vascular', 'Left l...","['Coronary artery bypass graft', 'the procedur...",['invasive monitoring'],[],[],"['vasopressor', 'Beta blocker', 'Lisinopril', ...",[],"['Coronary artery', 'atrial', 'epicardial wire...",[],"['Father', 'died', 'cirrhosis', 'mother', 'die..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52715,60.0,M,[],"['shortness of breath', 'chest discomfort', 'c...","['CAD', 'shortness of breath', 'diaphoresis', ...","['No IV', 'Retired janitor']",[],[],"['a cool, mottled, left foot', 'right upper lo...","['coronary artery disease', 'MI', 'hypertensio...",...,"['coronary artery', 'coronary artery']",[],[],"['coronary artery bypass grafting', 'coronary ...","['Chest x-ray', 'bronchoscopy', 'Chest x-ray',...",[],"['RPL', 'pump', 'pump', 'bronchoscopy', 'stand...",[],"['heart', 'coronary artery', 'coronary artery'...",[]
52716,58.0,F,['Demerol'],[],"['multiple medical problems', 'congestive hear...","['patient', 'tobacco', 'use', 'ethanol', 'abus...",[],[],[],[],...,[],[],[],[],[],[],[],[],[],[]
52717,23.0,M,[],[],"['a known congenital bicuspid aortic valve', '...",[],[],[],[],"['Congenital bicuspid aortic valve', 'Congenit...",...,['aortic'],[],[],[],[],[],[],[],[],[]
52718,69.0,M,['Midazolam Hcl'],['Chest pain'],"['known coronary artery disease', 'worsening c...","['Married', 'children', 'Stop smoking', '[**20...","['Coronary Artery Disease', 'Atrial Fibrillati...","['a heavily calcified aorta', 'some urinary re...",[],"['Coronary Artery Disease', 'Peripheral Vascul...",...,"['Coronary', 'Coronary']","['surgery', 'pump coronary artery bypass graft...","['routine preoperative evaluation', 'his worku...",[],[],"['cardiac', 'preoperative', 'pump', 'CVICU', '...",[],"['cardiac', 'calcified aorta', 'coronary arter...",[],['Denied']


#### Creating T5 Dataset

#### Making Problems, treatment, test, drugs and anatomy, medical terms dataset

In [17]:
med_types = ['problem', 'treat', 'test', 'drugs', 'anat'] # Order priority
sections_doc = ['history_of_present_illness', 'brief_hospital_course', 'hospital_course', 'past_medical_history']
num_p = re.compile("\d+(?:\.\d+)?")
add_sections_doc = ['social_history', 'family_history']
hpi_p = re.compile("\\[\*\*([^\[]*)\*\*\]", flags=0)

doc_d = {}
for idx, text in tqdm(enumerate(df_struct_text['text'].to_list())):
    
    # Sections from Document
    sections_text = get_topics_text(text)
    
    ## Dictionary from Line
    lists_d = df_struct_full.iloc[idx].to_dict()
    
    ## SECTION
    sec_d = defaultdict(list)
    for section in sections_doc:
        sec_text = sections_text.get(section, '')
        for res in hpi_p.finditer(sec_text):
            sec_text = sec_text.replace(res.group(), "")
        sec_text_orig = sec_text
        if sec_text:
            blk_intervals = []
            ## MEDICAL TYPE
            for med_type in med_types:
                med_l = literal_eval(lists_d[section+'_'+med_type])
                for med_term in med_l:
                    start = sec_text.find(med_term)
                    if start!=-1:
                        if any((start in interval) for interval in blk_intervals):
                            continue
                        end = start+len(med_term)
                        sec_d[section] += [(med_type, med_term, start)]
                        blk_intervals.append(range(start,end))
                        
            ## SET NUMBERS ENTITY
            for res in num_p.finditer(sec_text_orig):
                start = res.start()
                if any((start in interval) for interval in blk_intervals):
                    continue
                sec_d[section] += [('number', res.group(), res.start())]
    
    for section in add_sections_doc:
        sec_text = sections_text.get(section, '')
        for res in hpi_p.finditer(sec_text):
            sec_text = sec_text.replace(res.group(), "")
        sec_text_orig = sec_text
        if sec_text:
            blk_intervals = []
            med_l = literal_eval(lists_d[section])
            for med_term in med_l:
                start = sec_text.find(med_term) 
                if start!=-1:
                    if any((start in interval) for interval in blk_intervals):
                        continue
                    end = start+len(med_term)
                    sec_d[section] += [(med_type, med_term, start)]
                    blk_intervals.append(range(start,end))
                    
            ## SET NUMBERS ENTITY
            for res in num_p.finditer(sec_text_orig):
                start = res.start()
                if any((start in interval) for interval in blk_intervals):
                    continue
                sec_d[section] += [('number', res.group(), res.start())]
                    
    doc_d[idx] = sec_d

52720it [01:33, 565.00it/s]


In [60]:
get_topics_text(df_struct_text['text'].iloc[2])['past_medical_history'][310:]

'2107-12-9**].  '

In [39]:
print(str(df_struct_text['text'].iloc[1])[42:60])

     Discharge Dat


In [63]:
doc_d[2]

defaultdict(list,
            {'history_of_present_illness': [('problem',
               'multiple myelomas',
               65),
              ('problem', 'respiratory distress', 102),
              ('problem', 'elevated INR', 306),
              ('problem', 'lower INR', 347),
              ('problem', 'increased difficulty with breathing', 433),
              ('problem', 'a cough', 488),
              ('problem', 'decreased p.o. intake', 555),
              ('problem', 'some wheezing', 673),
              ('problem', 'coughing up thick sputum', 809),
              ('problem', 'tachypneic', 988),
              ('problem', 'weak', 1423),
              ('problem', 'steroids', 282),
              ('problem', 'Albuterol', 705),
              ('problem', 'Prednisone', 762),
              ('problem', 'Solu-Medrol', 1074),
              ('problem', 'PO2', 1217),
              ('treat', 'nebulizer', 255),
              ('treat', 'antibiotics', 266),
              ('treat', 'medication', 333),

#### Splitting Dataset in Train, Valid, Test

In [23]:
per_train = len(doc_d) * 80 / 100
per_valid_test = len(doc_d) * 10  / 100

In [24]:
per_train

42176.0

In [25]:
per_valid_test

5272.0

In [18]:
hp = "hospital_course"
hpi_p = re.compile("\\[\*\*([^\[]*)\*\*\]", flags=0)
firstn_p = re.compile("\[\*\*Known firstname \d+\*\*\]", flags=0)
lastn_p = re.compile("\[\*\*Known lastname \d+\*\*\]", flags=0)

source_array = []
target_array = []
for idx, sec_d in tqdm(doc_d.items()):
    sections_text = get_topics_text(df_struct_text.iloc[idx]['text'])
    for section, med_res in sec_d.items():
        str_train = ""
        str_text = ""
        sec_text = sections_text[section]
        for res in firstn_p.finditer(sec_text):
            sec_text = sec_text.replace(res.group(), "<PERSON_NAME>")
        for res in lastn_p.finditer(sec_text):
            sec_text = sec_text.replace(res.group(), "<PERSON_NAME>")
        for res in hpi_p.finditer(sec_text):
            sec_text = sec_text.replace(res.group(), "")
        med_res = sorted(med_res, key=lambda tup:tup[2])
        str_train+=section+" : " if hp not in section else f"{hp} : "
        for res in med_res:
            med_type, med_term = res[0], res[1]
            med_term = med_term if len(med_term.split())<2 else '_'.join(med_term.split())
            med_type = med_type.upper() if section not in ['social_history', 'family_history'] else "GEN_MED"
            str_train+="<" + str(med_type) + "> " + str(med_term.lower()) + " </" + str(med_type) + "> "
        source_array.append(str_train)
        str_text = wordpunct_tokenize(sec_text)
        str_text = " ".join(str_text)
        str_text = str_text.lower()
        target_array.append(str_text)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52720/52720 [01:02<00:00, 847.20it/s]


In [19]:
dataset_d = {}
dataset_d['source'] = source_array
dataset_d['target'] = target_array

In [23]:
dataset_d

{'source': ['history_of_present_illness : <PROBLEM> Type_1_diabetes_mellitus </PROBLEM> <PROBLEM> neuropathy </PROBLEM> <PROBLEM> nephropathy </PROBLEM> <PROBLEM> HTN </PROBLEM> <PROBLEM> gastroparesis </PROBLEM> <PROBLEM> CKD </PROBLEM> <PROBLEM> retinopathy </PROBLEM> <PROBLEM> orthostatic_hypotension </PROBLEM> <PROBLEM> autonomic_neuropathy </PROBLEM> <PROBLEM> DKA_hospitalizations </PROBLEM> <PROBLEM> worsening_nausea </PROBLEM> <PROBLEM> vomiting </PROBLEM> <PROBLEM> coffee-ground_emesis </PROBLEM> <PROBLEM> chills </PROBLEM> <PROBLEM> dyspnea </PROBLEM> <PROBLEM> a_fall_and_hit_her_right_face </PROBLEM> <PROBLEM> diarrhea </PROBLEM> <PROBLEM> DKA </PROBLEM> <PROBLEM> RA </PROBLEM> <TEST> AG </TEST> <TEST> bicarb </TEST> <TEST> inital_vitals </TEST> <TEST> K </TEST> <TEST> HCO3 </TEST> <TREAT> her_3rd_L_NS </TREAT> <TEST> Anion_Gap </TEST> <TREAT> Insulin_srip </TREAT> <TEST> Cr. </TEST> <TEST> baseline </TEST> <PROBLEM> levemir </PROBLEM> <PROBLEM> tartrate </PROBLEM> <TEST> BPs

In [85]:
idx = 1524
print(dataset_d['source'][idx])
print("")
print(dataset_d['target'][idx])

past_medical_history : <PROBLEM> diabetes_type_one </PROBLEM> <PROBLEM> insulin_dependant_copd </PROBLEM> <PROBLEM> cad </PROBLEM> <PROBLEM> s/p_mi </PROBLEM> <PROBLEM> mi </PROBLEM> <PROBLEM> pneumonia </PROBLEM> <PROBLEM> recent_treated_pvd_esophgitis </PROBLEM> <PROBLEM> hypercholestremia </PROBLEM> <PROBLEM> cva </PROBLEM> <PROBLEM> right_sided_s/p_cabgsx4 </PROBLEM> <TREAT> rt._toe_amputations </TREAT> <PROBLEM> amputations </PROBLEM> <NUMBER> 3 </NUMBER> <NUMBER> 4 </NUMBER> 

diabetes type one , insulin dependant copd cad , s / p mi pneumonia , recent treated pvd esophgitis hypercholestremia history of cva right sided s / p cabgsx4 s / p rt . toe amputations # 3 & 4


In [67]:
print(df_struct_full.iloc[368])

age                                                                                27.0
sex                                                                                   M
allergies                                                                            []
chief_complaint                                                                      []
history_of_present_illness_problem    ['obsessive-compulsive disorder', 'depression'...
social_history                        ['Social History', 'Lives', 'himself', 'Disabi...
discharge_diagnosis                                                        ['Overdose']
brief_hospital_course_problem         ['Overdose', 'Somnolence', 'opiod ingestion', ...
hospital_course_problem                                                              []
past_medical_history_problem          ['obsessive-compulsive disorder', 'depression'...
history_of_present_illness_treat      ['Narcan', '2mg IM narcan', '2nd mg IV Narcan'...
history_of_present_illness_test 

In [66]:
print(df_mimic_full.iloc[368]['TEXT'])

{Admission Date:  [**2131-10-14**]              Discharge Date:   [**2131-10-15**]

Date of Birth:  [**2104-5-24**]             Sex:   M

Service: MEDICINE

Allergies:
Patient recorded as having No Known Allergies to Drugs

Attending: [**Last Name (NamePattern4) 290**]
Chief Complaint:
found down.

Major Surgical or Invasive Procedure:
None

History of Present Illness:
patient is a 27-year-old man with history of
obsessive-compulsive disorder and depression who presents from
home after being found down by his friend's girlfriend.
According to Friend, [**Name (NI) **] (see below) they were "partying hard"
at a friend's house and then woke up the next morning to find
[**Doctor Last Name **] as well as another friend unable to wake up. [**Doctor First Name **] believes
that [**Doctor Last Name **] took too many "opiates", because "this is what opiate
overdose looks to me." Everyone was worried about [**Doctor Last Name **] so they
called Police and the ambulance which took [**Doctor Last 

In [60]:
df_struct_full[df_struct_full.history_of_present_illness_problem.str.contains("Denies rashes or skin changes")]

Unnamed: 0,age,sex,allergies,chief_complaint,history_of_present_illness_problem,social_history,discharge_diagnosis,brief_hospital_course_problem,hospital_course_problem,past_medical_history_problem,...,past_medical_history_anat,brief_hospital_course_treat,brief_hospital_course_test,hospital_course_treat,hospital_course_test,brief_hospital_course_drugs,hospital_course_drugs,brief_hospital_course_anat,hospital_course_anat,family_history
62,85.0,M,['Codeine'],['malaise'],"['Afib', 'systolic CHF', 'recent cholecystitis...","['lives', 'daily']","['Acute cholecystitis', 'Urinary tract infecti...","['atrial fibrillation', 'systolic heart failur...",[],"['ILIAL FRACTURE', 'ATRIAL FIBRILLATION', 'B12...",...,[],"['percutaneous drainage', 'antibiotics', 'perc...","['Labs', 'RUQ ultrasound', 'Culture data', 'th...",[],[],"['antibiotics', 'Vancomycin', 'Zosyn', 'ciprof...",[],"['atrial', 'heart', 'percutaneous', 'pulmonary...",[],"['Mother', 'died', 'alzheimers', 'dementia', '..."
204,81.0,M,[],['hypotension'],"['ruq pain', 'fever', 'hypotension', 'RUQ pain...","['**', 'non-smoker', 'non-drinker']",['Abdominal Pain'],"['HOCM', 'atrial fibrillation', 'RUQ pain', 'f...",[],['Hearing impaired chronic 1st degree HB HOCM'...,...,[],"['Coumadin', 'broad spectrum antibiotics', 'br...","['RUQ u/s', 'closer monitoring', 'CT chest', '...",[],[],"['Coumadin', 'broad spectrum antibiotics', 'pr...",[],"['atrial', 'abdominal', 'bowel', 'chest', 'bow...",[],[]
322,81.0,F,[],"['Chronic Blood Loss Anemia', 'respiratory dis...","['dCHF', 'afib', 'chronic blood loss from GAVE...","['Lives', 'home', 'husband', '[**Name (NI) 113...","['GI bleed', 'pulmonary embolism', 'Chronic di...","['chronic diastolic CHF', 'atrial fibrillation...",[],"['Gastric Antral Vascular Ectasia', 'Anemia', ...",...,"['Gastric', 'Vascular', 'GI', 'heart', 'Atrial...","['Coumadin', 'endoscopic banding', 'agitated s...","['EF', 'Physical exam', 'exam', 'exam', 'chest...",[],[],"['Coumadin', 'blood', 'anticoagulation', 'bloo...",[],"['atrial', 'blood', 'vascular', 'gastric AVMs'...",[],"['siblings', 'lung cancer']"
368,27.0,M,[],[],"['obsessive-compulsive disorder', 'depression'...","['Social History', 'Lives', 'himself', 'Disabi...",['Overdose'],"['Overdose', 'Somnolence', 'opiod ingestion', ...",[],"['obsessive-compulsive disorder', 'depression'...",...,[],"['methadone', 'amphetamines', 'Amphetamine', '...","['exam', 'Urine toxicology', 'His respiratory ...",[],[],"['methadone', 'cocaine', 'amphetamines', 'Adde...",[],"['Urine', 'body', 'respiratory', 'GI', 'cell l...",[],"['OMR', 'OCD', 'FH', 'eldest', 'brother', 'con..."
390,59.0,M,['Penicillins'],['Dyspnea'],"['insulin dependent DM', 'fever', 'dyspnea', '...","['Lives', 'Unemployed', 'former merchant marin...","['Community Acquired Pneumonia', 'Asthma', 'Ob...","['insulin dependent DM', 'fever', 'dyspnea', '...",[],"['DM', 'insulin dependent', 'retinopathy', 're...",...,"['renal', 'LV', 'renal', 'glaucoma', 'lung', '...","['albuterol inhaler', '15L oxygen', 'positive ...","['Several chest x-rays', 'exam', 'repeat PFTs'...",[],[],"['insulin', 'albuterol', 'oxygen', 'diuresed',...",[],"['chest', 'body', 'pulmonary', 'elbows', 'righ...",[],"['Father', 'sister', 'brothers', 'Diabetes', '..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52316,27.0,M,['Fentanyl'],"['worsening decubitus ulcer on R elbow', 'supe...","['Syndrome', 'chronic obstructive + restrictiv...","['Lives', 'grandparents', 'brother', 'Home Hea...",['Osteomyelitis of right elbow/olecranon'],"['Syndrome', 'chronic obstructive + restrictiv...",[],"['Syndrome', 'Chronic obstructive/restrictive ...",...,['lung'],"['IV Abx', 'ventilation', 'right elbow', 'broa...","['his WBC', 'A bone scan', 'goal trough levels...",[],[],"['chronic', 'trached', 'decubitus', 'R', 'IV',...",[],"['lung', 'decubitus ulcer', 'right elbow', 'ul...",[],"['Mother', 'brother', 'Location']"
52417,65.0,M,[],"['Black stools', 'relative hypotension']","['multiple CVAs', 'CAD', 'HTN', 'anemia', 'epi...","['Smoking/Tobacco', 'years', 'drugs', 'life', ...","['acute blood loss', 'anemia severe hemorrhagi...","['mechanical AVR', 'multiple CVAs', 'black tar...",[],"['CVA', 'HTN', 'CAD', 'single vessel distal LA...",...,"['vessel', 'Duodenal ulcer']","['coumadin', 'AVR', 'multiple units pRBCs', 'F...","['Hct', 'INR', 'EGD', 'Biopsy', 'GI, stool H. ...",[],[],"['b', 'CVAs', 'coumadin', 'stools', 'b', 'bloo...",[],"['blood', 'GI', 'blood', 'blood', 'nasal', 'he...",[],"['diabetes mellitus', 'hypertension', 'dyslipi..."
52457,75.0,F,"['Codeine', 'Percocet']","['Abdominal Pain', 'Chills']","['ischemic colitis', 'thyroid cancer', 'OSA', ...","['Lives', 'Retired', 'executive director', 'or...","['Sepsis', 'Nephroliathisis Sepsis']","['ischemic colitis', 'CAD', 'OSA', 'Sepsis', '...",[],"['PAF Ischemic Colitis', 'Diastolic dysfunctio...",...,"['Aortic', 'Thyroid cancer']","['home bipap', 'emergent percutaneous nephrost...","['her blood cultures', 'her blood pressure', '...",[],[],"['bipap', 'pain', 'E.coli', 'bacteremia', 'nep...",[],"['renal stone', 'urinary tract', 'urinary trac...",[],[]
52497,70.0,F,"['Penicillins', 'Sulfonamide']","['diapheresis', 'BRBPR']","['Afib', 'dCHF', 'diapheresis', 'a large blood...","['Tobacco', 'distant history', 'Alcohol', 'dri...","['GI bleeding', 'elevated INR']","['Afib', 'dCHF', 'diapheresis', 'a frankly blo...",[],"['Paroxysmal atrial fibrillation', 'Hypertensi...",...,"['atrial', 'tricuspid', 'mitral', 'Lateral wal...","['metop', 'dilt', 'coumadin']",['an INR'],[],[],"['Afib', 'metop', 'dilt', 'coumadin', 'blood',...",[],"['bowel', 'bowel', 'blood', 'blood']",[],"['Mother', 'died', 'age', '**', 'Age', '90 **'..."


In [83]:
hdataset = Dataset.from_dict(dataset_d)

In [84]:
len(hdataset)

219705

In [85]:
hdataset = hdataset.train_test_split(test_size=0.1)

In [86]:
hdataset

DatasetDict({
    train: Dataset({
        features: ['source', 'target'],
        num_rows: 197734
    })
    test: Dataset({
        features: ['source', 'target'],
        num_rows: 21971
    })
})

In [4]:
with open(d_path / 'hdataset_v2.pkl', 'rb') as f:
    hdataset = pickle.load(f)

In [87]:
model_id = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /home/leolara/.cache/huggingface/hub/models--t5-base/snapshots/23aa4f41cb7c08d4b05c8f327b22bfa0eb8c7ad9/config.json
Model config T5Config {
  "_name_or_path": "t5-base",
  "architectures": [
    "T5WithLMHeadModel"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dense_act_fn": "relu",
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "relu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": false,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "n_positions": 512,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "task_specific_params": {
    "summarization": {
      "early_stopping": true,
      "length_penalty

In [193]:
prefix = "summarize: "
max_input_length = 512
max_target_length = 512

def preprocess_data(examples):
    inputs = [prefix + text for text in examples["source"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["target"], 
                           max_length=max_target_length, 
                           truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [89]:
datasets_token = hdataset.map(preprocess_data, batched=True)

  0%|          | 0/198 [00:00<?, ?ba/s]

  0%|          | 0/22 [00:00<?, ?ba/s]

In [90]:
datasets_token

DatasetDict({
    train: Dataset({
        features: ['source', 'target', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 197734
    })
    test: Dataset({
        features: ['source', 'target', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 21971
    })
})

In [91]:
datasets_token = datasets_token.remove_columns(["source", "target"])

In [92]:
# with open(d_path / 'datasets_token_v3.pkl', 'wb') as f:
#     pickle.dump(datasets_token, f)

## Train Model  

In [9]:
# with open(d_path / 'datasets_token_full.pkl', 'rb') as f:
#     datasets_token = pickle.load(f)

In [4]:
# with open(d_path / 'datasets_token_v2.pkl', 'rb') as f:
#     datasets_token = pickle.load(f)

In [3]:
with open(d_path / 'datasets_token_v3.pkl', 'rb') as f:
    datasets_token = pickle.load(f)

In [7]:
datasets_token['train'][1]

{'input_ids': [21603,
  10,
  2833,
  834,
  19221,
  3,
  10,
  3,
  2,
  567,
  6122,
  12920,
  3155,
  2775,
  3,
  2,
  87,
  567,
  6122,
  12920,
  3155,
  3,
  2,
  567,
  6122,
  12920,
  3155,
  220,
  3,
  2,
  87,
  567,
  6122,
  12920,
  3155,
  3,
  2,
  17618,
  8775,
  6037,
  3155,
  78,
  115,
  3,
  2,
  87,
  17618,
  8775,
  6037,
  3155,
  3,
  2,
  17618,
  8775,
  6037,
  3155,
  8946,
  834,
  75,
  4607,
  3,
  2,
  87,
  17618,
  8775,
  6037,
  3155,
  3,
  2,
  17618,
  8775,
  6037,
  3155,
  3,
  6471,
  88,
  32,
  13711,
  1436,
  17,
  159,
  3,
  2,
  87,
  17618,
  8775,
  6037,
  3155,
  3,
  2,
  17618,
  8775,
  6037,
  3155,
  4923,
  834,
  26836,
  834,
  15,
  1778,
  9,
  3,
  2,
  87,
  17618,
  8775,
  6037,
  3155,
  3,
  2,
  17618,
  8775,
  6037,
  3155,
  3,
  107,
  17,
  29,
  3,
  2,
  87,
  17618,
  8775,
  6037,
  3155,
  3,
  2,
  17618,
  8775,
  6037,
  3155,
  29867,
  3,
  2,
  87,
  17618,
  8775,
  6037,
  3155,
  3,
  2,


In [6]:
datasets_tmp = datasets_token['train'].shuffle().train_test_split(test_size=21929)
datasets_token['train'] = datasets_tmp['train']
datasets_token['validation'] = datasets_tmp['test']
datasets_token['validation'] = datasets_token['validation'].shuffle().select(range(2000))
# datasets_token['validation'] = datasets_token['train'].shuffle().select(range(1000))
# datasets_token['train'] = datasets_token['train'].shuffle()
# datasets_token['test'] = datasets_token['test'].shuffle()

In [6]:
datasets_token

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 175805
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 21971
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [7]:
model_id = "t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto")

#### Setup Huggingface Hub

In [8]:
notebook_login()

Token is valid.
Your token has been saved to /home/leolara/.huggingface/token
Login successful


In [8]:
# MLFLOW
os.environ["MLFLOW_EXPERIMENT_NAME"]="trainer-medical-conditional-gen"
os.environ["MLFLOW_FLATTEN_PARAMS"]="1"

In [9]:
model_name = "t5-base-medical-conditional-gen"
model_dir = m_path

In [197]:
args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=1000,
    logging_strategy="steps",
    logging_steps=1000,
    save_strategy="steps",
    save_steps=1000,
    seed=1213,
    warmup_steps=500,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    eval_accumulation_steps=2,
    gradient_checkpointing=True,
    learning_rate=1e-3,
    weight_decay=0.01,
    save_total_limit=3,
    fp16=False,
    num_train_epochs=3,
    predict_with_generate=True,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    report_to=["tensorboard", "mlflow"]
)

In [12]:
# decay_parameters = get_parameter_names(model, [nn.LayerNorm])
# decay_parameters = [name for name in decay_parameters if "bias" not in name]
# optimizer_grouped_parameters = [
#     {
#         "params": [p for n, p in model.named_parameters() if n in decay_parameters],
#         "weight_decay": args.weight_decay,
#     },
#     {
#         "params": [p for n, p in model.named_parameters() if n not in decay_parameters],
#         "weight_decay": 0.0,
#     },
# ]

# optimizer_kwargs = {
#     "betas": (args.adam_beta1, args.adam_beta2),
#     "eps": args.adam_epsilon,
# }
# optimizer_kwargs["lr"] = args.learning_rate
# adam_bnb_optim = bnb.optim.Adam8bit(
#     optimizer_grouped_parameters,
#     betas=(args.adam_beta1, args.adam_beta2),
#     eps=args.adam_epsilon,
#     lr=args.learning_rate,
# )

In [198]:
data_collator = DataCollatorForSeq2Seq(tokenizer)
metric = load_metric("rouge")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # rougeLSum expects newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results from ROUGE
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds
    ]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

# def preprocess_logits_for_metrics(logits, labels):
#     """
#     Original Trainer may have a memory leak. 
#     This is a workaround to avoid storing too many tensors that are not needed.
#     """
#     pred_ids = torch.argmax(logits[0], dim=-1)
#     return pred_ids, labels

  metric = load_metric("rouge")


In [199]:
# Start TensorBoard before training to monitor it in progress
%load_ext tensorboard
%tensorboard --logdir '{model_dir}'/runs

In [14]:
!cat /home/leolara/miniconda3/envs/cond_medtrix_env/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py

# coding=utf-8
# Copyright 2018 Mesh TensorFlow authors, T5 Authors and HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch T5 model."""


import copy
import math
import os
from typing import Optional, Tuple, Union

import torch
from torch import nn
from torch.nn import CrossEntropyLoss
from torch.utils.checkpoint import checkpoint

from ...activations import ACT2FN
from ...modeling_outputs import (
    BaseModelOutput,
    BaseModelOutp

In [14]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=datasets_token["train"],
    eval_dataset=datasets_token["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [14]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 197354
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 74007
  Number of trainable parameters = 222903552
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
500,2.6526,1.908129,29.3387,15.5442,28.1086,28.8997,17.123
1000,1.9508,1.723784,32.1409,18.3835,31.0739,31.7211,17.115
1500,1.8157,1.628859,33.1003,19.5103,31.8553,32.5703,17.074
2000,1.7254,1.585357,32.963,19.4031,31.7364,32.5402,17.146
2500,1.7098,1.541533,33.0654,19.5978,31.867,32.6001,17.022
3000,1.6545,1.514103,32.8748,19.1617,31.3918,32.3067,17.17
3500,1.6266,1.489336,32.1245,19.1148,31.0063,31.633,17.241
4000,1.6039,1.475301,32.7564,19.5552,31.7834,32.3259,17.024
4500,1.5857,1.451842,32.0244,19.2718,31.1307,31.6227,17.139
5000,1.5644,1.440377,32.3229,19.7501,31.4389,32.0447,17.321


The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-600] due 

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-4500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-4500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-4500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-4500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-4500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-4500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-3500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you 

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-8500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-8500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-8500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-8500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-8500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-8500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-7500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you 

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-12500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-12500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-12500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-12500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-12500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-12500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-11500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-16500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-16500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-16500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-16500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-16500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-16500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-15500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-20500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-20500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-20500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-20500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-20500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-20500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-19500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-24500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-24500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-24500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-24500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-24500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-24500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-23000] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-28500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-28500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-28500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-28500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-28500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-28500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-27500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-32500
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-32500/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-32500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-32500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-32500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-32500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-31000] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`

KeyboardInterrupt: 

In [13]:
trainer.train(resume_from_checkpoint = True)

Loading model from /home/leolara/Development/medtrix/model/checkpoint-34500.
The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 197354
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 74007
  Number of trainable parameters = 222903552
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 1
  Continuing training from global step 34500
  Will skip the first 1 epochs then the first 19662 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the trainin

  0%|          | 0/19662 [00:00<?, ?it/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
35000,1.2147,1.166388,31.5912,19.2951,30.5599,31.0977,17.307
35500,1.2423,1.166346,31.1549,19.3394,30.2318,30.7429,17.406
36000,1.2201,1.165733,31.0839,19.1082,30.1731,30.7614,17.292
36500,1.2397,1.15977,31.2699,19.371,30.199,30.8695,17.33
37000,1.2301,1.158066,30.8144,19.0157,29.7702,30.3345,17.392
37500,1.2319,1.154609,31.0135,19.1902,29.8884,30.5618,17.432
38000,1.2371,1.153603,31.2083,19.2693,30.1969,30.7897,17.308
38500,1.235,1.150211,31.2003,19.566,30.1851,30.6933,17.459
39000,1.2252,1.147245,30.9767,19.1009,29.9254,30.5488,17.412
39500,1.2192,1.148044,30.7237,18.8142,29.6685,30.2846,17.433


The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-35000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-35000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-35000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-35000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-35000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-35000/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoi

Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-37500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-39000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-39000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-39000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-39000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-39000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtr

Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-42500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-41500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-43000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-43000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-43000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-43000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/ch

Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-46500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-46500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-45500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-47000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-47000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-47000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/

tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-50500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-50500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-50500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-49500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-51000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-51000/config.json
Model weights saved in /home/leolara/Development/medtrix/mo

Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-54500/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-54500/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-54500/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-54500/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-53500] due to args.save_total_limit
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: source, target. If source, target are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-55000
Configuration saved in /home/leolara/Development/medt

KeyboardInterrupt: 

In [17]:
mlflow.end_run()

In [18]:
!mlflow ui

[2022-12-01 09:36:20 -0300] [149698] [INFO] Starting gunicorn 20.1.0
[2022-12-01 09:36:20 -0300] [149698] [INFO] Listening at: http://127.0.0.1:5000 (149698)
[2022-12-01 09:36:20 -0300] [149698] [INFO] Using worker: sync
[2022-12-01 09:36:20 -0300] [149699] [INFO] Booting worker with pid: 149699
[2022-12-01 09:36:20 -0300] [149700] [INFO] Booting worker with pid: 149700
[2022-12-01 09:36:20 -0300] [149701] [INFO] Booting worker with pid: 149701
[2022-12-01 09:36:20 -0300] [149702] [INFO] Booting worker with pid: 149702
^C
[2022-12-01 21:17:18 -0300] [149698] [INFO] Handling signal: int
[2022-12-01 21:17:18 -0300] [149699] [INFO] Worker exiting (pid: 149699)
[2022-12-01 21:17:18 -0300] [149702] [INFO] Worker exiting (pid: 149702)
[2022-12-01 21:17:18 -0300] [149701] [INFO] Worker exiting (pid: 149701)
[2022-12-01 21:17:18 -0300] [149700] [INFO] Worker exiting (pid: 149700)


In [16]:
trainer.train()

***** Running training *****
  Num examples = 175425
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 65784
  Number of trainable parameters = 222903552
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
100,No log,2.204319,30.3184,15.8135,29.1973,29.8619,16.884


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-100
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-100/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-100/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-100/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-100/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-100/spiece.model


KeyboardInterrupt: 

In [16]:
trainer.train(resume_from_checkpoint = True)

Loading model from /home/leolara/Development/medtrix/model/checkpoint-100.
***** Running training *****
  Num examples = 175425
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 65784
  Number of trainable parameters = 222903552
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 0
  Continuing training from global step 100
  Will skip the first 0 epochs then the first 200 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/200 [00:00<?, ?it/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1000,2.029,1.719139,29.6967,17.5678,28.9296,29.3885,17.085
2000,1.7757,1.583866,31.6119,18.8827,30.3476,31.1637,17.2695
3000,1.6772,1.53109,30.6637,18.7662,29.6976,30.304,17.3035
4000,1.6183,1.486046,31.1141,18.9928,30.1935,30.7254,17.205
5000,1.5606,1.451454,30.5356,18.6368,29.6076,30.1959,17.376


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-1000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-1000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-1000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-1000/spiece.model
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-2000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-2000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-2000/pytorch_model.bin
tokenizer co

KeyboardInterrupt: 

In [15]:
trainer.train()

***** Running training *****
  Num examples = 175805
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 65928
  Number of trainable parameters = 222903552
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
50,5.4776,2.568689,26.0552,13.4821,25.5413,25.7408,18.881


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-50
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-50/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-50/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-50/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-50/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-50/spiece.model


KeyboardInterrupt: 

In [15]:
trainer.train(resume_from_checkpoint = True)

Loading model from /home/leolara/Development/medtrix/model/checkpoint-50.
***** Running training *****
  Num examples = 175805
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 65928
  Number of trainable parameters = 222903552
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 0
  Continuing training from global step 50
  Will skip the first 0 epochs then the first 100 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/100 [00:00<?, ?it/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
1000,1.9958,1.649836,33.575,19.5424,32.7121,33.1484,17.003
2000,1.6959,1.532891,35.264,21.6419,34.5035,34.9493,16.933
3000,1.5933,1.456939,34.7654,21.411,33.8664,34.385,17.1825
4000,1.5463,1.439317,34.5263,21.2605,33.7307,34.2234,17.0555
5000,1.5046,1.399152,35.1867,21.779,34.3087,34.8739,17.025
6000,1.4749,1.371958,34.9534,22.041,34.1509,34.6185,17.1995
7000,1.4311,1.357506,35.4737,22.1702,34.621,35.1346,17.0125
8000,1.4262,1.338436,34.7299,21.6533,33.9163,34.3413,17.1635
9000,1.4106,1.324373,35.4798,22.3669,34.5563,35.1176,17.126
10000,1.3984,1.314523,35.3938,22.5142,34.5378,35.0736,17.1115


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-1000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-1000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-1000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-1000/spiece.model
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-2000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-2000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-2000/pytorch_model.bin
tokenizer co

  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-12000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-12000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-12000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-12000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-12000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-12000/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-9000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-13000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-13000/config.json
Model weights saved in /home/leolara/Deve

RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

In [15]:
trainer.train(resume_from_checkpoint = True)

Loading model from /home/leolara/Development/medtrix/model/checkpoint-16000.
***** Running training *****
  Num examples = 175805
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 2
  Total optimization steps = 65928
  Number of trainable parameters = 222903552
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 0
  Continuing training from global step 16000
  Will skip the first 0 epochs then the first 32000 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.


  0%|          | 0/32000 [00:00<?, ?it/s]

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Gen Len
17000,1.3042,1.214565,34.5267,22.6801,33.6392,34.1023,17.2865
18000,1.2943,1.211974,34.2465,22.5321,33.377,33.8562,17.283
19000,1.292,1.210732,34.3945,22.5496,33.6036,34.0554,17.3105
20000,1.2875,1.202308,34.2987,22.2562,33.4316,33.9395,17.4085
21000,1.2801,1.197444,34.3525,22.6683,33.6384,34.0493,17.244
22000,1.2616,1.199157,34.4826,22.6383,33.6795,34.1345,17.309
23000,1.2658,1.188232,34.3028,22.6874,33.4383,33.9302,17.3955
24000,1.2606,1.186295,34.4494,22.5168,33.5796,34.0228,17.382
25000,1.2404,1.181284,34.5358,22.7448,33.7418,34.1946,17.298
26000,1.2396,1.176394,34.3579,22.4678,33.5047,33.9989,17.4815


***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-17000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-17000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-17000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-17000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-17000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-17000/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-15000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-18000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-18000

Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-27000/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-25000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-28000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-28000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-28000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-28000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-28000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-28000/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-26000] due to args.save_total_limit
***** Running Evaluation *****

tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-38000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-38000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-38000/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-36000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-39000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-39000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-39000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-39000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-39000/special_tokens_map.json
Copy vocab fi

Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-49000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-49000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-49000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-49000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-49000/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-47000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-50000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-50000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-50000/pytorch_model.bin
tokenizer config file saved in /home/leolar

***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-60000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-60000/config.json
Model weights saved in /home/leolara/Development/medtrix/model/checkpoint-60000/pytorch_model.bin
tokenizer config file saved in /home/leolara/Development/medtrix/model/checkpoint-60000/tokenizer_config.json
Special tokens file saved in /home/leolara/Development/medtrix/model/checkpoint-60000/special_tokens_map.json
Copy vocab file to /home/leolara/Development/medtrix/model/checkpoint-60000/spiece.model
Deleting older checkpoint [/home/leolara/Development/medtrix/model/checkpoint-58000] due to args.save_total_limit
***** Running Evaluation *****
  Num examples = 2000
  Batch size = 4
Saving model checkpoint to /home/leolara/Development/medtrix/model/checkpoint-61000
Configuration saved in /home/leolara/Development/medtrix/model/checkpoint-61000

TrainOutput(global_step=65928, training_loss=0.8951360450006778, metrics={'train_runtime': 67254.7625, 'train_samples_per_second': 7.842, 'train_steps_per_second': 0.98, 'total_flos': 2.924692728192e+17, 'train_loss': 0.8951360450006778, 'epoch': 3.0})

## Continuous Machine Learning