### The goal of this notebook is to classify medical transcriptions to medical specialties. The scope of work will be given as follows:
1. Load the datasets provided by the client
2. Preprocess the data using sentence segmentation
  *Data cleaning - Determining what words/sentences are important
  *Tokenize - sentence splitting
  *lower case
  *stemming - removing tenses
  *stop words - a, the, and, as, etc.
  *lemmatization - simplify words
3. Use feature engineering to vectorize the transcriptions
  *tf-idf, word2vec, etc.
4. Determine and build the best model for the training set
  *rand forest, CNN, RNN, (Naive Bayes) etc.
5. Evaluate the model based on a F1 score


### 1 - Reading Data

In [None]:
import pandas as pd

train_set = 'https://raw.githubusercontent.com/jkeomany/DS_Hackathon_2023/main/new_train.csv'
test_set = 'https://raw.githubusercontent.com/jkeomany/DS_Hackathon_2023/main/new_test.csv'
train_df = pd.read_csv(train_set, index_col = 0)
test_df = pd.read_csv(test_set, index_col = 0)

train_df.head()

Unnamed: 0,medical_specialty,transcription,labels
0,Emergency Room Reports,"REASON FOR THE VISIT:, Very high PT/INR.,HIST...",0
1,Surgery,"PREOPERATIVE DIAGNOSIS:, Acetabular fracture ...",1
2,Surgery,"NAME OF PROCEDURE,1. Selective coronary angio...",1
3,Radiology,"REFERRING DIAGNOSIS: , Motor neuron disease.,P...",2
4,Emergency Room Reports,"CHIEF COMPLAINT: , Dental pain.,HISTORY OF PRE...",0


### Train Set Label Distribution

In [None]:
train_df["medical_specialty"].value_counts()

 Surgery                          863
 Consult - History and Phy.       410
 Cardiovascular / Pulmonary       309
 Orthopedic                       289
 Radiology                        213
 General Medicine                 209
 Gastroenterology                 176
 Neurology                        170
 SOAP / Chart / Progress Notes    135
 Urology                          134
 Obstetrics / Gynecology          123
 Discharge Summary                 87
 ENT - Otolaryngology              82
 Neurosurgery                      71
 Hematology - Oncology             68
 Ophthalmology                     67
 Emergency Room Reports            63
 Nephrology                        63
 Pediatrics - Neonatal             55
 Pain Management                   54
 Psychiatry / Psychology           45
 Office Notes                      38
 Podiatry                          35
 Dermatology                       21
 Dentistry                         21
 Cosmetic / Plastic Surgery        19
 Letters    

### Sample Transcription

In [None]:
from pprint import pprint
pprint(train_df.transcription[0])

('REASON FOR THE VISIT:,  Very high PT/INR.,HISTORY: , The patient is an '
 '81-year-old lady whom I met last month when she came in with pneumonia and '
 'CHF.  She was noticed to be in atrial fibrillation, which is a chronic '
 'problem for her.  She did not want to have Coumadin started because she said '
 'that she has had it before and the INR has had been very difficult to '
 'regulate to the point that it was dangerous, but I convinced her to restart '
 'the Coumadin again.  I gave her the Coumadin as an outpatient and then the '
 'INR was found to be 12.  So, I told her to come to the emergency room to get '
 'vitamin K to reverse the anticoagulation.,PAST MEDICAL HISTORY:,1.  '
 'Congestive heart failure.,2.  Renal insufficiency.,3.  Coronary artery '
 'disease.,4.  Atrial fibrillation.,5.  COPD.,6.  Recent pneumonia.,7.  '
 'Bladder cancer.,8.  History of ruptured colon.,9.  Myocardial '
 'infarction.,10.  Hernia repair.,11.  Colon resection.,12.  Carpal tunnel '
 'repair.,13

# 2 - Preprocessing

### The following will use spacy, NLTK, the transcription data usable. Our first preprocessing stage will assume medical text needs to be tokenized, lemmatized, lower cased, and have an optional ability to remove punctuation. After analyzing the model, we will iteratively determine if certain data cleaning features should be added or removed.

Preprocessing features include


1.   Lower Case
2.   Tokenization
3.   Lemmatization
4.   [Optional] Punctuation Removal -> Modify block starting line 38



In [None]:
#import punkt model inside nltk library to use work tokenizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

lower_and_tokenize function

In [None]:
# lower_and_tokenize(document) turns all the words of the document lower case
#   and seperates the string into an array of individual tokens (words, spaces
#   punctuation, etc.)
# Returns: Document (string)
def lower_and_tokenize(document):
  lower_case_doc=document.lower()
  return nltk.word_tokenize(lower_case_doc)

Test lower_and_tokenize

In [None]:
print(lower_and_tokenize(train_df.transcription[0]))

['reason', 'for', 'the', 'visit', ':', ',', 'very', 'high', 'pt/inr.', ',', 'history', ':', ',', 'the', 'patient', 'is', 'an', '81-year-old', 'lady', 'whom', 'i', 'met', 'last', 'month', 'when', 'she', 'came', 'in', 'with', 'pneumonia', 'and', 'chf', '.', 'she', 'was', 'noticed', 'to', 'be', 'in', 'atrial', 'fibrillation', ',', 'which', 'is', 'a', 'chronic', 'problem', 'for', 'her', '.', 'she', 'did', 'not', 'want', 'to', 'have', 'coumadin', 'started', 'because', 'she', 'said', 'that', 'she', 'has', 'had', 'it', 'before', 'and', 'the', 'inr', 'has', 'had', 'been', 'very', 'difficult', 'to', 'regulate', 'to', 'the', 'point', 'that', 'it', 'was', 'dangerous', ',', 'but', 'i', 'convinced', 'her', 'to', 'restart', 'the', 'coumadin', 'again', '.', 'i', 'gave', 'her', 'the', 'coumadin', 'as', 'an', 'outpatient', 'and', 'then', 'the', 'inr', 'was', 'found', 'to', 'be', '12.', 'so', ',', 'i', 'told', 'her', 'to', 'come', 'to', 'the', 'emergency', 'room', 'to', 'get', 'vitamin', 'k', 'to', 'rev

lower_punct_tokenize function

In [None]:
from nltk.tokenize import RegexpTokenizer

# lower_and_tokenize(document) turns all the words of the document lower case.
#   , removes all punctuation, and seperates the string into an array of 
#   individual tokens (words, spaces, etc)
# Returns: Document (string)
def lower_punct_tokenize(document):
  tokenizer = RegexpTokenizer(r'\w+')
  lower_case_doc=document.lower()
  return tokenizer.tokenize(lower_case_doc)

lower_punct_tokenize function test

In [None]:
print(lower_punct_tokenize(train_df.transcription[0]))

['reason', 'for', 'the', 'visit', 'very', 'high', 'pt', 'inr', 'history', 'the', 'patient', 'is', 'an', '81', 'year', 'old', 'lady', 'whom', 'i', 'met', 'last', 'month', 'when', 'she', 'came', 'in', 'with', 'pneumonia', 'and', 'chf', 'she', 'was', 'noticed', 'to', 'be', 'in', 'atrial', 'fibrillation', 'which', 'is', 'a', 'chronic', 'problem', 'for', 'her', 'she', 'did', 'not', 'want', 'to', 'have', 'coumadin', 'started', 'because', 'she', 'said', 'that', 'she', 'has', 'had', 'it', 'before', 'and', 'the', 'inr', 'has', 'had', 'been', 'very', 'difficult', 'to', 'regulate', 'to', 'the', 'point', 'that', 'it', 'was', 'dangerous', 'but', 'i', 'convinced', 'her', 'to', 'restart', 'the', 'coumadin', 'again', 'i', 'gave', 'her', 'the', 'coumadin', 'as', 'an', 'outpatient', 'and', 'then', 'the', 'inr', 'was', 'found', 'to', 'be', '12', 'so', 'i', 'told', 'her', 'to', 'come', 'to', 'the', 'emergency', 'room', 'to', 'get', 'vitamin', 'k', 'to', 'reverse', 'the', 'anticoagulation', 'past', 'medica

Lemmatization + Lowercase + (optional punctuation remover)



In [None]:
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from collections import defaultdict
from nltk.corpus import wordnet

tag_map = defaultdict(lambda : wordnet.NOUN)
# note that we only can lemmatize certain types of words such as Adjectives,
#   Verbs and Adverbs. Thus, we can ignore all others.
tag_map['J'] = wordnet.ADJ 
tag_map['V'] = wordnet.VERB
tag_map['R'] = wordnet.ADV

# creating a lemmatizer object that we will use from the WordNetLemmatizer class
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
# preprocess_document(document) takes a document (string) and applies different
#   preprocessing functions such as lower case, tokenization, lemmatization, 
#   and an optional punctuation remover.
# Returns: Array of tokens (strings)
def preprocess_document(document):
  # *** removes punctuation              
  tokens = lower_punct_tokenize(document)

  # tagged tokens returns a structure with 2 parameters (tuple) with the token
  #   itself and the type of word it is
  tagged_tokens = nltk.pos_tag(tokens)

  # we will keep track of the array of lemmas
  lemmas=[]

  for token, pos in tagged_tokens:
    # taking first letter of pos to check for possible match in tag_map
    lemmatizer_tag = tag_map[pos[0]]
    # the lemmatizer takes in a token and a pos argument
    lemma = lemmatizer.lemmatize(token, pos=lemmatizer_tag)
    lemmas.append(lemma)
  return lemmas

preprocess corupus function (template function)


*   Add on Machine/Deep learning model to classify processed data



In [None]:
# preprocess_corpus(corpus) takes an array of documents (corpus) and applies 
#   and preprocess each of them in a for-loop. 
# *** This is a template function that can be used later
def preprocess_corpus(corpus):
  for document in corpus:
    preprocess_document(document)
    # then do something with the preprocessed document
    

Test preprocess_document

In [None]:
print(f"document itself: {train_df.transcription[0]}")
print(f"after preprocess: {preprocess_document(train_df.transcription[0])}")
print(f"after preprocess: {preprocess_document(train_df.transcription)}")

document itself: REASON FOR THE VISIT:,  Very high PT/INR.,HISTORY: , The patient is an 81-year-old lady whom I met last month when she came in with pneumonia and CHF.  She was noticed to be in atrial fibrillation, which is a chronic problem for her.  She did not want to have Coumadin started because she said that she has had it before and the INR has had been very difficult to regulate to the point that it was dangerous, but I convinced her to restart the Coumadin again.  I gave her the Coumadin as an outpatient and then the INR was found to be 12.  So, I told her to come to the emergency room to get vitamin K to reverse the anticoagulation.,PAST MEDICAL HISTORY:,1.  Congestive heart failure.,2.  Renal insufficiency.,3.  Coronary artery disease.,4.  Atrial fibrillation.,5.  COPD.,6.  Recent pneumonia.,7.  Bladder cancer.,8.  History of ruptured colon.,9.  Myocardial infarction.,10.  Hernia repair.,11.  Colon resection.,12.  Carpal tunnel repair.,13.  Knee surgery.,MEDICATIONS:,1.  Cou

AttributeError: ignored

#3 - Feature Engineering

### Sample Training

In [None]:
from datasets.dataset_dict import DatasetDict
from datasets import Dataset
from torch import nn
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split

In [None]:
unique_classes = train_df["medical_specialty"].unique()

# idx_2_class = {i: s for i, s in enumerate(unique_classes)}
# class_2_idx = {s: i for i, s in enumerate(unique_classes)}

In [None]:
# train_df["labels"] = train_df["medical_specialty"].apply(lambda s: class_2_idx[s])

In [None]:
train_train_df, train_test_df = \
    train_test_split(
    train_df,
    test_size=0.3,
    random_state=42
)

In [None]:
ds_dict = {
    'train': Dataset.from_pandas(train_train_df),
    'val': Dataset.from_pandas(train_test_df),
    "test": Dataset.from_pandas(test_df)
}

ds = DatasetDict(ds_dict)

In [None]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_text(texts):
    return tokenizer(texts["transcription"], truncation=True, padding=True, max_length=256)

ds["train"] = ds["train"].map(tokenize_text, batched=True)
ds["val"] = ds["val"].map(tokenize_text, batched=True)
ds["test"] = ds["test"].map(tokenize_text, batched=True)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(unique_classes)
)

### Evaluation Metric

In [None]:
from sklearn.metrics import f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="macro")
    return {"f1": f1}

In [None]:
batch_size = 32
logging_steps = len(train_train_df) // batch_size
output_dir = "hf_trainer"

training_args = TrainingArguments(
    output_dir=output_dir,
     num_train_epochs=5,
     learning_rate=2e-5,
     per_device_train_batch_size=batch_size,
     per_device_eval_batch_size=batch_size,
     weight_decay=0.01,
     evaluation_strategy="epoch",
     logging_steps=logging_steps,
     push_to_hub=False
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=ds['train'],
    eval_dataset=ds['val'],
    tokenizer=tokenizer
)

In [None]:
trainer.train()

### Making Inference on the Test Set

In [None]:
ds["test"]

In [None]:
pred_y = trainer.predict(ds["test"])

In [None]:
a = pd.Series(pred_y.predictions.argmax(axis=1))
a.name = "Expected"
a.to_csv("predictions.csv")