### Reading Data

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

# This code mounts the files from google drive into colab.

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
import pandas as pd

train_df = pd.read_csv("/content/gdrive/MyDrive/Hackathon/new_train.csv", index_col=0)
test_df = pd.read_csv("/content/gdrive/MyDrive/Hackathon/new_test.csv", index_col=0)

print("Train size", len(train_df))
print("Test size", len(test_df))
train_df.head(n=3)



# This code imports Panda library and uses it to read the two csv files into data frames.
# The path to the dataset csvs are included to the code.
# Next the "index_col=0" specifies that the first column of the CSV file should be used as the index of the data frame.
# By running this code section, we obtain "Train size 3969", "Test size 997" and the first three lines of the new_train.csv file.

Train size 3969
Test size 997


Unnamed: 0,medical_specialty,transcription,labels
0,Emergency Room Reports,"REASON FOR THE VISIT:, Very high PT/INR.,HIST...",0
1,Surgery,"PREOPERATIVE DIAGNOSIS:, Acetabular fracture ...",1
2,Surgery,"NAME OF PROCEDURE,1. Selective coronary angio...",1


In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# To clean the text for better recognition
train_df = train_df[train_df ["transcription"].notna()]

def clean_text(text):
    special_char = re.compile('[/(){}\[\]\|@,;]')
    text = special_char.sub('', text)
    special_char2 = re.compile('[^0-9a-z #+_]')
    text = special_char2.sub('', text)
    text = text.lower()
    tokens = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)   

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
train_df["transcription"]=train_df["transcription"].apply(clean_text)

### Train Set Label Distribution

In [None]:
counts = train_df["medical_specialty"].value_counts()

print(counts)
# This code prints the label and the corresponding number of transcriptions for each speciality.

 Surgery                          863
 Consult - History and Phy.       410
 Cardiovascular / Pulmonary       309
 Orthopedic                       289
 Radiology                        213
 General Medicine                 209
 Gastroenterology                 176
 Neurology                        170
 SOAP / Chart / Progress Notes    135
 Urology                          134
 Obstetrics / Gynecology          123
 Discharge Summary                 87
 ENT - Otolaryngology              82
 Neurosurgery                      71
 Hematology - Oncology             68
 Ophthalmology                     67
 Emergency Room Reports            63
 Nephrology                        63
 Pediatrics - Neonatal             55
 Pain Management                   54
 Psychiatry / Psychology           45
 Office Notes                      38
 Podiatry                          35
 Dermatology                       21
 Dentistry                         21
 Cosmetic / Plastic Surgery        19
 Letters    


Exploratory data analysis observations: 

Based on the outputs of the above code, it can be observed that 19 out of the 40 medical specialities have less than 1% occurence of the 3969 cases in train_df. Similarly, the top 5 categories (Surgery...Radiology) take up about 50% of the dataset occurences. This suggests that the dataset is imbalanced and skewed, thus limiting the accuracy of the classification model that will be trained. Therefore, in order to increase the resulting f1 score of the algorithm, a possibility could be to remove labels with count less than 40 (~1% of training cases) as shown below.

In [None]:
train_df = train_df[~train_df["medical_specialty"].isin(counts[counts < 40].index)]

counts = train_df["medical_specialty"].value_counts()
print(counts)

 Surgery                          863
 Consult - History and Phy.       410
 Cardiovascular / Pulmonary       309
 Orthopedic                       289
 Radiology                        213
 General Medicine                 209
 Gastroenterology                 176
 Neurology                        170
 SOAP / Chart / Progress Notes    135
 Urology                          134
 Obstetrics / Gynecology          123
 Discharge Summary                 87
 ENT - Otolaryngology              82
 Neurosurgery                      71
 Hematology - Oncology             68
 Ophthalmology                     67
 Emergency Room Reports            63
 Nephrology                        63
 Pediatrics - Neonatal             55
 Pain Management                   54
 Psychiatry / Psychology           45
Name: medical_specialty, dtype: int64



Exploratory data analysis observations:
Further analysis of the train.csv file shows that when you sort the file by the transcription column, there are multiple occurrences of the same transcription but with various labels attached to it. This shows that the transcriptions inputted can have multiple labels which may confuse algorithm. Therefore, the following supersets are identified and combined to reduce such occurences in the following codeblock. 



1.   Surgery 
2. SOAP/ Chart / Progress Notes
3. Emergency Room Reports
4. Discharge Summary
5. Office Notes
2.   General Medicine 
3. Pain Management 
4. Neurology (Superset of Neurosurgery)
5. Urology (Superset of Nephrology)





In [None]:
train_df = train_df[train_df["medical_specialty"] != "Surgery"]
train_df = train_df[train_df["medical_specialty"] != "SOAP / Chart / Progress Notes"]
train_df = train_df[train_df["medical_specialty"] != "Emergency Room Reports"]
train_df = train_df[train_df["medical_specialty"] != "Discharge Summary"]
train_df = train_df[train_df["medical_specialty"] != "Office Notes"]
train_df = train_df[train_df["medical_specialty"] != "General Medicine"]
train_df = train_df[train_df["medical_specialty"] != "Pain Management"]
train_df.loc[train_df.medical_specialty == ' Neurosurgery', "medical_specialty"] = ' Neurology'
train_df.loc[train_df.medical_specialty == ' Nephrology', "medical_specialty"] = " Urology"

counts = train_df["medical_specialty"].value_counts()
print(counts)

 Surgery                          863
 Consult - History and Phy.       410
 Cardiovascular / Pulmonary       309
 Orthopedic                       289
 Neurology                        241
 Radiology                        213
 General Medicine                 209
 Urology                          197
 Gastroenterology                 176
 SOAP / Chart / Progress Notes    135
 Obstetrics / Gynecology          123
 Discharge Summary                 87
 ENT - Otolaryngology              82
 Hematology - Oncology             68
 Ophthalmology                     67
 Emergency Room Reports            63
 Pediatrics - Neonatal             55
 Pain Management                   54
 Psychiatry / Psychology           45
Name: medical_specialty, dtype: int64


In [None]:
unique_classes = train_df["medical_specialty"].unique()
print(unique_classes)

idx_2_class = {i: s for i, s in enumerate(unique_classes)}

# This line for idx_2_class returns the names of medical specialities with its corresponding label (eg 0: 'Emergency Room Reports'...).
# That is, indexing the specialities with its corresponding labels.

class_2_idx = {s: i for i, s in enumerate(unique_classes)}
# This line does the same as idx_2_class but prints the index of the corresponding medical speciality label (eg 'Surgery': 1...).

print("Number of medical specialities:", len(unique_classes))
# Running the print line outputs the array containing the count of medical specialities which is 40.

[' Emergency Room Reports' ' Surgery' ' Radiology' ' Neurology'
 ' Gastroenterology' ' Orthopedic' ' Cardiovascular / Pulmonary'
 ' Urology' ' ENT - Otolaryngology' ' General Medicine'
 ' Hematology - Oncology' ' SOAP / Chart / Progress Notes'
 ' Psychiatry / Psychology' ' Consult - History and Phy.'
 ' Obstetrics / Gynecology' ' Discharge Summary' ' Ophthalmology'
 ' Pediatrics - Neonatal' ' Pain Management']
Number of medical specialities: 19


In [None]:
# This dataset is still imbalanced due to a large proportion of the transcriptions
# being of type surgery. Therefore, to even the dataset, we will be taking samples 
# of 40 from each type

for spec in unique_classes:
  t = train_df[train_df["medical_specialty"] == spec].sample(n=40, replace=False)
  train_df = pd.concat([train_df, t], ignore_index=True).sample(frac=1)
#counts = train_df["medical_specialty"].value_counts()
#print(counts)  

### Transcription

In [None]:
from pprint import pprint
pprint(train_df.transcription[4])

# This code prints the corresponding transcription in the row of the input given in [].

('ngina coronary artery disease ngina coronary artery disease oronary artery '
 'bypass grafting x2 left internal mammary artery left anterior descending '
 'reverse saphenous vein graft circumflex ude proximal anastomosis used vein '
 'graft ffpump edtronic technique left internal mammary artery technique '
 'circumflex eneral patient brought operating room placed supine position upon '
 'table fter adequate general anesthesia patient prepped etadine soap solution '
 'usual sterile manner lbows protected avoid ulnar neuropathy chest wall '
 'expansion avoided avoid ulnar neuropathy phrenic nerve protector used '
 'protect phrenic nerve removed end case midline sternal skin incision made '
 'carried sternum divided saw ericardial thymus fat pad divided left internal '
 'mammary artery harvested spatulated anastomosis eparin givenein resected '
 'thigh side branch secured using 40 silk emoclips thigh closed multilayer '
 'icryl exon technique ulsavac wash done drain placedhe left intern

### Training

This program requires code to train a machine learning model to perform multi-label classification of x number of specialities based on y number of transcriptions given. Using the DistilBERT model, the program will be optimised to obtain the highest possible f1 score. 


In [None]:
!pip install datasets
# This line installs the datasets necessary for the following 4 lines of code.
from datasets.dataset_dict import DatasetDict
from datasets import Dataset
from torch import nn
import torch

!pip install transformers
# This line installs transformers to run the following two lines.
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Using DistilBERT

In [None]:
train_df["labels"] = train_df["medical_specialty"].apply(lambda s: class_2_idx[s])

# This line creates the labels for the corresponding medical specialities.

In [None]:
train_train_df, train_test_df = \
    train_test_split(
    train_df,
    test_size=0.3,
    # The previous line splits the test file as 70%, 30% where 70% of train_df will be used to train the algortithm and the remaining 30% will be used to predict outcome.
    random_state=42
    # The previous line ensures that the splits are fixed and each run will return the same output. 
    # Removing of line can be used to improve accuracy with k-cross validation of the training data (ie. each test split varies). 
    # However, since the code makes use of transformers, there will not be a significant difference in accuracy by using cross validation. ??
)

In [None]:
ds_dict = {
    'train': Dataset.from_pandas(train_train_df),
    'val': Dataset.from_pandas(train_test_df),
    "test": Dataset.from_pandas(test_df)
}

ds = DatasetDict(ds_dict)

In [None]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_text(texts):
    return tokenizer(texts["transcription"], truncation=True, padding=True, max_length=256)

ds["train"] = ds["train"].map(tokenize_text, batched=True)
ds["val"] = ds["val"].map(tokenize_text, batched=True)
ds["test"] = ds["test"].map(tokenize_text, batched=True)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.26.1",
  "vocab_size": 30522
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapsh

Map:   0%|          | 0/3644 [00:00<?, ? examples/s]

Map:   0%|          | 0/1562 [00:00<?, ? examples/s]

Map:   0%|          | 0/997 [00:00<?, ? examples/s]

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(unique_classes)
)

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10",
    "11": "LABEL_11",
    "12": "LABEL_12",
    "13": "LABEL_13",
    "14": "LABEL_14",
    "15": "LABEL_15",
    "16": "LABEL_16",
    "17": "LABEL_17",
    "18": "LABEL_18"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_11": 11,
    "LABEL_12": 12,
    "LABEL_13": 13,
    "L

In [None]:
# Evaluating the approach taken using DistilBERT
from sklearn.metrics import f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="macro")
    return {"f1": f1}

In [None]:
# The optimal hyperparameters for DistilBERT is batch size 16 and learning rate 3e-5.
# Over those hyperperameters, the BERT authors recommend using 4 epochs which has also been modified.
# The optimal hyperparameters were found by trying possible permutations and combinations of the 3 variables.

batch_size = 16
# Batch_size was changed from 32 to 16.
logging_steps = len(train_train_df) // batch_size
output_dir = "hf_trainer"

training_args = TrainingArguments(
    output_dir=output_dir,
     num_train_epochs=4,
     # epochs was changed from 5 to 4.
     learning_rate=3e-5,
     # Learning_rate was changed to 3e-5 from 2e-5.
     per_device_train_batch_size=batch_size,
     per_device_eval_batch_size=batch_size,
     weight_decay=0.01,
     evaluation_strategy="epoch",
     logging_steps=logging_steps,
     push_to_hub=False
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=ds['train'],
    eval_dataset=ds['val'],
    tokenizer=tokenizer
)

In [None]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: __index_level_0__, transcription, medical_specialty. If __index_level_0__, transcription, medical_specialty are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3644
  Num Epochs = 4
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 912
  Number of trainable parameters = 66968083
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,2.421,2.001525,0.224193
2,1.8016,1.665965,0.346763
3,1.4877,1.536423,0.416806
4,1.3115,1.50198,0.448441


The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: __index_level_0__, transcription, medical_specialty. If __index_level_0__, transcription, medical_specialty are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1562
  Batch size = 16
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: __index_level_0__, transcription, medical_specialty. If __index_level_0__, transcription, medical_specialty are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1562
  Batch size = 16
Saving model checkpoint to hf_trainer/checkpoint-500
Configuration saved in hf_trainer/checkpoint-500/config.json
Model weights sa

TrainOutput(global_step=912, training_loss=1.7528277494405444, metrics={'train_runtime': 364.4352, 'train_samples_per_second': 39.996, 'train_steps_per_second': 2.503, 'total_flos': 965715089350656.0, 'train_loss': 1.7528277494405444, 'epoch': 4.0})

### Making Inference on the Test Set

In [None]:
ds["test"]

Dataset({
    features: ['transcription', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 997
})

In [None]:
pred_y = trainer.predict(ds["test"])

The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: __index_level_0__, transcription. If __index_level_0__, transcription are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 997
  Batch size = 16


In [None]:
a = pd.Series(pred_y.predictions.argmax(axis=1))
a.name = "Expected"
a.to_csv("predictions.csv")