# LR w/ClinicalBERT embeddings for baseline performance

#### Necessary Imports
Please use:  
pip install pandas tensorflow transformers numpy nltk matplotlib openai  
If you do not have any of the packages  
Please use Python 3.10.13.  
An easy way to achieve all of this is with anaconda - create an environment with python 3.10.13 and all of the necessary packages.  
```conda create -n env-name python=3.10 tensorflow transformers numpy pandas nltk matplotlib openai```  
Make sure env-name is selected as the environment when running the notebook.

In [None]:
import utils
from transformers import AutoTokenizer
LABELS = ["ABDOMINAL",
        "ADVANCED-CAD",
        "ALCOHOL-ABUSE",
        "ASP-FOR-MI",
        "CREATININE",
        "DIETSUPP-2MOS",
        "DRUG-ABUSE",
        "ENGLISH",
        "HBA1C",
        "KETO-1YR",
        "MAJOR-DIABETES",
        "MAKES-DECISIONS",
        "MI-6MOS"]

In [23]:
def get_word_embeddings(data):
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

    # Tokenize text
    clinical_notes = [utils.tokenize_data(note) for note in list(data['notes'])]
    tokenized_notes = tokenizer(clinical_notes, padding='max_length', max_length=512, truncation=True, return_tensors="tf")

    # Get word embeddings
    word_embeddings = tokenized_notes['input_ids'].numpy()

    return word_embeddings

### Retrieving Data

In [24]:
# Both will be dataframes with a 'notes' column and a column for each label
train_data = utils.get_note_data(LABELS, folder_name='train')
test_data = utils.get_note_data(LABELS, folder_name='test')

train_data.head()

Unnamed: 0,notes,ABDOMINAL,ADVANCED-CAD,ALCOHOL-ABUSE,ASP-FOR-MI,CREATININE,DIETSUPP-2MOS,DRUG-ABUSE,ENGLISH,HBA1C,KETO-1YR,MAJOR-DIABETES,MAKES-DECISIONS,MI-6MOS
162,\n\nRecord date: 2068-02-04\n\nASSOCIATED ARTH...,1,1,0,1,0,0,0,1,0,0,0,1,0
176,\n\nRecord date: 2085-04-22\n\n \nThis patient...,1,0,1,0,0,1,0,1,0,0,0,1,0
189,\n\nRecord date: 2090-07-07\n\nWillow Gardens ...,0,1,0,1,1,1,0,1,1,0,1,1,0
214,\n\nRecord date: 2096-07-15\n\n\n\nResults01/3...,0,1,0,1,0,1,0,0,1,0,0,1,1
200,\n\nRecord date: 2170-02-17\n\n \n\nReason for...,1,0,0,1,0,1,0,1,0,0,1,1,0


### Creating Model

In [25]:
from sklearn.linear_model import LogisticRegression

models = {}
for label in LABELS:
    # Define logistic regression model
    lr = LogisticRegression()

    # Set up X_train y_train
    X_train = get_word_embeddings(train_data)
    y_train = train_data[label].to_list()

    # Train the model
    lr.fit(X_train, y_train)

    models[label] = lr



### Getting Predictions

In [26]:
label_to_predictions = {}
for label, model in models.items():
    print(f"Predicting for model: {label}")
    # Predict
    X_test = get_word_embeddings(test_data)
    label_to_predictions[label] = model.predict(X_test)

Predicting for model: ABDOMINAL
Predicting for model: ADVANCED-CAD
Predicting for model: ALCOHOL-ABUSE
Predicting for model: ASP-FOR-MI
Predicting for model: CREATININE
Predicting for model: DIETSUPP-2MOS
Predicting for model: DRUG-ABUSE
Predicting for model: ENGLISH
Predicting for model: HBA1C
Predicting for model: KETO-1YR
Predicting for model: MAJOR-DIABETES
Predicting for model: MAKES-DECISIONS
Predicting for model: MI-6MOS


Save predictions and load predictions here if wanted.

In [27]:
utils.save_preds(label_to_predictions, "LR_predictions")

In [28]:
label_to_predictions = utils.read_preds("LR_predictions")

  label_to_predictions[row[0]] = list(row[1:])


### Performance:

In [29]:
label_to_micro_f1, overall_f1 = utils.get_f1_scores_for_labels(LABELS, test_data, label_to_predictions)
print('overall-f1:', overall_f1)

Raw f1 for ABDOMINAL 0.5731591125629509
Raw f1 for ADVANCED-CAD 0.5717948717948718
Raw f1 for ALCOHOL-ABUSE 0.49112426035502965
Raw f1 for ASP-FOR-MI 0.5592948717948718
Raw f1 for CREATININE 0.6126050420168067
Raw f1 for DIETSUPP-2MOS 0.5762237762237762
Raw f1 for DRUG-ABUSE 0.49112426035502965
Raw f1 for ENGLISH 0.42441860465116277
Raw f1 for HBA1C 0.5327777777777778
Raw f1 for KETO-1YR 0.5
Raw f1 for MAJOR-DIABETES 0.5314465408805031
Raw f1 for MAKES-DECISIONS 0.48255813953488375
Raw f1 for MI-6MOS 0.47530864197530864
overall-f1: 0.5247566076863825
