# Diabetes Phenotyping: A Natural Language Processing-Driven Machine Learning Approach 

<img src="flowchart.png" alt="flowchart" />
Diagnosing diabetes using automated approaches would aid in increasing patient care, disease management and assist clinicians. in this project supervised algorithms were used to develop classifiers developed to detect type 1 diabetes and type 2 diabetes from MIMIC-IV clinical notes using three different approaches by leveraging natural language processing techniques, TD-IDF, word embeddings and the use of pretrained large language model.


In [None]:
#packages imports 
#preprocessing 
import numpy as np
import pandas as pd 
import copy
import re
import string
from flashtext import KeywordProcessor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from random import sample
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize.treebank import TreebankWordDetokenizer
from sklearn.preprocessing import MinMaxScaler
from datasets import Dataset
import warnings
warnings.filterwarnings('ignore')
#Evaluation 
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
#Approach1&2 models
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
#Approach 2
import gensim
from gensim.models import Word2Vec
#Approach 3
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
import evaluate
#Visualiziation
import matplotlib.pyplot as plt 
np.random.seed(42)

<h3>Dataset</h3>
    
the instructions to download the MIMIC-IV database files are on the README file.<br>
for additional information about MIMIC Data visit https://physionet.org/content/mimiciv/2.2/

# Patients Labeling

The Algorithm psudo code and flowchart for this part can be found: <br>
Stanford University. Type 1 and type 2 Diabetes Mellitus. PheKB; 2020 Available from: https://phekb.org/phenotype/1506
the MIMIC data is not labeled hence the need to do the labeling, the algorithm works as a guide to help identify the phenotypes

<h3>Load Data</h3>
<br>
-d_items: a table that define the item ids referenced in other files in the ICU folder (similar to a items dictionary or meta data)<br>
-charts: a table that contains <br>
-prescriptions:a table that contains all patients prescriptions<br>
-diagnoses:a table that contains all the patients diagnoses<br>
-labs:a table that contain all labs results other than ICU tests<br>
-labs_items:a table that define the item ids referenced in the labs table (similar to a items dictionary or meta data) <br>
-diagnoses_details: a table that define the item ids referenced in the diagnoses table (similar to a items dictionary or meta data)<br>
-notes: a table that contains discharge summary notes. 

In [None]:
#Load files from MIMIC folders
#ICU folder
d_items = pd.read_csv("mimic-iv-2.2/icu/d_items.csv")
charts = pd.read_csv("mimic-iv-2.2/icu/chartevents.csv")
#HOSP folder
prescriptions = pd.read_csv("mimic-iv-2.2/hosp/prescriptions.csv")
diagnoses = pd.read_csv("mimic-iv-2.2/hosp/diagnoses_icd.csv")
labs = pd.read_csv("mimic-iv-2.2/hosp/labevents.csv")
labs_items = pd.read_csv("mimic-iv-2.2/hosp/d_labitems.csv")
diagnoses_details = pd.read_csv("mimic-iv-2.2/hosp/d_icd_diagnoses.csv")
#MIMIC Note Folder
notes = pd.read_csv("mimic-iv-note-deidentified-free-text-clinical-notes-2/note/discharge.csv")

<h3> Diabetes Items IDs Extraction </h3>

In [None]:
#load the file that contains the diabetes medictions (generic names and brands).
#medictions list was acquired from the Upadhyaya et al paper
rx_labels = pd.read_csv("Full_Rx_List.csv").astype(str)
rx_labels = rx_labels.stack().tolist()

#get the records that have a category of labs and they contain glucose (blood sugar) as labels from the 
#ICU items dictionary 
icu_labs_labels = d_items[d_items.label.str.contains('Glucose',case=False) & 
                          d_items.category.str.contains('labs',case=False)]

In [None]:
# get glucose labs  
glucose_lab_codes = [50809,50931,52569,52027] 
glucose_labs_labels = labs_items[labs_items.itemid.isin(glucose_lab_codes)]

In [None]:
#get the ICD9 and ICD10 codes for diabetes 

#ICD9 Diabetes codes start with 250 so extract the codes that match this condition
icd9_label = diagnoses_details[diagnoses_details.icd_code.str.contains('^250' , case=False ,regex=True) 
& diagnoses_details.long_title.str.contains(r'(?i)\Diabetes\b',case=False ,regex=True)]

#ICD10 Diabetes codes start with E10 or E11 so extract the codes that match this condition
icd10_label1 = diagnoses_details[diagnoses_details.icd_code.str.contains('^E11', case=False ,regex=True) 
& diagnoses_details.long_title.str.contains(r'(?i)\Diabetes\b',case=False ,regex=True)]
icd10_label2 =diagnoses_details[diagnoses_details.icd_code.str.contains('^E10',  case=False ,regex=True) 
& diagnoses_details.long_title.str.contains(r'(?i)\Diabetes\b',case=False ,regex=True)]

#merge all diabetes ICD codes in one dataframe
diabetes_codes_labels = pd.concat([icd9_label, icd10_label1 ,icd10_label2])

<h3>Get Patients with diabetes diagonsis</h3>

In [None]:
#filter patients diagnoses only with diabetes diagnosis codes 
diabetes_diagnoses = diagnoses[diagnoses.icd_code.isin(diabetes_codes_labels.icd_code)]

<h3> Filter diabetes medictions and prescriptions </h3>

In [None]:
def Filter_meds(row ,meds):
    '''
    check if the row matchs any items from the medictions list 
 
    Args:
        row (str): dataframe row 
        meds (str list): A list of medictions 
 
    Returns:
        1: if the mediction name matches any of the list items
        0: if the mediction name doesn't match any of the list items
    
    '''
    keyword_processor = KeywordProcessor()
    #add the meds as keywords to be searched for 
    keyword_processor.add_keywords_from_list(meds)
    if(keyword_processor.extract_keywords(row)):
        return 1 
    else:
        return 0

In [None]:
%%time
#fill the null values with unknown and convert the drug column type into string so it can be filtered
prescriptions.drug.fillna('unknown' , inplace=True)
prescriptions.drug = prescriptions.drug.astype(str)

#apply filter medictions function on the prescriptions table which flags the diabetes prescriptions 
prescriptions['is_diabetes_med'] = prescriptions.drug.apply(Filter_meds , args=(rx_labels,))

In [None]:
#create a new dataframe that contain only diabetes prescriptions 
diabetes_prescriptions = prescriptions[prescriptions['is_diabetes_med'] == 1]

#remove insulin for Hyperkalemia as it's not for diabetes
diabetes_prescriptions= diabetes_prescriptions[~diabetes_prescriptions.drug.str.contains("Hyperkalemia")]

In [None]:
#flag the prescriptions that are considered a metformin prescription (brands and generic names)
metformin_brands = ['Fortamet','Glucophage','Glumetza','Riomet','Metformin']
diabetes_prescriptions['is_metformin'] = diabetes_prescriptions.drug.apply(Filter_meds,args=(metformin_brands,))

#create a new dataframe with oral hypoglycemic prescriptions other than metformin
non_metformin_diabetes_prescriptions = diabetes_prescriptions[diabetes_prescriptions['is_metformin'] == 0]

In [None]:
#view diabetes prescriptions
diabetes_prescriptions

In [None]:
#view diabetes prescriptions which are not metformin to be used in later stages of the filtering
non_metformin_diabetes_prescriptions

<h3>Filter diabetes labs results</h3>

In [None]:
#get glucose labs results only 
diabetes_glucose_labs = labs[labs.itemid.isin(glucose_labs_labels.itemid)]

#get icu glucose labs results only
icu_charts_labs_results = charts[charts.itemid.isin(icu_labs_labels.itemid)]

#get A1C labs results only
diabetes_A1C_labs = labs[labs.itemid == 50852]

In [None]:
def Glucose_abnormal_filtering(diagnosed_patient , diabetes_glucose_labs):
    '''
    checks if the patient glucose labs results are abnormal (abnormal range starts from 125mg/dl)
 
    Args:
        diagnosed_patient (df row): the record of a diagnosed patient 
        diabetes_glucose_labs (df):  a dataframe that contains lab tests details
 
    Returns:
        1: if the patient has an abnormal test result 
        0: if the patient doesnt have an abnormal test result
    
    '''
    patient_id = diagnosed_patient.subject_id
    #select the patient lab results
    patient_labs_results = diabetes_glucose_labs[diabetes_glucose_labs.subject_id == patient_id]
    #check if the results are abnormal
    if(patient_labs_results.valuenum >= 125).any():
        return 1
    else:
        return 0 
def A1C_abnormal_filtering(diagnosed_patient , diabetes_A1C_labs):
    '''
    checks if the patient A1C Hemoglobin labs results are abnormal (abnormal range starts from 6.5)
 
    Args:
        diagnosed_patient (df): the record of a diagnosed patient 
        diabetes_A1C_labs (df):  a dataframe that contains lab tests details
 
    Returns:
        1: if the patient has an abnormal test result 
        0: if the patient doesnt have an abnormal test result
    
    '''
    patient_id = diagnosed_patient.subject_id
    #select the patients lab results
    patient_labs_results = diabetes_A1C_labs[diabetes_A1C_labs.subject_id == patient_id]
    #check if the results are abnormal
    if(patient_labs_results.valuenum >= 6.5).any():
        return 1
    else:
        return 0 

In [None]:
#create a deepcopy of the diabetes diagnoses dataframe in order to create new columns 
#a deepcopy ensure that the exception "SettingWithCopyWarning" is not raised 
filtered_diagnosed_patients = copy.deepcopy(diabetes_diagnoses)

#create a new column that flags whether a patient have a diabetes prescription or not 
filtered_diagnosed_patients['take_diabetes_meds'] = diabetes_diagnoses['subject_id'].isin(
    diabetes_prescriptions['subject_id']).astype(int)

#create a new column that flag whether a patient have a patient have an abnormal glucose lab or not
filtered_diagnosed_patients['abnormal_glucose_lab'] = diabetes_diagnoses.apply(Glucose_abnormal_filtering ,
    axis=1, args=(diabetes_glucose_labs, ))

#create a new column that flag whether a patient have a patient have an abnormal glucose ICU lab or not                                                                              )
filtered_diagnosed_patients['abnormal_ICU_lab'] = diabetes_diagnoses.apply(Glucose_abnormal_filtering ,
    axis=1, args=(icu_charts_labs_results, ))

#create a new column that flag whether a patient have a patient have an abnormal A1C lab or not
filtered_diagnosed_patients['abnormal_A1C_lab'] = diabetes_diagnoses.apply(A1C_abnormal_filtering,
    axis=1, args=(diabetes_A1C_labs,))

In [None]:
#check the labs after filtering 
filtered_diagnosed_patients

<h3>Filter diabetes patients</h3>

In [None]:
def Diabetes_cohort(patient_row):
    '''
    dtermines if the patient is a diabetes patient or not 
 
    Args:
        patient_row (df row): the record of a diagnosed patient 
 
    Returns:
        1: if the patient is diabetes patient 
        0: if the patient is not consdiered as diabetes patient
    
    '''
    #if the patient is taking a diabetes med return 1
    if(patient_row.take_diabetes_meds == 1):
        return 1
    #if the patient have an abnormal labs results return 1
    elif(patient_row.abnormal_glucose_lab == 1 | patient_row.abnormal_A1C_lab == 1 
    | patient_row.abnormal_ICU_lab == 1):
        return 1
    else:
        return 0 

In [None]:
#create a deepcopy of the dataframe to create a new column.
diagnosed_patients_cohort = copy.deepcopy(filtered_diagnosed_patients) 

#create a new column that flags diabetes patients through applying a function
diagnosed_patients_cohort['has_diabetes'] = filtered_diagnosed_patients.apply(Diabetes_cohort,axis=1)

In [None]:
#the dataframe of diabetes patients cohort 
diagnosed_patients_cohort

<h3> Label Type 1 diabetes or Type 2 diabetes </h3>

After Identifying the diabetes patients group, the next step is to identify which diabetes phenotype the have

In [None]:
#defines type 1 diabetes and type 2 diabetes ICD codes.
#reference:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10283086/
Type1DM_icd_codes =  ['^250[0-9]1$' , '^250[0-9]3$' , '^E10']
Type2DM_icd_codes =  ['^250[0-9]0$' , '^250[0-9]2$' , '^E11']

In [None]:
#calculate the frequency of type 1 diabetes ICD code from each patient 
diagnosed_patients_cohort['DMT1_Code_Frequency'] = diagnosed_patients_cohort.subject_id.where(
    diagnosed_patients_cohort['icd_code'].str.contains('|'.join(Type1DM_icd_codes))).groupby(
    diagnosed_patients_cohort['subject_id']).transform('count')

#calculate the frequency of type 2 diabetes ICD code from each patient 
diagnosed_patients_cohort['DMT2_Code_Frequency'] =diagnosed_patients_cohort.subject_id.where(
    diagnosed_patients_cohort['icd_code'].str.contains('|'.join(Type2DM_icd_codes))).groupby(
    diagnosed_patients_cohort['subject_id']).transform('count')

#calculate the ratio of type 1 codes to type 2 codes for each patient 
diagnosed_patients_cohort['DM1_DM2_Ratio'] = (diagnosed_patients_cohort.DMT1_Code_Frequency 
                                              / diagnosed_patients_cohort.DMT2_Code_Frequency)

In [None]:
#drop duplicates to the unique IDs of diabetes patients
diagnosed_patients_cohort.drop_duplicates(subset=['subject_id'] , inplace=True)

In [None]:
#get glucagon prescriptions only 
glucagon_rx = diabetes_prescriptions[diabetes_prescriptions.drug.str.contains
                                     (r'(?i)\bglucagon\b' ,regex=True, case=False)]

In [None]:
#flag with 1 if the patient have a glucagon prescription
diagnosed_patients_cohort['glucagon_rx'] =  np.where(diagnosed_patients_cohort.subject_id.isin(
    glucagon_rx.subject_id),1 , 0)

#flag with 1 if the patient have any oral hypoglycemic medications other than metformin  
diagnosed_patients_cohort['non_metformin_meds'] =  np.where(diagnosed_patients_cohort.subject_id.isin(
    non_metformin_diabetes_prescriptions.subject_id),1 , 0)

In [None]:
def Diabetes_Type_labeling(row):
    '''
    determines the phenotype of the dibaetes patient 
 
    Args:
        row (df row): the record of a diagnosed patient 
 
    Returns:
        0: if the patient doesn't meet the requirements for the phenotypes
        1: if the patient is type 1 diabetes patient 
        2: if the patient is type 2 diabetes patient
    
    '''
    if(row.DM1_DM2_Ratio <= 0.5):
        return 2 
    elif(row.DM1_DM2_Ratio > 0.5 and row.glucagon_rx == 1):
        return 1
    elif(row.DM1_DM2_Ratio > 0.5 and row.glucagon_rx == 0 & row.non_metformin_meds ==0):
        return 1
    elif(row.DM1_DM2_Ratio > 0.5 and row.glucagon_rx == 0 & row.non_metformin_meds ==1):
        return 2
    else: 
        return 0
        

In [None]:
#determine the diabetes phenotype label of each patient 
diagnosed_patients_cohort['Label'] = diagnosed_patients_cohort.apply(Diabetes_Type_labeling, axis = 1)

#get diabetes type 1 patients
diabetesT1_patients = diagnosed_patients_cohort[diagnosed_patients_cohort.Label==1]

#get diabetes type 2 patients
diabetesT2_patients = diagnosed_patients_cohort[diagnosed_patients_cohort.Label==2]

In [None]:
#view the tyep 1 patients , 2344 patients Identified
diabetesT1_patients

In [None]:
#view the tyep 2 patients , 34060 patients Identified
diabetesT2_patients

<h3> Filter, label and clean the discharge notes</h3>

In [None]:
def Filter_notes(note):
    '''
    filter the note based on  common terms for diabetes according to diabetes.org. 

    Args:
        note (str): a patient clinical note  
 
    Returns:
        0: if the note doesn't contain any terms 
        1: if a term was found in the note
    
    '''
    #list of the common words related to diabetes. ref:https://diabetes.org/about-diabetes/common-terms
    words = ['diabetes','diabetic''insulin', 'sugar','A1C','glucose','hyperglycemia','hypoglycemia',
             'Euglycemia','diabetic', 'diabet','Fasting', 'hypoglycem','hypoglycemic','pancreas'] 
    keyword_processor = KeywordProcessor()
    #add the words to the KeywordProcessor dictionary to be searched
    keyword_processor.add_keywords_from_list(words)
    #search the note for the diabets common terms
    if(keyword_processor.extract_keywords(note)):
        #if found return 1
        return 1 
    else:
        #if not found return 0
        return 0 

In [None]:
#filter the notes to get only type 1 diabetes patients notes
notes_T1 = notes[notes.subject_id.isin(diabetesT1_patients.subject_id)]
#filter the notes again to get only the relevant patient clinical notes 
notes_T1['is_diabetes_related'] = notes_T1.text.apply(Filter_notes)
notes_T1 = notes_T1[notes_T1.is_diabetes_related == 1]
#get a sample of 7000 notes
notes_T1 = notes_T1.sample(n=7000, random_state=42)
#assign the label 1 for these notes to be used for binary classifcation
notes_T1['label'] = 1
#filter the notes to get only type 2 diabetes patients notes
notes_T2 = notes[notes.subject_id.isin(diabetesT2_patients.subject_id)]
#filter the notes again to get only the relevant patient clinical notes 
notes_T2['is_diabetes_related'] = notes_T2.text.apply(Filter_notes)
notes_T2 = notes_T2[notes_T2.is_diabetes_related == 1]
#get a sample of 7000 notes
notes_T2 = notes_T2.sample(n=7000, random_state=42)
#assign the label 1 for these notes to be used for binary classifcation
notes_T2['label'] = 0 

#merge the notes 
labeled_notes = pd.concat([notes_T1, notes_T2])

In [None]:
#reset index as it was shffuled in the previous steps
labeled_notes.reset_index(inplace=True)
labeled_notes = labeled_notes[['text','label']]

In [None]:
#view the labeled notes dataframe
labeled_notes

In [None]:
def Cleaning_notes(text):
    '''
    preprocess and clean text in order to be used in modeling

    Args:
        text (str): a note to be cleaned
 
    Returns:
        cleaned_text: preprocessed and cleaned text
    
    '''
    #remove new lines 
    text = re.sub(r'\n', '', text)
    #remove punctuation
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    #remove special characters
    text = re.sub(r"[-()\"#/&@$;×+:<>{}`'+=~|.!?,]",'', text) 
    #remove digits
    text = re.sub(r'\w*\d\w*', '', text)
    text = re.sub(r'\d+','', text.lower())
    #remove spaces with more than 3 spaces
    text = re.sub(r'\s{3,}', ' ', text)
    #remove the header of the notes 
    header_words = ['name', 'unit', 'admission', 'allergy', 'date', 'discharge' ,'birth', 'sex', 'service']
    for word in header_words:
        text = re.sub(re.escape(word), '', text)
    #remove spaces 
    text = text.strip()
    #tokenize the words
    tokenized_words = word_tokenize(text)
    #remove stop words
    tokenized_words = [w for w in tokenized_words if w not in stopwords.words("english")]
    #lemmatize the words
    lemmatizer = WordNetLemmatizer()
    stemmed_words = [lemmatizer.lemmatize(word) for word in tokenized_words]
    #keep tokens that are more than one character 
    cleaned_single = [word for word in stemmed_words if len(word) > 1]
    #detokenize the words
    cleaned_text = TreebankWordDetokenizer().detokenize(cleaned_single)
    
    #return the cleaned note
    return cleaned_text

In [None]:
%%time
#clean the notes
labeled_notes.text = labeled_notes.text.apply(Cleaning_notes)

In [None]:
#view the labeled notes after cleaning 
labeled_notes

# Classification

In [None]:
#split the notes and labels 
X = labeled_notes.text
y = labeled_notes.label

<h3>Approach 1 (TF-IDF)</h3>

In [None]:
#split the data into training and testing set 
X_train, X_test, y_train, y_test = train_test_split( X,y , random_state=42,test_size=0.20, shuffle=True)

In [None]:
def tokenizer(text):
    '''    
    tokenizes a note 

    Args:
        text (str): a note to be tokenized
 
    Returns:
        tokens: tokenized note 
    
    '''
    tokens = word_tokenize(text)
    return tokens

In [None]:
#create a td-idf vectorizer 
vectorizer = TfidfVectorizer(ngram_range=(1,2),lowercase = False, min_df=3 ,tokenizer = tokenizer)
#apply the td-idf vectorizer 
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [None]:
def SVR_clf(x_train, y_train):
    '''
    fit SVR classifier on training data

    Args:
        x_train (str): X training set (text)
        y_train (int): y training set (label)
 
    Returns:
        clf: fitted SVR classifier 
    
    '''
    clf = SVC(random_state=42 ,C=0.1,kernel='linear')
    clf.fit(x_train, y_train)
    return clf

def NB_clf(x_train, y_train):
    '''
    fit Naive Bayes classifier on training data

    Args:
        x_train (str): X training set (text)
        y_train (int): y training set (label)
 
    Returns:
        clf: fitted NB classifier 
    
    '''
    clf = MultinomialNB()
    clf.fit(x_train, y_train)
    return clf

def LR_clf(x_train, y_train):
    '''
    fit Logistic Regression classifier on training data

    Args:
        x_train (str): X training set (text)
        y_train (int): y training set (label)
 
    Returns:
        clf: fitted LR classifier 
    
    '''
    clf = LogisticRegression(random_state=42)
    clf.fit(x_train, y_train)
    return clf 

def evaluation(X_set,y_true,clf):
    '''
    predict using the fitted classifier and evaluate the performance

    Args:
        X_set (str): X training set to be used for prediction
        y_true (int): y training set (label) 
        clf: fitted classifier
 
    Returns:
        prints classifcation evaluation report 
    
    '''
    y_pred = clf.predict(X_set)
    report = classification_report(y_true, y_pred)
    print(report)

In [None]:
%%time
#train the SVM classifier using td-idf vectors as input& get 
#the evaluation results on training and testing set
SVR_tdidf = SVR_clf(X_train, y_train)
print("training set report:")
evaluation(X_train,y_train,SVR_tdidf)
print("testing set report")
evaluation(X_test,y_test,SVR_tdidf)

In [None]:
%%time
NB_tdidf = NB_clf(X_train, y_train)
#train the Naive bayes classifier using td-idf vectors as input
#and get the evaluation results on training and testing set
print("training set report")
evaluation(X_train,y_train,NB_tdidf)
print("testing set report")
evaluation(X_test,y_test,NB_tdidf)

In [None]:
%%time
#train the logistic regression classifier using td-idf vectors as input
#and get the evaluation results on training and testing set
LR_tdidf = LR_clf(X_train, y_train)
print("training set report")
evaluation(X_train,y_train,LR_tdidf)
print("testing set report")
evaluation(X_test,y_test,LR_tdidf)

In [None]:
#create a dataframe with the the words from the td-idf vectorizer as the index
# and the coefficents of fitted logistic regression classifier as the values in order to obtain the 
#top words that contributed to the classifcation of the positive class
#in this instance diabetes type 1 is the positive class. 
important_tokens = pd.DataFrame(data=LR_tdidf.coef_[0],index=vectorizer.get_feature_names_out(),
columns=['coefficient'])
#sort values 
top__coeff = important_tokens.sort_values(by=['coefficient'], ascending=False).head(15)
#reset index 
top__coeff.reset_index(inplace=True)
# figure Size
fig, ax = plt.subplots(figsize =(16, 9))
# horizontal bar Plot
ax.barh(top__coeff['index'], top__coeff.coefficient)
ax.set_title('Feature importance of the top 15-ranked using coefficients of Logistic Regression for Type 1 Diabetes',
             loc ='center', fontsize =16)
plt.yticks(fontsize=12)
plt.xlabel('Importance')
#the labels were reversed to generate the diabetes type 2 top words (can be found in the report)

<h3>Approach 2 (Word Embeddings)</h3>

In [None]:
#split the dataset into training and testing set
X_train, X_test, y_train, y_test = train_test_split( X,y , random_state=42,test_size=0.20, shuffle=True)

In [None]:
%%time
#split the training set into sentences 
sentences = [sentence.split() for sentence in X_train]
#train a word2vec model to generate word embeddings 
w2v_model = Word2Vec(sentences, vector_size=300, window=50, min_count=10, workers=4 ,sg=0)

In [None]:
def vectorize(sentence):
    '''
    vectorize sentences using word2vec

    Args:
        sentence (str): X training set (text) 
 
    Returns:
        words_vectors: vectorized words 
    
    '''
    words = sentence.split()
    words_vecs = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    words_vectors = words_vecs.mean(axis=0)
    return words_vectors

#apply vectorize to preprocess the notes
X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

In [None]:
#create a scaler 
scaler = MinMaxScaler()
#apply min-max scaler which scale the values between 0 and 1
#as algorithms like naive bayes don't allow negative values 
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
%%time
#train the SVM classifier using word embeddings vectors as input
#and get the evaluation results on training and testing set
SVR_WordEmbedding = SVR_clf(X_train, y_train)
print("training set report")
evaluation(X_train,y_train,SVR_WordEmbedding)
print("testing set report")
evaluation(X_test,y_test,SVR_WordEmbedding)

In [None]:
%%time
#train the Naive Bayes classifier using word embeddings vectors as input
#and get the evaluation results on training and testing set
NB_WordEmbedding = NB_clf(X_train, y_train)
print("training set report")
evaluation(X_train,y_train,NB_WordEmbedding)
print("testing set report")
evaluation(X_test,y_test,NB_WordEmbedding)

In [None]:
%%time
#train the logistic regression classifier using word embeddings vectors as input
#and get the evaluation results on training and testing set
LR_WordEmbedding = LR_clf(X_train, y_train)
print("training set report")
evaluation(X_train,y_train,LR_WordEmbedding)
print("testing set report")
evaluation(X_test,y_test,LR_WordEmbedding)

<h3>Approach 3 (Clinical BERT)</h3>

In [None]:
#shuffle the dataset
labeled_notes = labeled_notes.sample(frac =1) 
#get a sample of 7000 notes so it will be used in the modeling 
labeled_notes_subset = labeled_notes.sample(7000)
#reset index as it was shuffled in the previous step
labeled_notes_subset.reset_index(inplace=True)
labeled_notes_subset = labeled_notes_subset[['text','label']]

In [None]:
#load the tiny clinicalBERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("nlpie/tiny-clinicalbert")
#load the tiny clinicalBERT pretrained model and set it for binary classifcation
model = AutoModelForSequenceClassification.from_pretrained("nlpie/tiny-clinicalbert", num_labels=2)

In [None]:
#split the data into training and testing set
train , test = train_test_split(labeled_notes_subset, shuffle=True ,test_size=0.2, random_state=42)
#create a Dataset class type for training and testing set 
#transformers require the data to be in this form
train_data = Dataset.from_pandas(train)
test_data = Dataset.from_pandas(test)

In [None]:
def BERT_tokenizer(df):
    '''
    toeknize the text using BERT tokenizer 

    Args:
        df (str): X training set (text) 
 
    Returns:
        tokenized_text: tokenized notes
    
    '''
    tokenized_text = tokenizer(df['text'], padding=True, 
    max_length = 512, truncation=True, return_tensors="pt")
    
    return tokenized_text

In [None]:
tokenized_train = train_data.map(BERT_tokenizer, batched=True)
tokenized_test = test_data.map(BERT_tokenizer, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    '''
    toeknize the text using BERT tokenizer 

    Args:
        eval_pred: variable that contain the true labels and raw predictions of the transformer(logits) 
 
    Returns:
        result: accuracy evaluation result 
    
    '''
    logits, labels = eval_pred
    #return the label of the highest value(Probability of the class)
    predictions = np.argmax(logits, axis=-1)
    #compute the accuracy during the training 
    result = metric.compute(predictions=predictions, references=labels)
    
    return result

In [None]:
#set the training arguments to fine tune the pretrained model for the classifcation task
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=5,
    evaluation_strategy = "epoch",
    logging_strategy="epoch"
)

#pass the model, dataset, hyperparameters, data collector and metrices into the trainer for training
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
    
)

In [None]:
%%time
#train the model
trainer.train()

In [None]:
#save the model 
trainer.save_model('BERTdiabetes_model')

In [None]:
#predict using the model on the training set
y_pred_train = trainer.predict(tokenized_train)
#return the label of the highest value(Probability of the class)
y_pred_train = np.argmax(y_pred_train.predictions, axis=-1)
#convert the y_train into a list 
y_train = train['label'].tolist()
eval_report_train = classification_report(y_train, y_pred_train)
#get the evaluation metrics 
print("training set report")
print(eval_report_train)



#predict using the model on the training set
y_pred = trainer.predict(tokenized_test)
#return the label of the highest value(Probability of the class)
y_pred = np.argmax(y_pred.predictions, axis=-1)
#convert the y_test into a list 
y_test = test['label'].tolist()
#get the evaluation metrics 
eval_report = classification_report(y_test, y_pred)
print("testing set report")
print(eval_report)

<h3>Acknowledgments</h3>
I would like to express my gratitude to: <br>
1-StackOverFlow Community for there various answers to problems i faced during coding<br>
2-Hugging face documentation & community<br>
3-Neri Van Otten for her word2vec tutorial: https://spotintelligence.com/2023/02/15/word2vec-for-text-classification <br>
4-Ray for transformers tutorial https://docs.ray.io/en/latest/train/getting-started-transformers.html <br> 
5-MIMIC IV contributors https://physionet.org/content/mimiciv/2.2/

