# Natural Language Processing - Medical Bios by CoastalCPH

This notebook contains explanations and steps performed in the preprocessing, estimating and evaluation of our NLP-model. The notebook is divided into the following sections:

1. **Loading dataset**: This step contains importing relevant libraries as well as loading the dataset from HuggingFace and checking for any missing values. 

2. **Text Proprocessing**: This includes using the Twitter preprocessor to remove numbers from the text strings as well as the use of SpaCy to remove stopwords. The purpose of this is to remove any 'irrelant' characters that could cause noice in the model.

3. **Handle Imbalanced Data**: The purpose of this is to reduce the potential bias in the model. 

4. **Text Vectorrization**: This step includes converting the cleaned text data into a numerical format that is suitable for machine learning.

5. **Model Building**: This includes building different classification models.

6. **Model Evaluation**: This section consist of estimating selected models and evaluation their performance on both the train set and a test set. Further, the section covers explainability of predictions using LIME.

7. **Topic Modelling**: Ths part determines the main topics use in the bios of the professons. 

8. **Export model components**: The purpose of this it to make the building of our interface running more effient.

## Load dataset

In [1]:
from datasets import load_dataset # For loading datasets from HuggingFace 
import pandas as pd # For data manipulation

import preprocessor as prepro # prepro
import spacy # spacy for quick language prepro
nlp = spacy.load('en_core_web_sm') # instantiating English module

import altair as alt # for data visualization

from imblearn.under_sampling import RandomUnderSampler # allows to align the number of classes in the dataset

from sklearn.pipeline import make_pipeline #pipeline creation
from sklearn.feature_extraction.text import TfidfVectorizer # transforms text to sparse matrix
from sklearn.decomposition import TruncatedSVD #dimensionality reduction (particulary well suited for sparse matrix with 1000s of columns)
from sklearn.metrics import classification_report 
from sklearn.ensemble import RandomForestClassifier # RandomForest model for classification
from sklearn.linear_model import LogisticRegression # Logistic model for classification
from xgboost import XGBClassifier # XGBoost model for classification

from lime.lime_text import LimeTextExplainer # explainability of outcome
from gensim.corpora.dictionary import Dictionary # Import the dictionary builder for topic modelling
from gensim.models import LdaMulticore # we'll use the faster multicore version of LDA

# Import pyLDAvis for interactive topic model visualization
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

import pickle # For exporting components

# Ignore all warnings to have cleaner outputs
import warnings 
warnings.simplefilter(action='ignore') 


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#Load dataset from HuggingFace 
dataset = load_dataset("coastalcph/medical-bios", "standard")
df = pd.DataFrame(dataset['train'])

In [3]:
df.head()

# Meaning of different labels:
    # 0 = psychologist
    # 1 = surgeon
    # 2 = nurse
    # 3 = dentist
    # 4 = physician

Unnamed: 0,text,label
0,He has been a practicing Dentist for 20 years....,3
1,He was happy to return to this area with his w...,1
2,"Having counseled more than 1,700 clients, he s...",0
3,She has received a 3.5 out of 5 star rating by...,4
4,Father Percival owns his own bamboo fence busi...,2


In [4]:
# Check for any missing values:
df.info()
# No missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    8000 non-null   object
 1   label   8000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 125.1+ KB


## Text Preprocessing

In [5]:
prepro.set_options(prepro.OPT.URL, # removes URLs
                   prepro.OPT.NUMBER, # removes numbers
                   prepro.OPT.RESERVED, # removes reserved words
                   prepro.OPT.MENTION, # removes any mentions
                   prepro.OPT.SMILEY) # removes emojis

In [6]:
def text_prepro(texts: pd.Series) -> list:
    """
    Preprocess a series of texts.

    Parameters:
    - texts: A pandas Series containing the text to be preprocessed.
    - nlp: A spaCy NLP model.

    Returns:
    - A list of preprocessed texts.

    Steps:
    - Clean twitter-specific characters using a predefined 'prepro' method.
    - Normalize the text by lowercasing and lemmatizing.
    - Remove punctuations, stopwords, and non-alphabet characters.
    """

    # Clean specific characters and other special characters
    texts_cleaned = texts.map(prepro.clean)
    texts_cleaned = texts_cleaned.str.replace('#', '')

    # Initialize container for the cleaned texts
    clean_container = []

    # Use spaCy's nlp.pipe for efficient text processing
    for doc in nlp.pipe(texts_cleaned, disable=["tagger", "parser", "ner"]):

        # Extract lemmatized tokens that are not punctuations, stopwords, or non-alphabetic characters
        tokens = [token.lemma_.lower() for token in doc # make into lower characters
                  if token.is_alpha # only take alphabetic characters
                  and not token.is_stop # remove stopwords
                  and not token.is_punct] # remove punctuation 

        clean_container.append(" ".join(tokens))

    return clean_container

In [7]:
# Apply the preprocessing pipeline
df['text_clean'] = text_prepro(df['text'])

In [8]:
# Confirm that it worked
df.head(10)

Unnamed: 0,text,label,text_clean
0,He has been a practicing Dentist for 20 years....,3,practicing dentist years bds currently associa...
1,He was happy to return to this area with his w...,1,happy return area wife children opportunity ar...
2,"Having counseled more than 1,700 clients, he s...",0,having counseled clients says fail understand ...
3,She has received a 3.5 out of 5 star rating by...,4,received star rating patients areas expertise ...
4,Father Percival owns his own bamboo fence busi...,2,father percival owns bamboo fence business fre...
5,Dr. Prasoon Kumar Tripathi practices at Raghva...,3,prasoon kumar tripathi practices raghvansh hos...
6,While psychological assessment is her key area...,0,psychological assessment key area concentratio...
7,He is board Certified by the American Board of...,1,board certified american board orthopaedic sur...
8,She has worked for Hamilton Health Sciences si...,2,worked hamilton health sciences manager depart...
9,He has been a successful Dentist for the last ...,3,successful dentist years mds currently practis...


## Handle Imbalanced Data

In [9]:
# Check the distribution of our dataset
df.label.value_counts().reset_index()

Unnamed: 0,label,count
0,0,2200
1,2,1638
2,3,1533
3,4,1349
4,1,1280


In [10]:
# Count and reset index
data_chart = df.label.value_counts().reset_index()

# Replace numerical categories with textual descriptions
data_chart['label'] = data_chart['label'].map({0: 'psychologist', 
                                               1: 'surgeon', 
                                               2: 'nurse', 
                                               3: 'dentist', 
                                               4: 'physician'})

# Plot the chart
chart = alt.Chart(data_chart).mark_bar(filled=True).encode(
    alt.X('count:Q', title='count'),
    alt.Y('label:O', title='Category', sort='-x'),
    color=alt.Color('label:N', legend=alt.Legend(title="Label Types"), scale=alt.Scale(
        domain=['psychologist', 'surgeon', 'nurse', 'dentist', 'physician'],
        range=['red', 'orange', 'blue', 'green', 'purple']
    ))
)

chart
# we have almost twise as many psychologists as surgeons
# --> it would make sense to make the data more balanced

In [11]:
# To make the data more balanced, we will remove random entries
rus = RandomUnderSampler(random_state=42) # make sure the same random rows gets removed every time
df_res, y_res = rus.fit_resample(df, df['label'])
df_res['label'].value_counts()
# The dataset is now balanced 

label
0    1280
1    1280
2    1280
3    1280
4    1280
Name: count, dtype: int64

## Text Vectorization

In [12]:
# Initialize the text vectoization
tfidf = TfidfVectorizer()
svd = TruncatedSVD(n_components = 100) # squeeze the factors into about 100 factors

## Model Building

In [13]:
# Divide train dataset into y and X variables
X_train = df_res['text_clean']
y_train = df_res['label']

In [14]:
# First we try a XGBoost model
cls = XGBClassifier()
pipe_xg = make_pipeline(tfidf, svd, cls)
pipe_xg.fit(X_train, y_train)

In [15]:
# Then we try a RandomForest model
cls = RandomForestClassifier(n_estimators=100, random_state=42)
pipe_rf = make_pipeline(tfidf, svd, cls)
pipe_rf.fit(X_train, y_train)

In [16]:
# Lastly, we try a Logistic model
cls = LogisticRegression()
pipe_log = make_pipeline(tfidf, cls)
pipe_log.fit(X_train, y_train)

## Model Evaluation

### Evaluation on train set

In [17]:
#XGBoost model
y_pred = pipe_xg.predict(X_train)
report = classification_report(y_train, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.90      0.95      0.93      1280
           1       0.92      0.86      0.89      1280
           2       0.92      0.90      0.91      1280
           3       0.95      0.96      0.95      1280
           4       0.86      0.88      0.87      1280

    accuracy                           0.91      6400
   macro avg       0.91      0.91      0.91      6400
weighted avg       0.91      0.91      0.91      6400



In [18]:
# Random Forest model
y_pred = pipe_rf.predict(X_train)
report = classification_report(y_train, y_pred)
print(report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1280
           1       1.00      1.00      1.00      1280
           2       1.00      1.00      1.00      1280
           3       1.00      1.00      1.00      1280
           4       1.00      1.00      1.00      1280

    accuracy                           1.00      6400
   macro avg       1.00      1.00      1.00      6400
weighted avg       1.00      1.00      1.00      6400



In [19]:
# Logistic model
y_pred = pipe_log.predict(X_train)
report = classification_report(y_train, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.94      0.98      0.96      1280
           1       0.96      0.90      0.93      1280
           2       0.97      0.94      0.96      1280
           3       0.96      0.96      0.96      1280
           4       0.90      0.94      0.92      1280

    accuracy                           0.95      6400
   macro avg       0.95      0.95      0.95      6400
weighted avg       0.95      0.95      0.95      6400



It seems like all models are doing pretty good on train set. Further, it doesn't seem like the models are particular bad at predicting one label.
 
The RandomForest model looks to be overfitted as it has a 100% accuracy, which means it's unlikely it will perform just as good on new unseen data.

### Evaluation on test set

In [20]:
# First we load the test set from HuggingFace
df_test = pd.DataFrame(dataset['test'])
# Apply the preprocessing to the test data
df_test['text_clean'] = text_prepro(df_test['text'])

# Divide data into X and y variables: 
X_test = df_test['text_clean']
y_test = df_test['label']


In [21]:
#XGBoost model
y_pred = pipe_xg.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.89      0.87      0.88       279
           1       0.84      0.82      0.83       159
           2       0.75      0.79      0.77       195
           3       0.98      0.93      0.95       197
           4       0.67      0.71      0.69       170

    accuracy                           0.83      1000
   macro avg       0.83      0.82      0.82      1000
weighted avg       0.83      0.83      0.83      1000



In [22]:
#RandomForest model
y_pred = pipe_rf.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.87      0.87      0.87       279
           1       0.82      0.78      0.80       159
           2       0.75      0.82      0.78       195
           3       0.97      0.95      0.96       197
           4       0.72      0.71      0.71       170

    accuracy                           0.83      1000
   macro avg       0.83      0.82      0.82      1000
weighted avg       0.83      0.83      0.83      1000



In [23]:
# Logistic model
y_pred = pipe_log.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.85      0.92      0.88       279
           1       0.86      0.85      0.85       159
           2       0.83      0.76      0.79       195
           3       0.99      0.92      0.96       197
           4       0.74      0.76      0.75       170

    accuracy                           0.85      1000
   macro avg       0.85      0.84      0.85      1000
weighted avg       0.86      0.85      0.85      1000



We can see that the logistic model is best on the test set, so we will continue with this model.

### Test that we can make predictions using the model

In [24]:
# Run a single prediction

# Make up a random sentence
t1 = ['He had been at the hospital for many years and was quite efficient with his scalpel']

# Preprocess the sentence
t1_p = text_prepro(pd.Series(t1))

# Predict the label of the sentence
pipe_log.predict(t1_p)
# Prediction looks to be successful as 2 = surgeon

array([1])

#### Explainability

In [25]:
class_names = ['psychologist', 'surgeon', 'nurse', 'dentist', 'physician']
explainer = LimeTextExplainer(class_names = class_names)

exp = explainer.explain_instance(t1_p[0], pipe_log.predict_proba,
                                num_features = 5, # how many words should be displayed that are significant for the model's reasoning
                                top_labels=3) # only show the top 3 labels 

exp.show_in_notebook(text=True)

## Topic Modelling

### Text Tokenization

In [26]:

tokens = []

for text in nlp.pipe(df_res['text_clean'], disable=["ner"]):
  proj_tok = [token.lemma_.lower() for token in text
              if token.pos_ in ['NOUN', 'PROPN', 'ADJ', 'ADV'] # only keep these word classes (others are hard to interpret when aggregated)
              and not token.is_stop
              and not token.is_punct]
  tokens.append(proj_tok)

df_res['tokens'] = tokens


In [27]:
# Check that it worked
df_res.head()

Unnamed: 0,text,label,text_clean,tokens
5297,She sees clients in her Northbrook office or v...,0,sees clients northbrook office telephone skype...,"[client, office, telephone, skype, session, bl..."
4869,"She has expertise in human relationships, anxi...",0,expertise human relationships anxiety managing...,"[expertise, human, relationship, anxiety, stre..."
6449,She received a master’s degree from Bryn Mawr ...,0,received master degree bryn mawr college exten...,"[master, degree, bryn, mawr, college, extensiv..."
6327,She graduated with honors in 1992. Having more...,0,graduated honors having years diverse experien...,"[honor, year, diverse, experience, especially,..."
5783,He stands at the progressive end of integratin...,0,stands progressive end integrating mindfulness...,"[progressive, end, mindfulness, therapeutic, p..."


### Model Building

In [28]:
def topic_modelling_pipe(label, df_res):
    df = df_res[df_res['label'] == label]

    # Create dictionary 
    dictionary = Dictionary(df['tokens'])

    # filter out low-frequency / high-frequency stuff, also limit the vocabulary to max 1000 words
    dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=1000)
    # we can filter extremes and only looks at the top 1000 used words,
    # but nothing above 0.5 (occurs in half or more of then)
    # and nothing that occurrs less than 5 times

    # construct corpus using this dictionary
    corpus = [dictionary.doc2bow(doc) for doc in df['tokens']]

    # fit model
    lda_model = LdaMulticore(corpus,
                         id2word=dictionary,
                         num_topics=5,
                         workers = 4,
                         passes=5)

    lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)

    return lda_display


### Visualizations

In [29]:
# Psychologist
lda_display = topic_modelling_pipe(label=0, df_res=df_res)
pyLDAvis.display(lda_display)
# It seems like the most used words for psychologist are university, psychology and clinical
# This make sense within this profession

In [30]:
# Surgeon
lda_display = topic_modelling_pipe(label=1, df_res=df_res)
pyLDAvis.display(lda_display)
# For surgeons, the most used words are center, patient, and school

In [31]:
# Nurse
lda_display = topic_modelling_pipe(label=2, df_res=df_res)
pyLDAvis.display(lda_display)
# The most used words in bios for nurses are medical, year and hospital

In [32]:
# Dentist
lda_display = topic_modelling_pipe(label=3, df_res=df_res)
pyLDAvis.display(lda_display)
# For dentists, the most used words are bds (Bachelor of Dentistry), clinic and tooth
# These are very suiting for a dentist bio

In [33]:
# Physician
lda_display = topic_modelling_pipe(label=4, df_res=df_res)
pyLDAvis.display(lda_display)
# For physicians, the most common words in bios are medicial, university and blue (likely for 'Blue Cross Blue Shield' - health insurance in the US)

## Export model components

In [34]:
# Export the model 
pickle.dump(pipe_log, open('components/pipe_log.pkl','wb'))

# Export dataset
df_res.to_json('components/df_res.json')