# Natural Language Processing Methodology

![png](Images/data_cycle.png)

## Business Understanding

**Objectives of the Project**

    1. Summarize provider notes
    2. Control length of summaries (50,100, or 150 word)
    3. Label the specialty of the note
    4. Topic recognition (i.e, medication name, diagnosis)     

## Data Collection

    Hospital database

## Data Cleaning

    1. Remove unnecessary columns
    2. Handle duplicate values
    3. Handle missing values
    4. Handle inconsistent values in columns that are not text data
    5. Preprocess text data
        - Expand contractions
        - convert to lowercase
        - remove digits
        - remove punctuation
        - remove stop words

# Imports

In [1]:
import pandas as pd 
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
import spacy

# Functions

In [2]:
## Function to remove special characters, numbers, stop words and perform lemmatization with spacy
## Load english model
nlp = spacy.load('en_core_web_sm', disable = ['parser','ner'])

def preprocess_text(text):
    ## nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))
    
    ## Remove numbers followed by a period and whitespace
    text = re.sub(r'\b\d+\.\s', '', text)
    
    ## Remove periods that are not between two digits (i.e., not part of a decimal number)
    text = re.sub(r'(?<!\d)\.(?!\d)', '', text)
    
    ## Remove unwanted characters (except numbers and periods for decimal numbers)
    text = re.sub(r'[^a-zA-Z0-9\s.]', '', text)
    
    ## Convert to lowercase
    text = text.lower()
    
    ## Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])

    ## Lemmatization
    ## Process the text with spaCy
    doc = nlp(text)

    ## Lemmatize and remove stopwords
    lemmatized_tokens = [token.lemma_ for token in doc]

    ## Join the tokens back into a single string
    return ' '.join(lemmatized_tokens)

# Load Data

In [3]:
## File path
path = ('Data/mtsamples.csv')

## Dataframe from .csv
data = pd.read_csv(path)

## Check if data loaded correctly
data.info()
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4999 entries, 0 to 4998
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Unnamed: 0         4999 non-null   int64 
 1   description        4999 non-null   object
 2   medical_specialty  4999 non-null   object
 3   sample_name        4999 non-null   object
 4   transcription      4966 non-null   object
 5   keywords           3931 non-null   object
dtypes: int64(1), object(5)
memory usage: 234.5+ KB


Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."


# Data Cleaning

In [4]:
## Remove unnecessary columns
data.drop(columns=['Unnamed: 0','description','keywords'], inplace=True)

## Confirm
data.head()

Unnamed: 0,medical_specialty,sample_name,transcription
0,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr..."
1,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb..."
2,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ..."
3,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit..."
4,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...


In [5]:
## Check for duplicate values
data.duplicated().sum()

0

In [6]:
## Check for missing values
data.isna().sum()

medical_specialty     0
sample_name           0
transcription        33
dtype: int64

In [7]:
## Remove rows with missing data from the transcription column
data.dropna(subset=['transcription'], inplace=True)

## Check for missing values
data.isna().sum()

medical_specialty    0
sample_name          0
transcription        0
dtype: int64

In [8]:
## Check for inconsistent values
data['medical_specialty'].value_counts()

 Surgery                          1088
 Consult - History and Phy.        516
 Cardiovascular / Pulmonary        371
 Orthopedic                        355
 Radiology                         273
 General Medicine                  259
 Gastroenterology                  224
 Neurology                         223
 SOAP / Chart / Progress Notes     166
 Urology                           156
 Obstetrics / Gynecology           155
 Discharge Summary                 108
 ENT - Otolaryngology               96
 Neurosurgery                       94
 Hematology - Oncology              90
 Ophthalmology                      83
 Nephrology                         81
 Emergency Room Reports             75
 Pediatrics - Neonatal              70
 Pain Management                    61
 Psychiatry / Psychology            53
 Office Notes                       50
 Podiatry                           47
 Dermatology                        29
 Cosmetic / Plastic Surgery         27
 Dentistry               

In [9]:
## Check for inconsistent values
# Temporarily set the display option
with pd.option_context('display.max_rows', None):
    print(data['sample_name'].value_counts())

 Lumbar Discogram                                                        5
 Pediatric Rheumatology Consult                                          4
 Mastoiditis - Discharge Summary                                         4
 Thyroid Mass Consult                                                    4
 Radiofrequency Ablation                                                 4
 Normal ENT Exam - 1                                                     4
 Blood in Urine - ER Visit                                               4
 Normal Newborn H&P Template                                             4
 Normal Newborn Infant Physical Exam                                     4
 Neuroplasty                                                             4
 Normal Child Exam Template                                              4
 Neuroblastoma - Consult                                                 4
 Foreign Body - Right Nose                                               4
 Hematuria - ER Visit    

In [10]:
## Check for whitespace
data['sample_name'].unique()

array([' Allergic Rhinitis ', ' Laparoscopic Gastric Bypass Consult - 2 ',
       ' Laparoscopic Gastric Bypass Consult - 1 ', ..., ' Autopsy - 7 ',
       ' Autopsy - 3 ', ' Autopsy - 4 '], dtype=object)

In [11]:
## Remove whitespace from 'medical_specialty' column
data['sample_name'] = data['sample_name'].str.strip()

## Confirm whitespaces were removed
data['sample_name'].unique()

array(['Allergic Rhinitis', 'Laparoscopic Gastric Bypass Consult - 2',
       'Laparoscopic Gastric Bypass Consult - 1', ..., 'Autopsy - 7',
       'Autopsy - 3', 'Autopsy - 4'], dtype=object)

In [12]:
## Remove values after a hyphen is present
data['sample_name'] = data['sample_name'].str.split('-').str[0]

## Confirm split was applied
with pd.option_context('display.max_rows', None):
    print(data['sample_name'].value_counts())

Gen Med Consult                                                        108
Consult                                                                 76
Discharge Summary                                                       74
Colonoscopy                                                             44
CT Abdomen & Pelvis                                                     36
Psych Consult                                                           30
Anterior Cervical Discectomy & Fusion                                   30
MRI Brain                                                               28
Dietary Consult                                                         27
EMG/Nerve Conduction Study                                              27
Esophagogastroduodenoscopy                                              26
Low                                                                     26
Gen Med Progress Note                                                   26
Cardiac Catheterization  

> sample_name column could be potentially useful if engineered more

In [13]:
## Remove whitespace from 'medical_specialty' column
data['medical_specialty'] = data['medical_specialty'].str.strip()

## Confirm whitespaces were removed
data['medical_specialty'].unique()

array(['Allergy / Immunology', 'Bariatrics', 'Cardiovascular / Pulmonary',
       'Neurology', 'Dentistry', 'Urology', 'General Medicine', 'Surgery',
       'Speech - Language', 'SOAP / Chart / Progress Notes',
       'Sleep Medicine', 'Rheumatology', 'Radiology',
       'Psychiatry / Psychology', 'Podiatry', 'Physical Medicine - Rehab',
       'Pediatrics - Neonatal', 'Pain Management', 'Orthopedic',
       'Ophthalmology', 'Office Notes', 'Obstetrics / Gynecology',
       'Neurosurgery', 'Nephrology', 'Letters',
       'Lab Medicine - Pathology', 'IME-QME-Work Comp etc.',
       'Hospice - Palliative Care', 'Hematology - Oncology',
       'Gastroenterology', 'ENT - Otolaryngology', 'Endocrinology',
       'Emergency Room Reports', 'Discharge Summary',
       'Diets and Nutritions', 'Dermatology',
       'Cosmetic / Plastic Surgery', 'Consult - History and Phy.',
       'Chiropractic', 'Autopsy'], dtype=object)

In [14]:
# Filter the DataFrame for 'Letters' specialty
with pd.option_context('display.max_colwidth', None):
    letter_df = data[data['medical_specialty'] == 'Letters']
    print(letter_df.head())

     medical_specialty                 sample_name  \
3057           Letters    Pediatric Urology Letter   
3058           Letters              Urology Letter   
3059           Letters           Rolandic Epilepsy   
3060           Letters           Wilson's Disease    
3061           Letters  Suspected Seizure Activity   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

> 'Letters' in the medical_specialty column seem to be letters that were written patient to provider or provider to provider. Only 23 values exist so these will be removed as they are not as relevant

In [15]:
## Remove rows containing 'Letters'
data = data[data['medical_specialty'] != 'Letters']
data['medical_specialty'].value_counts()

Surgery                          1088
Consult - History and Phy.        516
Cardiovascular / Pulmonary        371
Orthopedic                        355
Radiology                         273
General Medicine                  259
Gastroenterology                  224
Neurology                         223
SOAP / Chart / Progress Notes     166
Urology                           156
Obstetrics / Gynecology           155
Discharge Summary                 108
ENT - Otolaryngology               96
Neurosurgery                       94
Hematology - Oncology              90
Ophthalmology                      83
Nephrology                         81
Emergency Room Reports             75
Pediatrics - Neonatal              70
Pain Management                    61
Psychiatry / Psychology            53
Office Notes                       50
Podiatry                           47
Dermatology                        29
Cosmetic / Plastic Surgery         27
Dentistry                          27
Physical Med

In [16]:
# Filter the DataFrame for 'SOAP / Chart / Progress Notes' specialty
with pd.option_context('display.max_colwidth', None):
    soap_df = data[data['medical_specialty'] == 'SOAP / Chart / Progress Notes']
    print(soap_df.head())

                  medical_specialty                         sample_name  \
1287  SOAP / Chart / Progress Notes  Uterine Papillary Serous Carcinoma   
1289  SOAP / Chart / Progress Notes                        Wound Check    
1290  SOAP / Chart / Progress Notes          Weight Loss on Phentermine   
1293  SOAP / Chart / Progress Notes                         Wasp Sting    
1294  SOAP / Chart / Progress Notes            Tethered Cord Evaluation   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     

In [17]:
# Filter the DataFrame for 'SOAP / Chart / Progress Notes' specialty
with pd.option_context('display.max_colwidth', None):
    office_df = data[data['medical_specialty'] == 'Office Notes']
    print(office_df.head())

     medical_specialty                    sample_name  \
2450      Office Notes           Telemetry Monitoring   
2451      Office Notes  Premature retina and vitreous   
2452      Office Notes          Right Hand Laceration   
2453      Office Notes                Status Post T&A   
2454      Office Notes             Shoulder Contusion   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 

> It does not make sense to have progress notes be a medical specialty. 'Office Notes' and 'SOAP / Chart / Progress Notes' will be re-labeled as 'No Specialty'

In [18]:
## Relabel 'Office Notes' and 'SOAP / Chart / Progress Notes' to 'No Specialty'
## Dictionary for labels
relabel_dict = {'Office Notes':'No Specialty',
                'SOAP / Chart / Progress Notes':'No Specialty'}

## Replace values using dictionary
data['medical_specialty'] = data['medical_specialty'].replace(relabel_dict)

## Confirm replace worked correclty
data['medical_specialty'].value_counts()

Surgery                       1088
Consult - History and Phy.     516
Cardiovascular / Pulmonary     371
Orthopedic                     355
Radiology                      273
General Medicine               259
Gastroenterology               224
Neurology                      223
No Specialty                   216
Urology                        156
Obstetrics / Gynecology        155
Discharge Summary              108
ENT - Otolaryngology            96
Neurosurgery                    94
Hematology - Oncology           90
Ophthalmology                   83
Nephrology                      81
Emergency Room Reports          75
Pediatrics - Neonatal           70
Pain Management                 61
Psychiatry / Psychology         53
Podiatry                        47
Dermatology                     29
Cosmetic / Plastic Surgery      27
Dentistry                       27
Physical Medicine - Rehab       21
Sleep Medicine                  20
Endocrinology                   19
Bariatrics          

In [19]:
## Looking at content of 'Discharge Summary'
with pd.option_context('display.max_colwidth', None):
    discharge_df = data[data['medical_specialty'] == 'Discharge Summary']
    print(discharge_df.head())

      medical_specialty                sample_name  \
3877  Discharge Summary            Speech Therapy    
3880  Discharge Summary                       TAH    
3881  Discharge Summary  Urology Discharge Summary   
3884  Discharge Summary             Renal Disease    
3885  Discharge Summary            Speech Therapy    

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In [20]:
## Looking at content of 'Emergency Room Reports'
with pd.option_context('display.max_colwidth', None):
    er_df = data[data['medical_specialty'] == 'Emergency Room Reports']
    print(er_df.head())

           medical_specialty          sample_name  \
3802  Emergency Room Reports  Urgent Cardiac Cath   
3804  Emergency Room Reports      Viral Syndrome    
3806  Emergency Room Reports           Toothache    
3807  Emergency Room Reports      Testicular Pain   
3809  Emergency Room Reports             Syncope    

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

> Changing value name for more accurate representation of medical specialty

In [21]:
## Relabel 'Emergency Room Reports' to 'Emergency Department'
## Dictionary for labels
relabel_dict = {'Emergency Room Reports':'Emergency Department'}

## Replace values using dictionary
data['medical_specialty'] = data['medical_specialty'].replace(relabel_dict)

## Confirm replace worked correclty
data['medical_specialty'].value_counts()

Surgery                       1088
Consult - History and Phy.     516
Cardiovascular / Pulmonary     371
Orthopedic                     355
Radiology                      273
General Medicine               259
Gastroenterology               224
Neurology                      223
No Specialty                   216
Urology                        156
Obstetrics / Gynecology        155
Discharge Summary              108
ENT - Otolaryngology            96
Neurosurgery                    94
Hematology - Oncology           90
Ophthalmology                   83
Nephrology                      81
Emergency Department            75
Pediatrics - Neonatal           70
Pain Management                 61
Psychiatry / Psychology         53
Podiatry                        47
Dermatology                     29
Cosmetic / Plastic Surgery      27
Dentistry                       27
Physical Medicine - Rehab       21
Sleep Medicine                  20
Endocrinology                   19
Bariatrics          

In [22]:
## Save as .csv file
data.to_csv('Data/clean_data.csv')

# Text Preprocessing

In [23]:
## Visualize a couple of transcripts
for index,text in enumerate(data['transcription'][35:40]):
  print('Transcript %d:\n'%(index+1),text)

Transcript 1:
 EXAM: , Ultrasound examination of the scrotum.,REASON FOR EXAM: , Scrotal pain.,FINDINGS:  ,Duplex and color flow imaging as well as real time gray-scale imaging of the scrotum and testicles was performed.  The left testicle measures 5.1 x 2.8 x 3.0 cm.  There is no evidence of intratesticular masses.  There is normal Doppler blood flow.  The left epididymis has an unremarkable appearance.  There is a trace hydrocele.,The right testicle measures 5.3 x 2.4 x 3.2 cm.  The epididymis has normal appearance.  There is a trace hydrocele.  No intratesticular masses or torsion is identified.  There is no significant scrotal wall thickening.,IMPRESSION:  ,Trace bilateral hydroceles, which are nonspecific, otherwise unremarkable examination.
Transcript 2:
 PREOPERATIVE DIAGNOSIS: , Bladder tumor.,POSTOPERATIVE DIAGNOSIS: , Bladder tumor.,PROCEDURE PERFORMED: , Transurethral resection of a medium bladder tumor (TURBT), left lateral wall.,ANESTHESIA: , Spinal.,SPECIMEN TO PATHOLOGY:

In [24]:
## Apply preprocessing to the transcription and summary columns separately
data['cleaned_transcription'] = data['transcription'].apply(preprocess_text)

In [25]:
# ## Save as .csv file
data.to_csv('Data/preprocessed_text_data.csv')

# ## Display the first few rows of the cleaned data
for index,text in enumerate(data['cleaned_transcription'][35:40]):
  print('Transcript %d:\n'%(index+1),text)

Transcript 1:
 exam ultrasound examination scrotumreason exam scrotal painfinding duplex color flow image well real time grayscale image scrotum testicle perform left testicle measure 5.1 x 2.8 x 3.0 cm evidence intratesticular masse normal doppler blood flow leave epididymis unremarkable appearance trace hydrocelethe right testicle measure 5.3 x 2.4 x 3.2 cm epididymis normal appearance trace hydrocele intratesticular masse torsion identify significant scrotal wall thickeningimpression trace bilateral hydrocele nonspecific otherwise unremarkable examination
Transcript 2:
 preoperative diagnosis bladder tumorpostoperative diagnosis bladder tumorprocedure perform transurethral resection medium bladder tumor turbt leave lateral wallanesthesia spinalspecimen pathology bladder tumor specimen base bladder tumordrain 22french 3way foley catheter 30 ml balloonestimate blood loss minimalindication procedure 74yearold male present microscopic episode gross hematuria underwent ivp demonstrate en

In [26]:
data.head()

Unnamed: 0,medical_specialty,sample_name,transcription,cleaned_transcription
0,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...",subjective 23yearold white female present comp...
1,Bariatrics,Laparoscopic Gastric Bypass Consult,"PAST MEDICAL HISTORY:, He has difficulty climb...",past medical history difficulty climb stair di...
2,Bariatrics,Laparoscopic Gastric Bypass Consult,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",history present illness see abc today pleasant...
3,Cardiovascular / Pulmonary,2,"2-D M-MODE: , ,1. Left atrial enlargement wit...",2d mmode leave atrial enlargement leave atrial...
4,Cardiovascular / Pulmonary,2,1. The left ventricular cavity size and wall ...,leave ventricular cavity size wall thickness a...
