## Section 3: Feature Engineering

In this section, natural language processing (NLP) techniques as well as Generative AI models will be applied to handle the textual data in the dataset to prepare features for classification. The goal is to extract meaningful features from the text data that can be used to train a machine learning model for predicting the target variable.

In [1]:
# import libraries
import os 
import pandas as pd

In [3]:
# load the data for preview
df = pd.read_csv('data/processed/patients.csv')

# preview the data
display(df.head())

Unnamed: 0,patient_id,age,gender,medical_history,deterioration_label,timestamp,hear_rate,blood_pressure_sys,blood_pressure_dia,oxygen_saturation,temperature,respiratory_rate,describe_fatigue_level,describe_lifestyle,describe_mental_health
0,9b04b,65,Male,History of hypertension and type 2 diabetes.,True,2023-10-27T10:00:00Z,95.5,160.2,98.7,90.3,38.5,22.1,"Severe fatigue, difficulty getting out of bed.","Sedentary, poor diet.",Feeling anxious and low.
1,bffd5,45,Female,No significant medical history.,False,2023-10-27T10:05:00Z,70.2,120.5,75.0,98.5,36.8,16.0,Mild fatigue occasionally.,"Active, balanced diet.",Generally good.
2,fb35e,78,Male,"Chronic obstructive pulmonary disease (COPD), ...",True,2023-10-27T10:10:00Z,105.0,150.0,90.0,88.0,37.9,25.5,Constant exhaustion.,Very limited activity due to breathlessness.,Feeling down and worried.
3,1e30e,30,Female,Mild asthma.,False,2023-10-27T10:15:00Z,65.0,110.0,70.0,99.0,36.5,14.0,Not fatigued.,"Very active, regular exercise.",Good.
4,116a4,55,Male,High cholesterol.,False,2023-10-27T10:20:00Z,75.5,135.0,85.0,97.0,37.0,17.0,Moderate fatigue after work.,Moderately active.,Stable.


## 3.1 Entity Name Recognition (NER)

As we can observe from the dataset, the medical history text describes specific conditions that can be critical predictors of deterioration. Thus, pre-trained medical LLM like `BioBERT` will be used to extract specific conditions from the text. Furthermore, one-hot encoding will be applied to the extracted conditions to create binary features for each condition. This will allow us to represent the presence or absence of each condition in the dataset.

For medical term extraction, we will use the `BioBERT` model available on Huggingface trained using NCBI disease dataset. For one-hot encoding, we will extract the common conditions from the dataset and create a binary feature for each condition. The one-hot encoding will be done using the `pandas` library in Python.

### 3.1.1 BioBERT Model

In this section, we will use the `BioBERT` model to extract medical terms from the text data. The `BioBERT` model is a pre-trained language model specifically designed for biomedical text mining tasks. It has been trained on a large corpus of biomedical literature and can be used to extract relevant medical terms from the text.

In [7]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("ugaray96/biobert_ncbi_disease_ner")
model = AutoModelForTokenClassification.from_pretrained(
    "ugaray96/biobert_ncbi_disease_ner"
)

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

tokenizer_config.json:   0%|          | 0.00/534 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/780 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

Device set to use cpu


In [14]:
def extract_diseases_improved(text):
    if pd.isna(text) or text == '':
        return []
    
    # Run NER pipeline
    result = ner_pipeline(text)
    
    diseases = []
    current_disease = ""
    
    for entity in result:
        if entity["entity"] in ["Disease", "B-Disease", "I-Disease"]:
            # Start a new disease if the previous one is completed
            if current_disease and not entity["word"].startswith("##"):
                diseases.append(current_disease.strip())
                current_disease = ""
            
            # Handle BERT tokenization
            if entity["word"].startswith("##"):
                current_disease += entity["word"][2:]  # Remove the ## prefix
            else:
                if current_disease:  # If we have a partial disease, add it
                    diseases.append(current_disease.strip())
                current_disease = entity["word"]
                
        elif entity["entity"] == "Disease Continuation":
            # Handle BERT tokenization for continuation tokens
            if entity["word"].startswith("##"):
                current_disease += entity["word"][2:]  # Remove the ## prefix
            else:
                current_disease += " " + entity["word"]
    
    # Add the last disease if there is one
    if current_disease:
        diseases.append(current_disease.strip())
    
    # Add common conditions that might be missed by the model
    common_conditions = ["stroke", "diabetes", "hypertension", "cancer", "heart attack", 
                        "heart failure", "copd", "asthma", "alzheimer", "dementia"]
    
    text_lower = text.lower()
    for condition in common_conditions:
        if condition in text_lower and condition not in [d.lower() for d in diseases]:
            diseases.append(condition)
    
    return diseases

# Apply the function to the medical_history column
df['extracted_diseases'] = df['medical_history'].apply(extract_diseases_improved)

# Preview the results
print("Sample of extracted diseases:")
for i, (history, diseases) in enumerate(zip(df['medical_history'].head(3), df['extracted_diseases'].head(3))):
    print(f"\nSample {i+1}:")
    print(f"Medical History: {history}")
    print(f"Extracted Diseases: {', '.join(diseases) if diseases else 'None'}")

# Count how many records have extracted diseases
has_diseases = df['extracted_diseases'].apply(len) > 0
print(f"\nRecords with extracted diseases: {has_diseases.sum()} out of {len(df)}")

# Display the first few rows with the new column
display(df[['medical_history', 'extracted_diseases']].head())

Sample of extracted diseases:

Sample 1:
Medical History: History of hypertension and type 2 diabetes.
Extracted Diseases: hypertension, type 2 diabetes, diabetes

Sample 2:
Medical History: No significant medical history.
Extracted Diseases: None

Sample 3:
Medical History: Chronic obstructive pulmonary disease (COPD), previous heart attack.
Extracted Diseases: Chronic obstructive pulmonary disease, COPD, heart attack

Records with extracted diseases: 938 out of 1199


Unnamed: 0,medical_history,extracted_diseases
0,History of hypertension and type 2 diabetes.,"[hypertension, type 2 diabetes, diabetes]"
1,No significant medical history.,[]
2,"Chronic obstructive pulmonary disease (COPD), ...","[Chronic obstructive pulmonary disease, COPD, ..."
3,Mild asthma.,[asthma]
4,High cholesterol.,[]


From the output, we can see that `BioBERT` successfully extracts the medical terms from the text. Next, we will apply one-hot encoding to the extracted terms to create binary features for each condition. This will allow us to represent the presence or absence of each condition in the dataset.

### 3.1.2 One-Hot Encoding

In this section, we will apply one-hot encoding to the extracted medical terms to create binary features for each condition using `pandas`. One-hot encoding is a technique used to convert categorical variables into a format that can be provided to machine learning algorithms to do a better job in prediction.

Note that to avoid high dimensionality, we will only keep the most common conditions in the dataset. The one-hot encoding will be done using the `pandas` library in Python.

In [16]:
# Define the function to create one-hot encoding
def create_disease_features(diseases_list, conditions):
    """
    Create binary features for each disease in conditions based on diseases_list
    
    Args:
        diseases_list: List of diseases detected for a patient
        conditions: List of conditions to check for
    
    Returns:
        Dictionary with condition names as keys and 0/1 as values
    """
    result = {}
    # Convert all strings to lowercase for case-insensitive matching
    diseases_lower = [disease.lower() for disease in diseases_list]
    
    for condition in conditions:
        # Check if the condition exists in the list (case insensitive)
        result[condition] = 1 if condition.lower() in diseases_lower else 0
    
    return result

# List of common conditions to check
common_conditions = ["stroke", "diabetes", "hypertension", "cancer", "heart attack", 
                    "heart failure", "copd", "asthma", "alzheimer", "dementia"]

# Apply the function to create one-hot encoded columns
disease_features = df['extracted_diseases'].apply(lambda x: create_disease_features(x, common_conditions))

# Convert the result to DataFrame and join with original dataframe
disease_df = pd.DataFrame(disease_features.tolist())

# Add prefix 'has_' to the column names for clarity
disease_df = disease_df.add_prefix('has_')

# Join the one-hot encoded features with the original dataframe
df_encoded = pd.concat([df, disease_df], axis=1)

# Display the result
print("Number of patients with each condition:")
for condition in common_conditions:
    count = disease_df[f'has_{condition}'].sum()
    print(f"{condition.title()}: {count} patients")

# Show sample of data with the new features
print("\nSample of patients with their conditions:")
display(df_encoded[['patient_id', 'medical_history'] + [f'has_{condition}' for condition in common_conditions]].head(5))

Number of patients with each condition:
Stroke: 51 patients
Diabetes: 70 patients
Hypertension: 64 patients
Cancer: 31 patients
Heart Attack: 6 patients
Heart Failure: 64 patients
Copd: 51 patients
Asthma: 40 patients
Alzheimer: 10 patients
Dementia: 44 patients

Sample of patients with their conditions:


Unnamed: 0,patient_id,medical_history,has_stroke,has_diabetes,has_hypertension,has_cancer,has_heart attack,has_heart failure,has_copd,has_asthma,has_alzheimer,has_dementia
0,9b04b,History of hypertension and type 2 diabetes.,0,1,1,0,0,0,0,0,0,0
1,bffd5,No significant medical history.,0,0,0,0,0,0,0,0,0,0
2,fb35e,"Chronic obstructive pulmonary disease (COPD), ...",0,0,0,0,1,0,1,0,0,0
3,1e30e,Mild asthma.,0,0,0,0,0,0,0,1,0,0
4,116a4,High cholesterol.,0,0,0,0,0,0,0,0,0,0
