## Section 3: Feature Engineering

In this section, natural language processing (NLP) techniques as well as Generative AI models will be applied to handle the textual data in the dataset to prepare features for classification. The goal is to extract meaningful features from the text data that can be used to train a machine learning model for predicting the target variable.

In [1]:
# import libraries
import os 
import pandas as pd
import time

In [2]:
# load the data for preview
df = pd.read_csv('data/processed/patients.csv')

# preview the data
display(df.head())

Unnamed: 0,patient_id,age,gender,medical_history,deterioration_label,timestamp,hear_rate,blood_pressure_sys,blood_pressure_dia,oxygen_saturation,temperature,respiratory_rate,describe_fatigue_level,describe_lifestyle,describe_mental_health
0,9b04b,65,Male,History of hypertension and type 2 diabetes.,True,2023-10-27T10:00:00Z,95.5,160.2,98.7,90.3,38.5,22.1,"Severe fatigue, difficulty getting out of bed.","Sedentary, poor diet.",Feeling anxious and low.
1,bffd5,45,Female,No significant medical history.,False,2023-10-27T10:05:00Z,70.2,120.5,75.0,98.5,36.8,16.0,Mild fatigue occasionally.,"Active, balanced diet.",Generally good.
2,fb35e,78,Male,"Chronic obstructive pulmonary disease (COPD), ...",True,2023-10-27T10:10:00Z,105.0,150.0,90.0,88.0,37.9,25.5,Constant exhaustion.,Very limited activity due to breathlessness.,Feeling down and worried.
3,1e30e,30,Female,Mild asthma.,False,2023-10-27T10:15:00Z,65.0,110.0,70.0,99.0,36.5,14.0,Not fatigued.,"Very active, regular exercise.",Good.
4,116a4,55,Male,High cholesterol.,False,2023-10-27T10:20:00Z,75.5,135.0,85.0,97.0,37.0,17.0,Moderate fatigue after work.,Moderately active.,Stable.


## 3.1 Entity Name Recognition (NER)

As we can observe from the dataset, the medical history text describes specific conditions that can be critical predictors of deterioration. Thus, pre-trained medical LLM like `BioBERT` will be used to extract specific conditions from the text. Furthermore, one-hot encoding will be applied to the extracted conditions to create binary features for each condition. This will allow us to represent the presence or absence of each condition in the dataset.

For medical term extraction, we will use the `BioBERT` model available on Huggingface trained using NCBI disease dataset. For one-hot encoding, we will extract the common conditions from the dataset and create a binary feature for each condition. The one-hot encoding will be done using the `pandas` library in Python.

### 3.1.1 BioBERT Model

In this section, we will use the `BioBERT` model to extract medical terms from the text data. The `BioBERT` model is a pre-trained language model specifically designed for biomedical text mining tasks. It has been trained on a large corpus of biomedical literature and can be used to extract relevant medical terms from the text.

In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("ugaray96/biobert_ncbi_disease_ner")
model = AutoModelForTokenClassification.from_pretrained(
    "ugaray96/biobert_ncbi_disease_ner"
)

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

Device set to use cpu


In [4]:
def extract_diseases_improved(text):
    if pd.isna(text) or text == '':
        return []
    
    # Run NER pipeline
    result = ner_pipeline(text)
    
    diseases = []
    current_disease = ""
    
    for entity in result:
        if entity["entity"] in ["Disease", "B-Disease", "I-Disease"]:
            # Start a new disease if the previous one is completed
            if current_disease and not entity["word"].startswith("##"):
                diseases.append(current_disease.strip())
                current_disease = ""
            
            # Handle BERT tokenization
            if entity["word"].startswith("##"):
                current_disease += entity["word"][2:]  # Remove the ## prefix
            else:
                if current_disease:  # If we have a partial disease, add it
                    diseases.append(current_disease.strip())
                current_disease = entity["word"]
                
        elif entity["entity"] == "Disease Continuation":
            # Handle BERT tokenization for continuation tokens
            if entity["word"].startswith("##"):
                current_disease += entity["word"][2:]  # Remove the ## prefix
            else:
                current_disease += " " + entity["word"]
    
    # Add the last disease if there is one
    if current_disease:
        diseases.append(current_disease.strip())
    
    # Add common conditions that might be missed by the model
    common_conditions = ["stroke", "diabetes", "hypertension", "cancer", "heart attack", 
                        "heart failure", "copd", "asthma", "alzheimer", "dementia"]
    
    text_lower = text.lower()
    for condition in common_conditions:
        if condition in text_lower and condition not in [d.lower() for d in diseases]:
            diseases.append(condition)
    
    return diseases

# Apply the function to the medical_history column
df['extracted_diseases'] = df['medical_history'].apply(extract_diseases_improved)

# Preview the results
print("Sample of extracted diseases:")
for i, (history, diseases) in enumerate(zip(df['medical_history'].head(3), df['extracted_diseases'].head(3))):
    print(f"\nSample {i+1}:")
    print(f"Medical History: {history}")
    print(f"Extracted Diseases: {', '.join(diseases) if diseases else 'None'}")

# Count how many records have extracted diseases
has_diseases = df['extracted_diseases'].apply(len) > 0
print(f"\nRecords with extracted diseases: {has_diseases.sum()} out of {len(df)}")

# Display the first few rows with the new column
display(df[['medical_history', 'extracted_diseases']].head())

Sample of extracted diseases:

Sample 1:
Medical History: History of hypertension and type 2 diabetes.
Extracted Diseases: hypertension, type 2 diabetes, diabetes

Sample 2:
Medical History: No significant medical history.
Extracted Diseases: None

Sample 3:
Medical History: Chronic obstructive pulmonary disease (COPD), previous heart attack.
Extracted Diseases: Chronic obstructive pulmonary disease, COPD, heart attack

Records with extracted diseases: 938 out of 1199


Unnamed: 0,medical_history,extracted_diseases
0,History of hypertension and type 2 diabetes.,"[hypertension, type 2 diabetes, diabetes]"
1,No significant medical history.,[]
2,"Chronic obstructive pulmonary disease (COPD), ...","[Chronic obstructive pulmonary disease, COPD, ..."
3,Mild asthma.,[asthma]
4,High cholesterol.,[]


From the output, we can see that `BioBERT` successfully extracts the medical terms from the text. Next, we will apply one-hot encoding to the extracted terms to create binary features for each condition. This will allow us to represent the presence or absence of each condition in the dataset.

### 3.1.2 One-Hot Encoding

In this section, we will apply one-hot encoding to the extracted medical terms to create binary features for each condition using `pandas`. One-hot encoding is a technique used to convert categorical variables into a format that can be provided to machine learning algorithms to do a better job in prediction.

Note that to avoid high dimensionality, we will only keep the most common conditions in the dataset. The one-hot encoding will be done using the `pandas` library in Python.

In [5]:
# Define the function to create one-hot encoding
def create_disease_features(diseases_list, conditions):
    """
    Create binary features for each disease in conditions based on diseases_list
    
    Args:
        diseases_list: List of diseases detected for a patient
        conditions: List of conditions to check for
    
    Returns:
        Dictionary with condition names as keys and 0/1 as values
    """
    result = {}
    # Convert all strings to lowercase for case-insensitive matching
    diseases_lower = [disease.lower() for disease in diseases_list]
    
    for condition in conditions:
        # Check if the condition exists in the list (case insensitive)
        result[condition] = 1 if condition.lower() in diseases_lower else 0
    
    return result

# List of common conditions to check
common_conditions = ["stroke", "diabetes", "hypertension", "cancer", "heart attack", 
                    "heart failure", "copd", "asthma", "alzheimer", "dementia"]

# Apply the function to create one-hot encoded columns
disease_features = df['extracted_diseases'].apply(lambda x: create_disease_features(x, common_conditions))

# Convert the result to DataFrame and join with original dataframe
disease_df = pd.DataFrame(disease_features.tolist())

# Add prefix 'has_' to the column names for clarity
disease_df = disease_df.add_prefix('has_')

# Join the one-hot encoded features with the original dataframe
df_encoded = pd.concat([df, disease_df], axis=1)

# Display the result
print("Number of patients with each condition:")
for condition in common_conditions:
    count = disease_df[f'has_{condition}'].sum()
    print(f"{condition.title()}: {count} patients")

# Show sample of data with the new features
print("\nSample of patients with their conditions:")
display(df_encoded[['patient_id', 'medical_history'] + [f'has_{condition}' for condition in common_conditions]].head(5))

Number of patients with each condition:
Stroke: 51 patients
Diabetes: 70 patients
Hypertension: 64 patients
Cancer: 31 patients
Heart Attack: 6 patients
Heart Failure: 64 patients
Copd: 51 patients
Asthma: 40 patients
Alzheimer: 10 patients
Dementia: 44 patients

Sample of patients with their conditions:


Unnamed: 0,patient_id,medical_history,has_stroke,has_diabetes,has_hypertension,has_cancer,has_heart attack,has_heart failure,has_copd,has_asthma,has_alzheimer,has_dementia
0,9b04b,History of hypertension and type 2 diabetes.,0,1,1,0,0,0,0,0,0,0
1,bffd5,No significant medical history.,0,0,0,0,0,0,0,0,0,0
2,fb35e,"Chronic obstructive pulmonary disease (COPD), ...",0,0,0,0,1,0,1,0,0,0
3,1e30e,Mild asthma.,0,0,0,0,0,0,0,1,0,0
4,116a4,High cholesterol.,0,0,0,0,0,0,0,0,0,0


## 3.2 Sentiment Analysis

In this section, we will apply Gemini 2.5 Flash model to perform sentiment analysis on the questionnaire response data where the response will be rate in the range from 1 to 5 as follows:

```txt
describe_fatigue_level -> fatigue_level
describe_lifestyle -> activity_level
describe_mental_health -> mental_health_level
```

### 3.2.1 Accessing the Gemini 2.5 Flash Model

We will be using the Gemini 2.5 Flash Preview 0417, currently the most intelligent model offered by Google.

In [6]:
# configure api
from dotenv import load_dotenv
import os

load_dotenv()
gemini_api_key = os.getenv("GEMINI_API_KEY")

In [7]:
from google import genai
from google.genai import types

client = genai.Client(api_key=gemini_api_key)

model = [
    "gemini-2.5-flash-preview-04-17"
]

generate_content_config = types.GenerateContentConfig(
    response_mime_type="application/json",
)

### 3.2.2 Applying Sentiment Analysis

This code analyzes patient-reported text responses using Google's Gemini model through batch processing and structured prompt engineering. The dataset is divided into batches of 20 rows to optimize API usage efficiency. Each batch generates a comprehensive prompt directing Gemini to perform sentiment analysis (fatigue, activity, and mental health) using 1-5 rating scales with explicit criteria, returning results in JSON format. The system handles responses by extracting and validating JSON data, including error handling for API interactions and response parsing. Extracted sentiment scores are merged with the original dataset and visualized through summary statistics and distribution charts, enabling efficient analysis of textual health data while maintaining structured, machine-readable outputs for downstream applications.

In [8]:
import json
from datetime import datetime

def process_sentiment_batch(batch_df):
    """
    Process a batch of rows using the Gemini model to analyze sentiment in text responses.
    Returns sentiment scores for fatigue, activity, and mental health levels.
    """
    # Combine all text responses into a single prompt for batch processing
    rows = []
    for idx, row in batch_df.iterrows():
        row_data = {
            "id": row["patient_id"],
            "fatigue": row["describe_fatigue_level"] if not pd.isna(row["describe_fatigue_level"]) else "No response",
            "lifestyle": row["describe_lifestyle"] if not pd.isna(row["describe_lifestyle"]) else "No response",
            "mental_health": row["describe_mental_health"] if not pd.isna(row["describe_mental_health"]) else "No response"
        }
        rows.append(row_data)
    
    prompt = f"""Analyze the sentiment in these patient responses and rate each on a scale from 1-5:
    - Fatigue level (1: No fatigue, 5: Extreme fatigue)
    - Activity level (1: Sedentary, 5: Very active)
    - Mental health level (1: Poor, 5: Excellent)

    For each patient, return ONLY a JSON object with their ID and the three numerical ratings.
    If a response is missing or unclear, assign a neutral value of 3.

    Patient responses:
    {rows}
    """

    response = client.models.generate_content(model=model[0], contents=prompt, config=generate_content_config)
    
    return response.text

In [None]:
# Process the dataframe in batches of 20 rows
batch_size = 30
request_count = 0
max_retries = 3

for i in range(0, len(df), batch_size):
    print(f"Processing batch {i // batch_size + 1} of {(len(df) + batch_size - 1) // batch_size}")
    batch_df = df.iloc[i:i+batch_size]
    batch_results = process_sentiment_batch(batch_df)

    success = False
    retry_count = 0
    
    while not success and retry_count < max_retries:
        try:
            batch_results = process_sentiment_batch(batch_df)
            
            # Try to save the response to a file
            try:
                with open(f"data/raw_2/run_{i}.json", "w") as file:
                    file.write(batch_results)
                success = True
                print(f"Successfully processed and saved batch {i // batch_size + 1}")
            except Exception as e:
                print(f"Error saving results: {str(e)}. Retrying...")
                retry_count += 1
                time.sleep(2)  # Short delay before retry
                
        except Exception as e:
            print(f"Error processing batch: {str(e)}. Retrying...")
            retry_count += 1
            time.sleep(5)  # Slightly longer delay for API errors
    
    if not success:
        print(f"Failed to process batch starting at index {i} after {max_retries} attempts. Skipping.")
    
    # Increment request counter
    request_count += 1
    
    # Add delay after every 5 requests
    if request_count % 5 == 0 and i + batch_size < len(df):
        print(f"Completed {request_count} requests. Taking a 1-minute break to avoid rate limiting...")
        time.sleep(60)  # Sleep for 60 seconds (1 minute)
        print("Resuming processing...")

Processing batch 1 of 40
Successfully processed and saved batch 1
Processing batch 2 of 40
Successfully processed and saved batch 2
Processing batch 3 of 40
Successfully processed and saved batch 3
Processing batch 4 of 40
Successfully processed and saved batch 4
Processing batch 5 of 40
Successfully processed and saved batch 5
Completed 5 requests. Taking a 1-minute break to avoid rate limiting...
Resuming processing...
Processing batch 6 of 40
Successfully processed and saved batch 6
Processing batch 7 of 40
Successfully processed and saved batch 7
Processing batch 8 of 40
Successfully processed and saved batch 8
Processing batch 9 of 40
Successfully processed and saved batch 9
Processing batch 10 of 40
Successfully processed and saved batch 10
Completed 10 requests. Taking a 1-minute break to avoid rate limiting...
Resuming processing...
Processing batch 11 of 40
Successfully processed and saved batch 11
Processing batch 12 of 40
Successfully processed and saved batch 12
Processing 

In [None]:
import glob

# Define the directory containing the JSON files
json_dir = "data/raw_2"

# Get a list of all JSON files
json_files = glob.glob(os.path.join(json_dir, "*.json"))

# Initialize empty lists to store the data
all_ratings = []

# Process each JSON file
for file_path in json_files:
    try:
        with open(file_path, 'r') as file:
            content = file.read()

            # Try to parse the JSON
            try:
                data = json.loads(content)
                
                # Handle both list and single object formats
                if isinstance(data, list):
                    ratings = data
                else:
                    ratings = [data]
                
                # Process each rating entry
                for rating in ratings:
                    # Standardize field names
                    patient_id = rating.get("id", rating.get("patient_id"))
                    
                    # Handle different possible field names for fatigue
                    fatigue_level = rating.get("fatigue_level", 
                                     rating.get("fatigue_rating", 
                                     rating.get("fatigue", None)))
                    
                    # Handle different possible field names for activity
                    activity_level = rating.get("activity_level", 
                                     rating.get("activity_rating", 
                                     rating.get("activity", 
                                     rating.get("lifestyle", None))))
                    
                    # Handle different possible field names for mental health
                    mental_health_level = rating.get("mental_health_level", 
                                         rating.get("mental_health_rating", 
                                         rating.get("mental_health", None)))
                    
                    # Add to our collection if valid
                    if patient_id:
                        all_ratings.append({
                            "patient_id": patient_id,
                            "fatigue_level": fatigue_level,
                            "activity_level": activity_level,
                            "mental_health_level": mental_health_level
                        })
            except json.JSONDecodeError as e:
                print(f"Error parsing JSON in file {file_path}: {e}")
                
    except Exception as e:
        print(f"Error processing file {file_path}: {e}")

# Create a dataframe from the collected ratings
ratings_df = pd.DataFrame(all_ratings)

# Print summary statistics
print(f"Successfully processed {len(ratings_df)} patient ratings")
print(f"Number of unique patients: {ratings_df['patient_id'].nunique()}")

# Display the first few rows of the ratings dataframe
display(ratings_df.head())

# Now merge with the original dataframe (df_encoded)
df_with_ratings = df_encoded.merge(ratings_df, on="patient_id", how="left")

# Check for any patients without ratings
missing_ratings = df_with_ratings[df_with_ratings['fatigue_level'].isna()]['patient_id'].count()
print(f"Patients without ratings: {missing_ratings} out of {len(df_with_ratings)}")

# Display the first few rows of the merged dataframe
display(df_with_ratings.head())

In [None]:
# Save the merged dataframe to a new file
df_with_ratings.to_csv('data/processed/patients_with_ratings.csv', index=False)
print("DataFrame with ratings saved to 'data/processed/patients_with_ratings.csv'")