In [66]:
# Suppress Hugging Face tokenizer parallelism and common warnings for cleaner output
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

# Advanced GenAI Features Demo
This notebook demonstrates advanced GenAI, NLP, and LLM features for recruiter-ready healthcare AI/data science portfolios.

**Features Demonstrated:**
- Entity extraction and classification with BERT/Bio_ClinicalBERT (Hugging Face Transformers)
- Retrieval-Augmented Generation (RAG) pipeline
- Vector database integration (FAISS/Chroma)
- Prompt engineering and finetuning
- Bias detection, model guardrails, and safety checks
- Cloud integration (AWS, S3, cloud ML workflows)
- PEFT/SFT advanced finetuning (Hugging Face PEFT)
Each section includes code, workflow explanation, and practical tips for production and portfolio use.

## 1. Entity Extraction & Classification with Transformers
This section demonstrates how to use BERT/Bio_ClinicalBERT and Hugging Face Transformers for entity extraction and classification in clinical text.

In [67]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load Bio_ClinicalBERT model and tokenizer
model_checkpoint = 'emilyalsentzer/Bio_ClinicalBERT'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=5)  # Example: 5 labels

labels = ['O', 'B-DISEASE', 'I-DISEASE', 'B-SYMPTOM', 'I-SYMPTOM']

def get_entities(text, model, tokenizer, labels):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)[0].tolist()
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    entities = []
    current_entity = None
    for token, pred in zip(tokens, predictions):
        label = labels[pred]
        if label.startswith('B-'):
            if current_entity:
                entities.append(current_entity)
            current_entity = {'entity': label[2:], 'text': token.replace('##', '')}
        elif label.startswith('I-') and current_entity:
            current_entity['text'] += token.replace('##', '')
        else:
            if current_entity:
                entities.append(current_entity)
                current_entity = None
    if current_entity:
        entities.append(current_entity)
    return entities

# Example clinical notes
notes = [
    'Patient reports chest pain and shortness of breath. History of hypertension.',
    'Diabetic patient with fatigue and nausea. No chest pain.'
 ]

for i, note in enumerate(notes):
    ents = get_entities(note, model, tokenizer, labels)
    print(f'Note {i+1}:', note)
    for ent in ents:
        print(f"  Entity: {ent['text']} | Type: {ent['entity']}")
    print()

Some weights of BertForTokenClassification were not initialized from the model checkpoint at emilyalsentzer/Bio_ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note 1: Patient reports chest pain and shortness of breath. History of hypertension.
  Entity: of | Type: DISEASE

Note 2: Diabetic patient with fatigue and nausea. No chest pain.
  Entity: tic | Type: DISEASE
  Entity: fatigue | Type: DISEASE
  Entity: nausea | Type: DISEASE



## 2. Retrieval-Augmented Generation (RAG) Pipeline
This section demonstrates a simple RAG pipeline using local models and custom retrievers/generators for clinical QA.

In [68]:
# Simple RAG pipeline demo
def simple_retriever(query, docs):
    # Return the most relevant document (here, just the first for demo)
    return docs[0]

def simple_generator(text):
    # Simulate LLM answer generation
    return f"LLM answer based on: {text}"

# Example documents and query
documents = [
    "Patient 123 has diabetes and hypertension.",
    "Patient 456 has asthma and no history of diabetes."
 ]
query = "What is the diagnosis for patient 123?"

# RAG workflow
retrieved_doc = simple_retriever(query, documents)
generated_answer = simple_generator(retrieved_doc)
print("Query:", query)
print("Retrieved Document:", retrieved_doc)
print("Generated Answer:", generated_answer)

Query: What is the diagnosis for patient 123?
Retrieved Document: Patient 123 has diabetes and hypertension.
Generated Answer: LLM answer based on: Patient 123 has diabetes and hypertension.


## 3. Vector Database Integration (FAISS)
This section demonstrates how to use FAISS for semantic search and retrieval in clinical NLP workflows.

### Alternative: Vector Database Integration with Annoy
FAISS is not currently supported on Python 3.13. Annoy is a pure Python library for approximate nearest neighbor search and works with the latest Python versions. Below is a demo using Annoy for semantic search in clinical NLP workflows.

In [69]:
# Annoy vector search demo (works with Python 3.13)
# Install Annoy if not already installed
import sys
try:
    from annoy import AnnoyIndex
except ImportError:
    import subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'annoy'])
    from annoy import AnnoyIndex
import numpy as np

# Create example embeddings (2D for demo)
embeddings = np.array([[0.1, 0.2], [0.2, 0.1], [0.9, 0.8]], dtype='float32')
f = embeddings.shape[1]
index = AnnoyIndex(f, 'euclidean')
for i, vec in enumerate(embeddings):
    index.add_item(i, vec)
index.build(10)  # 10 trees

# Query embedding
query_embedding = [0.15, 0.15]
nearest_indices = index.get_nns_by_vector(query_embedding, 2, include_distances=True)
print("Query embedding:", query_embedding)
print("Top 2 nearest indices:", nearest_indices[0])
print("Distances:", nearest_indices[1])

Query embedding: [0.15, 0.15]
Top 2 nearest indices: [0, 1]
Distances: [0.0707106739282608, 0.0707106813788414]


## 4. Prompt Engineering and Finetuning
This section demonstrates prompt engineering and basic finetuning techniques using Hugging Face Transformers.

In [70]:
# Prompt engineering demo with Hugging Face Transformers
from transformers import pipeline

# Use a fill-mask pipeline for prompt engineering
fill_mask = pipeline('fill-mask', model='bert-base-uncased')
prompt = "The patient was diagnosed with [MASK]."
results = fill_mask(prompt)
print("Prompt:", prompt)
for result in results[:3]:
    print(f"Prediction: {result['token_str']} | Score: {result['score']:.4f}")

# Finetuning demo (conceptual, not executed)
print("\nFinetuning: Use Trainer API with your labeled dataset for supervised training.")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0
Device set to use mps:0


Prompt: The patient was diagnosed with [MASK].
Prediction: cancer | Score: 0.5427
Prediction: leukemia | Score: 0.0917
Prediction: schizophrenia | Score: 0.0470

Finetuning: Use Trainer API with your labeled dataset for supervised training.


## 5. Bias Detection, Model Guardrails, and Safety Checks
This section demonstrates basic bias detection and safety checks for NLP models using Python and scikit-learn.

In [71]:
# Bias detection and safety checks demo
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report

# Example predictions and true labels for two groups
y_true = np.array([1, 0, 1, 0, 1, 0])  # 1: Disease, 0: No Disease
y_pred_group1 = np.array([1, 0, 1, 0, 1, 0])  # Group 1 predictions
y_pred_group2 = np.array([0, 0, 1, 0, 0, 0])  # Group 2 predictions

print("Group 1 Classification Report:")
print(classification_report(y_true, y_pred_group1))

print("Group 2 Classification Report:")
print(classification_report(y_true, y_pred_group2))

# Simple bias check: Compare accuracy between groups
acc_group1 = np.mean(y_true == y_pred_group1)
acc_group2 = np.mean(y_true == y_pred_group2)
print(f"Accuracy Group 1: {acc_group1:.2f}")
print(f"Accuracy Group 2: {acc_group2:.2f}")
if abs(acc_group1 - acc_group2) > 0.2:
    print("Warning: Potential bias detected between groups!")

Group 1 Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         3
           1       1.00      1.00      1.00         3

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6

Group 2 Classification Report:
              precision    recall  f1-score   support

           0       0.60      1.00      0.75         3
           1       1.00      0.33      0.50         3

    accuracy                           0.67         6
   macro avg       0.80      0.67      0.62         6
weighted avg       0.80      0.67      0.62         6

Accuracy Group 1: 1.00
Accuracy Group 2: 0.67


## 6. Cloud Integration (AWS, S3, Cloud ML Workflows)
This section demonstrates how to integrate with cloud platforms for data storage, model deployment, and ML workflows.

In [72]:
# Cloud integration demo: Upload file to AWS S3 (requires AWS credentials)
import boto3

# Example: Upload a file to S3 (conceptual, not executed)
def upload_to_s3(file_path, bucket, object_name):
    s3 = boto3.client('s3')
    try:
        s3.upload_file(file_path, bucket, object_name)
        print(f"Uploaded {file_path} to s3://{bucket}/{object_name}")
    except Exception as e:
        print("Error uploading to S3:", e)

# Example usage (commented out)
# upload_to_s3('model.pt', 'my-ml-bucket', 'models/model.pt')

print("For full cloud ML workflows, use AWS SageMaker for training/deployment.")

For full cloud ML workflows, use AWS SageMaker for training/deployment.


## 7. PEFT/SFT Advanced Finetuning
This section demonstrates parameter-efficient finetuning (PEFT/SFT) using Hugging Face PEFT library for LLMs.

In [73]:
# PEFT/SFT advanced finetuning demo (conceptual)
# Requires: pip install peft transformers datasets
from peft import get_peft_model, LoraConfig
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load base model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Configure LoRA (Low-Rank Adaptation) for PEFT
lora_config = LoraConfig(r=8, lora_alpha=32, target_modules=["query", "value"], lora_dropout=0.1)
peft_model = get_peft_model(model, lora_config)

print("PEFT model ready for parameter-efficient finetuning.")
print("For full training, use Trainer API with your labeled dataset.")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PEFT model ready for parameter-efficient finetuning.
For full training, use Trainer API with your labeled dataset.


## Load Synthetic Clinical Notes
Read the synthetic clinical notes CSV and display the first few entries.

In [74]:
import pandas as pd
notes_df = pd.read_csv('../notebooks/synthetic_clinical_notes.csv')
notes_df.head()

Unnamed: 0,text,label
0,Patient reports fatigue and thirst.,Diabetes
1,Patient reports wheezing and shortness of breath.,Asthma
2,Patient reports thirst and frequent urination.,Diabetes
3,Patient reports healthy and no complaints.,No Disease
4,Patient reports wheezing and cough.,Asthma


In [75]:
import sys
import os
sys.path.append(os.path.abspath('../src'))

In [76]:
# Select the first clinical note for processing
clinical_note = notes_df.iloc[0]['text']
print("Selected clinical note from column 'text':")
print(clinical_note)

# Example: Pass the note to your GenAI pipeline (replace with your actual function)
from nlp_pipeline import process_note
result = process_note(clinical_note)
print('GenAI pipeline result:')
print(result)

Selected clinical note from column 'text':
Patient reports fatigue and thirst.
GenAI pipeline result:
{'length': 35, 'preview': 'Patient reports fatigue and thirst.'}


## Advanced NLP Entity Extraction Demo
This cell demonstrates how to use spaCy and Hugging Face Transformers for named entity recognition and clinical entity extraction on a sample clinical note.

In [77]:
# Example clinical note for advanced NLP demo
clinical_note = "Patient reports chest pain and shortness of breath. History of hypertension."

from nlp_pipeline import process_note
result = process_note(clinical_note)

print("Note length:", result.get('length'))
print("Preview:", result.get('preview'))
print("spaCy Named Entities:", result.get('spacy_entities', 'No spaCy entities found or error occurred'))
print("Clinical Entities (Bio_ClinicalBERT):", result.get('clinical_entities', 'No clinical entities found or error occurred'))

Note length: 76
Preview: Patient reports chest pain and shortness of breath
spaCy Named Entities: No spaCy entities found or error occurred
Clinical Entities (Bio_ClinicalBERT): No clinical entities found or error occurred


## Advanced NLP Entity Extraction: Multiple Test Cases
Try several clinical notes to see how spaCy and Bio_ClinicalBERT extract entities from different types of medical text.

In [81]:
# Multiple clinical note test cases for advanced NLP entity extraction
test_notes = [
    "John Doe, a 45-year-old male, was diagnosed with diabetes on 2022-03-15.",
    "Patient presents with severe headache and nausea. MRI scheduled for 10/10/2025.",
    "Jane Smith has a history of hypertension and COPD. Prescribed Lisinopril.",
    "Patient reports chest pain, shortness of breath, and cough. No prior cardiac history.",
    "Michael Johnson, DOB 1970-05-22, admitted for acute asthma exacerbation."
]

from nlp_pipeline import process_note

for i, note in enumerate(test_notes):
    print(f"\nTest Case {i+1}:")
    print("Note:", note)
    result = process_note(note)
    print("Note length:", result.get('length'))
    print("Preview:", result.get('preview'))
    print("spaCy Named Entities:", result.get('spacy_entities', 'No spaCy entities found or error occurred'))
    print("Clinical Entities (Bio_ClinicalBERT):", result.get('clinical_entities', 'No clinical entities found or error occurred'))


Test Case 1:
Note: John Doe, a 45-year-old male, was diagnosed with diabetes on 2022-03-15.
Note length: 72
Preview: John Doe, a 45-year-old male, was diagnosed with d
spaCy Named Entities: No spaCy entities found or error occurred
Clinical Entities (Bio_ClinicalBERT): No clinical entities found or error occurred

Test Case 2:
Note: Patient presents with severe headache and nausea. MRI scheduled for 10/10/2025.
Note length: 79
Preview: Patient presents with severe headache and nausea. 
spaCy Named Entities: No spaCy entities found or error occurred
Clinical Entities (Bio_ClinicalBERT): No clinical entities found or error occurred

Test Case 3:
Note: Jane Smith has a history of hypertension and COPD. Prescribed Lisinopril.
Note length: 73
Preview: Jane Smith has a history of hypertension and COPD.
spaCy Named Entities: No spaCy entities found or error occurred
Clinical Entities (Bio_ClinicalBERT): No clinical entities found or error occurred

Test Case 4:
Note: Patient reports chest pa

In [None]:
# Biomedical NER with Hugging Face Transformers (d4data/biomedical-ner-all)
# This cell demonstrates advanced biomedical entity extraction for clinical notes.
from transformers import pipeline

# Load the biomedical NER pipeline
biomed_ner = pipeline('ner', model='d4data/biomedical-ner-all', aggregation_strategy='simple')

print('Biomedical NER Results:')
for i, note in enumerate(test_notes):
    print(f"\nTest Case {i+1}:")
    print("Note:", note)
    entities = biomed_ner(note)
    if entities:
        for ent in entities:
            print(f"  Entity: {ent['word']} | Type: {ent['entity_group']} | Score: {ent['score']:.2f}")
    else:
        print("  No biomedical entities found.")

Device set to use mps:0


Biomedical NER Results:

Test Case 1:
Note: John Doe, a 45-year-old male, was diagnosed with diabetes on 2022-03-15.
  Entity: 45 - year - old | Type: Age | Score: 0.99
  Entity: diabetes | Type: Disease_disorder | Score: 1.00
  Entity: 03 | Type: Date | Score: 1.00

Test Case 2:
Note: Patient presents with severe headache and nausea. MRI scheduled for 10/10/2025.
  Entity: severe | Type: Severity | Score: 1.00
  Entity: headache | Type: Sign_symptom | Score: 1.00
  Entity: nausea | Type: Sign_symptom | Score: 1.00
  Entity: mri | Type: Diagnostic_procedure | Score: 1.00

Test Case 3:
Note: Jane Smith has a history of hypertension and COPD. Prescribed Lisinopril.
  Entity: hypertension | Type: History | Score: 0.98
  Entity: cop | Type: History | Score: 0.99
  Entity: li | Type: Medication | Score: 1.00
  Entity: ##sin | Type: Medication | Score: 0.96
  Entity: ##op | Type: Medication | Score: 1.00

Test Case 4:
Note: Patient reports chest pain, shortness of breath, and cough. No prior

## MCP Demo: Store and Retrieve Agent Context
This cell demonstrates how to use Model Context Protocol (MCP) to store and retrieve analysis results for agentic workflows.

In [80]:
from agents.mcp import ModelContext

# Create ModelContext object
context = ModelContext()

# Store the GenAI pipeline result for the agent
context.update_context('ClinicalAgent', result)

# Retrieve and display the context for the agent
retrieved = context.get_context('ClinicalAgent')
print('Retrieved context for ClinicalAgent:')
print(retrieved)

Retrieved context for ClinicalAgent:
[{'length': 72, 'preview': 'Michael Johnson, DOB 1970-05-22, admitted for acut'}]


## Analysis Overview: Synthetic Clinical Notes with GenAI Features
This section demonstrates how advanced GenAI features are applied to synthetic clinical notes. The workflow includes:
- Loading and inspecting synthetic clinical notes data.
- Selecting a clinical note for analysis.
- Processing the note using an NLP pipeline powered by large language models (LLMs).
- Using Retrieval-Augmented Generation (RAG) to fetch relevant medical codes and guidelines based on extracted entities.
- Leveraging Model Context Protocol (MCP) to maintain conversational memory and context for multi-turn agent interactions.
This analysis showcases how clinical AI agents can extract key information, ground their responses in external knowledge, and remember context for follow-up questions, supporting dynamic and explainable healthcare NLP solutions.