## Goal: Build a BERT‑based sentiment classifier for patient feedback.

## Why BERT: It understands context better than LSTM and gives higher accuracy.

## What you do: Clean text → Fine‑tune BERT → Evaluate → Save model.

## Outcome: A strong, modern NLP model ready for dashboards and real‑world use.


In [1]:
import pandas as pd
df=pd.read_csv(r"D:\HealthCare System\cleaned_feedback_dataset.csv")
df.head()

Unnamed: 0,clean_text,Sentiment
0,discharge instructions were very clear,1
1,i felt the doctor used too much medical jargon,1
2,i appreciated the reminders about my medicines,1
3,instructions after discharge were not helpful,0
4,instructions after discharge were not helpful,0


In [2]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=42)


## Renaming Sentiment column to labels BEFORE tokenization

In [11]:
train_df = train_df.rename(columns={"Sentiment": "labels"})
val_df   = val_df.rename(columns={"Sentiment": "labels"})
test_df  = test_df.rename(columns={"Sentiment": "labels"})


 ## Tokenization function


In [12]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


In [13]:
def tokenize(batch):
    return tokenizer(
        batch["clean_text"],          # the cleaned text column
        padding="max_length",         # pad all sequences to same length
        truncation=True,              # cut off long text safely
        max_length=64                 # max token length for BERT
    )


##  Convert pandas → HuggingFace Dataset

In [14]:
from datasets import Dataset
train_ds = Dataset.from_pandas(train_df)
val_ds   = Dataset.from_pandas(val_df)
test_ds  = Dataset.from_pandas(test_df)

## Applying tokenization to datasets

In [15]:
train_ds = train_ds.map(tokenize, batched=True)   # tokenize training data
val_ds   = val_ds.map(tokenize, batched=True)     # tokenize validation data
test_ds  = test_ds.map(tokenize, batched=True)    # tokenize test data

Map:   0%|          | 0/720 [00:00<?, ? examples/s]

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

## Load the bert model for classification 

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,                    # 2 sentiment classes
    
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Set training arguments

In [17]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="bert_sentiment",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50
)



## computing metrics to add in trainer 

In [21]:
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    matthews_corrcoef
)

def compute_metrics(pred):
    labels = pred.label_ids
    preds  = pred.predictions.argmax(-1)
    
    accuracy  = accuracy_score(labels, preds)
    mcc       = matthews_corrcoef(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    
    return {
        'accuracy': accuracy,
        'mcc': mcc,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

In [23]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = Trainer(


In [24]:
trainer.train()




Step,Training Loss
50,0.0015
100,0.0003


TrainOutput(global_step=135, training_loss=0.0007368486650564053, metrics={'train_runtime': 121.3176, 'train_samples_per_second': 17.805, 'train_steps_per_second': 1.113, 'total_flos': 35766197637120.0, 'train_loss': 0.0007368486650564053, 'epoch': 3.0})

In [25]:
trainer.evaluate(test_ds)




{'eval_loss': 0.00015362749400082976,
 'eval_accuracy': 1.0,
 'eval_mcc': 1.0,
 'eval_precision': 1.0,
 'eval_recall': 1.0,
 'eval_f1': 1.0,
 'eval_runtime': 2.7116,
 'eval_samples_per_second': 73.758,
 'eval_steps_per_second': 4.794,
 'epoch': 3.0}

##  model is not overfitting because:
- Training performance is excellent
- Test performance is also excellent
If it were overfitting, test accuracy would drop.
model is simply strong and  dataset is easy.


## Testing 

In [30]:
def predict(text):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128
    )

    outputs = model(**inputs)
    pred = outputs.logits.argmax(dim=1).item()

    return "Positive" if pred == 1 else "Negative"


In [31]:
predict("The doctor was very rude and unprofessional.")


'Positive'

In [32]:
predict("The staff were extremely kind and helpful.")


'Positive'

In [33]:
predict("Waiting time was too long.")


'Negative'

## Limitations & Future Improvements

. Occasional Misclassification of Strongly Negative Sentences
Although the model achieves excellent overall performance, it misclassifies a small number of strongly negative sentences such as:
- “The doctor was very rude and unprofessional.”
This happens because the training data may not contain enough examples with harsh or explicit negative language. As a result, the model sometimes interprets these sentences as neutral or mildly positive.
Why this happens
- The dataset may be biased toward polite or mild wording.
- Strong negative words like rude, unprofessional, terrible, horrible may be under‑represented.
- BERT generalizes well overall but struggles with rare patterns it hasn’t seen enough times.
Impact
This does not break the model, but it can hide important negative feedback if used in a real hospital dashboard.



## FINAL ANALYSIS

## The model achieved high performance due to optimal fine‑tuning parameters (learning_rate=2e‑5, batch_size=16, epochs=3, weight_decay=0.01), clean tokenization (max_length=64, padding, truncation), and the use of DistilBERT’s strong pretrained language representations. 

## The HuggingFace Trainer handled optimization, scheduling, and batching, while a custom compute_metrics function enabled full evaluation (Accuracy, Precision, Recall, F1, MCC). Clean dataset preprocessing and correct label formatting (labels) ensured stable and efficient learning.
