## Summary Report

> **Note:** Figures and detailed observations for each task are available in `task_1.ipynb` (classification) and `task_2.ipynb` (NLP entity extraction).

### Approach

The project had two main objectives:  
1. Build a binary classification model to predict 30-day hospital readmission.  
2. Use NLP to extract clinical entities from discharge notes using a large language model (LLM).

**For the classification task:**
- Cleaned and explored structured clinical data (e.g., age, number of previous admissions, length of stay).
- Engineered features based on distributional differences and handled class imbalance using stratified sampling.
- Trained a baseline random forest classifier and evaluated using ROC AUC, F1-score, and confusion matrix.
- Used SHAP values to interpret feature influence.
- Compared benchmark performance with a model trained on engineered features only.

**For the NLP task:**
- Used Flan-T5 in a zero-shot setting with custom prompts to extract clinical entities: diagnosis, treatment, symptoms, medication, and follow-up.
- Compared predictions to manually annotated labels.
- Observed risks common to LLMs in clinical contexts, including hallucination, entity ambiguity, and inconsistency across prompts.

---

### Key Results

#### Data Inspection

- The dataset contained 200 rows and 9 columns, including structured fields and unstructured discharge notes.
- Variables included both categorical and discrete numeric types such as `age`, `length_of_stay`, and `num_previous_admissions`.
- All records were unique, with no missing values or duplicates.

#### Feature Distribution

- The outcome variable `readmitted_30_days` showed a moderate class imbalance (~1:2 ratio).
- Most features had low cardinality; `age` had the most unique values (67), while others had 14 or fewer.
- Discrete numeric features displayed varied distribution shapes, from right-skewed to multimodal.
- Categorical features were generally well balanced.

#### Distribution Across the Outcome Variable and Feature Engineering

- No strong correlations were observed between features and the outcome.
- `age` and `num_previous_admissions` showed mild correlation with `readmitted_30_days`.
- Visual inspection suggested specific value ranges within features might be linked to outcome classes.
- New features were derived based on these patterns.

#### Classification Task

- The random forest model performed poorly, achieving results only slightly better than random guessing and below baseline accuracy.
- SHAP values identified `age`, `num_previous_admissions`, and `length_of_stay` as the most influential features, consistent with prior EDA.
- The engineered-features-only model performed similarly, suggesting that most of the signal in the raw data was retained through feature transformation, but overall predictive power remained limited.
 
#### LLM Entity Tagging Task

- The Flan-T5 model achieved high recall across most entity types, particularly for `symptoms`, `medication`, and `follow-up`, with recall scores of 1.00 for each.
- Precision was more variable: `medication` and `symptoms` achieved good precision (0.75 and 0.67 respectively), while `follow-up` (0.33) and `treatment` (0.57) showed more frequent false positives.
- `diagnosis` was not detected in any sample, despite being present in two manually labelled cases. This resulted in a precision and recall of 0.00 for that entity.
- Overall, the model demonstrated strong surface-level pattern matching, but limited ability to infer clinical meaning, especially for implied or resolved conditions.

---

### Practical Implications

#### Classification Model

- The weak performance of the classification model suggests that the available structured data lacks strong predictive power for 30-day readmission. 
- Despite this, SHAP analysis highlighted a consistent set of influential features (`age`, `length_of_stay`, `num_previous_admissions`), which could inform future data collection or risk stratification approaches.
- In a real-world setting, this model in its current form would not be reliable for deployment but could serve as a baseline for further refinement with additional features or richer data sources (e.g., lab results, clinical history).

#### LLM Entity Tagging

- The LLM was effective at identifying surface-level clinical information, with high recall for observable entities like `symptoms` and `medications`, which could support downstream tasks such as automated coding or triage.
- However, its failure to detect `diagnosis` entities imits its current utility in decision-making contexts.
- Precision variability across entity types also highlights the need for caution when integrating LLM outputs into clinical workflows, particularly where high accuracy and interpretability are critical.

---

### With More Time or Data

#### Data
- Expand the dataset to improve generalisability; 200 datapoints is relatively small for training and evaluation.
- Add more detailed structured features, such as lab results, comorbidities, and prior treatments.
- Incorporate embeddings from discharge notes into the classification model to capture unstructured clinical context.

#### Modelling

##### Classification Model
- Experiment with alternative classifiers such as logistic regression or XGBoost.
- Apply hyperparameter tuning to optimise model performance beyond the default random forest baseline.

##### NLP Model
- Fine-tune a domain-specific language model (e.g., ClinicalBERT) to improve recognition of subtle or implied entities.
- Revise prompts with clear definitions to improve performance.

#### Evaluation

##### Classification Model
- Use cross-validation to make performance evaluation more robust and reduce reliance on a single train/test split.

##### NLP Model
- Increase the number of manually annotated discharge notes to improve the reliability of evaluation.
- Define clear annotation guidelines to resolve ambiguity in entity definitions (e.g., resolved vs. active diagnoses).