---
title: "ISE 625 Project Proposal"
# date: "5/22/2021"
subtitle: "Stable decision trees for suicide experience prediction"
format: 
  revealjs:
    slide-number: c/t
    theme: [default, template/custom.scss]
    toc: true
    progress: true
    mouse-wheel: true
    controls: true
  pdf:
    documentclass: scrartcl
    papersize: letter
author: "Adhithya Bhaskar, Michelle Gelman"
# lightbox: true
# embed-resources: true
# bibliography: template/references.bib
# incremental: true
# csl: template/ieee.csl
width: 1200
# height: 816
editor:
    render-on-save: true
jupyter: 
  kernelspec:
    name: "python3"
    language: "python"
    display_name: ".venv (Python 3.11.11)"
---


## Problem Context and Background
- **Aim:** Predict suicidal experiences among youth experiencing homelessness (YEH)
- The provided decision tree model is unstable to change in train-test splits
- Can we find a robust model invariant to shifts in distributions that will procude the same best features indicative of suicide ideaiton and attempts?

## Dataset Considerations
- Missing data
    - 584, 587 samples remaining for each prediction model from initial listwise deletion method form oriignal 940 total samples
    - 4% of data set mising for suicideideation and attempy (36 and 40 samples respectively)
- Imbalanced classes
    - 83% labeled 2, 16% labeled 1 for suicideidea class
    - 88% labeled 0, 11% labeled 1 for suicideattempt class 


## Stable Decision Trees
- Bertsimas et al. (2023) proposes a method to create stable decision trees
- 1 of 6 datasets used is publicly available - Breast Cancer dataset (UCI Machine Learning Repository)
- Used to test and compare our implementation
- With satisfactory results, we will apply our implementation to the suicide dataset

## Proposed Plan - [1. Understand the instability of provided DT]{.highlighted}
- Given model exists as 2 python files (for `suicidea` and `suicattemp`)
- Create simple example to deterministically try various splits
- Empirically measure the difference in predicted splits

## Proposed Plan - [2. Implement a stable DT (Bertsimas et al. 2023)]{.highlighted}
1. Train initial set **(T0)** of decision trees on a subset of the data and a second set **(T)** on the full dataset
3. Compute **average distance** of each tree in **T** to **T0**
4. Compute performance metrics (AUC) of trees on validation/test set
5. For the trees in **T** we select the Pareto optimal trees by optimizing for predictive performance and distance to **T0**

## Proposed Plan - [3. Measuring effectiveness of proposed model]{.highlighted}
1. Evaluate performance of provided DT using the stability experiment we define in step 1
2. Evaluate performance of the stable tree using the same experiment handler
3. Define and compare the models using metrics for assessing stability over various splits 

## Key optimization algorithms to implement
- Distance between two trees

$$
d\bigl(\mathcal{T}_{1}, \mathcal{T}_{2}\bigr) 
\;=\;\min_{\{x\}}\;
\sum_{p\in\mathcal{P}(\mathcal{T}_{1})}\sum_{q\in\mathcal{P}(\mathcal{T}_{2})} d(p,q)\, x_{p,q}
\;
\\+\;\sum_{p\in\mathcal{P}(\mathcal{T}_{1})} w(p)\, x_{p}
$$

- Pareto optimal tree
$$\mathbb{T}^{\star}=\arg\!\mathrm{max}\,f\!\left(d_{b},\alpha_{b}\right)\!$$


## Project Outcomes
- A robust, stable decision tree model that minimizes the variability in tree structure due to random train-test splits
- Empirical evidence supporting the stability of the model through consistent feature selection and comparable performance metrics

- **Impact:** Better interpretability of decision trees to predict suicide risk among YEHs

## Initial implementation


In [None]:
import numpy as np
import itertools
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

data_breast_cancer = load_breast_cancer(as_frame=True)
X_full = data_breast_cancer["data"]
y_full = data_breast_cancer["target"]

X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=0.2, random_state=42)

print("X_train shape: {}, X_test shape: {}".format(X_train.shape, X_test.shape))
print("y_train shape: {}, y_test shape: {}".format(y_train.shape, y_test.shape))

<!-- The project proposal and presentation will be evaluated based on the following criteria:
    Relevance to class topic
    Clarity of the project proposal (1 page) and presentation.
    Feasibility of the proposed work based on data availability and proposed plan for solving the problem.
    Potential for social impact. -->

<!-- ### Proposal Outline
## 1. Problem Context and Background
## 2. Pre-processing Data

- Dataset Considerations
- Compare/contrast old pre-processing steps with new proposal

## 3. Data Analysis      
- Criteria for Ideal deal dataset
- Defining Stability
- Defining Generalizability
- Defining Fairness
## 3. Model Background
- Literature Review
- Hyperparameter Considerations
- Mathematical Formulation
## 1. Implementation
-  Defining Tree Pipeline  
- Mathematical Details
## 4. Training ML Model
## 5. Testing and Results 
## 6. Discussion 
## 7. Future Work -->
<!-- https://quarto.org/docs/presentations/ -->