# Assignment 3: Logistic Regression

## Instructions

* Complete the assignment as outlined below.
* Restart your kernel and rerun your cells before submission.
* Submit your completed notebook (.ipynb).

## Dataset Information

The dataset includes information on a total of 1000 burn related hospitalizations. The outcome of interest is survival to hospital discharge (`death`).
Below are the features:

* `death`: Hospital discharge status (1 = Dead, 0 = Alive)
* `age`: Age at admission (Years)
* `gender`: Gender (1 = Male, 0 = Female)
* `tbsa`: Total burn surface area (Minor, Moderate, Severe, Critical)
* `race`: Race (1 = White, 0 = Non-White)
* `inh_inj`: Burn involved inhalation injury (1 = Yes, 0 = No)
* `flame`: Flame involved in burn injury (1 = Yes, 0 = No)

Your goal in this homework is to develop a logistic regression
model to predict the probability of survival to hospital discharge of these patients.

In [None]:
# Suggested packages, you can add more if you think they are necessary.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

# Plotting packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Download the data. Uncomment the line below if you are using Google colab.
# !gdown https://drive.google.com/uc?id=1dGlTA6GsxDwoRjLc9lXfB368U3QMprwL

## Question 1:

1. Load the dataset `burn.csv` and display the first 5 rows.
2. Print out all columns in the dataset and identify any missing values.
3. Create a crosstab between the `tbsa` and `death` columns. Combine the least frequent category into an "other" category or with a category of similar meaning, and justify your decision. The new `tbsa` variable should contain three categories.
4. Create a crosstab between the new `tbsa` and `death` columns.
5. Create dummy variables for the new `tbsa` and display the results.

In [None]:
df = pd.read_csv('burn.csv')
print("first 5 rows:")
print(df.head())
print(f"\nshape: {df.shape}")


In [None]:
print("all columns:")
print(df.columns.tolist())
print("\nnull values per column:")
print(df.isnull().sum())
print(f"\ntotal null values: {df.isnull().sum().sum()}")


In [None]:
crosstab_original = pd.crosstab(df['tbsa'], df['death'], margins=True)
print("crosstab:")
print(crosstab_original)


In [None]:
print("original tbsa value counts:")
print(df['tbsa'].value_counts())

df['tbsa_new'] = df['tbsa'].replace({'critical': 'severe'})

print("\nnew tbsa value counts:")
print(df['tbsa_new'].value_counts())


In [None]:
crosstab_new = pd.crosstab(df['tbsa_new'], df['death'], margins=True)
print("crosstab:")
print(crosstab_new)

print("\njustification:")
print("combined critical with severe b/c critical has the least observations and both represent most serious burn cases with high mortality")

In [None]:
tbsa_dummies = pd.get_dummies(df['tbsa_new'], prefix='tbsa', drop_first=True)
print("dummy variables for tbsa_new:")
print(tbsa_dummies.head(10))
print(f"\nshape: {tbsa_dummies.shape}")


## Question 2:

1.   Check the class distribution of the `death` variable. How many positive class?
2.   What is the baseline accuracy for this classification problem?



In [None]:
print("class distribution of death:")
print(df['death'].value_counts())
print("\nclass distribution (percentages):")
print(df['death'].value_counts(normalize=True))


In [None]:
positive_class_count = df['death'].sum()
print(f"number of positive class (death=1): {positive_class_count}")


In [None]:
baseline_accuracy = df['death'].value_counts(normalize=True).max()
print(f"baseline accuracy: {baseline_accuracy:.4f}")


## Question 3:

Split the data into training (70%) and testing (30%) sets. **Use your student ID** as the `random_state`. Set `stratify` to the target variable.

In [None]:
df_final = pd.concat([df.drop(['tbsa', 'tbsa_new'], axis=1), tbsa_dummies], axis=1)

X = df_final.drop('death', axis=1)
y = df_final['death']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=251280006, stratify=y)

print(f"training: {X_train.shape}")
print(f"testing: {X_test.shape}")
print(f"\ndistribution:")
print(y_train.value_counts())


## Question 4:

1. Create a pipeline that first standardizes the non-binary variables using a z-scale transform, and then trains an instance of `LogisticRegression` with `penalty = None` and `max_iter = 10000`. **Use the same random seed you used before.**
2. Train the pipeline using the training set.

In [None]:
non_binary_features = ['age']
binary_features = ['gender', 'race', 'inh_inj', 'flame', 'tbsa_moderate', 'tbsa_severe']

preprocessor = ColumnTransformer(
    transformers=[
        ('scaler', StandardScaler(), non_binary_features),
        ('passthrough', 'passthrough', binary_features)
    ])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(penalty=None, max_iter=10000, random_state=251280006))
])

print("pipeline created")
print(pipeline)


In [None]:
pipeline.fit(X_train, y_train)
print("pipeline trained")


## Question 5:

1. Display the training parameters and intercept of the logistic regression model.
2. Compute odds ratios for all variables.
3. Interpret the odds ratios for `age` and `inh_inj`.
4. Predict over the test set and compute the modelâ€™s accuracy.


In [None]:
from typing import Any


coefficients = pipeline.named_steps['classifier'].coef_[0]
intercept = pipeline.named_steps['classifier'].intercept_[0]

feature_names = non_binary_features + binary_features

print("logistic regression coefficients:")
for name, coef in zip[tuple[str, Any]](feature_names, coefficients):
    print(f"  {name}: {coef:.4f}")
    
print(f"\nintercept: {intercept:.4f}")


In [None]:
odds_ratios = np.exp(coefficients)

print("odds ratios:")
for name, or_val in zip(feature_names, odds_ratios):
    print(f"  {name}: {or_val:.4f}")


Interpret the odds ratio for `age`.

for every one unit increase in age (after standardization), the odds of death increase by a factor of the odds ratio. since age is standardized, this represents the effect of a one standard deviation increase in age. older patients have higher odds of death.



Interpret the odds ratio `inh_inj`.

patients with inhalation injury have odds of death that are the odds ratio times higher than patients without inhalation injury, holding all other variables constant. inhalation injury significantly increases the risk of death.

In [None]:
y_pred = pipeline.predict(X_test)
accuracy = pipeline.score(X_test, y_test)

print(f"model accuracy on test: {accuracy:.4f}")
print("\nconfusion matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nclassification report:")
print(classification_report(y_test, y_pred))

## Question 6:

1. Plot the ROC curve to see the performance over all cutoffs.
2. Compute the area under the curve (AUC). Is the AUC acceptable?

In [None]:
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label='roc curve')
plt.plot([0, 1], [0, 1], 'r--', linewidth=2, label='random classifier')
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title('roc curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()


In [None]:
auc = roc_auc_score(y_test, y_pred_proba)
print(f"auc: {auc:.4f}")


Is the AUC acceptable?

yes, the auc is acceptable. an auc above 0.8 is generally considered good, and above 0.9 is excellent. this model shows strong discriminative ability in predicting patient survival.