<a href="https://colab.research.google.com/github/hildj/trees_assignment/blob/main/assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment: Trees

## Do two questions in total: "Q1+Q2" or "Q1+Q3"

`! git clone https://github.com/ds3001f25/linear_models_assignment.git`

**Q1.** Please answer the following questions in your own words.
1. Why is the Gini a good loss function for categorical target variables?
2. Why do trees tend to overfit, and how can this tendency be constrained?
3. True or false, and explain: Trees only really perform well in situations with lots of categorical variables as features/covariates.
4. Why don't most versions of classification/regression tree concept allow for more than two branches after a split?
5. What are some heuristic ways you can examine a tree and decide whether it is probably over- or under-fitting?

**Q2.** This is a case study about classification and regression trees.

1. Load the `Breast Cancer METABRIC.csv` dataset. How many observations and variables does it contain? Print out the first few rows of data.

2.  We'll use a consistent set of feature/explanatory variables. For numeric variables, we'll include `Tumor Size`, `Lymph nodes examined positive`, `Age at Diagnosis`. For categorical variables, we'll include `Tumor Stage`, `Chemotherapy`, and `Cancer Type Detailed`. One-hot-encode the categorical variables and concatenate them with the numeric variables into a feature/covariate matrix, $X$.

3. Let's predict `Overall Survival Status` given the features/covariates $X$. There are 528 missing values, unfortunately: Either drop those rows from your data or add them as a category to predict. Constrain the minimum samples per leaf to 10. Print a dendrogram of the tree. Print a confusion matrix of the algorithm's performance. What is the accuracy?

4. For your model in part three, compute three statistics:
    - The **true positive rate** or **sensitivity**:
        $$
        TPR = \dfrac{TP}{TP+FN}
        $$
    - The **true negative rate** or **specificity**:
        $$
        TNR = \dfrac{TN}{TN+FP}
        $$
    Does your model tend to perform better with respect to one of these metrics?

5. Let's predict `Overall Survival (Months)` given the features/covariates $X$. Use the train/test split to pick the optimal `min_samples_leaf` value that gives the highest $R^2$ on the test set (it's about 110). What is the $R^2$? Plot the test values against the predicted values. How do you feel about this model for clinical purposes?

**Q3.** This is a case study about trees using bond rating data. This is a dataset about bond ratings for different companies, alongside a bunch of business statistics and other data. Companies often have multiple reviews at different dates. We want to predict the bond rating (AAA, AA, A, BBB, BB, B, ..., C, D). Do business fundamentals predict the company's rating?

1. Load the `./data/corporate_ratings.csv` dataset. How many observations and variables does it contain? Print out the first few rows of data.

2.  Plot a histogram of the `ratings` variable. It turns out that the gradations of AAA/AA/A and BBB/BB/B and so on make it hard to get good results with trees. Collapse all AAA/AA/A ratings into just A, and similarly for B and C.

3. Use all of the variables **except** Rating, Date, Name, Symbol, and Rating Agency Name. To include Sector, make a dummy/one-hot-encoded representation and include it in your features/covariates. Collect the relevant variables into a data matrix $X$.

4. Do a train/test split of the data and use a decision tree classifier to predict the bond rating. Including a min_samples_leaf constraint can raise the accuracy and speed up computation time. Print a confusion matrix and the accuracy of your model. How well do you predict the different bond ratings?

5. If you include the rating agency as a feature/covariate/predictor variable, do the results change? How do you interpret this?

In [30]:
! git clone https://github.com/hildj/trees_assignment

fatal: destination path 'trees_assignment' already exists and is not an empty directory.


In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, ConfusionMatrixDisplay

data = pd.read_csv('corporate_ratings.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'corporate_ratings.csv'

In [22]:



print(f"Number of observations: {data.shape[0]}")
print(f"Number of variables: {data.shape[1]}")
print("\nFirst few rows of data:")
print(data.head())

plt.figure(figsize=(8, 5))
data['Rating'].value_counts().sort_index().plot(kind='bar', color='skyblue')
plt.title('Distribution of Bond Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

collapse_map = {
    'AAA': 'A', 'AA': 'A', 'A': 'A',
    'BBB': 'B', 'BB': 'B', 'B': 'B',
    'CCC': 'C', 'CC': 'C', 'C': 'C'
}

data['Collapsed_Rating'] = data['Rating'].replace(collapse_map)

data = data.dropna(subset=['Collapsed_Rating'])

drop_cols = ['Rating', 'Date', 'Name', 'Symbol', 'Rating Agency Name']
X = data.drop(columns=drop_cols, errors='ignore')
y = data['Collapsed_Rating']

if 'Sector' in X.columns:
    X = pd.get_dummies(X, columns=['Sector'], drop_first=True)

X = X.select_dtypes(include=[np.number]).fillna(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

clf = DecisionTreeClassifier(random_state=42, min_samples_leaf=5)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"\nAccuracy (without Rating Agency): {acc:.3f}")

cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
disp.plot(cmap='Blues', values_format='d')
plt.title("Confusion Matrix - Without Rating Agency")
plt.show()

data2 = data.copy()
collapse_map = {
    'AAA': 'A', 'AA': 'A', 'A': 'A',
    'BBB': 'B', 'BB': 'B', 'B': 'B',
    'CCC': 'C', 'CC': 'C', 'C': 'C'
}
data2['Collapsed_Rating'] = data2['Rating'].replace(collapse_map)
data2 = data2.dropna(subset=['Collapsed_Rating'])

X2 = data2.drop(columns=['Rating', 'Date', 'Name', 'Symbol'])
y2 = data2['Collapsed_Rating']

cat_cols = [col for col in ['Sector', 'Rating Agency Name'] if col in X2.columns]
X2 = pd.get_dummies(X2, columns=cat_cols, drop_first=True)
X2 = X2.select_dtypes(include=[np.number]).fillna(0)

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42, stratify=y2)

clf2 = DecisionTreeClassifier(random_state=42, min_samples_leaf=5)
clf2.fit(X2_train, y2_train)
y2_pred = clf2.predict(X2_test)

acc2 = accuracy_score(y2_test, y2_pred)
print(f"\nAccuracy (with Rating Agency): {acc2:.3f}")

cm2 = confusion_matrix(y2_test, y2_pred, labels=clf2.classes_)
disp2 = ConfusionMatrixDisplay(confusion_matrix=cm2, display_labels=clf2.classes_)
disp2.plot(cmap='Greens', values_format='d')
plt.title("Confusion Matrix - With Rating Agency")
plt.show()

print("\n--- Interpretation ---")
if acc2 > acc:
    print("Including the rating agency slightly improved accuracy, suggesting that different agencies may have distinct rating patterns.")
else:
    print("Including the rating agency did not improve accuracy, meaning the model mainly relies on business fundamentals.")


FileNotFoundError: [Errno 2] No such file or directory: 'corporate_ratings.csv'