## Tree Based Modeling for Class Target (Python & SAS Viya)

**EXAMPLE:** Tree Based Modeling for Class Target using Python & SAS Viya  
**DATA SOURCE:**  
Training Data: adult_train.csv, Testing Data: adult_test.csv   
Becker, B. and Kohavi, R. (1996). Adult. UCI Machine Learning Repository. [Link](https://doi.org/10.24432/C5XW20)  

**DESCRIPTION:** This template demonstrates a workflow for preprocessing data in Python and building predictive models using tree-based modeling techniques in SAS Viya.  
**PURPOSE:** The goal is to predict the likelihood of a binary outcome, in this case, whether income exceeds $50K/yr.  
**DETAILS:**  
- Data preprocessing is performed in Python, including one-hot encoding of categorical variables.
- Classification Models built in SAS Viya include: Decision Tree, Forest, and Gradient Boosting.
- Score the test data.
- Model Assessment: Confusion Matrix and Classification Report.
- Model Comparison: ROC curves are plotted to assess the performance of each model in predicting events along with AUC score.


In [None]:
# Importing necessary libraries
import os
import pandas as pd
import numpy as np
import warnings
import matplotlib.pyplot as plt
import seaborn as sn
from sasviya.ml.tree import DecisionTreeClassifier, ForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score

# Suppress warnings
warnings.filterwarnings("ignore")

### Data Loading and Preprocessing
- **Importing Data and Defining Variables**
    - Load the dataset for both training and testing partition.
    - Define variables necessary for further analysis.
- **Perform One-Hot Encoding for Categorical Variables**
    - Encode categorical variables as one-hot vectors to prepare the data for modeling.

In [None]:
# Construct the workspace path
workspace = f"{os.path.abspath('')}/../../data/"

# Importing Data and Defining Variables
train_data = pd.read_csv(workspace + "adult_train.csv")
test_data = pd.read_csv(workspace + "adult_test.csv")

# Encode categorical variables as one-hot vectors
X_train_encoded = pd.get_dummies(train_data.drop(columns=['target']))
X_test_encoded = pd.get_dummies(test_data.drop(columns=['target']))

# Reindex the testing dataset with the columns from the training dataset
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Check the shape of the training data after one-hot encoding
print("Shape of X_train_encoded:", X_train_encoded.shape)

# Print first 5 rows of train dataset
print("Top 5 rows of adult_train:")
print(train_data.head(5))

### Decision Tree Model Training, Scoring and Evaluation  

For more information regarding SAS Viya Decision Tree Classifier, refer to [this link](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=p14rqs4yfhf5bcn1js9nlfgzx795.htm).


In [None]:
# Initialize the SAS Viya Decision Tree classifier
sas_dtree = DecisionTreeClassifier(max_depth=5)

# Fit the model
sas_dtree.fit(X_train_encoded, train_data['target'])

# Score on the test partition
y_pred_tree = sas_dtree.predict(X_test_encoded)

# Calculate predicted probabilities for the positive class ('>50K')
y_pred_proba_tree = sas_dtree.predict_proba(X_test_encoded)['P_target_50K'].values

# Convert categorical target variable to binary labels
y_test_binary = test_data['target'].replace({'<=50K': 0, '>50K': 1})


**Decision Tree Model Evaluation**  
&emsp; Generate Confusion Matrix, ROC curve, and Compute AUC Score


In [None]:
# Generate confusion matrix, classification report, ROC curve, and compute AUC score for Decision Tree
conf_matrix_tree = confusion_matrix(test_data['target'], y_pred_tree)
fpr_tree, tpr_tree, thresholds_tree = roc_curve(y_test_binary, y_pred_proba_tree)
roc_auc_tree = roc_auc_score(y_test_binary, y_pred_proba_tree)

# Plot confusion matrix for Decision Tree using seaborn
plt.figure(figsize=(8, 6))
sn.heatmap(conf_matrix_tree, annot=True, fmt="d", cmap="Blues", xticklabels=['<=50K', '>50K'], yticklabels=['<=50K', '>50K'])
plt.title('Confusion Matrix for Decision Tree')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

# Plot ROC curve for Decision Tree
plt.plot(fpr_tree, tpr_tree, color='blue', lw=2, label='Decision Tree ROC curve (AUC = %0.2f)' % roc_auc_tree)
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Decision Tree')
plt.show()

### Forest Model Training, Scoring and Evaluation
For more information regarding SAS Viya Forest Classifier, refer to [this link](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=p04zhxjh60eutqn1t40f0104gw42.htm).


In [None]:
# Initialize the SAS Viya Forest classifier
sas_forest_model = ForestClassifier(n_estimators=100, max_depth=5, random_state=12345)

# Fit the model
sas_forest_model.fit(X_train_encoded, train_data['target'])

# Score on the test partition
y_pred_rf = sas_forest_model.predict(X_test_encoded)

# Calculate predicted probabilities for the positive class ('>50K')
y_pred_proba_rf = sas_forest_model.predict_proba(X_test_encoded)['P_target_50K'].values


**Forest Model Evaluation**  
&emsp; Generate Confusion Matrix, ROC curve, and Compute AUC Score

In [None]:
conf_matrix_rf = confusion_matrix(test_data['target'], y_pred_rf)
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_test_binary, y_pred_proba_rf)
roc_auc_rf = roc_auc_score(y_test_binary, y_pred_proba_rf)

# Plot confusion matrix for Forest
plt.figure(figsize=(8, 6))
sn.heatmap(conf_matrix_rf, annot=True, fmt="d", cmap="Reds", xticklabels=['<=50K', '>50K'], yticklabels=['<=50K', '>50K'])
plt.title('Confusion Matrix for Forest')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

# Plot ROC curve for Forest
plt.plot(fpr_rf, tpr_rf, color='red', lw=2, label='Forest ROC curve (AUC = %0.2f)' % roc_auc_rf)
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend(loc="lower right")

### Gradient Boosting Model Training, Scoring and Evaluation
For more information regarding SAS Viya Gradient Boosting Classifier, refer to [this link](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=n1kiea90s0276wn1xr0ig0hvkix6.htm).


In [None]:
# Initialize the SAS Viya Gradient Boosting classifier
sas_gb_model = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=12345)

# Fit the model
sas_gb_model.fit(X_train_encoded, train_data['target'])

# Score on the test partition
y_pred_gb = sas_gb_model.predict(X_test_encoded)

# Calculate predicted probabilities for the positive class ('>50K')
y_pred_proba_gb = sas_gb_model.predict_proba(X_test_encoded)['P_target_50K'].values

**Gradient Boosting Model Evaluation**  
&emsp; Generate Confusion Matrix, ROC curve, and Compute AUC Score

In [None]:
conf_matrix_gb = confusion_matrix(test_data['target'], y_pred_gb)
fpr_gb, tpr_gb, thresholds_gb = roc_curve(y_test_binary, y_pred_proba_gb)
roc_auc_gb = roc_auc_score(y_test_binary, y_pred_proba_gb)

# Plot confusion matrix for Gradient Boosting 
plt.figure(figsize=(8, 6))
sn.heatmap(conf_matrix_gb, annot=True, fmt="d", cmap="Greens", xticklabels=['<=50K', '>50K'], yticklabels=['<=50K', '>50K'])
plt.title('Confusion Matrix for Gradient Boosting')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()

# Plot ROC curve for Gradient Boosting
plt.plot(fpr_gb, tpr_gb, color='green', lw=2, label='Gradient Boosting ROC curve (AUC = %0.2f)' % roc_auc_gb)
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend(loc="lower right")
plt.show()

### Overall Model Comparsion
&emsp; Compare F1 Scores across the models

In [None]:
# Extract F1 scores from classification reports
class_reports = {
    'Decision Tree': classification_report(test_data['target'], y_pred_tree, output_dict=True)['weighted avg']['f1-score'],
    'Forest': classification_report(test_data['target'], y_pred_rf, output_dict=True)['weighted avg']['f1-score'],
    'Gradient Boosting': classification_report(test_data['target'], y_pred_gb, output_dict=True)['weighted avg']['f1-score']
}

# Extract model names and F1 scores
model_names, f1_values = list(class_reports.keys()), list(class_reports.values())

# Plotting
plt.figure(figsize=(8, 6))
bars = plt.bar(model_names, f1_values, color=['blue', 'red', 'green'])
plt.title('F1 Score Comparison for All Models')
plt.xlabel('Model')
plt.ylabel('F1 Score')
plt.ylim(0, 1)  # Set y-axis limit to ensure readability

# Add text annotations for F1 scores on the bars
for bar, f1_score in zip(bars, f1_values):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() - 0.05, f'{f1_score:.2f}', ha='center', color='black')

plt.show()
