# Assignment 1 - Decision Tree

In this assignment, we are going to train a decision tree prediction model.
The first step is to load the dataset, and show the first 5 rows of this dataset to make sure our table is correctly loaded.

In [None]:
import pandas as pd
df = pd.read_parquet('https://huggingface.co/datasets/scikit-learn/churn-prediction/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet', engine='pyarrow')

df.head()

## Load default scripts
The cell below executes prepared code, just hit the "play" button and move to the next cell

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

def train_decision_tree(X, y):
  # Define categorical columns
  categorical_cols = [col for col in X.columns if X[col].dtype == 'object']
  numeric_features = [col for col in X.columns if X[col].dtype in ["float64", "int64"]]

  # Create a pipeline for preprocessing categorical variables
  categorical_transformer = Pipeline(steps=[
      ('onehot', OneHotEncoder(handle_unknown='ignore'))
  ])

  numeric_transformer = Pipeline(
      steps=[("imputer", SimpleImputer(strategy="median"))]
  )

  # Combine preprocessing steps with ColumnTransformer
  preprocessor = ColumnTransformer(
      transformers=[
          ("num", numeric_transformer, numeric_features),
          ('cat', categorical_transformer, categorical_cols)
  ])

  # Append classifier to preprocessing pipeline
  # Now we have a full prediction pipeline
  clf = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', DecisionTreeClassifier(min_samples_leaf=1000))])

  # Fit the model
  clf.fit(X, y)

  # Get the feature names
  feature_names = clf.named_steps['preprocessor'].named_transformers_['cat'].named_steps['onehot'].get_feature_names_out() #(input_features=categorical_cols)

  full_list = numeric_features
  full_list.extend(feature_names)

  return clf, full_list

## Execute machine learning run
The cell below will train a decision tree model.

If you want to use additional features in your model, copy & paste the **exact** column name (mind capital letters!) from the table above (or from excel).
You can add as many variables as you want, if you wrap them with quotes, and separate every variable with a comma.

In [None]:
inputFeatures = df[["gender", "MonthlyCharges"]]
outcome = (df["Churn"]=="Yes")
dt_model, feature_names = train_decision_tree(inputFeatures, outcome)

## Visualize decision tree
The code below will visualize the trained decision tree. Just run the code below (no changes needed)

In [None]:
import matplotlib.pyplot as plt
import sklearn.tree as tree
import graphviz
fig = plt.figure(figsize=(25,20))
dot_data = tree.export_graphviz(dt_model.named_steps['classifier'], feature_names=feature_names, class_names=["False", "True"], filled=True)

# Draw graph
graph = graphviz.Source(dot_data, format="png")
graph

# Model performance**Wait with the steps below when until we are addressing model performance!**!

## Confusion matrix
Below you can generate the confusion matrix for the decision tree developed above. Just execute the cell below.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(dt_model, inputFeatures, outcome)

## ROC curve (discriminative ability)
Run the code below to generate the ROC curve and determine the AUC value (see legend). How good/bad is the model?

In [None]:
from sklearn.metrics import RocCurveDisplay
roc = RocCurveDisplay.from_estimator(dt_model, inputFeatures, outcome)
plt.show()