<h1>Importing</h1>

Importing all necessary modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.inspection import DecisionBoundaryDisplay

<h1>Data Preparation</h1>

You are given datasets of breast-cancer-wisconsin and diabetes. Apply some data analysis on both, such as checking shape, dtypes, missing data and then prepare your data for modeling. It is better not to waste time with finding outliers for now but you can work on this dataset at home for more advanced data preparation. 

PS: Check value_counts for target columns of both dataset as well. 

<h1>Modeling</h1>

The following describes some parameters for DecisionTree

criterion - How you want to measure impurity of each node and leaf? Gini, Entropy, or Log-Loss?

max_depth - How many depths you want for construction of tree at maximum?

min_samples_split - How many samples you want inside each node at minimum?

min_samples_leaf - How many samples you want inside each leaf at minimum?

max_leaf_nodes - How many leaves you want for your tree at maximum?

min_impurity_decrease - How much impurity change you want at minimum?

min_weight_fraction_leaf - How much weight you want at each leaf at minimum ? weight = number of samples / total number of samples

max_features - How many features you want at each node?

class_weight - How much weight you want for each class? (This is used in case of high imbalance between numbers of samples in each class)

ccp_alpha - Which brand would you like to eliminate?

Initially, we use breast-cancer-wisconsin dataset to have a better understanding of tree structure more comfortably.

In this task, you are going plot tree with fitted model of Decision Tree Classifier. Start with default parameters for modeling and use plot_tree method to analyze your tree. Include the parameters of **feauture_names** as your column names and **filled** as True. Furthermore, use plt.savefig to save the plot in your computer and call plt.figure function to define dpi parameter.

In [None]:
## Split your dataset and fit it

In [None]:
plt.figure(dpi = 300)


## Plot your tree


plt.savefig('structure_of_tree.png')

Our tree is very complicated and hard to interpret. To understand what is happening in tree, you can zoom in so that you might see details or you would create a simpler tree structure by defining parameters of max_depth = 3, for instance.

Your next task is to define parameters of max_depth, max_leaf_nodes, and criterion as different values and see changes while plotting the tree. Try to choose small values such as 1-5 for maximum depth, 1-10 for max_leaf_nodes, etc. You can play around the other parameters for later.

The rest of tutorial will be about diabetes_prediction_dataset.csv with which we will have better understanding of overfitting in Decision Trees. If you haven't done any data prepation steps for this dataset, do it now and finish your training.

After modeling, check your scores on both train and test dataset. Probably, you will have score over 0.90 with default parameters but do you think it is so good?

Unfortunately, the model still overfits. To understand why, read the next cells and run the cell of decision boundary code.

Cols contain two most important features affecting target column. The aim is to plot decision boundaries for given these features by fitting on train dataset and predicting on test dataset with **default** parameters.

In [None]:

## Choosing two most important features
cols = ['blood_glucose_level', 'HbA1c_level']

## Display Method
display = DecisionBoundaryDisplay

## Fitting
dtr = DecisionTreeClassifier()
dtr.fit(X_train[cols], y_train)


## Plotting Decision Boundaries
display.from_estimator(
    dtr,
    X_test[cols],
    response_method = 'predict',
    xlabel = cols[0],
    ylabel = cols[1],
    alpha = 0.5    
)

## Plotting Data Points
plt.scatter(X_test[cols[0]], X_test[cols[1]], c = y_test.values)
plt.show()


Purple data points indicate there is no diabete detected while yellow is for reverse case. It is clearly seen from the plot that yellow points have passed decision boundaries however, there is not any misprediction for non-diabete cases - purple data points. 

The model obviously overfits for cases of having diabetes. Why? Remember value_counts()? It showed us that 91500 samples of data belongs to 0 class - no diabete while only 8500 - only 8.5% of target column is about 1 class. The model would learn 0 class really well, however, lack of information about 1 class paves the way for bad fitting for only this class. To analyze such cases, it is recommended to have information about precision, recall, f1_score although it is out of the scope for this tutorial.

How to fix this issue? Use class_weight = 'balanced' and play around with other parameters if there is still overfitting. You can also check ccp_alpha parameter to see if it can help

<h1>Optional Homework</h1>

Create a function computing gini index for given input. The input can be list, array, pandas series objects, etc. Therefore, it is better to convert them inside the function by using the method of np.asarray() which converts each container-like objects to numpy arrays.

Do the same for entropy computation.

Create a function computing Information Gain based on the metric - entropy or gini.

Apply the functions you created on the dataset. For example, you can define threshold of 200 for blood_glucose_level and see gini index or entropy in your new nodes combined with Information Gain.