# Decision Trees for Classification


This notebook will show you how you can train a `DecisionTreeClassifier` with sklearn. 
The structure of the code matches that of a Python script. First, all used functions are listed and called in the function "main" at the end of the script. The results are then displayed at the very end by `if __name__ == "__main__":`. A more detailed explanation of the function `if __name__ == "__main__":` can be found [here](https://stackoverflow.com/questions/419163/what-does-if-name-main-do).

### **Task:**
The functions are missing docstrings and it's your task to change it! Try to understand what each function is doing and give it a proper docstring. 

### **Docstrings:**

In general, docstrings serve as inline documentation for functions, classes, modules, and packages (describing what a piece of code does, its parameters, return values, any exceptions, etc.) without having to read the code itself. 
This might be especially useful later when you work with large codebases or when reviewing unfamiliar code. 

Python's built-in `help()` function and various IDEs and development tools can use docstrings to provide contextual help and code introspection. This allows developers to quickly access information about function signatures, parameters, and usage.

Using docstrings is good practice for promoting code readability, facilitating collaboration among developers (including your future self), and enabling tools for code introspection and automated documentation generation.

#### "autoDocstring" VSCode extension:

In the beginning of the bootcamp during setting up your machines, the "autoDocstring" extension for VSCode was installed (if not you can search for it and install it now). The extension can facilitate and standardize writing docstrings for your code: 

1. **Checking docstrings:** In VS Code, you can use the built-in code suggestions feature to view docstrings of functions, methods, and classes. Simply place the cursor on the function or class name, press `Ctrl + Space`, and you'll see a tooltip with the docstring information if it's available.

2. **Writing docstrings with autoDocstring extension:** The "autoDocstring" extension for VS Code helps you generate docstrings quickly and efficiently by typing `"""` or `'''` (triple quotes) and pressing Enter. It will generate a template docstring with placeholders for parameters, return values, and other relevant information. You can then fill in the details accordingly. Pressing `tab` will guide you to the next part of the docstring where your individual input is needed.


When writing docstrings, it's good practice to follow the Python docstring conventions outlined in [PEP 257](https://www.python.org/dev/peps/pep-0257/). This includes using triple quotes (`"""`) for multi-line docstrings, using the appropriate sections (e.g., Parameters, Returns, Raises) to describe function behaviour, and adhering to a consistent style within your codebase.



---
### The Data

**Dataset Source:** https://archive.ics.uci.edu/ml/datasets/Balance+Scale

Generated to model psychological experiments reported by Siegler, R. S. (1976). Three Aspects of Cognitive Development. Cognitive Psychology, 8, 481-520. For further information, you can find the paper [here](Balance-Scale-Data-Siegler-1976.pdf).

**Donor:** Tim Hume (hume@ics.uci.edu)

**Data Set Information:**
This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance left-weight) and (right-distance right-weight). If they are equal, it is balanced.

**Attribute Information:**

- Class Name: 3 (L, B, R)
- Left-Weight: 5 (1, 2, 3, 4, 5)
- Left-Distance: 5 (1, 2, 3, 4, 5)
- Right-Weight: 5 (1, 2, 3, 4, 5)
- Right-Distance: 5 (1, 2, 3, 4, 5)

### The Code

In [None]:
# Importing the required packages 
import numpy as np 
import pandas as pd 
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 

# Suppress warnings 
# (sometimes you might want to ignore warnings, that's how you can achieve this)
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Function importing Dataset 
def importdata(): 
    balance_data = pd.read_csv( 
'https://archive.ics.uci.edu/ml/machine-learning-'+
'databases/balance-scale/balance-scale.data', 
    sep= ',', header = None) 
      
    # Printing the dataset shape 
    print ("Dataset Length: ", len(balance_data)) 
    print ("Dataset Shape: ", balance_data.shape) 
      
    # Printing the dataset observations 
    print ("Dataset: \n", balance_data.head()) 
    return balance_data 

In [None]:
importdata()

In [None]:
# Function to split the dataset 
def splitdataset(balance_data): 
  
    # Separating the target variable 
    X = balance_data.values[:, 1:5] 
    Y = balance_data.values[:, 0] 
  
    # Splitting the dataset into train and test 
    X_train, X_test, y_train, y_test = train_test_split(  
    X, Y, test_size = 0.3, random_state = 100) 
      
    return X, Y, X_train, X_test, y_train, y_test

In [None]:
# Function to perform training with giniIndex. 
def train_using_gini(X_train, y_train): 
    
    # Creating the classifier object
    clf_gini = DecisionTreeClassifier(
            criterion = "gini", 
            max_depth = 3, min_samples_leaf = 5) 
    # Performing training 
    clf_gini.fit(X_train, y_train) 
    return clf_gini  

In [None]:
# Function to perform training with entropy. 
def train_using_entropy(X_train, y_train): 
  
    # Decision tree with entropy 
    clf_entropy = DecisionTreeClassifier( 
            criterion = "entropy",
            max_depth = 3, min_samples_leaf = 5)
    # Performing training
    clf_entropy.fit(X_train, y_train)
    return clf_entropy

In [None]:
# Function to make predictions 
def prediction(X_test, clf_object): 
  
    # Prediction on test 
    y_pred = clf_object.predict(X_test) 
    print("Predicted values:\n") 
    print(y_pred) 
    return y_pred 

In [None]:
# Function to calculate accuracy 
def cal_accuracy(y_test, y_pred): 
    
    print("-----"*15)
    print("Confusion Matrix: \n", 
        confusion_matrix(y_test, y_pred)) 
    
    print("-----"*15)
    print ("Accuracy : \n", 
    accuracy_score(y_test, y_pred) * 100) 
    
    print("-----"*15)
    print("Report : \n", 
    classification_report(y_test, y_pred)) 
  

In [None]:
# Driver code 
def main(): 
      
    # Building Phase 
    data = importdata() 
    X, Y, X_train, X_test, y_train, y_test = splitdataset(data) 
    clf_gini = train_using_gini(X_train, y_train) 
    clf_entropy = train_using_entropy(X_train, y_train) 
      
    # Operational Phase 
    print("-----"*15)
    print("Results Using Gini Index:\n") 
      
    # Prediction using gini 
    y_pred_gini = prediction(X_test, clf_gini) 
    cal_accuracy(y_test, y_pred_gini) 
    
    print("-----"*15)
    print("Results Using Entropy:\n") 
    # Prediction using entropy 
    y_pred_entropy = prediction(X_test, clf_entropy) 
    cal_accuracy(y_test, y_pred_entropy)

In [None]:
# Calling main function 
if __name__ == "__main__": 
    main() 

Please take a moment to write down what the best model is and what you recognize as weak points of the models.
Well commented code with a small conclusion about what you can learn from the Modeling is very important for your own understanding of the approach at a later time and if someone else reads your code and notebook at some point.