# A Notebook to Use Decision Tree Classifiers

This notebook shows how to train a decision tree to classify unseen instances.

For those of you interested in understanding the code, it uses predefined functions from the [sklearn](http://scikit-learn.org) library of machine learning primitives and from the [graphviz](http://www.graphviz.org) library to generate visualizations. A few more details about the code:  
* The variable "dataset" stores the name of text file that you input and is passed as an argument of the function "loadDataSet()".  
* The variable "attributes" stores the names of all features. The variable "instances" stores the values of all features in the training set. The variable "labels" stores the labels of all instances.  
* The variable "clf" stores a decision tree model, and it can be trained with "instances" and "labels". Once the model is trained, it can be used to predict unseen instances.  We use a type of decision tree algorithm called CART (Classification and Regression Trees). 
* The variable "n_foldCV" stores the number of times of n-fold cross validation that you input.
* The function "cross_val_scores" assesses the accuracy scores of a decision tree model.  Its inputs are "clf", "instances", "labels", "n_foldCV".
* The variable "scores" stores the accuracy of an n-fold cross validation of the model.


In [None]:
import numpy as np
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
import graphviz

def loadDataSet(dataset):
    with open(dataset) as f:
        data=f.readlines()
        attributes=data[0].rstrip().split(',')[:-1]
        instances=[entry.rstrip().split(',')[:-1] for entry in data[1:]]
        dataArray=[]
        Dict={}
        for i in range(len(instances[0])):
            try:
                dataArray.append([float(instance[i]) for instance in instances])
            except:
                encodedData,vocab=encode([instance[i] for instance in instances])
                dataArray.append(encodedData)
                Dict[i]=vocab
                print(attributes[i],': ',list(vocab.items()))
        instances=np.array(dataArray).T
        labels=[entry.rstrip().split(',')[-1] for entry in data[1:]]
        return attributes,instances,labels,Dict

def encode(data):
    vocab={}
    uniqueVals=list(set(data))
    for Val in uniqueVals:
        vocab[Val]=uniqueVals.index(Val)
    encodedData=list(map(uniqueVals.index,data))
    return encodedData,vocab

## Training: Building a Decision Tree Classifier ##

The cell below asks for a dataset. It trains a decision tree classifier. 

We provide two classification datasets that could be applied to the decision tree algorithms. 
* ["iris.txt"](https://archive.ics.uci.edu/ml/datasets/iris) has four attributes with continuous values describing three different iris species.
* ["lenses.txt"](https://archive.ics.uci.edu/ml/datasets/lenses) contains four attributes with discrete values and three classes.

Before training your classifier, run the cell below to take a look at the dataset.

In [None]:
import pandas as pd
dataset=input('Please Enter Your Dataset:')
df=pd.read_csv(dataset)
display(df)

Before we run the following cell, let's learn an important concept called feature encoding. Many classifiers only take numerical data and some datasets have features that are not numerical. For example, a feature can be the state that a person lives in. Those are called [categorical features](https://en.wikipedia.org/wiki/Categorical_variable). In that case,we need to encode categorical features into discrete values. This process is called feature encoding

In our notebook, if your dataset contains categorical features, you will see the code rules in the cell below. In the next section, when you are prompted to input test set for prediction, the algorithm will automatically encode the relevant categorical features according to the code rules showned below 

In [None]:
attributes,instances,labels,Dict=loadDataSet(dataset)
clf = tree.DecisionTreeClassifier()
clf.fit(instances,labels)

## Visualizing a Decision Tree##

The following cell will generate a visualization of the decision tree.

In [None]:
dot_data = tree.export_graphviz(clf, out_file=None,max_depth=5,\
feature_names=attributes,class_names=clf.classes_,label='all',\
filled=True,special_characters=True) 
graph = graphviz.Source(dot_data) 
graph

## Prediction: Classifying New Instances Using a Decision Tree Classifier##

The cell below classifies new instances with the decision tree you created.

When you are prompted to input a test set, please create an example of an instance that looks like the instances in the training set.  For example, if you trained the classifier with contact lenses data, you should create an instance that has the same kinds of features.  For example:

"young,myope,yes,normal"


Each feature value is separated with a comma, and should have the same length as the instances in the training set. 

In [None]:
testset=input('Please Enter Your Test Set:')
testset=testset.strip().split(",")
temp=[]
for i in range(len(testset)):
    try:
        temp.append(float(testset[i]))
    except:
        temp.append(Dict[i][testset[i]])
testset=np.array(temp).reshape((1,len(temp)))
predictions=clf.predict(testset)

In [None]:
print(predictions)

## Evaluating the Accuracy of a Decision Tree Classifier##

The following cell will run cross-validation to evaluate your decision tree classifier.  It will ask you for your test data, and the number of folds that you want to use.

In [None]:
dataset=input('Please Enter Your Test Data:')
n_foldCV=int(input("Please Enter the Number of Folds:"))
attributes,instances,labels,Dict=loadDataSet(dataset)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(instances,labels)
scores = cross_val_score(clf, instances, labels, cv=n_foldCV)

The following cell will output the accuracy score for each fold and the accuracy estimate of the model under 95% confidence interval.

In [None]:
print("Sores:")
[print(score) for score in scores]
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

**Question**: What is the overall accuracy of the classifier?

Now you can print this notebook as a PDF file and turn it in. Note: The decision tree may be truncated during export. 