## DecisionTreeClassifier from SAS® Viya® on Handwriting Recognition

### About the data set
The data is a copy of the test set of the [UCI ML hand-written digits datasets]( https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits).

This example focuses on image recognition, specifically the recognition of handwritten digits. The dataset comprises 1797 observations, each consisting of an image depicting a single handwritten digit. Each image is 64 pixels in size, with dimensions of 8 pixels in width and 8 pixels in height.

The **inputs(𝐱)** are vectors with 64 dimensions or values, where each input vector represents one image. The 64 values within the input vector correspond to the pixels of the image. These input values range from 0 to 16, indicating the grayscale shade of the respective pixel. The **output(𝑦)** for each observation is an integer between 0 and 9, corresponding to the digit depicted in the image. In total, there are ten classes, each representing a different digit.

### Step 1: Import Packages

You will need to import Matplotlib, various functions and classes from scikit-learn.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sasviya.ml.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

### Step 2a: Get Data

You can retrieve the dataset directly from scikit-learn using load_digits(). This function returns a tuple containing the inputs and outputs.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

We have loaded the digits data. Let's plot the digits. 

The code below sets up a figure with a 6x6 inch size and adjusts the spacing between the subplots. It then creates an 8x8 grid of subplots, each displaying an 8x8 pixel image from the digits dataset. The code adds a text label to each subplot, showing the target value associated with the corresponding image.

In [None]:
# set up the figure
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits: each image is 8x8 pixels
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))

The code x, y = load_digits(return_X_y=True) is a concise way to load the Digits dataset from the scikit-learn library and unpack the feature data (images) and target data (digit labels) into separate array variables, making it ready for further processing and model training.

In [None]:
x, y = load_digits(return_X_y=True)

In [None]:
x

In [None]:
y

That is the data you will be working with. x is a multi-dimensional array with 1797 rows and 64 columns, containing integers from 0 to 16. y is a one-dimensional array with 1797 integers ranging from 0 to 9.

### Step 2b: Split Data

It is a common and widely adopted practice to divide the dataset you are working with into two subsets: the **training set** and the **test set**. This division is typically done randomly. The training set is used to train your model, while the test set is used to evaluate its performance. It is crucial not to use the test set during the model fitting process. This methodology allows for an unbiased assessment of the model.

One method to split your dataset into training and test sets is by using train_test_split(). is to apply train_test_split().

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

[train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) accepts x and y, along with test_size to determine the test set's size and random_state to set the pseudo-random number generator state, among other optional arguments. This function returns a list containing four arrays:

1. X_train: the portion of x used for model fitting
2. X_test: the portion of x used for model evaluation
3. y_train: the corresponding part of y for X_train
4. y_test: the corresponding part of y for X_test

Once your data is split, you should set aside X_test and y_test until you define your model.

###  Step 3: Create a Model and Train it with sasviya.ml.tree DecisionTreeClassifier
Utilize the x_train and y_train subsets to train the model. Create an instance of DecisionTreeClassifier and invoke .fit() on it.

For details about using the `DecisionTreeClassifier` class, see the [DecisionTreeClassifier documentation](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=p14rqs4yfhf5bcn1js9nlfgzx795.htm)

In [None]:
dt = DecisionTreeClassifier(max_depth=5,
                            min_samples_leaf=1,                           
                            criterion='gini')

dt.fit(X_train, y_train)

Use the get_params() to retrieve the parameters used by the DecisionTreeClassifier object dt.   

In [None]:
dt.get_params()

Obtain the predicted outputs using .predict().

In [None]:
y_pred = dt.predict(X_test)

In [None]:
y_pred

The variable y_pred is now associated with an array of the predicted outputs.

You can calculate the accuracy using .score().

In [None]:
print('{:.4f}'.format(dt.score(X_train, y_train)))

In [None]:
print('{:.4f}'.format(dt.score(X_test, y_test)))

You can obtain two accuracy values, one from the training set and the other from the test set. Comparing these two values is advisable; a significant difference where the training set accuracy is much higher might indicate overfitting. The test set accuracy is more relevant for evaluating performance on unseen data since it is unbiased.

You can get the confusion matrix using confusion_matrix().

The resulting confusion matrix is extensive, containing 100 numbers in this instance. This is a scenario where visualizing it could be highly beneficial.

In [None]:
cm = confusion_matrix(y_test, y_pred)

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.set_xlabel('Predicted outputs', fontsize=12, color='black')
ax.set_ylabel('Actual outputs', fontsize=12, color='black')
ax.xaxis.set(ticks=range(10))
ax.yaxis.set(ticks=range(10))
ax.set_ylim(9.5, -0.5)
for i in range(10):
    for j in range(10):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='white')
plt.show()

This heatmap visually represents the confusion matrix using numbers and colors. The shades of purple indicate smaller numbers (such as 0, 1, or 2), while green and yellow represent larger numbers (21 and above).

The values on the diagonal (24, 21, ..., 37) indicate the number of accurate predictions from the test set. For instance, there are 24 correctly classified images of zero, 21 of one, and 37 of two, and so forth. Other values correspond to incorrect predictions. For example, the value 1 in the third row and first column indicates one image of the number 2 incorrectly classified as 0.

Finally, you can obtain the classification report as a string or dictionary using classification_report().

In [None]:
print(classification_report(y_test, y_pred))

This report provides additional information such as the support and precision for classifying each digit.