# DAT210x - Programming with Python for DS

## Module6- Lab5

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.tree import export_graphviz

Useful information about the dataset used in this assignment can be [found here](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names).

Load up the mushroom dataset into dataframe `X` and verify you did it properly, and that you have not included any features that clearly shouldn't be part of the dataset.

You should not have any doubled indices. You can check out information about the headers present in the dataset using the link we provided above. Also make sure you've properly captured any NA values.

In [2]:
headers = ['label', 'cap-shape', 'cap-surface', 'cap-color', 'bruises?',
           'odor', 'gill-attachment', 'gill-spacing', 'gill-size',
           'gill-color', 'stalk-shape', 'stalk-root',
           'stalk-surface-above-ring', 'stalk-surface-below-ring',
           'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
           'veil-color', 'ring-number', 'ring-type', 'spore-print-color',
           'population', 'habitat']
X = pd.read_csv('Datasets/agaricus-lepiota.data', names=headers, na_values = "?") #or X.replace('?', np.nan, inplace=True)

In [3]:
# An easy way to show which rows have nans in them:
#X[pd.isnull(X).any(axis=1)]
print(f"NaN values: \n{X.isnull().sum()}")

NaN values: 
label                          0
cap-shape                      0
cap-surface                    0
cap-color                      0
bruises?                       0
odor                           0
gill-attachment                0
gill-spacing                   0
gill-size                      0
gill-color                     0
stalk-shape                    0
stalk-root                  2480
stalk-surface-above-ring       0
stalk-surface-below-ring       0
stalk-color-above-ring         0
stalk-color-below-ring         0
veil-type                      0
veil-color                     0
ring-number                    0
ring-type                      0
spore-print-color              0
population                     0
habitat                        0
dtype: int64


For this simple assignment, just drop any row with a nan in it, and then print out your dataset's shape:

In [4]:
X.dropna(axis = 0, how = 'any', inplace = True)
print (f"After dropping all rows with any NaNs, shape of X is: \n{X.shape}")

After dropping all rows with any NaNs, shape of X is: 
(5644, 23)


Copy the labels out of the dataframe into variable `y`, then remove them from `X`.

Encode the labels, using the `.map()` trick we presented you in Module 5, using `canadian:0`, `kama:1`, and `rosa:2`.

In [5]:
y = X.label
X.drop('label', axis = 1, inplace = True)
y = y.map({'e': 0, 'p': 1})

Encode the entire dataframe using dummies:

In [6]:
X = pd.get_dummies(X, columns = ['cap-shape', 'cap-surface', 'cap-color', 'bruises?', 'odor', 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type', 'spore-print-color', 'population', 'habitat'])
print (X.head(6))


   cap-shape_b  cap-shape_c  cap-shape_f  cap-shape_k  cap-shape_s  \
0            0            0            0            0            0   
1            0            0            0            0            0   
2            1            0            0            0            0   
3            0            0            0            0            0   
4            0            0            0            0            0   
5            0            0            0            0            0   

   cap-shape_x  cap-surface_f  cap-surface_g  cap-surface_s  cap-surface_y  \
0            1              0              0              1              0   
1            1              0              0              1              0   
2            0              0              0              1              0   
3            1              0              0              0              1   
4            1              0              0              1              0   
5            1              0            

Split your data into `test` and `train` sets. Your `test` size should be 30% with `random_state` 7.

Please use variable names: `X_train`, `X_test`, `y_train`, and `y_test`:

In [7]:
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 7)

Create an DT classifier. No need to set any parameters:

In [8]:
#from sklearn import tree
model = tree.DecisionTreeClassifier(max_depth=9, criterion="entropy")

Train the classifier on the `training` data and labels; then, score the classifier on the `testing` data and labels:

In [9]:
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

In [10]:
print("High-Dimensionality Score: ", round((score*100), 3))

High-Dimensionality Score:  100.0


Use the code on the course's SciKit-Learn page to output a .DOT file, then render the .DOT to .PNGs.

You will need graphviz installed to do this. On macOS, you can `brew install graphviz`. On Windows 10, graphviz installs via a .msi installer that you can download from the graphviz website. Also, a graph editor, gvedit.exe can be used to view the tree directly from the exported tree.dot file without having to issue a call. On other systems, use analogous commands.

If you encounter issues installing graphviz or don't have the rights to, you can always visualize your .dot file on the website: http://webgraphviz.com/.

In [11]:
def get_lineage(tree, feature_names):
     left      = tree.tree_.children_left
     right     = tree.tree_.children_right
     threshold = tree.tree_.threshold
     features  = [feature_names[i] for i in tree.tree_.feature]

     # get ids of child nodes
     idx = np.argwhere(left == -1)[:,0]     

     def recurse(left, right, child, lineage=None):          
          if lineage is None:
               lineage = [child]
          if child in left:
               parent = np.where(left == child)[0].item()
               split = 'l'
          else:
               parent = np.where(right == child)[0].item()
               split = 'r'

          lineage.append((parent, split, threshold[parent], features[parent]))

          if parent == 0:
               lineage.reverse()
               return lineage
          else:
               return recurse(left, right, parent, lineage)

     for child in idx:
          for node in recurse(left, right, child):
               print (node)

In [12]:
get_lineage(model, X.columns)

(0, 'l', 0.5, 'spore-print-color_h')
(1, 'l', 0.5, 'gill-size_n')
(2, 'l', 0.5, 'ring-number_o')
(3, 'l', 0.5, 'habitat_p')
(4, 'l', 0.5, 'stalk-color-below-ring_n')
5
(0, 'l', 0.5, 'spore-print-color_h')
(1, 'l', 0.5, 'gill-size_n')
(2, 'l', 0.5, 'ring-number_o')
(3, 'l', 0.5, 'habitat_p')
(4, 'r', 0.5, 'stalk-color-below-ring_n')
6
(0, 'l', 0.5, 'spore-print-color_h')
(1, 'l', 0.5, 'gill-size_n')
(2, 'l', 0.5, 'ring-number_o')
(3, 'r', 0.5, 'habitat_p')
7
(0, 'l', 0.5, 'spore-print-color_h')
(1, 'l', 0.5, 'gill-size_n')
(2, 'r', 0.5, 'ring-number_o')
8
(0, 'l', 0.5, 'spore-print-color_h')
(1, 'r', 0.5, 'gill-size_n')
(9, 'l', 0.5, 'odor_n')
(10, 'l', 0.5, 'stalk-shape_t')
11
(0, 'l', 0.5, 'spore-print-color_h')
(1, 'r', 0.5, 'gill-size_n')
(9, 'l', 0.5, 'odor_n')
(10, 'r', 0.5, 'stalk-shape_t')
12
(0, 'l', 0.5, 'spore-print-color_h')
(1, 'r', 0.5, 'gill-size_n')
(9, 'r', 0.5, 'odor_n')
(13, 'l', 0.5, 'population_c')
14
(0, 'l', 0.5, 'spore-print-color_h')
(1, 'r', 0.5, 'gill-size_n')

In [14]:
tree.export_graphviz(model, out_file='tree.dot', feature_names=X.columns)

#http://webgraphviz.com/ paste dot.file output there to see
#from subprocess import call
#call(['dot', '-T', 'png','tree.dot', '-o', 'tree.png'])

NameError: name 'shutil' is not defined