### Assignment 2: Spring Quarter Data Science, Low-cost Sensor - Decision Trees
Ilana Zimmerman 4/12/18

#### Assignment Description
Your task for this assignment:  Design a simple, low-cost sensor that can distinguish between red wine and white wine.

Your sensor must correctly distinguish between red and white wine for at least 95% of the samples in a set of 6497 test samples of red and white wine.
Your technology is capable of sensing the following wine attributes:

    Fixed acidity  -  Free sulphur dioxide
    Volatile acidity  -  Total sulphur dioxide
    Citric acid  -  Sulphates
    Residual sugar  -  pH
    Chlorides  - Alcohol
    Density

** Decision Trees explained: http://www.saedsayad.com/decision_tree.htm

#### In this assignment we were asked to build an experiment using Decision Trees and answer:

   1. What is the percentage of correct classification results (using all attributes)?
   2. What is the percentage of correct classification results (using a subset of the attributes)?
   3. What is the AUC of your model?
   4. Visualize your decision tree
   5. What is the best AUC that you can achieve?
   6. Which are the the minimum number of attributes? Why?

### What Data Are We Working With?

In [1]:
import numpy as np
import pandas as pd
df_wine = pd.read_csv('RedWhiteWine.csv', delimiter = ',')

print(df_wine.shape)

#print first 5 rows of the dataset
df_wine.head(5)

(6497, 13)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,Class
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1


In [2]:
df_wine=df_wine.drop('quality', axis = 1)

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

### Split the Data into Test/Train Sets

In [4]:
X = df_wine.iloc[:,0:11] #X includes all rows and columns EXCEPT the 'Class' column which we are predicting
y = df_wine['Class']
#Split data into test and train data, 20:80

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)  

(5197, 11) (1300, 11) (5197,) (1300,)


### Generate the Model : Decision Tree Classifier

In [5]:
tree = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 2, random_state=5)#HOW TO CHOOSE CRITERION/RANDOM STATE
tree.fit(X_train, y_train)
y_predict = tree.predict(X_test)


In [6]:
y_predict

array([0, 0, 1, ..., 0, 1, 0])

### Evaluate the Model

##### Question 1: Model Accuracy Using ALL Features

In [7]:
def accuracy_score(y_test, y_predict):
    test_acc = np.mean([a==p for a, p in zip(y_test, y_predict)])
    return test_acc

In [8]:
# Determine accuracy of these predictions
from sklearn import metrics
accuracy_score(y_test, y_predict)

0.9830769230769231

In [9]:
print (metrics.confusion_matrix( y_test, y_predict, labels=None))

[[976  14]
 [  8 302]]


##### Question 2: Model Accuracy Using a Subset of Features

In [10]:
import random
from itertools import combinations

In [11]:
def Feature_Selection(Columns, i):
    #iterate through X matrix in any dimension [6497, a], not including 'Class' column
   # i=random.choice(range(1,11))
    for r in range(i+1):
        features = list(a for a in combinations (Columns, r) if a !=[] )#list of all possible combinations given 'a', not am empty array!
        X_new = np.asarray((features))
    return X_new;
    

In [12]:
#define columns from dataframe - list of all headers; drop target column
columns = list(df_wine)
columns.remove('Class')
columns = np.asarray(columns)


In [13]:
#Test that Feature Selection chooses a different number of featuers randomly
df_wine[Feature_Selection(columns, 3)[2]]

Unnamed: 0,fixed acidity,volatile acidity,chlorides
0,7.4,0.700,0.076
1,7.8,0.880,0.098
2,7.8,0.760,0.092
3,11.2,0.280,0.075
4,7.4,0.700,0.076
5,7.4,0.660,0.075
6,7.9,0.600,0.069
7,7.3,0.650,0.065
8,7.8,0.580,0.073
9,7.5,0.500,0.071


In [14]:
#X_train[Feature_Selection(columns, 3)[0]].shape
tree.fit(X_train[Feature_Selection(columns, 3)[0]], y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=5,
            splitter='best')

In [15]:
def DecisionTree_accuracy(X):
    '''Input list of features, output the test accuracy of 
    Decission Tree Classifier and its corresponding features'''
    #y = df_wine['Class']
    #Split data into test and train data, 20:80
    #X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)
    tree.fit(X_train[X], y_train) 
    y_predictfun = tree.predict(X_test[X])
    test_acc = accuracy_score(y_test, y_predictfun)
    
    return test_acc, list(X)




In [16]:
# selection of Features and their accuracy given number of features chosen
DecisionTree_accuracy(Feature_Selection(columns,3)[0])

(0.92, ['fixed acidity', 'volatile acidity', 'citric acid'])

##### Question 3/5: What is the AUC of our classifier (using all attributes, also the BEST AUC attainable)

In [17]:
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score

In [18]:
from sklearn.svm import LinearSVC

svm = LinearSVC()
X = df_wine.iloc[:,0:11] #X includes all rows and columns EXCEPT the 'Class' column which we are predicting
y = df_wine['Class']

X_train, X_test,y_train, y_test = train_test_split(X,y,test_size= 0.2)

#SVM Model
svm_model = svm.fit(X_train, y_train)
y_test_predictions = svm.predict(X_test)
y_score = svm_model.decision_function(X_test)
fpr_svm, tpr_svm, thresholds = roc_curve(y_test, y_score)
roc_auc_svm = auc(fpr_svm, tpr_svm)

#Decision Tree
tree = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 2, random_state=5)#HOW TO CHOOSE CRITERION/RANDOM STATE
tree_model = tree.fit(X_train, y_train)
y_proba = tree_model.predict_proba(X_test)
y_score = tree_model.score(X_test, y_test)

fpr_tree, tpr_tree, _ = roc_curve(y_test, y_proba[:,1])
roc_auc = auc(fpr_tree, tpr_tree)


import matplotlib.pyplot as plt
% matplotlib inline
plt.figure()
plt.plot(fpr_tree, tpr_tree, color='darkorange',
          lw=2, label='Decision Tree curve zero (area = {0:.2f})'.format(roc_auc))
plt.plot(fpr_svm, tpr_svm, color='darkred',
         lw=2, label='SVM ROC curve zero (area = {0:.2f})'.format(roc_auc_svm))
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC for Class Zero and One (Red and White)')
plt.legend(loc="lower right")
plt.show()


  'Matplotlib is building the font cache using fc-list. '
UsageError: Line magic function `%` not found.


##### Question 4: Visualize Decision Tree

In [None]:
#http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

In [None]:
import graphviz 
from sklearn import tree as Tree

In [None]:
#define columns from dataframe - list of all headers; drop target column
columns = list(df_wine)
columns.remove('Class')
columns = np.asarray(columns)


In [None]:
%matplotlib notebook
#http://scikit-learn.org/stable/modules/tree.html#tree
dot_data = Tree.export_graphviz(tree, out_file=None, 
                         feature_names=columns,  
                         class_names='Class',  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

In [None]:
dot_data = Tree.export_graphviz(tree, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("Wine") 

#### Question 6: Minimum Number of Attributes Needed to Obtain 95%  Prediction Accuracy

In [None]:
#RUN SIMULATIONS!!!
r = 11 #choose up to ll features for the combinations
X_matrix = []

#num_feat = []
Results = []
for i in range(r):
    a = 2+i
    X_matrix.append(Feature_Selection(columns, a ))
np.asarray(X_matrix).shape

In [None]:
combo_matrix # list of combinations choosing 1-10 features

In [None]:
combo = {}
for j in range(11):
    combo[j] = combo_matrix[j]
combo[9][0]

In [None]:
DecisionTree_accuracy(combo[9][0])

In [None]:
Acc = []
feat =[]
for idx1 in range(11):
    for idx2 in range(len(combo[a])):
        acc, features = DecisionTree_accuracy(combo[idx1][idx2])
        Num_Features = len(features)
        if acc >= 0.95:
            Acc.append(acc)
            feat.append(features)
    if len(features)> 0:
        break

In [None]:
Results_df = pd.DataFrame(
    {'Accuracy': Acc,
    'Features':feat})
Results_df.head(5)

In [None]:
NumFeat = []
for i in Results_df['Features']:
    NumFeat.append(np.asarray(i).shape[0])
Results_df['Number of Features']= NumFeat


In [None]:
Results_df.head(5)

In [None]:
Results_df[Results_df['Number of Features'] == 2].max()

The minimum number of attributes needed to obtain 95% prediciton accuracy (or greater) is TWO! I wrote a funciton to obtain a list of all feature combination options choosing one or more attribute (Feature_Selection). I then changed X_test/X_train to include only the features in question and fit a new decision tree model and test the accuracy (DecisionTree_accuracy). Adding only the results that had 95% or more accuracy iteratively to empty lists, I created a new data frame (Results_df).