# Project 4 - Predictive Analysis Using scikit-learn

__This assignment uses data from the mushroom dataset found at https://archive.ics.uci.edu/ml/datasets/Mushroom and scikit-learn to determine which of the two predictor columns selected in Assignment 13 most accurately predicts whether or not a mushroom is poisonous.__

__The predictive analysis techniques are adapted from [Kevin Markham's video tutorial series "Introduction to machine learning with scikit-learn."](https://github.com/justmarkham/scikit-learn-videos)__

__As in Assignment 13, first we look at the data dictionary to see what we're dealing with. We find:__

```
Attribute Information: (classes: edible=e, poisonous=p)
     1. cap-shape:                bell=b,conical=c,convex=x,flat=f,
                                  knobbed=k,sunken=s
     2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
     3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,
                                  pink=p,purple=u,red=e,white=w,yellow=y
     4. bruises?:                 bruises=t,no=f
     5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
                                  musty=m,none=n,pungent=p,spicy=s
     6. gill-attachment:          attached=a,descending=d,free=f,notched=n
     7. gill-spacing:             close=c,crowded=w,distant=d
     8. gill-size:                broad=b,narrow=n
     9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
                                  green=r,orange=o,pink=p,purple=u,red=e,
                                  white=w,yellow=y
    10. stalk-shape:              enlarging=e,tapering=t
    11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
                                  rhizomorphs=z,rooted=r,missing=?
    12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
    13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
    14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                                  pink=p,red=e,white=w,yellow=y
    15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
                                  pink=p,red=e,white=w,yellow=y
    16. veil-type:                partial=p,universal=u
    17. veil-color:               brown=n,orange=o,white=w,yellow=y
    18. ring-number:              none=n,one=o,two=t
    19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
                                  none=n,pendant=p,sheathing=s,zone=z
    20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
                                  orange=o,purple=u,white=w,yellow=y
    21. population:               abundant=a,clustered=c,numerous=n,
                                  scattered=s,several=v,solitary=y
    22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
                                  urban=u,waste=w,woods=d

Missing Attribute Values: 2480 of them (denoted by "?"), all for attribute #11.
```

__We create our Pandas dataframe with a subset of columns - edible/poisonous, odor, and an additional one of our choosing (in this case, ring-number).__

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'
mushrooms = pd.read_csv(url, usecols = [0,5,18], names = ['Class', 'Odor', 'RingNumber'])
mushrooms.head()

Unnamed: 0,Class,Odor,RingNumber
0,p,p,o
1,e,a,o
2,e,l,o
3,p,p,o
4,e,n,o


__Next, we replace the codes in the columns with numeric values.__

In [2]:
mushrooms['Class'].replace({'p': 1,'e': 0}, inplace = True)
mushrooms['Odor'].replace(
    {'a':0,'l':1,'c':2,'y':3,'f':4,'m':5,'n':6,'p':7,'s':8}, inplace = True)
mushrooms['RingNumber'].replace({'n':0, 'o':1, 't': 2}, inplace = True)
mushrooms.head()

Unnamed: 0,Class,Odor,RingNumber
0,1,7,1
1,0,0,1
2,0,1,1
3,1,7,1
4,0,6,1


__Next, we use `get_dummies()` method to convert the Odor column into a set of columns.  This is important because the feature data needs to be stored as a matrix (dataframe) shape.__

In [3]:
odor = pd.get_dummies(mushrooms['Odor'])
odor.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,1,0,0


__We see that the data has been converted from a list (e.g. 7, 0, 1, 7, 6) to a matrix where the same 70176 value is represented by using 1 in the matrix.__

__Since this is a problem involving classifications (e.g. whether or not a mushroom is poisonous), we use Markham's techniques found in ["Evaluating a classification model"](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb)__

In [4]:
# define X and y
X = odor
y = mushrooms.Class

In [5]:
# split X and y into training and testing sets

# use sklearn.model_selection instead of Markham's sklearn.cross_validation to avoid
# deprecation warning
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [6]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [7]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

__Classification accuracy:__ percentage of correct predictions

In [8]:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.98670605613


__We see that the accuracy score for Odor is 0.98670605613.  Let's see what happens when we use Ring Number.__

In [9]:
# convert Ring Number to matrix
rings = pd.get_dummies(mushrooms['RingNumber'])
rings.head()

Unnamed: 0,0,1,2
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


In [10]:
# define X and y
X = rings
y = mushrooms.Class

In [11]:
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [12]:
# train a logistic regression model on the training set
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [13]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

In [14]:
# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))

0.531265386509


__The accuracy score for Ring Number is 0.531265386509, which is much lower than the accuracy score for Odor (0.98670605613).  Obviously this is a much poorer predictor of whether or not a mushroom is poisonous.  Considering that Assignment 13 showed that the vast majority of mushrooms in the dataset had one ring, it is not a surprising result.__

__It would be interested to run the same analysis on each feature in the mushroom dataset, along with various combinations of features in the dataset, to find out if anything beats Odor alone as a predictor.__