# Project 4 
** Predictive Analysis using scikit-learn **

Your assignment is to: 

• Start with the mushroom data in the pandas DataFrame that you constructed in your “Assignment – Preprocessing Data with sci-kit learn.” 

• Use scikit-learn to determine which of the two predictor columns that you selected (odor and one other column of your choice) most accurately predicts whether or not a mushroom is poisonous. There is an additional challenge here—to use scikit-learn’s predictive classifiers, you’ll want to convert each of your two (numeric categorical) predictor columns into a set of columns. See for one approach pandas get_dummies() method. 


• Clearly state your conclusions along with any recommendations for further analysis.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

import sklearn.model_selection



**First we need to read the csv file**

In [2]:
mushroom_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data',
                            sep = ',',
                            header = None,
                            usecols=[0, 5, 22],
                            names = ['Mushrooms', 'Odor', 'Habitat']
                           )

mushroom_data.head(10)

Unnamed: 0,Mushrooms,Odor,Habitat
0,p,p,u
1,e,a,g
2,e,l,m
3,p,p,u
4,e,n,g
5,e,a,g
6,e,a,m
7,e,l,m
8,p,p,g
9,e,a,m


Next we will convert the 'Odor' and 'Habitat' columns using the pandas get_dummies() method. 

In [4]:
#The code below creates the dataframe
raw_data = {'Mushrooms': ['p', 'e', 'p', 'e', 'p', 'e', 'p', 'e', 'p'], 
            'Odors': ['a', 'l', 'c', 'y', 'f', 'm', 'n', 'p', 's'], 
            'Habitats': ['g', 'l', 'm', 'p', 'u', 'w', 'd', 'g', 'l']
           }

df = pd.DataFrame(raw_data, columns = ['Mushrooms', 'Odors', 'Habitats'])
df.head()

Unnamed: 0,Mushrooms,Odors,Habitats
0,p,a,g
1,e,l,l
2,p,c,m
3,e,y,p
4,p,f,u


Next we will give numerical values to Odors & Habitats using the get_dummies() method

In [5]:
o_dummy = pd.get_dummies(df['Odors'])
o_dummy

Unnamed: 0,a,c,f,l,m,n,p,s,y
0,1,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1
4,0,0,1,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0
6,0,0,0,0,0,1,0,0,0
7,0,0,0,0,0,0,1,0,0
8,0,0,0,0,0,0,0,1,0


In [6]:
h_dummy = pd.get_dummies(df['Habitats'])
h_dummy

Unnamed: 0,d,g,l,m,p,u,w
0,0,1,0,0,0,0,0
1,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0
3,0,0,0,0,1,0,0
4,0,0,0,0,0,1,0
5,0,0,0,0,0,0,1
6,1,0,0,0,0,0,0
7,0,1,0,0,0,0,0
8,0,0,1,0,0,0,0


Now we will join the columns to the main dataframe

In [7]:
df_new = pd.concat([df, o_dummy, h_dummy], axis = 1)
df_new

Unnamed: 0,Mushrooms,Odors,Habitats,a,c,f,l,m,n,p,s,y,d,g,l.1,m.1,p.1,u,w
0,p,a,g,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,e,l,l,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0
2,p,c,m,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,e,y,p,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
4,p,f,u,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
5,e,m,w,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
6,p,n,d,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0
7,e,p,g,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0
8,p,s,l,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0


Now that we have joined the dataframe, we can determine which of the two predictor columns chosen most accurately predicts whether or not a mushroom is poisonous.

In [9]:
X = o_dummy.iloc[:, :-1].values
y = h_dummy.iloc[:, 1].values
X,y

(array([[1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 1]], dtype=uint8),
 array([1, 0, 0, 0, 0, 0, 0, 1, 0], dtype=uint8))

Next we setup the training and testing models

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(6L, 8L)
(6L,)
(3L, 8L)
(3L,)


We continue by using linear regression to predict y value with a test variable & use sci-kit learn to predict true & predictive output.

In [11]:
linreg = sklearn.linear_model.LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
t = [1, 0]
p = [1, 0]


print(sklearn.metrics.mean_absolute_error(t, p))
print(sklearn.metrics.mean_squared_error(t, p))
print(np.sqrt(sklearn.metrics.mean_squared_error(t, p)))

0.0
0.0
0.0


Finally we calculate the root mean squared error to find the margin of error. 

In [13]:
X = o_dummy.iloc[:, 2:10].values
y = h_dummy.iloc[:, 1].values
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1.0


Remove odor to determine whether or not it can predict edibility.

In [14]:
X = o_dummy.iloc[:, 2:10].values
y = h_dummy.iloc[:, 1].values

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, random_state=1)
linreg.fit(X_train, y_train)
Y_pred = linreg.predict(X_test)

print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1.0


# conclusion
** Based on the sqaure root and low margin error we can see that the habitat and odor are not dependent to determine if the mushroom is in fact poisonous or not.**