# Project 4 – Predictive Analysis using scikit-learn

• Start with the mushroom data in the pandas DataFrame that you constructed in your “Assignment – 
Preprocessing Data with sci-kit learn.” 
• Use scikit-learn to determine which of the two predictor columns that you selected (odor and one 
other column of your choice) most accurately predicts whether or not a mushroom is poisonous.  There is 
an additional challenge here—to use scikit-learn’s predictive classifiers, you’ll want to convert each of 
your two (numeric categorical) predictor columns into a set of columns.  See for one approach pandas 
get_dummies() method. 
• Clearly state your conclusions along with any recommendations for further analysis.

## Import libraries

In [1]:
import pandas as pd
import numpy as np

# scikit-learn imports 
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

## Build our DataFrame

In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'

# read data to DataFrame using meaningful column names
df = pd.read_csv(url, usecols=[0,5,22], header=None, names=['Edible', 'Odor','Habitat'])

# copy dataframe to preserve original
df_mushroom = df.copy()

## Display DataFrame

In [3]:
df_mushroom.head()

Unnamed: 0,Edible,Odor,Habitat
0,p,p,u
1,e,a,g
2,e,l,m
3,p,p,u
4,e,n,g


## Converting for "Numerical Categorical"

In [4]:
df_expanded = pd.get_dummies(df_mushroom, columns=['Edible', 'Odor', 'Habitat'], drop_first=True)

# view first 5 rows
df_expanded.head()

Unnamed: 0,Edible_p,Odor_c,Odor_f,Odor_l,Odor_m,Odor_n,Odor_p,Odor_s,Odor_y,Habitat_g,Habitat_l,Habitat_m,Habitat_p,Habitat_u,Habitat_w
0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0


## Predictive Classifiers Training

In [5]:
# set X values
X = df_expanded.iloc[:, 1:]

# sey y values
y = df_expanded.Edible_p

# split X and y into training and testing sets using random state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train the logistic regression model using training set
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression()

## Predicting Y Values from X_test

In [6]:
# store predictions for the testing set
y_pred = logreg.predict(X_test)

In [7]:
# calculate accuracy of predictions
accuracy = metrics.accuracy_score(y_test, y_pred)

# display accruacy in a more readable format
print('Accuracy Percentage: {}'.format(np.format_float_positional(accuracy*100, precision=2)))

Accuracy Percentage: 98.23


In [8]:
# print the first 30 true and predicted responses
print('True:', y_test.values[0:30])
print('Pred:', y_pred[0:30])

True: [0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0]
Pred: [0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0]


## Conclusion

Using the "Edible", "Odor", and "Habitat" predictor columns gave us a fairly high prediction accuracy. Since we have such a high accuracy, this indicates that these columns were a good fit for predicting whether or not a mushroom was edible or not. Odor itself seems like it plays a major role in prediciting edibility, so coupling that with habitat was a good choice.