I start by importing all the necessary modules to train the test set, create a linear regression model, and run the metrics to determine the margin of error.

In [1]:
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np

I create the dataset from web source

In [2]:
page = 'https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data'
df1 = pd.read_csv(page, index_col=False, header=None, names=['Poisonous?', 'Cap Color', 'Odor'], usecols=[0,3,5])

Convert the Response column to integers

In [3]:
# poisonous = 1, edible = 0
df1.replace(to_replace={'Poisonous?':{'p':1, 'e': 0}}, inplace=True)

I use pandas.get_dummies to "Convert categorical variable into dummy/indicator variables" for the 'Cap Color' and 'Odor' Feature columns

In [4]:
C = pd.Series(df1['Cap Color'])
f = pd.get_dummies(C)

O = pd.Series(df1['Odor'])
g = pd.get_dummies(O)

create a new df and merge the dummy-feature columns and response column

In [5]:
new_df = pd.concat([f, g, df1['Poisonous?']], axis=1)
cols = list(new_df.iloc[:, :-1])

I set the X and y variables appropriately

In [6]:
X = new_df.iloc[:, :-1].values
y = new_df.iloc[:, 1].values

setup the training and testing models

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(6093, 19)
(6093,)
(2031, 19)
(2031,)


"fit the model to the training data (learn the coefficients)"

In [8]:
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

"print the intercept and coefficients"

In [9]:
print(linreg.intercept_)
print(linreg.coef_)

0.0794633138661
[-0.03924856  0.96075144 -0.03924856 -0.03924856 -0.03924856 -0.03924856
 -0.03924856 -0.03924856 -0.03924856 -0.03924856 -0.04021475 -0.04021475
 -0.04021475 -0.04021475 -0.04021475 -0.04021475 -0.04021475 -0.04021475
 -0.04021475]


"make predictions on the testing set"

In [10]:
y_pred = linreg.predict(X_test)

"define true and predicted response values" and print

In [11]:
true = [1, 0]
pred = [1, 0]

print(metrics.mean_absolute_error(true, pred))
print(metrics.mean_squared_error(true, pred))
print(np.sqrt(metrics.mean_squared_error(true, pred)))

0.0
0.0
0.0


Compute the margin of error with the whole dataset. Root Mean Squared Error (RMSE) = 2.4

In [12]:
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

2.40308753969e-15


I then test removing the 'Cap Color' data. We get a RMSE of .058. Much better!

In [13]:
# use the list to select a subset of the original DataFrame
X = new_df.iloc[:, 11:-1].values

# select a Series from the DataFrame
y = new_df.iloc[:, 1].values

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# compute the RMSE of our predictions
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.0588025088502


Now I change it to test the model without 'Odor'. We get a RMSE of 3.2

In [14]:
# use the list to select a subset of the original DataFrame
X = new_df.iloc[:, 1:10].values

# select a Series from the DataFrame
y = new_df.iloc[:, 1].values

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# compute the RMSE of our predictions
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

3.32421535544e-15


#### Conclusion is that the 'Cap Color' is least relevant, and 'Odor' is the most accurate in determining if the mushroom is poisonous or edible. The dataset should be 'Odor' as the feature, 'Poisonous?' as the Response.

For further anlysis I would test the other feature columns available in the original dataset to see if the RMSE goes lower with any combination of them.