# Project – Predictive Analysis using scikit-learn
## Your assignment is to:

### - Start with the mushroom data in the pandas DataFrame that you constructed in your “Assignment – Preprocessing Data with sci-kit learn.”
### - Use scikit-learn to determine which of the two predictor columns that you selected (odor and one other column of your choice) most accurately predicts whether or not a mushroom is poisonous. There is an additional challenge here—to use scikit-learn’s predictive classifiers, you’ll want to convert each of your two (numeric categorical) predictor columns into a set of columns. See for one approach pandas get_dummies() method.
### - Clearly state your conclusions along with any recommendations for further analysis.

In [75]:
#importing required libraries: 

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn import linear_model
import numpy as n

### Load mushroom dataset used in 'Assignment – Preprocessing Data with sci-kit learn.'

In [76]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', 
                            sep=',', 
                            header=None, 
                            usecols=[0,5,21], 
                            names=["Edible/Poisonous","Odor","Population"])
df.head(20)

Unnamed: 0,Edible/Poisonous,Odor,Population
0,p,p,s
1,e,a,n
2,e,l,n
3,p,p,s
4,e,n,a
5,e,a,n
6,e,a,n
7,e,l,s
8,p,p,v
9,e,a,s


### Replace letter values from data with numeric values:

In [77]:
df['Edible/Poisonous'].replace({'e':0, 'p': 1}, inplace=True)
df['Odor'].replace({'a':0, 'l':1, 'c':2, 'y':3, 'f':4, 'm':5, 'n':6, 'p':7, 's':8}, inplace=True)
df['Population'].replace({'a':0, 'c':1, 'n':2, 's':3, 'v':4, 'y':5}, inplace=True)

df

Unnamed: 0,Edible/Poisonous,Odor,Population
0,1,7,3
1,0,0,2
2,0,1,2
3,1,7,3
4,0,6,0
...,...,...,...
8119,0,6,1
8120,0,6,4
8121,0,6,1
8122,1,3,4


### Use the get_dummies() method to convert each of the two predictor columns into a set of columns. The following shows the "Odor" column and the one after the "Population" column:

In [78]:
o_dummy = pd.get_dummies(df['Odor'])
o_dummy

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,0,0,0,0,0,0,1,0
1,1,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...
8119,0,0,0,0,0,0,1,0,0
8120,0,0,0,0,0,0,1,0,0
8121,0,0,0,0,0,0,1,0,0
8122,0,0,0,1,0,0,0,0,0


In [79]:
p_dummy = pd.get_dummies(df['Population'])
p_dummy

Unnamed: 0,0,1,2,3,4,5
0,0,0,0,1,0,0
1,0,0,1,0,0,0
2,0,0,1,0,0,0
3,0,0,0,1,0,0
4,1,0,0,0,0,0
...,...,...,...,...,...,...
8119,0,1,0,0,0,0
8120,0,0,0,0,1,0
8121,0,1,0,0,0,0
8122,0,0,0,0,1,0


### Combine the columns together:

In [80]:
df_new = pd.concat([df, o_dummy, p_dummy], axis = 1)
df_new

Unnamed: 0,Edible/Poisonous,Odor,Population,0,1,2,3,4,5,6,7,8,0.1,1.1,2.1,3.1,4.1,5.1
0,1,7,3,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
1,0,0,2,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,0,1,2,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
3,1,7,3,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0
4,0,6,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,0,6,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
8120,0,6,4,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
8121,0,6,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0
8122,1,3,4,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0


### Set values for X and y (training models):

In [81]:
X = o_dummy.iloc[:, :-1].values
y = p_dummy.iloc[:, 1].values

X_train, X_test, Y_train, Y_test = train_test_split(X, y, random_state=1)

### Implement linear regression for prediction and to create true/predicted response values:

In [82]:
linreg = LinearRegression()
linreg.fit(X_train, Y_train)
y_pred = linreg.predict(X_test)
true = [1, 0]
pred = [1, 0]

print(metrics.mean_absolute_error(true, pred))
print(metrics.mean_squared_error(true, pred))
print(metrics.mean_squared_error(true, pred))

0.0
0.0
0.0


### Calculate the margin of error:

In [83]:
print(metrics.mean_squared_error(Y_test, Y_pred))

0.04807570240117666


### Remove individual variables to see which is most important for determining edibility, odor or population. Here I started with "odor" and then "population":

In [84]:
X = o_dummy.iloc[:, 5:6].values
y = p_dummy.iloc[:, 1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)

print(metrics.mean_squared_error(y_test, y_pred))

0.04427126327288063


In [85]:
X = o_dummy.iloc[:, 3:4].values
y = p_dummy.iloc[:, 1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
linreg.fit(X_train, y_train)
Y_pred = linreg.predict(X_test)

print(metrics.mean_squared_error(y_test, y_pred))

0.04427126327288063


### Conclusion:

#### To summarize the findings of this analysis, we can conclude that the results based on the odor and population would be useful to predict the edibility and poisonous of mushrooms because they are both similar in their margin of error.