# Task: In this Project, we’ll use scikit-learn to answer the question:<br>“Which other attribute (i.e., aside from the poisonous/edible indicator) or attributes are the best predictors of whether a particular mushroom is poisonous or edible?”

## Phase I: Data Acquisition, Data Preparation & Exploratory Data Analysis
- 1) First study the dataset and the associated description of the data (i.e. “data dictionary”).<br>
- 2) Create a pandas DataFrame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous, the column that includes odor, and at least two other columns of your choosing.<br>
- 3) Add meaningful names for each column in the DataFrame you created to store your subset.<br>
- 4) Convert the “e”/”p” indicators in the first column to digits: for example, the “e” might become 0 and “p” might become 1. For each of the other columns in your DataFrame create a set of dummy variables. This is necessary because your downstream processing in Project 4 using scikit-learn requires that values be stored as numerics. See the pandas get_dummies() method for one possible approach to doing this.<br>
- 5) Perform exploratory data analysis: show the distribution of data for each of the columns you selected, and show plots for edible/poisonous vs. odor as well as the other columns that you selected. It is up to you to decide which types of plots to use for these tasks. Include text describing your EDA findings.<br>
- 6) Include some text describing your preliminary conclusions about whether any of the other columns you’ve included in your subset (i.e., aside from the poisonous/edible indicator) could be helpful in predicting if a specific mushroom is edible or poisonous.

## Phase II: Build Predictive Models
- 1) Start with the mushroom data (including the dummy variables) in the pandas DataFrame that you constructed in
Phase I.<br>
- 2) Use scikit-learn to determine which of the predictor columns that you selected (odor and the other columns of
your choice) most accurately predicts whether or not a mushroom is poisonous. How you go about doing this
with scikit-learn is up to you as a practitioner of data analytics.<br>
- 3) Clearly state your conclusions along with any recommendations for further analysis.

### Phase I - 1) First study the dataset and the associated description of the data (i.e. “data dictionary”).

#### Data Set Information: https://archive.ics.uci.edu/ml/datasets/mushroom
This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.


#### Attribute Information:
1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s 
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y 
4. bruises?: bruises=t,no=f 
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s 
6. gill-attachment: attached=a,descending=d,free=f,notched=n 
7. gill-spacing: close=c,crowded=w,distant=d 
8. gill-size: broad=b,narrow=n 
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y 
10. stalk-shape: enlarging=e,tapering=t 
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? 
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y 
16. veil-type: partial=p,universal=u 
17. veil-color: brown=n,orange=o,white=w,yellow=y 
18. ring-number: none=n,one=o,two=t 
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z 
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y 
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y 
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

### Phase I - 2) Create a pandas DataFrame with a subset of the columns in the dataset. You should include the column that indicates edible or poisonous, the column that includes odor, and at least two other columns of your choosing.

In [1]:
#importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
#loading the data
df = pd.read_csv("https://raw.githubusercontent.com/mhan1/analytical-programming/master/mushroom.csv", 
                 names=['edibility', 
                        'cap_shape', 
                        'cap_surface',
                        'cap_color', 
                        'bruises', 
                        'odor', 
                        'gill_attachment',
                        'gill_spacing',
                        'gill_size',
                        'gill_color',
                        'stalk_shape',
                        'stalk_root',
                        'stalk_surface_above_ring',
                        'stalk_surface_below_ring',
                        'stalk_color_above_ring',
                        'stalk-color_below_ring',
                        'veil_type',
                        'veil_color',
                        'ring_number',
                        'ring_type',
                        'spore_print_color',
                        'population',
                        'habitat'                   
                       ])

In [3]:
#sanity check
df.head()

Unnamed: 0,edibility,cap_shape,cap_surface,cap_color,bruises,odor,gill_attachment,gill_spacing,gill_size,gill_color,...,stalk_surface_below_ring,stalk_color_above_ring,stalk-color_below_ring,veil_type,veil_color,ring_number,ring_type,spore_print_color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [4]:
#checking the number of rows and columns
df.shape

(8124, 23)

In [5]:
#checking the columns
df.columns

Index(['edibility', 'cap_shape', 'cap_surface', 'cap_color', 'bruises', 'odor',
       'gill_attachment', 'gill_spacing', 'gill_size', 'gill_color',
       'stalk_shape', 'stalk_root', 'stalk_surface_above_ring',
       'stalk_surface_below_ring', 'stalk_color_above_ring',
       'stalk-color_below_ring', 'veil_type', 'veil_color', 'ring_number',
       'ring_type', 'spore_print_color', 'population', 'habitat'],
      dtype='object')

### According to WildFoodUK(https://www.wildfooduk.com/articles/how-to-tell-the-difference-between-poisonous-and-edible-mushrooms/), we should avoid mushrooms with white gills, a skirt or ring on the stem and a bulbous or sack like base called a volva. Also, they say we should avoide mushrooms with red on the cap or stem. <br> Hence, I will choose 'cap_color' and 'gill_color' as the other two attributes besides the 'odor' to predict the mushroom's edibility for this project.

In [6]:
#creating a pandas DataFrame with a subset of the columns in the dataset.
df1 = df[['edibility','odor','cap_color','gill_color']]
df1.head()

Unnamed: 0,edibility,odor,cap_color,gill_color
0,p,p,n,k
1,e,a,y,k
2,e,l,w,n
3,p,p,w,n
4,e,n,g,k


In [7]:
#checking the data type
type(df1)

pandas.core.frame.DataFrame

In [8]:
#checking the number of rows and columns
df1.shape

(8124, 4)

In [9]:
#checking the columns
df1.columns

Index(['edibility', 'odor', 'cap_color', 'gill_color'], dtype='object')

In [10]:
#chekcing the description of df1 dataframe
df1.describe()

Unnamed: 0,edibility,odor,cap_color,gill_color
count,8124,8124,8124,8124
unique,2,9,10,12
top,e,n,n,b
freq,4208,3528,2284,1728


### Phase I - 3) Add meaningful names for each column in the DataFrame you created to store your subset.

In [11]:
#checking the columns
df1.columns

Index(['edibility', 'odor', 'cap_color', 'gill_color'], dtype='object')

### I already named the columns when I was loading the dataset, using "names" parameter, as above.

### Phase I - 4) Convert the “e”/”p” indicators in the first column to digits: for example, the “e” might become 0 and “p” might become 1. For each of the other columns in your DataFrame create a set of dummy variables. This is necessary because your downstream processing in Project 4 using scikit-learn requires that values be stored as numerics. See the pandas get_dummies() method for one possible approach to doing this.

In [19]:
df1_with_dum = pd.get_dummies(df1['edibility'])
df1_with_dum.head(7)

Unnamed: 0,e,p
0,0,1
1,1,0
2,1,0
3,0,1
4,1,0
5,1,0
6,1,0


In [21]:
df1_with_dum.rename(columns={"e":"edible", "p":"poisonous"}, inplace=True)
df1_with_dum.head(3)

Unnamed: 0,edible,poisonous
0,0,1
1,1,0
2,1,0


In [26]:
#combining the dummy variables with the original data
df2 = df1.join(df1_with_dum)
df2.head(7)

Unnamed: 0,edibility,odor,cap_color,gill_color,edible,poisonous
0,p,p,n,k,0,1
1,e,a,y,k,1,0
2,e,l,w,n,1,0
3,p,p,w,n,0,1
4,e,n,g,k,1,0
5,e,a,y,n,1,0
6,e,a,w,g,1,0


In [27]:
#replacing the 'poisonous' with a '1', and 'edible' with a '0' in the 'edibility' column
df2.edibility.replace({'p': 1, 'e':0}, inplace=True)
df2.head()

Unnamed: 0,edibility,odor,cap_color,gill_color,edible,poisonous
0,1,p,n,k,0,1
1,0,a,y,k,1,0
2,0,l,w,n,1,0
3,1,p,w,n,0,1
4,0,n,g,k,1,0


In [29]:
df2.shape

(8124, 6)

### Phase I - 5) Perform exploratory data analysis: show the distribution of data for each of the columns you selected, and show plots for edible/poisonous vs. odor as well as the other columns that you selected. It is up to you to decide which types of plots to use for these tasks. Include text describing your EDA findings.

### Phase I - 6) Include some text describing your preliminary conclusions about whether any of the other columns you’ve included in your subset (i.e., aside from the poisonous/edible indicator) could be helpful in predicting if a specific mushroom is edible or poisonous.

### Phase II - 1) Start with the mushroom data (including the dummy variables) in the pandas DataFrame that you constructed in Phase I.

### Phase II - 2) Use scikit-learn to determine which of the predictor columns that you selected (odor and the other columns of your choice) most accurately predicts whether or not a mushroom is poisonous. How you go about doing this with scikit-learn is up to you as a practitioner of data analytics.

### Phase II - 3) Clearly state your conclusions along with any recommendations for further analysis.