# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [3]:
# Import your libraries:

import numpy as np
import pandas as pd

# Challenge 1 - Explore The Data

This lesson will explore the creation of a machine learning pipeline from beggining to end. We will save our model and use the model to make predictions on data outside of our training sample. Let's start by loading the dataset

In [4]:
# Loading the data

mushrooms = pd.read_csv('../mushrooms.csv')

#### This dataset contains information about different types of mushrooms. Our response variable indicates whether a mushroom is poisonous. 

We will create a model to predict whether a mushroom is poisonous using other information about the mushrooms. 

Let's print the `head()` of this dataset to see what columns we have in our data.

In [6]:
# Your code here:

mushrooms.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


#### It looks like most of the columns in this dataset are coded. 

Let's examine the column types using `dtypes` to confirm this. 

In [7]:
# Your code here:

mushrooms.dtypes

class                       object
cap-shape                   object
cap-surface                 object
cap-color                   object
bruises                     object
odor                        object
gill-attachment             object
gill-spacing                object
gill-size                   object
gill-color                  object
stalk-shape                 object
stalk-root                  object
stalk-surface-above-ring    object
stalk-surface-below-ring    object
stalk-color-above-ring      object
stalk-color-below-ring      object
veil-type                   object
veil-color                  object
ring-number                 object
ring-type                   object
spore-print-color           object
population                  object
habitat                     object
dtype: object

#### Since there are single letter codes for each column, it would be best if we looked at the data dictionary below.

Attribute Information: (classes: edible=e, poisonous=p)

    cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s

    cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s

    cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y

    bruises: bruises=t,no=f

    odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s

    gill-attachment: attached=a,descending=d,free=f,notched=n

    gill-spacing: close=c,crowded=w,distant=d

    gill-size: broad=b,narrow=n

    gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y

    stalk-shape: enlarging=e,tapering=t

    stalk-root: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?

    stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s

    stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s

    stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

    stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y

    veil-type: partial=p,universal=u

    veil-color: brown=n,orange=o,white=w,yellow=y

    ring-number: none=n,one=o,two=t

    ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

    spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y

    population: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y

    habitat: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d


#### Using correlated variables is not always great for our model. 

Correlated variables are most damaging in linear regression. They can produce an unstable and inaccurate solution. You can read more about why [here](https://en.wikipedia.org/wiki/Multicollinearity#Consequences_of_multicollinearity).

In random forests, we produce multiple trees. Each tree is produced using both a sample of rows and a sample of columns. Therefore, this may help reduce the effect of correlated variables. However, even when the correlated variables do not produce a worse model, removing them can be benefical since it makes the model simpler. A model with less features will take less time to compute.

Since all our data is categorical, we cannot use a correlation matrix to ensure that the variables are not correlated.

In this case, we will use a [Chi-Square test of independence](https://onlinecourses.science.psu.edu/stat500/node/56/) to find  whether there is correlation between each pair of variables.

The null hypothesis of the Chi-Square test of independence is that the two features are independent and the alternative hypothesis is that they are not independent.

In the following cell, create a contingency table of `cap-surface` and `cap-shape` using the `pd.crosstab` function. We will use this contingency table to perform the chi square test of independence. Assign this table to the variable `ct`

In [53]:
# Your code here:

ct = pd.crosstab(mushrooms['cap-surface'], mushrooms['cap-shape'])
ct

cap-shape,b,c,f,k,s,x
cap-surface,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
f,52,0,1016,60,32,1160
g,1,1,1,1,0,0
s,244,0,820,418,0,1074
y,155,3,1315,349,0,1422


Let's import the chi quare test:

In [54]:
from scipy.stats import chi2_contingency

In the following cell perform the chi square test on the contingency table using the function you just imported. This function wil produce a tuple with 4 values. The second value in the tuple is the p-value for our test. Return the p-value and interpret the result. 

In [83]:
# Your code here:

chi2, p, dof, expected = chi2_contingency(ct)
p

4.635777687474967e-206

# Bonus Challenge 1 - Create a p-value Matrix for all Variables.

In the cell below, write a for loop to create a dataframe or matrix of all pairwise tests. Print this dataframe and interpret the results.

Below is an example of what this might look like:
![corr df](../corr_df.png)

In [82]:
# Your code here:

corr_result = pd.DataFrame(index=mushrooms.columns.values, columns=mushrooms.columns.values)

for i in corr_result.index:
    for j in corr_result.columns.values:
        corr_result.loc[i, j] = chi2_contingency(pd.crosstab(mushrooms[i], mushrooms[j]))[1]

#### The next step in model generation is to ensure there is no missing data and handle any missing data if it exists.

In the next cell, check to see if there is any missing data in each column of the dataset

In [8]:
# Your code here:

mushrooms.isna().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

#### Since there is no work to be done to clean up missing data, the next step is to create dummy variables. 

Most machine learning algorithms cannot non-numeric data, so we will need to transform our data. Use the `get_dummies` function to transform the data. Make sure to remove one dummy column per variable using the `drop_first=True` option.

In [9]:
# Your code here:

mushrooms_dummy = pd.get_dummies(mushrooms, columns=mushrooms.columns.values, drop_first=True)

In [85]:
mushrooms_dummy.head()

Unnamed: 0,class_p,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_c,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,1,0,0,0,0,1,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0


#### Our final data exploration task is to prepare the data for modeling by splitting it to predictor, response, train and test. 

We will do this using the `train_test_split` function from scikit-learn. In the cell below, split the data to `X_train`, `X_test`, `y_train`, and `y_test` using this function. Select 80% of the data for the training sample and the rest for the test sample.

In [119]:
from sklearn.model_selection import train_test_split

# Your code here:

cols = [x for x in mushrooms_dummy.columns.values if x != "class_p"]
X_train, X_test, y_train, y_test = train_test_split(mushrooms_dummy[cols], mushrooms_dummy['class_p'], test_size=0.2)

# Challenge 2 - Creating and Saving Our Model

Determining whether a mushroom is poisonous is a classification task. There are multiple classification models we can choose from.
However, since we have determined that there are many columns that are not indepdendent, this limits our choice of model. One model we will not consider is logistic regression. Two potential choices for this modeling task are random forest and SVM.

Let's start with Random Forest. We think of random forest as a voting algorithm. We generate many decision trees by sampling both rows and columns in our dataset. Each one of these trees produces a decision. We let all the trees "vote" and the aggregate decision that they produce gives us the final decision for our algorithm (in this case, they will vote whether each mushroom is poisonous or not). To learn more about random forests, click [here](https://onlinecourses.science.psu.edu/stat857/node/179/).

In the cell below, we will import and initialize a random forest from scikit-learn. Assign the initialized model to `mushroom_rf`. For now, we will just use the default settings for the random forest classifier, so there is no need to pass any arguments to the function.

In [120]:
from sklearn.ensemble import RandomForestClassifier

# Your code here:

mushroom_rf = RandomForestClassifier()

In the cell below, fit the model to the training data.

In [121]:
# Your code here:

mushroom_rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

#### Next, let's evaluate the model. One of the most straightforward ways to evaluate a classification model is using a confusion matrix. 

The confusion matrix shows us the true positives, false positives, false negatives and true negatives in the data. Our goal is to maximize the true positives and true negatives (the observations that are correctly classified) and minimize the false positives and false negatives.

In the cell below, we'll start by generating predictions for the test data using the `predict` function. 

In [122]:
# Your code here:

y_rf = mushroom_rf.predict(X_test)

Now we'll import the `confusion_matrix` function and compute the confusion matrix by comparing the observed data (`y_test`) and the predicted data that you found in the cell above.

In [123]:
from sklearn.metrics import confusion_matrix

# Your code here:

confusion_matrix(y_test, y_rf)

array([[850,   0],
       [  0, 775]], dtype=int64)

# Bonus Challenge 2 - Use a Different ML Algorithm

Repeat the steps here to predict and evaluate the model but instead use gradient boosted classification. Your end result should be a confusion matrix comparing the predicted and observed y values for the test sample. You can read more about gradient boosted models [here](https://en.wikipedia.org/wiki/Gradient_boosting).

In [124]:
from sklearn.ensemble import GradientBoostingClassifier

# Your code here:



# Challenge 3 - Producing Individual Predictions and Saving The Model

One of the most important goals of machine learning models is to act as something like a prediction black box. We would like to pass an observation to the model and get back a prediction as an output. Let's do this in the next cell using the `predict` function. We'll pick a random mushroom and generate a prediction that will tell us whether it is poisonous.

In [133]:
from random import seed
from random import randint
# seed random number generator
seed(1)

random_number = randint(0, X_test.shape[0])

random_mushroom = X_test.iloc[[random_number]]

In [134]:
random_mushroom

Unnamed: 0,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_c,cap-color_e,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
4452,0,0,0,0,1,0,0,1,0,0,...,0,0,0,1,1,0,0,0,0,0


In the cell below, use the predict function to generate a prediction for this random mushroom. Is the random mushroom poisonous? Compare this to the true y value.

Hint: use `iloc` to access the `random_number` row like in the example above. Read more about the difference between `loc` and `iloc` [here](

In [137]:
# Your code here:

mushroom_rf.predict(random_mushroom), y_test.iloc[[random_number]]

(array([1], dtype=uint8), 4452    1
 Name: class_p, dtype: uint8)

#### Our final step is to save our model. 

Do this in the cell below using pickling. Import the pickle library and save the model as a pickle file. Name your file `mushrooms.sav`

In [139]:
# Your code here:

import pickle

pickle.dump(mushroom_rf, open("mushrooms.sav", 'wb'))