# Module 3 Code Challenge

## Overview

This assessment is designed to test your understanding of the Module 3 materials. It covers:

* Calculus, Cost Function, and Gradient Descent
* Logistic Regression
* Decision Trees
* Ensemble Methods 

_Read the instructions carefully._ You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions, _please use your own words._ The expectation is that you have **not** copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Part 1: Calculus, Cost Function, and Gradient Descent [Suggested Time: 25 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1.1) What is a more generalized name for the RSS curve above? How is it related to machine learning models?

In [None]:
"""
A more generalized name for the RSS curve above is simply a cost curve. 
This cost curve can look similar to the one above where the values are 
continuous or for a classification model it will use different metrics.
Generally when we are building machine learning models we are always
attempting to find the parameter set that minimizes our cost curve. The
global minimum for our cost curve is going to represent the model that
has the best fit to our data.
"""

### 1.2) Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above?   What is the relation between the position on the cost curve, the error, and the slope of the line?

In [None]:
"""
I would choose a m value of .05 from the cost curve above as the RSS
at .05 is ~2500 vs ~8000 at .08. At each point on the cost curve you
will find an error amount and the rate of change of the error. If you
have a curve shaped above where there are no local minima then the 
higher the error you have at your position on the curve the steeper
the slope of your line will be. Essentially, the further you are away
from the global minimum the steeper the slope is going to be.
As this slope is larger, a one unit change in your m will have a
greater impact on the change in the cost curve. For example in the 
curve above going from m = .1 to .09 will result in a RSS decrease 
of ~4000 while going from m =.06 to .05 will result in a 
RSS decrease of <1000.
"""

![](visuals/gd.png)

### 1.3) Using the gradient descent visual from above, explain why the distance between each step is getting smaller as more steps occur with gradient descent.

In [None]:
"""
This is tied in to the answer in 1.2 As you step towards the global
minimum along the cost curve the derivative (rate of change) is 
decreasing. Your step size is based on the steepness of the cost curve
at each point so as that slope decreases your step size will also be
decreasing. In a case with the learning rate equal to 1 you are simply
moving the amount of this derivative each time so as you get closer to
the global minimum your step size will decrease. At the global minimum
the rate of change is 0 so you wouldn't move at all from there! This is
why local minima can cause problems as it might trick the gradient 
descent model into stopping prematurely.
"""

### 1.4) What is the purpose of a learning rate in gradient descent? Explain how a very small and a very large learning rate would affect the gradient descent.

In [None]:
"""
The purpose of a learning rate in gradient descent is to adjust how
quickly you are moving along your cost curve. A learning rate of one
will result in moving along the curve at the rate of the derivative 
at each point. However, usually we want a learning rate <1 so that we
move some amount less than this. If our learning rate is too high
then we will continually overshoot the minimum once we get close and 
will bounce back and forth between points well away from it never being
able to reach it. If our learning rate is too small then the amount
of time taken to calculate the minimum could be extreme. Additionally,
if we have a cost curve with local minima having too small of a learning
rate will ensure that we get stuck in these local minima rather than
finding the true global minimum.
"""

---
## Part 2: Logistic Regression [Suggested Time: 25 min]
---

![cnf matrix](visuals/cnf_matrix.png)

### 2.1) Using the confusion matrix above, calculate precision, recall, and F-1 score.

Show your work, not just your final numeric answer

In [1]:
# Your code here to calculate precision
# We want to be precise when there is a large cost to being market as
# a true positive and not much downside to being market as a false negative
# For example: elective surgery that costs $1m. Very costly to be
#marked as necessary, not a problem if missed
#Precision = TP/(TP + FP)
prec = 30/(30+4)
print(prec)

0.8823529411764706


In [2]:
# Your code here to calculate recall
# We want to focus on recall when missing a true positive is very costly
#and a false positive is significantly less costly. For example if being
#marked as positive would result in life saving treatment with no
#side effects we would want 100% recall even at the cost of making a 
#lot of false positives
#Recall = TP/(TP + FN)
recall = 30/(30 + 12)
print(recall)

0.7142857142857143


In [4]:
# Your code here to calculate F-1 score
#F-1 score is the harmonic mean of recall and precision. It is useful
#as one overall number to evaluate your model.
#F-1 = 2 * ((precision*recall)/(precision + recall))
f1_score = 2*((prec*recall)/(prec + recall))
print(f1_score)

0.7894736842105262


<img src = "visuals/many_roc.png" width = "700">

### 2.2) Pick the best ROC curve from the above graph and explain your choice. 

Note: each ROC curve represents one model, each labeled with the feature(s) inside each model.

In [None]:
"""
I would choose the All Features model from the ROC curve above. At each
point except a false positive rate of 0 it outperforms the other two 
models. Unless you were focused exclusively on specificty then for any
given true or false positive rate you would perform better on the other
metric using the All Features model than either of the other two models.
"""

The following cell includes code to train and evaluate a model

In [5]:
# Run this cell without changes

# Include relevant imports
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('write_data/sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The original classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

# get the area under the curve from an ROC curve
y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
auc = round(roc_auc_score(y_test, y_score), 3)
print(f'The original classifier has an area under the ROC curve of {auc}.')

The original classifier has an accuracy score of 0.956.
The original classifier has an area under the ROC curve of 0.836.


### 2.3) The model above has an accuracy score that might be too good to believe. Using `y.value_counts()`, explain how `y` is affecting the accuracy score.

In [6]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

In [10]:
print(257/270)

0.9518518518518518


In [7]:

"""
The 0.956 accuracy score is so high that it is hard to believe. Looking
at the output variables we can see that if the model simply marked
every output as 0 it would have an accuracy of .952. This is due to a
large class imbalance in our output and shows that are model is really
not doing very much to improve upon a biased model that only outputs 0.
"""

0.9518518518518518


'\nThe 0.956 accurace\n'

### 2.4) What methods would you use to address the issues in Question 2.3? 

In [None]:
"""
In order to adjust for this we 
could over-sample or under-sample. Under-sampling is not a good 
canditate for this data as we have very few data points and removing
any number of them will lose a large amount of information. I would
recommend over-sampling to help improve our classifier. We could do
this either via resampling or SMOTE. Resampling is more straightforward
but simply creates duplicates of the minority class. Using SMOTE we 
will create new synthetic data points that are similar but not exactly
the same as our minority class elements. 
"""

---
## Part 3: Decision Trees [Suggested Time: 15 min]
---

### Concepts 
You're given a dataset of **30** elements, 15 of which belong to a positive class (denoted by *`+`* ) and 15 of which do not (denoted by `-`). These elements are described by two attributes, A and B, that can each have either one of two values, true or false. 

The diagrams below show the result of splitting the dataset by attribute: the diagram on the left hand side shows that if we split by attribute A there are 13 items of the positive class and 2 of the negative class in one branch and 2 of the positive and 13 of the negative in the other branch. The right hand side shows that if we split the data by attribute B there are 8 items of the positive class and 7 of the negative class in one branch and 7 of the positive and 8 of the negative in the other branch.

<img src="visuals/decision_stump.png">

### 3.1) Which one of the two attributes resulted in the best split of the original data? How do you select the best attribute to split a tree at each node? 

It may be helpful to discuss splitting criteria.

In [16]:
"""
Attribute A resulted in the best split of the data. We can see that 
by splitting on A and choosing the majority class we would achieve an
accuracy of 13/15 for each node. This is signifcantly better than the 
random chance of 50/50 we started with. If instead we chose attribute 
B our accuracy would be essentially 8/15 which is barely better. When
choosing how to split for a classification model such as the one above
we want to maximize information gain (decrease entropy) or 
decrease our gini coefficent which is a measure of purity. Below we can
see that splitting on A results in much lower gini index values than B.

"""
gini_A = 1 - ((13/15)**2 + (2/15)**2)
print(gini_A)
gini_B = 1 - ((8/15)**2 + (7/15)**2)
print(gini_B)

0.23111111111111104
0.49777777777777776


### Decision Trees for Regression 

In this section, you will use decision trees to fit a regression model to the Combined Cycle Power Plant dataset. 

This dataset is from the UCI ML Dataset Repository, and has been included in the `raw_data` folder of this repository as an Excel `.xlsx` file, `'Folds5x2_pp.xlsx'`. 

The features of this dataset consist of hourly average ambient variables taken from various sensors located around a power plant that record the ambient variables every second.  
- Temperature (AT) 
- Ambient Pressure (AP) 
- Relative Humidity (RH)
- Exhaust Vacuum (V) 

The target to predict is the net hourly electrical energy output (PE). 

The features and target variables are not normalized.

In the cells below, we import `pandas` and `numpy` for you, and we load the data into a pandas DataFrame. We also include code to inspect the first five rows and get the shape of the DataFrame.

In [17]:
# Run this cell without changes

import pandas as pd 
import numpy as np 

# Load the data
filename = 'raw_data/Folds5x2_pp.xlsx'
df = pd.read_excel(filename)

In [18]:
# Run this cell without changes
# Inspect the first five rows of the DataFrame
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


In [19]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(9568, 5)

Before fitting any models, you need to create training and test splits for the data.

Below, we split the data into features and target (`'PE'`) for you. 

In [20]:
# Run this cell without changes

X = df.drop(columns=['PE'], axis=1)
y = df['PE']

### 3.2) Split the data into training and test sets. Create training and test sets with `test_size=0.5` and `random_state=1`.

In [24]:
# Replace None with appropriate code  

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5,random_state =1)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4784, 4)
(4784,)
(4784, 4)
(4784,)


### 3.3) Fit a vanilla decision tree regression model with scikit-learn to the training data. Set `random_state=1` for reproducibility. Evaluate the model on the test data.

For the rest of this section feel free to refer to the scikit-learn documentation on [decision tree regressors](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). 

In [80]:
# Your code here 
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(random_state =1)
tree_model = tree.fit(X_train,y_train)



### 3.4) Obtain the mean squared error, mean absolute error, and coefficient of determination (r2 score) of the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [81]:
# Your code imports here
from sklearn.metrics import r2_score

predict_train = tree.predict(X_train)
predict_test = tree.predict(X_test)

error_train = abs(y_train - predict_train)
error_test = abs(y_test - predict_test)
square_error_base_train = error_train**2
square_error_base_test = error_test**2

mae_base_train = sum(error_train)/len(y_train)
mae_base_test = sum(error_test)/len(y_test)

mse_base_train = sum(square_error_base_train)/len(y_train)
mse_base_test = sum(square_error_base_test)/len(y_test)


r2_metric = r2_score(y_test,predict_test)
r2_calc = tree_model.score(X_test,y_test)
#The model has a very high r^2 value but also a relatively large testing
#error. I didn't specify max depth and the model went to each leaf having
#one output so it is clearly overfit. When I reduced the max depth
#it performed better on the testing set
# Replace None with appropriate code 

print('Mean Squared Error:', mse_base_test)
print('Mean Absolute Error:', mae_base_test)
print('R-squared:', r2_metric)
print('R-squared:', r2_calc)

Mean Squared Error: 22.21215773411365
Mean Absolute Error: 3.2235451505016752
R-squared: 0.9250521988398296
R-squared: 0.9250521988398297


Hint: MSE should be about 22.21 

### Hyperparameter Tuning of Decision Trees for Regression

### 3.5) Add hyperparameters to a new decision tree and fit it to our training data. Evaluate the model with the test data.

In [82]:
# I added max_depth and min_samples_leaf as hyperparameters

tree2 = DecisionTreeRegressor(random_state =1, max_depth = 10,min_samples_leaf = 4)
tree_model2 = tree2.fit(X_train,y_train)



### 3.6) Obtain the mean squared error, mean absolute error, and coefficient of determination (r2 score) of the predictions on the test set. Did this improve your previous model? (It's ok if it didn't)

In [83]:
# Your code here
predict_train2 = tree2.predict(X_train)
predict_test2 = tree2.predict(X_test)

error_train = abs(y_train - predict_train2)
error_test = abs(y_test - predict_test2)
square_error_base_train = error_train**2
square_error_base_test = error_test**2

mae_base_train = sum(error_train)/len(y_train)
mae_base_test = sum(error_test)/len(y_test)

mse_base_train = sum(square_error_base_train)/len(y_train)
mse_base_test = sum(square_error_base_test)/len(y_test)


r2_metric = r2_score(y_test,predict_test2)
r2_calc = tree_model2.score(X_test,y_test)
#The model has a very high r^2 value but also a relatively large testing
#error. I didn't specify max depth and the model went to each leaf having
#one output so it is clearly overfit. When I reduced the max depth
#it performed better on the testing set
# Replace None with appropriate code 

print('Mean Squared Error:', mse_base_test)
print('Mean Absolute Error:', mae_base_test)
print('R-squared:', r2_metric)
print('R-squared:', r2_calc)

Mean Squared Error: 18.342379892325802
Mean Absolute Error: 3.143917476309396
R-squared: 0.9381095228374401
R-squared: 0.9381095228374401


In [None]:
"""
Adding max_depth and min_samples_leaf improved all metrics of the
testing model. MSE, MAE both went down and R^2 went up! The first model
was massively overfit while the second one was not so it performed much
better on the testing data.
"""

---
## Part 4: Ensemble Methods [Suggested Time: 10 min]
---

### Random Forests and Hyperparameter Tuning using GridSearchCV

In this section, you will perform hyperparameter tuning for a Random Forest classifier using GridSearchCV. You will use scikit-learn's wine dataset to classify wines into one of three different classes. 

After finding the best estimator, you will interpret the best model's feature importances. 

In the cells below, we have loaded the relevant imports and the wine data for you. 

In [84]:
# Run this cell without changes

# Relevant imports 
from sklearn.datasets import load_wine

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'
df = pd.concat([X, y.to_frame()], axis=1)

In the cells below, we inspect the first five rows of the DataFrame and compute the DataFrame's shape.

In [85]:
# Run this cell without changes
# Inspect the first five rows of the DataFrame
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [86]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(178, 14)

We also get descriptive statistics for the dataset features, and obtain the distribution of classes in the dataset. 

In [87]:
# Run this cell without changes
# Get descriptive statistics for the features
X.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [88]:
# Run this cell without changes
# Obtain distribution of classes
y.value_counts().sort_index()

0    59
1    71
2    48
Name: target, dtype: int64

You will now perform hyperparameter tuning for a Random Forest classifier.

In the cell below, we include the relevant imports for you.

In [89]:
# Run this cell without changes

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

### 4.1) Create an instance of a Random Forest classifier estimator. Call the instance `rfc`. 

Make sure to set `random_state=42` for reproducibility. 

In [90]:
# Replace None with appropriate code
rfc = RandomForestClassifier(random_state = 42)

### 4.2) Construct a `param_grid` dictionary to pass to `GridSearchCV` when instantiating the object. 

Choose at least three hyperparameters to tune, and at least three values for each.

In [95]:
# Replace None with appropriate code 
n_estimators = [10,20,50,100]
max_features = ['auto','sqrt','log2']
min_samples_leaf = [1,2,4,8]
min_samples_split = [2,4,8]
param_grid = {'n_estimators': n_estimators, 
              'max_features': max_features,
              'min_samples_leaf': min_samples_leaf,
              'min_samples_split': min_samples_split}

Now that you have created the `param_grid` dictionary of hyperparameters, let's continue performing hyperparameter optimization of a Random Forest Classifier. 

### 4.3) Create an instance of an `GridSearchCV` object and fit it to the data. Call the instance `cv_rfc`. 

- Use the random forest classification estimator you instantiated above, the parameter grid dictionary you constructed, and make sure to perform 5-fold cross validation. 
- The fitting process should take 10-15 seconds to complete. 

In [96]:
# Replace None with appropriate code 
cv_rfc = GridSearchCV(estimator = rfc,param_grid = param_grid,cv=5) 

cv_rfc.fit(X, y)




GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'

In [98]:
cv_rfc.best_params_

{'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 2,
 'n_estimators': 100}

In [None]:
#Out of the parameters I sent GridSearch chose auto for features,
#minimum of 2 samples per terminal node, at least 2 samples to split
#a node, and a tree depth of 100