# Module 3 Code Challenge

## Overview

This assessment is designed to test your understanding of the Module 3 materials. It covers:

* Calculus, Cost Function, and Gradient Descent
* Logistic Regression
* Decision Trees
* Ensemble Methods 

_Read the instructions carefully._ You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions, _please use your own words._ The expectation is that you have **not** copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Part 1: Calculus, Cost Function, and Gradient Descent [Suggested Time: 25 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1.1) What is a more generalized name for the RSS curve above? How is it related to machine learning models?

The more generalized name for this curve is called the cost function. It is related to a machine learning model as the goal of a machine learning model is to find a model with the lowest value on the cost function curve essentially meaning you are using the model with the least amount of error between the models predicted values and the true values. 

### 1.2) Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above?   What is the relation between the position on the cost curve, the error, and the slope of the line?

I would choose the m value of 0.05 from the RSS curve because it has the lowest RSS value and is at the mininum point on the cost functions curve. This tells you that you have found the m value that results in the lowest error between the models predictions and the true values. The minimum point on the cost curve is found by looking for the point on the curve where the slope of the cost curve is =0. This tells you that you have reached the minimum or maximum point of the curve as it is no longer acsending or descending. In this case we have reached the minimum which is the goal for machine learning as we want the lowest error.

![](visuals/gd.png)

### 1.3) Using the gradient descent visual from above, explain why the distance between each step is getting smaller as more steps occur with gradient descent.

This has to do with the formula used for gradient descent. As you take each step towards the lowest slope value, the size of the step will decrease relative to the proximity to the minimum point on the graph. Basically the larger the slope, the larger the step and once you start nearing the minimum your slope will begin to decrease resulting in smaller steps towards the minimum. This is intuitive as you would not want to be taking bigger steps when you know you are near the minimum.

### 1.4) What is the purpose of a learning rate in gradient descent? Explain how a very small and a very large learning rate would affect the gradient descent.

The purpose of a learning rate in gradient descent is to control the step size in order to find the minimum in the most efficient way. If you have a step size that is too large, then you will be jumping around the cost curve too far each time and likely never find the point where the slope =0. With a step size that is very little, it will take a very long time to reach the minimum resulting in a computationally expensive and unnefficient task. That is why a moderate learning rate of say 0.2 or 0.3 is ideal.

---
## Part 2: Logistic Regression [Suggested Time: 25 min]
---

![cnf matrix](visuals/cnf_matrix.png)

### 2.1) Using the confusion matrix above, calculate precision, recall, and F-1 score.

Show your work, not just your final numeric answer

In [1]:
# Your code here to calculate precision
TP = 30
FP = 4
p = TP/(TP+FP)
p

0.8823529411764706

In [2]:
# Your code here to calculate recall
FN = 12
r = TP/(FN+TP)
r

0.7142857142857143

In [4]:
# Your code here to calculate F-1 score
F1 = 2 * (r*p)/(r+p)
F1

0.7894736842105262

<img src = "visuals/many_roc.png" width = "700">

### 2.2) Pick the best ROC curve from the above graph and explain your choice. 

Note: each ROC curve represents one model, each labeled with the feature(s) inside each model.

The pink (all features) curve is the best ROC because it has the highest area under the curve, meaning that this model is the greatest ratio of the above models of true positive rate to false positive rate. Basically it says that using all of the features in the model results in a model with the best predicitive ability of positives.

The following cell includes code to train and evaluate a model

In [5]:
# Run this cell without changes

# Include relevant imports
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('write_data/sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The original classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

# get the area under the curve from an ROC curve
y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
auc = round(roc_auc_score(y_test, y_score), 3)
print(f'The original classifier has an area under the ROC curve of {auc}.')

The original classifier has an accuracy score of 0.956.
The original classifier has an area under the ROC curve of 0.836.


### 2.3) The model above has an accuracy score that might be too good to believe. Using `y.value_counts()`, explain how `y` is affecting the accuracy score.

In [6]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

y is affecting the accuracy score because there is a lot of class imbalance between class 0 (majority class) and class 1 (minority class). The model is essentially just maximizing its accuracy by always guessing the majority class since there arent many of the minority observations in the data set. However, this results in a F1 score meaning the model will never pick up on positives hence our model is essentially useless if we want the ability to predict when a positive will occur.

### 2.4) What methods would you use to address the issues in Question 2.3? 

There are three methods one can use to address class imbalance: Resampling, Smote, or Tomek Link. 
In this case I would choose to resample by oversampling the minority class, which would be making copies of the minority class so the number of observations between majority and minority is identical.

Another method i would consider is smote. Smote allows for the creation of synthetic data points for the minority class based of the existing characteristics of the real minority observations.

---
## Part 3: Decision Trees [Suggested Time: 15 min]
---

### Concepts 
You're given a dataset of **30** elements, 15 of which belong to a positive class (denoted by *`+`* ) and 15 of which do not (denoted by `-`). These elements are described by two attributes, A and B, that can each have either one of two values, true or false. 

The diagrams below show the result of splitting the dataset by attribute: the diagram on the left hand side shows that if we split by attribute A there are 13 items of the positive class and 2 of the negative class in one branch and 2 of the positive and 13 of the negative in the other branch. The right hand side shows that if we split the data by attribute B there are 8 items of the positive class and 7 of the negative class in one branch and 7 of the positive and 8 of the negative in the other branch.

<img src="visuals/decision_stump.png">

### 3.1) Which one of the two attributes resulted in the best split of the original data? How do you select the best attribute to split a tree at each node? 

It may be helpful to discuss splitting criteria.

The left tree is a better split meaning attribute A is more indicative of whether the element will be positive or negative. When looking at terminal nodes, you want to maximize the Gini value (the purity of elements within the node) meaning trying to create nodes with only one class (positive or negative in this case). As you can see in the two trees above the tree on the left has terminal nodes with greater purity than the tree on the right.

### Decision Trees for Regression 

In this section, you will use decision trees to fit a regression model to the Combined Cycle Power Plant dataset. 

This dataset is from the UCI ML Dataset Repository, and has been included in the `raw_data` folder of this repository as an Excel `.xlsx` file, `'Folds5x2_pp.xlsx'`. 

The features of this dataset consist of hourly average ambient variables taken from various sensors located around a power plant that record the ambient variables every second.  
- Temperature (AT) 
- Ambient Pressure (AP) 
- Relative Humidity (RH)
- Exhaust Vacuum (V) 

The target to predict is the net hourly electrical energy output (PE). 

The features and target variables are not normalized.

In the cells below, we import `pandas` and `numpy` for you, and we load the data into a pandas DataFrame. We also include code to inspect the first five rows and get the shape of the DataFrame.

In [8]:
# Run this cell without changes

import pandas as pd 
import numpy as np 

# Load the data
filename = 'raw_data/Folds5x2_pp.xlsx'
df = pd.read_excel(filename)

In [9]:
# Run this cell without changes
# Inspect the first five rows of the DataFrame
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


In [10]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(9568, 5)

Before fitting any models, you need to create training and test splits for the data.

Below, we split the data into features and target (`'PE'`) for you. 

In [11]:
# Run this cell without changes

X = df.drop(columns=['PE'], axis=1)
y = df['PE']

### 3.2) Split the data into training and test sets. Create training and test sets with `test_size=0.5` and `random_state=1`.

In [12]:
# Replace None with appropriate code  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state=1)

### 3.3) Fit a vanilla decision tree regression model with scikit-learn to the training data. Set `random_state=1` for reproducibility. Evaluate the model on the test data.

For the rest of this section feel free to refer to the scikit-learn documentation on [decision tree regressors](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). 

In [14]:
# Your code here 
from sklearn.tree import DecisionTreeRegressor
#instantiate decision tree
reg_tree = DecisionTreeRegressor(random_state = 1)
#Fit tree on training data
tree1 = reg_tree.fit(X_train, y_train)
#Use tree to predict
pred1 = tree1.predict(X_test)
#evaluate performance
result = sum((pred1 - y_test)**2)/len(y_test)
result

22.21215773411365

### 3.4) Obtain the mean squared error, mean absolute error, and coefficient of determination (r2 score) of the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [15]:
# Your code imports here
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import r2_score as r2

# Replace None with appropriate code 

print('Mean Squared Error:', mse(y_test, pred1))
print('Mean Absolute Error:', mae(y_test, pred1))
print('R-squared:', r2(y_test, pred1))

Mean Squared Error: 22.212157734113713
Mean Absolute Error: 3.2235451505016726
R-squared: 0.9250521988398296


Hint: MSE should be about 22.21 

### Hyperparameter Tuning of Decision Trees for Regression

### 3.5) Add hyperparameters to a new decision tree and fit it to our training data. Evaluate the model with the test data.

In [17]:
# Your code here 
new_tree = DecisionTreeRegressor(random_state = 1, min_samples_leaf=25, min_samples_split=50, max_depth=5, max_features = 'sqrt')
#Fit tree on training data
tree2 = new_tree.fit(X_train, y_train)
#Use tree to predict
pred2 = tree2.predict(X_test)
#evaluate performance
result = sum((pred2 - y_test)**2)/len(y_test)
result

27.446845815151317

### 3.6) Obtain the mean squared error, mean absolute error, and coefficient of determination (r2 score) of the predictions on the test set. Did this improve your previous model? (It's ok if it didn't)

In [18]:
# Your code here
print('Mean Squared Error:', mse(y_test, pred2))
print('Mean Absolute Error:', mae(y_test, pred2))
print('R-squared:', r2(y_test, pred2))

Mean Squared Error: 27.446845815151338
Mean Absolute Error: 4.024093555168919
R-squared: 0.9073894230694874


It did not improve. The error has increased.

---
## Part 4: Ensemble Methods [Suggested Time: 10 min]
---

### Random Forests and Hyperparameter Tuning using GridSearchCV

In this section, you will perform hyperparameter tuning for a Random Forest classifier using GridSearchCV. You will use scikit-learn's wine dataset to classify wines into one of three different classes. 

After finding the best estimator, you will interpret the best model's feature importances. 

In the cells below, we have loaded the relevant imports and the wine data for you. 

In [19]:
# Run this cell without changes

# Relevant imports 
from sklearn.datasets import load_wine

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'
df = pd.concat([X, y.to_frame()], axis=1)

In the cells below, we inspect the first five rows of the DataFrame and compute the DataFrame's shape.

In [20]:
# Run this cell without changes
# Inspect the first five rows of the DataFrame
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [21]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(178, 14)

We also get descriptive statistics for the dataset features, and obtain the distribution of classes in the dataset. 

In [22]:
# Run this cell without changes
# Get descriptive statistics for the features
X.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [23]:
# Run this cell without changes
# Obtain distribution of classes
y.value_counts().sort_index()

0    59
1    71
2    48
Name: target, dtype: int64

You will now perform hyperparameter tuning for a Random Forest classifier.

In the cell below, we include the relevant imports for you.

In [24]:
# Run this cell without changes

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

### 4.1) Create an instance of a Random Forest classifier estimator. Call the instance `rfc`. 

Make sure to set `random_state=42` for reproducibility. 

In [25]:
# Replace None with appropriate code
rfc = RandomForestClassifier(random_state=42)

### 4.2) Construct a `param_grid` dictionary to pass to `GridSearchCV` when instantiating the object. 

Choose at least three hyperparameters to tune, and at least three values for each.

In [26]:
# create parameter values
n_estimators = [10, 100, 200]
max_features = ['auto', 'sqrt', 'log2', None]
max_depth = [3, 6, 9, None]
min_samples_leaf = [1, 5, 10]
bootstrap = [True, False]

param_grid = {'n_estimators': n_estimators,
'max_features' : max_features,
'max_depth' : max_depth,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
param_grid

{'n_estimators': [10, 100, 200],
 'max_features': ['auto', 'sqrt', 'log2', None],
 'max_depth': [3, 6, 9, None],
 'min_samples_leaf': [1, 5, 10],
 'bootstrap': [True, False]}

Now that you have created the `param_grid` dictionary of hyperparameters, let's continue performing hyperparameter optimization of a Random Forest Classifier. 

### 4.3) Create an instance of an `GridSearchCV` object and fit it to the data. Call the instance `cv_rfc`. 

- Use the random forest classification estimator you instantiated above, the parameter grid dictionary you constructed, and make sure to perform 5-fold cross validation. 
- The fitting process should take 10-15 seconds to complete. 

In [27]:
# Replace None with appropriate code 
cv_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5) 

cv_rfc.fit(X, y)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'