# Module 3 Code Challenge

## Overview

This assessment is designed to test your understanding of the Module 3 materials. It covers:

* Calculus, Cost Function, and Gradient Descent
* Logistic Regression
* Decision Trees
* Ensemble Methods 

_Read the instructions carefully._ You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions, _please use your own words._ The expectation is that you have **not** copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Part 1: Calculus, Cost Function, and Gradient Descent [Suggested Time: 25 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1.1) What is a more generalized name for the RSS curve above? How is it related to machine learning models?

The RSS curve above is also known as a Cost curve, because it evaluates the amount of RSS at every value of m, which is used to calculate the slope of the linear regression line.  The goal of machine learning, and in this case linear regression, is to develop a model that is as accurate as possible, meaning the line predicts y values with as little difference or error from the actual values (for a supervised model where the actuals are known).  To determine the line of best fit, we must first determine how the line should be angled (the slope) so it has the smallest RSS.  The cost curve helps us determine that m value by identifying the lowest point in the cost curve where the RSS is the lowest.  Fore xample, according to the graph above that m value would be approximately .05.

### 1.2) Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above?   What is the relation between the position on the cost curve, the error, and the slope of the line?

An m value of 0.05 would be better than 0.08 because the RSS is lowest at that point, meaning that 0.05 would provide the slope of the linear regression line that best fits the data and minimizes the RSS, or error, between the predicted y values of the linear regression line and the actual y values (targets) provided in the data.

![](visuals/gd.png)

### 1.3) Using the gradient descent visual from above, explain why the distance between each step is getting smaller as more steps occur with gradient descent.

The gradient steps are getting smaller because the derivative of each point, which is used to calculate the step size along with the learning rate, is getting smaller, and closer and closer to 0.  The point at which the RSS is lowest on the curve is at the very bottom tip of the curve.  The slope of the tangent line that provides the instantaneous rate of change at that point (the derivative) would be 0 because that tangent line would be horizontal.

### 1.4) What is the purpose of a learning rate in gradient descent? Explain how a very small and a very large learning rate would affect the gradient descent.

The purpose of the learning rate is to regulate the step size and therefore the amount of steps it takes to get to the lowest point on the Cost curve.  We can't see the cost curve, so in a way it is like walking around in the dark to figure out where you're going.  We have to make an initial guess to determine which way to go.  The derivative of the first point we choose will tell us whether we should go up or down the curve.  If the derivative is high, then it means we are farther away from the lowest point in the curve and we must take a large step, but then smaller and smaller steps so we don't overshoot the lowest point in the curve.

If you start out with a small learning rate and very small step sizes, then it will take hundreds and maybe even thousands of steps to get to the lowest point in the curve.  This will take a long time and you may not even get to the lowest point!

If you start out with a big learning rate and take very large steps sizes, then you will most likely overshoot the lowest point and end up on the other side of the cost curve.

An appropriate learning rate would normally be around 0.8 to 1.4, but of course it depends on the RSS.


---
## Part 2: Logistic Regression [Suggested Time: 25 min]
---

![cnf matrix](visuals/cnf_matrix.png)

### 2.1) Using the confusion matrix above, calculate precision, recall, and F-1 score.

Show your work, not just your final numeric answer

In [1]:
# precision: of all the predicted 
# P = TP/ TP + FP
P = 30/ (30 + 4)

In [2]:
# recall: 
# R = TP/ TP + FN
R = 30/ (30 + 12)

In [3]:
# F-1 score
F1_Score = 2 * ((P*R)/(P+R))

In [4]:
F1_Score

0.7894736842105262

<img src = "visuals/many_roc.png" width = "700">

### 2.2) Pick the best ROC curve from the above graph and explain your choice. 

Note: each ROC curve represents one model, each labeled with the feature(s) inside each model.

The best ROC curve from the graph above is the pink line (for all features) which has the biggest AUC (Area Under the Curve).  ROC and AUC are metrics used to evaluate a classification model and they help you determine how effective a model is at distinguishing between classes.  A perfect AUC is 1 and a terrible AUC is 0.  In this case, the AUC closest to 1 is the pink line.

The following cell includes code to train and evaluate a model

In [5]:
# Run this cell without changes

# Include relevant imports
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('write_data/sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The original classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

# get the area under the curve from an ROC curve
y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
auc = round(roc_auc_score(y_test, y_score), 3)
print(f'The original classifier has an area under the ROC curve of {auc}.')

The original classifier has an accuracy score of 0.956.
The original classifier has an area under the ROC curve of 0.836.


### 2.3) The model above has an accuracy score that might be too good to believe. Using `y.value_counts()`, explain how `y` is affecting the accuracy score.

In [6]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

There is a class imbalance between 0, which contains 257 observations, and 1, which contains 13 observations.  This means that even if a model prediected every observation as 0, it would still look pretty accurate because it would only get 13 wrong out of 270.

### 2.4) What methods would you use to address the issues in Question 2.3? 

There are several ways to address this misleading accuracy score.  One way is by using the additional metrics of precision, accuracy and F1 score, which is the harmonic mean of precision and recall.  This will give you a more complete picture of how the model is truly performing.

You can also choose to sample the data differently to structurally address this imbalance.  Using resmapling, you can oversample the minority class until it is the same size as the majority class.  You can also undersample the majority class so it is the same size as the minority class.  You can use SMOTE (Synthetic Minority Oversampling Technique) to develop synthetic data points in the training set that improves the model's generalizability.  You can also use a method (can't remember the name!) that identifies data points from the majority class that have similar characteristics to data points in the minorty class and removes them so the majority class becomes smaller and the model is able to better distinguish between the two classes. 

---
## Part 3: Decision Trees [Suggested Time: 15 min]
---

### Concepts 
You're given a dataset of **30** elements, 15 of which belong to a positive class (denoted by *`+`* ) and 15 of which do not (denoted by `-`). These elements are described by two attributes, A and B, that can each have either one of two values, true or false. 

The diagrams below show the result of splitting the dataset by attribute: the diagram on the left hand side shows that if we split by attribute A there are 13 items of the positive class and 2 of the negative class in one branch and 2 of the positive and 13 of the negative in the other branch. The right hand side shows that if we split the data by attribute B there are 8 items of the positive class and 7 of the negative class in one branch and 7 of the positive and 8 of the negative in the other branch.

<img src="visuals/decision_stump.png">

### 3.1) Which one of the two attributes resulted in the best split of the original data? How do you select the best attribute to split a tree at each node? 

It may be helpful to discuss splitting criteria.

Attribute A resulted in the best split, which we can determine using the Gini Purity criterion.  Gini is a metric for determining the purity of a class.  A higher Gini score of 0.5 and above means a lower purity and a lower score of 0 to approximately 0.3 (depending on your standards and field) means a higher purity.  Attribute A results in a higher purity for each leaf compared to attribut b where the purity is very low.

### Decision Trees for Regression 

In this section, you will use decision trees to fit a regression model to the Combined Cycle Power Plant dataset. 

This dataset is from the UCI ML Dataset Repository, and has been included in the `raw_data` folder of this repository as an Excel `.xlsx` file, `'Folds5x2_pp.xlsx'`. 

The features of this dataset consist of hourly average ambient variables taken from various sensors located around a power plant that record the ambient variables every second.  
- Temperature (AT) 
- Ambient Pressure (AP) 
- Relative Humidity (RH)
- Exhaust Vacuum (V) 

The target to predict is the net hourly electrical energy output (PE). 

The features and target variables are not normalized.

In the cells below, we import `pandas` and `numpy` for you, and we load the data into a pandas DataFrame. We also include code to inspect the first five rows and get the shape of the DataFrame.

In [7]:
# Run this cell without changes

import pandas as pd 
import numpy as np 

# Load the data
filename = 'raw_data/Folds5x2_pp.xlsx'
df = pd.read_excel(filename)

In [8]:
# Run this cell without changes
# Inspect the first five rows of the DataFrame
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


In [9]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(9568, 5)

Before fitting any models, you need to create training and test splits for the data.

Below, we split the data into features and target (`'PE'`) for you. 

In [10]:
# Run this cell without changes

X = df.drop(columns=['PE'], axis=1)
y = df['PE']

### 3.2) Split the data into training and test sets. Create training and test sets with `test_size=0.5` and `random_state=1`.

In [11]:
# Replace None with appropriate code  
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

In [12]:
X.head()

Unnamed: 0,AT,V,AP,RH
0,14.96,41.76,1024.07,73.17
1,25.18,62.96,1020.04,59.08
2,5.11,39.4,1012.16,92.14
3,20.86,57.32,1010.24,76.64
4,10.82,37.5,1009.23,96.62


### 3.3) Fit a vanilla decision tree regression model with scikit-learn to the training data. Set `random_state=1` for reproducibility. Evaluate the model on the test data.

For the rest of this section feel free to refer to the scikit-learn documentation on [decision tree regressors](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). 

In [16]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X_train, y_train)
cross_val_score(regressor, X, y, cv=10)

array([0.92336139, 0.92529127, 0.92427708, 0.92375014, 0.93000943,
       0.93506484, 0.92595705, 0.93831008, 0.93264554, 0.93985167])

In [30]:
y_pred = regressor.predict(X_test)

In [18]:
regressor.decision_path(X_train)

<4784x9523 sparse matrix of type '<class 'numpy.int64'>'
	with 76925 stored elements in Compressed Sparse Row format>

### 3.4) Obtain the mean squared error, mean absolute error, and coefficient of determination (r2 score) of the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

# Your code imports here
mse = mean_squared_error(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)

# Replace None with appropriate code 

print('Mean Squared Error:', mse)
print('Mean Absolute Error:', mae)
print('R-squared:', r2)

Hint: MSE should be about 22.21 

### Hyperparameter Tuning of Decision Trees for Regression

### 3.5) Add hyperparameters to a new decision tree and fit it to our training data. Evaluate the model with the test data.

In [None]:
# Your code here 

### 3.6) Obtain the mean squared error, mean absolute error, and coefficient of determination (r2 score) of the predictions on the test set. Did this improve your previous model? (It's ok if it didn't)

In [None]:
# Your code here

I blanked on how to evaluate a decison tree model using the y-pred compared to the y_test, so I didn't complete this section.  I need to revisit this code so I can remember how to implement these steps. But I grabbed the code for the metrics and I would use the metrics to determine which model is best.  Decision Tree Regressors are used to predict a continuous value, which is the mean continuous value of a specific region or leaf.  The model with the lower MSE and MAE is better and the 

---
## Part 4: Ensemble Methods [Suggested Time: 10 min]
---

### Random Forests and Hyperparameter Tuning using GridSearchCV

In this section, you will perform hyperparameter tuning for a Random Forest classifier using GridSearchCV. You will use scikit-learn's wine dataset to classify wines into one of three different classes. 

After finding the best estimator, you will interpret the best model's feature importances. 

In the cells below, we have loaded the relevant imports and the wine data for you. 

In [19]:
# Run this cell without changes

# Relevant imports 
from sklearn.datasets import load_wine

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'
df = pd.concat([X, y.to_frame()], axis=1)

In the cells below, we inspect the first five rows of the DataFrame and compute the DataFrame's shape.

In [20]:
# Run this cell without changes
# Inspect the first five rows of the DataFrame
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [21]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(178, 14)

We also get descriptive statistics for the dataset features, and obtain the distribution of classes in the dataset. 

In [22]:
# Run this cell without changes
# Get descriptive statistics for the features
X.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [23]:
# Run this cell without changes
# Obtain distribution of classes
y.value_counts().sort_index()

0    59
1    71
2    48
Name: target, dtype: int64

You will now perform hyperparameter tuning for a Random Forest classifier.

In the cell below, we include the relevant imports for you.

In [24]:
# Run this cell without changes

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

### 4.1) Create an instance of a Random Forest classifier estimator. Call the instance `rfc`. 

Make sure to set `random_state=42` for reproducibility. 

In [26]:
# Replace None with appropriate code
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
rfc = RandomForestClassifier(max_depth=2, random_state=42)

### 4.2) Construct a `param_grid` dictionary to pass to `GridSearchCV` when instantiating the object. 

Choose at least three hyperparameters to tune, and at least three values for each.

In [27]:
rfc.get_params

<bound method BaseEstimator.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators='warn',
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)>

In [29]:
# Replace None with appropriate code 

criterion = ['gini', 'entropy']
max_depth = [2, 3, 4, 5, 6, 7]
min_samples_split = [2, 3, 4, 5, 6, 7]
min_samples_leaf = [2, 3, 4, 5, 6, 7]

param_grid = {'criterion': criterion,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

Now that you have created the `param_grid` dictionary of hyperparameters, let's continue performing hyperparameter optimization of a Random Forest Classifier. 

### 4.3) Create an instance of an `GridSearchCV` object and fit it to the data. Call the instance `cv_rfc`. 

- Use the random forest classification estimator you instantiated above, the parameter grid dictionary you constructed, and make sure to perform 5-fold cross validation. 
- The fitting process should take 10-15 seconds to complete. 

In [None]:
# Replace None with appropriate code 
from sklearn.model_selection import GridSearchCV
cv_rfc = GridSearchCV(svc, parameters)

cv_rfc.fit(None, None)