# Module 3 Code Challenge

## Overview

This assessment is designed to test your understanding of the Module 3 materials. It covers:

* Calculus, Cost Function, and Gradient Descent
* Logistic Regression
* Decision Trees
* Ensemble Methods 

_Read the instructions carefully._ You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions, _please use your own words._ The expectation is that you have **not** copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Part 1: Calculus, Cost Function, and Gradient Descent [Suggested Time: 25 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1.1) What is a more generalized name for the RSS curve above? How is it related to machine learning models?

In [None]:
"""
The RSS Curve is the cost function. It measures the variance that is not explained by the model and essentially tells
us how good the model is at predicting for a given m-value. In assigning coefficients, we want to choose values that 
minimize the costfunction, meaning our predicted values are closer to the actual values. From the graph above, the 
cost curve hits its minima at around 0.05 which would be the m-value we would want to use in our regression model.
"""

### 1.2) Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above?   What is the relation between the position on the cost curve, the error, and the slope of the line?

In [None]:
"""
In this instance, I would want to use 0.05 as my m-value or slope. This is where the cost function is at its lowest,
meaning the RSS is at its smallest and there is the minimum amount of variance between the predicted values and the 
actual values. At 0.08, there is more variance between the predicted values and the actuals. In this instance, the RSS
at 0.08 is ~8,000 where at 0.05 the RSS is ~2,000.
"""

![](visuals/gd.png)

### 1.3) Using the gradient descent visual from above, explain why the distance between each step is getting smaller as more steps occur with gradient descent.

In [None]:
"""
The steps get smaller as the slope of the cost function gets flatter. We determine the step size by taking the current
parameter less the gradient(how steep the curve is at that point) multiplied by the learning rate. As the slope of the
cost function flattens out, so does the slope of the tangent line dictating our step size until the curve converges.
"""

### 1.4) What is the purpose of a learning rate in gradient descent? Explain how a very small and a very large learning rate would affect the gradient descent.

In [None]:
"""
The learning rate or alpha softens the rate of change we apply to a given parameter. Learning rates 
Choosing a learning rate that is too big risks overshooting the minima even though we may get there or near there
faster. Choosing a small learning rate can be too slow and computationally expensive to reach the minima. Gradient 
Descent aims to reach the minimum value on the cost curve in as few steps as possible.
"""

---
## Part 2: Logistic Regression [Suggested Time: 25 min]
---

![cnf matrix](visuals/cnf_matrix.png)

### 2.1) Using the confusion matrix above, calculate precision, recall, and F-1 score.

Show your work, not just your final numeric answer

In [1]:
# Your code here to calculate precision

# precision is out of the number of observations deemed positive, how many actually were?
# precision = tp/ (tp +fp)

precision = 30/ (30 + 4)
precision

0.8823529411764706

In [2]:
# Your code here to calculate recall

# recall is out of all the positive cases in my data, how many did my model correctly identify.
# recall = tp/ (tp +fn)

recall = 30/ (30 + 12)
recall

0.7142857142857143

In [3]:
# Your code here to calculate F-1 score

# F1 Score is the harmonic mean between precision & recall.
# F1_Score = 2 * ((Precision * Recall)/ (Precision + Recall))

f1_score = 2 * ((precision * recall)/ (precision + recall))
f1_score

0.7894736842105262

<img src = "visuals/many_roc.png" width = "700">

### 2.2) Pick the best ROC curve from the above graph and explain your choice. 

Note: each ROC curve represents one model, each labeled with the feature(s) inside each model.

In [None]:
"""
The best ROC Curve above is the pink line denoting all features. The ROC Curve shows the trade-off between precision
and recall by comparing the true positive rate versus the false positive rate at all possible thresholds. An ROC Curve
that hugs the left hand corner is a good indication on a model's ability to decifer observations accordingly. In this
instance, the pink line captures the highest true positive rate at any given threshold. Depending on what we're 
solving for, we can choose a threshold to include more false positives - one where we are okay having a lower precision
score in favor of a higher recall score - or optimize precision by choosing a threshold with minimal false positives, 
causing our recall to fall.
"""

The following cell includes code to train and evaluate a model

In [4]:
# Run this cell without changes

# Include relevant imports
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('write_data/sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The original classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

# get the area under the curve from an ROC curve
y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
auc = round(roc_auc_score(y_test, y_score), 3)
print(f'The original classifier has an area under the ROC curve of {auc}.')

The original classifier has an accuracy score of 0.956.
The original classifier has an area under the ROC curve of 0.836.


### 2.3) The model above has an accuracy score that might be too good to believe. Using `y.value_counts()`, explain how `y` is affecting the accuracy score.

In [5]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

In [None]:
"""
For this data, there seems to be a class imbalance problem. There are only 13 observations that are true versus 257
that are false. That's only 4.8% of the data. The problem when using data with an imblanace problem is that the 
algorithms we use assume balanced the data is equally distributed. This could lead a model to be more biased towards
the majority class and a bad predictor for the minority class. If the effect size between the two classes is minimal,
a model won't be able to pick up on the nuances and correctly identify the minority class from the majority.
"""

### 2.4) What methods would you use to address the issues in Question 2.3? 

In [None]:
"""
1. Resampling - either undersampling the majority class or oversampling the minority class. The problem with
undersampling is that you can lose valuable data even in a massive dataset. Oversampling the minority class also comes
with its own issues. When you oversample, you are creating replicas of existing data which may not prove effective
when introducing new data to the model.

2. SMOTE - Smote generates synthetic data points by choosing similar points in the existing data and creating a new
data point for the minority class.

3. Tomek Link - removes data from both the majoirty and minorirty classes that are similar to create a bigger effect
size  between the two classes. This way, when these data points that are deemed "close" are removed, the model will 
have a greater probability of picking up on the different classes.
"""

---
## Part 3: Decision Trees [Suggested Time: 15 min]
---

### Concepts 
You're given a dataset of **30** elements, 15 of which belong to a positive class (denoted by *`+`* ) and 15 of which do not (denoted by `-`). These elements are described by two attributes, A and B, that can each have either one of two values, true or false. 

The diagrams below show the result of splitting the dataset by attribute: the diagram on the left hand side shows that if we split by attribute A there are 13 items of the positive class and 2 of the negative class in one branch and 2 of the positive and 13 of the negative in the other branch. The right hand side shows that if we split the data by attribute B there are 8 items of the positive class and 7 of the negative class in one branch and 7 of the positive and 8 of the negative in the other branch.

<img src="visuals/decision_stump.png">

### 3.1) Which one of the two attributes resulted in the best split of the original data? How do you select the best attribute to split a tree at each node? 

It may be helpful to discuss splitting criteria.

In [None]:
"""
Model A resulted in the best split of the data. The way we can tell this is by looking at the Gini Purity Index at
each split since this is a classification tree. A decision tree makes the immediate best split compared to the 
critierion, in this case gini, even if that is not the best split for the overall best split. Model A would have a 
lower gini, meaning its split was more pure, where Model B would have a gini closer to 0.5 indiciating the data does
not have a pure split.

In a regression tree, we would use the MSE score to determine where to split. The tree will split at the lowest MSE
in that instance. 
"""

### Decision Trees for Regression 

In this section, you will use decision trees to fit a regression model to the Combined Cycle Power Plant dataset. 

This dataset is from the UCI ML Dataset Repository, and has been included in the `raw_data` folder of this repository as an Excel `.xlsx` file, `'Folds5x2_pp.xlsx'`. 

The features of this dataset consist of hourly average ambient variables taken from various sensors located around a power plant that record the ambient variables every second.  
- Temperature (AT) 
- Ambient Pressure (AP) 
- Relative Humidity (RH)
- Exhaust Vacuum (V) 

The target to predict is the net hourly electrical energy output (PE). 

The features and target variables are not normalized.

In the cells below, we import `pandas` and `numpy` for you, and we load the data into a pandas DataFrame. We also include code to inspect the first five rows and get the shape of the DataFrame.

In [6]:
# Run this cell without changes

import pandas as pd 
import numpy as np 

# Load the data
filename = 'raw_data/Folds5x2_pp.xlsx'
df = pd.read_excel(filename)

In [7]:
# Run this cell without changes
# Inspect the first five rows of the DataFrame
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


In [8]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(9568, 5)

Before fitting any models, you need to create training and test splits for the data.

Below, we split the data into features and target (`'PE'`) for you. 

In [9]:
# Run this cell without changes

X = df.drop(columns=['PE'], axis=1)
y = df['PE']

### 3.2) Split the data into training and test sets. Create training and test sets with `test_size=0.5` and `random_state=1`.

In [10]:
# Replace None with appropriate code  

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

### 3.3) Fit a vanilla decision tree regression model with scikit-learn to the training data. Set `random_state=1` for reproducibility. Evaluate the model on the test data.

For the rest of this section feel free to refer to the scikit-learn documentation on [decision tree regressors](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html). 

In [12]:
# Your code here 

from sklearn.tree import DecisionTreeRegressor

reg_tree = DecisionTreeRegressor(random_state=1)
reg_tree.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1, splitter='best')

### 3.4) Obtain the mean squared error, mean absolute error, and coefficient of determination (r2 score) of the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [15]:
# Your code imports here
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

y_test_pred = reg_tree.predict(X_test)

# Replace None with appropriate code 

print('Mean Squared Error:', mean_squared_error(y_test, y_test_pred))
print('Mean Absolute Error:', mean_absolute_error(y_test, y_test_pred))
print('R-squared:', r2_score(y_test, y_test_pred))

Mean Squared Error: 22.212157734113713
Mean Absolute Error: 3.2235451505016726
R-squared: 0.9250521988398296


Hint: MSE should be about 22.21 

### Hyperparameter Tuning of Decision Trees for Regression

### 3.5) Add hyperparameters to a new decision tree and fit it to our training data. Evaluate the model with the test data.

In [18]:
# Your code here 

hyp_tree = DecisionTreeRegressor(random_state = 1, max_depth = 3, min_samples_split = 10)
hyp_tree.fit(X_train, y_train)

y_test_pred = hyp_tree.predict(X_test)

### 3.6) Obtain the mean squared error, mean absolute error, and coefficient of determination (r2 score) of the predictions on the test set. Did this improve your previous model? (It's ok if it didn't)

In [19]:
# Your code here

print('Mean Squared Error:', mean_squared_error(y_test, y_test_pred))
print('Mean Absolute Error:', mean_absolute_error(y_test, y_test_pred))
print('R-squared:', r2_score(y_test, y_test_pred))

Mean Squared Error: 26.619897534087666
Mean Absolute Error: 4.018921825303297
R-squared: 0.9101796947792781


In [None]:
"""
My MSE increased in my second model. My MAE also increased and my R-Squared value decreased. This is because I tuned
my hyperparameters to restrict how deep my tree could grow. By doing so I limited the variance of my tree, creating
more bias in my second model. Decision Trees have a tendency to overfit resulting in high variance. While correcting
for the high variance should lead to improved performance on my testing data, in this particular instance, the
hyperparamter values I chose did not imrpove upon the first model.
"""

---
## Part 4: Ensemble Methods [Suggested Time: 10 min]
---

### Random Forests and Hyperparameter Tuning using GridSearchCV

In this section, you will perform hyperparameter tuning for a Random Forest classifier using GridSearchCV. You will use scikit-learn's wine dataset to classify wines into one of three different classes. 

After finding the best estimator, you will interpret the best model's feature importances. 

In the cells below, we have loaded the relevant imports and the wine data for you. 

In [20]:
# Run this cell without changes

# Relevant imports 
from sklearn.datasets import load_wine

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'
df = pd.concat([X, y.to_frame()], axis=1)

In the cells below, we inspect the first five rows of the DataFrame and compute the DataFrame's shape.

In [21]:
# Run this cell without changes
# Inspect the first five rows of the DataFrame
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [22]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(178, 14)

We also get descriptive statistics for the dataset features, and obtain the distribution of classes in the dataset. 

In [23]:
# Run this cell without changes
# Get descriptive statistics for the features
X.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


In [24]:
# Run this cell without changes
# Obtain distribution of classes
y.value_counts().sort_index()

0    59
1    71
2    48
Name: target, dtype: int64

You will now perform hyperparameter tuning for a Random Forest classifier.

In the cell below, we include the relevant imports for you.

In [25]:
# Run this cell without changes

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

### 4.1) Create an instance of a Random Forest classifier estimator. Call the instance `rfc`. 

Make sure to set `random_state=42` for reproducibility. 

In [34]:
# Replace None with appropriate code
rfc = RandomForestClassifier(random_state=42)

### 4.2) Construct a `param_grid` dictionary to pass to `GridSearchCV` when instantiating the object. 

Choose at least three hyperparameters to tune, and at least three values for each.

In [35]:
# Replace None with appropriate code 

# GridSearch finds the best parameters to tune for a random forest

# number of trees in my forest
n_estimators = [10, 50, 100]

# number of features to consider at each node
max_features = ['auto', 'sqrt', 'log2']

# minumum number of samples that are needed to split a node
min_samples_leaf = [2, 5, 10]

# should the model use bootstrapped data?
bootstrap = [True, False]

param_grid = {'n_estimators': n_estimators,
             'max_features': max_features,
             'min_samples_leaf': min_samples_leaf,
             'bootstrap': bootstrap}

Now that you have created the `param_grid` dictionary of hyperparameters, let's continue performing hyperparameter optimization of a Random Forest Classifier. 

### 4.3) Create an instance of an `GridSearchCV` object and fit it to the data. Call the instance `cv_rfc`. 

- Use the random forest classification estimator you instantiated above, the parameter grid dictionary you constructed, and make sure to perform 5-fold cross validation. 
- The fitting process should take 10-15 seconds to complete. 

In [37]:
# Replace None with appropriate code 
cv_rfc = GridSearchCV(estimator = rfc, param_grid = param_grid, cv = 5) 

cv_rfc.fit(X, y)



GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators='warn', n_jobs=None,
                                              oob_score=False, random_state=42,
                                              verbose=0, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'

In [38]:
cv_predict = cv_rfc.predict(X)
cv_predict

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

In [40]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y, cv_predict)
cm

array([[59,  0,  0],
       [ 0, 71,  0],
       [ 0,  0, 48]])