<h1 align="center">Module 4 Assessment</h1>

## Overview

This assessment is designed to test your understanding of the Mod 4 material. It covers:

* Calculus, Cost Function, and Gradient Descent
* Extensions to Linear Models
* Introduction to Logistic Regression


Read the instructions carefully. You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions please use your own words. The expectation is that you have not copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Calculus, Cost Function, and Gradient Descent [Suggested Time: 25 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1. What is a more generalized name for the RSS function above? What are the parameters of RSS? How is it related to machine learning models?

RSS is a cost function that is used to fit a line to data.


### 2. Would you rather choose a $m$ value of 0.09 or 0.06 from the RSS curve up above?   What is the relation between the position on the cost curve, the error, and the slope of the line?

0.06 because it is closer to the minimum of the cost curve the minimum of the cost curve is where error is minimized the bottom of the parabola between 0.04 and 0.06 would be where the minimizing slope value for the line of best fit would be (where RSS would be at its lowest). on the other hand, 0.99 is further away from the minimization point and has a higher RSS than 0.06

![](visuals/gd.png)

### 3. Using the gradient descent visual from above, explain why the distance between each step is getting smaller as more steps occur with gradient descent.

Gradient Descent is an approach to minimize error by moving toward the local minimum of a cost curve toward the y-intercept and slope that minimize the cost function. To find the minimum we use step down the cost curve, and as we move down the curve we adjust the size of our step based on the slope of the curve observed at each point. When the slope is steep, it indicates we are further away from the minimum, so we take a larger step to the next point to test how close we are to the minimum. As the slope gets less steep, we take smaller steps as we approach the minimum. The goal is to reach the point where the slope of the tangent line is close 0, as that would be when the slope has flattened at the minimum.

### 4. What is the purpose of a learning rate in gradient descent? Explain how a very small and a very large learning rate would affect the gradient descent.

The learning rate essentially controls how fast or slow we travel along the downward slope toward a local minimum. A very small learning rate would be very slow but more likely to reduce loss better than a very high learning rate. Whereas a faster learning rate would be quicker to reach a local minimum but would not decrease loss as a smaller learning rate would, and in fact, if the learning rate is very high, it could increase loss. 

---
## Extensions to Linear Regression [Suggested Time: 25 min]
---

In this section, you're going to be creating linear models that are more complicated than a simple linear regression. In the cells below, we are importing relevant modules that you might need later on. We also load and prepare the dataset for you.

In [1]:
import pandas as pd
import itertools
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
from sklearn.linear_model import Lasso, Ridge
import pickle
from sklearn.metrics import mean_squared_error, roc_curve, roc_auc_score, accuracy_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

In [2]:
data = pd.read_csv('raw_data/advertising.csv').drop('Unnamed: 0',axis=1)
data.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [3]:
X = data.drop('sales', axis=1)
y = data['sales']

In [5]:
# split the data into training and testing set. Do not change the random state please!
X_train , X_test, y_train, y_test = train_test_split(X, y, random_state=2019)


### 1. We'd like to add a bit of complexity to the model created in the example above, and we will do it by adding some polynomial terms. Write a function to calculate train and test error for different polynomial degrees.

This function should:
* take `degree` as a parameter that will be used to create polynomial features to be used in a linear regression model
* create a PolynomialFeatures object for each degree and fit a linear regression model using the transformed data
* calculate the mean square error for each level of polynomial
* return the `train_error` and `test_error` 


In [None]:
import math

def errors(x_values, y_values, m, b):
    y_line = (b + m*x_values)
    return (y_values - y_line)

def squared_errors(x_values, y_values, m, b):
    return np.round(errors(x_values, y_values, m, b)**2, 2)

def residual_sum_squares(x_values, y_values, m, b):
    return round(sum(squared_errors(x_values, y_values, m, b)), 2)

def root_mean_squared_error(x_values, y_values, m, b):
    return round(math.sqrt(sum(squared_errors(x_values, y_values, m, b)))/len(x_values), 2)


In [None]:
didn't have time to finish this one

In [None]:
def polynomial_regression(degree):
    """
    Calculate train and test errorfor a linear regression with polynomial features.
    (Hint: use PolynomialFeatures)
    
    input: Polynomial degree
    output: Mean squared error for train and test set
    """
    # // your code here //
    
    train_error = None
    test_error = None
    return train_error, test_error

#### Try out your new function

In [None]:
polynomial_regression(3)

#### Check your answers

MSE for degree 3:
- Train: 0.2423596735839209
- Test: 0.15281375973923944

MSE for degree 4:
- Train: 0.18179109317368244
- Test: 1.9522597174462015

### 2. What is the optimal number of degrees for our polynomial features in this model? In general, how does increasing the polynomial degree relate to the Bias/Variance tradeoff?  (Note that this graph shows RMSE and not MSE.)

<img src ="visuals/rsme_poly_2.png" width = "300">

<!---
fig, ax = plt.subplots(figsize=(7, 7))
degree = list(range(1, 10 + 1))
ax.plot(degree, error_train[0:len(degree)], "-", label="Train Error")
ax.plot(degree, error_test[0:len(degree)], "-", label="Test Error")
ax.set_yscale("log")
ax.set_xlabel("Polynomial Feature Degree")
ax.set_ylabel("Root Mean Squared Error")
ax.legend()
ax.set_title("Relationship Between Degree and Error")
fig.tight_layout()
fig.savefig("visuals/rsme_poly.png",
            dpi=150,
            bbox_inches="tight")
--->

The optimal number of degrees is 3, as the difference between train and test MSE is less than it is when the degree is 4. The higher degree introduced more error to the model. 

Increasing the polynomial features of a model helps decrease the underfitting (bias) of a model, and also it's RMSE, but there comes a point when increasing the number of polynomial features too much results in an increase in RMSE (error) as the model begins to overfit and become too sensitive to the fluctuations/variance of the training set.

### 3. In general what methods would you can use to reduce overfitting and underfitting? Provide an example for both and explain how each technique works to reduce the problems of underfitting and overfitting.

One way to reduce underfitting issues is to collect more data to increase the train sample size. However, I know that this option is not always feasible, given that you often have the data you have and that's it.


Another option is to adjust the degree of polynomial features in a model to see which degree polynomial performs best.

In the case of underfitting, interaction effects are missed or the model didn't catch a polynomial relationship. So adding polynomial features could be of benefit to the model.

In the case of overfitting, the random noise and small fluctuations of the training set were modeled and the model is to sensitive to those nuances and not generalizable to other test data. Therefore decreasing the degree of polynomials would make the model less sensitive to the noise.


Another option is to use regularization techniques, specifically for overfitting. As a model's complexity (and number of features) increases, overfitting becomes more likely. So instead of deleting predictors, we can use regularization techniques, such as lasso and ridge, that use penalized estimation to reduce coefficients and make them less sensitive to noise in data. In fact, Lasso punishes coefficients so severely that it has a tendency to set coefficients to 0 and perform feature selection - which helps reduce overfitting issues.


### 4. What is the difference between the two types of regularization for linear regression?

L1 and L2 regularization correspond with Lasso and Ridge respectively. Both use penalized estimation to reduce the values of coefficients to make them less sensitive to noise, which helps reduce model complexity and prevent overfitting.

The differences are that:

1) Ridge (L2) is not useful with many features, while Lasso (L1) is better with a larger number of features

2) The formulas underlying the two regularization techniques are different in that Ridges adds lambda as the penalty term constraining the coefficients, whereas Lasso bounds the sum of the coefficients' absolute values.

3) Lasso punishes coefficients more severely than ridge because its distribution is steeper than the more normal distribution of ridge. This underlies the reason why Lasso has a tendency to set coefficients to 0, thus performing feature selection.


### 5. Why is scaling input variables a necessary step before regularization?

Regularization operates on the idea of penalizing larger coefficients. Therefore, if you have features scaled differently from one another, the larger coefficients will be unfairly penalized. By scaling input variables before regularization, you help ensure that coefficients are not unnecessarily penalized

---
## Introduction to Logistic Regression [Suggested Time: 25 min]
---

<!---
# load data
ads_df = pd.read_csv("raw_data/social_network_ads.csv")

# one hot encode categorical feature
def is_female(x):
    """Returns 1 if Female; else 0"""
    if x == "Female":
        return 1
    else:
        return 0
        
ads_df["Female"] = ads_df["Gender"].apply(is_female)
ads_df.drop(["User ID", "Gender"], axis=1, inplace=True)
ads_df.head()

# separate features and target
X = ads_df.drop("Purchased", axis=1)
y = ads_df["Purchased"]

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=19)

# preprocessing
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# save preprocessed train/test split objects
pickle.dump(X_train, open("write_data/social_network_ads/X_train_scaled.pkl", "wb"))
pickle.dump(X_test, open("write_data/social_network_ads/X_test_scaled.pkl", "wb"))
pickle.dump(y_train, open("write_data/social_network_ads/y_train.pkl", "wb"))
pickle.dump(y_test, open("write_data/social_network_ads/y_test.pkl", "wb"))

# build model
model = LogisticRegression(C=1e5, solver="lbfgs")
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)

from sklearn.metrics import confusion_matrix

# create confusion matrix
# tn, fp, fn, tp
cnf_matrix = confusion_matrix(y_test, y_test_pred)
cnf_matrix

# build confusion matrix plot
plt.imshow(cnf_matrix,  cmap=plt.cm.Blues) #Create the basic matrix.

# Add title and Axis Labels
plt.title('Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')

# Add appropriate Axis Scales
class_names = set(y_test) #Get class labels to add to matrix
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)

# Add Labels to Each Cell
thresh = cnf_matrix.max() / 2. #Used for text coloring below
#Here we iterate through the confusion matrix and append labels to our visualization.
for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
        plt.text(j, i, cnf_matrix[i, j],
                 horizontalalignment="center",
                 color="white" if cnf_matrix[i, j] > thresh else "black")

# Add a Side Bar Legend Showing Colors
plt.colorbar()

# Add padding
plt.tight_layout()
plt.savefig("visuals/cnf_matrix.png",
            dpi=150,
            bbox_inches="tight")
--->

![cnf matrix](visuals/cnf_matrix.png)

### 1. Using the confusion matrix up above, calculate precision, recall, and F-1 score.

In [7]:
TP = 30
FP = 12
TN = 54
FN = 4
observations = TP + FP + TN + FN
observations

100

In [8]:
accuracy = (TP + TN) / observations
print('Accuracy:', accuracy)

Accuracy: 0.84


In [9]:
precision = TP / (TP + FN)
print('Precision:', precision)

Precision: 0.8823529411764706


In [10]:
recall = TP / (TP + FP)
print('Recall:', recall)

Recall: 0.7142857142857143


In [11]:
f1_score = 2 * ((precision * recall) / (precision + recall))
print('F1-Score:', f1_score)

F1-Score: 0.7894736842105262


### 2.  What is a real life example of when you would care more about recall than precision? Make sure to include information about errors in your explanation.

You would care more about recall if you were a doctor treating an outbreak of Ebola who wanted to identify every patient with a disease, even if that meant a higher rate of false positives or Type 1 errors (treating people who were incorrectly identified with symptoms even though they actually did not have Ebola).

Higher recall leads to higher false positives/Type 1 error.

<!---
# save preprocessed train/test split objects
X_train = pickle.load(open("write_data/social_network_ads/X_train_scaled.pkl", "rb"))
X_test = pickle.load(open("write_data/social_network_ads/X_test_scaled.pkl", "rb"))
y_train = pickle.load(open("write_data/social_network_ads/y_train.pkl", "rb"))
y_test = pickle.load(open("write_data/social_network_ads/y_test.pkl", "rb"))

# build model
model = LogisticRegression(C=1e5, solver="lbfgs")
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)

labels = ["Age", "Estimated Salary", "Female", "All Features"]
colors = sns.color_palette("Set2")
plt.figure(figsize=(10, 8))
# add one ROC curve per feature
for feature in range(3):
    # female feature is one hot encoded so it produces an ROC point rather than a curve
    # for this reason, female will not be included in the plot at all since it is
    # disingeneuous to call it a curve.
    if feature == 2:
        pass
    else:
        X_train_feat = X_train[:, feature].reshape(-1, 1)
        X_test_feat = X_test[:, feature].reshape(-1, 1)
        logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='lbfgs')
        model_log = logreg.fit(X_train_feat, y_train)
        y_score = model_log.decision_function(X_test_feat)
        fpr, tpr, thresholds = roc_curve(y_test, y_score)
        lw = 2
        plt.plot(fpr, tpr, color=colors[feature],
                 lw=lw, label=labels[feature])

# add one ROC curve with all the features
model_log = logreg.fit(X_train, y_train)
y_score = model_log.decision_function(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_score)
lw = 2
plt.plot(fpr, tpr, color=colors[3], lw=lw, label=labels[3])

# create foundation of the plot
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.yticks([i / 20.0 for i in range(21)])
plt.xticks([i / 20.0 for i in range(21)])
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.title("ROC Curve")
plt.legend()
plt.tight_layout()
plt.savefig("visuals/many_roc.png",
            dpi=150,
            bbox_inches="tight")
--->

### 3. Pick the best ROC curve from this graph and explain your choice. 

*Note: each ROC curve represents one model, each labeled with the feature(s) inside each model*.

<img src = "visuals/many_roc.png" width = "700">


I would pick the ROC curve for all features because it is closest to the upper left hand corner, thus illustrating the best precision-recall tradeoff for the given classifier, and it has a larger AUC(area under the curve), which indicates it is closest of the three to perfect performance.

<!---
# sorting by 'Purchased' and then dropping the last 130 records
dropped_df = ads_df.sort_values(by="Purchased")[:-130]
dropped_df.reset_index(inplace=True)
pickle.dump(dropped_df, open("write_data/sample_network_data.pkl", "wb"))
--->

In [12]:
network_df = pickle.load(open("write_data/sample_network_data.pkl", "rb"))

# partion features and target 
X = network_df.drop("Purchased", axis=1)
y = network_df["Purchased"]

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver="lbfgs")
model.fit(X_train,y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f"The original classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.")

# get the area under the curve from an ROC curve
y_score = model.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score)
auc = round(roc_auc_score(y_test, y_score), 3)
print(f"The original classifier has an area under the ROC curve of {auc}.")

The original classifier has an accuracy score of 0.956.
The original classifier has an area under the ROC curve of 0.836.


### 4. The model above has an accuracy score that might be too good to believe. Using `y.value_counts()`, explain how `y` is affecting the accuracy score.

In [14]:
print(y.value_counts())
print(y.value_counts(normalize=True))

0    257
1     13
Name: Purchased, dtype: int64
0    0.951852
1    0.048148
Name: Purchased, dtype: float64


Examining the value counts for the dependent variable (y) reveals a class imbalance problem. 95% of the y-data is "not-purchased", while only 5% is data for "purchased".

The formula for accuracy is:
accuracy = (number of true positives + number of true negatives) / total observations

We observed a high accuracy rate because the accuracy formula takes into account both the number of true positives and true negatives in its calculation. Therefore, the model for the imbalanced dataset was very accurate at predicting true negatives. However, I am assuming in this case that we would want a model that is better at predicting true positives, as figuring out why something was "purchased" would be the business case to solve.

### 5. What methods would you use to address the issues mentioned up above in question 4? 


We can address class imbalance problems by:
1) Using the class weight parameter in SKLearn's logistic regression function,
2) Oversampling the minority group (in this case 1) with replacement to balance the 0 and 1 values,
3) Under-sampling the majority group (in this case 0) to balance the 0 and 1 values, or
4) Using SMOTE to generate synthetic samples of the minority class (in this case 1) to oversample the minority class.

I would try all of these methods to balancing the data and then see how each performs by comparing their Accuracy scores and ROCs.