# Module 3 Code Challenge

## Overview

This assessment is designed to test your understanding of Module 3 material. It covers:

* Gradient Descent
* Logistic Regression
* Classification Metrics
* Decision Trees

_Read the instructions carefully._ You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions, _please use your own words._ The expectation is that you have **not** copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Part 1: Gradient Descent [Suggested Time: 20 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1.1) What is a more generalized name for the RSS curve above? How could a machine learning model use this curve?

The curve above is a graph of the Weight Values or Coefficients of a model vs the model's loss, in this case RSS. Machine Learning algorithms use this curve to determine the single point on the curve where the slope is $0$. This point always corresponds to the coefficient with the lowest measure of loss such as RSS.

### 1.2) Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above? Explain your reasoning.

$.05$ is a better _m_ value than $.08$ because it corresponds to a lower value of RSS, or loss. The purpose of using gradient descent is to find the _m_ value with the lowest possible loss.

![](visuals/gd.png)

### 1.3) Using the gradient descent visual from above, explain why the distance between estimates in each step is getting smaller as more steps occur with gradient descent.

The distance between each point of a gradient descent gets smaller because the slope of the curve at each point gets closer to $0$. The algorithm uses the learning rate try to find the point where the slope is equal to $0$ while not overshooting it and yet incrementing by enough so that the computation time is not too long.

### 1.4) What does the learning rate do in the gradient descent algorithm? Explain how a very small and a very large learning rate would affect the gradient descent.

The learning rate is hyperparameter of gradient descent. It is a value multiplied by the gradient to determine the next point of where to measure loss. A very large learning rate might overshoot the target slope of $0$ while a very small learning rate might not reach it in a reasonable amount of time.

---
## Part 2: Logistic Regression [Suggested Time: 15 min]
---

### 2.1) Why is logistic regression typically better than linear regession for modeling a binary target/outcome?

When predicting a binary target, you are attempting to predict the probability of some outcome occuring. Linear Regression is typically used to predict continuous variables so it is not effective in predicting probabilistic variables. Logistic Regression is better suited for these problems because the the predicion curve is nonlinear and therefore can better fit to the distribution of outcomes.

### 2.2) What is one advantage that logistic regression can have over other classification methods?

One advantage of logistic regression over other models is that it is easier to interpret the results by converting the log odds to probabilities. Other models such as Random Forests are not easily interpreted as there are more complicated algorithms involved.

---
## Part 3: Classification Metrics [Suggested Time: 20 min]
---

![cnf matrix](visuals/cnf_matrix.png)

### 3.1) Using the confusion matrix above, calculate precision, recall, and F-1 score.

Show your work, not just your final numeric answer

In [1]:
TP = 30
FP = 4
TN = 54
FN = 12

In [2]:
# Your code here to calculate precision
Precision = TP / (TP + FP)
Precision

0.8823529411764706

In [3]:
# Your code here to calculate recall
Recall = TP / (TP + FN)
Recall

0.7142857142857143

In [4]:
# Your code here to calculate F-1 score
F1 = 2 * Precision * Recall / (Precision + Recall)
F1

0.7894736842105262

<img src = "visuals/many_roc.png" width = "700">

### 3.2) Which ROC curve from the above graph is the best? Explain your reasoning. 

Note: each ROC curve represents one model, each labeled with the feature(s) inside each model.

In the ROC Curve above, model using All Features represented by the pink curve is the best because it has the greatest Area Under the Curve (AUC). The AUC is a measure of how accurate the model is where a larger value indicates higher accuracy.

### Logistic Regression Example

The following cell includes code to train and evaluate a model

In [5]:
# Run this cell without changes

# Include relevant imports
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('write_data/sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

The classifier has an accuracy score of 0.956.


### 3.3) Explain how the distribution of `y` shown below could explain the very high accuracy score.

In [6]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

The distribution of y values are heavily imbalanced: there are very few instances of the positive class compared to the negative. It is possible that the accuracy of the model is so high because it simply predicts the majority class (negative) in every case.

In [13]:
round(257/(257+13), 3)

0.952

### 3.4) What method could you use to address the issue discovered in Question 3.3? 

Class Imbalance can be addressed by using some method of Regularization. Possible methods are Upsampling the minority class, Downsampling the majority class, SMOTE, or Tomek Links.

---
## Part 4: Decision Trees [Suggested Time: 20 min]
---

### Concepts 
You're given a dataset of **30** elements, 15 of which belong to a positive class (denoted by *`+`* ) and 15 of which do not (denoted by `-`). These elements are described by two attributes, A and B, that can each have either one of two values, true or false. 

The diagrams below show the result of splitting the dataset by attribute: the diagram on the left hand side shows that if we split by attribute A there are 13 items of the positive class and 2 of the negative class in one branch and 2 of the positive and 13 of the negative in the other branch. The right hand side shows that if we split the data by attribute B there are 8 items of the positive class and 7 of the negative class in one branch and 7 of the positive and 8 of the negative in the other branch.

<img src="visuals/decision_stump.png">

### 4.1) Which one of the two attributes resulted in the best split of the original data? How do you select the best attribute to split a tree at each node? 

It may be helpful to discuss splitting criteria.

Attribute A resulted in the best split as it led to more homogeneous leaf nodes. In other words, there is a clear correlation between attribute A and the target variable, where attribute B was not good at predicting the target. Algorithms such as Entropy or Gini are used to select the best attributes to split on.

### Decision Tree Example

In this section, you will use decision trees to fit a classification model to the wine dataset. The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

In [14]:
# Run this cell without changes

# Relevant imports 
import pandas as pd 
import numpy as np 
from sklearn.datasets import load_wine

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'
df = pd.concat([X, y.to_frame()], axis=1)
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [15]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(178, 14)

In [16]:
# Run this cell without changes
# Get the distribution of the target variable 
y.value_counts()

1    71
0    59
2    48
Name: target, dtype: int64

### 4.2) Split the data into training and test sets. Create training and test sets with `test_size=0.5` and `random_state=1`.

In [22]:
# Replace None with appropriate code  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

### 4.3) Fit a decision tree model with scikit-learn to the training data. Use parameter defaults, except for `random_state=1`. Use the fitted classifier to generate predictions for the test data.

You can use the Scikit-learn DecisionTreeClassifier (docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))

In [24]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)

In [32]:
pd.DataFrame(dt_pred)[0].value_counts()

1    37
0    29
2    23
Name: 0, dtype: int64

### 4.4) Obtain the accuracy score of the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [35]:
from sklearn.metrics import accuracy_score

print('Accuracy Score:', accuracy_score(y_test, dt_pred))

Accuracy Score: 0.8764044943820225


### 4.5) Produce a confusion matrix for the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [42]:
from sklearn.metrics import confusion_matrix

pd.DataFrame(confusion_matrix(y_test, dt_pred))

Unnamed: 0,0,1,2
0,27,6,0
1,2,30,2
2,0,1,21


### 4.6) Do the accuracy score or confusion matrix reveal any substantial problems with this model's performance? Explain your answer.

The Accuracy Score and Confusion Matrix do not reveal an _substantial_ problems with the model. The Accuracy Score is quite high $(87.6\%)$ and the Confusion Matrix shows that the model did a pretty good job predicting each category. The largest Error of the model was predicting $6$ cases of class $1$ that were actually class $0$.