# Module 3 Code Challenge

## Overview

This assessment is designed to test your understanding of Module 3 material. It covers:

* Gradient Descent
* Logistic Regression
* Classification Metrics
* Decision Trees

_Read the instructions carefully._ You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions, _please use your own words._ The expectation is that you have **not** copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Part 1: Gradient Descent [Suggested Time: 20 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1.1) What is a more generalized name for the RSS curve above? How could a machine learning model use this curve?

In [8]:
"""
This curve is known as a cost function.  A machine learning algorithm can find the coefficients/intercept that 
minimize the cost function in order to find the model that best represents the relationship between the predictor
and dependent variables.  One way to minimize the cost function is to find the coefficients/intercept that yield
a derivative of 0 on the cost function, which corresponds to the cost function minimum.
"""

'\nThis curve is known as a cost function.  A machine learning algorithm can find the coefficients/intercept that \nminimize the cost function in order to find the model that best represents the relationship between the predictor\nand dependent variables.  One way to minimize the cost function is to find the coefficients/intercept that yield\na derivative of 0 on the cost function, which corresponds to the cost function minimum.\n'

### 1.2) Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above? Explain your reasoning.

In [9]:
"""
I would rather choose an m value of .05 from the RSS curve above because this m value has a lower RSS, which means
that it does a better job of representing the relationship between x and y.
"""

'\nI would rather choose an m value of .05 from the RSS curve above because this m value has a lower RSS, which means\nthat it does a better job of representing the relationship between x and y.\n'

![](visuals/gd.png)

### 1.3) Using the gradient descent visual from above, explain why the distance between estimates in each step is getting smaller as more steps occur with gradient descent.

In [10]:
"""
The distance gets smaller as more steps occur because in gradient descent, the size of a step is determined in part 
by the derivative of the cost function at the point a step is located.  Each sequential step is closer to the
minimum, which means that it will have a smaller slope/derivative and therefore that the step size, which is 
dependent upon the derivative, will be smaller.
"""

'\nThe distance gets smaller as more steps occur because in gradient descent, the size of a step is determined in part \nby the derivative of the cost function at the point a step is located.  Each sequential step is closer to the\nminimum, which means that it will have a smaller slope/derivative and therefore that the step size, which is \ndependent upon the derivative, will be smaller.\n'

### 1.4) What does the learning rate do in the gradient descent algorithm? Explain how a very small and a very large learning rate would affect the gradient descent.

In [11]:
"""
The learning rate in gradient descent scales the derivative at a given point to determine the step size from that
point.
A learning that is too small will lead to a gradient descent with very small steps, which will lead the algorithm to
be very computationally expensive and take a long time to run.  A learning rate that is too large may lead to an
initial step to a point that is farther from the minimum, which in turn will lead to an even larger derivative and
step size, which will yield a point even farther from the minimum.  This process of stepping farther and farther from
the minimum will repeate and the algorithm will never solve for the cost curve minimum.
"""

'\nThe learning rate in gradient descent scales the derivative at a given point to determine the step size from that\npoint.\nA learning that is too small will lead to a gradient descent with very small steps, which will lead the algorithm to\nbe very computationally expensive and take a long time to run.  A learning rate that is too large may lead to an\ninitial step to a point that is farther from the minimum, which in turn will lead to an even larger derivative and\nstep size, which will yield a point even farther from the minimum.  This process of stepping farther and farther from\nthe minimum will repeate and the algorithm will never solve for the cost curve minimum.\n'

---
## Part 2: Logistic Regression [Suggested Time: 15 min]
---

### 2.1) Why is logistic regression typically better than linear regession for modeling a binary target/outcome?

In [12]:
"""
Logistic regression is better than linear regression for modeling binary outcomes because linear regression uses an
unbounded line to represent the relationship between the independent and dependent variables and therefore can yield
probabilities greater than 1 or less than 0 that an observation is a member of the positive class, which is not
possible.  Logistic regression on the other hand uses a sigmoid function that is bounded by 0 and 1, which accurately
represents the range of possible probabilities for a binary outcome.
"""

'\nLogistic regression is better than linear regression for modeling binary outcomes because linear regression uses an\nunbounded line to represent the relationship between the independent and dependent variables and therefore can yield\nprobabilities greater than 1 or less than 0 that an observation is a member of the positive class, which is not\npossible.  Logistic regression on the other hand uses a sigmoid function that is bounded by 0 and 1, which accurately\nrepresents the range of possible probabilities for a binary outcome.\n'

### 2.2) What is one advantage that logistic regression can have over other classification methods?

In [13]:
"""
One advantage of logistic regression over other classification methods is that it is interpretable.  The
size of logistic regression coefficients tells you the relative influence of those coefficients on the log liklihood
of the positive outcome, while the sign of the coefficients tells you whether they have a positive or negative impact
on log liklihood.
"""

'\nOne advantage of logistic regression over other classification methods is that it is interpretable.  The\nsize of logistic regression coefficients tells you the relative influence of those coefficients on the log liklihood\nof the positive outcome, while the sign of the coefficients tells you whether they have a positive or negative impact\non log liklihood.\n'

---
## Part 3: Classification Metrics [Suggested Time: 20 min]
---

![cnf matrix](visuals/cnf_matrix.png)

### 3.1) Using the confusion matrix above, calculate precision, recall, and F-1 score.

Show your work, not just your final numeric answer

In [14]:
# Your code here to calculate precision

#Precision = TP/(TP+FP)
precision = 30/(30+4)
precision

0.8823529411764706

In [15]:
# Your code here to calculate recall

#Recall = TP/(TP+FN)
recall = 30/(30+12)
recall

0.7142857142857143

In [16]:
# Your code here to calculate F-1 score

# F_1 = 2*(precision*recall)/(precision+recall)
f_1 = 2*(precision*recall)/(precision+recall)
f_1

0.7894736842105262

<img src = "visuals/many_roc.png" width = "700">

### 3.2) Which ROC curve from the above graph is the best? Explain your reasoning. 

Note: each ROC curve represents one model, each labeled with the feature(s) inside each model.

In [17]:
"""
The All Features ROC curve is best because it has the greatest AUC.  The AUC is the Area Under the Curve and 
represents the probability that if a positive and a negative observation were randomly selected from the sample, the
positive observation would have a greater probability than the negative observation.  AUC is therefore a measure of
the peformance of a model with higher AUC indicating better model performance.
"""

'\nThe All Features ROC curve is best because it has the greatest AUC.  The AUC is the Area Under the Curve and \nrepresents the probability that if a positive and a negative observation were randomly selected from the sample, the\npositive observation would have a greater probability than the negative observation.  AUC is therefore a measure of\nthe peformance of a model with higher AUC indicating better model performance.\n'

### Logistic Regression Example

The following cell includes code to train and evaluate a model

In [18]:
# Run this cell without changes

# Include relevant imports
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('write_data/sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

The classifier has an accuracy score of 0.956.


### 3.3) Explain how the distribution of `y` shown below could explain the very high accuracy score.

In [19]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

In [20]:
"""
The model's high accuracy score is explained by the class imbalance.  257 out of 270 observation, or 95.2%, are
members of class 0.  Due to this class imbalance, the model will predict that observations are members of 0 the 
vast majority of the time and will be right the vast majority of the time, which explains why the model is right
95.6% of the time.
"""

"\nThe model's high accuracy score is explained by the class imbalance.  257 out of 270 observation, or 95.2%, are\nmembers of class 0.  Due to this class imbalance, the model will predict that observations are members of 0 the \nvast majority of the time and will be right the vast majority of the time, which explains why the model is right\n95.6% of the time.\n"

### 3.4) What method could you use to address the issue discovered in Question 3.3? 

In [21]:
"""
In order to address this issue, one method we could use would be to calculate specificity.  Specificity tells us the
percentage of negative class observations that the model correctly predicted, which would reveal if the model is 
actually able to effectively predict the minority posivitive observations or if it is just mostly predicting negative
and right most of the time because most observations are negative.
"""

'\nIn order to address this issue, one method we could use would be to calculate specificity.  Specificity tells us the\npercentage of negative class observations that the model correctly predicted, which would reveal if the model is \nactually able to effectively predict the minority posivitive observations or if it is just mostly predicting negative\nand right most of the time because most observations are negative.\n'

---
## Part 4: Decision Trees [Suggested Time: 20 min]
---

### Concepts 
You're given a dataset of **30** elements, 15 of which belong to a positive class (denoted by *`+`* ) and 15 of which do not (denoted by `-`). These elements are described by two attributes, A and B, that can each have either one of two values, true or false. 

The diagrams below show the result of splitting the dataset by attribute: the diagram on the left hand side shows that if we split by attribute A there are 13 items of the positive class and 2 of the negative class in one branch and 2 of the positive and 13 of the negative in the other branch. The right hand side shows that if we split the data by attribute B there are 8 items of the positive class and 7 of the negative class in one branch and 7 of the positive and 8 of the negative in the other branch.

<img src="visuals/decision_stump.png">

### 4.1) Which one of the two attributes resulted in the best split of the original data? How do you select the best attribute to split a tree at each node? 

It may be helpful to discuss splitting criteria.

In [25]:
"""
Attribute A is the better attribute to split by because it results in purer child nodes.  Both attributes split the 
dataset into 2 nodes with equal sizes of 15 but in the attribute A split 87% of observations in both child nodes are 
from the majority class, while that number is only 53% for the child nodes of the class B split.  When child node 
sizes are equal, both Decision Tree Attribute Selection Methods, Information Gain and Gini Impurity, favor child nodes
with a higher proportion of the majority class.
"""

'\nAttribute A is the better attribute to split by because it results in purer child nodes.  Both attributes split the \ndataset into 2 nodes with equal sizes of 15 but in the attribute A split 87% of observations in both child nodes are \nfrom the majority class, while that number is only 53% for the child nodes of the class B split.  When child node \nsizes are equal, both Decision Tree Attribute Selection Methods, Information Gain and Gini Impurity, favor child nodes\nwith a higher proportion of the majority class.\n'

### Decision Tree Example

In this section, you will use decision trees to fit a classification model to the wine dataset. The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

In [42]:
# Run this cell without changes

# Relevant imports 
import pandas as pd 
import numpy as np 
from sklearn.datasets import load_wine

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'
df = pd.concat([X, y.to_frame()], axis=1)
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [43]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(178, 14)

In [44]:
# Run this cell without changes
# Get the distribution of the target variable 
y.value_counts()

1    71
0    59
2    48
Name: target, dtype: int64

In [45]:
y.value_counts()

1    71
0    59
2    48
Name: target, dtype: int64

### 4.2) Split the data into training and test sets. Create training and test sets with `test_size=0.5` and `random_state=1`.

In [46]:
# Replace None with appropriate code  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,random_state=1)

### 4.3) Fit a decision tree model with scikit-learn to the training data. Use parameter defaults, except for `random_state=1`. Use the fitted classifier to generate predictions for the test data.

You can use the Scikit-learn DecisionTreeClassifier (docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))

In [47]:
# Your code here 
from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier(random_state=1)
clf = clf.fit(X_train,y_train)
y_pred_test = clf.predict(X_test)

### 4.4) Obtain the accuracy score of the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [48]:
# Your code imports here
from sklearn import metrics 

# Replace None with appropriate code 


print('Accuracy Score:', metrics.accuracy_score(y_test, y_pred_test))

Accuracy Score: 0.8764044943820225


### 4.5) Produce a confusion matrix for the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [49]:
# Your code imports here
from sklearn import metrics 

# Your code here 
confusion = metrics.confusion_matrix(y_test, y_pred_test)
confusion

array([[27,  6,  0],
       [ 2, 30,  2],
       [ 0,  1, 21]])

### 4.6) Do the accuracy score or confusion matrix reveal any substantial problems with this model's performance? Explain your answer.

In [50]:
"""
I do not believe that the accuracy score or confusion matrix reveal any substancial issues with this model's
performance.  The accuracy score of 87.6% is high, especially when considering that this score is not greatly
lifted by the relatively well balanced class sizes of 33, 34 and 22.  Similarly, the model correctly predicted
81.8%, 88.2% and 95.5% of observations for the 3 actual classes and correctly forecasted 93.1%, 81.0% and 91.3% of
observations for the 3 class predictions.  The model performed strongly and was relatively uniform in its performance
across predicted and actuall classes, so I don't see any substancial issues with performance.
"""

"\nI do not believe that the accuracy score or confusion matrix reveal any substancial issues with this model's\nperformance.  The accuracy score of 87.6% is high, especially when considering that this score is not greatly\nlifted by the relatively well balanced class sizes of 33, 34 and 22.  Similarly, the model correctly predicted\n81.8%, 88.2% and 95.5% of observations for the 3 actual classes and correctly forecasted 93.1%, 81.0% and 91.3% of\nobservations for the 3 class predictions.  The model performed strongly and was relatively uniform in its performance\nacross predicted and actuall classes, so I don't see any substancial issues with performance.\n"