# Module 3 Code Challenge

## Overview

This assessment is designed to test your understanding of Module 3 material. It covers:

* Gradient Descent
* Logistic Regression
* Classification Metrics
* Decision Trees

_Read the instructions carefully._ You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions, _please use your own words._ The expectation is that you have **not** copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Part 1: Gradient Descent [Suggested Time: 20 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1.1) What is a more generalized name for the RSS curve above? How could a machine learning model use this curve?

In [1]:
"""
The more generalized name is the cost function. Machine learning could use this curve by using the slope that give
the least RSS or in other words error
"""

'\nThe more generalized name is the cost function. Machine learning could use this curve by using the slope that give\nthe least RSS or in other words error\n'

### 1.2) Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above? Explain your reasoning.

In [2]:
"""
0.05 because when the slope is 0.08, the RSS is ~8,000 and when the slope is 0.05, it's obviously less. The model with
less RSS/error will mean the better slope
"""

"\n0.05 because when the slope is 0.08, the RSS is ~8,000 and when the slope is 0.05, it's obviously less. The model with\nless RSS/error will mean the better slope\n"

![](visuals/gd.png)

### 1.3) Using the gradient descent visual from above, explain why the distance between estimates in each step is getting smaller as more steps occur with gradient descent.

In [3]:
"""
The distance between each step gets smaller because of the hyperparameter LEARNING RATE in gradient descent.
Using learning rate, we can adjust the size of the step by multiplying the learning rate and slope to determine the
next step. In this case, the learning rate is the approriate size such that each step decreaes the slope 
and the steps gets smaller each iteration
"""

'\nThe distance between each step gets smaller because of the hyperparameter LEARNING RATE in gradient descent.\nUsing learning rate, we can adjust the size of the step by multiplying the learning rate and slope to determine the\nnext step. In this case, the learning rate is the approriate size such that each step decreaes the slope \nand the steps gets smaller each iteration\n'

### 1.4) What does the learning rate do in the gradient descent algorithm? Explain how a very small and a very large learning rate would affect the gradient descent.

In [9]:
"""
Learning rate will effect each step size. A very small learning rate would make the algorithm take very small steps
and likely to take many iterations to reach the minima. On the other hand, having a very large learning would result
in the steps to potentially overshoot consistently and therefore get further and further away from the min point.
"""

'\nLearning rate will effect each step size. A very small learning rate would make the algorithm take very small steps\nand likely to take many iterations to reach the minima. On the other hand, having a very large learning would result\nin the steps to potentially overshoot consistently and therefore get further and further away from the min point.\n'

---
## Part 2: Logistic Regression [Suggested Time: 15 min]
---

### 2.1) Why is logistic regression typically better than linear regession for modeling a binary target/outcome?

In [10]:
"""
It is 'better' in that your predictions in a logistic regression would be of either binary targets (classifies) rather
than using a linear regression to predict a continuous between the binary values.
"""

"\nIt is 'better' in that your predictions in a logistic regression would be of either binary targets (classifies) rather\nthan using a linear regression to predict a continuous between the binary values.\n"

### 2.2) What is one advantage that logistic regression can have over other classification methods?

In [11]:
"""
I don't have a one complete advantages over all the other classification modelsl;however, the logistic regression is
very easy to implement compared to most other classifcation methods.
However, even though it is as easy as KNN, Logistic regression is faster and more efficient
"""

"\nI don't have a one complete advantages over all the other classification modelsl;however, the logistic regression is\nvery easy to implement compared to most other classifcation methods.\nHowever, even though it is as easy as KNN, Logistic regression is faster and more efficient\n"

---
## Part 3: Classification Metrics [Suggested Time: 20 min]
---

![cnf matrix](visuals/cnf_matrix.png)

### 3.1) Using the confusion matrix above, calculate precision, recall, and F-1 score.

Show your work, not just your final numeric answer

In [5]:
tp = 54
tn = 30
fp = 12
fn = 4

In [6]:
# Your code here to calculate precision
#precision = true pos / true pos + false pos
#left column
precision = tp / (tp + fp)
precision

0.8181818181818182

In [7]:
# Your code here to calculate recall
#recall = true pos / true pos + false neg
#first row
recall = tp / (tp + fn)
recall

0.9310344827586207

In [8]:
# Your code here to calculate F-1 score
#2 * precision * recall / (precision + recall)
f1 = (2*precision*recall)/(precision + recall)
f1

0.8709677419354839

<img src = "visuals/many_roc.png" width = "700">

### 3.2) Which ROC curve from the above graph is the best? Explain your reasoning. 

Note: each ROC curve represents one model, each labeled with the feature(s) inside each model.

In [12]:
"""
The best ROC curve is the graph with all the features. We look for the curve that has the largest 
Area under the curve (AUC) and has the right amount of false positive rates that we are okay with our model. Since
the graph with ALL the features have the greatest area and is better at every false positive rate, we choose that one
"""

'\nThe best ROC curve is the graph with all the features. We look for the curve that has the largest \nArea under the curve (AUC) and has the right amount of false positive rates that we are okay with our model. Since\nthe graph with ALL the features have the greatest area and is better at every false positive rate, we choose that one\n'

### Logistic Regression Example

The following cell includes code to train and evaluate a model

In [13]:
# Run this cell without changes

# Include relevant imports
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('write_data/sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

The classifier has an accuracy score of 0.956.


In [17]:
from sklearn.metrics import f1_score

In [19]:
f1_score(y_test, y_test_pred)

0.4

### 3.3) Explain how the distribution of `y` shown below could explain the very high accuracy score.

In [14]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

In [20]:
257/(257+13)

0.9518518518518518

In [21]:
"""
The high accuracy score could be explained by the class imbalance of the distribution. Just purely classifying the
dominant class can already give an accuracy of 0.952
"""

'\nThe high accuracy score could be explained by the class imbalance of the distribution. Just purely classifying the\ndominant class can already give an accuracy of 0.952\n'

### 3.4) What method could you use to address the issue discovered in Question 3.3? 

In [22]:
"""
Depending on the problem at hand, we can use a different metric to judge our model. For example, a good general 
measurement could be using f1 score.
"""

'\nDepending on the problem at hand, we can use a different metric to judge our model. For example, a good general \nmeasurement could be using f1 score.\n'

---
## Part 4: Decision Trees [Suggested Time: 20 min]
---

### Concepts 
You're given a dataset of **30** elements, 15 of which belong to a positive class (denoted by *`+`* ) and 15 of which do not (denoted by `-`). These elements are described by two attributes, A and B, that can each have either one of two values, true or false. 

The diagrams below show the result of splitting the dataset by attribute: the diagram on the left hand side shows that if we split by attribute A there are 13 items of the positive class and 2 of the negative class in one branch and 2 of the positive and 13 of the negative in the other branch. The right hand side shows that if we split the data by attribute B there are 8 items of the positive class and 7 of the negative class in one branch and 7 of the positive and 8 of the negative in the other branch.

<img src="visuals/decision_stump.png">

### 4.1) Which one of the two attributes resulted in the best split of the original data? How do you select the best attribute to split a tree at each node? 

It may be helpful to discuss splitting criteria.

In [23]:
"""
The best split of the data is by split A. My reasoning for this is because the split is because we get more pure
subsets than the right since split B has an approximately 50/50 split. Thinking back to an entropy curve, we gain the
most information at the right most and left most points, so when the splits are less even.
"""

'\nThe best split of the data is by split A. My reasoning for this is because the split is because we get more pure\nsubsets than the right since split B has an approximately 50/50 split. Thinking back to an entropy curve, we gain the\nmost information at the right most and left most points, so when the splits are less even.\n'

### Decision Tree Example

In this section, you will use decision trees to fit a classification model to the wine dataset. The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

In [24]:
# Run this cell without changes

# Relevant imports 
import pandas as pd 
import numpy as np 
from sklearn.datasets import load_wine

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'
df = pd.concat([X, y.to_frame()], axis=1)
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [25]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(178, 14)

In [26]:
# Run this cell without changes
# Get the distribution of the target variable 
y.value_counts()

1    71
0    59
2    48
Name: target, dtype: int64

In [32]:
X.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


### 4.2) Split the data into training and test sets. Create training and test sets with `test_size=0.5` and `random_state=1`.

In [33]:
# Replace None with appropriate code  

X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 1, test_size = 0.5)

### 4.3) Fit a decision tree model with scikit-learn to the training data. Use parameter defaults, except for `random_state=1`. Use the fitted classifier to generate predictions for the test data.

You can use the Scikit-learn DecisionTreeClassifier (docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))

In [34]:
from sklearn.tree import DecisionTreeClassifier

In [36]:
# Your code here 
dtree = DecisionTreeClassifier()

dtree.fit(X_train,y_train)

y_predict = dtree.predict(X_test)

### 4.4) Obtain the accuracy score of the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [41]:
from sklearn.metrics import accuracy_score

In [42]:
# Your code imports here

# Replace None with appropriate code 

print('Accuracy Score:', accuracy_score(y_test,y_predict))

Accuracy Score: 0.898876404494382


### 4.5) Produce a confusion matrix for the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [43]:
from sklearn.metrics import confusion_matrix

In [45]:
# Your code imports here

# Your code here 
confusion_matrix(y_test, y_predict)

array([[31,  2,  0],
       [ 4, 28,  2],
       [ 0,  1, 21]])

In [None]:
#rows = actual and are for 0,1,2

### 4.6) Do the accuracy score or confusion matrix reveal any substantial problems with this model's performance? Explain your answer.

In [47]:
y_test.value_counts()

1    34
0    33
2    22
Name: target, dtype: int64

In [50]:
31+28+21

80

In [51]:
34+33+22

89

In [52]:
"""
From looking at the accuracy score and confusion matrix, we can determine overall how many predictions are wrong.
From looking at the confusion matrix specifically, we can actually see the breakdown of the types of misclassifcations
and extrapolate which classification were put where. Overall, we can see the model actually performs well.
"""

'\nFrom looking at the accuracy score and confusion matrix, we can determine overall how many predictions are wrong.\nFrom looking at the confusion matrix specifically, we can actually see the breakdown of the types of misclassifcations\nand extrapolate which classification were put where. Overall, we can see the model actually performs well.\n'