# Module 3 Code Challenge

## Overview

This assessment is designed to test your understanding of Module 3 material. It covers:

* Gradient Descent
* Logistic Regression
* Classification Metrics
* Decision Trees

_Read the instructions carefully._ You will be asked both to write code and respond to a few short answer questions.

### Note on the short answer questions

For the short answer questions, _please use your own words._ The expectation is that you have **not** copied and pasted from an external source, even if you consult another source to help craft your response. While the short answer questions are not necessarily being assessed on grammatical correctness or sentence structure, do your best to communicate yourself clearly.

---
## Part 1: Gradient Descent [Suggested Time: 20 min]
---

![best fit line](visuals/best_fit_line.png)

The best fit line that goes through the scatterplot up above can be generalized in the following equation: $$y = mx + b$$

Of all the possible lines, we can prove why that particular line was chosen using the plot down below:

![](visuals/cost_curve.png)

where RSS is defined as the residual sum of squares:

$$ 
\begin{align}
RSS &= \sum_{i=1}^n(actual - expected)^2 \\
&= \sum_{i=1}^n(y_i - \hat{y})^2 \\
&= \sum_{i=1}^n(y_i - (mx_i + b))^2
\end{align}
$$ 

### 1.1) What is a more generalized name for the RSS curve above? How could a machine learning model use this curve?

In [None]:
"""
Your written answer here

The more generalized name for the RSS curve above is the cost function.  
A machine learning model could use gradient descent to find the parameters that would output the minimal cost (error). 
"""

### 1.2) Would you rather choose a $m$ value of 0.08 or 0.05 from the RSS curve up above? Explain your reasoning.

In [1]:
"""
Your written answer here

I would rather choose a m value of .05 from the RSS curve above because around .05 is the lowest point on the cost function. 
At this point, the value of RSS appears to no longer decrease meaning that at this point, loss would be at its minimum.
"""

'\nYour written answer here\n\nI would rather choose a m value of .05 from the RSS curve above because around .05 is the lowest point on the cost function. \nAt this point, the value of RSS appears to no longer decrease meaning that at this point, loss would be at its minimum.\n'

![](visuals/gd.png)

### 1.3) Using the gradient descent visual from above, explain why the distance between estimates in each step is getting smaller as more steps occur with gradient descent.

In [None]:
"""
Your written answer here
The step sizes are getting smaller towards the bottom of the curve because the slope at the bottom is smaller. Slope helps determine stepsize.  
So, in the beginning slope is larger and steps are bigger.
"""

### 1.4) What does the learning rate do in the gradient descent algorithm? Explain how a very small and a very large learning rate would affect the gradient descent.

In [None]:
"""
Your written answer here
The learning rate is how big of a step we want to take when moving down the cost curve to find the minimum loss point.  The learning rate is a scaler within the gradient descent algorithm. 
A very small learning rate would take forever to reach the minimum/optimization point.  A very large learning rate could overstep the minimum and thus never find the precise minimum loss point.
"""

---
## Part 2: Logistic Regression [Suggested Time: 15 min]
---

### 2.1) Why is logistic regression typically better than linear regession for modeling a binary target/outcome?

In [None]:
"""
Your written answer here
Logistic regression is better than linear regression for modeling a binary target because it uses the sigmoid function "s" curve to classify the data and predict probabilities. Linear's use of OLS doesn't adequately describe binary data since there is not a 'linear relationship', and linear doesn't predict binary targets but instead the value of continuous variables. 
"""

### 2.2) What is one advantage that logistic regression can have over other classification methods?

In [None]:
"""
Your written answer here
Logistic regression can be really efficient as it outputs the probabilities of a specific class and these probabilities can be used for class predictions.  Logistric Regression is really interpretable and its training/prediction are fast. 
"""

---
## Part 3: Classification Metrics [Suggested Time: 20 min]
---

![cnf matrix](visuals/cnf_matrix.png)

### 3.1) Using the confusion matrix above, calculate precision, recall, and F-1 score.

Show your work, not just your final numeric answer

In [4]:
# Your code here to calculate precision
TP=54 
TN=30
FP=12
FN=4

Precision = (TP)/(TP+FP)
Precision

0.8181818181818182

In [5]:
# Your code here to calculate recall
Recall = (TP)/(TP+FN)
Recall

0.9310344827586207

In [8]:
# Your code here to calculate F-1 score
F1=2*((Precision*Recall)/(Precision+Recall))
F1

0.8709677419354839

<img src = "visuals/many_roc.png" width = "700">

### 3.2) Which ROC curve from the above graph is the best? Explain your reasoning. 

Note: each ROC curve represents one model, each labeled with the feature(s) inside each model.

In [None]:
"""
Your written answer here
The pink ROC curve is the best.  
The AUC (area under the ROC) is biggest under the pink curve, which tells us that the pink curve (model) performs the best across all possible classification thresholds so the pink model fits best overall, regardless of the threshold used.
"""

### Logistic Regression Example

The following cell includes code to train and evaluate a model

In [9]:
# Run this cell without changes

# Include relevant imports
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score

network_df = pickle.load(open('write_data/sample_network_data.pkl', 'rb'))

# partion features and target 
X = network_df.drop('Purchased', axis=1)
y = network_df['Purchased']

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2019)

# scale features
scale = StandardScaler()
scale.fit(X_train)
X_train = scale.transform(X_train)
X_test = scale.transform(X_test)

# build classifier
model = LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)

# get the accuracy score
print(f'The classifier has an accuracy score of {round(accuracy_score(y_test, y_test_pred), 3)}.')

The classifier has an accuracy score of 0.956.


### 3.3) Explain how the distribution of `y` shown below could explain the very high accuracy score.

In [10]:
# Run this cell without changes

y.value_counts()

0    257
1     13
Name: Purchased, dtype: int64

In [12]:
257/270

0.9518518518518518

In [None]:
"""
Your written answer here

The distribution of y is extremely unbalanced, the 0 class makes up 95% of our target data.  
Therefore, the very high accuracy score doesn't really vary much from just predicting the 0 class which is the majority class.  
And so here the accuracy score doesn't actually tell us the predictive power of the model and what type of errors the model is making.


"""

### 3.4) What method could you use to address the issue discovered in Question 3.3? 

In [None]:
"""
Your written answer here
The classes could be balanced using class imbalanced methods such as oversampling/resampling so that the distribution of Y is more even.  Then, accuracy could give us a better picture.
Additionally, we could also use a different metric to evaluate the model, such as F1 score.  F1 gives us a balance of precision and recall.
"""

---
## Part 4: Decision Trees [Suggested Time: 20 min]
---

### Concepts 
You're given a dataset of **30** elements, 15 of which belong to a positive class (denoted by *`+`* ) and 15 of which do not (denoted by `-`). These elements are described by two attributes, A and B, that can each have either one of two values, true or false. 

The diagrams below show the result of splitting the dataset by attribute: the diagram on the left hand side shows that if we split by attribute A there are 13 items of the positive class and 2 of the negative class in one branch and 2 of the positive and 13 of the negative in the other branch. The right hand side shows that if we split the data by attribute B there are 8 items of the positive class and 7 of the negative class in one branch and 7 of the positive and 8 of the negative in the other branch.

<img src="visuals/decision_stump.png">

### 4.1) Which one of the two attributes resulted in the best split of the original data? How do you select the best attribute to split a tree at each node? 

It may be helpful to discuss splitting criteria.

In [None]:
"""
Your written answer here
Attribute A resulted in the best split of the original data. 
Attribute A resulted in a higher information gain (and decrease in entropy), meaning more homogenous (pure) branches were produced. 

"""

### Decision Tree Example

In this section, you will use decision trees to fit a classification model to the wine dataset. The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

In [13]:
# Run this cell without changes

# Relevant imports 
import pandas as pd 
import numpy as np 
from sklearn.datasets import load_wine

# Load the data 
wine = load_wine()
X, y = load_wine(return_X_y=True)
X = pd.DataFrame(X, columns=wine.feature_names)
y = pd.Series(y)
y.name = 'target'
df = pd.concat([X, y.to_frame()], axis=1)
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [14]:
# Run this cell without changes
# Get the shape of the DataFrame 
df.shape

(178, 14)

In [15]:
# Run this cell without changes
# Get the distribution of the target variable 
y.value_counts()

1    71
0    59
2    48
Name: target, dtype: int64

### 4.2) Split the data into training and test sets. Create training and test sets with `test_size=0.5` and `random_state=1`.

In [16]:
# Replace None with appropriate code  
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)

### 4.3) Fit a decision tree model with scikit-learn to the training data. Use parameter defaults, except for `random_state=1`. Use the fitted classifier to generate predictions for the test data.

You can use the Scikit-learn DecisionTreeClassifier (docs [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html))

In [20]:
# Your code here 
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=1)
#training the model
dtc.fit(X_train,y_train)
#predicting on test set
y_pred_test=dtc.predict(X_test)

### 4.4) Obtain the accuracy score of the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [23]:
# Your code imports here
from sklearn import metrics 
from sklearn.metrics import accuracy_score 

# Replace None with appropriate code 

print('Accuracy Score:',metrics.accuracy_score(y_test,y_pred_test))

Accuracy Score: 0.8764044943820225


### 4.5) Produce a confusion matrix for the predictions on the test set. 

You can use the `sklearn.metrics` module.

In [27]:
# Your code imports here
from sklearn.metrics import confusion_matrix 

# Your code here 
CM=metrics.confusion_matrix(y_test,y_pred_test)
CM

array([[27,  6,  0],
       [ 2, 30,  2],
       [ 0,  1, 21]])

### 4.6) Do the accuracy score or confusion matrix reveal any substantial problems with this model's performance? Explain your answer.

In [None]:
"""
Your written answer here
No, the accuracy score and confusion matrix do not reveal any substantial problems.  Most predictions reflect the actual values as shown on the diagonal of the confusion matrix.   
"""