# Setting Thresholds Lab

### Introduction

In this lesson, we'll work with viewing the tradeoff of precision and recall and altering our thresholds by working with a credit card fraud dataset.  The dataset is quite large, so you will have to download it separately.  We can find the original dataset [here](https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets), or it can be downloaded [by clicking here](https://github.com/jigsawlabs-student/logistic-regression/blob/master/0-classification-fundamentals/3-metrics-for-classification/compressed_creditcard.csv.zip).

### Loading our Data

Let's begin by loading our data.

In [1]:
import pandas as pd
url = "add dataset here"
# df = pd.read_csv(url)

The target data of whether there was fraud is located in the `Class` column.  Let's start by getting a sum of the number of positive cases.

In [52]:
# code here

# 492

492

Now let's calculate the mean to see the percentage of the positive cases.

In [54]:
# code here

# 0.001727485630620034

0.001727485630620034

So we can see that, as expected, not many of the transactions are labeled as fradulent.  Let's assign the `Class` column to y and every other feature to `X`. 

In [17]:
X = None
y = None

In [55]:
X.shape, y.shape

# ((284807, 30), (284807,))

((284807, 30), (284807,))

The data has already been transformed to be all numeric and there are no null values.  Let's scale our data, and assign the scaled data to a dataframe, with the appropriate columns. 

In [18]:
# scale data here

X_df = None

In [56]:
X_df[:2]

# Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V20	V21	V22	V23	V24	V25	V26	V27	V28	Amount
# 0	-1.996583	-0.694242	-0.044075	1.672773	0.973366	-0.245117	0.347068	0.193679	0.082637	0.331128	...	0.326118	-0.024923	0.382854	-0.176911	0.110507	0.246585	-0.392170	0.330892	-0.063781	0.244964
# 1	-1.996583	0.608496	0.161176	0.109797	0.316523	0.043483	-0.061820	-0.063700	0.071253	-0.232494	...	-0.089611	-0.307377	-0.880077	0.162201	-0.561131	0.320694	0.261069	-0.022256	0.044608	-0.342475
# 2 rows × 30 columns

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,-1.996583,-0.694242,-0.044075,1.672773,0.973366,-0.245117,0.347068,0.193679,0.082637,0.331128,...,0.326118,-0.024923,0.382854,-0.176911,0.110507,0.246585,-0.39217,0.330892,-0.063781,0.244964
1,-1.996583,0.608496,0.161176,0.109797,0.316523,0.043483,-0.06182,-0.0637,0.071253,-0.232494,...,-0.089611,-0.307377,-0.880077,0.162201,-0.561131,0.320694,0.261069,-0.022256,0.044608,-0.342475


Now split the data into training, validation, and test sets.  Each split should be stratified.  And we want to 60, 20, 20 split of the data.  Set the `random_state = 1` in each split. 

In [57]:
from sklearn.model_selection import train_test_split



> We can check that the data has been properly stratified, by ensuring the mean targets across the training validation and test sets are the same.

In [62]:
y_train.mean(), y_validate.mean(), y_test.mean()

# (0.0017263172678542169, 0.0017204754130018785, 0.0017380007724447878)

(0.0017263172678542169, 0.0017204754130018785, 0.0017380007724447878)

Let's also check the shape of the data.

In [63]:
X_train.shape, X_validate.shape, X_test.shape

# ((170884, 30), (56961, 30), (56962, 30))

((170884, 30), (56961, 30), (56962, 30))

### Train the model

Now let's train a model setting the class_weight as `balanced`, and the `random_state = 1`.

In [68]:


# LogisticRegression(class_weight='balanced', random_state=1)

LogisticRegression(class_weight='balanced', random_state=1)

> Score the model on the validation set.

In [69]:


# 0.974350871649023

0.974350871649023

Calculate the precision and recall scores of the `balanced` model, on the validation set.  Do so by using the methods in the sklearn library.

In [75]:
precision_balanced = None

precision_balanced

# 0.056026058631921824

0.056026058631921824

In [76]:
recall_balanced = None
recall_balanced

# 0.8775510204081632

0.8775510204081632

So we can see our model capture 87 percent of the fraud cases, and 5.6 percent of those predicted as fraud, were in fact fraud. 

### Without Weighted

Now let's train a logistic regression model using the sample weight (no class_weight of balanced).  

> Set the random_state = 1.

In [30]:
model_sample = None


# LogisticRegression(random_state=1)

LogisticRegression(random_state=1)

> Calculate the recall and precision scores on the validation sets.

In [78]:
recall_score_sample_data = None

# 0.6530612244897959

(0.6530612244897959, 0.8421052631578947)

In [None]:
precision_score_sample_data = None

# 0.8421052631578947

We see that the recall score is not as strong the precision score is much stronger.

### Plotting the tradeoffs

Let's get a fuller picture of how our classifiers perform by plotting the `precision_recall_curve` for each classifier.  Begin with the precision_recall_curve for the model that *did not* balance the class weights.   

In [3]:
# create precision, recall, and thresholds for the precision recall plots

In [80]:
precision.shape, recall.shape, thresholds.shape

# ((41414,), (41414,), (41413,))

((41414,), (41414,), (41413,))

In [7]:
# write code for plots

<img src="./precision-recall-curve-sample.png" width="40%">

We can see that we can get can capture much of the data with a threshold around .1, perhaps lower.  Let's see how high our precision score can be if we maintain a recall score at .9.

Begin by assigning a dataframe that has columns of our thresholds, and related precision and recall scores for our classifier.

In [4]:
df_precision_recall = None

In [6]:
# df_precision_recall[:2]

# 	threshold	precision	recall
# 0	0.000121	0.002689	0.987179
# 1	0.000121	0.002689	0.987179

Now narrow down the data, by selecting those thresholds where recally is greater than .89 and lower than .90.  That is, where the classifier captures 90 percent of the fraud cases.

In [82]:
df_capture_ninety = None
df_capture_ninety

# 	threshold	precision	recall
# 27559	0.002864	0.067308	0.897436
# 27560	0.002865	0.067372	0.897436
# 27561	0.002869	0.067437	0.897436
# 27562	0.002870	0.067502	0.897436
# 27563	0.002870	0.067568	0.897436
# ...	...	...	...
# 27636	0.003023	0.072690	0.897436
# 27637	0.003025	0.072765	0.897436
# 27638	0.003026	0.072841	0.897436
# 27639	0.003027	0.072917	0.897436
# 27640	0.003032	0.072993	0.897436

# 82 rows × 3 columns

Unnamed: 0,threshold,precision,recall
27559,0.002864,0.067308,0.897436
27560,0.002865,0.067372,0.897436
27561,0.002869,0.067437,0.897436
27562,0.002870,0.067502,0.897436
27563,0.002870,0.067568,0.897436
...,...,...,...
27636,0.003023,0.072690,0.897436
27637,0.003025,0.072765,0.897436
27638,0.003026,0.072841,0.897436
27639,0.003027,0.072917,0.897436


In [83]:
df_capture_ninety.shape

# (82, 3)

(82, 3)

We can see that we can capture much of .897 of our fraudulent cases with a threshold of .003.  Unfortunately, our precision drops to roughly 7 percent.

### Balanced model

Now let's plot the `precision_recall_curve` for the balanced model.

In [38]:
# create precision, recall, and thresholds for the precision recall plots

In [8]:
# create plots

<img src="./balanced-model-curves.png" width="40%">

This time we can see that the model has a low precision score until there is a high threshold, as it places a less emphasis on correctly classifying the negative events (so it incorrectly predicts many of them as positive).  The recall score, by contrast performs well throughout, as our model places more emphasis on correctly classifying the positive events.

Again, create a dataframe of the thresholds along with the precision and recall scores, and assign it to `df_precision_recall_balanced`.

In [40]:
df_precision_recall_balanced = None

In [84]:
df_precision_recall_balanced[:2]

# 	threshold	precision	recall
# 0	0.043482	0.003939	0.987179
# 1	0.043482	0.003939	0.987179

Unnamed: 0,threshold,precision,recall
0,0.043482,0.003939,0.987179
1,0.043482,0.003939,0.987179


Let's see narrow down our dataframe to results with recall scores between .89 and .92.

In [41]:


# 	threshold	precision	recall
# 19171	0.860196	0.196676	0.910256
# 19172	0.860624	0.197222	0.910256
# 19173	0.860700	0.197772	0.910256
# 19174	0.862010	0.198324	0.910256
# 19175	0.862561	0.198880	0.910256
# ...	...	...	...
# 19421	0.989690	0.630631	0.897436
# 19422	0.990370	0.636364	0.897436
# 19423	0.990801	0.642202	0.897436
# 19424	0.991059	0.648148	0.897436
# 19425	0.991368	0.654206	0.897436
# 255 rows × 3 columns

Unnamed: 0,threshold,precision,recall
19171,0.860196,0.196676,0.910256
19172,0.860624,0.197222,0.910256
19173,0.860700,0.197772,0.910256
19174,0.862010,0.198324,0.910256
19175,0.862561,0.198880,0.910256
...,...,...,...
19421,0.989690,0.630631,0.897436
19422,0.990370,0.636364,0.897436
19423,0.990801,0.642202,0.897436
19424,0.991059,0.648148,0.897436


So here we can see that we can achieve a much better precision rate if we keep recall to around .89, but improving recall by just 2 percent drops our precision score significantly.  

Let's set our threshold at .991 to maintain a high recall and precision score.  

Finally, let's train a model with class_weight = 'balanced' and our training and validation data combined.

In [42]:
X_combined = None
y_combined = None

In [43]:
X_combined.shape

# (273414, 30)

(273414, 30)

> Create the balanced model with the combined data.

In [44]:
model_combined = None


# LogisticRegression(class_weight='balanced', random_state=1)

LogisticRegression(class_weight='balanced', random_state=1)

Now use the `model_combined` to make predictions on the test data, with the same threshold of .991, as discovered previously.

> Predict the values from the test set using the threshold below.

In [45]:
threshold = .991


> And then check the precision and recall scores.

In [46]:
precision_score(y_test, predicted_targets), recall_score(y_test, predicted_targets)

# (0.75, 0.9)

(0.75, 0.9)

### Resources

[Credit Fraud](https://www.kaggle.com/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets)