# Precision and Recall Lab

### Introduction

In this lesson, we'll train a classifier and explore our different metrics for measuring the classifier's performance.  We'll do so by looking at customer churn data from a telecommunications company.

### Loading our Data

Let's begin by loading our data.

In [3]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/logistic-regression/master/0-classification-fundamentals/3-metrics-for-classification/coerced_customer_churn.csv"
df = pd.read_csv(url, index_col = 0)

Now let's take a look at our data.

In [4]:
df[:2]

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender,Partner,Dependents,PhoneService,MultipleLines_x0_No phone service,MultipleLines_x0_Yes,InternetService_x0_Fiber optic,...,StreamingMovies_x0_Yes,Contract_x0_One year,Contract_x0_Two year,PaperlessBilling,PaymentMethod_x0_Credit card (automatic),PaymentMethod_x0_Electronic check,PaymentMethod_x0_Mailed check,Churn,TotalCharges,TotalCharges_is_na
0,0,1,29.85,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,29.85,False
1,0,34,56.95,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1889.5,False


As we can can see, our data has already been formatted so that we can train a model.  Let's get to it.  Assign column everything but `Churn` to the variable X, and assign Churn to y as the target.

In [5]:
X = df.drop('Churn', axis = 1)
y = df['Churn']

In [6]:
X.shape, y.shape

# ((7043, 31), (7043,))

((7043, 31), (7043,))

Now scale the X data, and place the scaled data in a dataframe with the appropriate columns.

In [10]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled, columns = X.columns) 

In [12]:
X_scaled_df[:2]

# 	SeniorCitizen	tenure	MonthlyCharges	gender	Partner	Dependents	PhoneService	MultipleLines_x0_No phone service	MultipleLines_x0_Yes	InternetService_x0_Fiber optic	...	StreamingMovies_x0_No internet service	StreamingMovies_x0_Yes	Contract_x0_One year	Contract_x0_Two year	PaperlessBilling	PaymentMethod_x0_Credit card (automatic)	PaymentMethod_x0_Electronic check	PaymentMethod_x0_Mailed check	TotalCharges	TotalCharges_is_na
# 0	-0.439916	-1.277445	-1.160323	-1.009559	1.034530	-0.654012	-3.054010	3.054010	-0.854176	-0.88566	...	-0.525927	-0.79607	-0.514249	-0.562975	0.829798	-0.525047	1.406418	-0.544807	-0.994971	-0.039551
# 1	-0.439916	0.066327	-0.259629	0.990532	-0.966622	-0.654012	0.327438	-0.327438	-0.854176	-0.88566	...	-0.525927	-0.79607	1.944582	-0.562975	-1.205113	-0.525047	-0.711026	1.835513	-0.173876	-0.039551
# 2 rows × 31 columns

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender,Partner,Dependents,PhoneService,MultipleLines_x0_No phone service,MultipleLines_x0_Yes,InternetService_x0_Fiber optic,...,StreamingMovies_x0_No internet service,StreamingMovies_x0_Yes,Contract_x0_One year,Contract_x0_Two year,PaperlessBilling,PaymentMethod_x0_Credit card (automatic),PaymentMethod_x0_Electronic check,PaymentMethod_x0_Mailed check,TotalCharges,TotalCharges_is_na
0,-0.439916,-1.277445,-1.160323,-1.009559,1.03453,-0.654012,-3.05401,3.05401,-0.854176,-0.88566,...,-0.525927,-0.79607,-0.514249,-0.562975,0.829798,-0.525047,1.406418,-0.544807,-0.994971,-0.039551
1,-0.439916,0.066327,-0.259629,0.990532,-0.966622,-0.654012,0.327438,-0.327438,-0.854176,-0.88566,...,-0.525927,-0.79607,1.944582,-0.562975,-1.205113,-0.525047,-0.711026,1.835513,-0.173876,-0.039551


Now let's split the data into training validation and test sets.  We split the training data into training and test datasets.  Now it's your turn to split the test dataset in half into validation and test datasets. 

> Make sure you stratify the data by the `y_test` data.  Set `random_state = 1`.

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled_df, y, 
                                                    test_size = .4,
                                                    random_state = 1, stratify = y)
X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, 
                                                          test_size = .5, 
                                                          random_state = 1, stratify = y_test)

In [14]:
X_validate[:2]

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,gender,Partner,Dependents,PhoneService,MultipleLines_x0_No phone service,MultipleLines_x0_Yes,InternetService_x0_Fiber optic,...,StreamingMovies_x0_No internet service,StreamingMovies_x0_Yes,Contract_x0_One year,Contract_x0_Two year,PaperlessBilling,PaymentMethod_x0_Credit card (automatic),PaymentMethod_x0_Electronic check,PaymentMethod_x0_Mailed check,TotalCharges,TotalCharges_is_na
2105,-0.439916,-1.196004,0.348589,-1.009559,-0.966622,-0.654012,0.327438,-0.327438,1.170719,1.129102,...,-0.525927,-0.79607,-0.514249,-0.562975,-1.205113,-0.525047,1.406418,-0.544807,-0.9013,-0.039551
3336,-0.439916,0.921455,-1.477726,-1.009559,1.03453,1.529024,0.327438,-0.327438,-0.854176,-0.88566,...,1.901403,-0.79607,-0.514249,1.776278,-1.205113,-0.525047,-0.711026,1.835513,-0.531716,-0.039551


Now let's fit our logistic regression model, set the solver as `lbfgs` and the `random_state` as 1.  Check the accuracy on the validation set.

In [16]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver="lbfgs",
                           random_state = 1).fit(X_train, y_train)
model.score(X_validate, y_validate)

# 0.8034066713981547

0.8034066713981547

### Breaking down our error

Next let's use the validation data to create a confusion matrix.  Use the `confusion_matrix` function from `sklearn.metrics`.   

In [17]:
from sklearn.metrics import confusion_matrix

In [18]:
y_pred_val = model.predict(X_validate)

In [19]:
mtx = confusion_matrix(y_validate.values, y_pred_val)

In [20]:
import pandas as pd
conf_mtx_df = pd.DataFrame(mtx, index = ['observed -', 'observed +'],
                           columns = ['predicted -', 'predicted +'])
conf_mtx_df.iloc[::-1, ::-1].T


# 	observed +	observed -
# predicted +	198	101
# predicted -	176	934

Unnamed: 0,observed +,observed -
predicted +,198,101
predicted -,176,934


Now let's break down the confusion matrix.  From here assign the `true_positive`, `false_positive`, `false_negative`, and `false_positive` below.

In [21]:
TP = 198
TN = 934
FP = 101
FN = 176

Use the four variables declare in the cell above to calculate the accuracy.  It should line up to the score we saw above.

In [23]:
accuracy = (TP + TN)/(TP + TN + FP + FN)
accuracy
# 0.8034066713981547

0.8034066713981547

Let's also check that the total number of positive observations and negative observations line up with what we see in our confusion matrix.  Use the variables to make the correct calculations.

In [24]:
total_positives = TP + FN

total_positives

374

In [25]:
y_validate.sum()

374.0

In [26]:
total_negatives = TN + FP 
total_negatives

1035

In [27]:
(y_validate == 0).sum()

1035

### Working through Precision and Recall

Now let's calculate the precision and recall.  

1. Precision 

Let's start with precision.  Remember that precision is **the percentage our classifier predicts** is positive that is actually positive.  

> Use the variables above to calculate the precision.

In [29]:
precision = TP/(TP + FP)
precision
# 0.6622073578595318

0.6622073578595318

> Then import `precision_score` from `sklearn.metrics` and check that you get the same number.

In [30]:
from sklearn.metrics import precision_score

In [32]:
precision_score(y_validate, y_pred_val)
# 0.6622073578595318

0.6622073578595318

So we can see that roughly one third of what our classifier detects is a false positive.   In other words, one third of those who our classifier predicts will churn, do not.

2. Recall

Next, let's move to recall.  Remember that recall is the **percentage of observed positive events** that were classified as positive. 

In [34]:
recall = TP/(TP + FN)
recall

# 0.5294117647058824

0.5294117647058824

Now, we import the `recall_score` from sklearn and check that we get a matching recall.

In [35]:
from sklearn.metrics import recall_score

In [37]:
recall_score(y_validate, y_pred_val)

0.5294117647058824

So we can see that many churned customers are not captured by our classifier (.55 recall score), and that our model performs a little better by balancing our data.

### Balancing Data 

Let's see if we can perform a little better by balancing our data. Currently, we have an imbalanced dataset. 

In [38]:
y_train.mean()

0.26532544378698225

This means that during training, a model will be optimized at performing better on the negative observations than the positive ones, as there are three times as many negative observations.  We can alter this by setting `class_weight` as `balanced` when initializing the LogisticRegression model.  

As explained in the documenation:

```text 
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data
as ``n_samples / (n_classes * np.bincount(y))``
```

So the fewer the number of observations, the larger the estimator multiplies the cost associated with that observation.

> Fit the logistic regression model with the `class_weight = 'balanced'`, and the solver='lbfgs', and a random_state = 1.  Score the `balanced_model` on the validation set.

In [41]:
balanced_model = LogisticRegression(solver="lbfgs",
                           random_state = 1, class_weight='balanced').fit(X_train, y_train)
balanced_model.score(X_validate, y_validate)
# 0.7260468417317246

0.7260468417317246

> Notice that the score decreases from `0.8062455642299503` previously to `0.7359829666430092`.  This is expected as the accuracy score calculates the total number of observations classified correctly, while our balanced classifier focuses on fitting to the positive events.

Now let's see how our balanced model changes the way it classifies the positive events.  Begin by creating a confusion matrix for the balanced model.

In [42]:
y_val_predict_balanced = balanced_model.predict(X_validate)

In [43]:
mtx_balanced = confusion_matrix(y_validate.values, y_val_predict_balanced)

In [44]:
import pandas as pd
conf_mtx_df_balanced = pd.DataFrame(mtx_balanced, index = ['observed -', 'observed +'],
                           columns = ['predicted -', 'predicted +'])
conf_mtx_df_balanced.iloc[::-1, ::-1].T


# observed +	observed -
# predicted +	290	302
# predicted -	84	733

Unnamed: 0,observed +,observed -
predicted +,290,302
predicted -,84,733


Now let's compare this to the original confusion matrix.

In [59]:
conf_mtx_df.iloc[::-1, ::-1].T

Unnamed: 0,observed +,observed -
predicted +,197,96
predicted -,177,939


We can see that using `class_weight` of `balanced`, has our model performed better with predicting true positives, but worse at predicting true negatives.

Let's take a look at the precision and recall scores.

In [45]:
precision_balanced = precision_score(y_validate, y_val_predict_balanced)
precision_balanced

# 0.48986486486486486

0.48986486486486486

In [46]:
recall_balanced = recall_score(y_validate, y_val_predict_balanced)

recall_balanced

# 0.7754010695187166

0.7754010695187166

In [None]:
# previous precision score 0.6622
# previous recall score # 0.5294117647058824

We can see that by changing to balanced, the the recall of the classifier greatly increased, but the precision score decreased.

### Using the F1 score

Finally, we can use the f1 score to see a harmonic mean of the precision and recall scores.  We'll use this metric to compare our two models.

In [47]:
from sklearn.metrics import f1_score

In [48]:
f1_score(y_validate, model.predict(X_validate))
# 0.5884101040118871

0.5884101040118871

In [51]:
f1_score(y_validate, balanced_model.predict(X_validate))
# 0.6004140786749482

0.6004140786749482

We can see that averaging both of these scores, the balanced model performed better.

### Summary

In this lesson, we practiced calculating the precision and recall scores, and compared logistic regression models where the data was balanced, and where the sample weight was used.

We applied the formulas of:

* $precision = \frac{TP}{TP + FP}$
* $recall = \frac{TP}{TP + FN}$

We also computed the accurracy score of the total correctly classified (TP + TN) divided by the all of the observations (TP + FN + FP + FN).

### Resources

[Class weight](https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work)