# 🚯 Lecture 11 Lab: Logistic regression and spam detection

<img src="img/spam-email.png" alt= “spam-email” width="500" />

## ✅ Setup and data import
In this lab, we will work with a [classic dataset](https://archive.ics.uci.edu/dataset/94/spambase) of 4,601 emails classified as spam or not spam.

In [3]:
# Load in additional functions
library(tidyverse)
library(lubridate)

# Use three digits past the decimal point,
# and don't use scientific notation.
options(digits = 3, scipen = 999)

# Format plots with a white background and dark features.
theme_set(theme_bw())

# Increase the default text size of plots.
# If you are *not* working in Google Colab, we recommend commenting
# out this line of code.
theme_update(text = element_text(size = 20))

# Increase the default plot width and height.
# If you are *not* working in Google Colab, we recommend commenting
# out this line of code.
options(repr.plot.width=12, repr.plot.height=8)

# Read in the data
spam = read_csv('https://jdgrossman.com/assets/spam.csv')

# peek at 10 random rows
sample_n(spam, 10)

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.2 --
[32mv[39m [34mggplot2[39m 3.5.0     [32mv[39m [34mpurrr  [39m 1.0.2
[32mv[39m [34mtibble [39m 3.2.1     [32mv[39m [34mdplyr  [39m 1.1.4
[32mv[39m [34mtidyr  [39m 1.3.0     [32mv[39m [34mstringr[39m 1.5.1
[32mv[39m [34mreadr  [39m 2.1.4     [32mv[39m [34mforcats[39m 1.0.0
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union


[1mRows: [22m[34m4601[39m [1mColumns: [22m[34m58[39m
[36m--[39m [1mColumn specification[22m [36m--------------------------------------------------------[39m
[1mDelimiter:[22m ","
[32mdbl[39m (58): make, address, al

make,address,all,3d,our,over,remove,internet,order,mail,⋯,char_semicolon,char_left_paren,char_left_bracket,char_exclamation,char_dollar,char_pound,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.0,0.0,0.0,0,0.47,0.0,1.41,0,0.0,0.0,⋯,0.0,0.0,0.0,0.144,0.288,0.0,3.75,54,191,1
0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,1.05,⋯,0.0,0.0,0.335,0.0,0.0,0.0,4.13,26,124,0
0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0.69,⋯,0.0,0.228,0.114,0.0,0.0,0.114,3.65,28,157,0
0.0,0.0,0.0,0,0.0,0.0,0.68,0,0.0,0.68,⋯,0.0,0.144,0.0,0.0,0.0,0.072,3.37,19,155,0
0.0,0.0,0.44,0,0.0,0.0,0.0,0,0.0,0.0,⋯,0.0,0.0,0.061,0.0,0.0,0.0,1.95,17,230,0
0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0.0,⋯,0.0,0.0,0.0,5.844,0.0,0.0,1.67,5,15,1
0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,5,0
0.0,0.0,0.0,0,0.0,0.0,0.0,0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0,2.2,4,11,0
0.1,0.05,0.1,0,0.31,0.1,0.0,0,0.0,0.05,⋯,0.007,0.168,0.0,0.038,0.061,0.007,1.7,25,939,0
0.05,0.0,0.1,0,0.16,0.05,0.0,0,0.48,0.0,⋯,0.172,0.195,0.062,0.0,0.015,0.0,2.76,47,1073,0


## ♨️ Warm up

How many emails are in the database? 

What fraction of the emails in the database are spam? 

Which email contains the highest percentage of words matching "money"? What percentage of words in that email match "money"?

In [4]:
# Your code here!

# START answer

# 4601 emails
nrow(spam)

# 39% are spam
mean(spam$is_spam)

# 12.5% of words in one email match "money"
spam %>%
  arrange(desc(money)) %>% 
  slice(1) %>% 
  pull(money)

# END answer

## 🎲 Linear probability models (LPMs)

Fit a linear regression model to the spam data with the `lm` function. 

Use the following covariates to predict the likelihood that an email is spam:
- `char_dollar`
- `credit`
- `money`
- `re`

How would you interpret the model coefficients for the intercept and for `char_dollar`?

- Note: `char_dollar` represents the percentage of characters in the email that match `$`.

In [5]:
# Your code here!

## START answer

model = lm(is_spam ~ 1 + char_dollar + credit + money + re, data=spam)

summary(model)

# Intercept: For an email that does not mention $, credit, money, or re, the model predicts a 0.33 probability of being spam.

# char_dollar: For every unit increase in char_dollar (i.e., for every percentage point increase in the proportion of characters in the email that are $), the model predicts a 59pp increase in the probability of being spam.

## END answer


Call:
lm(formula = is_spam ~ 1 + char_dollar + credit + money + re, 
    data = spam)

Residuals:
   Min     1Q Median     3Q    Max 
-2.849 -0.335 -0.303  0.514  0.963 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  0.33459    0.00732   45.70 <0.0000000000000002 ***
char_dollar  0.58551    0.02675   21.89 <0.0000000000000002 ***
credit       0.15752    0.01285   12.26 <0.0000000000000002 ***
money        0.18794    0.01487   12.64 <0.0000000000000002 ***
re          -0.05355    0.00648   -8.27 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.443 on 4596 degrees of freedom
Multiple R-squared:  0.179,	Adjusted R-squared:  0.178 
F-statistic:  250 on 4 and 4596 DF,  p-value: <0.0000000000000002


Using your linear probability model and the `predict` function, predict the in-sample probability that each email is spam.

What is the smallest predicted probability? The largest? Do you notice any issues with these predictions?

In [6]:
# Your code here!

## START answer

# equivalent to predict(model)
predictions = predict(model, newdata=spam)

summary(predictions)

# We get impossible predictions for probabilities!

# END answer

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -0.81    0.33    0.33    0.39    0.40    3.85 

## 🎰 Odds functions

Write two functions:
- A function to convert probabilities to odds.
- A function to convert odds to probabilities

Test your functions by making sure that 2:1 odds returns a 2/3 probability, and vice versa. 

Finally, suppose my probability of winning is 60%. If I double my odds of winning, what is my new probability of winning?

In [7]:
# Your code here!

# START answer

prob_to_odds = function(prob) {
  prob / (1 - prob)
}

odds_to_prob = function(odds) {
  odds / (1 + odds)
}

prob_to_odds(2/3)
odds_to_prob(2)

# 75% is my new probability of winning
odds_to_prob(2 * prob_to_odds(0.6))

# END answer

## 🪙 Fitting a logistic regression model 

We can fit a logistic regression model with the same covariates as above with the following code:

In [8]:
model = glm(is_spam ~ 1 + char_dollar + credit + money + re, family='binomial', data=spam)

summary(model)

"glm.fit: fitted probabilities numerically 0 or 1 occurred"



Call:
glm(formula = is_spam ~ 1 + char_dollar + credit + money + re, 
    family = "binomial", data = spam)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.785  -0.769  -0.621   0.607   3.279  

Coefficients:
            Estimate Std. Error z value             Pr(>|z|)    
(Intercept)  -1.0666     0.0432  -24.68 < 0.0000000000000002 ***
char_dollar  11.8176     0.6045   19.55 < 0.0000000000000002 ***
credit        2.3119     0.3430    6.74   0.0000000000157692 ***
money         1.9933     0.2485    8.02   0.0000000000000010 ***
re           -0.7755     0.0994   -7.81   0.0000000000000059 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6170.2  on 4600  degrees of freedom
Residual deviance: 4427.8  on 4596  degrees of freedom
AIC: 4438

Number of Fisher Scoring iterations: 7


Interpret the intercept and `money` coefficients for the logistic regression model three different ways:
1. On the log odds scale
2. On the odds scale (by exponentiating the coefficients)
3. On the probability scale (using either the odds functions you wrote, or the divide by 4 trick).

Tip: Use the `coef` function to extract coefficients from the model.

In [11]:
# Your code here! 

# START answer

coefficients = coef(model)

coefs_to_interpret = coefficients[c('(Intercept)', 'money')]

# For an email that does not mention $, credit, money, or re, the model predicts a -1.07 log-odds of being spam. This corresponds to a probability of 26%.
# The model predicts an 1.99 increase log odds for every percentage point increase in the proportion of words that match "money".
print(coefs_to_interpret)

odds_to_prob(exp(-1.07))

# For an email that does not mention $, credit, money, or re, the model predicts 0.344 odds of being spam. As above, this corresponds to a probability of 26%.
# For these emails, the model predicts a 7.3x increase in the odds of an email being spam for every percentage point increase in the proportion of words that match "money".
print(exp(coefs_to_interpret))

odds_to_prob(0.344)

# The maximum change in the probability of an email being spam for every 1pp increase in the proportion of words in the email that match 'money' is 0.498.
print(coefs_to_interpret['money']/4)

# END answer

(Intercept)       money 
      -1.07        1.99 


(Intercept)       money 
      0.344       7.340 


money 
0.498 
