# 🚯 Lecture 11 Lab: Logistic regression and spam detection

<img src="img/spam-email.png" alt= “spam-email” width="500" />

## ✅ Setup and data import
In this lab, we will work with a [classic dataset](https://archive.ics.uci.edu/dataset/94/spambase) of 4,601 emails classified as spam or not spam.

In [0]:
# Load in additional functions
library(tidyverse)
library(lubridate)

# Use three digits past the decimal point,
# and don't use scientific notation.
options(digits = 3, scipen = 999)

# Format plots with a white background and dark features.
theme_set(theme_bw())

# Increase the default text size of plots.
# If you are *not* working in Google Colab, we recommend commenting
# out this line of code.
theme_update(text = element_text(size = 20))

# Increase the default plot width and height.
# If you are *not* working in Google Colab, we recommend commenting
# out this line of code.
options(repr.plot.width=12, repr.plot.height=8)

# Read in the data
spam = read_csv('https://jdgrossman.com/assets/spam.csv')

# peek at 10 random rows
sample_n(spam, 10)

## ♨️ Warm up

How many emails are in the database? 

What fraction of the emails in the database are spam? 

Which email contains the highest percentage of words matching "money"? What percentage of words in that email match "money"?

In [0]:
# Your code here!



## 🎲 Linear probability models (LPMs)

Fit a linear regression model to the spam data with the `lm` function. 

Use the following covariates to predict the likelihood that an email is spam:
- `char_dollar`
- `credit`
- `money`
- `re`

How would you interpret the model coefficients for the intercept and for `char_dollar`?

- Note: `char_dollar` represents the percentage of characters in the email that match `$`.

Using your linear probability model and the `predict` function, predict the in-sample probability that each email is spam.

What is the smallest predicted probability? The largest? Do you notice any issues with these predictions?

In [0]:
# Your code here!



## 🎰 Odds functions

Write two functions:
- A function to convert probabilities to odds.
- A function to convert odds to probabilities

Test your functions by making sure that 2:1 odds returns a 2/3 probability, and vice versa. 

Finally, suppose my probability of winning is 60%. If I double my odds of winning, what is my new probability of winning?

In [0]:
# Your code here!



## 🪙 Fitting a logistic regression model 

We can fit a logistic regression model with the same covariates as above with the following code:

In [0]:
model = glm(is_spam ~ 1 + char_dollar + credit + money + re, family='binomial', data=spam)

summary(model)

Interpret the intercept and `char_dollar` coefficients for the logistic regression model three different ways:
1. On the log odds scale
2. On the odds scale (by exponentiating the coefficients)
3. On the probability scale (using either the odds functions you wrote, or the divide by 4 trick).

Tip: Use the `coef` function to extract coefficients from the model.

In [0]:
# Your code here! 

