## Activity 5 - Linear Regression

**Due** on *Monday, November 13th* by 11:59 pm

You will be asked to complete a short survey on ICON that asks questions about the output generated below. Furthermore, there are additional questions to consider sprinkled throughout the notebook below, these do not need to be explicitly answered, but can provide a bit of a guide to thinking and interpreting the following statistical output. 

## Setup

This first code cell needs to be executed ("Run") everytime this notebook is opened. For example, if you stop working on this activity and come back to the activity, this first code cell will need to be executed again to load the data, even though output may still show up from the prior time you worked on the activity. 

The data for this activity comes from the [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday) project. The data contain 19,405 rows and 28 columns about tornados from around the United States between 2007 and 2022. A data description for each column in the data is shown below ([see the Tidy Tuesday page for more information](https://github.com/rfordatascience/tidytuesday/tree/master/data/2022/2022-08-09))

|variable     |class     |description  |
|:------------|:---------|:------------|
|om           |integer   |Tornado number. Effectively an ID for this tornado in this year.|
|yr           |integer   |Year, 1950-2022. |
|mo           |integer   |Month, 1-12.|
|dy           |integer   |Day of the month, 1-31. |
|date         |date      |Date. |
|time         |time      |Time. |
|tz           |character |[Canonical tz database timezone](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).|
|datetime_utc |datetime  |Date and time normalized to UTC. |
|st           |character |Two-letter postal abbreviation for the state (DC = Washington, DC; PR = Puerto Rico; VI = Virgin Islands). |
|stf          |integer   |State FIPS (Federal Information Processing Standards) number. |
|mag          |integer   |Magnitude on the F scale (EF beginning in 2007). Some of these values are estimated (see fc). |
|inj          |integer   |Number of injuries. When summing for state totals, use sn == 1 (see below). |
|fat          |integer   |Number of fatalities. When summing for state totals, use sn == 1 (see below). |
|loss         |double    |Estimated property loss information in dollars. Prior to 1996, values were grouped into ranges. The reported number for such years is the maximum of its range. |
|slat         |double    |Starting latitude in decimal degrees. |
|slon         |double    |Starting longitude in decimal degrees. |
|elat         |double    |Ending latitude in decimal degrees. |
|elon         |double    |Ending longitude in decimal degrees. |
|len          |double    |Length in miles. |
|wid          |double    |Width in yards. |
|ns           |integer   |Number of states affected by this tornado. 1, 2, or 3. |
|sn           |integer   |State number for this row. 1 means the row contains the entire track information for this state, 0 means there is at least one more entry for this state for this tornado (om + yr). |
|f1           |integer   |FIPS code for the 1st county. |
|f2           |integer   |FIPS code for the 2nd county. |
|f3           |integer   |FIPS code for the 3rd county. |
|f4           |integer   |FIPS code for the 4th county. |
|fc           |logical   |Was the mag column estimated? |
|log_loss.    |double.   |The log of the loss attribute |

### Guiding question for the activity
1. How accurately can a regression tree model predict the number of Calories in the drink using other features of the drink?

In [None]:
library(tidyverse)
library(ggformula)
library(mosaic)
library(rpart)
library(rpart.plot)
library(rsample)

theme_set(theme_bw(base_size = 18))

tornado <- readr::read_csv('https://raw.githubusercontent.com/lebebr01/psqf_6243/main/data/tornados.csv')

head(tornado)

## Question 1

Explore the log loss (i.e., `log_loss`) in the tornado data.

Fill in the primary attribute of interest (i.e, `log_loss`) in place of "^^". Also, fill in appropriate attribute names and plot title in place of "%%".

In [None]:
gf_density(~ ^^, data = tornado) |>
  gf_labs(x = '%%',
          y = '%%',
          title = '%%')

Also explore the distribution of the loss attribute (i.e., `loss`) in the tornado data.

Fill in the primary attribute of interest (i.e, `loss`) in place of "^^". Also, fill in appropriate attribute names and plot title in place of "%%".

In [None]:
gf_density(~ ^^, data = tornado) |>
  gf_labs(x = '%%',
          y = '%%',
          title = '%%')

### Questions to think about

1. What is the shape, center, and variation for each figure? 
2. Identify some key differences when comparing the log loss attribute to the loss attribute.

## Question 2

Interpret the correlations found between log loss, length, and width of the tornados. 

Note, the code below returns a correlation matrix that will have 1's on the diagonal of the matrix (these can be ignored). The correlations are represented on the off diagonal elements by comparing the row to column type. For example, the correlation reported for the row, log_loss and the column, width, would represent the correlation between log loss and width. 

In [None]:
tornado |>
  select(log_loss, len, wid) |>
  cor(use = 'pairwise.complete.obs') |> 
  round(3)

### Questions to consider

1. Interpret the 3 unique correlations between log loss, length, and width. 
2. Which attribute would help predict the log loss of a tornado given the correlation values? 

## Question 3

Fit a regression tree to predict the log loss of a tornado using the length of the tornado (i.e., `len`). **Note:** you can use any attribute as a predictor except the name and the calories. 

Place the outcome (ie., `log_loss`) in place of the "^^" and place the length of the tornado (`len`) in place of "@@".

In [None]:
tornado_mod <- lm(^^ ~ @@, data = tornado)

coef(tornado_mod) |> round(3)

### Questions to consider

1. What is the best interpretation for the term "Intercept"?
2. What is the best interpretation for the slope term, "len"?