content/chapters/Relationships_among_numerical_variables.qmd

---
title: Relationships among numerical variables
subtitle: When we only have numbers
bibliography: ../references.bib
---

<!-- COMMENT NOT SHOW IN ANY OUTPUT: Code chunk below sets overall defaults for .qmd file; these inlcude showing output by default and looking for files relative to .Rpoj file, not .qmd file, which makes putting filesin different folders easier  -->

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file())
source("../globals.R")
```

In the last chapter we extended linear models to consider impacts of
multiple factors. Continuing that tradition, we will now explore how
numerical (and specifically continuous) predictor variables can be used
to explain variation in numerical outcome variables.

## Back to the iris data

We will motivate this with an example from our favorite iris data. So
far we have considered how species impacts measured outcomes. However,
we might also want to consider the relationship between traits. For
example, we might want to know if sepal and petal length are related. We
could plot the data:

```{r}
library(ggplot2)
sepal_petal_relationship <- ggplot(iris, aes(y=Sepal.Length, x=Petal.Length)) +
  geom_point() +
  labs(title="Sepal length increases with petal length",
       y= "Sepal length (cm)",
       x= "Petal Length (cm)")
sepal_petal_relationship

```

Our related hypothesis might be a relationship exists among the
variables; alternatively, one does not. To put this our null hypothesis
framework, we might write:

$$
\begin{split}
H_O: \textrm{there is not a relationship between sepal and petal length}\\
H_A: \textrm{there is a relationship between sepal and petal length}\\
\end{split}
$$

In other words, we need to gather enough evidence to reject the
hypothesis of no relationship. Note we will formalize these hypotheses
more in a moment, but how do we test them?

### Working with continuous predictors

For some background, consider differences between continuous and
categorical predictor variables. Unlike examples of when we transformed
continuous outcomes into binomial data, continuous predictors offers
information on order and spacing. Compared to un-ordered categorical
variables (what we've focused on), the numbers mean something! This
allows us (with caution) to estimate outcomes from un-sampled regions.

Consider: We know the mean sepal lengths for three species of irises:

```{r}
library(Rmisc)
function_output <- summarySE(iris, measurevar="Sepal.Length", groupvars =
                               c("Species"))

ggplot(function_output, aes(y=Sepal.Length, x=Species, fill=Species)) +
  geom_col(aes(fill=Species)) +
    geom_errorbar(aes(ymin=Sepal.Length-ci, ymax=Sepal.Length+ci)) +
  labs(title=expression(paste("Sepal lengths of ",italic("I. species"))),
       y= "Sepal length (cm)",
       x= "Species")
```

But if we find a new species, we actually don't know what to expect!

However, if we have this graph

```{r}
sepal_petal_relationship
```

We might have a guess of the petal length of a flower that has a sepal
length of 2.5 cm even though we didn't measure anything of that size. In
fact, you might be able to visualize the relationship:

```{r}
sepal_petal_relationship+geom_smooth(method = "lm",se = F)
```

While we can draw this "line of best fit" by sight, linear models
actually give us a way to formally define it (analyze our hypothesis).
The line of best fit minimizes the residuals (which means it also offers
a prediction any value of sepal length!). This approach, however, also
means we are considering **linear** relationships between our predictor
variables (but note our predictor variables themselves could be a
transformation of an actual measurement, such as a square root or value
cubed!). This is one assumption of linear regression. We can consider
**non-linear** relationships using other techniques (which we will cover
later).

If we wanted to carry out a sampling experiment to determine a p-value
for our hypotheses, we could sample from a population that represents
our sepal lengths and one that represents our petal lengths. If the two
populations are not connected, then arbitrary pairs could be made - this
would indicate no relationship among the variables. For each dataset, we
could note the potential relationship (now occuring by chance!) between
our variables, and then compare that null distribution to what we
actually observed. To make this approach more generalizable, we could
standardize our data points - using these to calculate our error terms
for full and reduced models would lead us to F distributions.

Overall, this means we can use a linear model approach to investigate
relationships among numerical variables. We can build the model

```{r}
iris_regression <- lm(Sepal.Length ~ Petal.Length, iris)
```

and see the impacts on our linear model.

The relationship between our two variables is noted in the model matrix,
which now includes our actual values,

```{r}
library(rmarkdown)
paged_table(data.frame(model.matrix(iris_regression)))
```

and a coefficient that shows their relationship. The first parameter is
the x-intercept, and the second (the relationship) is the slope of the
best fit line!

```{r}
matrix(as.numeric(iris_regression$coefficients), ncol=1)
```

Once the model is made, we can plot it to check assumptions

```{r}
plot(iris_regression)

```

For numerical relationships we typically see more of a cloud of data,
but we are still assuming the residuals are normally distributed (with
the same variance for all fitted values, i.e. identically distributed)
and indepent. Iassumptions are met consider outcomes.

```{r}
library(car)
Anova(iris_regression, type = "III")
summary(iris_regression)

```

Here the ANOVA table tells us there is a significant impact of sepal
length on petal length, and the summary (and graph) demonstrate that
relationship is positive. Note we do not need any post-hoc analysis (why
not?).

## Regression or correlation

This approach, with small changes in theory, can be used for two
scenarios. We typically consider (but actually rarely use) linear
regression. Linear regression technically assumes that an approach was
used to determine how one variable impacts another (so we chose which
one to vary and how). For this scenario, our sampling experiment
technically uses the set values of the predictor variable,and our
hypotheses focus on the coefficient value

$$
\begin{split}
H_O: \beta_\textrm{(coefficient between sepal and petal length)} = 0\\
H_A: \beta_\textrm{(coefficient between sepal and petal length)} \neq 0\\
\end{split}
$$

Note our coefficient will also change based on measurement units, but
this should not impact the resulting p-value.

The other approach, correlation, is more simply measuring association
between the variables. It's not specifying which, if either, is the
driver - both variables could be responding to an un-measured variable.
For example, since we simply observed flower traits, we could easily
reverse everything above (plot shown here).

```{r}
ggplot(iris, aes(x=Sepal.Length, y=Petal.Length)) +
  geom_point() +
  labs(title="Petal length increases with sepal length",
       x= "Sepal length (cm)",
       y= "Petal Length (cm)")
```

Note that doing so would change the coefficients in our linear model,but
not the direction of the relationship. For this reason, correlation
often focuses on a unitless correlation parameter, *r*, instead of a
coefficient from our $\beta$ matrix.

$$
\begin{split}
H_O: r_\textrm{(association between sepal and petal length)} = 0\\
H_A: r_\textrm{(association between sepal and petal length)} \neq 0\\
\end{split}
$$

*r* can vary from -1 (values are perfectly negatively related) to 1
(values are perfectly positively related), where 0 indicates no
associatoin. This should sound familiar (hold this thought). For a
related sampling experiment, populations for both traits are simulated
and then paired (what we described above!).

The correlation coefficient can be calculated in R using the *cor.test*
function; note the formula interface is different to reflect
association.

```{r}
cor.test(~ Sepal.Length + Petal.Length, data = iris)
```

Note that the output also gives us a confidence interval and estimate
for r in addition to a p-value. If we square the provided *r* value, we
get the R^2^ output we have previously described in our linear model
summary.

```{r}
cor.test(~ Sepal.Length + Petal.Length, data = iris)$estimate^2
summary(iris_regression)$r.squared
```

### Other options

If our assumptions are not met, we have a few options. You may have
noticed the *cor.test* function provided a *Pearson's* product-moment
correlation. This is one approach that uses the raw data. Another
approach, the Spearman rank correlation, uses (surprise) ranked data.
This relaxes the assumption of normality and only assumes monotonic
relationships (one variable increases or decreases with the other). We
can use it by updating the arguments in the *cor.test* function.

```{r}
cor.test(~ Sepal.Length + Petal.Length, data = iris, method="spearman")
```

Bootstrapping and permutation options also exist. Some of these use the
same functions we've encountered before. For example, we can do
permutation tests using the **coin** package.

```{r}
library(coin)
independence_test(Sepal.Length ~ Petal.Length, iris)
```

Bootstrapping may be done using the **boot** or other packages. For
example, the *boot* function in the **car** package gives an easy
wrapper for *boot*.

```{r}
library(car)
bootstrap_iris <- Boot(iris_regression)
summary(bootstrap_iris)
Confint(bootstrap_iris)
```

## A little more on assumptions

Although our linear models all have the same assumptions, numerical
predictors add a few new wrinkles. For numerical predictors, outliers
may be more of an issue. Outliers may be used a general term to focus on
a data point that is different from the rest of the dataset, but only
certain types of outliers matter.

For example, let's pretend we realized our *iris* dataset was missing 3
rows and add them back in.

```{r}
iris_new <- iris
iris_new$Source <- "original"
iris_new$label <- NULL
#make outlier
iris_outlier <- data.frame(Petal.Length = c(2.5,12, 12.1),
                           Sepal.Length = c(5.4,8.9, 3), 
                           Source = "new",
                           label = c("A","B", "C"))
iris_merged <- merge(iris_new, iris_outlier, all = T)
iris_merged <- iris_merged[order(iris_merged$Source, decreasing = T),]
rownames(iris_merged) <- 1:nrow(iris_merged) 

iris_merged$row_number <- 1:nrow(iris_merged)

ggplot(iris_merged, aes(x=Petal.Length, y=Sepal.Length, color=Source)) +
  geom_point() +
  geom_text(iris_merged[iris_merged$Source != "original",],mapping=aes(label=row_number,y=Sepal.Length+.2), color="black") +
  labs(title="Iris data including our missing 3 rows!",
       y= "Sepal length (cm)",
       x= "Petal Length (cm)")
```

Our 3 "new" points (rows 151-153) are all unique ins some way. **152**
is from a relatively unsampled region of the graph that is within the
range of existing data. **151** and **153** are also from un-sampled
regions, but these are outside of the original range. More importantly,
note if we fit the data *without* these points

```{r}
ggplot(iris_merged, aes(y=Sepal.Length, x=Petal.Length, color=Source)) +
  geom_point() +
  geom_text(iris_merged[iris_merged$Source != "original",],mapping=aes(label=row_number,y=Sepal.Length+.2), color="black") +
  labs(title="Iris data including our missing 3 rows!",
       subtitle = "Best fit based on original data only!",
       y= "Sepal length (cm)",
       x= "Petal Length (cm)") +
  geom_smooth(data=iris_new[iris_new$Source == "original",], method = "lm", fullrange=T, se=F)
```

We see that only row **151** appears to be different than what we would
expect given the rest of the data. Now, note if we fit a model all these
points and check assumptions

```{r}
iris_regression_new <- lm(Sepal.Length ~ Petal.Length, iris_merged)
plot(iris_regression_new)
```

several plots label row **151**. In general, R provides the row number
of the most "unusual" rows in each graph. This may mean a fairly normal
row that just so happens to be the "most" extreme in a dataset is
labelled. However, here we see row **151** falling outside the dotted
lines that are labelled *Cook's distance* in the 4th graph. Cook's
distance is one way of quantifying leverage, or how much a single point
shifts the best fit line. The measure quantifies change in the
regression coefficients if each data point is removed individually. Here
we see row **151** is a high leverage point, while **153** is not even
though the sepal length itself may be an outlier.

Although row **153** is not identified as a high leverage point, it *is*
impacting the line (as are all other points). Our best fit line will
*always* go though the point $(\bar{x}, \bar{y})$, so outliers in the
"x" variable will impact the line. This also explains why the confidence
region (which we won't calculate here), gets "smaller" in the middle of
the data range.

```{r}
ggplot(iris_merged, aes(x=Petal.Length, y=Sepal.Length)) +
  geom_point(aes(color=Source)) +
  geom_text(iris_merged[iris_merged$Source != "original",],mapping=aes(label=row_number,y=Sepal.Length+.2), color="black") +
  labs(title="Iris data including our missing 3 rows!",
       subtitle = "Grey area is confidence region, with mean point shown in purple!",
       x= "Petal length (cm)",
       y= "Sepal Length (cm)") +
  geom_smooth( method = "lm", fullrange=T, se=T)+
  geom_point(aes(x=mean(iris_merged$Petal.Length), y=mean(iris_merged$Sepal.Length)), color="purple")
```

This all the further you get from the mean, the wider your confidence
interval will be for an estimate. Even more importantly, estimating
points outside your data range is likely a bad idea.

## Plotting outcomes

As shown above, numerical data is often plotted via paired points. You
can also add regression lines with or without confidence regions.

## Next steps

In the next chapters we will carry our linear model approach to consider
the relationship between continuous outcomes and continuous predictor
variables.