# Linear regression

In this lesson we will learn how to perform a simple linear regression by examining the counts, weight, and size measurements of juvenile snowshoe hares (*Lepus americanus*) observed at the Bonanza Creek Experimental Forest from 1999 to 2012 @kielland_snowshoe_2017. 

## About the data

Size measurements, sex, and age of snowshoe hares were collected and made available by Dr. Knut Kielland and colleagues at the [Bonanza Creek Experimental Forest Long Term Ecological Research (LTER) site](https://www.lter.uaf.edu) located approximately 20 km southwest of Fairbanks, Alaska, USA. The data contains observations of 3380 snowshoe hares obtained by capture-recapture studies conducted yearly from 1999 to 2012 in three sampling sites: Bonanza Riparian, Bonanza Mature and Bonanza Black Spruce. 

## Data exploration

Let's start by loading the data and taking a very high-level look at it:


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
hares= pd.read_csv(os.path.join('data','knb-lter-bnz','55_Hare_Data_2012.txt'))
hares.head()

## Examining hares age data

In this example we are interested in working with data from juvenile hares exclusively. So we will:

1. Examine the values in the `age` column
2. Filter for observations in which age is 'adult' or 'juvenile'
3. Investigate the age distributions across time

From the [dataset's metadata](https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-bnz.55.22) we know there are three allowedd values in the `age` column: 

- 'a' for 'adult', 
- 'j' for 'juvenile', and 
- 'm' for 'mortality'. 

## Linear regression

For our analysis we want to investigate possible relations between hind foot length and weight for juvenile hares.

Let's investigate whether a linear model is an adequate way to describe this data. To do this we will use the [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model from the [`scikit-learn`](https://scikit-learn.org/stable/index.html) library. 


The trickiest part of fiting the model is to get the data in the required shape:

- input data (the x-values, independent variable, or training data) shape should be `(n_samples, 1)`. The 1 comes from having a single feature modeling the output data.
- output data (the y-values, dependent variable, or target data) shape should be `(n_samples,)`.

Remember that the equation of the linear model is given by

$$\hat{y} = \beta_0 + \beta_1 x, $$

where

- $x$ = input variable
- $\hat{y}$ = **estimated $y$ value** at $x$ from the linear model 
- $\beta_0$ = the **$x$-intercept** of the linear model, this is interpreted as the estimated average value when $x=0$
- $\beta_1$ = the **slope** of the linear model, this is the estimated difference in the predicted value $\hat{y}$ per unit of $x$.

The **coefficient of determination**, $R^2$ is between 0 and 1. It is interpreted as the amount of variation in the outcome variable $y$ that is explained by the least squares line the variable $x$. 

-->

We can use all this information to plot our linear model together with our data:

:::{.callout-tip}
# Exercise
Answer the following questions:

a. Does it make sense to interpret the $x$-axis intercept as an estimated measurement of weight?

b. What is the estimated change in weight for each millimiter increase in hind foot length?

c. Does a linear model of weight with respect to hind foot length account completely for the change in the dependent variable? What other variables could be worth exploring to model the weight?

d. How would you use the linear model to estimate the weight of a juvenile hare with hind foot length of 90 mm?
:::

<!--
:::{.callout-tip}
# Exercise 2

This would be an exercise about doing linear regression while grouping by sex
:::
-->