# Yule's pauper data

Section 1.4 of Freedman describes Yule's (1899) work on modelling the cause of 
pauperism in the UK. 

Below are the data for the metropolitan unions for the 1871 to 1881 time period (one of 8 different data sets).


In [None]:
yule = read.table('http://stats203.stanford.edu/data/yule.txt')

In [None]:
yule

## Variables

The variables are:

- `paup`: percentage of paupers in a given union
- `out`: ratio of paupers outside poorhouse vs. inside (expressed as percentage)
- `old`: percentage of population above 65
- `pop`: population (expressed as percentage of previous population)

## Model

Yule postulated a model for $\Delta \text{paup}$ (where $\Delta \text{paup}$ denotes the variable $100 - \text{paup}$). The approximation he made was
$$
\Delta \text{paup}_i \approx a + b \cdot \Delta \text{out}_i + c \cdot \Delta \text{old}_i + d \cdot \Delta \text{pop}_i
$$
with $i$ denoting the indexing of the 32 metropolitan unions.

This is an example of a *regression model*, expressing a dependent variable as approximately a function 
of independent variables.

The term *independent variable* suggests that some experimenter is manipulating these variables. Much of the time this is not the case, hence we typically use the term *covariates* or perhaps *control variables*. 

## Finding `a,b,c,d`

Having postulated a model, we must try and find plausible values for `a,b,c,d`. Yule did this by 
least squares, i.e. by minimizing the function
$$
L(a,b,c,d) = \sum_{i=1}^{32} (\Delta \text{paup}_i - a - b \cdot \Delta \text{out}_i - c \cdot \Delta \text{old}_i - d \cdot \Delta \text{pop}_i)^2
$$
over $a,b,c,d$.

Yule would have done this by hand. We can use a computer.

In [None]:
attach(yule)
del_paup = 100 - paup
del_out = 100 - outrelief 
del_old = 100 - old 
del_pop = 100 - pop
yule.lm = lm(del_paup ~ del_out + del_old + del_pop)
detach(yule)
yule.lm

In [None]:
summary(yule.lm)

## What does the model tell us?

Yule's goal is to understand what causes poverty. The results of his modelling tell us
that there seems to be a positive association between `del_out` and `del_pop` but does this really tell
us that making `del_out` larger will cause more poverty?

This gets at the issue of distinguishing between association (sometimes referred to as correlation) and 
causation. 

An oft-repeated phrase: [correlation does not imply causation](https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation).

Much of the modelling we will considered are really modelling associations or correlations. In certain cases, if we are willing
to make some assumptions about the data, one can infer some causal links between variables. Generally, though,
regression models are models of assocation.

### Exercise: read Chapter 1 of Freedman