The term &ldquo;regression&rdquo; originated with the work of Galton on the
heights of parents and their children.  How do you suppose a child&rsquo;s
adult height is related to their parents&rsquo; heights?  Let&rsquo;s explore the
data.

We&rsquo;ll use `pandas`, which stands for &ldquo;Python Data Analysis Library&rdquo; or
perhaps &ldquo;panel data,&rdquo; an econometrics term for multidimensional
structured data sets.  In any case, `pandas` is an open source,
BSD-licensed library providing high-performance, easy-to-use data
structures and data analysis tools for Python.  It&rsquo;s popular, and its
basic conceit of **data frames** is widely used (e.g., in R).  Let&rsquo;s
begin by loading the `pandas` module.



In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

## Loading Galton's data



We read Galton&rsquo;s data into a data frame `df`.



In [1]:
df = pd.read_csv('galton.csv')

This data includes the parents&rsquo; heights and the child&rsquo;s eventual adult
height.  Here&rsquo;s some of the data.

| mother|father|height|
|---|---|---|
| 67.0|78.5|69.2|
| 67.0|78.5|69.0|
| 67.0|78.5|69.0|
| 66.5|75.5|73.5|
| 66.5|75.5|72.5|
| 66.5|75.5|65.5|
| 66.5|75.5|65.5|

You can see the &ldquo;whole&rdquo; data frame by evaluating `df`.



In [1]:
df

You can get a quick overview with `describe`.



In [1]:
df.describe()

## Some basic questions



Are the fathers generally taller than the mothers?



In [1]:
df.father.mean() > df.mother.mean()

Is it enough to compare the means like this?  We could also explore
this question by looking at histograms.



In [1]:
df.father.hist()
df.mother.hist()
plt.show()

## Subsetting



A reason to love `pandas` is that it simplifies certain &ldquo;data
wrangling&rdquo; tasks.  A common task is taking a subset.

The gender of the child is given by the `gender` column.



In [1]:
df.gender

Let&rsquo;s see if daughters are taller than their mothers.  We&rsquo;ll filter
using the following.



In [1]:
daughters = df[df.gender == 'F']

What sort of object is `df.gender == 'F'` ?  You should be feeling
that `pandas` is quite expressive!



In [1]:
daughters.describe()

The average height of the daughters is less than (but so close!) to
that of their mothers.  But is it enough to just look at the mean?



## Scatter plots



One feature of `pandas` is how accessible it makes the usual
&ldquo;exploratory&rdquo; tools like scatterplots.  Indeed, when you are working
with data, ****first look at your data**** and some scatterplots are a
reasonable way to do this.

We could plot the father&rsquo;s height on the $x$-axis and the child&rsquo;s eventual height on the $y$-axis.



In [1]:
df.plot.scatter('father', 'height')
plt.show()

Or the mother&rsquo;s height on the $x$-axis and the child&rsquo;s eventual height on the $y$-axis.



In [1]:
df.plot.scatter('mother', 'height')
plt.show()

But `pandas` also permits various calculations to be performed on the
rows.  Let&rsquo;s add a column to our data frame which is the average the
heights of the two parents.



In [1]:
df['midparent'] = (df.father + df.mother)/2

Now we could plot the midparent height on the $x$-axis instead.



In [1]:
df.plot.scatter('midparent', 'height')
plt.show()

I encourage you to look at the other methods available under `df.plot`
to explore this dataset further.  This data is available at [a data
repository]([https://doi.org/10.7910/DVN/T0HSJ1](https://doi.org/10.7910/DVN/T0HSJ1)) and if you are looking
for other interesting data sets, I also encourage you to explore these
data repositories.



## Making predictions



Can we use Galton&rsquo;s data to predict the height of a child based on the
average height of his/her parents?

There are various ways to do this.  Perhaps the first thing to think
to do is the following: to predict the height of a child whose
midparent height is $x$, let&rsquo;s look at &ldquo;neighbors&rdquo; meaning rows
in the data frame where the midparent height is close to $x$.



In [1]:
def neighbors(x):
    return df[ abs(df['midparent'] - x) < 2 ]

Once we have some &ldquo;neighbors,&rdquo; we can look at their average height.



In [1]:
def prediction(x):
    return neighbors(x).height.mean()

Now let&rsquo;s plot our predictions.  Because `pandas` is built on
`matplotlib` it is easy to combine plots from different sources.



In [1]:
df.plot.scatter('midparent', 'height')
xs = np.linspace( df.midparent.min(), df.midparent.max(), 100 )
plt.plot(xs, [prediction(x) for x in xs])
plt.show()

That &ldquo;prediction line&rdquo; is wiggly, but certainly looks like a line!
What line is it?  &ldquo;Linear regression&rdquo; is our next goal.

