# Re-Galtoning

In [None]:
# Don't change this cell; just run it.
import numpy as np  # The array library.
import pandas as pd
# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('regalton.ok')

In this exercise, you will very likely find yourselves using

* [groupby](/useful-pandas/groupby)
* [merge](/useful-pandas/merge)

as well as some of your other Pandas skills.

## The data

The data for your task relates to a very famous paper by [Francis
Galton](https://en.wikipedia.org/wiki/Francis_Galton), published in 1886.
Galton was an extraordinarily versatile scientist who laid the groundwork for
early statistics, and particularly regression and correlation.  The paper we
are interested in here is:

> Galton, F. (1886). [Regression Towards Mediocrity in Hereditary Stature](
https://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf)
Journal of the Anthropological Institute, 15, 246-263

In fact, this paper is the origin of the term *regression* for fitting
prediction lines to data.

Galton was a keen eugenicist, and was very interested in inheritance.  In this
case he studied the relationship of children's heights to the heights of their
parents.

Galton asked families to give him data about:

* The father's height
* The mother's height
* The height and gender of each adult child in the family.

You can read more about the data files at the [Galton heights datasets
page](https://github.com/odsti/datasets/tree/regalton/galtons_heights).

## Reconstructing a dataframe

First, here is the data frame that you are aiming to reconstruct.  Your task is
to rebuild this table, including its data and column names, from the component data frames you will see further below.

In [None]:
# Data frame you are aiming to reconstruct.
combined = pd.read_csv('galton_combined.csv')
combined.head()

As you can see, this combined data frame has one row per adult child, along
with their parents heights, and a unique identifier for the family, in the
`family` column.  We will come onto `midparentHeight` later.

The components you will be using to reconstruct the `combined` data frame are the following data frames:

In [None]:
# Data frame with data about families.
families = pd.read_csv('galton_families.csv')
families.head()

This data frame has information about the families, but no information about the children.  Next:

In [None]:
# Data frame with data about the children.
children = pd.read_csv('galton_children.csv')
children.head()

## Mid-parent height

Galton wanted to predict the height of the adult children from the heights of
the parents.  He wanted one number to encapsulate the height of both parents,
and this number is `midparentHeight` in the `combined` data frame.

Women are not as tall as men, on average.  To adjust for this, Galton
multiplied the mother's height by 1.08 before averaging with the father's
height, to give `midparentHeight`.

## Ready, set

To recap — your task is to reconstruct the data of the `combined` data frame,
using the data from the `families` and `children` data frames.  Call the reconstructed data frame `reconstructed`.

Try to get the values in `reconstructed` to match `combined` as well as you
can.  Rename the columns to match the columns of `combined`.

You will all but certainly find yourself using the `groupby` and `merge` methods from the links above.

Good luck!

In [None]:
#- Calculate midparentHeight, insert into "families"
...
# Show the result.
families

In [None]:
#- Use "groupby" to make a Series with counts of children per family.
#- Set the Series name to "children"
child_counts = ...
# Show the result.
child_counts

In [None]:
#- Merge child_counts into families
...
# Show the result.
families_with_counts

If you got a merge error, make sure you used `groupby` to create your
`child_counts` in the cell further up.

In [None]:
#- Merge children into families.
reconstructed = ...
# Show the result
reconstructed

In [None]:
#- Make sure the columns are in the right order.
#- Make sure the columns have the right names
...
# Show the result
reconstructed

Run the cell below.  It tests that the data frames are very close, after you have sorted them in the same way, and taking into account tiny differences from calculation precision errors.

In [None]:
# Run this cell.  It tests that the data frames are very close.

def cmp_galtons(df1, df2):
    # Compare the two data frames
    # Sort in the same way, and reset index to match sort order.
    sort_keys = ['family', 'childNum']
    df1 = df1.sort_values(sort_keys).reset_index(drop=True)
    df2 = df2.sort_values(sort_keys).reset_index(drop=True)
    calc_col = 'midparentHeight'
    if not np.allclose(df1[calc_col], df2[calc_col]):
        print(calc_col, 'seems to be off')
        return
    if not df1.drop(calc_col, axis='columns').equals(
        df2.drop(calc_col, axis='columns')):
        print('Somewhere else seems to be off')
        return
    print('Result!')

# Run the comparison.
# You should see "Result!"
cmp_galtons(combined, reconstructed)

Here's a test to check that is the result you see:

In [None]:
_ = ok.grade('q_01_cmp_galtons')

## Done

You're finished with the assignment!  Be sure to...

- **run all the tests** (the next cell has a shortcut for that),
- **Save and Checkpoint** from the "File" menu.
- Finally, **restart** the kernel for this notebook, and **run all the cells**,
  to check that the notebook still works without errors.  Use the
  "Kernel" menu, and choose "Restart and run all".  If you find any
  problems, go back and fix them, save the notebook, and restart / run
  all again, before submitting.  When you do this, you make sure that
  we, your humble markers, will be able to mark your notebook.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]