# Reading Household income data

In [None]:
from typing import Tuple

In [None]:
import numpy as np
import pandas as pd

The distribution of income is famously skewed to the right. In this exercise, we’ll measure how strong that skew is.
The Current Population Survey (CPS) is a joint effort of the Bureau of Labor Statistics and the Census Bureau to study income and related variables. Data collected in 2013 is available from http://www.census.gov/hhes/www/cpstables/032013/hhinc/toc.htm.

I downloaded `hinc06.xls`, which is an Excel spreadsheet with information about household income, and converted it to `hinc06.csv`, a CSV file you will find in the repository for this book. You will also find `hinc2.py`, which reads this file and transforms the data.

The dataset is in the form of a series of income ranges and the number of respondents who fell in each range.

The lowest range includes respondents who reported annual household income Under \$5000.

The highest range includes respondents who made \$250,000 or more.

To estimate mean and other statistics from these data, we have to make some assumptions about the lower and upper bounds, and how the values are distributed in each range. `hinc2.py` provides `InterpolateSample`, which shows one way to model this data. It takes a `DataFrame` with a column, `income`, that contains the upper bound of each range, and `freq`, which contains the number of respondents in each frame.

It also takes `log_upper`, which is an assumed upper bound on the highest range, expressed in `log10` dollars. The default value, `log_upper=6.0` represents the assumption that the largest income among the respondents is $10^6$, or one million dollars.

`InterpolateSample` generates a pseudo-sample; that is, a sample of household incomes that yields the same number of respondents in each range as the actual data. It assumes that incomes in each range are equally spaced on a `log10` scale.

In [None]:
df = pd.read_csv('../data/hinc06.csv', header=None, skiprows=9)

The income level and the number of respondents are the first two columns

In [None]:
cols = df[[0, 1]].rename(columns={0: 'income', 1: 'freq'})

In [None]:
cols.head()

In [None]:
cols.dtypes

The counts are easy to clean

In [None]:
cols['freq'] = cols.freq.apply(lambda s: s.replace(',', '')).astype(np.uint64)

In [None]:
cols.dtypes

Getting the upper and lower values from the category will take a little more work

In [None]:
def extract_amount(s: str) -> int:
    """Converts dollar amounts to integers."""
    s = s.lstrip('$').replace(',', '').lower()
    if s == 'under':
        return 0
    if s == 'over':
        return pd.NA
    return int(s)

In [None]:
def extract_bounds(label: str) -> Tuple[int, int]:
    t = label.split()
    return extract_amount(t[0]), extract_amount(t[-1])

In [None]:
extract_bounds('$5,000 to  $9,999')

In [None]:
extract_bounds('Under $5,000')

In [None]:
# create an array of tuples
bounds = [extract_bounds(label) for label in cols.income.values]

In [None]:
# we want the high values and the frequencies
df = pd.DataFrame(dict(
    income = [item[1] for item in bounds],
    freq = cols.freq.values
)).astype({'income': pd.UInt64Dtype()})
df.head()

In [None]:
# correct the first value
df.iloc[0, 0] -= 1

Add a cumulative sum

In [None]:
df['cumsum'] = df.freq.cumsum()

In [None]:
df.dtypes

normalize the cumulative freqs

In [None]:
total = df['cumsum'][len(df)-1]

In [None]:
df['ps'] = df['cumsum'] / total

In [None]:
df.head()

In [None]:
df.to_feather('../data/household_incomes.feather')