# Household Debt Inequalities



## Source

For this example we're extracting the tables 11 and 12 from an xls dataset dealing with household debt inequalities.

The example highlights using iteration to join multiple tables into a coherent whole.

In [None]:
from typing import List
from tidychef import acquire, preview
from tidychef.selection import XlsSelectable

tables: List[XlsSelectable] = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/householddebtdataset.xls", tables="Table 11|Table 12")
for table in tables:
    preview(table)

From an xls source which can be [downloaded here](https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/householddebtdataset.xls).

## Requirements

- We're going to extract "Period" from the obvious dates in column A.
- We're just going to call the principle field indicated by column A "Category".
- We're going to take "Great Britain" as a constant for a column named "Area".
- We're going to take the headers on row 4 as "Financial Liability"
- As an additional exercise we're going to use a horizontal condition to create a "Unit Of Measure" column to be one of "Pounds Sterling", "Percent", "Ratio" or "Number" depending on the category.
- We're going to prefix "Category" as extracted from table 12 with "Education: " to make the data a little easier to understand.
- We're going to join both tables into a single tidy data output.
- We're going to de-duplicate with a printout of what we've removed - it should be the contents of row 14 as its duplicated on both tables.
- We'll strip trailing ".0"s from the observations (which we'll call "Value" this time).

In [None]:
from typing import Dict, List
from tidychef import acquire, preview
from tidychef.direction import down, right, left
from tidychef.output import Column, TidyData
from tidychef.selection import XlsSelectable

def unit_of_measure(line: Dict[str, str]) -> str:
    """
    Function to define unit of measure based on Financial Liability
    """
    cat = line["Category"]
    if "(%)" in cat:
        return "Percent"
    elif "(£)" in cat:
        return "Pounds Sterling"
    elif "Frequency" in cat:
        return "Number"
    elif "Ratio" in cat:
        return "Ratio"
    else:
        raise Exception(f"Cannot identify unit of measure from: {cat}")

tables: List[XlsSelectable] = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/householddebtdataset.xls", tables="Table 11|Table 12")

all_tidy_data = []
for table in tables:
    area = table.excel_ref("A").re("Great Britain").assert_one().label_as("Area")
    period = table.excel_ref("A3").fill(down).re(".*[0-9]{4}").assert_len(2).label_as("Period")
    category = area.shift(down).fill(right).label_as("Category")
    observations = category.fill(down).is_not_blank().label_as("Value")
    financial_liability = (observations.fill(left) - observations).label_as("Financial Liability")
    preview(observations, area, period, category, financial_liability)

    tidy_data = TidyData(
        observations,
        Column.constant("Area", area.lone_value()),
        Column(period.finds_observations_closest(down)),
        Column(category.finds_observations_directly(down), apply=lambda x: "Eduction: "+x if table.name == "Table 12" else x),
        Column(financial_liability.finds_observations_directly(right)),
        Column.horizontal_condition("Unit Of Measure", unit_of_measure),
        obs_apply=lambda x: x.replace(".0", "")
    )

    all_tidy_data.append(tidy_data)

final_tidy_data = TidyData.from_tidy_list(all_tidy_data)
final_tidy_data.drop_duplicates(print_duplicates=True)
final_tidy_data.to_csv("household-debt.csv")

# Outputs

The tidy data can be [downloaded here](./household-debt.csv) and a full inline preview of the tidydata generated is shown below for those people who'd prefer to scroll.

In [None]:
print(final_tidy_data)