# Iterating Tables/Sheets

Here we'll talk about working with iterable data sources - such as extracting and joining data taken from multiple tables from a single spreadheet.


## Source Data

The data source we're using for these examples is shown below:

| <span style="color:green">Note - this particular table has some very verbose headers we don't care about, so we'll be using `bounded=` to remove them from the previews as well as to show just the subset of data we're working with.</span>|
|-----------------------------------------|

The [full data source can be downloaded here](https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx).

 For this example we'll be using the following tables:
 
 - The 4th table named "Table 1a".
 - The 5th table named "Table 1b".

 The principle difference between the tables is 1a is "seasonally adjusted" and 1b is not.

 For the sake of practicality we'll only be extracting observations down to row 12.

In [1]:
from typing import List
from datachef import acquire, preview
from datachef.selection import XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")
preview(tables[3], bounded="A1:H12")
preview(tables[4], bounded="A1:H12")

0,1,2,3,4,5,6,7,8
,A,B,C,D,E,F,G,H
1.0,"Table 1a: Construction output in Great Britain, volume, seasonally adjusted, index numbers, by sector",,,,,,,
2.0,This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance.,,,,,,,
3.0,Source: Construction Output and Employment from the Office for National Statistics,,,,,,,
4.0,2019=100,,,,,,,
5.0,Time period,Public new housing,Private new housing,Total new housing,Infrastructure new work,Public other new work,Private industrial new work,Private commercial new work
6.0,Dataset identifier code,MV36,MV37,MVL7,MV38,MV39,MV3A,MV3B
7.0,1997,30.8,44.8,42.6,61.2,57.6,152.1,84.3
8.0,1998,24.9,45.3,42,59.5,60.7,155,91.4
9.0,1999,21.6,40.7,37.7,57.9,68.3,159.9,102.3


0,1,2,3,4,5,6,7,8
,A,B,C,D,E,F,G,H
1.0,"Table 1b: Construction output in Great Britain, volume, non-seasonally adjusted, index numbers, by sector",,,,,,,
2.0,This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance.,,,,,,,
3.0,Source: Construction Output and Employment from the Office for National Statistics,,,,,,,
4.0,2019=100,,,,,,,
5.0,Time period,Public new housing,Private new housing,Total new housing,Infrastructure new work,Public other new work,Private industrial new work,Private commercial new work
6.0,Dataset identifier code,MV3J,MV3K,MVL8,MV3L,MV3M,MV3N,MV3O
7.0,1997,30.7,45.5,43.3,60.7,56.7,149.8,82.4
8.0,1998,24.8,46,42.8,59,59.7,152.5,89.3
9.0,1999,21.6,41.5,38.5,57.6,67.6,158.1,100.4


## An Iterated Extraction

In this example we're going to

- Iterate through the sheets
- Extract data from the two sheets in question - adding a column to indicate whether the data is seasonally adjusted.
- Join the data into a single TidyData putput.

In [5]:
from typing import List
from datachef import acquire, preview
from datachef.direction import right, down
from datachef.output import Column, TidyData
from datachef.selection import XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")

# Use a quick comprehension to get rid of hte tables we don't want
tables = [x for x in tables if x.name in ["Table 1a", "Table 1b"]]

# An empty list to hold our tables
tidy_data_list = []

# Now iterate and extract
for table in tables:
    anchor = table.excel_ref("A5").label_as("Anchor Cell")

    # Note - it bad practice to rely on excel references too much
    # (see "Best Practice" guidance) but in this instance we need
    # to curtail the amount of data for practical purposes.
    observations = table.excel_ref("B7:H12").label_as("Observations")

    housing = anchor.fill(right).label_as("Housing")
    dataset_identifier_codes = housing.shift(down).label_as("Data Identifier Codes")
    period = anchor.shift(down(2)).expand(down).label_as("Period")

    # We're not gonna set a variable based on the contents of cell A1
    # this is what tells us if its SA of NSA
    a1_cell_value = table.excel_ref("A1").lone_value()
    is_seasonally_adjusted = "False" if "non-season" in a1_cell_value else "True"

    # Preview selections to sanity check
    # we'll include the anchor cell
    preview(anchor, observations, housing, dataset_identifier_codes, period, bounded="A1:H12")

    tidy_data = TidyData(
        observations,
        Column(housing.finds_observations_directly(down)),
        Column(dataset_identifier_codes.finds_observations_directly(down)),
        Column(period.finds_observations_directly(right)),
        Column.constant("Seasonally Adjusted", is_seasonally_adjusted)
    )
    
    # Now append the tidy data for this sheet to our list
    tidy_data_list.append(tidy_data)
    

# concatenate the list and print our new output
all_tidy_data = TidyData.from_tidy_list(tidy_data_list)
print(all_tidy_data)


0
Anchor Cell
Observations
Housing
Data Identifier Codes
Period

0,1,2,3,4,5,6,7,8
,A,B,C,D,E,F,G,H
1.0,"Table 1a: Construction output in Great Britain, volume, seasonally adjusted, index numbers, by sector",,,,,,,
2.0,This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance.,,,,,,,
3.0,Source: Construction Output and Employment from the Office for National Statistics,,,,,,,
4.0,2019=100,,,,,,,
5.0,Time period,Public new housing,Private new housing,Total new housing,Infrastructure new work,Public other new work,Private industrial new work,Private commercial new work
6.0,Dataset identifier code,MV36,MV37,MVL7,MV38,MV39,MV3A,MV3B
7.0,1997,30.8,44.8,42.6,61.2,57.6,152.1,84.3
8.0,1998,24.9,45.3,42,59.5,60.7,155,91.4
9.0,1999,21.6,40.7,37.7,57.9,68.3,159.9,102.3


0
Anchor Cell
Observations
Housing
Data Identifier Codes
Period

0,1,2,3,4,5,6,7,8
,A,B,C,D,E,F,G,H
1.0,"Table 1b: Construction output in Great Britain, volume, non-seasonally adjusted, index numbers, by sector",,,,,,,
2.0,This worksheet contains one table. Some shorthand is used in this table [R&M] = repair and maintenance.,,,,,,,
3.0,Source: Construction Output and Employment from the Office for National Statistics,,,,,,,
4.0,2019=100,,,,,,,
5.0,Time period,Public new housing,Private new housing,Total new housing,Infrastructure new work,Public other new work,Private industrial new work,Private commercial new work
6.0,Dataset identifier code,MV3J,MV3K,MVL8,MV3L,MV3M,MV3N,MV3O
7.0,1997,30.7,45.5,43.3,60.7,56.7,149.8,82.4
8.0,1998,24.8,46,42.8,59,59.7,152.5,89.3
9.0,1999,21.6,41.5,38.5,57.6,67.6,158.1,100.4


Observations,Housing,Data Identifier Codes,Period,Seasonally Adjusted
30.8,Public new housing,MV36,1997,True
44.8,Private new housing,MV37,1997,True
42.6,Total new housing,MVL7,1997,True
61.2,Infrastructure new work,MV38,1997,True
57.6,Public other new work,MV39,1997,True
152.1,Private industrial new work,MV3A,1997,True
84.3,Private commercial new work,MV3B,1997,True
24.9,Public new housing,MV36,1998,True
45.3,Private new housing,MV37,1998,True
42.0,Total new housing,MVL7,1998,True



