# Service Industry

A small spreadsheet puckished by the UK Office for NAtional Statistcs, has heavy use of whistespace for demaracation and inconsistent and fairly irregular spatial relationships between concepts. There's also some fairly semantically lackluster things scattered throughout, i.e `<date> ->` (see cell A21 and footnotes).

So while neither particuarly big nor structually complex there are a lot of small processing steps needed to make this source legible as tidy data.

## Tutorial Structure

With these example tutorials I'm going to comment heavily and cover nuances in a follow up section (with liberal targetted previews as needed) as it's the easiest way to grapple with new ideas. It may also be worth opening up these notebooks yourself (they're in `./jupyterbook` in the [tidychef](https://github.com/mikeAdamss/tidychef) github repo) so you can run, alter and generally have a play about with this yourself as part of your learning.

We'll cover:

- source data
- requirements, what we're aiming to do here
- show the full script (all logic commented)
- output the selection preview
- nuances (where applicable)
- view the output

This sequencing is necessary as the output for some of the example is **really** long so that necessitates it coming last. If you're viewing this via a jupyter book (i.e on the site) you can navigate between the above sections via your right hand menu.

_Note - these tutorial scripts might seem verbose due all the comments but that's ok (this is a tutorial after all). If you take them out you end up with a fairly succinct and human readable encapsulation of what would otherwise (with existing tools) be a rather convoluted and fragile set of instructions to express._

In virtually all cases I'll make heavy use of `preview` and `bounded` to only look at relevant parts of what can be quite large datasets. Downloads links are provided for the source data.

## Source

For this example we're extracting the table "TOPS19" as shown below (note - preview cropped for reasons of practicality):

In [1]:
from tidychef import acquire, preview
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/service-industry.xls", tables="TOPSI9")
preview(table, bounded="A1:Q22")


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q
1.0,TOPSI9,,,UK Production Turnover,,,,,,,,,,,,,
2.0,,,,Turnover in Production and Services Industries,,,,,,,,,,,,,
3.0,,,,"Current price, not seasonally adjusted",,,,,,,,,,,,,£ million
4.0,Back to Contents,,,,,,,Manufacture of air and spacecraft and related machinery,,,,,,,,,
5.0,,,,,Building of ships and boats,,,,,,Manufacture of other transport equipment,,,,,,
6.0,,,,,,,,,,,,,,Manufacture of furniture,,,Other manufacturing
7.0,,,,,,,,,,,,,,,,,
8.0,,,,,30.1,,,30.3,,,30.2/4/9 (30OTHER),,,31,,,32
9.0,,,,,JQR4,,,JQS8,,,JQU4,,,JQV2,,,JQV5


From an xlsx source which can be [downloaded here](https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/service-industry.xls).

# Requirements

- We'll take the line 4 and 5 headers as "Production".
- We'll call "Year" from column A and clean it up.
- We'll take "Quarter" from column B.
- We'll take row 9 as "CDID" (as I happen to know that's the name of this particular type of identifier).
- We'll call the observations column "Value"

In [21]:
from typing import List
from tidychef import acquire, against, preview, filters
from tidychef.direction import up, down, left, right
from tidychef.output import Column, TidyData
from tidychef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://raw.githubusercontent.com/mikeAdamss/tidychef/main/tests/fixtures/xls/service-industry.xls", tables="TOPSI9")
unwanted = table.excel_ref('A').filter(filters.contains_string("Average")).expand(right).expand(down)

product = table.cell_containing_string("ships",strict=False).extrude(down).extrude(up).expand(right).is_not_blank().label_as("Product")


year = table.column('A').is_numeric().expand(down).is_not_blank().label_as("Year") - unwanted

quarter_month_or_neither = year.shift(right).expand(down).is_not_blank().label_as("Quarter") - unwanted | year.shift(right)

cdid = table.excel_ref('8:10').re(r"^[A-Z]{3}\d$").assert_single_row().label_as("CDID")

observations = cdid.waffle(down, quarter_month_or_neither).is_not_blank().label_as("Observations")

preview(product, year, quarter_month_or_neither, cdid, observations)





# anchor = table.re(".*ships and boats.*").assert_one().shift(left).label_as("Anchor Cell")
# year = (anchor.shift(left(3)).expand(down).is_not_blank()- footer).label_as("Year")
# quarter = (
#     (anchor.shift(left(2)).expand(down).is_not_blank() | year.shift(right))
#     - footer
#     ).label_as("Quarter")
# cdid = table.re(r"^[A-Z]{3}\d$").assert_single_row().label_as("CDID")
# product = anchor.extrude(up).extrude(down).expand(right).is_not_blank().label_as("Production")
# observations = (cdid.waffle(down, quarter) - footer).label_as("Value")

# preview(anchor, observations, product, year, quarter, cdid)

# tidy_data = TidyData(
#     observations,
#     Column(product.finds_observations_directly(down)),
#     Column(year.finds_observations_closest(down), apply=lambda x: x[:4], validate=against.is_numeric),
#     Column(quarter.finds_observations_directly(right), apply=lambda x: "All" if x == "" else x),
#     Column(cdid.finds_observations_directly(down))
# )

# tidy_data.to_csv("service-industry.csv")

0
Product
Year
Quarter
CDID
Observations

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21
,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U
1.0,TOPSI9,,,UK Production Turnover,,,,,,,,,,,,,,,,,
2.0,,,,Turnover in Production and Services Industries,,,,,,,,,,,,,,,,,
3.0,,,,"Current price, not seasonally adjusted",,,,,,,,,,,,,£ million,,,,
4.0,Back to Contents,,,,,,,Manufacture of air and spacecraft and related machinery,,,,,,,,,,,,,
5.0,,,,,Building of ships and boats,,,,,,Manufacture of other transport equipment,,,,,,,,,,
6.0,,,,,,,,,,,,,,Manufacture of furniture,,,Other manufacturing,,,,
7.0,,,,,,,,,,,,,,,,,,,,,
8.0,,,,,30.1,,,30.3,,,30.2/4/9 (30OTHER),,,31,,,32,,,,
9.0,,,,,JQR4,,,JQS8,,,JQU4,,,JQV2,,,JQV5,,,,


# Outputs

The tidy data can be [downloaded here](./service-industry.csv) and a full inline preview of the tidydata generated is shown below for those people who'd prefer to scroll.

In [3]:
print(tidy_data)

0,1,2,3,4
Value,Product,Year,Quarter,CDID
4787.6,Building of ships and boats,2012,All,JQR4
21632.8,Manufacture of air and spacecraft and related machinery,2012,All,JQS8
2162.2,Manufacture of other transport equipment,2012,All,JQU4
6722.5,Manufacture of furniture,2012,All,JQV2
8784.8,Other manufacturing,2012,All,JQV5
4484.8,Building of ships and boats,2013,All,JQR4
24556.8,Manufacture of air and spacecraft and related machinery,2013,All,JQS8
2487.0,Manufacture of other transport equipment,2013,All,JQU4
6821.2,Manufacture of furniture,2013,All,JQV2



