# Best Practice

This section covers some best practice approaches and techniques for creating robust transform recipes with datachef.

## Source Data

The data source we're using for these examples is shown below:

| <span style="color:green">Note - this particular table has some very verbose headers we don't care about, so we'll be using `bounded=` to remove them from the previews as well as to show just the subset of data we're working with.</span>|
|-----------------------------------------|

The [full data source can be downloaded here](https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx). We'll be using th 10th tab named "Table 3c".


## Conditional Is Better Then Explicit

If you want to create a robust repeatable recipe a conditional selection (particularly simple conditions such as `.expand()` and `.fill()`) are typically better than explicit `.excel_ref()` statements.

This is because _many_ tabulated sources are intended to be extended over time.

Consider the following:

In [28]:
from typing import List
from datachef import acquire, preview, XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx")
preview(tables[9], bounded="A155:H161")

0,1,2,3,4,5,6,7,8
,A,B,C,D,E,F,G,H
155.0,Oct 2022,-3.3,0.6,0.2,5.3,-2.2,0.8,1.5
156.0,Nov 2022,2,-0.4,-0.2,6.2,4,-0.5,0.5
157.0,Dec 2022,7,-2.1,-1.1,7.3,10,-1.8,-1.7
158.0,Jan 2023,7.1,-5.3,-4,6.3,8.9,0.3,-1.1
159.0,Feb 2023,5,-5.1,-4,2.9,4.4,0.8,0.3
160.0,Mar 2023,1.3,-5.3,-4.5,-1.9,0.2,4.2,0.7
161.0,Apr 2023,-0.8,-3.3,-3,-1.5,2,5.4,0.9


It's pretty easy to see that the creator of this dataset will be adding more months of data in the future - and you want your script to capture this additional data on reruns.

Now consider our two different techniques of selection.

### excel_ref()

A statement of something like `table.excel_ref("B155:B161")` will work but when the user inevitably republishes the source with additional rows this additional data will not be extracted.

### expand()

Whereas a statement of `table.excel_ref("B155").expand(down).expand(right)` would automatically extract the additional data upon a re run of the script with **no code update required**.

## Validation Vs Assertions

These are superficially similar things using overlapping mechanics but there is a distinction that's important to understand here:

- Assertions via the `<selectable>.assert_selections()` method run against **selections** so police your _extraction logic_.
- Validations via the `Column(validation=)` constructor keyword run against the **output** so police your _final product_.

Consider the following scenarios:

- 1.) You want to select an "anchor cell" or a selection of cells for the sole purpose of subtracting it from another selection. It could be important to confirm these selections are accurate but because they're not directly extracted values then `validation=` will never see them (just the consequence of them) so the `assert_seletions()` is more appropriate.

- 2.) You're are using `apply=` to cleanse cell value data at the point of extraction and need to make sure the correct things are happening, the `assert_seletions()` method will **never see these cleansed values**, but `validation=` will.

Some nuances on where to use each follows but the pithy version is "use both strategies, wherever possible and as much as its practical to do so".

## When to use Validation & Assertions

If you're doing a one off transform you've not intention of repeating (you're quickly pulling some data apart of populate a dataframe, chart etc) then you could argue that validation is overkill.

If you're intending run your code more than once against a changeable data source you should be implementing some form of validation. Even if its just one of the lightweight methods documented elsewhere in this Jupyterbook.

If you're setting up any kind of RAP processes, you should almost certainly be investing time in a quality validation callable (or other mechanism) suitable for your pipeline process and use case(s).

Assertions are more light touch and as a general rule you should be getting into the habit of using them liberally.


## Select Wide and Validate Narrow

Sharp observers will have noticed you can us regex to select cells and also use regex to assert selection and even validate extracted values - isn't this redundant?

No, because there are very different goals for selection vs validation (to include assertions), as a general rule:

- You want to keep your **selection techniques wide** so that the cells get selected _even if they contain invalid values_ (perhaps **especially** if they contain invalid values) - provided ofc these calls are located structurally where valid values should be.
- Your validation then needs to be **narrow**, **precise** and **unforgiving**.

You can sum this up with the following statement:

**You cannot stop human beings making human errors when publishing updated versions of data sources, but you CAN set things up so you know WHEN and precisely WHERE any such error is encountered.**

Consider this - if you use a strict regular expression to select _only valid values_ then this just means you completely skip the invalid values.

- What if its just a typo? "Maale" in place of Male? Do you want "Male" to be missing from your output?
- What if you use an `is_numeric` filter but a user adds a data marker of `*` in place of an observation?

My point is there are nuances here, you **dont** want to process invalid values, but you **do** (on balance) want to select them (if they're located where valid values should be).

The trick here is to use the provided tools to make it **obvious** where and precisely what the problem is so it can be trivially addressed. Judicious and well targeted usage of validation and assertions will get you that.