# Acquire

The acquire module is how data is loaded into datachef.

The acquire module uses dot notation to specify format and origin (local vs http etc).

Simple examples for common tabular data formats follow.

| <span style="color:green">Note - We're using excel style cell references for previews throughout these examples. This functionality will be explained in the _Preview_ page of this documentation.</span>|
|-----------------------------------------|

## CSV Data From Local

Creating a single selectable table-like object from a local csv file.

You can download a copy of the example data being used [here]("https://raw.githubusercontent.com/mikeAdamss/datachef/main/tests/fixtures/csv/bands-wide.csv").

In [31]:
from datachef import acquire, preview
from datachef.selection import CsvSelectable

# Argument is the location of any csv file on your machine
# This can be a string or a python Path object.
table: CsvSelectable = acquire.csv.local("../../tests/fixtures/csv/bands-wide.csv")
preview(table)

0,1,2,3,4,5,6,7,8,9,10,11
,A,B,C,D,E,F,G,H,I,J,K
1.0,,,,,,,,,,,
2.0,,,Houses,Cars,Boats,,,,Houses,Cars,Boats
3.0,Beatles,,,,,,Rolling Stones,,,,
4.0,,John,1,5,9,,,Keith,2,6,10
5.0,,Paul,2,6,10,,,Mick,3,7,11
6.0,,George,2,7,11,,,Charlie,3,8,12
7.0,,Ringo,4,8,12,,,Ronnie,5,9,13
8.0,,,,,,,,,,,


## CSV Data from Http(s)

You can also load a csv via http as per the following example:

In [32]:
from datachef import acquire, preview
from datachef.selection import CsvSelectable

# Argument is any csv file accessible via http or https
table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/datachef/main/tests/fixtures/csv/bands-wide.csv")
preview(table)

0,1,2,3,4,5,6,7,8,9,10,11
,A,B,C,D,E,F,G,H,I,J,K
1.0,,,,,,,,,,,
2.0,,,Houses,Cars,Boats,,,,Houses,Cars,Boats
3.0,Beatles,,,,,,Rolling Stones,,,,
4.0,,John,1,5,9,,,Keith,2,6,10
5.0,,Paul,2,6,10,,,Mick,3,7,11
6.0,,George,2,7,11,,,Charlie,3,8,12
7.0,,Ringo,4,8,12,,,Ronnie,5,9,13
8.0,,,,,,,,,,,


## Customising Csv Loads

Both the local and http csv loaders are wrappers around [the csv reader from the standard python library csv package](https://docs.python.org/3/library/csv.html) and propogate keyword arguments.

This means you can pass any keyword arguments through to `acquire.csv.local()` and `acquire.csv.http()` that you could pass to the `csv.reader()` method.

As an example here we are loading a csv file using a `|` delimiter in place of commas. [This is the example data we're using](https://raw.githubusercontent.com/mikeAdamss/datachef/main/tests/fixtures/csv/pipe-delimited.csv).

In [33]:
from datachef import acquire, preview
from datachef.selection import CsvSelectable

# Lets specify a different delimiter
table: CsvSelectable = acquire.csv.local("../../tests/fixtures/csv/pipe-delimited.csv", delimiter="|")
preview(table)

0,1,2,3
,A,B,C
1.0,Age,Male,Female
2.0,1,3,9
3.0,14,54,12
4.0,9,0,3


## Loading Xlsx Data

Creating a **list** of selectable table-like objects from a local xlsx file.

You can download the example xlsx data being used [here](https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx).

| <span style="color:green">Note - Some of the following examples are quite big tables so we're using the `preview()` keyword `bounded=` to limit the size of the previews. Again this functionality will be explained in detail in the _Preview_ page of this documentation.</span>|
|-----------------------------------------|

## Xlsx Data From Local

In [34]:
from typing import List
from datachef import acquire, preview
from datachef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
tables: List[XlsxSelectable] = acquire.xlsx.local("../../tests/fixtures/xlsx/ons-oic.xlsx")
preview(tables[0], bounded="A1:B4")

0,1,2
,A,B
1.0,Output in the construction industry reference tables April 2023,
2.0,"This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations",
3.0,Coverage:,Great Britain
4.0,Released:,14 June 2023


Rather than using the rather clunky `tables[0]` syntax, we can also pass a `tables=` keyword as per the below. 

In [35]:
from datachef import acquire, preview
from datachef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
table: XlsxSelectable = acquire.xlsx.local("../../tests/fixtures/xlsx/ons-oic.xlsx", tables="Cover Sheet")
preview(table, bounded="A1:B4")

0,1,2
,A,B
1.0,Output in the construction industry reference tables April 2023,
2.0,"This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations",
3.0,Coverage:,Great Britain
4.0,Released:,14 June 2023


## Xlsx Data From Http(s)

In [36]:
from datachef import acquire, preview
from datachef.selection import XlsxSelectable

# Note: tables (note plural) as its now a list of tabulated data sources
table: XlsxSelectable = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx", tables="Cover Sheet")
preview(table, bounded="A1:B4")

0,1,2
,A,B
1.0,Output in the construction industry reference tables April 2023,
2.0,"This spreadsheet contains the data tables published alongside the Office for National Statistics' Construction output in Great Britain bulletin for April 2023 . We have edited these data tables and the accompanying cover sheet, table of contents and notes worksheet to meet legal accessibility regulations",
3.0,Coverage:,Great Britain
4.0,Released:,14 June 2023


## Loading Xls Data

## Xls Data From Local

You can download the example data being used [here](https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xls/sample.xls).

In [37]:
from datachef import acquire, preview
from datachef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.local("../../tests/fixtures/xls/sample.xls", tables="SalesOrders")
preview(table, bounded="A1:G7")

0,1,2,3,4,5,6,7
,A,B,C,D,E,F,G
1.0,OrderDate,Region,Rep,Item,Units,Unit Cost,Total
2.0,44202.0,East,Jones,Pencil,95.0,1.99,189.05
3.0,44219.0,Central,Kivell,Binder,50.0,19.99,999.4999999999999
4.0,44236.0,Central,Jardine,Pencil,36.0,4.99,179.64
5.0,44253.0,Central,Gill,Pen,27.0,19.99,539.7299999999999
6.0,44270.0,West,Sorvino,Pencil,56.0,2.99,167.44
7.0,44287.0,East,Jones,Binder,60.0,4.99,299.40000000000003


## Xls Data From Http(s)

In [38]:
from datachef import acquire, preview
from datachef.selection import XlsSelectable

table: XlsSelectable = acquire.xls.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xls/sample.xls", tables="SalesOrders")
preview(table, bounded="A1:G7")

0,1,2,3,4,5,6,7
,A,B,C,D,E,F,G
1.0,OrderDate,Region,Rep,Item,Units,Unit Cost,Total
2.0,44202.0,East,Jones,Pencil,95.0,1.99,189.05
3.0,44219.0,Central,Kivell,Binder,50.0,19.99,999.4999999999999
4.0,44236.0,Central,Jardine,Pencil,36.0,4.99,179.64
5.0,44253.0,Central,Gill,Pen,27.0,19.99,539.7299999999999
6.0,44270.0,West,Sorvino,Pencil,56.0,2.99,167.44
7.0,44287.0,East,Jones,Binder,60.0,4.99,299.40000000000003


## A Note On Http(s) Caching

All `.http()` methods described in this section use http caching via the python [CacheControl](https://pypi.org/project/CacheControl/) package.

This _should_ cache responses and fetch new data only when the last modified date of the data in question has changed (i.e when the data source has been updated).

You can toggle this behaviour off as needed by passing `cache=False` into the `.http()` function(s).

Example follows:

In [39]:
from datachef import acquire, preview
from datachef.selection import CsvSelectable

table: CsvSelectable = acquire.csv.http("https://raw.githubusercontent.com/mikeAdamss/datachef/main/tests/fixtures/csv/bands-wide.csv",
                         cache=False)
preview(table)

0,1,2,3,4,5,6,7,8,9,10,11
,A,B,C,D,E,F,G,H,I,J,K
1.0,,,,,,,,,,,
2.0,,,Houses,Cars,Boats,,,,Houses,Cars,Boats
3.0,Beatles,,,,,,Rolling Stones,,,,
4.0,,John,1,5,9,,,Keith,2,6,10
5.0,,Paul,2,6,10,,,Mick,3,7,11
6.0,,George,2,7,11,,,Charlie,3,8,12
7.0,,Ringo,4,8,12,,,Ronnie,5,9,13
8.0,,,,,,,,,,,


## Using tables=

Some of you will have wondered why its "tables" (plural) not "table" singular.

It's because the string you pass to tables is a [regular expression](https://regexone.com/).

We're _not_ going to go into regular expressions as part of this documentation, but for our purposes just be aware its pattern matching syntax and well worth exploring for anyone working in an ETL role.

As a simple example, you can use regular expressions to create an **or** statement with the pipe (`|`) character.

The following example shows how you can do just that to select two table from our xlsx source.

In [40]:
from typing import List
from datachef import acquire
from datachef.selection import XlsxSelectable

tables: List[XlsxSelectable] = acquire.xlsx.http("https://github.com/mikeAdamss/datachef/raw/main/tests/fixtures/xlsx/ons-oic.xlsx",
                                                 tables="Table 1a|Table 1b")
for table in tables:
    print(table.name)

Table 1a
Table 1b


For many users this ability to only select the tables you want to process will be all that's required, but the _real_ power here is the ability (where needed) to deal with inconsistencies in table naming.

This is crude (you can do some very clever things with regular expressions should you chose to explore them) but continuing on from our **or** example lets imagine the following:

```
tables="Table 1a|table 1a|table1a|table 1A|Table 1A|Table1A"
```

Which gives your acquire statement a fair degree of additional robustness, as minor changes or mistakes in table naming by the data publisher are accounted for.