## Setting up your parser

First thing you need to know is the width of each column in your file.
There's no magic here. You need to find out.

Lets take [this file](https://raw.githubusercontent.com/nano-labs/pyfwf3/master/sample_data/humans.txt) 
as an example. The first line looks like:

```
  123456789-123456789-123456789-123456789-123456789-123456789-123456789-123
  --------- --------- --------- --------- --------- --------- --------- ---
  US       AR19570526Fbe56008be36eDianne Mcintosh         Whatever    Medic
```

- 9 bytes: location
- 2 bytes: state
- 8 bytes: birthdate
- 1 byte: gender
- 12 bytes: don't know
- 24 bytes: name
- \.\. and so on

To start with, we only use `name`, `birthday` and `gender`.


In [1]:
from fwf_db import fwf_open, op

class HumanFileSpec:
    FIELDSPECS = [
        {"name": "birthday", "slice": (11, 19)},
        {"name": "gender"  , "slice": (19, 20)},
        {"name": "name"    , "slice": (32, 56)},
    ]

data = fwf_open(HumanFileSpec, "./humans.txt")


The slices represent the first and last positions of each information within the line. Alternatively you may provide combinations of `start`, `len` and `stop`.

The sequence of fields is relevant for exporting and (pretty) printing the dataset.

## Views

`data`, in the example above, makes all records and fields from the file available,
and is accessible almost like a standard python list. You may consider it the
root-view, as it doesn't have another parent view.

Slices, filters, etc. create views on top of their parent views.
Views are very light-weight and do not copy any data from the file.
They basically only maintain indexes into their parent view.

Views inherit the header (fields) from their parent, but maintain their
own copy. It can be modified without affecting the parents header.

In [2]:
# slices provide a view (subset) on the full data set
data[0:5].print(pretty=True)

+----------+--------+--------------------------+
| birthday | gender |           name           |
+----------+--------+--------------------------+
| 19570526 |   F    | Dianne Mcintosh          |
| 19940213 |   M    | Rosalyn Clark            |
| 19510403 |   M    | Shirley Gray             |
| 20110508 |   F    | Georgia Frank            |
| 19930404 |   M    | Virginia Lambert         |
+----------+--------+--------------------------+
  len: 5/5


In [3]:
# The raw data look like this
data[0:5].print(pretty=False)

FWFRegion(count=5):
[('birthday', 'gender', 'name'),
  (b'19570526', b'F', b'Dianne Mcintosh         '),
  (b'19940213', b'M', b'Rosalyn Clark           '),
  (b'19510403', b'M', b'Shirley Gray            '),
  (b'20110508', b'F', b'Georgia Frank           '),
  (b'19930404', b'M', b'Virginia Lambert        ')
]



In [4]:
# You want to change the field order?
data[0:5].print("name", "birthday", "gender", pretty=True)  

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Georgia Frank            | 20110508 |   F    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 5/5


In [5]:
# May be you want to change it for the view?
x = data[0:5].set_header("name", "birthday", "gender")
x.print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Georgia Frank            | 20110508 |   F    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 5/5


In [6]:
# Indivial lines can be requested as well
data[327].print(pretty=True)

print(data[327].name)
print(data[327].birthday)
print(data[327].gender)

print(tuple(data[327]))
print(list(data[327]))

+----------+--------+--------------------------+
| birthday | gender |           name           |
+----------+--------+--------------------------+
| 19490106 |   M    | Jack Brown               |
+----------+--------+--------------------------+
b'Jack Brown              '
b'19490106'
b'M'
(b'19490106', b'M', b'Jack Brown              ')
[b'19490106', b'M', b'Jack Brown              ']


## Filter

Any view can be filtered and returns a new view.
Which again can be filtered and so on.

In [7]:
data = fwf_open(HumanFileSpec, "./humans.txt")

# This is the original data from the file
data.set_header("name", "birthday", "gender").print(pretty=True, stop=5)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Georgia Frank            | 20110508 |   F    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 5/10,012


In [8]:
# Filtered by 'gender'
data[:5].filter(op("gender") == b"F").print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
| Georgia Frank            | 20110508 |   F    |
+--------------------------+----------+--------+
  len: 2/2


In [9]:
# Combinations of filters (and/or support)
data[:5].filter(op("gender") == b"M", op("birthday").bytes() >= b"19900101", is_or=True).print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Georgia Frank            | 20110508 |   F    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 4/4


In [10]:
# Chained filters
data[:5].filter(op("name").str().strip().endswith("k")).filter(op("gender")==b"F").print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Georgia Frank            | 20110508 |   F    |
+--------------------------+----------+--------+
  len: 1/1


In [11]:
# Filters are function invoked for each record.
data[:5].filter(lambda line: op("birthday").str().date().get(line).year == 1957).print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
+--------------------------+----------+--------+
  len: 1/1


In [12]:
# Which could be rewritten as:
data[:5].filter(op("birthday").bytes().startswith(b"1957")).print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
+--------------------------+----------+--------+
  len: 1/1


In [13]:
# Or
data[:5].filter(op("birthday")[0:4] == b"1957").print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
+--------------------------+----------+--------+
  len: 1/1


In [14]:
# Or with an additional field added to the view
x = data[:5]
x.add_field("birthday_year", start=11, len=4)   # !!!! TODO Not working !!!!
print(x.fields)
x.filter(op("birthday_year") == b"1957").print(pretty=True)

# data's view remains unmodified  (no 'birthday_year')
data[:5].print(pretty=True)

FWFFileFieldSpecs(reclen=56, fields=[birthday=(11, 19), gender=(19, 20), name=(32, 56), birthday_year=(11, 15)])
+--------------------------+----------+--------+---------------+
|           name           | birthday | gender | birthday_year |
+--------------------------+----------+--------+---------------+
| Dianne Mcintosh          | 19570526 |   F    |      1957     |
+--------------------------+----------+--------+---------------+
  len: 1/1
+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Georgia Frank            | 20110508 |   F    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 5/5


## Indices

As mentioned previously the main use case for this library is
  - (very) fast nosql-like access
  - data-sets potentially larger then memory

The 2nd point is covered my means of memory-mapping the file.
The 1st one requires to support indexes, unique and none-unique ones.

### Unique Indices

In [15]:
import fwf_db
from fwf_db import fwf_open, op

class CompleteHumanFileSpec:
    FIELDSPECS = [
            {"name": "name",       "slice": (32, 56)},
            {"name": "gender",     "slice": (19, 20)},
            {"name": "birthday",   "slice": (11, 19)},
            {"name": "location",   "slice": ( 0,  9)},
            {"name": "state",      "slice": ( 9, 11)},
            {"name": "universe",   "slice": (56, 68)},
            {"name": "profession", "slice": (68, 81)},
        ]
data = fwf_open(CompleteHumanFileSpec, "./humans.txt")

In [16]:
# Create a unique index over column 'state'.
index = fwf_db.FWFUniqueIndexDict(data)
fwf_db.FWFCythonIndexBuilder(index).index(data, "state")
index.print("name", "state", "birthday", pretty=True, stop=5)

+-----------------------------+-------+-------------+
|             name            | state |   birthday  |
+-----------------------------+-------+-------------+
| b'Paul Dash               ' | b'AR' | b'19710316' |
| b'Alex Taylor             ' | b'MI' | b'19420108' |
| b'Terry Shelton           ' | b'WI' | b'19900906' |
| b'James Clark             ' | b'MD' | b'20090909' |
| b'Margaret Radford        ' | b'PA' | b'20130316' |
+-----------------------------+-------+-------------+
  len: 5/51


In [17]:
# The index is dict-like, and the dict-value of a unique index is a 
# single line in the file. Only the index itself consumes memory.
index[b"AR"].print(pretty=True)

+--------------------------+--------+----------+-----------+-------+--------------+---------------+
|           name           | gender | birthday |  location | state |   universe   |   profession  |
+--------------------------+--------+----------+-----------+-------+--------------+---------------+
| Paul Dash                |   F    | 19710316 | US        |   AR  | Whatever     | Student       |
+--------------------------+--------+----------+-----------+-------+--------------+---------------+


In case a value is not unique, the last one will be stored in the index.
Which comes quite handy: consider a CDC use case (change data capture), where
the file contains potentially several records with the same ID and you only
need the last one. Or a multi-file scenario where in every month the first file
is a full export, whereas the remaining daily ones are delta exports. In SQL and
Pandas you need `group_by` operations, which are much more expensive (memory,
time).

The library does not support multi-level indexes. You may have recognized,
that we avoid to eagerly load all lines, parse all values, and so on. Same
for multi-level indexes. Because it is so fast to create an index, we rather
create the 2nd-level index if and when needed on the relevant subset. We
found it saves a lot of memory and has not shown up as performance bottleneck
so far.

### Non-unique Indices

In [18]:
# Create a none-unique index over column 'state'. The difference compared
# to the unique-index, is the dict-like object to maintain the index.
index = fwf_db.FWFIndexDict(data)
fwf_db.FWFCythonIndexBuilder(index).index(data, "state")
index.print(pretty=True)

FWFIndexDict(count=51): [b'AR': len(195), b'MI': len(222), b'WI': len(191), b'MD': len(196), b'PA': len(237), b'VT': len(190), b'OK': len(195), b'NV': len(217), b'RI': len(195), b'ME': len(210) ...]



In [19]:
# The dict-values are views. Exactly the ones we've seen in the previous
# section. Only the index itself consumes memory.
index[b"AR"].print(pretty=True)

+--------------------------+--------+----------+-----------+-------+--------------+---------------+
|           name           | gender | birthday |  location | state |   universe   |   profession  |
+--------------------------+--------+----------+-----------+-------+--------------+---------------+
| Dianne Mcintosh          |   F    | 19570526 | US        |   AR  | Whatever     | Medic         |
| Karl Carney              |   M    | 19640508 | US        |   AR  | Whatever     | Shark tammer  |
| Betsy Shipley            |   M    | 19950925 | US        |   AR  | Whatever     | Super hero    |
| Elizabeth Lewis          |   F    | 20100330 | US        |   AR  | Whatever     | Time traveler |
| Rosalyn Gamache          |   M    | 20030912 | US        |   AR  | Whatever     | Artist        |
| Portia Mooneyham         |   M    | 19610606 | US        |   AR  | Whatever     |               |
| Luther Christian         |   F    | 19791216 | US        |   AR  | Whatever     | Programmer    |


## Multi-File

Events and streaming is the future, but we often receive files
in regular time intervals. Every file might be considered a partition,
and the sum of several of these files make up a dataset. All operations
possible on a single file, should transparently be possible on Multi-files
as well. Including redelivered files, and including file schema evolution.

In [20]:
# Create a multi-file dataset, but passing all the file names to fwf_open()
# In this example it is twice the same file, only for demonstration purposes.
data = fwf_open(HumanFileSpec, "./humans-subset.txt", "./humans.txt")

# We'll get to hidden and computed fields a little later
data[8:12].print("_lineno", "_file", pretty=True)

+---------+---------------------+
| _lineno |        _file        |
+---------+---------------------+
|    8    | ./humans-subset.txt |
|    9    | ./humans-subset.txt |
|    0    |     ./humans.txt    |
|    1    |     ./humans.txt    |
+---------+---------------------+
  len: 4/4


Everything else remains the same: views, filters, indexes

## More on Views

This section shows more examples of what can be done with views.

### exclude(\*\*kwargs)

In [21]:
# Pretty much the opposite of `.filter()`

data = fwf_open(HumanFileSpec, "./humans.txt")
data.set_header("name", "birthday", "gender")
data[:5].print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Georgia Frank            | 20110508 |   F    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 5/5


In [22]:
data[:5].exclude(op("gender")==b"F").print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 3/3


### .order_by(field_name(s))

In [23]:
# Create a new view with the field(s) being sorted. Default sorting
# is ascending. For descending sorting prepend the field name with
# '-', e.g. '-birthday'.

data = fwf_open(HumanFileSpec, "./humans.txt")
data.set_header("name", "birthday", "gender")
data[:5].print(pretty=True)

data[:5].order_by("gender").print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Georgia Frank            | 20110508 |   F    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 5/5
+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Dianne Mcintosh          | 19570526 |   F    |
| Georgia Frank            | 20110508 |   F    |
| Rosalyn Clark            | 19940213 |   M    |
| Shirley Gray             | 19510403 |   M    |
| Virginia Lambert         | 19930404 |   M    |
+--------------------------+----------+--------+
  len: 5/5


In [24]:
# Ascending and descending 
data[:5].order_by("gender", "-birthday").print(pretty=True)

+--------------------------+----------+--------+
|           name           | birthday | gender |
+--------------------------+----------+--------+
| Georgia Frank            | 20110508 |   F    |
| Dianne Mcintosh          | 19570526 |   F    |
| Rosalyn Clark            | 19940213 |   M    |
| Virginia Lambert         | 19930404 |   M    |
| Shirley Gray             | 19510403 |   M    |
+--------------------------+----------+--------+
  len: 5/5


### .unique(*fields)

In [25]:
# Return a list of unique values for that field.

data = fwf_open(CompleteHumanFileSpec, "./humans.txt")

print(sorted(data.unique("gender")))

print(sorted(data.unique("profession"))[0:5])

print(sorted(data.unique("state"))[0:10])

print(sorted(data.unique("profession", "state"))[0:5])


[b'F', b'M']
[b'             ', b'Artist       ', b'Berserk      ', b'Comedian     ', b'Cookie maker ']
[b'  ', b'AK', b'AL', b'AR', b'AZ', b'CA', b'CO', b'CT', b'DE', b'FL']
[(b'             ', b'AK'), (b'             ', b'AL'), (b'             ', b'AR'), (b'             ', b'AZ'), (b'             ', b'CA')]


### count

In [26]:
# Return how many records are in a view: `len(data) == data.count()`

data = fwf_open(HumanFileSpec, "./humans.txt")
print(data.count())

print(data[:5].count())

10012
5


## Computed fields

By default the following fields are available in all views:

- `_lineno`: The line number (record number) within the original file, excluding leading comments
- `_file`: The file name, e.g. as in a multi-file scenario
- `_line`: The unchanged and unparsed original line including newline

For how to add your own computed fields, please see further down below.

In [27]:
data = fwf_open(HumanFileSpec, "./humans.txt")
data[10:15].print("_lineno", "name")

+---------+--------------------------+
| _lineno |           name           |
+---------+--------------------------+
|    10   | Robert Carolina          |
|    11   | Gladys Martin            |
|    12   | Jason Stinebaugh         |
|    13   | Kenneth Provines         |
|    14   | James Mcgloster          |
+---------+--------------------------+
  len: 5/5


In [28]:
data[10:15].print("_lineno", *data.header())

+---------+----------+--------+--------------------------+
| _lineno | birthday | gender |           name           |
+---------+----------+--------+--------------------------+
|    10   | 20090527 |   M    | Robert Carolina          |
|    11   | 19990123 |   F    | Gladys Martin            |
|    12   | 19610219 |   M    | Jason Stinebaugh         |
|    13   | 19911219 |   F    | Kenneth Provines         |
|    14   | 19741114 |   M    | James Mcgloster          |
+---------+----------+--------+--------------------------+
  len: 5/5


In [29]:
# Note the trailing whitespaces and breakline on __line
data[10:15].print("_lineno", "_line", pretty=True)

+---------+-------+
| _lineno | _line |
+---------+-------+
    |10   |    US       AL20090527M771b0ad5b70fRobert Carolina         Whatever    Time traveler#
|         |       |
    |11   |    US       WY19990123Fad2d64883e15Gladys Martin           Whatever    Medic        #
|         |       |
    |12   |    US       FL19610219Ma701d784bc77Jason Stinebaugh        Whatever    Comedian     #
|         |       |
    |13   |    US       HI19911219Fe301c6ea97b9Kenneth Provines        Whatever    Super hero   #
|         |       |
    |14   |    US       AL19741114M4f56d046e3b5James Mcgloster         Whatever    Programmer   #
|         |       |
+---------+-------+
  len: 5/5


In [30]:
list(data[10:15].to_list("_lineno", "_line"))

[(10,
  b'US       AL20090527M771b0ad5b70fRobert Carolina         Whatever    Time traveler#\r\n'),
 (11,
  b'US       WY19990123Fad2d64883e15Gladys Martin           Whatever    Medic        #\r\n'),
 (12,
  b'US       FL19610219Ma701d784bc77Jason Stinebaugh        Whatever    Comedian     #\r\n'),
 (13,
  b'US       HI19911219Fe301c6ea97b9Kenneth Provines        Whatever    Super hero   #\r\n'),
 (14,
  b'US       AL19741114M4f56d046e3b5James Mcgloster         Whatever    Programmer   #\r\n')]

Additional computed fields:

```
class HumanFileSpec:
    FIELDSPECS = [
            {"name": "name",       "slice": (32, 56)},
            {"name": "gender",     "slice": (19, 20)},
            {"name": "birthday",   "slice": (11, 19)},
        ]
```

The reason why a file specification is a class like the one above, is because
methods can be added to it, e.g:

In [31]:
from datetime import datetime
from fwf_db import FWFLine

class ExtendedHumanFileSpec:
    FIELDSPECS = [
            {"name": "name",       "slice": (32, 56)},
            {"name": "gender",     "slice": (19, 20)},
            {"name": "birthday",   "slice": (11, 19)},
        ]

    def __header__(self) -> list[str]:
        # Define the default header
        return ["name", "gender", "birthday", "birthday_year", "age"]

    def birthday_year(self, line: FWFLine):
        return int(line.birthday[0:4])

    def age(self, line: FWFLine):
        return datetime.today().year - self.birthday_year(line)

    def __validate__(self, line: FWFLine) -> bool:
        return True  # False => Error

    def my_comment_filter(self, line: FWFLine) -> bool:
        return line[0] != ord("#")

data = fwf_open(ExtendedHumanFileSpec, "./humans.txt")

In [32]:
# Filter with a user defined method
data.filter(data.filespec.my_comment_filter)
data.filter(data.filespec.my_comment_filter).print(pretty=True, stop=5)

+--------------------------+--------+----------+---------------+-----+
|           name           | gender | birthday | birthday_year | age |
+--------------------------+--------+----------+---------------+-----+
| Dianne Mcintosh          |   F    | 19570526 |      1957     |  65 |
| Rosalyn Clark            |   M    | 19940213 |      1994     |  28 |
| Shirley Gray             |   M    | 19510403 |      1951     |  71 |
| Georgia Frank            |   F    | 20110508 |      2011     |  11 |
| Virginia Lambert         |   M    | 19930404 |      1993     |  29 |
+--------------------------+--------+----------+---------------+-----+
  len: 5/10,012


In [33]:
# Print headers as defined in __headers__()
# And including user-defined computed fields
data[:5].print(pretty=True)

+--------------------------+--------+----------+---------------+-----+
|           name           | gender | birthday | birthday_year | age |
+--------------------------+--------+----------+---------------+-----+
| Dianne Mcintosh          |   F    | 19570526 |      1957     |  65 |
| Rosalyn Clark            |   M    | 19940213 |      1994     |  28 |
| Shirley Gray             |   M    | 19510403 |      1951     |  71 |
| Georgia Frank            |   F    | 20110508 |      2011     |  11 |
| Virginia Lambert         |   M    | 19930404 |      1993     |  29 |
+--------------------------+--------+----------+---------------+-----+
  len: 5/5


In [34]:
# Test every line on your own criteria and list the errornous lines
data.validate().print("_lineno", "_line", pretty=True, stop=5)

+---------+-------+
| _lineno | _line |
+---------+-------+
    |0    |    US       AR19570526Fbe56008be36eDianne Mcintosh         Whatever    Medic        #
|         |       |
    |1    |    US       MI19940213M706a6e0afc3dRosalyn Clark           Whatever    Comedian     #
|         |       |
    |2    |    US       WI19510403M451ed630accbShirley Gray            Whatever    Comedian     #
|         |       |
    |3    |    US       MD20110508F7e5cd7324f38Georgia Frank           Whatever    Comedian     #
|         |       |
    |4    |    US       PA19930404Mecc7f17c16a6Virginia Lambert        Whatever    Shark tammer #
|         |       |
+---------+-------+
  len: 5/10,012


## More on "debugging" fwf files

tbd.

## Development

We are using a virtual env (`.venv`) for dependencies. And given the chosen
file structure (`./src` directory; `./tests` directory without `__init__.py`), we do
`pip install -e .` to install the project in '.' as a local package, with
development enabled (-e).

Test execution: `pytest -sx tests\...`

Build the cython exentions only: ./build_ext.bat
