## A TableSchema interface compatible with the pandas data structure 
-------     
The interface converts DataFrame or Series into Json TableSchema format. The interface is reversible and compatible with [TableSchema types and formats](https://specs.frictionlessdata.io/table-schema/#types-and-formats)

The interface uses the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its [JSON-TAB variation for tabular data](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf).    
    
This solution allows to include all the types and formats defined in the TableSchema specification.

### Content
This NoteBook uses examples to present some key points

*(active link on jupyter Notebook or Nbviewer)*
- [1 - Simple example](#1---Simple-example)
- [2 - Example of Series](#2---Example-of-Series)
    - [Numerical Series](#Numerical-Series)
    - [Json Series](#Json-Series)
    - [Datation Series](#Datation-Series)
    - [Location Series](#Location-Series)
- [3 - DataFrame](#3---DataFrame)
        
### References
- [JSON-NTV specification](https://loco-philippe.github.io/ES/JSON%20semantic%20format%20(JSON-NTV).htm)
- [JSON-TAB specification](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf)
- [JSON-NTV classes and methods](https://loco-philippe.github.io/NTV/json_ntv.html)
- [Table Schema specification](https://specs.frictionlessdata.io/table-schema/#types-and-formats)

This Notebook can also be viewed at [nbviewer](http://nbviewer.org/github/loco-philippe/ntv-pandas/tree/main/example)

In [1]:
import os
import sys
os.path
sys.path.insert(0, 'C:\\Users\\phili\\github\\ntv-pandas')

In [2]:
import math
import json
from pprint import pprint

import pandas as pd
import ntv_pandas as npd
from shapely.geometry import Point, Polygon, LineString
from json_ntv import Ntv
from datetime import date, datetime, time

## 1 - Simple example

- The example is a Dataframe with several NTVtype : date, point, email

In [3]:
df = pd.DataFrame({
            'end february::date': [date(2023,2,28), date(2024,2,29), date(2025,2,28)],
            'coordinates::point': [Point([2.3, 48.9]), Point([5.4, 43.3]), Point([4.9, 45.8])],
            'contact::email':     ['john.doe@table.com', 'lisa.minelli@schema.com', 'walter.white@breaking.com']
            }).astype({'contact::email': 'string'})
df


Unnamed: 0,end february::date,coordinates::point,contact::email
0,2023-02-28,POINT (2.3 48.9),john.doe@table.com
1,2024-02-29,POINT (5.4 43.3),lisa.minelli@schema.com
2,2025-02-28,POINT (4.9 45.8),walter.white@breaking.com


- the example has a JSON representation conform to TableSchema
- the pandas JSON interface is not able to read or to create this representation because it daes not understand `date` type or `geopoint` type or `email` format

In [4]:
df_to_table = npd.to_json(df, table=True)
pprint(df_to_table, width=140, sort_dicts=False)

{'schema': {'fields': [{'name': 'index', 'type': 'integer'},
                       {'name': 'end february', 'type': 'date'},
                       {'name': 'coordinates', 'type': 'geopoint', 'format': 'array'},
                       {'name': 'contact', 'type': 'string', 'format': 'email'}],
            'primaryKey': ['index'],
            'pandas_version': '1.4.0'},
 'data': [{'index': 0, 'end february': '2023-02-28', 'coordinates': [2.3, 48.9], 'contact': 'john.doe@table.com'},
          {'index': 1, 'end february': '2024-02-29', 'coordinates': [5.4, 43.3], 'contact': 'lisa.minelli@schema.com'},
          {'index': 2, 'end february': '2025-02-28', 'coordinates': [4.9, 45.8], 'contact': 'walter.white@breaking.com'}]}


- The json conversion is reversible : df_from_table equals initial df

In [5]:
df_from_table = npd.read_json(df_to_table)
print('df created from TableSchema is equal to initial df ? ', df_from_table.equals(df))
df_from_table

df created from TableSchema is equal to initial df ?  True


Unnamed: 0,end february::date,coordinates::point,contact::email
0,2023-02-28,POINT (2.3 48.9),john.doe@table.com
1,2024-02-29,POINT (5.4 43.3),lisa.minelli@schema.com
2,2025-02-28,POINT (4.9 45.8),walter.white@breaking.com


## 2 - Example of Series

### Numerical Series
- TableSchema defines three types `integer`, `boolean` and `number` (with `default` format)
- additional `format` are used to integrate pandas or NTV data :
    - `float`, `floatxx` for `number` type where xx is the bit's lenght (as defined in pandas dtype)
    - `intxx`, `uintxx` for `integer` type where xx is the bit's lenght (as defined in pandas dtype)

In [6]:
sr = pd.Series([1, 2, 3], name='value')
print('pandas object :\n' + str(sr))

json_table = npd.to_json(sr, table=True)
print('\nJson Table representation :    ')
pprint(json_table, width=100, sort_dicts=False)

print('\nIs Json Table translation reversible ? ', sr.equals(npd.read_json(json_table)))

pandas object :
0    1
1    2
2    3
Name: value, dtype: int64

Json Table representation :    
{'schema': {'fields': [{'name': 'index', 'type': 'integer'}, {'name': 'value', 'type': 'integer'}],
            'primaryKey': ['index'],
            'pandas_version': '1.4.0'},
 'data': [{'index': 0, 'value': 1}, {'index': 1, 'value': 2}, {'index': 2, 'value': 3}]}

Is Json Table translation reversible ?  True


In [7]:
list_sr = [pd.Series([1, 2, 3],   name='value'),
           pd.Series([1.1, 2, 3], name='value'),
           pd.Series([True, False, True], name='value'),
           # additional types
           pd.Series([1, 2, 3],   name='value', dtype='int32'),
           pd.Series([1, 2, 3],   name='value', dtype='uint64'),
           pd.Series([1.6, 2, 3], name='value', dtype='float32')]

print('reversibility, schema field : ')
for sr in list_sr:
    json_table = npd.to_json(sr, table=True)
    print('    ', sr.equals(npd.read_json(json_table)), ', ', json_table['schema']['fields'][1])

reversibility, schema field : 
     True ,  {'name': 'value', 'type': 'integer'}
     True ,  {'name': 'value', 'type': 'number'}
     True ,  {'name': 'value', 'type': 'boolean'}
     True ,  {'name': 'value', 'type': 'integer', 'format': 'int32'}
     True ,  {'name': 'value', 'type': 'integer', 'format': 'uint64'}
     True ,  {'name': 'value', 'type': 'number', 'format': 'float32'}


### Json Series
- TableSchema defines two types `object` and `array` with `default` format and one type `string` with five formats (`default`, `uri`, `email`, `binary`, `uuid`). `binary` and `uuid` are not used. 
- additional `format` are used to integrate pandas or NTV data :
    - `file` for `string` type
    - `null`, `object` for `object` type (`object` type defines Json data, `object` format defines 'dict' data)

In [8]:
list_sr = [pd.Series([[1, 2], ['val1', 'val2']],   name='value::array'),
           pd.Series([[1, 2], 3, 'test', {'val1': 5, 'val2': 6}],   name='value'),
           pd.Series(['az', 'er', 'cd'], name='value', dtype='string'),
           pd.Series(["geo:13.412 ,103.866", "mailto:John.Doe@example.com"], name='value::uri', dtype='string'),
           pd.Series(["philippe@loco-labs.io", "John.Doe@example.com"], name='value::email', dtype='string'),
           # additional types
           pd.Series([{'val1': 5, 'val2': 6}, {'val1': 5.1, 'val2': 6.1}],   name='value::object'),
           pd.Series(["///path/to/file", "//host.example.com/path/to/file"], name='value::file', dtype='string'),
           pd.Series([None, None, None], name='value::null')]

print('reversibility, schema field : ')
for sr in list_sr:
    json_table = npd.to_json(sr, table=True)
    print('    ', sr.equals(npd.read_json(json_table)), ', ', json_table['schema']['fields'][1])

reversibility, schema field : 
     True ,  {'name': 'value', 'type': 'array'}
     True ,  {'name': 'value', 'type': 'object'}
     True ,  {'name': 'value', 'type': 'string'}
     True ,  {'name': 'value', 'type': 'string', 'format': 'uri'}
     True ,  {'name': 'value', 'type': 'string', 'format': 'email'}
     True ,  {'name': 'value', 'type': 'object', 'format': 'object'}
     True ,  {'name': 'value', 'type': 'string', 'format': 'file'}
     True ,  {'name': 'value', 'type': 'object', 'format': 'null'}


### Datation Series
- TableSchema defines six types `duration`, `datetime`, `date`, `time`, `yearmonth` and `year` with `default` format (`duration` type is not yet implemented). The `any` and `<PATTERN>` format for `datetime`, `date` and `time` are not used. 
- additional `format` are used to integrate NTV data :
    - `day`, `wday`, `yday`, `week` for `date` type
    - `hour`, `minute`, `second` for `time` type 

In [9]:
list_sr = [pd.Series(['2022-01-01', '2021-01-01'], dtype='datetime64[ns]', name='value'),
           pd.Series([date(2022,1,1), date(2021,1,1), date(2023,1,1)],   name='value::date'),
           pd.Series([time(10,20,50), time(9,20,50), time(8,20,50)],   name='value::time'),
           pd.Series([1, 2, 3], name='value::month'),
           pd.Series([2021, 2022, 2023], name='value::year'),
           # additional types
           pd.Series([1, 2, 3],   name='value::day'),
           pd.Series([1, 2, 3],   name='value::wday'),
           pd.Series([1, 2, 3],   name='value::yday'),
           pd.Series([1, 2, 3],   name='value::week'),
           pd.Series([1, 2, 3],   name='value::hour'),
           pd.Series([1, 2, 3],   name='value::minute'),
           pd.Series([1, 2, 3],   name='value::second')
          ]

print('reversibility, schema field : ')
for sr in list_sr:
    json_table = npd.to_json(sr, table=True)
    print('    ', sr.equals(npd.read_json(json_table)), ', ', json_table['schema']['fields'][1])

reversibility, schema field : 
     True ,  {'name': 'value', 'type': 'datetime'}
     True ,  {'name': 'value', 'type': 'date'}
     True ,  {'name': 'value', 'type': 'time'}
     True ,  {'name': 'value', 'type': 'yearmonth'}
     True ,  {'name': 'value', 'type': 'year'}
     True ,  {'name': 'value', 'type': 'date', 'format': 'day'}
     True ,  {'name': 'value', 'type': 'date', 'format': 'wday'}
     True ,  {'name': 'value', 'type': 'date', 'format': 'yday'}
     True ,  {'name': 'value', 'type': 'date', 'format': 'week'}
     True ,  {'name': 'value', 'type': 'time', 'format': 'hour'}
     True ,  {'name': 'value', 'type': 'time', 'format': 'minute'}
     True ,  {'name': 'value', 'type': 'time', 'format': 'second'}


### Location Series
- TableSchema defines two types `geopoint` (`array` format is used, `default` and `object` format are not used) and `geojson`. (`default` format is used, `topojson` is not used).
- additional `format` (`geometry`, `polygon`, `line`) for `geojson` are used to integrate NTV data. 

In [10]:
list_sr = [pd.Series(pd.Series([Point(1, 0), Point(1, 1), Point(1, 2)], name='value::point')),
           pd.Series([Point(1, 0), Polygon([[1.0, 2.0], [1.0, 3.0], [2.0, 4.0]])], name='value::geojson'),
           # additional types
           pd.Series([Point(1, 0), Polygon([[1.0, 2.0], [1.0, 3.0], [2.0, 4.0]])], name='value::geometry'),
           pd.Series([Polygon([[1, 2], [1, 3], [2, 4]]), Polygon([[1, 2], [1, 3], [2, 5]])], name='value::polygon'),
           pd.Series([LineString([[1, 2], [1, 3], [2, 4]]), LineString([[1, 2], [1, 3], [2, 5]])], name='value::line')
          ]

print('reversibility, schema field : ')
for sr in list_sr:
    json_table = npd.to_json(sr, table=True)
    print('    ', sr.equals(npd.read_json(json_table)), ', ', json_table['schema']['fields'][1])

reversibility, schema field : 
     True ,  {'name': 'value', 'type': 'geopoint', 'format': 'array'}
     True ,  {'name': 'value', 'type': 'geojson'}
     True ,  {'name': 'value', 'type': 'geojson', 'format': 'geometry'}
     True ,  {'name': 'value', 'type': 'geojson', 'format': 'polygon'}
     True ,  {'name': 'value', 'type': 'geojson', 'format': 'line'}


## 3 - DataFrame
As used for the Series, DataFrame follow the same implementation. The Example below used some of Series used in previous example.

In [11]:
df = pd.DataFrame({
        # numerical
        'float':       [1.1, 2, 3],
        'boolean':     [True, False, False],
        'int32':       pd.Series([1, 2, 3],   dtype='int32'),
        # json
        'ex1::array':  [[1, 2], ['val1', 'val2'], [1, {'val3': 3}]],
        'json':        [[1, 2], 'test', {'val1': 5, 'val2': 6}],
        'string':      pd.Series(['az', 'er', 'cd'], dtype='string'),
        'ex2::uri':    pd.Series(["geo:13.412 ,103.866", "mailto:John.Doe@example.com", ""], dtype='string'),
        'ex3::email':  pd.Series(["philippe@loco-labs.io", "John.Doe@example.com", ""], dtype='string'),
        'ex4::object': [{'val1': 5, 'val2': 6}, {'val1': 5.1, 'val2': 6.1}, {}],
        # datation
        'datetime':    pd.Series(['2022-01-01', '2021-01-01', '2023-01-01'], dtype='datetime64[ns]'),
        'ex5::date':   [date(2022,1,1), date(2021,1,1), date(2023,1,1)],
        'ex6::time':   [time(10,20,50), time(9,20,50), time(8,20,50)],
        'ex7::month':  [1, 2, 3],
        'ex8::hour':   [1, 2, 3],
        # location
        'ex9::point':  [Point(1, 0), Point(1, 1), Point(1, 2)],
        'ex10::geojson': [Point(1, 0), LineString([[1, 2], [1, 3]]), Polygon([[1.0, 2.0], [1.0, 3.0], [2.0, 4.0]])],
        # additional types
        'ex11::geometry': [Point(1, 0), LineString([[1, 2], [1, 3]]), Polygon([[1.0, 2.0], [1.0, 3.0], [2.0, 4.0]])],
        'ex12::polygon':  [Polygon([[1,2], [1,3], [2,4]]), Polygon([[1,2], [1,3], [2,5]]), Polygon([[1,2], [1,3], [2,6]])],
        'ex13::line':     [LineString([[1, 2], [2, 4]]), LineString([[1, 2], [2, 5]]), LineString([[1, 2], [2, 6]])]           
})
print('\nJson Table representation :    ')
pprint(npd.to_json(df, table=True), width=100, sort_dicts=False)
print('\nis Json translation reversible ? ', df.equals(npd.read_json(npd.to_json(df, table=True))))


Json Table representation :    
{'schema': {'fields': [{'name': 'index', 'type': 'integer'},
                       {'name': 'float', 'type': 'number'},
                       {'name': 'boolean', 'type': 'boolean'},
                       {'name': 'int32', 'type': 'integer', 'format': 'int32'},
                       {'name': 'ex1', 'type': 'array'},
                       {'name': 'json', 'type': 'object'},
                       {'name': 'string', 'type': 'string'},
                       {'name': 'ex2', 'type': 'string', 'format': 'uri'},
                       {'name': 'ex3', 'type': 'string', 'format': 'email'},
                       {'name': 'ex4', 'type': 'object', 'format': 'object'},
                       {'name': 'datetime', 'type': 'datetime'},
                       {'name': 'ex5', 'type': 'date'},
                       {'name': 'ex6', 'type': 'time'},
                       {'name': 'ex7', 'type': 'yearmonth'},
                       {'name': 'ex8', 'type': 'time', 'fo

In [34]:
data = {'index':           [100, 200, 300, 400, 500],
                'dates::date':     [date(1964,1,1), date(1985,2,5), date(2022,1,21), date(1964,1,1), date(1985,2,5)],
                'value':           [10, 10, 20, 20, 30],
                'value32':         pd.Series([12, 12, 22, 22, 32], dtype='int32'),
                'res':             [10, 20, 30, 10, 20],
                'coord::point':    [Point(1,2), Point(3,4), Point(5,6), Point(7,8), Point(3,4)],
                'names':           pd.Series(['john', 'eric', 'judith', 'mila', 'hector'], dtype='string'),
                'unique':          True }
df = pd.DataFrame(data).set_index('index')
df.index.name = None
df

Unnamed: 0,dates::date,value,value32,res,coord::point,names,unique
100,1964-01-01,10,12,10,POINT (1 2),john,True
200,1985-02-05,10,12,20,POINT (3 4),eric,True
300,2022-01-21,20,22,30,POINT (5 6),judith,True
400,1964-01-01,20,22,10,POINT (7 8),mila,True
500,1985-02-05,30,32,20,POINT (3 4),hector,True


In [38]:
df_to_json = npd.to_json(df)
pprint(df_to_json, compact=True, width=120, sort_dicts=False)
print(npd.read_json(df_to_json).equals(df))

{':tab': {'index': [100, 200, 300, 400, 500],
          'dates::date': ['1964-01-01', '1985-02-05', '2022-01-21', '1964-01-01', '1985-02-05'],
          'value': [10, 10, 20, 20, 30],
          'value32::int32': [12, 12, 22, 22, 32],
          'res': [10, 20, 30, 10, 20],
          'coord::point': [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0], [3.0, 4.0]],
          'names::string': ['john', 'eric', 'judith', 'mila', 'hector'],
          'unique': True}}
True


In [32]:
df_to_table = npd.to_json(df, table=True)
pprint(df_to_table['data'][0], sort_dicts=False)
pprint(df_to_table['schema'], sort_dicts=False)
print(npd.read_json(df_to_table).equals(df))

{'index': 100,
 'dates': '1964-01-01',
 'value': 10,
 'value32': 12,
 'res': 10,
 'coord': [1.0, 2.0],
 'names': 'john',
 'unique': True}
{'fields': [{'name': 'index', 'type': 'integer'},
            {'name': 'dates', 'type': 'date'},
            {'name': 'value', 'type': 'integer'},
            {'name': 'value32', 'type': 'integer', 'format': 'int32'},
            {'name': 'res', 'type': 'integer'},
            {'name': 'coord', 'type': 'geopoint', 'format': 'array'},
            {'name': 'names', 'type': 'string'},
            {'name': 'unique', 'type': 'boolean'}],
 'primaryKey': ['index'],
 'pandas_version': '1.4.0'}
True
