## A TableSchema interface compatible with the pandas data structure 
-------     

### NTV-pandas
To have a simple, compact and reversible solution, the interface uses the [JSON-NTV format (Named and Typed Value)](https://github.com/loco-philippe/NTV#readme) - which integrates the notion of type - and its JSON-TAB variation for tabular data.    
This solution allows to include all the types and formats defined in the TableSchema specification.

### Content
This NoteBook uses examples to present some key points

*(active link on jupyter Notebook or Nbviewer)*
- [1 - Simple example](#1---Simple-example)
- [2 - Series](#2---Series)
    - [Simple example](#Simple-example)
    - [Typed example](#Typed-example)
    - [Examples with a non-Pandas type](#Examples-with-a-non-Pandas-type)
    - [Categorical examples](#Categorical-examples)
- [3 - DataFrame](#3---DataFrame)
    - [Initial example](#Initial-example)
    - [Complete example](#Complete-example)
    - [Json data can be annotated](#Json-data-can-be-annotated)
    - [Categorical data can be included](#Categorical-data-can-be-included)
    - [Multidimensional data](#Multidimensional-data)
- [Appendix : Series tests](#Appendix-:-Series-tests)     
        
### References
- [JSON-NTV specification](https://loco-philippe.github.io/ES/JSON%20semantic%20format%20(JSON-NTV).htm)
- [JSON-TAB specification](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf)
- [JSON-NTV classes and methods](https://loco-philippe.github.io/NTV/json_ntv.html)

This Notebook can also be viewed at [nbviewer](http://nbviewer.org/github/loco-philippe/ntv-pandas/tree/main/example)

In [9]:
import math
import json
from pprint import pprint

import pandas as pd
import ntv_pandas as npd
from shapely.geometry import Point, Polygon
from json_ntv import Ntv
from datetime import date, datetime, time

## 1 - Simple example

- The example is a Dataframe with several dtype

In [22]:
df = pd.DataFrame({
            'end february::date': [date(2023,2,28), date(2024,2,29), date(2025,2,28)],
            'coordinates::point': [Point([2.3, 48.9]), Point([5.4, 43.3]), Point([4.9, 45.8])],
            'contact::email':     ['john.doe@table.com', 'lisa.minelli@schema.com', 'walter.white@breaking.com']
            }).astype({'contact::email': 'string'})
df.dtypes


end february::date    object
coordinates::point    object
contact::email        string
dtype: object

- the example has a simple and compact JSON representation including dtype

In [23]:
df_to_table = npd.to_json(df, table=True)
pprint(df_to_table, width=140, sort_dicts=False)

{'schema': {'fields': [{'name': 'index', 'type': 'integer'},
                       {'name': 'end february', 'type': 'date', 'format': 'default'},
                       {'name': 'coordinates', 'type': 'geopoint', 'format': 'array'},
                       {'name': 'contact', 'type': 'string', 'extDtype': 'string', 'format': 'email'}],
            'primaryKey': ['index'],
            'pandas_version': '1.4.0'},
 'data': [{'index': 0, 'end february': '2023-02-28', 'coordinates': [2.3, 48.9], 'contact': 'john.doe@table.com'},
          {'index': 1, 'end february': '2024-02-29', 'coordinates': [5.4, 43.3], 'contact': 'lisa.minelli@schema.com'},
          {'index': 2, 'end february': '2025-02-28', 'coordinates': [4.9, 45.8], 'contact': 'walter.white@breaking.com'}]}


- The json conversion is reversible : df_from_json equals initial df

In [24]:
df_from_table = npd.read_json(df_to_table)
print('df created from TableSchema is equal to initial df ? ', df_from_table.equals(df))
df_from_table

df created from TableSchema is equal to initial df ?  True
end february::date    object
coordinates::point    object
contact::email        string
dtype: object
end february::date    object
coordinates::point    object
contact::email        string
dtype: object


Unnamed: 0,end february::date,coordinates::point,contact::email
0,2023-02-28,POINT (2.3 48.9),john.doe@table.com
1,2024-02-29,POINT (5.4 43.3),lisa.minelli@schema.com
2,2025-02-28,POINT (4.9 45.8),walter.white@breaking.com


## 2 - Series

### Simple example

In [29]:
sr = pd.Series([1, 2, 3], name='value')
print('pandas object :\n' + str(sr))

json_table = npd.to_json(sr, table=True)
print('\nJson Table representation :    ')
pprint(json_table, width=120, sort_dicts=False)

print('\nIs Json translation reversible ? ', sr.equals(npd.read_json(json_table)))

pandas object :
0    1
1    2
2    3
Name: value, dtype: int64

Json representation :    
{'schema': {'fields': [{'name': 'index', 'type': 'integer'},
                       {'name': 'value', 'type': 'object', 'extDtype': 'Int64', 'format': 'default'}],
            'primaryKey': ['index'],
            'pandas_version': '1.4.0'},
 'data': [{'index': 0, 'value': 1}, {'index': 1, 'value': 2}, {'index': 2, 'value': 3}]}

is Json translation reversible ?  True


### Typed example

In [35]:
sr = pd.Series([date(1964, 1, 1), date(1985, 1, 1), date(2022, 1, 1)], name='new year::date')
print('pandas object :\n' + str(sr))

json_table = npd.to_json(sr, table=True)
print('\nJson Table representation :    ')
pprint(json_table, width=120, sort_dicts=False)

print('\nIs Json translation reversible ? ', sr.equals(npd.read_json(json_table)))

pandas object :
0    1964-01-01
1    1985-01-01
2    2022-01-01
Name: new year::date, dtype: object

Json Table representation :    
{'schema': {'fields': [{'name': 'index', 'type': 'integer'}, {'name': 'new year', 'type': 'date', 'format': 'default'}],
            'primaryKey': ['index'],
            'pandas_version': '1.4.0'},
 'data': [{'index': 0, 'new year': '1964-01-01'},
          {'index': 1, 'new year': '1985-01-01'},
          {'index': 2, 'new year': '2022-01-01'}]}

Is Json translation reversible ?  True


### Examples with a non-Pandas type

In [None]:
field_data = {'dates::date': ['1964-01-01', '1985-02-05', '2022-01-21']}
sr = npd.read_json({':field': field_data})
# pandas dtype conform to Ntv type
print('pandas object :\n' + str(sr))
print('\nJson representation : \n    ', npd.to_json(sr))
print('\nis Json translation reversible ? ', sr.equals(npd.read_json(npd.to_json(sr))))
print('\nis pandas translation reversible ? ', json.dumps(npd.to_json(sr)) == json.dumps({':field': field_data}))

In [None]:
field_data = {'coord::point':    [[1,2], [3,4], [5,6]]}
sr = npd.read_json({':field': field_data})
# pandas dtype conform to Ntv type
print('pandas object :\n' + str(sr))
print('\nJson representation : \n    ', npd.to_json(sr))
print('\nis Json translation reversible ? ', sr.equals(npd.read_json(npd.to_json(sr))))

### Categorical examples
- available only with hashable data

In [None]:
field_data = {"integer": [[1, 2], [0, 1, 1, 0]]}
sr = npd.read_json({':field': field_data})
# pandas dtype conform to Ntv type
print('pandas object :\n' + str(sr))
print('\nJson representation : \n    ', npd.to_json(sr))
print('\nis Json translation reversible ? ', sr.equals(npd.read_json(npd.to_json(sr))))
print('\nis pandas translation reversible ? ', json.dumps(npd.to_json(sr)) == json.dumps({':field': field_data}))

In [None]:
field_data = {'dates': [{'::date': ['1964-01-01', '1985-02-05', '2022-01-21']}, [0, 1, 0, 2]]}
sr = npd.read_json({':field': field_data})
# pandas dtype conform to Ntv type
print('pandas object :\n' + str(sr))
print('\nJson representation : \n    ', npd.to_json(sr))
print('\nis Json translation reversible ? ', sr.equals(npd.read_json(npd.to_json(sr))))

In [None]:
field_data = {'test_array': [{'::array': [[1,2], [3,4], [5,6]]}, [0, 1, 0, 2]]}
sr = npd.read_json({':field': field_data})
# pandas dtype conform to Ntv type
print('pandas object :\n' + str(sr))
print('\nJson representation : \n    ', npd.to_json(sr))
print('\nis Json translation reversible ? ', sr.equals(npd.read_json(npd.to_json(sr))))
print('\nis pandas translation reversible ? ', json.dumps(npd.to_json(sr)) == json.dumps({':field': field_data}))

## 3 - DataFrame

### Initial example

In [None]:
df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})

print('pandas dtype :\n' + str(df.dtypes))
print('\npandas object :\n' + str(df))
print('\nJson representation : \n    ', npd.to_json(df))
print('\nis Json translation reversible ? ', df.equals(npd.read_json(npd.to_json(df))))

### Complete example
- index data
- Pandas dtype (int32, bool, string)
- NTV type (date, point) -> object dtype
- data unique

In [None]:
tab_data = {'index':           [100, 200, 300, 400, 500, 600],
            'dates::date':     ['1964-01-01', '1985-02-05', '2022-01-21', '1964-01-01', '1985-02-05', '2022-01-21'], 
            'value':           [10, 10, 20, 20, 30, 30],
            'value32::int32':  [12, 12, 22, 22, 32, 32],
            'res':             [10, 20, 30, 10, 20, 30],
            'coord::point':    [[1,2], [3,4], [5,6], [7,8], [3,4], [5,6]],
            'names::string':   ['john', 'eric', 'judith', 'mila', 'hector', 'maria'],
            'unique':          True }
df = npd.read_json({':tab': tab_data})
print('pandas dtype :\n' + str(df.dtypes))
print('\npandas object :\n' + str(df))
print('\nJson representation :')
pprint(npd.to_json(df), width=140)
print('\nis Json translation reversible ? ', df.equals(npd.read_json(npd.to_json(df))))

### Json data can be annotated

In [None]:
tab_data = {'index':           [100, 200, 300, 400, 500, 600],
            'dates::date':     ['1964-01-01', '1985-02-05', '2022-01-21', '1964-01-01', '1985-02-05', '2022-01-21'], 
            'value':           [10, 10, 20, 20, {'valid?': 30}, 30],
            'value32::int32':  [12, 12, 22, 22, 32, 32],
            'res':             {'res1': 10, 'res2': 20, 'res3': 30, 'res4': 10, 'res5': 20, 'res6': 30},
            'coord::point':    [[1,2], [3,4], [5,6], [7,8], {'same as 2nd point': [3,4]}, [5,6]],
            'names::string':   ['john', 'eric', 'judith', 'mila', 'hector', 'maria'],
            'unique':          True }

df2 = npd.read_json({':tab': tab_data}, annotated=True)
print('is DataFrame identical ? ', df.equals(df2))

### Categorical data can be included

In [None]:
df = pd.DataFrame({"A": list("abca"), "B": list("bccd")}, dtype="category")

print('pandas dtype :\n' + str(df.dtypes))
print('\npandas object :\n' + str(df))
print('\nJson representation : \n    ', npd.to_json(df))
print('\nis Json translation reversible ? ', df.equals(npd.read_json(npd.to_json(df))))

In [None]:
tab_data = {'index':           [100, 200, 300, 400, 500, 600],
            'dates':           [{'::date': ['1964-01-01', '1985-02-05', '2022-01-21']}, [0, 1, 2, 0, 1, 2]],
            'value':           [[10, 20, {'valid?': 30}], [0, 0, 1, 1, 2, 2]],
            'value32::int32':  [12, 12, 22, 22, 32, 32],
            'res':             {'res1': 10, 'res2': 20, 'res3': 30, 'res4': 10, 'res5': 20, 'res6': 30},
            'coord::point':    [[1,2], [3,4], [5,6], [7,8], {'same as 2nd point': [3,4]}, [5,6]],
            'names::string':   ['john', 'eric', 'judith', 'mila', 'hector', 'maria'],
            'unique:boolean':  True }

df = npd.read_json({':tab': tab_data}, annotated=True)
print('pandas dtype :\n' + str(df.dtypes))
print('\npandas object :\n' + str(df))
print('\nJson representation :')
pprint(npd.to_json(df), width=140)
print('\nis Json translation reversible ? ', df.equals(npd.read_json(npd.to_json(df))))

In [None]:
index   = pd.Series([100, 200, 300, 400, 500, 600])
dates   = pd.Series(name='dates::date',  data=[date(1964, 1, 1), date(1985, 2, 5), date(2022, 1, 21), date(1964, 1, 1),
                                               date(1985, 2, 5), date(2022, 1, 21)], dtype='object').astype('category')
value   = pd.Series(name='value',        data=[10,10,20,20,30,30], dtype='Int64').astype('category') #alias mandatory 
value32 = pd.Series(name='value32',      data=[12, 12, 22, 22, 32, 32], dtype='int32')
coord   = pd.Series(name='coord::point', data=[Point(1,2), Point(3,4), Point(5,6), Point(7,8), Point(3,4), Point(5,6)])
names   = pd.Series(name='names',        data=['john', 'eric', 'judith', 'mila', 'hector', 'maria'], dtype='string')
unique  = pd.Series(name='unique',       data=[True, True, True, True, True, True])

df = pd.DataFrame({ser.name: ser for ser in [index, dates, value, value32, coord, names, unique]}).set_index(None)

print('pandas dtype :\n' + str(df.dtypes))
print('\npandas object :\n' + str(df))
print('\nJson representation :')
pprint(npd.to_json(df), width=140)
print('\nis Json translation reversible ? ', df.equals(npd.read_json(npd.to_json(df))))

### Multidimensional data
- JSON-TAB format is applicable for multidimensional data
- JSON multi-dimensional data can be translated into a Pandas Dataframe or a DataArray Xarray

In [None]:
data = {"quantity": ["1 kg", "1 kg", "1 kg", "1 kg", "10 kg", "10 kg", "10 kg", "10 kg"],
        "product": ["banana", "orange", "apple", "peppers", "banana", "orange", "apple", "peppers"], 
        "plants": ["fruit", "fruit", "fruit", "vegetable", "fruit", "fruit", "fruit", "vegetable"], 
        "price": [0.5, 2, 1, 1.5, 5, 20, 10, 15]}

df  = pd.DataFrame(data)
df2 = pd.DataFrame(data, dtype='category').sort_values(by=['quantity', 'product'])
df2

In [None]:
json_df = Ntv.obj(df).to_obj()[':tab']
print('json_df is the JSON-TAB format with "full" mode\n')
pprint(json_df, width=200)

json_xar = Ntv.obj(df2).to_obj()[':tab']
print('\njson_xa is the JSON-TAB format with "optimize" mode\n')
pprint(json_xar, width=200)

df_from_xar = Ntv.obj({':tab': json_xar}).to_obj(format='obj').sort_index()
print('\nDataFrame from the two JSON-TAB format are identical ? ', df.astype('object').equals(df_from_xar.astype('object')))

print('\nThe "optimize" JSON-TAB format is the image of the DataArray Xarray')
from observation import Sdataset
Sdataset.ntv(json_df).setcanonorder().to_xarray(varname='price')

## Appendix : Series tests

In [None]:
# json interface ok
srs = [
       # without ntv_type, without dtype
       pd.Series([{'a': 2, 'e':4}, {'a': 3, 'e':5}, {'a': 4, 'e':6}]),  
       pd.Series([[1,2], [3,4], [5,6]]),  
       pd.Series([[1,2], [3,4], {'a': 3, 'e':5}]),  
       pd.Series([True, False, True]),
       pd.Series(['az', 'er', 'cd']),
       pd.Series(['az', 'az', 'az']),
       pd.Series([1,2,3]),
       pd.Series([1.1,2,3]),
       
       # without ntv_type, with dtype
       pd.Series([10,20,30], dtype='Int64'),
       pd.Series([True, False, True], dtype='boolean'),
       pd.Series([1.1, 2, 3], dtype='float64'), 

       # with ntv_type only in json data (not numbers)
       pd.Series([pd.NaT, pd.NaT, pd.NaT]),
       pd.Series([datetime(2022, 1, 1), datetime(2022, 1, 2)], dtype='datetime64[ns]'),
       pd.Series(pd.to_timedelta(['1D', '2D'])),
       pd.Series(['az', 'er', 'cd'], dtype='string'), 

       # with ntv_type only in json data (numbers)
       pd.Series([1,2,3], dtype='Int32'), 
       pd.Series([1,2,3], dtype='UInt64'),
       pd.Series([1,2,3], dtype='float32'),

       # with ntv_type in Series name and in json data (numbers)
       pd.Series([1,2,3], name='::int64'),
       pd.Series([1,2,3], dtype='Float64', name='::float64'), # force dtype dans la conversion json

       # with ntv_type in Series name and in json data (not numbers)
       pd.Series([[1,2], [3,4], [5,6]], name='::array'),  
       pd.Series([{'a': 2, 'e':4}, {'a': 3, 'e':5}, {'a': 4, 'e':6}], name='::object'),  
       pd.Series([None, None, None], name='::null'), 
       pd.Series(["geo:13.412 ,103.866", "mailto:John.Doe@example.com"], name='::uri', dtype='string'),
       pd.Series(["///path/to/file", "//host.example.com/path/to/file"], name='::file', dtype='string'),

       # with ntv_type converted in object dtype (not in datetime)
       pd.Series([date(2022, 1, 1), date(2022, 1, 2)], name='::date'),
       pd.Series([time(10, 21, 1),  time(8, 1, 2)],    name='::time'),

       # with ntv_type unknown in pandas and with pandas conversion               
       pd.Series([1,2,3], dtype='int64', name='::day'),
       pd.Series([2001,2002,2003], dtype='int64', name='::year'),
       pd.Series([21,10,55], name='::minute'),

       # with ntv_type unknown in pandas and NTV conversion
       pd.Series([Point(1, 0), Point(1, 1), Point(1, 2)], name='::point'),
]
for sr in srs:
    print(npd.as_def_type(sr).equals(npd.read_json(npd.to_json(sr))), 
          npd.read_json(npd.to_json(sr)).name == sr.name, 
          npd.to_json(sr))  

In [None]:
# json interface ok
for a in [{'test::int32': [1,2,3]},
          {'test': [1,2,3]},
          [1.0, 2.1, 3.0],
          ['er', 'et', 'ez'],
          [True, False, True],
          {'::boolean': [True, False, True]},
          {'::string': ['er', 'et', 'ez']},
          {'test::float32': [1.0, 2.5, 3.0]},
          {'::int64': [1,2,3]},
          {'::datetime': ["2021-12-31T23:00:00.000","2022-01-01T23:00:00.000"] },
          {'::date': ["2021-12-31", "2022-01-01"] },
          {'::time': ["23:00:00", "23:01:00"] },
          {'::object': [{'a': 3, 'e':5}, {'a': 4, 'e':6}]},
          {'::array': [[1,2], [3,4], [5,6]]},
          True,
          {':boolean': True}
         ]:
    field = {':field': a}
    print(npd.to_json(npd.read_json(field)) == field, field)

In [None]:
# json interface ok (categorical data)
for a in [{'test': [{'::int32': [1, 2, 3]}, [0,1,2,0,1]]},
          {'test': [[1, 2, 3], [0,1,2,0,1]]},
          [[1.0, 2.1, 3.0], [0,1,2,0,1]],
          [['er', 'et', 'ez'], [0,1,2,0,1]],
          [[True, False], [0,1,0,1,0]],
          [{'::string': ['er', 'et', 'ez']}, [0,1,2,0,1]],
          {'test':[{'::float32': [1.0, 2.5, 3.0]}, [0,1,2,0,1]]},
          [{'::int64': [1, 2, 3]}, [0,1,2,0,1]],
          [{'::datetime': ["2021-12-31T23:00:00.000", "2022-01-01T23:00:00.000"] }, [0,1,0,1,0]],
          [{'::date': ["2021-12-31", "2022-01-01"] }, [0,1,0,1,0]],
          [{'::time': ["23:00:00", "23:01:00"] }, [0,1,0,1,0]],
          {'test_date': [{'::datetime': ["2021-12-31T23:00:00.000", "2022-01-01T23:00:00.000"] }, [0,1,0,1,0]]},
          [{'::boolean': [True, False]}, [0,1,0,1,0]],
          [[True], [2]], # periodic Series
          {'quantity': [['1 kg', '10 kg'], [4]]}]:  # periodic Series
    field = {':field': a}
    print(npd.to_json(npd.read_json(field)) == field, field)

In [None]:
# json interface ko
srs = [# without ntv_type
       pd.Series([math.nan, math.nan]), # bug pandas conversion json : datetime NaT
       
       # without ntv_type, with dtype
       pd.Series([math.nan, math.nan], dtype='float64'), # bug pandas conversion json : datetime NaT
    
       # with ntv_type in Series name and in json data
       pd.Series([1,2,3], dtype='UInt64', name='::uint64'),   # name inutile
       
       # with ntv_type unknown in pandas
       pd.Series([datetime(2022, 1, 1), datetime(2022, 1, 2), datetime(2022, 1, 3)], dtype='datetime64[ns, UTC]'), #à traiter
]
for sr in srs:
    print(npd.as_def_type(sr).equals(npd.read_json(npd.to_json(sr))), 
          npd.read_json(npd.to_json(sr)).name == sr.name, 
          npd.to_json(sr, text=True))  

In [None]:
# json interface ko (categorical data)
for a in [{'test_array': [{'::array': [[1,2], [3,4], [5,6], [7,8]]}, [0, 1, 0, 2, 3]]}]: # list -> tuple to be hashable
    field = {':field': a}
    print(npd.to_json(npd.read_json(field)) == field, field)