Fails for multilevel columns pandas DataFrames #346

antoinecollet5 · 2021-01-22T22:01:06Z

Jsonpickle does not support multilevel columns pandas DataFrames.

Here is an example that fails. As you can see, the multiindex is supported, not the multilevel columns.

import pandas as pd
import numpy as np
import jsonpickle as jp
import jsonpickle.ext.pandas as jsonpickle_pd

### This works ###
midx = pd.MultiIndex(levels=[["zero", "one"], ["x", "y"]], 
                     codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
df = pd.DataFrame(np.random.randn(4, 2), index=midx, columns=['col1', 'col2'])
frozen = jp.encode(df)
thawed = jp.decode(frozen)

### This does not ###
iterables = [['inj', 'prod'], ['hourly', 'cum']]
names = ['first', 'second']
# transform it to tuples
columns = pd.MultiIndex.from_product(iterables, names=names)
# build a multi-index from it
df2 = pd.DataFrame(np.random.randn(3, 4), index=["A", "B", "C"], 
                   columns=columns)
frozen2 = jp.encode(df2)
thawed2 = jp.decode(frozen2)

Here is a modification of pandas.py that seems to fix the issue. I am new to git so I need to figure out how to make a clean pull request. I ll try to add an appropriate unit test.

def make_read_csv_params(meta):
    meta_dtypes = meta.get('dtypes', {})
    # [None] makes it compatible with objects serialized before 
    # column_levels_names has been introduced.
    column_level_names = meta.get('column_level_names', [None])
    # The header is used to select the rows of the csv from which
    # the columns names are retrived
    header = meta.get('header', [0])
    parse_dates = []
    converters = {}
    dtype = {}
    timedeltas = []
    for k, v in meta_dtypes.items():
        if v.startswith('datetime'):
            parse_dates.append(k)
        elif v.startswith('complex'):
            converters[k] = complex
        elif v.startswith('timedelta'):
            timedeltas.append(k)
            dtype[k] = 'object'
        else:
            dtype[k] = v

    return dict(dtype=dtype, header=header, parse_dates=parse_dates, 
                converters=converters), timedeltas, column_level_names


class PandasDfHandler(BaseHandler):
    pp = PandasProcessor()

    def flatten(self, obj, data):
        dtype = obj.dtypes.to_dict()

        meta = {'dtypes': {k: str(dtype[k]) for k in dtype}, 
                'index': encode(obj.index),
                'column_level_names': obj.columns.names,
                'header': list(range(len(obj.columns.names)))}

        data = self.pp.flatten_pandas(
            obj.reset_index(drop=True).to_csv(index=False), data, meta
        )
        return data

    def restore(self, data):
        csv, meta = self.pp.restore_pandas(data)
        params, timedeltas, column_level_names = make_read_csv_params(meta)

        df = (
            pd.read_csv(StringIO(csv), **params)
            if data['values'].strip()
            else pd.DataFrame()
        )
        for col in timedeltas:
            df[col] = pd.to_timedelta(df[col])    
        df.set_index(decode(meta['index']), inplace=True)
        # restore the column level(s) name(s)
        df.columns.names = column_level_names        
        return df

antoinecollet5 · 2021-01-22T22:29:41Z

I just did a pull request.

ujson is needed in order to pass the pandas tests, so add it to the general "testing" section. We should work to eliminate this. ujson is now fully python3 compatible and does not need to be blocked on python3.8 anymore. Related-to: jsonpickle#346 jsonpickle#347 Signed-off-by: David Aguilar <davvid@gmail.com>

The ujson module was narrowed down as the reason why the tests were passing on python2 and failing on newer python3 versions. Re-enable the multilevel columns test now that ujson is present. Related-to: jsonpickle#346 jsonpickle#347 Signed-off-by: David Aguilar <davvid@gmail.com>

Flatten the dtypes meta dictionary before handing it off to the backend to ensure that special types, such as tuples in dicts, are handled properly. Related-to: jsonpickle#346 jsonpickle#347 Signed-off-by: David Aguilar <davvid@gmail.com>

davvid mentioned this issue Jan 31, 2021

Update pandas.py #347

Closed

davvid closed this as completed in 565c299 Jan 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails for multilevel columns pandas DataFrames #346

Fails for multilevel columns pandas DataFrames #346

antoinecollet5 commented Jan 22, 2021 •

edited

antoinecollet5 commented Jan 22, 2021

Fails for multilevel columns pandas DataFrames #346

Fails for multilevel columns pandas DataFrames #346

Comments

antoinecollet5 commented Jan 22, 2021 • edited

antoinecollet5 commented Jan 22, 2021

antoinecollet5 commented Jan 22, 2021 •

edited