Skip to content

pd.read_json ignoring encoding='utf-8-sig' #20598

@xtofs

Description

@xtofs

I work with a byte stream (from Azure DataLake https://pypi.python.org/pypi/azure-datalake-store/0.0.19 which only supports byte stream) that has a UTF-8 byte order mark, and want to read it into a data frame.

pandas.read_json fails.
For comparison, pd.read_csv(file, lines=True, encoding='utf-8-sig') works fine with a similar file

import pandas as pd

def skip_utf_8_bom(file):
    bom = file.read(3)
    # print(bom)
    if bom != b'\xef\xbb\xbf': # undo read
        file.seek(len(bom), 1) 

path = 'sample-utf-8-sig.txt'

#works
with open(path, 'rb') as file:   
    skip_utf_8_bom(file)
    df = pd.read_json(file, lines=True, encoding='utf-8-sig')    
df

#fails
with open(path, 'rb') as file: 
    df = pd.read_json(file, lines=True, encoding='utf-8-sig')    
df

Problem description

pd.read_json seems not to be able to process the encoding='utf-8-sig' parameter.
Expected behavior is that it allows to work with byte streams with an utf-8 byte order mark

sample-utf-8-sig.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    Duplicate ReportDuplicate issue or pull requestIO JSONread_json, to_json, json_normalize

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions