Incorrect skipping of lines with inline comments and printing warnings #16472

Closed
pankajp opened this Issue May 24, 2017 · 0 comments

Comments

Projects
None yet
2 participants
Contributor

pankajp commented May 24, 2017

Code Sample, a copy-pastable example if possible

# Your code here
from io import StringIO
import numpy as np
import pandas as pd

test_input = u"""\
1 2
2 2 3
3 2 3 # 3 fields
4 2 3# 3 fields
5 2 # 2 fields
6 2# 2 fields
7 # 1 field, NaN
8# 1 field, NaN
9 2 3 # skipped line
# comment"""

df = pd.read_table(StringIO(test_input), comment='#', header=None,
                   delimiter='\\s+', skiprows=0, error_bad_lines=False)

print df
# Expected: only lines with <= 2 fields should appear in the df, others should be warned as skipped
assert (df == pd.DataFrame([[1, 2], [5, 2], [6, 2], [7, np.nan], [8, np.nan]],
                          index=list(range(5)), columns=[0,1])).all().all()

Problem description

Only lines with <= 2 fields should appear in the df, others should be skipped and their warning should be printed on stderr.

Output

Skipping line 2: expected 2 fields, saw 3
Skipping line 4: expected 2 fields, saw 6
Skipping line 6: expected 2 fields, saw 4

   0  1
0  1  2
1  7  8

Problems:

  • Lines skipped due to more fields than expected and which end with inline comments are never printed as skipped on stderr (lines 3-9)
  • Lines which end with inline comment after a space count one more field than present, so incorrectly skip or not skip the line (lines 3, 5, 7, 9)
  • The incorrect accounting joined the lines 7 and 8 as well which was not expected.

Expected Output

Skipping line 2: expected 2 fields, saw 3
Skipping line 3: expected 2 fields, saw 3
Skipping line 4: expected 2 fields, saw 3
Skipping line 9: expected 2 fields, saw 3

   0    1
0  1  2.0
1  5  2.0
2  6  2.0
3  7  NaN
4  8  NaN

Output of pd.show_versions()

# Paste the output here pd.show_versions() here

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.14-200.fc25.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 34.3.3
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.19.0
statsmodels: 0.8.0
xarray: None
IPython: 5.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.7.3
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.1.6
pymysql: None
psycopg2: None
jinja2: 2.9.5
boto: None
pandas_datareader: 0.2.1

pankajp changed the title from Fix skipping lines with inline comments and printing warnings to Incorrect skipping of lines with inline comments and printing warnings May 24, 2017

jreback added this to the 0.20.2 milestone May 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment