BUG: read_csv Internal Error #4991

kogolobo · 2022-09-17T19:45:57Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

(Interactive session)
>>> import modin.pandas as pd
>>> data = pd.read_csv('./file.csv',  sep=',', index_col=False)  
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init()

ParserWarning: Length of header or names does not match length of data. This leads to a loss of data with index_col=False.
>>> data
(Error happens)

Issue Description

When trying to read a file (below) that does not have a header for all columns, and force index column off, I cannot access the data due to the error below.

CSV file:

id, xmin, ymin, xmax, ymax, area, skinArea, x Centroid, y Centroid, forMVx, forMVy, forL1, backMVx, backMVy, backL1
     1, 378,  70, 401,  79,     142,      0,  388,   75,   1,   0, 28.72,   0,   0, 60.00,   14 
     2, 346,  95, 362, 103,      78,      0,  354,   98,   1,   0, 28.62,   0,   0, 60.00,   19 
     3, 374,  96, 388, 109,      77,      0,  380,  103,   0,   0, 25.58,   0,   0, 60.00,   30 
     4, 149, 111, 155, 156,     231,      0,  153,  133,   0,   0, 21.06,   0,   0, 60.00,   40 
     5, 225, 114, 346, 208,    6752,      0,  277,  152,   3,   3, 21.59,   0,   0, 60.00,   44   68  389  418  546  569 
     6, 405, 146, 413, 160,      69,      0,  409,  153,   0,   0, 20.09,   0,   0, 60.00,   52 
     7,  62, 160,  88, 167,     123,      0,   76,  163,  -1,   0, 17.25,   0,   0, 60.00,   58 
     8, 412, 181, 420, 188,      44,      0,  416,  184,   0,   0, 25.50,   0,   0, 60.00,   69 
     9, 311, 195, 326, 202,      70,      0,  318,  199,   5,   0, 30.49,   0,   0, 60.00,   83 
    10, 197, 221, 206, 238,      99,      0,  202,  230,   0,   0, 19.58,   0,   0, 60.00,  111 
    11,  49, 282,  61, 292,      79,      0,   56,  287,   0,   1, 20.71,   0,   0, 60.00,  161 
    12, 201, 331, 221, 335,      71,      0,  211,  333,   0,   0, 25.94,   0,   0, 60.00,  198 
    13, 710, 481, 719, 487,      47,      0,  715,  484,   0,   0, 17.45,   0,   0, 60.00,  290 
    14, 668, 511, 686, 518,     103,      0,  677,  514,   2,   0, 21.03,   0,   0, 60.00,  295 
    15, 283,   1, 295,  31,     346,      0,  289,   15,   0,   2, 22.92,   0,   0, 60.00,  305 
    16, 274,  40, 301,  88,     707,      0,  286,   63,   0,   1, 21.77,   0,   0, 60.00,  320  329 
    17, 383,  60, 393,  71,      57,      0,  389,   66,   1,   0, 47.44,   0,   0, 60.00,  321 
    18, 297,  87, 304,  95,      46,      0,  300,   91,   1,   0, 30.70,   0,   0, 60.00,  344 
    19, 279,  79, 306, 113,     616,      0,  290,   95,   1,   1, 24.10,   0,   0, 60.00,  346  379 
    20, 434,  92, 446, 103,      77,      0,  441,   99,   0,   0, 24.36,   0,   0, 60.00,  364 
    21, 264,  88, 280, 106,     161,      0,  274,   95,   1,   0, 26.31,   0,   0, 60.00,  366 
    22, 257, 104, 305, 127,     544,      0,  280,  116,   2,   2, 28.64,   0,   0, 60.00,  377  378  397  410 
    23, 298, 107, 315, 127,     153,      0,  306,  117,   0,   0, 32.88,   0,   0, 60.00,  398 
    24, 343, 134, 352, 142,      48,      0,  346,  139,   1,   0, 24.17,   0,   0, 60.00,  438 
    25,   2, 159,  96, 173,     726,      0,   48,  166,   0,   0, 17.03,   0,   0, 60.00,  487 
    26, 318, 148, 367, 179,     892,      0,  346,  163,   2,   0, 20.25,   0,   0, 60.00,  491  496  503  511  513 
    27, 294, 168, 342, 197,     962,      0,  316,  182,   3,   0, 27.87,   0,   0, 60.00,  522  543  561 
    28, 273, 178, 279, 197,      73,      0,  275,  188,   4,   4, 37.10,   0,   0, 60.00,  524 
    29, 280, 179, 289, 191,      79,      0,  284,  185,   1,   0, 50.35,   0,   0, 60.00,  527 
    30, 284, 174, 296, 184,      88,      0,  291,  179,   1,   0, 45.61,   0,   0, 60.00,  530 
    31,  95, 192, 121, 207,     286,      0,  110,  200,   0,   0, 19.36,   0,   0, 60.00,  588 
    32, 198, 190, 202, 213,      79,      0,  200,  203,   0,   1, 20.13,   0,   0, 60.00,  596 
    33, 212, 245, 226, 271,     164,      0,  222,  262,   0,   0, 22.90,   0,   0, 60.00,  692 
    34, 281, 255, 295, 261,      51,      0,  289,  258,   0,   0, 17.06,   0,   0, 60.00,  695 
    35, 545, 249, 555, 257,      60,      0,  551,  253,   0,   0, 22.33,   0,   0, 60.00,  697 
    36, 292, 256, 307, 267,     114,      0,  300,  262,   0,   1, 17.82,   0,   0, 60.00,  707 
    37, 375, 253, 385, 265,      99,      0,  380,  260,   0,   0, 24.36,   0,   0, 60.00,  713 
    38,  31, 283,  48, 288,      63,      0,   39,  286,   0,   0, 33.33,   0,   0, 60.00,  760 
    39, 662, 282, 666, 303,      70,      0,  663,  292,   0,   0, 16.60,   0,   0, 60.00,  767 
    40,  57, 294,  66, 301,      58,      0,   61,  297,   0,   0, 21.28,   0,   0, 60.00,  792 
    41, 165, 293, 178, 298,      53,      0,  171,  296,   0,   0, 18.52,   0,   0, 60.00,  793 
    42, 343, 308, 350, 324,      65,      0,  347,  318,   0,   0, 17.38,   0,   0, 60.00,  860 
    43, 562, 315, 571, 324,      61,      0,  567,  319,   0,   1, 27.57,   0,   0, 60.00,  865 
    44, 178, 333, 208, 338,     104,      0,  193,  336,   1,   0, 26.94,   0,   0, 60.00,  899 
    45, 268, 406, 288, 411,      96,      0,  277,  408,   0,   0, 19.23,   0,   0, 60.00,  1042 
    46,  99, 404, 131, 412,     173,      0,  115,  409,   1,   0, 12.55,   0,   0, 60.00,  1044 
    47, 492, 405, 576, 416,     431,      0,  536,  410,   1,   0, 26.74,   0,   0, 60.00,  1051  1059 
    48, 213, 424, 232, 429,      78,      0,  223,  426,   1,   0, 15.69,   0,   0, 60.00,  1078 
    49, 708, 432, 719, 439,      52,      0,  713,  435,   1,   0, 11.80,   0,   0, 60.00,  1084 
    50, 633, 453, 644, 460,      58,      0,  638,  457,   0,   0, 23.52,   0,   0, 60.00,  1118

Expected Behavior

Vanilla pandas allows me to access the data, truncating the columns for which the header is not available. It would be great if Modin could implement parity behavior.

Error Logs

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\logging\logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\pandas\dataframe.py", line 216, in __repr__
    result = repr(self._build_repr_df(num_rows, num_cols))
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\logging\logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\pandas\base.py", line 203, in _build_repr_df
    return self.iloc[indexer]._query_compiler.to_pandas()
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\logging\logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\core\storage_formats\pandas\query_compiler.py", line 259, in to_pandas
    return self._modin_frame.to_pandas()
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\logging\logger_metaclass.py", line 68, in log_wrap
    return method(*args, **kwargs)
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\core\dataframe\pandas\dataframe\dataframe.py", line 115, in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\core\dataframe\pandas\dataframe\dataframe.py", line 2841, in to_pandas
    ErrorMessage.catch_bugs_and_request_email(
  File "D:\Users\kogolobo\Miniconda3\envs\cv2\lib\site-packages\modin\error_message.py", line 70, in catch_bugs_and_request_email
    raise Exception(
Exception: Internal Error. Please visit https://github.com/modin-project/modin/issues to file an issue with the traceback and the command that caused this error. If you can't file a GitHub issue, please email bug_reports@modin.org.
Internal and external indices on axis 0 do not match.

Installed Versions

pandas : 1.4.4
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 63.4.1
pip : 22.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli : None
fastparquet : None
fsspec : 2022.8.2
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : None
numexpr : 2.8.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

vnlitvinov · 2022-09-19T10:04:28Z

@kogolobo thanks for reporting! I can see the issue on current master using your data and script.

I believe it's related to #2845 (not exactly a duplicate but most likely is caused by the same underlying assumptions being false on this data).

kogolobo added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Sep 17, 2022

vnlitvinov added P2 Minor bugs or low-priority feature requests and removed Triage 🩹 Issues that need triage labels Sep 19, 2022

anmyachev added the External Pull requests and issues from people who do not regularly contribute to modin label Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv Internal Error #4991

BUG: read_csv Internal Error #4991

kogolobo commented Sep 17, 2022 •

edited

vnlitvinov commented Sep 19, 2022

BUG: read_csv Internal Error #4991

BUG: read_csv Internal Error #4991

Comments

kogolobo commented Sep 17, 2022 • edited

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

vnlitvinov commented Sep 19, 2022

kogolobo commented Sep 17, 2022 •

edited