Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Segfault in pd.read_csv() using chunksize parameter #11793
Here is my repro script:
import pandas as pd import sys for df in pd.read_csv(sys.argv, chunksize=1000): print(df[['sum']].sum())
and I am attaching small.csv.gz as the smallest data set I know reproduces this segfault. Running
I tried my best to narrow it down. You can edit this file down to under 2000 lines and the segfault does not occur. Once it goes over 2000 lines I start to see the segfault. I can add lines 1000 at a time and notice the segfault is intermittent (I see it again at 6002 lines). It seems like to me if there are a multiple of
I installed via
referenced this issue
Dec 11, 2015
I tried to look into this a little more. I think the segfault is occurring in
... if na_filter: for i in range(lines): COLITER_NEXT(it, word) k = kh_get_str(na_hashset, word) # in the hash table ...
I thought given the bug is OSX only maybe we ran into a compiler quirk with clang. I can't repro with clang on LInux though on a recent checkout.
edit: The symptoms described by @OEP are exactly the same as in the mentioned issue: same invalid data returned by