New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_parquet can't handle mixed type columns #21228
Comments
Using the latest
The dtypes that are returned by Pandas as not as detailed as those supported and used by Parquet. For example Pandas has the very generic type of |
It is available for Linux only. I'll try to experiment on Linux server but it may take some time.
Why does the following code work then? import pandas as pd
data = pd.read_csv('pandas_example.csv', dtype = {'A': 'int32', 'B': 'object'})
data.to_parquet('example.parquet') |
@xhochy
import pandas as pd
data = pd.read_excel('pandas_example.xlsx', sheet_name = 0)
data.to_parquet('example.parquet') still gives |
@Ingvar-Y Finally I had some time to look at the data. The problem here is that you have partly strings, partly integer values. What would be the expected type when writing this column? Note that Arrow and Pandas can only have columns of a single type. |
@xhochy It is a string type column that unfortunately has a lot of integer-like values but the expected type is definitely string. IMHO, there should be an option to write a column with a string type even if all the values inside are integers - for example, to maintain consistency of column types among multiple files. This is not the case for my example - column B can't have integer type. |
re ' you have partly strings, partly integer values. What would be the expected type when writing this column?' |
We could have some mechanism to indicate "this column should have a string type in the final parquet file", like we have a
So unless that is something arrow would want to change (but personally I would not do that), this would not help for the specific example case in this issue. We could of course still do a conversion on the pandas side, but that would need to be rather custom logic (and a user can do |
In my case, I had read in multiple csv's and done So in that case at least, it may be more an issue with |
IMHO we should close this since it's giving people the wrong impression that parquet "can't handle mixed type columns", e.g. "hey ,they have an open issue with this title" (without a clear resolution at the end of the thread). As @jorisvandenbossche mentioned, the OP's problem is type inference when doing |
agree here - closing as a usage issue |
I know this is a closed issue, but in case someone looks for a patch, here is what worked for me:
I needed this as I was dealing with a large dataframe (coming from openfoodfacts: https://world.openfoodfacts.org/data ), containing 1M lines and 177 columns of various types, and I simply could not manually cast each column. |
@titsitits you might want to have a look at |
I realize that this has been closed for a while now, but as I'm revisiting this error, I wanted to share a possible hack around it (not that it's an ideal approach): as @catawbasam mentioned:
I cast all my categorical columns into 'str' before writing as parquet (instead of specifying each column by name which can get cumbersome for 500 columns). When I load it back into pandas, the type of the str column would be Edit: If you happen to hit an error with NA's being hardcoded into 'None' after you convert your object columns into str, make sure to convert these NA's into np.nan before converting into str (stackoverflow link) |
I solved this by:
First, find out the mixed type column and convert them to string. Then find out list type column and convert them to string if not you may get Reference:https://stackoverflow.com/questions/29376026/whats-a-good-strategy-to-find-mixed-types-in-pandas-columns |
I know this issue is closed but I found the quick fix. When you write to_parquet(), make sure to pass the argument low_memory = False. This will automatically handle the mixed types columns error. |
I want to state clear that this is not a problem for the |
Code Sample, a copy-pastable example if possible
pandas_example.xlsx
Problem description
to_parquet
tries to convert anobject
column toint64
. This happens when using either engine but is clearly seen when usingdata.to_parquet('example.parquet', engine='fastparquet')
You can see that it is a mixed type column issue if you use
to_csv
andread_csv
to load data from csv file instead - you get the following warning on import:Specifying
dtype
option solves the issue but it isn't convenient that there is no way to set column types after loading the data. It is also strange thatto_parquet
tries to infer column types instead of using dtypes as stated in.dtypes
or.info()
Expected Output
to_parquet
tries write parquet file using dtypes as specifiedOutput of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: