Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column type detection issues #241

Open
sagoyal2 opened this issue Jun 24, 2019 · 2 comments
Open

Column type detection issues #241

sagoyal2 opened this issue Jun 24, 2019 · 2 comments
Labels

Comments

@sagoyal2
Copy link

sagoyal2 commented Jun 24, 2019

I tried to upload a pandas dataframe to Immerse however I got the following errors

Input:

user_str = 'API Key Name'
password_str = 'API Key Secret'
host_str = 'use2-api.mapd.cloud'
dbname_str = 'mapd'
connection = connect(user=user_str, password=password_str, host=host_str, dbname=dbname_str, port=443, protocol='https')

table_name = 'DivvyData'
print(DivvyUse.dtypes)
connection.load_table(table_name, DivvyUse)
TRIP ID                    object
START TIME                 object
STOP TIME                  object
BIKE ID                   float64
TRIP DURATION              object
FROM STATION ID           float64
FROM STATION NAME          object
TO STATION ID             float64
TO STATION NAME            object
USER TYPE                  object
GENDER                     object
BIRTH YEAR                float64
FROM LATITUDE             float64
FROM LONGITUDE            float64
FROM LOCATION              object
TO LATITUDE               float64
TO LONGITUDE              float64
TO LOCATION                object
Boundaries - ZIP Codes    float64
Zip Codes                  object
Community Areas           float64
Wards                     float64
FROM ZIPCODE                int64
FROM REGION                object
FROM TRANSIT TYPE          object
TO ZIPCODE                  int64
TO REGION                  object
TO TRANSIT TYPE            object
GENERATION                 object
Age                       float64
dtype: object
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
    126                 else:
--> 127                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    128             except Exception:

~/miniconda3/lib/python3.7/site-packages/pandas/core/nanops.py in reduction(values, axis, skipna, mask)
    741         else:
--> 742             result = getattr(values, meth)(axis)
    743 

~/miniconda3/lib/python3.7/site-packages/numpy/core/_methods.py in _amax(a, axis, out, keepdims, initial)
     27           initial=_NoValue):
---> 28     return umr_maximum(a, axis, None, out, keepdims, initial)
     29 

TypeError: '>=' not supported between instances of 'int' and 'str'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-41-38b62b814b8d> in <module>
     10 table_name = 'DivvyData'
     11 print(DivvyUse.dtypes)
---> 12 connection.load_table(table_name, DivvyUse)

~/miniconda3/lib/python3.7/site-packages/pymapd/connection.py in load_table(self, table_name, data, method, preserve_index, create)
    499 
    500         if create:
--> 501             self.create_table(table_name, data)
    502 
    503         if method == 'infer':

~/miniconda3/lib/python3.7/site-packages/pymapd/connection.py in create_table(self, table_name, data, preserve_index)
    445         """
    446 
--> 447         row_desc = build_row_desc(data, preserve_index=preserve_index)
    448         self._client.create_table(self._session, table_name, row_desc,
    449                                   TFileType.DELIMITED, TCreateParams(False))

~/miniconda3/lib/python3.7/site-packages/pymapd/_pandas_loaders.py in build_row_desc(data, preserve_index)
    201     if preserve_index:
    202         data = data.reset_index()
--> 203     dtypes = [(col, get_mapd_dtype(data[col])) for col in data.columns]
    204     # row_desc :: List<TColumnType>
    205     row_desc = [

~/miniconda3/lib/python3.7/site-packages/pymapd/_pandas_loaders.py in <listcomp>(.0)
    201     if preserve_index:
    202         data = data.reset_index()
--> 203     dtypes = [(col, get_mapd_dtype(data[col])) for col in data.columns]
    204     # row_desc :: List<TColumnType>
    205     row_desc = [

~/miniconda3/lib/python3.7/site-packages/pymapd/_pandas_loaders.py in get_mapd_dtype(data)
     28     "Get the OmniSci type"
     29     if is_object_dtype(data):
---> 30         return get_mapd_type_from_object(data)
     31     else:
     32         return get_mapd_type_from_known(data.dtype)

~/miniconda3/lib/python3.7/site-packages/pymapd/_pandas_loaders.py in get_mapd_type_from_object(data)
     74         return 'BOOL'
     75     elif isinstance(val, int):
---> 76         if data.max() >= 2147483648 or data.min() <= -2147483648:
     77             return 'BIGINT'
     78         return 'INT'

~/miniconda3/lib/python3.7/site-packages/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
  10954                                       skipna=skipna)
  10955         return self._reduce(f, name, axis=axis, skipna=skipna,
> 10956                             numeric_only=numeric_only)
  10957 
  10958     return set_function_name(stat_func, name, cls)

~/miniconda3/lib/python3.7/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   3628                                           'numeric_only.'.format(name))
   3629             with np.errstate(all='ignore'):
-> 3630                 return op(delegate, skipna=skipna, **kwds)
   3631 
   3632         # TODO(EA) dispatch to Index

~/miniconda3/lib/python3.7/site-packages/pandas/core/nanops.py in f(values, axis, skipna, **kwds)
    128             except Exception:
    129                 try:
--> 130                     result = alt(values, axis=axis, skipna=skipna, **kwds)
    131                 except ValueError as e:
    132                     # we want to transform an object array

~/miniconda3/lib/python3.7/site-packages/pandas/core/nanops.py in reduction(values, axis, skipna, mask)
    740                 result = np.nan
    741         else:
--> 742             result = getattr(values, meth)(axis)
    743 
    744         result = _wrap_results(result, dtype, fill_value)

~/miniconda3/lib/python3.7/site-packages/numpy/core/_methods.py in _amax(a, axis, out, keepdims, initial)
     26 def _amax(a, axis=None, out=None, keepdims=False,
     27           initial=_NoValue):
---> 28     return umr_maximum(a, axis, None, out, keepdims, initial)
     29 
     30 def _amin(a, axis=None, out=None, keepdims=False,

TypeError: '>=' not supported between instances of 'int' and 'str'
@jp-harvey
Copy link
Contributor

@sagoyal2 it looks like this is the source of the problem:

~/miniconda3/lib/python3.7/site-packages/pymapd/_pandas_loaders.py in get_mapd_type_from_object(data)
     74         return 'BOOL'
     75     elif isinstance(val, int):
---> 76         if data.max() >= 2147483648 or data.min() <= -2147483648:
     77             return 'BIGINT'
     78         return 'INT'

We'll have to look into why this would be happening, it's doing a test on the column to determine the data type and perhaps seeing the column as an int when it contains str types.

It only does this when it's trying to infer the data type, so as a workaround you can create your table first in OmniSci which will then read the types from the schema instead of trying to infer them. For reference, it's one (or more) of the object data types in your Pandas dataframe that it's tripping up on.

@jp-harvey
Copy link
Contributor

Actually after looking at the code block it's debatable that it's a bug at all. If the type of the Pandas dataframe is object then it takes the first value of the column and gets a type from it. In this case the first value in the column is an int, so it assumes all the other values are. In this case there must be other rows in this column of str type and hence the error.

We need to decide if this is "just the way it works" or if we do something more complicated with the auto detection, like check the types of the first 1000 rows instead, with additional rules for which data type wins when they are of different types.

If we decide we should do something, perhaps the most logical alternative would be to keep the detection the same as the OmniSci detection itself - ie. using the same logic as the detect_column_types endpoint. An additional OmniSciDB endpoint could use the same detection logic but accept data instead of a filename. That way we don't have to replicate all the functionality of column detection in pymapd, and we could make detection consistent across other connectors also.

@randyzwitch randyzwitch changed the title Uploading data to Immerse using pymapd Column type detection issues Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants