# Comparing type inference methods for mixed data arrays

When a data array has mixed data types such as `[True, 0.0]`, pandas infer the array as `object` dtype. To do so, pandas use `pandas.DataFrame.infer_objects`. However, a lot of different types of mixed arrays can be inferred as `object` dtype. This "blancket" approach might be useful for practical data handling but it is not suitable for more accurate and granular type inference.

Unlike this, `pandas.api.types.infer_dtype` provides a much more granular type inference and it can ignore null values (`skipna=True`). The method returns a name of inferred type as a string. For the comprehensive list of the type names, see the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.types.infer_dtype.html)

This notebook compares the two type inference methods of pandas (`pandas.DataFrame.infer_objects` and `pandas.api.types.infer_dtype`) when they are faced with various cases of mixed arrays. Here, I use `None, array(list), str, bool, float, int`, data types to generate various combinations of mixed arrays, and we use both inference methods to compare the results.

In [1]:
import pandas as pd
import numpy as np

## Generating toy example of mixed arrays
Here I generate a dataframe with various mixed types. I use "nan"(`np.nan`), "none", "array" (`list`), "str", "bool", "float", "int" types to generate exhaustive combinations for a 2-element array. To emphasize the inference method comparison, I assign `object` dtype to all columns.

In [2]:
example = pd.DataFrame(
    {
        'nan': [np.nan, np.nan],
        'nan_none': [np.nan, None],
        'nan_array': [np.nan, []],
        'nan_str': [np.nan, "a"],
        'nan_bool': [np.nan, True],        
        'nan_float': [np.nan, 1.0],        
        'nan_int': [np.nan, 1],
        'none': [None, None],        
        'none_array': [None, []],
        'none_str': [None, "a"],        
        'none_bool': [None, True],
        'none_float': [None, 0.0],
        'none_int': [None, 1],
        'array': [[], []],
        'array_str': [[], "a"],
        'array_bool': [[], True],
        'array_float': [[], 1.0],
        'array_int': [[], 1],
        'str': ["a", "b"],
        'str_bool': ["a", True],
        'str_float': ["a", 1.0],
        'str_int': ["a", 1],
        'bool': [True, False],
        'bool_float': [True, 0.0],
        'bool_int': [True, 1],
        'float': [1.0, 0.0],
        'float_int': [1.0, 0],
        'int': [1, 0],
    },
    dtype=object
)
print(example.dtypes.value_counts())

object    28
dtype: int64


In [3]:
example.head()

Unnamed: 0,nan,nan_none,nan_array,nan_str,nan_bool,nan_float,nan_int,none,none_array,none_str,...,str,str_bool,str_float,str_int,bool,bool_float,bool_int,float,float_int,int
0,,,,,,,,,,,...,a,a,a,a,True,True,True,1.0,1.0,1
1,,,[],a,True,1.0,1.0,,[],a,...,b,True,1.0,1,False,0.0,1,0.0,0.0,0


## Type inference with `pandas.DataFrame.infer_objects`

In [4]:
example_results = example.T
example_results['pd_infer_objects'] = example.infer_objects().dtypes
example_results.head()

Unnamed: 0,0,1,pd_infer_objects
,,,float64
nan_none,,,float64
nan_array,,[],object
nan_str,,a,object
nan_bool,,True,object


## Type inference with `pandas.api.types.infer_dtype`
Here type inference is done with and without na values.

In [5]:
example_results['pd_infer_dtype'] = example.apply(lambda x:pd.api.types.infer_dtype(x, skipna=False))
example_results['pd_infer_dtype_skipna'] = example.apply(lambda x:pd.api.types.infer_dtype(x, skipna=True))
example_results.head()

Unnamed: 0,0,1,pd_infer_objects,pd_infer_dtype,pd_infer_dtype_skipna
,,,float64,floating,empty
nan_none,,,float64,mixed,empty
nan_array,,[],object,mixed,mixed
nan_str,,a,object,mixed,string
nan_bool,,True,object,mixed,boolean


## Comparison: with vs. without na values in `pandas.api.types.infer_dtype`

When we don't ignore na values, we often get `"mixed"` results from `pandas.api.types.infer_dtype` for arrays that are essentially quite different. For instance, in the table above, the last 3 rows (`"nan_array", "nan_str", "nan_bool"`) all are identified as `"mixed"` when we don't ignore nan. 

However, when we ignore na values, `"nan_array"` is identified as `"mixed"`, `"nan_str"` as `"string"`, `"nan_bool"` as `"boolean"`, which means the correct data types are identified.

## Comparison: `pandas.DataFrame.infer_objects` vs. `pandas.api.types.infer_dtype(skipna=True)`

As I mentioned before, `pandas.DataFrame.infer_objects` has a blanket approach to mixed data arrays. We can select specific columns in the toy example where `pandas.DataFrame.infer_objects` identified them as `object` and examine the type inference result from `pandas.api.types.infer_dtype(skipna=True)` to compare the two type inference methods.

In [6]:
example_results[example_results['pd_infer_objects'] == object].drop('pd_infer_dtype', axis=1).sort_values(by='pd_infer_dtype_skipna')

Unnamed: 0,0,1,pd_infer_objects,pd_infer_dtype_skipna
nan_bool,,True,object,boolean
none_bool,,True,object,boolean
none,,,object,empty
nan_array,,[],object,mixed
str_float,a,1.0,object,mixed
str_bool,a,True,object,mixed
array_float,[],1.0,object,mixed
array_bool,[],True,object,mixed
array_str,[],a,object,mixed
array,[],[],object,mixed


This shows that a variety of mixed arrays is inferred as `object` by `pandas.DataFrame.infer_objects` but `pandas.api.types.infer_dtype(skipna=True)` can often identify true types. It's true that the latter returns a lot of different arrays as `"mixed"` but most of them are those which have non-numerical values such as string or array. 

One interesting observation is that `[True, 0.0]` is inferred as `"mixed"` but `[True, 1]` as `"mixed-integer"`, which implies that `pandas.api.types.infer_dtype` method is designed to show presence of integer in inferred type information.

Finally, we can compare the returned values of two methods:

In [7]:
for val in set(example_results['pd_infer_objects']):
    print(val, type(val))

int64 <class 'numpy.dtype[int64]'>
float64 <class 'numpy.dtype[float64]'>
bool <class 'numpy.dtype[bool_]'>
object <class 'numpy.dtype[object_]'>


In [8]:
for val in set(example_results['pd_infer_dtype_skipna']):
    print(val, type(val))

mixed <class 'str'>
floating <class 'str'>
boolean <class 'str'>
mixed-integer <class 'str'>
string <class 'str'>
integer <class 'str'>
mixed-integer-float <class 'str'>
empty <class 'str'>


This shows that `pandas.DataFrame.infer_objects` returns a readily usable python types as inference results but `pandas.api.types.infer_dtype` returns string values, that need to be further processed or mapped if we want to cast data types.

## Conlusions

Pandas has two type inference methods: `pandas.DataFrame.infer_objects`, `pandas.api.types.infer_dtype`. 

`pandas.DataFrame.infer_objects` is designed to return practical data types that can be easily cast on arrays, and thus its type inference tends to adopt a blanket approach where type inference should just work to handle the data immediately.

On the other hand, `pandas.api.types.infer_dtype` does a more granular type inference job where it can also ignore na values. However, it returns string values as results, not python types, and thus we need a further process to use this information for type casting.