## Point of Interest - Data Types

In many applications, such as Excel, you don't worry so much about what sort of data you're working with because it's either a number or a string.

In Python you typically don't need to worry about it. However, if you're working with large data sets choice of the data type can make a significant difference in efficiency.   

Under the hood pandas builds on Python's NumPy library, which supports vectorized calculations.  NumPy, in turn, relies on your C compiler's support for data types.  Your C compiler works in conjunction with your CPU to figure out just what data it can support.   Fortunately, most users can be blissfully unaware of the complexities the interested few can dive in and optimize their operations.

Everyone should know how to figure out what data types are being used, what their limitations are, and how to swap back and forth when that's possible.   This section will cover the basics.


Pandas objects automatically assigns a sensible data type whenever you add or change data.  You can always take a look and you can force your own choices if you want.   Each column can only hold one type of data so the choice is important.

Here, we're creating a DataFrame out of some diverse data comprised of an integer, a floating point number and a string.

In [22]:
import pandas as pd
from custom_utils.display_wide import  display_wide
df = pd.DataFrame([\
                   [1, 2.0, "  3.00"],
                   [6, 3.3, "5.67"],
                  ],
                 columns =["c1", "c2", "c3" ])
data_types = df.dtypes
display_wide([df, data_types.to_frame()],["Data", "'dtypes'"], spacing=3)

Unnamed: 0_level_0,c1,c2,c3,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0
Unnamed: 0_level_1,0,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1,2.0,3.0,,,,
1,6,3.3,5.67,,,,
c1,int64,,,,,,
c2,float64,,,,,,
c3,object,,,,,,
Data,,,,'dtypes',,,
c1  c2  c3  0  1  2.0  3.00  1  6  3.3  5.67,,,,0  c1  int64  c2  float64  c3  object,,,

Unnamed: 0,c1,c2,c3
0,1,2.0,3.0
1,6,3.3,5.67

Unnamed: 0,0
c1,int64
c2,float64
c3,object


You'll see that pandas went ahead and assigned an integer type to C1 and a float type to C2.  The "64" bit of the dtype name shows how much memory is allocated to each.  By default, each gets the "widest" version supported by your C compiler.

The "object" dtype is generic and it typically signals that the column holds strings. 1/

The display holds clues to the data type, but it's not reliable.  Integers never show decimal points and floats always do, but with "object" types, you never know. 

Here are a couple ways to change dypes.   In the first case, we performed an operation on an integer column that produced floating point output.   Pandas autmatically reassigned a new datatype. 

In the second case we used the astype() method to performed an explicit "type cast".  Here we created a new column for the converted values and asked pandas to fail silently if problems are encountered.   You'll note that failed operations result in "object" data types.

The astype() method is particularly useful when your data is a little funky - as is the case when some of the "numbers" you're ingesting have gratuituous spaces in them and would otherwise be regarded as strings.

1/  The object dtype serves a pointer to specific objects referenced in a column.  This allows bending the rule about only one type of data per column because each row can point to a different kind of object.  This flexibility comes at a high cost in terms of efficiency.

Strings, because they aren't numbers, have to be "object" types.   If you wanted to use arrays of super-high-precision numbers (possible with decimal.Decimal) you would also need to use the "object" type.  Although these are numbers, they aren't natively supported by C.

In [23]:
#Type cast via operation
df['c1'] = df['c1']/1

#Explicit type cast
df['c3 int_or_fail'] = df['c3'].astype(int, errors='ignore')

display_wide([df, df.dtypes.to_frame()], spacing=3) 

Unnamed: 0_level_0,c1,c2,c3,c3 int_or_fail,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0
Unnamed: 0_level_1,0,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1.0,2.0,3.0,3.00,,,
1,6.0,3.3,5.67,5.67,,,
c1,float64,,,,,,
c2,float64,,,,,,
c3,object,,,,,,
c3 int_or_fail,object,,,,,,
c1  c2  c3  c3 int_or_fail  0  1.0  2.0  3.00  3.00  1  6.0  3.3  5.67  5.67,,,,0  c1  float64  c2  float64  c3  object  c3 int_or_fail  object,,,

Unnamed: 0,c1,c2,c3,c3 int_or_fail
0,1.0,2.0,3.0,3.0
1,6.0,3.3,5.67,5.67

Unnamed: 0,0
c1,float64
c2,float64
c3,object
c3 int_or_fail,object
