## Point of Interest - Data Types

In many applications, such as Excel, you don't worry so much about what sort of data you're working with because it's either a number or a string.

In Python you typically don't need to worry about it. However, if you're working with large data sets choice of the data type can make a significant difference in efficiency.  

<u> Pandas Data Types</u>

Under the hood pandas builds on Python's NumPy library, which supports vectorized calculations.  NumPy, in turn, relies on your C compiler's support for data types.  Your C compiler works in conjunction with your CPU to figure out just what data it can support.   Fortunately, most users can be blissfully unaware of the complexities the interested few can dive in and optimize their operations.

Pandas objects require that each column (Series) contain all the same data type.  That makes things much easier / faster under the hood because the compiler can make assumptions about how wide numbers are, how large vectors of these numbers are, etc.

Everyone should know how to figure out what data types are being used, what their limitations are, and how to swap back and forth when that's possible.   This section will cover the basics.


Pandas objects automatically assigns a sensible data type whenever you add or change data.  You can always take a look and you can force your own choices if you want.   Each column can only hold one type of data so the choice is important.

Here, we're creating a DataFrame out of some diverse data comprised of an integer, a floating point number and a string.

In [26]:
import pandas as pd
import numpy as np
from custom_utils.display_wide import  display_wide
df = pd.DataFrame([\
                   [1, 2.0, "  3.00"],
                   [6, 3.3, "5.67"],
                  ],
                 columns =["c1", "c2", "c3" ])
data_types = df.dtypes
display_wide([df, data_types.to_frame()],["Data", "'dtypes'"], spacing=3)

Unnamed: 0_level_0,c1,c2,c3,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0
Unnamed: 0_level_1,0,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1,2.0,3.0,,,,
1,6,3.3,5.67,,,,
c1,int64,,,,,,
c2,float64,,,,,,
c3,object,,,,,,
Data,,,,'dtypes',,,
c1  c2  c3  0  1  2.0  3.00  1  6  3.3  5.67,,,,0  c1  int64  c2  float64  c3  object,,,

Unnamed: 0,c1,c2,c3
0,1,2.0,3.0
1,6,3.3,5.67

Unnamed: 0,0
c1,int64
c2,float64
c3,object


You'll see that pandas went ahead and assigned an integer type to C1 and a float type to C2.  The "64" bit of the dtype name shows how much memory is allocated to each.  By default, each gets the "widest" version supported by your C compiler.

The "object" dtype is generic and it typically signals that the column holds strings. 

The display holds clues to the data type, but it's not reliable.  Integers never show decimal points and floats always do, but with "object" types, you never know. 

<u>Changing Data Types ("Type Casting")</u>

Here are a couple ways to change dypes.   In the first case, we performed an operation on an integer column that produced floating point output.   Pandas autmatically reassigned a new datatype. 

In the second case we used the <b>astype()</b> method to performed an explicit "type cast".  Here we created a new column for the converted values and asked pandas to fail silently if problems are encountered.   You'll note that failed operations result in "object" data types.

The <b>astype()</b> method is particularly useful when your data is a little funky - as is the case when some of the "numbers" you're ingesting have gratuituous spaces in them and would otherwise be regarded as strings.


In [27]:
#Type cast via operation
df['c1'] = df['c1']/1

#Explicit type cast
df['c3 int'] = df['c3'].astype(int,errors='ignore')
df['c3 coerce'] = pd.to_numeric(df['c3'],errors='coerce').astype('int')
df['c3 complex'] = pd.to_numeric(df['c3'],errors='coerce').astype(np.complex)
df['c3 int64'] = pd.to_numeric(df['c3'],errors='coerce').astype(np.int64)

display_wide([df, df.dtypes.to_frame()], spacing=3) 

Unnamed: 0_level_0,c1,c2,c3,c3 int,c3 coerce,c3 complex,c3 int64
Unnamed: 0_level_1,0,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1.0,2.0,3.0,3.00,3.0,3.0+0.0j,3.0
1,6.0,3.3,5.67,5.67,5.0,5.67+0.0j,5.0
c1,float64,,,,,,
c2,float64,,,,,,
c3,object,,,,,,
c3 int,object,,,,,,
c3 coerce,int32,,,,,,
c3 complex,complex128,,,,,,
c3 int64,int64,,,,,,
c1  c2  c3  c3 int  c3 coerce  c3 complex  c3 int64  0  1.0  2.0  3.00  3.00  3  3.0+0.0j  3  1  6.0  3.3  5.67  5.67  5  5.67+0.0j  5,,,,0  c1  float64  c2  float64  c3  object  c3 int  object  c3 coerce  int32  c3 complex  complex128  c3 int64  int64,,,

Unnamed: 0,c1,c2,c3,c3 int,c3 coerce,c3 complex,c3 int64
0,1.0,2.0,3.0,3.0,3,3.0+0.0j,3
1,6.0,3.3,5.67,5.67,5,5.67+0.0j,5

Unnamed: 0,0
c1,float64
c2,float64
c3,object
c3 int,object
c3 coerce,int32
c3 complex,complex128
c3 int64,int64


Here are some things you can try to convert the data types column-wise (using operations on Series objects).   You can perform an operation and let Python do type-casting under the hood - it will attempt to convert data to a floating point type:

<b>df['c1'] = df['c1']/1</b>

..forces division where possible.  It will fail with an error if Python can't figure out what to do.  Alternatively, you can ask for a direct type case using the <b>astype()</b> method.  This has the advantage of allowing you to specify that data type you end up with.  

The <b>astype()</b> method isn't terribly robust, though.  You can specify a couple options around handling errors.  It will allow errors to be raised ('raise' option, the defalut) or to be ignored ('ignore' option), but isn't aggressive about forcing a change.   You can see from the example above that astype() essentially gave up - although it didn't raise an error, the data type remains 'object'.

To actually force a change of data type here, you might consider using the general Pandas <b>to_numeric()</b> method. 

<b>df['c3 int_coerce'] = pd.to_numeric(df['c3'],errors='coerce').astype('int')</b>

It allows an additional way to handle conversion errors - 'coherse'.  You can see it in action in the second two equations.  You can see that we've produced both integers and complex numbers out of the original strings.

If you study the last three equations, you'll see another nuance.   In the first we asked to convert the data types to one of Pythons native numeric types (these are float, int, and complex).

<b>df['c3 int_as_type'] = df['c3'].astype(int,errors='ignore')</b>

Alternatively, we can use any of the NumPy data types.  If we specify these rather generically e.g., numpy.complex, numpy.float, etc. we specify the general data type and let Pandas choose the precise width.  Here, Pandas chose the int32 data type based on the data we presented.

<b>df['c3 np generic'] = pd.to_numeric(df['c3'],errors='coerce').astype(np.complex)</b>

If we want more granular control we can request the NumPy data types explicitly.  Perhaps because we anticipate using really large values later on, we've requested int64.  Other options may include int8, int16, int32 .. up to the widest integer supported by your specific system.

<b>df['c3 np specific'] = pd.to_numeric(df['c3'],errors='coerce').astype(np.int64)</b>



<u>More on the 'object' Data Type</u>

Pandas Series objects need to contain the same data type - numbers being the most efficient.   A Series can be set up to contain all 'object' types, as is the case when they contain strings.  This data type is really the primitive ancestor of all Python data types, the object called 'object'.  Here, it serves as a pointer to the real data contained in the Series.

This is important in a couple of ways.  First, the data type and other specifics of the real data needs to be resolved on a case-by-case bases.  This means that the substantial efficiency gains of predicatable data types are lost.   But you gain flexibility, so the bargain may well be worthwhile.

<u>Introducing Non-native Data Types</u>

The second is a bit more nuanced.   The real data can be anything - you're no longer constrained to just numbers and strings - you can use other, specialized data types available through other Pyhton libraries.

In [31]:
#Import the library for rational numbers
import fractions

#Product a couple Fraction objects
one_third = fractions.Fraction (1, 3)
one_seventh = fractions.Fraction(1, 7)

#Create a DataFrame with some nice row and column indices
frac_df = pd.DataFrame([['row_1', one_third, one_seventh],
                        ['row_2', one_third, one_seventh]], 
                         columns =['row', '1/3', '1/7'])
frac_df.set_index('row', inplace=True)  

#This displays various bits of the new DataFrame
cell = frac_df.loc['row_1', '1/3']
cell_type = type(cell)
display_wide([frac_df, frac_df.dtypes, cell , cell_type], 
             ['Data', 'Column Data Types', 'Cell Value', "Cell Data Type"],
               spacing=2) 

Unnamed: 0_level_0,1/3,1/7,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
row_1,1/3,1/7,,,,,,,,,
row_2,1/3,1/7,,,,,,,,,
1/3,object,,,,,,,,,,
1/7,object,,,,,,,,,,
value,1/3,,,,,,,,,,
value,<class 'fractions.Fraction'>,,,,,,,,,,
Data,,,Column Data Types,,,Cell Value,,,Cell Data Type,,

Unnamed: 0_level_0,1/3,1/7
row,Unnamed: 1_level_1,Unnamed: 2_level_1
row_1,1/3,1/7
row_2,1/3,1/7

Unnamed: 0,Unnamed: 1
1/3,object
1/7,object

Unnamed: 0,Unnamed: 1
value,1/3

Unnamed: 0,Unnamed: 1
value,<class 'fractions.Fraction'>


Let's look at the bits of this data from right to left.    On the right, you can see that an individual cell value within the DataFrame is a fractions.Fraction object.   Next to that, you will observe that it displays itself as a fraction.

Next, you'll see that Pandas thinks of each of its internal columns as an 'object', in spite of the  data type actually represented. 

This is pretty cool because you can manipulate an entire column of these just as you would an entire column of an internally-supported data type.   For instance, Pandas knows what to do if it sees the "+" operator.   It looks at the object's <b>__add__()</b> method and follows the instructions there.  The fact that fractions add themselves differently than integers or floats doesn't matter.  So this works:

<b>df['sums'] = df['1/3'] + df['1/3']</b>

<u>Working with Non-ative Data Types</u>

Things get just a bit trickier if we want to tap into stuff known by the Fraction objects themselves, but invisible to Pandas.  For instance, Fraction objects have a 'numerator' and a 'denominator' attribute.  If we want to resolve the fraction into a floating-point approximation, we need to access these attributes and do some sort of division.

To access the attributes, we have to address the Series on a cell-by-cell basis.  The <b>apply()</b>method provides that capability.  Here, we supply the calculations as a local lambda function.

In [32]:
#Add and multiply columns of Fraction values
frac_df['sum_1/3'] = frac_df['1/3'] + frac_df['1/3']
frac_df['1/3 * 1/7'] = frac_df['1/3'] * frac_df['1/7']

#Use internal methods or non-native objects
frac_df['1/3_float'] = frac_df['sum_1/3'].apply(lambda x: x.numerator/x.denominator)
frac_df

Unnamed: 0_level_0,1/3,1/7,sum_1/3,1/3 * 1/7,1/3_float
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
row_1,1/3,1/7,2/3,1/21,0.666667
row_2,1/3,1/7,2/3,1/21,0.666667


<u>Miscellaneous Notes</u>

-- Your native Pandas data types ultimately depend on the ctypes supported by your C compiler.   One big difference between Pandas and mainstream Python is that the "width" (accuracy) of the data types may be constrained for efficiency.

-- Floating point values of any type can be inaccurate due to rounding and the challenges of representing base-10 values using base-2 hardware.  The differences are small and inconsequential for most purposes.   If you are working extensively with rational numbers (they can be represented by fractions) consider using the fractions library.   

-- If you require super-accurate decimal operations, you may be interested in the decimal library (with which you can carry numbers to arbitrary levels of precision, switching the accuracy as needed).

The docs are here:

https://docs.python.org/3.7/library/decimal.html

https://docs.python.org/3/library/fractions.html