# DataFrame Data Type Conversion

The previous chapters in this part focused on the understanding and construction of data types outside of the context of real data. We mainly used numpy arrays and pandas Series for exploring these data types. In this chapter, we will work with entire DataFrames of real data. 

Let's see some examples by reading in a few of the columns from the college dataset. The `usecols` parameter is set to just six columns, one of which, the institution name, is placed in the index.

In [1]:
import pandas as pd
cols = ['instnm', 'hbcu', 'relaffil', 'ugds', 
        'md_earn_wne_p10', 'grad_debt_mdn_supp']
college = pd.read_csv('../data/college.csv', index_col='instnm', usecols=cols)
college.head(3)

Unnamed: 0_level_0,hbcu,relaffil,ugds,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,1.0,0,4206.0,30300,33888.0
University of Alabama at Birmingham,0.0,0,11383.0,39700,21941.5
Amridge University,0.0,1,291.0,40100,23370.0


From the above visual display, it appears that all of the columns are numeric. But, unlike Series, we cannot tell with DataFrames, as each column's specific data type is not displayed in the output. You must access them with the `dtypes` attribute. 

In [2]:
college.dtypes

hbcu                  float64
relaffil                int64
ugds                  float64
md_earn_wne_p10        object
grad_debt_mdn_supp     object
dtype: object

In [4]:
college.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
hbcu,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
relaffil,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
ugds,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0


In [6]:
college.describe(include='object').T

Unnamed: 0,count,unique,top,freq
md_earn_wne_p10,6413,598,PrivacySuppressed,822
grad_debt_mdn_supp,7503,2038,PrivacySuppressed,1510


You might be surprised to see that the last two columns are read in as objects. They certainly appear to be numeric. If you look closely at the values in the `grad_debt_mdn_supp` column, you'll notice that the second value has a single decimal value while the others do not. Columns that have the float data type will never have a visual display like this. The decimal places will always align.

When a seemingly numeric column is read in as an object, it is a clue that there are strings in this column. One of the first things we can do to investigate this issue is output some of the values from the underlying numpy array with the `values` attribute. Here, we take a look at the first five values to see that they strings.

In [7]:
college['grad_debt_mdn_supp'].head().values

array(['33888', '21941.5', '23370', '24097', '33118.5'], dtype=object)

If the `read_csv` function initially reads in a column with the data type object, then you can guarantee that all non-missing values will be strings. We can get the exact type of an individual value by extracting it from the DataFrame and passing it to the built-in `type` function.

In [8]:
val = college.loc['Alabama A & M University', 'grad_debt_mdn_supp']
val

'33888'

In [9]:
type(val)

str

The `read_csv` function won't read in a column of data as strings unless it contains non-numeric characters. One way to find non-numeric characters is to sort the string column in descending order. Numeric characters have a lower unicode code point than alphabetic characters, so this should put the alphabetic strings to the top. 

Instead of sorting the column itself, we first call the `value_counts` method and then sort the index in descending order to place all the non-numeric values at the very top. This also reveals the frequency of occurrence.

In [14]:
(college['grad_debt_mdn_supp'].sort_values(na_position='first'))

instnm
Lincoln University                                                         NaN
Yale-New Haven Hospital Dietetic Internship                                NaN
Beauty College of America                                                  NaN
Pope St John XXIII National Seminary                                       NaN
St Francis Medical Center-School of Radiologic Technology                  NaN
                                                                   ...        
City Vision University                                       PrivacySuppressed
Hispanic American College                                    PrivacySuppressed
Avalon School of Cosmetology                                 PrivacySuppressed
Marketti Academy of Cosmetology                              PrivacySuppressed
Pacifica Graduate Institute                                  PrivacySuppressed
Name: grad_debt_mdn_supp, Length: 7535, dtype: object

In [16]:
(college['grad_debt_mdn_supp'].value_counts()
                              .sort_index(ascending=False))

grad_debt_mdn_supp
PrivacySuppressed    1510
9999                    1
9998                    1
9993                    1
9981                    1
                     ... 
10016.5                 1
10014.5                 2
10014                   3
10005                   2
10000                  22
Name: count, Length: 2038, dtype: int64

This method isn't perfect for uncovering non-numeric strings since it's possible a string can begin with a digit only to be followed by alphabetic characters. Regular expressions (which are covered in an upcoming part) are needed to search for more specific patterns.

### Converting non-numeric values to missing

We've uncovered one string, `'PrivacySuppressed'`, in the `grad_debt_mdn_supp` column which forced pandas to use object instead of float for its data type. If we want to use this column as a float, we'll need to convert the strings to some numeric value or a missing value.

Let's go ahead and convert the string to a missing value. As we saw previously, we cannot use the `astype` method to make this conversion as the `errors` parameter cannot be set to 'coerce'. We need to use the `to_numeric` function to convert this column to a float. We set the `errors` parameter to 'coerce' to force any value that isn't able to be converted to missing. One minor annoyance is that `to_numeric` converts only a single column at a time. Below, we overwrite both of the object columns (the other column has the same exact issue) with two separate calls of the `to_numeric` function.

In [19]:
college.tail(3)

Unnamed: 0_level_0,hbcu,relaffil,ugds,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
National Personal Training Institute of Cleveland,,1,,,6333
Bay Area Medical Academy - San Jose Satellite Location,,1,,,PrivacySuppressed
Excel Learning Center-San Antonio South,,1,,,12125


In [20]:
college['grad_debt_mdn_supp'] = pd.to_numeric(college['grad_debt_mdn_supp'], 
                                              errors='coerce')
college['md_earn_wne_p10'] = pd.to_numeric(college['md_earn_wne_p10'], 
                                           errors='coerce')
college.tail(3)

Unnamed: 0_level_0,hbcu,relaffil,ugds,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
National Personal Training Institute of Cleveland,,1,,,6333.0
Bay Area Medical Academy - San Jose Satellite Location,,1,,,
Excel Learning Center-San Antonio South,,1,,,12125.0


Notice how the decimals align. Let's verify that the new data types are float.

In [21]:
college.dtypes

hbcu                  float64
relaffil                int64
ugds                  float64
md_earn_wne_p10       float64
grad_debt_mdn_supp    float64
dtype: object

## The `astype` method for DataFrames

The `astype` method is still useful for DataFrames. We can convert all columns at once to a different type. Below, we convert each column to a 32-bit float and immediately verify the data types.

In [22]:
college.astype('float32').dtypes

hbcu                  float32
relaffil              float32
ugds                  float32
md_earn_wne_p10       float32
grad_debt_mdn_supp    float32
dtype: object

You can change the data type of specific columns by using a dictionary to map the column name to the desired type. Here, we change `relaffil` to an 8-bit integer and `ugds` to a 32-bit float.

In [23]:
college.astype({'relaffil': 'int8', 'ugds': 'float32'}).dtypes

hbcu                  float64
relaffil                 int8
ugds                  float32
md_earn_wne_p10       float64
grad_debt_mdn_supp    float64
dtype: object

## Reading in data with known missing values

You can avoid having to use `to_numeric` if you know the missing value representation in your dataset before you read in your data. Set the `na_values` parameter of the `read_csv` function to the string that represents missing values. You can use a list to specify more values and a dictionary to specify different missing values for each column. Here, we read in our college dataset again and convert every occurrence of 'PrivacySuppressed' to missing on read. You can also use the `dtype` parameter to set the data type of specific columns on read with a dictionary.

In [25]:
college = pd.read_csv('../data/college.csv', index_col='instnm', 
                      usecols=cols, na_values='PrivacySuppressed', 
                      dtype={'relaffil': 'int8', 'ugds': 'float32'})
college.head(3)

Unnamed: 0_level_0,hbcu,relaffil,ugds,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,1.0,0,4206.0,30300.0,33888.0
University of Alabama at Birmingham,0.0,0,11383.0,39700.0,21941.5
Amridge University,0.0,1,291.0,40100.0,23370.0


It appears as though the last two columns were read in as floats. Let's verify that pandas has correctly read in the last two columns as floats.

In [26]:
college.dtypes

hbcu                  float64
relaffil                 int8
ugds                  float32
md_earn_wne_p10       float64
grad_debt_mdn_supp    float64
dtype: object

## More data type conversion with the housing dataset

Let's see another example of converting data types on DataFrames with the housing dataset. We begin by reading in just seven of the columns by setting the `usecols` parameter to a list of those column names.

In [27]:
cols = ['MSSubClass', 'LotConfig', 'Condition1', 'YearBuilt', 
        'ExterQual', 'OverallQual', 'SalePrice']
housing = pd.read_csv('../data/housing.csv', usecols=cols)
housing.head()

Unnamed: 0,MSSubClass,LotConfig,Condition1,OverallQual,YearBuilt,ExterQual,SalePrice
0,60,Inside,Norm,7,2003,Gd,208500
1,20,FR2,Feedr,6,1976,TA,181500
2,60,Inside,Norm,7,2001,Gd,223500
3,70,Corner,Norm,7,1915,TA,140000
4,60,FR2,Norm,8,2000,Gd,250000


Let's see how pandas has chosen to read in this data.

In [28]:
housing.dtypes

MSSubClass      int64
LotConfig      object
Condition1     object
OverallQual     int64
YearBuilt       int64
ExterQual      object
SalePrice       int64
dtype: object

By examining the data dictionary, 'MSSubClass', 'LotConfig', and 'Condition1' are nominal categorical data. Both 'OverallQual' and 'ExterQual' are ordinal, and 'YearBuilt' and 'SalePrice' are integers. A dictionary mapping each column to its new data type is created.

Ordered categorical data types must be constructed with the `CategoricalDtype` constructor. We also choose unsigned integers with less bit sizes for the 'YearBuilt' and 'SalePrice' columns. Finally, we pass this dictionary to the `astype` method to make the conversion. It is not necessary to specify every column like we did here. Those not specified retain their data type.

In [29]:
oq_dtype = pd.CategoricalDtype(range(1, 11), ordered=True)
eq_dtype = pd.CategoricalDtype(['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered=True)
dtype_dict = {'MSSubClass': 'category', 
              'LotConfig': 'category', 
              'Condition1': 'category',
              'YearBuilt': 'uint16', 
              'SalePrice': 'uint32',
              'OverallQual': oq_dtype, 
              'ExterQual': eq_dtype}
housing2 = housing.astype(dtype_dict)
housing2.dtypes

MSSubClass     category
LotConfig      category
Condition1     category
OverallQual    category
YearBuilt        uint16
ExterQual      category
SalePrice        uint32
dtype: object

This data type conversion may be done on read by using the same dictionary passed to the `dtype` parameter.

In [33]:
housing2 = pd.read_csv('../data/housing.csv', usecols=cols, dtype=dtype_dict)
housing2.dtypes

MSSubClass     category
LotConfig      category
Condition1     category
OverallQual    category
YearBuilt        uint16
ExterQual      category
SalePrice        uint32
dtype: object

The amount of memory saved by converting to these new data types is substantial with most of it coming from the change to categorical. Let's get the memory usage for both the original and converted datasets.

In [34]:
orig_mem = housing.memory_usage(index=False, deep=True)
orig_mem

MSSubClass     11680
LotConfig      91921
Condition1     89237
OverallQual    11680
YearBuilt      11680
ExterQual      86140
SalePrice      11680
dtype: int64

In [35]:
new_mem = housing2.memory_usage(index=False, deep=True)
new_mem

MSSubClass     2905
LotConfig      1942
Condition1     2312
OverallQual    1592
YearBuilt      2920
ExterQual      1927
SalePrice      5840
dtype: int64

Summing these two Series gets the total memory used for each. Dividing these totals reveals a reduction in memory of 94 percent.

In [36]:
1 - new_mem.sum() / orig_mem.sum()

np.float64(0.9380990898610908)

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Read in the bikes dataset and select the `tripduration` column. Find its data type and then use the `memory_usage` method to find how much memory (in bytes) it is using. Change its data type to the smallest possible type so that no information is lost. What percentage of memory has been saved?</span>

In [None]:
dtype_dict = {'Gender': 'category', 
              'starttime': 'datetime64[ns]', 
              'stoptime': 'datetime64[ns]',
              'tripduration': 'float32', 
              'from_station_name': 'category',
              'start_capacity': 'uint8', 
              'to_station_name': 'category',
              'end_capacity':'uint8',
              'events':'category'}

bikes = pd.read_csv('../data/bikes.csv')


bikes.dtypes

gender                object
starttime             object
stoptime              object
tripduration           int64
from_station_name     object
start_capacity       float64
to_station_name       object
end_capacity         float64
temperature          float64
wind_speed           float64
events                object
dtype: object

In [43]:
def find_categorical_candidates(df, threshold=0.5):  # threshold = 5% unique
    total_rows = len(df)
    results = []

    for col in df.columns:
        unique_count = df[col].nunique(dropna=True)
        percent_unique = unique_count / total_rows

        if percent_unique <= threshold:  # good candidate
            results.append({
                'column': col,
                'unique_values': unique_count,
                '% unique': round(percent_unique * 100, 2)
            })

    return pd.DataFrame(results).sort_values('% unique')


In [44]:
find_categorical_candidates(bikes,0.1)

Unnamed: 0,column,unique_values,% unique
0,gender,2,0.0
8,events,11,0.02
5,end_capacity,19,0.04
3,start_capacity,20,0.04
7,wind_speed,33,0.07
6,temperature,142,0.28
4,to_station_name,595,1.19
2,from_station_name,600,1.2
1,tripduration,2442,4.88


In [64]:
bikes['events'].describe()

count            50089
unique              11
top       partlycloudy
freq             16998
Name: events, dtype: object

In [69]:
import numpy as np

np.iinfo('uint32')

iinfo(min=0, max=4294967295, dtype=uint32)

In [66]:
td= bikes['tripduration']

td.info(memory_usage=True)

<class 'pandas.core.series.Series'>
RangeIndex: 50089 entries, 0 to 50088
Series name: tripduration
Non-Null Count  Dtype
--------------  -----
50089 non-null  int64
dtypes: int64(1)
memory usage: 391.4 KB


In [70]:
td2 = bikes['tripduration'].astype('uint32')

td2.info(memory_usage='deep')

<class 'pandas.core.series.Series'>
RangeIndex: 50089 entries, 0 to 50088
Series name: tripduration
Non-Null Count  Dtype 
--------------  ----- 
50089 non-null  uint32
dtypes: uint32(1)
memory usage: 195.8 KB


### Exercise 2

<span style="color:green; font-size:16px">Read in the diamonds dataset and convert the data types of each column so they use the least amount of memory without losing any information.</span>

In [76]:
diamonds = pd.read_csv('../data/diamonds.csv')

diamonds.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 12.2 MB


In [77]:
find_categorical_candidates(diamonds,0.1)

Unnamed: 0,column,unique_values,% unique
1,cut,5,0.01
2,color,7,0.01
3,clarity,8,0.01
5,table,127,0.24
4,depth,184,0.34
0,carat,273,0.51
8,z,375,0.7
7,y,552,1.02
6,x,554,1.03


In [80]:
diamonds.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

In [96]:
diamonds

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [117]:
diamonds['price'].max()

np.int64(18823)

In [103]:
diamonds['carat'].astype('float32').sum()

np.float32(43040.87)

In [118]:
clarity = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
clarity_cat = pd.CategoricalDtype(clarity, ordered=True)

color = ['J','I','H','G','F','E','D']
color_cat = pd.CategoricalDtype(color, ordered=True)

cut= ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
cut_cat = pd.CategoricalDtype(cut,ordered=True)

dtype_dict = {
                'carat':'float32',
                'cut':cut_cat,
                'color':color_cat,
                'clarity':clarity_cat,
                'depth':'float32',
                'table':'float32',
                'price':'uint16',
                'x':'float32',
                'y':'float32',
                'z':'float32',
            }

diamonds2 = pd.read_csv('../data/diamonds.csv', dtype=dtype_dict)