# Data Analysis in Python - X: Working with Different Data Types

## Introduction


In this lesson, we will review previously covered material related to data types and data type conversion and learn some new functions that are useful for working with different data types for data analysis needs. 

Note: 
1. Use the TOC to navigate between sections.


## Different data types

Python supports many simple and complex data types that we have worked with before. Pandas also supports different data types including, object, int64, float64, datetime64, bool, and category. Additionally, the numpy library comes with its own data types. 

When working with data, it is important to know the data types of various fields (columns) and the appropriate data types to use in function parameters and outputs. Without this knowledge, we may get unexpected or inaccurate results from the analysis. 
For the same reason, it is sometimes necessary to convert the data type of a column before performing an operation on it. 

## Checking the data types of data frame columns

We can check the data types of the columns of a data frame using the `dtypes` properties and/or the `info()` function.


In [15]:
# import pandas and seaborn
import pandas as pd
import seaborn as sns

# load the titanic data set from the seaborn library
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [16]:
# dtypes
titanic.dtypes

survived          int64
pclass            int64
sex              object
age             float64
sibsp             int64
parch             int64
fare            float64
embarked         object
class          category
who              object
adult_male         bool
deck           category
embark_town      object
alive            object
alone              bool
dtype: object

In [17]:
# info
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


## Converting data types

It is possible to convert a column from one data type to another as long as the corresponding type conversion is alllowed. 

Let's convert a numeric column to string and back using the `astype()` function.

In [18]:
# check the type of column age
titanic["age"].dtype

dtype('float64')

In [19]:
# find average age
titanic["age"].mean()

29.69911764705882

In [20]:
# convert age to string type
titanic["age"] = titanic["age"].astype(str)

In [21]:
# average age
titanic["age"].mean()

TypeError: Could not convert 22.038.026.035.035.0nan54.02.027.014.04.058.020.039.014.055.02.0nan31.0nan35.034.015.028.08.038.0nan19.0nannan40.0nannan66.028.042.0nan21.018.014.040.027.0nan3.019.0nannannannan18.07.021.049.029.065.0nan21.028.55.011.022.038.045.04.0nannan29.019.017.026.032.016.021.026.032.025.0nannan0.8330.022.029.0nan28.017.033.016.0nan23.024.029.020.046.026.059.0nan71.023.034.034.028.0nan21.033.037.028.021.0nan38.0nan47.014.522.020.017.021.070.529.024.02.021.0nan32.532.554.012.0nan24.0nan45.033.020.047.029.025.023.019.037.016.024.0nan22.024.019.018.019.027.09.036.542.051.022.055.540.5nan51.016.030.0nannan44.040.026.017.01.09.0nan45.0nan28.061.04.01.021.056.018.0nan50.030.036.0nannan9.01.04.0nannan45.040.036.032.019.019.03.044.058.0nan42.0nan24.028.0nan34.045.518.02.032.026.016.040.024.035.022.030.0nan31.027.042.032.030.016.027.051.0nan38.022.019.020.518.0nan35.029.059.05.024.0nan44.08.019.033.0nannan29.022.030.044.025.024.037.054.0nan29.062.030.041.029.0nan30.035.050.0nan3.052.040.0nan36.016.025.058.035.0nan25.041.037.0nan63.045.0nan7.035.065.028.016.019.0nan33.030.022.042.022.026.019.036.024.024.0nan23.52.0nan50.0nannan19.0nannan0.92nan17.030.030.024.018.026.028.043.026.024.054.031.040.022.027.030.022.0nan36.061.036.031.016.0nan45.538.016.0nannan29.041.045.045.02.024.028.025.036.024.040.0nan3.042.023.0nan15.025.0nan28.022.038.0nannan40.029.045.035.0nan30.060.0nannan24.025.018.019.022.03.0nan22.027.020.019.042.01.032.035.0nan18.01.036.0nan17.036.021.028.023.024.022.031.046.023.028.039.026.021.028.020.034.051.03.021.0nannannan33.0nan44.0nan34.018.030.010.0nan21.029.028.018.0nan28.019.0nan32.028.0nan42.017.050.014.021.024.064.031.045.020.025.028.0nan4.013.034.05.052.036.0nan30.049.0nan29.065.0nan50.0nan48.034.047.048.0nan38.0nan56.0nan0.75nan38.033.023.022.0nan34.029.022.02.09.0nan50.063.025.0nan35.058.030.09.0nan21.055.071.021.0nan54.0nan25.024.017.021.0nan37.016.018.033.0nan28.026.029.0nan36.054.024.047.034.0nan36.032.030.022.0nan44.0nan40.550.0nan39.023.02.0nan17.0nan30.07.045.030.0nan22.036.09.011.032.050.064.019.0nan33.08.017.027.0nan22.022.062.048.0nan39.036.0nan40.028.0nannan24.019.029.0nan32.062.053.036.0nan16.019.034.039.0nan32.025.039.054.036.0nan18.047.060.022.0nan35.052.047.0nan37.036.0nan49.0nan49.024.0nannan44.035.036.030.027.022.040.039.0nannannan35.024.034.026.04.026.027.042.020.021.021.061.057.021.026.0nan80.051.032.0nan9.028.032.031.041.0nan20.024.02.0nan0.7548.019.056.0nan23.0nan18.021.0nan18.024.0nan32.023.058.050.040.047.036.020.032.025.0nan43.0nan40.031.070.031.0nan18.024.518.043.036.0nan27.020.014.060.025.014.019.018.015.031.04.0nan25.060.052.044.0nan49.042.018.035.018.025.026.039.045.042.022.0nan24.0nan48.029.052.019.038.027.0nan33.06.017.034.050.027.020.030.0nan25.025.029.011.0nan23.023.028.548.035.0nannannan36.021.024.031.070.016.030.019.031.04.06.033.023.048.00.6728.018.034.033.0nan41.020.036.016.051.0nan30.5nan32.024.048.057.0nan54.018.0nan5.0nan43.013.017.029.0nan25.025.018.08.01.046.0nan16.0nannan25.039.049.031.030.030.034.031.011.00.4227.031.039.018.039.033.026.039.035.06.030.5nan23.031.043.010.052.027.038.027.02.0nannan1.0nan62.015.00.83nan23.018.039.021.0nan32.0nan20.016.030.034.517.042.0nan35.028.0nan4.074.09.016.044.018.045.051.024.0nan41.021.048.0nan24.042.027.031.0nan4.026.047.033.047.028.015.020.019.0nan56.025.033.022.028.025.039.027.019.0nan26.032.0 to numeric

In [22]:
# convert age back to float
titanic["age"] = titanic["age"].astype(float)
titanic["age"].dtype

dtype('float64')

In [23]:
# average age
titanic["age"].mean()

29.69911764705882

Now, let's use the `to_numeric()` function for type conversion.

In [24]:
# convert age to string
titanic["age"] = titanic["age"].astype(str)

In [25]:
# use to_numeric to convert back to float
titanic["age"] = pd.to_numeric(titanic["age"])
titanic["age"].dtype

ValueError: Unable to parse string "nan" at position 5

When the `astype()` and `to_numeric()` functions are unable to parse a string into the desired data type, they throw an error. The `to_numeric()` function as an `errors` parameter that specifies the conversion behavior in that case. You can set the value of `errors` to `raise` (raises an error as above), `coerce` (coerces values to NaN) or `ignore` (does not perform type conversion). 

In [26]:
# convert age to float using coersion to NaN
titanic["age"] = pd.to_numeric(titanic["age"], errors = 'coerce')
titanic["age"].dtype

dtype('float64')

In [27]:
# display age values 
titanic["age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

## The Datetime data type

The built-in datetime library in python provides useful functionality for working with datetime values through the datetime object.

In [28]:
# import object from library
from datetime import datetime

In [29]:
# current date time
my_date_time = datetime.now()
my_date_time

datetime.datetime(2023, 11, 6, 9, 54, 39, 655734)

In [30]:
# as string
str(my_date_time)

'2023-11-06 09:54:39.655734'

In [31]:
my_date_time.strftime("%Y-%m-%d")

'2023-11-06'

In [32]:
# create student DataFrame
students = pd.DataFrame(
    [
        [9000, 'Amir', 'A1@psu.edu'],
        [9001, 'Biko', 'b10@psu.edu'],
        [9002, 'Chen', 'C2@psu.edu'],
        [9003, 'Darren', 'd@psu.edu'],
        [9004, 'Elena', 'e@psu.edu'],
    ], 
    columns = ['ID','Name','Email']
)

students

Unnamed: 0,ID,Name,Email
0,9000,Amir,A1@psu.edu
1,9001,Biko,b10@psu.edu
2,9002,Chen,C2@psu.edu
3,9003,Darren,d@psu.edu
4,9004,Elena,e@psu.edu


In [33]:
# add a record created column with value 10-20-2022
students["RecordCreated"] = datetime(day = 20, month = 10, year = 2022)
students

Unnamed: 0,ID,Name,Email,RecordCreated
0,9000,Amir,A1@psu.edu,2022-10-20
1,9001,Biko,b10@psu.edu,2022-10-20
2,9002,Chen,C2@psu.edu,2022-10-20
3,9003,Darren,d@psu.edu,2022-10-20
4,9004,Elena,e@psu.edu,2022-10-20


In [34]:
# check data type
students["RecordCreated"].dtype

dtype('<M8[ns]')

In [35]:
# delete column
students.drop('RecordCreated', axis = 1, inplace = True)

In [36]:
# create column by assigning a string and check data type
students["RecordCreated"] = "2022-10-20"
students["RecordCreated"].dtype

dtype('O')

In [37]:
# convert to date time
students["RecordCreated"] = pd.to_datetime(students["RecordCreated"], format = "%Y-%m-%d")
students["RecordCreated"].dtype

dtype('<M8[ns]')

In [38]:
# display data
students['RecordCreated']

0   2022-10-20
1   2022-10-20
2   2022-10-20
3   2022-10-20
4   2022-10-20
Name: RecordCreated, dtype: datetime64[ns]

In [39]:
# display data in a different format
students["RecordCreated"].dt.strftime("%d-%m-%y")

0    20-10-22
1    20-10-22
2    20-10-22
3    20-10-22
4    20-10-22
Name: RecordCreated, dtype: object

In [40]:
students['StartYear'] = [2020, 2019, 2020, 2021, 2019]
students['StartMonth'] = [8,8,8,1,1]

In [41]:
# create start date column
students["StartDate"] = pd.to_datetime("1" + "-" + students["StartMonth"].astype(str) + "-" + students["StartYear"].astype(str), format = "%d-%m-%Y")
students.head()

Unnamed: 0,ID,Name,Email,RecordCreated,StartYear,StartMonth,StartDate
0,9000,Amir,A1@psu.edu,2022-10-20,2020,8,2020-08-01
1,9001,Biko,b10@psu.edu,2022-10-20,2019,8,2019-08-01
2,9002,Chen,C2@psu.edu,2022-10-20,2020,8,2020-08-01
3,9003,Darren,d@psu.edu,2022-10-20,2021,1,2021-01-01
4,9004,Elena,e@psu.edu,2022-10-20,2019,1,2019-01-01


In [42]:
# check data types
students.dtypes

ID                        int64
Name                     object
Email                    object
RecordCreated    datetime64[ns]
StartYear                 int64
StartMonth                int64
StartDate        datetime64[ns]
dtype: object

In [43]:
students['StartDate'].dt.strftime("%d-%m-%Y")

0    01-08-2020
1    01-08-2019
2    01-08-2020
3    01-01-2021
4    01-01-2019
Name: StartDate, dtype: object

In [44]:
# current date time
current_date=datetime.now()

In [45]:
# time in program
time_in_program=current_date-students['StartDate']
time_in_program

0   1192 days 09:54:39.809274
1   1558 days 09:54:39.809274
2   1192 days 09:54:39.809274
3   1039 days 09:54:39.809274
4   1770 days 09:54:39.809274
Name: StartDate, dtype: timedelta64[ns]

In [46]:
# complete days from time in program
time_in_program.dt.days

0    1192
1    1558
2    1192
3    1039
4    1770
Name: StartDate, dtype: int64

In [47]:
# seconds in the time beyond complete days in time in program
time_in_program.dt.seconds

0    35679
1    35679
2    35679
3    35679
4    35679
Name: StartDate, dtype: int64

In [48]:
# overall seconds in the time in program
time_in_program.dt.total_seconds()

0    1.030245e+08
1    1.346469e+08
2    1.030245e+08
3    8.980528e+07
4    1.529637e+08
Name: StartDate, dtype: float64

### Boolean subsetting with datetime values

In [49]:
students

Unnamed: 0,ID,Name,Email,RecordCreated,StartYear,StartMonth,StartDate
0,9000,Amir,A1@psu.edu,2022-10-20,2020,8,2020-08-01
1,9001,Biko,b10@psu.edu,2022-10-20,2019,8,2019-08-01
2,9002,Chen,C2@psu.edu,2022-10-20,2020,8,2020-08-01
3,9003,Darren,d@psu.edu,2022-10-20,2021,1,2021-01-01
4,9004,Elena,e@psu.edu,2022-10-20,2019,1,2019-01-01


In [50]:
# retrieve all students who started on or after 1-1-2021 
students[students['StartDate']>= datetime(day=1, month=1, year=2021)]

Unnamed: 0,ID,Name,Email,RecordCreated,StartYear,StartMonth,StartDate
3,9003,Darren,d@psu.edu,2022-10-20,2021,1,2021-01-01


In [51]:
# retrieve all students who started between 8-1-2019 and 8-1-2020
students[(students['StartDate']>=datetime(day=1, month=8, year=2019))&(students['StartDate']<=datetime(day=1, month=1, year=2021))]

Unnamed: 0,ID,Name,Email,RecordCreated,StartYear,StartMonth,StartDate
0,9000,Amir,A1@psu.edu,2022-10-20,2020,8,2020-08-01
1,9001,Biko,b10@psu.edu,2022-10-20,2019,8,2019-08-01
2,9002,Chen,C2@psu.edu,2022-10-20,2020,8,2020-08-01
3,9003,Darren,d@psu.edu,2022-10-20,2021,1,2021-01-01


In [52]:
# retrieve all students who started in the fall
students[students['StartMonth']==8]

Unnamed: 0,ID,Name,Email,RecordCreated,StartYear,StartMonth,StartDate
0,9000,Amir,A1@psu.edu,2022-10-20,2020,8,2020-08-01
1,9001,Biko,b10@psu.edu,2022-10-20,2019,8,2019-08-01
2,9002,Chen,C2@psu.edu,2022-10-20,2020,8,2020-08-01


## The Categorical data type

In [53]:
# check the type of column deck
titanic['deck'].dtype

CategoricalDtype(categories=['A', 'B', 'C', 'D', 'E', 'F', 'G'], ordered=False)

In [54]:
# what are the unique values in column deck
titanic['deck'].unique()

[NaN, 'C', 'E', 'G', 'D', 'A', 'B', 'F']
Categories (7, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G']

In [55]:
# convert deck to a string type
titanic['deck']=titanic['deck'].astype(str)
titanic['deck'].dtype

dtype('O')

In [56]:
# convert deck back to category type
titanic['deck']=titanic['deck'].astype('category')
titanic['deck'].dtype

CategoricalDtype(categories=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'nan'], ordered=False)

Note: In the above conversion the string 'nan' was recognized as another category instead of a missing value. to fix this use the following code. 

In [57]:
# convert deck to string type
titanic['deck']=titanic['deck'].astype(str)

In [61]:
# convert nan string to missing values
import numpy as np
titanic['deck'].replace('nan',np.NaN,inplace=True)


In [59]:
# convert deck to category type
titanic['deck']=titanic['deck'].astype('category')
titanic['deck'].dtype

CategoricalDtype(categories=['A', 'B', 'C', 'D', 'E', 'F', 'G'], ordered=False)