# **Guided Lab 343.3.16 - Convert date-like strings to Pandas DateTime Series using to_datetime()**

## **Lab Overview:**
In this lab, we will demonstrate how to convert the list of date-like strings to a pandas DateTime format.

Pandas has a built-in function called **`to_datetime()`** that converts date-like string to DateTime format. When working with DateTime Series data, this function is incredibly beneficial.

This lab is suitable for beginners in data analysis who want to enhance their skills in handling date and time data.

## **Lab Objective:**

By the end of this lab, you will be able to:

- Describe the role of the pandas.to_datetime() function in converting date-like strings.
- Use the function to convert a list of date-like strings to pandas DateTime objects.
- Handle different date formats and deal with common parsing challenges.

---
# **Begin:**
### **Example: Convert List of String Date to Pandas DateTime Series**

int the below example we will define a List of Date-like Strings, then convert to Pandas DateTime format.

In [1]:
import pandas as pd

In [2]:
input = ['2023-01-01', '2023-01-02', '2023-02-06']
print("Original Date Strings:")
print(input)
# Display the original DateTime objects
x = type(input)

print("\n======= Before convert============")
print("datatype is " ,x)

print("\n======= after convert ============")

output = pd.to_datetime(input)
# Display the converted DateTime objects
print("\nConverted DateTime Objects:")

print("Output: ", output)
y = type(output)
print(y)



Original Date Strings:
['2023-01-01', '2023-01-02', '2023-02-06']

datatype is  <class 'list'>


Converted DateTime Objects:
Output:  DatetimeIndex(['2023-01-01', '2023-01-02', '2023-02-06'], dtype='datetime64[ns]', freq=None)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


**Lets demonstrates the flexibility of pd.to_datetime() to handle a variety of date formats within the input list. It can handle different formats and convert them to a consistent DatetimeIndex.**

In [None]:
# Creating a list of date
input_list = ['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']
print("Original Date Strings:")
print(input_list)
# Display the original DateTime objects
x = type(input_list)
print("\n======= Before convert============")
print("datatype is " ,x)


print("\n======= after convert ============")
# Display the converted DateTime objects

# Have use parameter format='mixed' with different date formats
output = pd.to_datetime(input_list, format='mixed')
print("\nConverted DateTime Objects:")
print("datatype: ", output)
# Display the converted DateTime objects
y = type(output)
print(y)



Original Date Strings:
['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']

datatype is  <class 'list'>


Converted DateTime Objects:
datatype:  DatetimeIndex(['2023-01-01 00:00:00', '2023-01-02 00:00:00',
               '2020-03-10 14:30:45', '2023-10-13 00:00:00'],
              dtype='datetime64[ns]', freq=None)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


## **Example: String column to datetime**
### **Dealing with Ambiguities:**

Create date strings with ambiguous formats (e.g., '01/02/2023' - is it January 2nd or February 1st?) and explore how pd.to_datetime() handles these situations.

In [8]:
df = pd.DataFrame({
    'patientID':[101,23,48,49],
    'name': ['alice','bob','charlie','Eric'],
    'date_of_birth': ['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']
})

print("\n====== before convert ============")
print("Original Date Strings:")
print(df)
print()
print(df.info())
print()
print("Data Type of date_of_birth column: ", df['date_of_birth'].dtype)
print(type(df["date_of_birth"]))
print(df['date_of_birth'])

print("\n======= after convert============")

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], format='mixed')
print()
print(df.info())
print()

print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)

print("\nConverted DateTime Objects:")
df['date_of_birth']
df




Original Date Strings:
   patientID     name          date_of_birth
0        101    alice             2023-01-01
1         23      bob             2023-01-02
2         48  charlie       3/10/2020 143045
3         49     Eric  13th of October, 2023

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   patientID      4 non-null      int64 
 1   name           4 non-null      object
 2   date_of_birth  4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
None

Data Type of date_of_birth column:  object
<class 'pandas.core.series.Series'>
0               2023-01-01
1               2023-01-02
2         3/10/2020 143045
3    13th of October, 2023
Name: date_of_birth, dtype: object


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype         

Unnamed: 0,patientID,name,date_of_birth
0,101,alice,2023-01-01 00:00:00
1,23,bob,2023-01-02 00:00:00
2,48,charlie,2020-03-10 14:30:45
3,49,Eric,2023-10-13 00:00:00


## **Example : String column to datetime, custom format**

The **format** argument of the **to_datetime()** function allows you to pass a custom format. For example, let’s say you want to parse your string with the following timestamp format
- YYYY-MM-DD HH: MM: SS.
- [Click here to see all formats](https://strftime.org/)

If your date string does not meet the timestamp format, you will get a TypeError or ValueError, as shown in the below example. Let’s see how we can do this:


**Note:** the below example throw error because of inconsistency in the format.



In [9]:
df = pd.DataFrame({
    'patientID':[101,23,48,49],
    'name': ['alice','bob','charlie','keyla'],
    'date_of_birth': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04']
})

print("\n======= before============")
#print(df.info())
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)
#print(df['date_of_birth'])
df


print("\n======= after============")

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], format='%d/%m/%Y')
df


#print(df.info())
print(df['date_of_birth'].dtype)
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)

df


Data Type of date_of_birth column:  object



ValueError: time data "2023-01-01" doesn't match format "%d/%m/%Y", at position 0. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

**Handling Parsing Errors**
You can set the argument **errors** to **‘ignore’** or **‘coerce’** to avoid error.




In [None]:
import pandas as pd
df = pd.DataFrame({
    'patientID':[101,23,48,49],
    'name': ['alice','bob','charlie','keyla'],
    'date_of_birth': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04']
})

print("======= before============")
#print(df.info())
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)
#print(df['date_of_birth'])
df


print("\n======= after============")

# errors='ignore' is deprecated & will raise an error in a future version. Use to_datetime() without passing `errors` & catch exceptions explicitly instead
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'],format='%d/%m/%Y', errors='ignore')
df


print(df.info())
print(df['date_of_birth'].dtype)
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)

df

Data Type of date_of_birth column:  object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   patientID      4 non-null      int64 
 1   name           4 non-null      object
 2   date_of_birth  4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
None
object
Data Type of date_of_birth column:  object


  df['date_of_birth'] = pd.to_datetime(df['date_of_birth'],format='%d/%m/%Y', errors='ignore')


Unnamed: 0,patientID,name,date_of_birth
0,101,alice,2023-01-01
1,23,bob,2023-01-02
2,48,charlie,2023-01-03
3,49,keyla,2023-01-04


**Another approach:** We can use use argument **format='mixed'** and **dayfirst=True** for ambiguous dates, as shown in the following example.

In [11]:
df = pd.DataFrame({
    'patientID': [101, 23, 48, 49],
    'name': ['alice', 'bob', 'charlie', 'Eric'],
    'date_of_birth': ['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']
})

print("\n====== before convert ============")
print("Original Date Strings:")
print(df)
print("Data Type of date_of_birth column: ", df['date_of_birth'].dtype)
print(type(df["date_of_birth"]))
print(df['date_of_birth'])

print("\n======= after convert============")

# Use format='mixed' and dayfirst=True for ambiguous dates
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], format='mixed', dayfirst=True, errors='coerce')

print("Data Type of date_of_birth column: ", df['date_of_birth'].dtype)

print("\nConverted DateTime Objects:")
print(df)


Original Date Strings:
   patientID     name          date_of_birth
0        101    alice             2023-01-01
1         23      bob             2023-01-02
2         48  charlie       3/10/2020 143045
3         49     Eric  13th of October, 2023
Data Type of date_of_birth column:  object
<class 'pandas.core.series.Series'>
0               2023-01-01
1               2023-01-02
2         3/10/2020 143045
3    13th of October, 2023
Name: date_of_birth, dtype: object

Data Type of date_of_birth column:  datetime64[ns]

Converted DateTime Objects:
   patientID     name       date_of_birth
0        101    alice 2023-01-01 00:00:00
1         23      bob 2023-01-02 00:00:00
2         48  charlie 2020-10-03 14:30:45
3         49     Eric 2023-10-13 00:00:00


**Explanation of Changes**

- **format='mixed':** This tells pd.to_datetime() to try and infer the format of each date string individually, allowing for different formats within the column.
- **errors='coerce':** It converts the invalid parsing to NaT
- **dayfirst=True**: It considers the first value in the date string as the day.

# **Specifying datetime format when impoting a csv file**

**When we create a DataFrame by importing a CSV file, the date/time values are considered string objects, not DateTime objects.**


Let’s try to import the CSV dataset into a Pandas DataFrame and check the date's column data types.

To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.

In [12]:
url='https://raw.githubusercontent.com/bprasad26/lwd/master/data/tesla_stock_prices.csv'
# Assuming the date format is YYYY-MM-DD
tesla_df = pd.read_csv(url, parse_dates=['Date'], date_format='%Y-%m-%d')
# Display the DataFrame

tesla_df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,19.000000,25.000000,17.540001,23.889999,23.889999,18766300
1,2010-06-30,25.790001,30.420000,23.299999,23.830000,23.830000,17187100
2,2010-07-01,25.000000,25.920000,20.270000,21.959999,21.959999,8218800
3,2010-07-02,23.000000,23.100000,18.709999,19.200001,19.200001,5139800
4,2010-07-06,20.000000,20.000000,15.830000,16.110001,16.110001,6866900
...,...,...,...,...,...,...,...
1786,2017-08-02,318.940002,327.119995,311.220001,325.890015,325.890015,13091500
1787,2017-08-03,345.329987,350.000000,343.149994,347.089996,347.089996,13535000
1788,2017-08-04,347.000000,357.269989,343.299988,356.910004,356.910004,9198400
1789,2017-08-07,357.350006,359.480011,352.750000,355.170013,355.170013,6276900


This is great! It looks like everything worked fine. Not so fast – let’s check the data types of the columns in the dataset. We can do this using the **.info()** function and **dtypes** attribute method.

In [13]:
tesla_df.dtypes

Date         datetime64[ns]
Open                float64
High                float64
Low                 float64
Close               float64
Adj Close           float64
Volume                int64
dtype: object

In [14]:
tesla_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1791 entries, 0 to 1790
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       1791 non-null   datetime64[ns]
 1   Open       1791 non-null   float64       
 2   High       1791 non-null   float64       
 3   Low        1791 non-null   float64       
 4   Close      1791 non-null   float64       
 5   Adj Close  1791 non-null   float64       
 6   Volume     1791 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 98.1 KB


We can see that the data type of the Date column is object. This means that the data are stored as strings, meaning that you can’t access the slew of DateTime functionality available in Pandas.

## **Using to_datetime to Convert Columns to DateTime**

The pandas to_datetime() function converts a date/time value stored in a DataFrame column into a DateTime object. Having date/time values as DateTime objects makes manipulating them much easier. Run the following statement and see the changes:

In [15]:
# Convert the 'Date' column to pandas DateTime format
tesla_df['Date'] = pd.to_datetime(tesla_df['Date'], errors='coerce')
tesla_df['Date']

0      2010-06-29
1      2010-06-30
2      2010-07-01
3      2010-07-02
4      2010-07-06
          ...    
1786   2017-08-02
1787   2017-08-03
1788   2017-08-04
1789   2017-08-07
1790   2017-08-08
Name: Date, Length: 1791, dtype: datetime64[ns]

In [16]:
# Display the DataFrame with the converted 'Date' column
print("\nDataFrame with Converted 'Date' Column:")
print(tesla_df.head())


DataFrame with Converted 'Date' Column:
        Date       Open   High        Low      Close  Adj Close    Volume
0 2010-06-29  19.000000  25.00  17.540001  23.889999  23.889999  18766300
1 2010-06-30  25.790001  30.42  23.299999  23.830000  23.830000  17187100
2 2010-07-01  25.000000  25.92  20.270000  21.959999  21.959999   8218800
3 2010-07-02  23.000000  23.10  18.709999  19.200001  19.200001   5139800
4 2010-07-06  20.000000  20.00  15.830000  16.110001  16.110001   6866900


**Lets verify the Data Type of Date Column**

In [17]:
print(tesla_df['Date'].dtype)

datetime64[ns]


In [18]:
tesla_df.dtypes

Date         datetime64[ns]
Open                float64
High                float64
Low                 float64
Close               float64
Adj Close           float64
Volume                int64
dtype: object

In [19]:
tesla_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1791 entries, 0 to 1790
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       1791 non-null   datetime64[ns]
 1   Open       1791 non-null   float64       
 2   High       1791 non-null   float64       
 3   Low        1791 non-null   float64       
 4   Close      1791 non-null   float64       
 5   Adj Close  1791 non-null   float64       
 6   Volume     1791 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 98.1 KB


## **Extract month and year from the 'Date' column**



In [21]:
url='https://raw.githubusercontent.com/bprasad26/lwd/master/data/tesla_stock_prices.csv'
# Assuming the date format is YYYY-MM-DD
tesla_df = pd.read_csv(url, parse_dates=['Date'], date_format='%Y-%m-%d')
# Display the DataFrame

tesla_df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,19.000000,25.000000,17.540001,23.889999,23.889999,18766300
1,2010-06-30,25.790001,30.420000,23.299999,23.830000,23.830000,17187100
2,2010-07-01,25.000000,25.920000,20.270000,21.959999,21.959999,8218800
3,2010-07-02,23.000000,23.100000,18.709999,19.200001,19.200001,5139800
4,2010-07-06,20.000000,20.000000,15.830000,16.110001,16.110001,6866900
...,...,...,...,...,...,...,...
1786,2017-08-02,318.940002,327.119995,311.220001,325.890015,325.890015,13091500
1787,2017-08-03,345.329987,350.000000,343.149994,347.089996,347.089996,13535000
1788,2017-08-04,347.000000,357.269989,343.299988,356.910004,356.910004,9198400
1789,2017-08-07,357.350006,359.480011,352.750000,355.170013,355.170013,6276900


### .dt accessor
- `Series` has an accessor to succinctly return datetime like properties for the values of the Series, if it is a datetime/period like Series. This will return a Series, indexed like the existing Series.

In [22]:
tesla_df['Month'] = tesla_df['Date'].dt.month
print("\n  Month:")
tesla_df['Month'].head(10)





  Month:


0    6
1    6
2    7
3    7
4    7
5    7
6    7
7    7
8    7
9    7
Name: Month, dtype: int32

In [23]:
tesla_df['Year'] = tesla_df['Date'].dt.year
print("\n  Year:")
tesla_df['Year'].head(10)



  Year:


0    2010
1    2010
2    2010
3    2010
4    2010
5    2010
6    2010
7    2010
8    2010
9    2010
Name: Year, dtype: int32

In [24]:
tesla_df['day_name_sample'] = tesla_df['Date'].dt.day_name()
tesla_df['day_name_sample'].head(20)

0       Tuesday
1     Wednesday
2      Thursday
3        Friday
4       Tuesday
5     Wednesday
6      Thursday
7        Friday
8        Monday
9       Tuesday
10    Wednesday
11     Thursday
12       Friday
13       Monday
14      Tuesday
15    Wednesday
16     Thursday
17       Friday
18       Monday
19      Tuesday
Name: day_name_sample, dtype: object

Similarly, you can access different calculated attributes. For example, you can calculate the largest and smallest dates using the .max() and .min() methods. Let’s see what this looks like:

In [25]:
# Calculating Max and Min DateTimes
print(tesla_df['Date'].max())
print(tesla_df['Date'].min())

2017-08-08 00:00:00
2010-06-29 00:00:00


### **Group by month and sum the volumn for each month**

In [26]:
monthly_sales = tesla_df.groupby(['Year', 'Month'])['Volume'].sum().reset_index()
monthly_sales


Unnamed: 0,Year,Month,Volume
0,2010,6,35953400
1,2010,7,64575800
2,2010,8,15038200
3,2010,9,18045900
4,2010,10,6547800
...,...,...,...
82,2017,4,116950600
83,2017,5,148007800
84,2017,6,185930900
85,2017,7,181640000


## **# Filterting Based on Date**

Lets filter date by using loc[] and query().



In [31]:
# Filter data between two dates
filtered_df_twoDate = tesla_df.loc[(tesla_df['Date'] >= '2010-02-01') & (tesla_df['Date'] < '2010-08-01')]
# Display
filtered_df_twoDate

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Month,Year,day_name_sample
0,2010-06-29,19.0,25.0,17.540001,23.889999,23.889999,18766300,6,2010,Tuesday
1,2010-06-30,25.790001,30.42,23.299999,23.83,23.83,17187100,6,2010,Wednesday
2,2010-07-01,25.0,25.92,20.27,21.959999,21.959999,8218800,7,2010,Thursday
3,2010-07-02,23.0,23.1,18.709999,19.200001,19.200001,5139800,7,2010,Friday
4,2010-07-06,20.0,20.0,15.83,16.110001,16.110001,6866900,7,2010,Tuesday
5,2010-07-07,16.4,16.629999,14.98,15.8,15.8,6921700,7,2010,Wednesday
6,2010-07-08,16.139999,17.52,15.57,17.459999,17.459999,7711400,7,2010,Thursday
7,2010-07-09,17.58,17.9,16.549999,17.4,17.4,4050600,7,2010,Friday
8,2010-07-12,17.950001,18.07,17.0,17.049999,17.049999,2202500,7,2010,Monday
9,2010-07-13,17.389999,18.639999,16.9,18.139999,18.139999,2680100,7,2010,Tuesday


In [33]:
filtered_df_twoDate = tesla_df.query("Date >= '2010-02-01' and Date < '2010-08-01'")
filtered_df_twoDate

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Month,Year,day_name_sample
0,2010-06-29,19.0,25.0,17.540001,23.889999,23.889999,18766300,6,2010,Tuesday
1,2010-06-30,25.790001,30.42,23.299999,23.83,23.83,17187100,6,2010,Wednesday
2,2010-07-01,25.0,25.92,20.27,21.959999,21.959999,8218800,7,2010,Thursday
3,2010-07-02,23.0,23.1,18.709999,19.200001,19.200001,5139800,7,2010,Friday
4,2010-07-06,20.0,20.0,15.83,16.110001,16.110001,6866900,7,2010,Tuesday
5,2010-07-07,16.4,16.629999,14.98,15.8,15.8,6921700,7,2010,Wednesday
6,2010-07-08,16.139999,17.52,15.57,17.459999,17.459999,7711400,7,2010,Thursday
7,2010-07-09,17.58,17.9,16.549999,17.4,17.4,4050600,7,2010,Friday
8,2010-07-12,17.950001,18.07,17.0,17.049999,17.049999,2202500,7,2010,Monday
9,2010-07-13,17.389999,18.639999,16.9,18.139999,18.139999,2680100,7,2010,Tuesday


## **Filter data for specific weekday (Wednesday)**

In [None]:
# Weekday starting on Monday or 0
filtered_df_week = tesla_df.loc[tesla_df['Date'].dt.weekday == 2]
filtered_df_week


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Month,Year,day_name_sample
1,2010-06-30,25.790001,30.420000,23.299999,23.830000,23.830000,17187100,6,2010,Wednesday
5,2010-07-07,16.400000,16.629999,14.980000,15.800000,15.800000,6921700,7,2010,Wednesday
10,2010-07-14,17.940001,20.150000,17.760000,19.840000,19.840000,4195200,7,2010,Wednesday
15,2010-07-21,20.660000,20.900000,19.500000,20.219999,20.219999,1252500,7,2010,Wednesday
20,2010-07-28,20.549999,20.900000,20.510000,20.719999,20.719999,467200,7,2010,Wednesday
...,...,...,...,...,...,...,...,...,...,...
1766,2017-07-05,347.200012,347.239990,326.329987,327.089996,327.089996,17046700,7,2017,Wednesday
1771,2017-07-12,330.399994,333.100006,324.500000,329.519989,329.519989,10346100,7,2017,Wednesday
1776,2017-07-19,328.230011,331.649994,323.220001,325.260010,325.260010,6357000,7,2017,Wednesday
1781,2017-07-26,340.359985,345.500000,338.119995,343.850006,343.850006,4820800,7,2017,Wednesday


In [36]:
# Weekday starting on Monday or 0
filtered_df_week = tesla_df.query('Date.dt.weekday == 3')
filtered_df_week


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Month,Year,day_name_sample
2,2010-07-01,25.000000,25.920000,20.270000,21.959999,21.959999,8218800,7,2010,Thursday
6,2010-07-08,16.139999,17.520000,15.570000,17.459999,17.459999,7711400,7,2010,Thursday
11,2010-07-15,19.940001,21.500000,19.000000,19.889999,19.889999,3739800,7,2010,Thursday
16,2010-07-22,20.500000,21.250000,20.370001,21.000000,21.000000,957800,7,2010,Thursday
21,2010-07-29,20.770000,20.879999,20.000000,20.350000,20.350000,616000,7,2010,Thursday
...,...,...,...,...,...,...,...,...,...,...
1767,2017-07-06,317.260010,320.790009,306.299988,308.829987,308.829987,19324500,7,2017,Thursday
1772,2017-07-13,330.109985,331.600006,319.970001,323.410004,323.410004,8594500,7,2017,Thursday
1777,2017-07-20,326.899994,330.220001,324.200012,329.920013,329.920013,5166200,7,2017,Thursday
1782,2017-07-27,346.000000,347.500000,326.290009,334.459991,334.459991,8302400,7,2017,Thursday


Reference:[ Official documentation](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)