## Extracting Features from Date and Time Variables

Date and time variables contain information about dates, times, or both, and in programming, we refer to them collectively as datetime features. Date of birth, the time of an event, and the date and time of the last payment are examples of datetime variables.


Because of their nature, datetime features typically exhibit high cardinality. This means that they contain a huge number of unique values, each corresponding to a specific date and/or time combination. We don't normally use datetime variables for machine learning models in their raw format. Instead, we enrich the dataset by extracting multiple features from these variables. These new features will typically have reduced cardinality, and allow us to capture meaningful information, such as trends, seasonality, and important events and tendencies.


- Extracting features from dates with pandas
- Extracting features from time with pandas
- Capturing elapsed time between datetime variables
- Working with time in different time zones
- Automating datetime feature extraction with feature-engine

### Extracting features from dates with pandas

The values of datetime variables can be dates, time, or both. We'll begin by focusing on those variables that contain dates. We rarely use raw dates with machine learning algorithms. Instead, we extract simpler features, such as the year, month, or day of the week, that allow us to capture insights such as seasonality, periodicity, and trends


### Getting readdy

The following are some of the features that we can extract from the date part of the datetime variable off the shelf using pandas:
			
- pandas.Series.dt.year
- pandas.Series.dt.quarter
- pandas.Series.dt.month
- pandas.Series.dt.isocalendar().week
- pandas.Series.dt.day
- pandas.Series.dt.day_of_week
- pandas.Series.dt.weekday
- pandas.Series.dt.dayofyear
- pandas.Series.dt.day_of_year

In [41]:
import warnings
import numpy as np
import pandas as pd

warnings.filterwarnings('ignore')

We'll start by creating 20 datetime values beginning from 2024-05-17 at midnight and followed by increments of 1 day. Then, we'll capture those values in a DataFrame instance and display the top five rows:

In [42]:
rng_ = pd.date_range("2024-05-17", periods=20, freq="D")
data = pd.DataFrame({"date": rng_})
data.head()

Unnamed: 0,date
0,2024-05-17
1,2024-05-18
2,2024-05-19
3,2024-05-20
4,2024-05-21


**NOTE**: We can check the data format of the variable by executing data["date"].dtypes. If the variable is cast as an object, we can convert it into datetime format by executing data["date_dt"] = pd.to_datetime(data["date"])

In [43]:
data['date'].dtypes

dtype('<M8[ns]')

In [44]:
# data['date_dt'] = pd.to_datetime(data['date'])

# data.head()

In [45]:
data.dtypes

date    datetime64[ns]
dtype: object

In [46]:
data['year'] = data['date'].dt.year

data.head()

Unnamed: 0,date,year
0,2024-05-17,2024
1,2024-05-18,2024
2,2024-05-19,2024
3,2024-05-20,2024
4,2024-05-21,2024


Let's extract the quarter of the year out of the date into a new column and display the top five rows

In [47]:
data['quarter'] = data['date'].dt.quarter
data[['date','quarter']].head()

Unnamed: 0,date,quarter
0,2024-05-17,2
1,2024-05-18,2
2,2024-05-19,2
3,2024-05-20,2
4,2024-05-21,2


With quarter, we can now create the semester feature:

In [48]:
data['semester'] = np.where(data['quarter'] < 3, 1, 2)
data.head()

Unnamed: 0,date,year,quarter,semester
0,2024-05-17,2024,2,1
1,2024-05-18,2024,2,1
2,2024-05-19,2024,2,1
3,2024-05-20,2024,2,1
4,2024-05-21,2024,2,1


**NOTE**: You can explore the distinct values of the new variables utilizing pandas' unique(), for example, by executing df["quarter"].unique() or df["semester"].unique().

Let's extract the month part of the date in a new column and display the top five rows of the DataFrame:

In [49]:
data['month'] = data['date'].dt.month
data[['date','month']].head()

Unnamed: 0,date,month
0,2024-05-17,5
1,2024-05-18,5
2,2024-05-19,5
3,2024-05-20,5
4,2024-05-21,5


Let's extract the week number (a year has 52 weeks) from the date:

In [50]:
data['week'] = data['date'].dt.isocalendar().week
data[['date', 'week']].head()

Unnamed: 0,date,week
0,2024-05-17,20
1,2024-05-18,20
2,2024-05-19,20
3,2024-05-20,21
4,2024-05-21,21


In [51]:
data['day_mo'] = data['date'].dt.day
data.head()

Unnamed: 0,date,year,quarter,semester,month,week,day_mo
0,2024-05-17,2024,2,1,5,20,17
1,2024-05-18,2024,2,1,5,20,18
2,2024-05-19,2024,2,1,5,20,19
3,2024-05-20,2024,2,1,5,21,20
4,2024-05-21,2024,2,1,5,21,21


In [52]:
data['day_week'] = data['date'].dt.dayofweek
data.head()

Unnamed: 0,date,year,quarter,semester,month,week,day_mo,day_week
0,2024-05-17,2024,2,1,5,20,17,4
1,2024-05-18,2024,2,1,5,20,18,5
2,2024-05-19,2024,2,1,5,20,19,6
3,2024-05-20,2024,2,1,5,21,20,0
4,2024-05-21,2024,2,1,5,21,21,1


In [53]:
data['is_weeked'] = (data['date'].dt.dayofweek > 4).astype(int)
data.head()



Unnamed: 0,date,year,quarter,semester,month,week,day_mo,day_week,is_weeked
0,2024-05-17,2024,2,1,5,20,17,4,0
1,2024-05-18,2024,2,1,5,20,18,5,1
2,2024-05-19,2024,2,1,5,20,19,6,1
3,2024-05-20,2024,2,1,5,21,20,0,0
4,2024-05-21,2024,2,1,5,21,21,1,0


### Extracting features from time with pandas

Some events occur more often at certain times of the day - for example, fraudulent activity is more likely to occur during the night or early morning. Air pollutant concentration also changes with the time of the day, with peaks at rush hour when there are more vehicles on the streets.

- pandas.Series.dt.hour
- pandas.Series.dt.minute
- pandas.Series.dt.second

In [54]:
import numpy as np
import pandas as pd

rng_ = pd.date_range("2024-05-01", periods=31, freq='1h15min10s')
df = pd.DataFrame({'date': rng_})

df

Unnamed: 0,date
0,2024-05-01 00:00:00
1,2024-05-01 01:15:10
2,2024-05-01 02:30:20
3,2024-05-01 03:45:30
4,2024-05-01 05:00:40
5,2024-05-01 06:15:50
6,2024-05-01 07:31:00
7,2024-05-01 08:46:10
8,2024-05-01 10:01:20
9,2024-05-01 11:16:30


Let's extract the hour, minute, and second part and capture them into three new columns, then display the DataFrame's top five rows:

In [56]:
df['hour'] = df['date'].dt.hour
df['min'] = df['date'].dt.minute
df['sec'] = df['date'].dt.second
df

Unnamed: 0,date,hour,min,sec
0,2024-05-01 00:00:00,0,0,0
1,2024-05-01 01:15:10,1,15,10
2,2024-05-01 02:30:20,2,30,20
3,2024-05-01 03:45:30,3,45,30
4,2024-05-01 05:00:40,5,0,40
5,2024-05-01 06:15:50,6,15,50
6,2024-05-01 07:31:00,7,31,0
7,2024-05-01 08:46:10,8,46,10
8,2024-05-01 10:01:20,10,1,20
9,2024-05-01 11:16:30,11,16,30


### Capturing the elapsed time between datetime variables

We can extract powerful features from each datetime variable individually, as we did in the previous two recipes. We can create additional features by combining multiple datetime variables. A common example consists of extracting the age at the time of an event by comparing the date of birth with the date of the event

In [57]:
import datetime
import numpy as np
import pandas as pd

date = "2024-05-01"
rng_hr = pd.date_range(date, periods=20, freq='h')
rng_month = pd.date_range(date, periods=20, freq='ME')
df = pd.DataFrame({"date1": rng_hr, "date2": rng_month})

df.head()

Unnamed: 0,date1,date2
0,2024-05-01 00:00:00,2024-05-31
1,2024-05-01 01:00:00,2024-06-30
2,2024-05-01 02:00:00,2024-07-31
3,2024-05-01 03:00:00,2024-08-31
4,2024-05-01 04:00:00,2024-09-30


Let's capture the difference in days between the two variables in a new feature, and then display the DataFrame’s top rows:

In [58]:
df['elapse_days'] = (df['date2'] - df['date1']).dt.days
df

Unnamed: 0,date1,date2,elapse_days
0,2024-05-01 00:00:00,2024-05-31,30
1,2024-05-01 01:00:00,2024-06-30,59
2,2024-05-01 02:00:00,2024-07-31,90
3,2024-05-01 03:00:00,2024-08-31,121
4,2024-05-01 04:00:00,2024-09-30,151
5,2024-05-01 05:00:00,2024-10-31,182
6,2024-05-01 06:00:00,2024-11-30,212
7,2024-05-01 07:00:00,2024-12-31,243
8,2024-05-01 08:00:00,2025-01-31,274
9,2024-05-01 09:00:00,2025-02-28,302


### Automating the datetime feature extraction with Feature-engine

feature-engine is a Python library for feature engineering and selection that is well suited to working with pandas DataFrames. The DatetimeFeatures() class can extract features from date and time automatically by using pandas' dt under the hood. DatetimeFeatures() allows you to extract the following features

- Month
- Quarter
- Semester
- Year
- Week
- Day of the week
- Day of the month
- Day of the year
- Weekend
- Month start
- Month end
- Quarter start
- Quarter end
- Year start
- Year end
- Leap year
- Days in a month
- Hour
- Minute
- Second

In [59]:
import pandas as pd
from feature_engine.datetime import DatetimeFeatures

Let's create a datetime variable with 20 values, beginning from 2024-05-17 at midnight and followed by increments of 1 day. Then, we store this variable in a DataFrame:

In [61]:
rng_ = pd.date_range('2024-05-01', periods=20, freq='D')
data = pd.DataFrame({'date': rng_})

data.head()

Unnamed: 0,date
0,2024-05-01
1,2024-05-02
2,2024-05-03
3,2024-05-04
4,2024-05-05


We'll start by setting up the transformer to extract all supported datetime features:

In [62]:
dtfs = DatetimeFeatures(
    variables=None,
    features_to_extract='all'
)

DatetimeFeatures() automatically finds the variables of the datetime type, or that could be parsed as datetime when the variables parameter is set to None. Alternatively, you can pass a list with the names of the variables from which you want to extract date and time features.

In [63]:
# Let's add the date and time features to the data
dft = dtfs.fit_transform(data)

**NOTE**: By default, DatetimeFeatures() extracts the following features from each datetime variable: month, year, day_of_week, day_of_month, hour, minute, and second. We can modify this behavior through the features_to_extract parameter as we did in Step 3.

In [64]:
vars_ = [v for v in dft.columns if 'date' in v]
vars_

['date_month',
 'date_quarter',
 'date_semester',
 'date_year',
 'date_week',
 'date_day_of_week',
 'date_day_of_month',
 'date_day_of_year',
 'date_weekend',
 'date_month_start',
 'date_month_end',
 'date_quarter_start',
 'date_quarter_end',
 'date_year_start',
 'date_year_end',
 'date_leap_year',
 'date_days_in_month',
 'date_hour',
 'date_minute',
 'date_second']

In [67]:
dft[vars_].head()


Unnamed: 0,date_month,date_quarter,date_semester,date_year,date_week,date_day_of_week,date_day_of_month,date_day_of_year,date_weekend,date_month_start,date_month_end,date_quarter_start,date_quarter_end,date_year_start,date_year_end,date_leap_year,date_days_in_month,date_hour,date_minute,date_second
0,5,2,1,2024,18,2,1,122,0,1,0,0,0,0,0,1,31,0,0,0
1,5,2,1,2024,18,3,2,123,0,0,0,0,0,0,0,1,31,0,0,0
2,5,2,1,2024,18,4,3,124,0,0,0,0,0,0,0,1,31,0,0,0
3,5,2,1,2024,18,5,4,125,1,0,0,0,0,0,0,1,31,0,0,0
4,5,2,1,2024,18,6,5,126,1,0,0,0,0,0,0,1,31,0,0,0


**NOTE**: We can create specific features by passing their names to the features_to_extract parameter.

For example, to extract week and year, we set the transformer like this: dtfs = DatetimeFeatures(features_to_extract=["week", "year"]). We can also extract all supported features by setting the features_to_extract parameter to "all".

DatetimeFeatures() can also create features from variables in different time zones. Let's learn how to correctly set up the transformer in this situation.

In [69]:
df = pd.DataFrame()
df['time'] = pd.concat([
    pd.Series(
        pd.date_range(
            start='2024-08-01 09:00', 
            freq='h', 
            periods=3, 
            tz="Europe/Berlin"
        ),
    ),
    pd.Series(
        pd.date_range(
            start='2024-08-01 09:00',
            freq='h',
            periods=3,
            tz='US/Central'
        )
    )
], axis=0)

df.head()

Unnamed: 0,time
0,2024-08-01 09:00:00+02:00
1,2024-08-01 10:00:00+02:00
2,2024-08-01 11:00:00+02:00
0,2024-08-01 09:00:00-05:00
1,2024-08-01 10:00:00-05:00
