# Creating features from date and time

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

**Date and time variables are those that contain information about dates, times, or date and
time.** 

In programming, we refer to these variables as datetime variables. Examples of the
datetime variables are date of birth, time of the accident, and date of last payment. 

**The datetime variables usually contain a multitude of different labels corresponding to a
specific combination of date and time.** 

**We do not utilize the datetime variables in their
raw format when building machine learning models.** Instead, we enrich the dataset
dramatically by deriving multiple features from these variables.

## Extracting date and time parts from a datetime variable

The datetime variables can take dates, time, or date and time as values. The datetime
variables are not used in their raw format to build machine learning algorithms. Instead, we
**create additional features from them, and, in fact, we can enrich the dataset dramatically by
extracting information from the date and time.**



In [2]:
rng_ = pd.date_range('2019-03-05', periods=20, freq='T')
df = pd.DataFrame({'date': rng_})
df.head()

  rng_ = pd.date_range('2019-03-05', periods=20, freq='T')


Unnamed: 0,date
0,2019-03-05 00:00:00
1,2019-03-05 00:01:00
2,2019-03-05 00:02:00
3,2019-03-05 00:03:00
4,2019-03-05 00:04:00


In [3]:
df.dtypes

date    datetime64[ns]
dtype: object

In [5]:
df['date_part'] = df['date'].dt.date
df.head()

Unnamed: 0,date,date_part
0,2019-03-05 00:00:00,2019-03-05
1,2019-03-05 00:01:00,2019-03-05
2,2019-03-05 00:02:00,2019-03-05
3,2019-03-05 00:03:00,2019-03-05
4,2019-03-05 00:04:00,2019-03-05


In [6]:
df['time_part'] = df['date'].dt.time
df.head()

Unnamed: 0,date,date_part,time_part
0,2019-03-05 00:00:00,2019-03-05,00:00:00
1,2019-03-05 00:01:00,2019-03-05,00:01:00
2,2019-03-05 00:02:00,2019-03-05,00:02:00
3,2019-03-05 00:03:00,2019-03-05,00:03:00
4,2019-03-05 00:04:00,2019-03-05,00:04:00


## Deriving representations of the year and month

**Some events occur more often at certain times of the year**, for example, recruitment rates
increase after Christmas and slow down toward the summer holidays in Europe.
Businesses and organizations want to evaluate performance and objectives at regular
intervals throughout the year, for example, at every quarter or every semester. Therefore,
deriving these features from a date variable is very useful for both data analysis and
machine learning.

In [8]:
rng_ = pd.date_range('2019-03-05', periods=20, freq='M')
df = pd.DataFrame({'date': rng_})
df.head()

  rng_ = pd.date_range('2019-03-05', periods=20, freq='M')


Unnamed: 0,date
0,2019-03-31
1,2019-04-30
2,2019-05-31
3,2019-06-30
4,2019-07-31


In [10]:
df['year'] = df['date'].dt.year
df.head()

Unnamed: 0,date,year
0,2019-03-31,2019
1,2019-04-30,2019
2,2019-05-31,2019
3,2019-06-30,2019
4,2019-07-31,2019


In [11]:
df['month'] = df['date'].dt.month
df.head()

Unnamed: 0,date,year,month
0,2019-03-31,2019,3
1,2019-04-30,2019,4
2,2019-05-31,2019,5
3,2019-06-30,2019,6
4,2019-07-31,2019,7


In [12]:
df['quarter'] = df['date'].dt.quarter
df.head()

Unnamed: 0,date,year,month,quarter
0,2019-03-31,2019,3,1
1,2019-04-30,2019,4,2
2,2019-05-31,2019,5,2
3,2019-06-30,2019,6,2
4,2019-07-31,2019,7,3


## Creating representations of day and week

**Some events occur more often on certain days of the week**, for example, loan applications
occur more likely during the week than over weekends, whereas others occur more often
during certain weeks of the year. Businesses and organizations may also want to track some
key performance metrics throughout the week. 


In [13]:
rng_ = pd.date_range('2019-03-05', periods=20, freq='D')
df = pd.DataFrame({'date': rng_})
df.head()

Unnamed: 0,date
0,2019-03-05
1,2019-03-06
2,2019-03-07
3,2019-03-08
4,2019-03-09


In [14]:
df['day_mo'] = df['date'].dt.day
df.head()

Unnamed: 0,date,day_mo
0,2019-03-05,5
1,2019-03-06,6
2,2019-03-07,7
3,2019-03-08,8
4,2019-03-09,9


In [15]:
df['day_week'] = df['date'].dt.dayofweek
df.head()

Unnamed: 0,date,day_mo,day_week
0,2019-03-05,5,1
1,2019-03-06,6,2
2,2019-03-07,7,3
3,2019-03-08,8,4
4,2019-03-09,9,5


In [16]:
df['day_week_name'] = df['date'].dt.day_name()
df.head()

Unnamed: 0,date,day_mo,day_week,day_week_name
0,2019-03-05,5,1,Tuesday
1,2019-03-06,6,2,Wednesday
2,2019-03-07,7,3,Thursday
3,2019-03-08,8,4,Friday
4,2019-03-09,9,5,Saturday


In [18]:
df['is_weekend'] = np.where(df['day_week_name'].isin(['Sunday', 'Saturday']), 1, 0)
df.head()

Unnamed: 0,date,day_mo,day_week,day_week_name,is_weekend
0,2019-03-05,5,1,Tuesday,0
1,2019-03-06,6,2,Wednesday,0
2,2019-03-07,7,3,Thursday,0
3,2019-03-08,8,4,Friday,0
4,2019-03-09,9,5,Saturday,1


In [19]:
df['week'] = df['date'].dt.isocalendar().week
df.head()

Unnamed: 0,date,day_mo,day_week,day_week_name,is_weekend,week
0,2019-03-05,5,1,Tuesday,0,10
1,2019-03-06,6,2,Wednesday,0,10
2,2019-03-07,7,3,Thursday,0,10
3,2019-03-08,8,4,Friday,0,10
4,2019-03-09,9,5,Saturday,1,10


## Extracting time parts from a time variable

**Some events occur more often at certain times of the day,** for example, fraudulent activity
occurs more likely during the night or early morning. Also, occasionally, organizations
want to track whether an event occurred after another one, in a very short time window, for
example, if sales increased on the back of displaying a TV or online advertisement.
Therefore, deriving time features is extremely useful.

In [20]:
rng_ = pd.date_range('2019-03-05', periods=20, freq='1h15min10s')
df = pd.DataFrame({'date': rng_})
df.head()

Unnamed: 0,date
0,2019-03-05 00:00:00
1,2019-03-05 01:15:10
2,2019-03-05 02:30:20
3,2019-03-05 03:45:30
4,2019-03-05 05:00:40


In [21]:
df['hour'] = df['date'].dt.hour
df['min'] = df['date'].dt.minute
df['sec'] = df['date'].dt.second
df.head()

Unnamed: 0,date,hour,min,sec
0,2019-03-05 00:00:00,0,0,0
1,2019-03-05 01:15:10,1,15,10
2,2019-03-05 02:30:20,2,30,20
3,2019-03-05 03:45:30,3,45,30
4,2019-03-05 05:00:40,5,0,40


In [22]:
df['is_morning'] = np.where( (df['hour'] < 12) & (df['hour'] > 6), 1, 0 )
df.head()

Unnamed: 0,date,hour,min,sec,is_morning
0,2019-03-05 00:00:00,0,0,0,0
1,2019-03-05 01:15:10,1,15,10,0
2,2019-03-05 02:30:20,2,30,20,0
3,2019-03-05 03:45:30,3,45,30,0
4,2019-03-05 05:00:40,5,0,40,0


## Capturing the elapsed time between datetime variables

**The datetime variables offer value individually and they offer more value collectively
when used together with other datetime variables to derive important insights.** The most
common example consists in deriving the age from the date of birth and today variable, or
the day the customer had an accident or requested a loan. Like these examples, we can
combine several datetime variables to derive the time that passed in between and create
more meaningful features.

In [23]:
rng_hr = pd.date_range('2019-03-05', periods=20, freq='H')
rng_month = pd.date_range('2019-03-05', periods=20, freq='M')
df = pd.DataFrame({'date1': rng_hr, 'date2': rng_month})
df.head()

  rng_hr = pd.date_range('2019-03-05', periods=20, freq='H')
  rng_month = pd.date_range('2019-03-05', periods=20, freq='M')


Unnamed: 0,date1,date2
0,2019-03-05 00:00:00,2019-03-31
1,2019-03-05 01:00:00,2019-04-30
2,2019-03-05 02:00:00,2019-05-31
3,2019-03-05 03:00:00,2019-06-30
4,2019-03-05 04:00:00,2019-07-31


In [24]:
df['elapsed_days'] = (df['date2'] - df['date1']).dt.days
df.head()

Unnamed: 0,date1,date2,elapsed_days
0,2019-03-05 00:00:00,2019-03-31,26
1,2019-03-05 01:00:00,2019-04-30,55
2,2019-03-05 02:00:00,2019-05-31,86
3,2019-03-05 03:00:00,2019-06-30,116
4,2019-03-05 04:00:00,2019-07-31,147


In [50]:
df['months_passed'] = ((df['date2'] - df['date1']) / np.timedelta64(1, 'D')) / 30
df['months_passed'] = np.round(df['months_passed'],0)
df.head()

Unnamed: 0,date1,date2,elapsed_days,months_passed,diff_seconds,diff_minutes,to_today
0,2019-03-05 00:00:00,2019-03-31,26,1.0,2246400.0,37440.0,1897 days 17:34:19.237168
1,2019-03-05 01:00:00,2019-04-30,55,2.0,4834800.0,80580.0,1897 days 16:34:19.237168
2,2019-03-05 02:00:00,2019-05-31,86,3.0,7509600.0,125160.0,1897 days 15:34:19.237168
3,2019-03-05 03:00:00,2019-06-30,116,4.0,10098000.0,168300.0,1897 days 14:34:19.237168
4,2019-03-05 04:00:00,2019-07-31,147,5.0,12772800.0,212880.0,1897 days 13:34:19.237168


In [51]:
df['diff_seconds'] = (df['date2'] - df['date1'])/np.timedelta64(1,'s')
df['diff_minutes'] = (df['date2'] - df['date1'])/np.timedelta64(1,'m')
df.head()

Unnamed: 0,date1,date2,elapsed_days,months_passed,diff_seconds,diff_minutes,to_today
0,2019-03-05 00:00:00,2019-03-31,26,1.0,2246400.0,37440.0,1897 days 17:34:19.237168
1,2019-03-05 01:00:00,2019-04-30,55,2.0,4834800.0,80580.0,1897 days 16:34:19.237168
2,2019-03-05 02:00:00,2019-05-31,86,3.0,7509600.0,125160.0,1897 days 15:34:19.237168
3,2019-03-05 03:00:00,2019-06-30,116,4.0,10098000.0,168300.0,1897 days 14:34:19.237168
4,2019-03-05 04:00:00,2019-07-31,147,5.0,12772800.0,212880.0,1897 days 13:34:19.237168


In [52]:
import datetime

df['to_today'] = (datetime.datetime.today() - df['date1'])
df.head()

Unnamed: 0,date1,date2,elapsed_days,months_passed,diff_seconds,diff_minutes,to_today
0,2019-03-05 00:00:00,2019-03-31,26,1.0,2246400.0,37440.0,1897 days 17:36:18.704810
1,2019-03-05 01:00:00,2019-04-30,55,2.0,4834800.0,80580.0,1897 days 16:36:18.704810
2,2019-03-05 02:00:00,2019-05-31,86,3.0,7509600.0,125160.0,1897 days 15:36:18.704810
3,2019-03-05 03:00:00,2019-06-30,116,4.0,10098000.0,168300.0,1897 days 14:36:18.704810
4,2019-03-05 04:00:00,2019-07-31,147,5.0,12772800.0,212880.0,1897 days 13:36:18.704810


## Working with time in different time zones

Some organizations operate internationally; therefore, the information they collect about
events may be recorded together with the time zone of the area where the event took place.
**To be able to compare events that occurred across different time zones, we first need to set
all of the variables within the same zone**.

In [56]:
df = pd.DataFrame()

df['time1'] = pd.concat([
    pd.Series(pd.date_range(start='2015-06-10 09:00', freq='h', periods=3, tz='Europe/Berlin')),
    pd.Series( pd.date_range(start='2015-09-10 09:00', freq='h', periods=3, tz='US/Central'))], axis=0)

In [60]:
df['time2'] = pd.concat([
    pd.Series(pd.date_range(start='2015-07-01 09:00', freq='h', periods=3,tz='Europe/Berlin')),
    pd.Series(pd.date_range(start='2015-08-01 09:00', freq='h', periods=3, tz='US/Central'))], axis=0)

In [61]:
df

Unnamed: 0,time1,time2
0,2015-06-10 09:00:00+02:00,2015-07-01 09:00:00+02:00
1,2015-06-10 10:00:00+02:00,2015-07-01 10:00:00+02:00
2,2015-06-10 11:00:00+02:00,2015-07-01 11:00:00+02:00
0,2015-09-10 09:00:00-05:00,2015-08-01 09:00:00-05:00
1,2015-09-10 10:00:00-05:00,2015-08-01 10:00:00-05:00
2,2015-09-10 11:00:00-05:00,2015-08-01 11:00:00-05:00


In [62]:
df['time1_utc'] = pd.to_datetime(df['time1'], utc=True)
df['time2_utc'] = pd.to_datetime(df['time2'], utc=True)
df

Unnamed: 0,time1,time2,time1_utc,time2_utc
0,2015-06-10 09:00:00+02:00,2015-07-01 09:00:00+02:00,2015-06-10 07:00:00+00:00,2015-07-01 07:00:00+00:00
1,2015-06-10 10:00:00+02:00,2015-07-01 10:00:00+02:00,2015-06-10 08:00:00+00:00,2015-07-01 08:00:00+00:00
2,2015-06-10 11:00:00+02:00,2015-07-01 11:00:00+02:00,2015-06-10 09:00:00+00:00,2015-07-01 09:00:00+00:00
0,2015-09-10 09:00:00-05:00,2015-08-01 09:00:00-05:00,2015-09-10 14:00:00+00:00,2015-08-01 14:00:00+00:00
1,2015-09-10 10:00:00-05:00,2015-08-01 10:00:00-05:00,2015-09-10 15:00:00+00:00,2015-08-01 15:00:00+00:00
2,2015-09-10 11:00:00-05:00,2015-08-01 11:00:00-05:00,2015-09-10 16:00:00+00:00,2015-08-01 16:00:00+00:00


In [63]:
df['elapsed_days'] = (df['time2_utc'] - df['time1_utc']).dt.days
df['elapsed_days'].head()

0    21
1    21
2    21
0   -40
1   -40
Name: elapsed_days, dtype: int64

In [65]:
df['time1_london'] = df['time1_utc'].dt.tz_convert('Europe/London')
df['time2_berlin'] = df['time1_utc'].dt.tz_convert('Europe/Berlin')
df[['time1_london', 'time2_berlin']]

Unnamed: 0,time1_london,time2_berlin
0,2015-06-10 08:00:00+01:00,2015-06-10 09:00:00+02:00
1,2015-06-10 09:00:00+01:00,2015-06-10 10:00:00+02:00
2,2015-06-10 10:00:00+01:00,2015-06-10 11:00:00+02:00
0,2015-09-10 15:00:00+01:00,2015-09-10 16:00:00+02:00
1,2015-09-10 16:00:00+01:00,2015-09-10 17:00:00+02:00
2,2015-09-10 17:00:00+01:00,2015-09-10 18:00:00+02:00
