# 09 Pandas Time and Date

* generatine time indices
* indexing
* resampling
* plotting

## Timestamp class

Python has basic datetime support. Pandas extends this with a custom Timestamp type.

In [1]:
import datetime

In [2]:
datetime.datetime(2000, 1, 1)

datetime.datetime(2000, 1, 1, 0, 0)

Parsing dates and times is straigtforward, but the variety of date layouts can still be a problem, sometimes. Internally Pandas utilized the [dateutil](http://dateutil.readthedocs.io/en/stable/) library.

The traditional method centers around `strptime` and `strftime` - inherited from the [C programming language](http://pubs.opengroup.org/onlinepubs/009695399/functions/strptime.html) and POSIX, respectively.

In [3]:
datetime.datetime.strptime("2000/1/1", "%Y/%m/%d")

datetime.datetime(2000, 1, 1, 0, 0)

In [4]:
datetime.datetime(2000, 1, 1, 0, 0).strftime("%Y%m%d")

'20000101'

With Pandas you do not necessarily have to remember the layout specification.

In [5]:
import pandas as pd
import numpy as np

In [6]:
pd.to_datetime("4th of July 2000")

Timestamp('2000-07-04 00:00:00')

In [7]:
pd.to_datetime("2010 --- Jan / 19")

Timestamp('2010-01-19 00:00:00')

There is a hint - `dayfirst=True`, that `to_datetime` accepts, e.g. when day and month might is ambiguous.

In [8]:
pd.to_datetime("09.12.2000")

Timestamp('2000-09-12 00:00:00')

In [9]:
pd.to_datetime("09.12.2000", dayfirst=True)

Timestamp('2000-12-09 00:00:00')

Allowing various formats is useful, if dates are formatted differently in a single data set (which is unfortunately not that uncommon).

The resulting object is a `pd.Timestamp`.

In [10]:
dt = pd.to_datetime("02.03.2018")
type(dt)

pandas._libs.tslib.Timestamp

In [11]:
pd.Timestamp

pandas._libs.tslib.Timestamp

In [12]:
issubclass(pd.Timestamp, datetime.datetime)

True

This means that both types can be used interchangeably in many places and `pd.Timestamp` will behave like a datetime object, e.g. it will have a `weekday` method.

In [13]:
ts = pd.to_datetime("2001-01-01 5:00")

In [14]:
ts

Timestamp('2001-01-01 05:00:00')

In [15]:
ts.year, ts.month, ts.day, ts.weekday()

(2001, 1, 1, 0)

## DatetimeIndex class

A DatetimeIndex class is one specialed index type in Pandas.



In [16]:
index = [pd.Timestamp("2000-01-01"),
         pd.Timestamp("2000-01-02"),
         pd.Timestamp("2000-01-03")]

ts = pd.Series(np.random.randn(len(index)), index=index)

In [17]:
ts

2000-01-01   -2.426435
2000-01-02    0.728601
2000-01-03   -1.889533
dtype: float64

In [18]:
ts.index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None)

A list of timestamps (or datetime objects) are converted into a DatetimeIndex on Series or DataFrame creation time.

A list of date **strings** would not have yielded the same result.

In [19]:
index = ['2000-01-01', '2000-01-02', '2000-01-03']
ts = pd.Series(np.random.randn(len(index)), index=index)

In [20]:
ts.index

Index(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='object')

However, `to_datetime` is flexible enough to handle a **list of strings** as well.

In [21]:
index = pd.to_datetime(['2000-01-01', '2000-01-02', '2000-01-03'])
ts = pd.Series(np.random.randn(len(index)), index=index)

In [22]:
ts.index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03'], dtype='datetime64[ns]', freq=None)

## Fixed date ranges

Pandas has a handy function called `date_range` that can be utilize to created DatetimeIndex objects from fixed intervals. It is very flexible and if you ever generated intervals by hand, you might be positively surprised.

The `date_range` function takes a few parameters (not all of them need to be specified):

* start
* end
* periods
* freq

### Start and end 

In [23]:
pd.date_range(start="2000-01-01", end="2001-01-01")

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
               '2000-01-09', '2000-01-10',
               ...
               '2000-12-23', '2000-12-24', '2000-12-25', '2000-12-26',
               '2000-12-27', '2000-12-28', '2000-12-29', '2000-12-30',
               '2000-12-31', '2001-01-01'],
              dtype='datetime64[ns]', length=367, freq='D')

First and last date are inclusive, here the default frequency is choosen: D, which stands for "Calendar day frequency".

### Periods and Frequency

The following creates three entries with a given frequency, here "H", which stands for hourly.

In [24]:
pd.date_range(start="2000-01-01", periods=3, freq="H")

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:00:00',
               '2000-01-01 02:00:00'],
              dtype='datetime64[ns]', freq='H')

A full list of frequencies can be found in the Pandas docs under:
    
* https://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases


Alias    | Description
---------|-----------------------
B        | business day frequency
C        | custom business day frequency
D        | calendar day frequency
W        | weekly frequency
M        | month end frequency
SM       | semi-month end frequency (15th and end of month)
BM       | business month end frequency
CBM      | custom business month end frequency
MS       |  month start frequency
SMS      | semi-month start frequency (1st and 15th)
BMS      | business month start frequency
CBMS     | custom business month start frequency
Q        | quarter end frequency
BQ       | business quarter end frequency
QS       | quarter start frequency
BQS      | business quarter start frequency
A, Y     | year end frequency
BA, BY   | business year end frequency
AS, YS   | year start frequency
BAS, BYS | business year start frequency
BH       | business hour frequency
H        | hourly frequency
T, min   | minutely frequency
S        | secondly frequency
L, ms    | milliseconds
U, us    | microseconds
N        | nanoseconds

In [25]:
pd.date_range(start="2000-01-01", periods=3, freq="S")

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:00:01',
               '2000-01-01 00:00:02'],
              dtype='datetime64[ns]', freq='S')

In [26]:
pd.date_range(start="2000-01-01", periods=3, freq="T")

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00',
               '2000-01-01 00:02:00'],
              dtype='datetime64[ns]', freq='T')

One thing to note is the business days frequency - it takes into account weekends (Sat, Sun).

The business year 2000 started on January, 3rd.

In [27]:
pd.date_range(start="2000-01-01", periods=3, freq="B")

DatetimeIndex(['2000-01-03', '2000-01-04', '2000-01-05'], dtype='datetime64[ns]', freq='B')

### Combined frequencies

A less common use case is combination of intervals. Generating an index for intervals of 90 minutesmight look like the following:

In [28]:
pd.date_range(start="2000", periods=5, freq="90min")

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00'],
              dtype='datetime64[ns]', freq='90T')

In [29]:
pd.date_range(start="2000", periods=5, freq="1h30min")

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:30:00',
               '2000-01-01 03:00:00', '2000-01-01 04:30:00',
               '2000-01-01 06:00:00'],
              dtype='datetime64[ns]', freq='90T')

Again, Pandas tries to be useful and do the right thing out of the box.

### Combining aliases

It is possible to combine offsets (e.g. business hours - BH) with intervals. To generate entries for every 2 hours of business time (which starts at 9AM and ends at 5PM), one can write the following:

In [30]:
pd.date_range(start="2000", periods=12, freq="2BH")

DatetimeIndex(['2000-01-03 09:00:00', '2000-01-03 11:00:00',
               '2000-01-03 13:00:00', '2000-01-03 15:00:00',
               '2000-01-04 09:00:00', '2000-01-04 11:00:00',
               '2000-01-04 13:00:00', '2000-01-04 15:00:00',
               '2000-01-05 09:00:00', '2000-01-05 11:00:00',
               '2000-01-05 13:00:00', '2000-01-05 15:00:00'],
              dtype='datetime64[ns]', freq='2BH')

### Anchored offsets

An anchor allows to specify date ranges with some fixed points, such as:
    
* every Friday
* second Tuesday of the month

The first takes an offset "W" and anchors it on "FRI", Friday.

The full list of anchored offset can be found in the Pandas documentation:

* https://pandas.pydata.org/pandas-docs/stable/timeseries.html#anchored-offsets

Anchor|Description
------|-------------------
W-SUN |weekly frequency (Sundays). Same as ‘W’
W-MON |weekly frequency (Mondays)
W-TUE |weekly frequency (Tuesdays)
W-WED |weekly frequency (Wednesdays)
W-THU |weekly frequency (Thursdays)
W-FRI |weekly frequency (Fridays)
W-SAT |weekly frequency (Saturdays)
(B)Q(S)-DEC |quarterly frequency, year ends in December. Same as ‘Q’
(B)Q(S)-JAN |quarterly frequency, year ends in January
(B)Q(S)-FEB |quarterly frequency, year ends in February
(B)Q(S)-MAR |quarterly frequency, year ends in March
(B)Q(S)-APR |quarterly frequency, year ends in April
(B)Q(S)-MAY |quarterly frequency, year ends in May
(B)Q(S)-JUN |quarterly frequency, year ends in June
(B)Q(S)-JUL |quarterly frequency, year ends in July
(B)Q(S)-AUG |quarterly frequency, year ends in August
(B)Q(S)-SEP |quarterly frequency, year ends in September
(B)Q(S)-OCT |quarterly frequency, year ends in October
(B)Q(S)-NOV |quarterly frequency, year ends in November
(B)A(S)-DEC |annual frequency, anchored end of December. Same as ‘A’
(B)A(S)-JAN |annual frequency, anchored end of January
(B)A(S)-FEB |annual frequency, anchored end of February
(B)A(S)-MAR |annual frequency, anchored end of March
(B)A(S)-APR |annual frequency, anchored end of April
(B)A(S)-MAY |annual frequency, anchored end of May
(B)A(S)-JUN |annual frequency, anchored end of June
(B)A(S)-JUL |annual frequency, anchored end of July
(B)A(S)-AUG |annual frequency, anchored end of August
(B)A(S)-SEP |annual frequency, anchored end of September
(B)A(S)-OCT |annual frequency, anchored end of October
(B)A(S)-NOV |annual frequency, anchored end of November

In [31]:
pd.date_range(start="2000-01-01", periods=5, freq="W-Fri")

DatetimeIndex(['2000-01-07', '2000-01-14', '2000-01-21', '2000-01-28',
               '2000-02-04'],
              dtype='datetime64[ns]', freq='W-FRI')

The "week of month" is modified by 2 (second week of month) and anchored on Tuesday.

In [32]:
pd.date_range(start="2018-01-01", periods=12, freq="WOM-2TUE")

DatetimeIndex(['2018-01-09', '2018-02-13', '2018-03-13', '2018-04-10',
               '2018-05-08', '2018-06-12', '2018-07-10', '2018-08-14',
               '2018-09-11', '2018-10-09', '2018-11-13', '2018-12-11'],
              dtype='datetime64[ns]', freq='WOM-2TUE')

### Index unions

Indices can be combined. The API is similar to the API of the `set` type in Python.

* union
* intersection

In [33]:
# BAS (A) nnual (B) usiness frequency, anchored at (S) tart of January
s = pd.date_range(start="2000-01-01", periods=10, freq="BAS-JAN")

In [34]:
s

DatetimeIndex(['2000-01-03', '2001-01-01', '2002-01-01', '2003-01-01',
               '2004-01-01', '2005-01-03', '2006-01-02', '2007-01-01',
               '2008-01-01', '2009-01-01'],
              dtype='datetime64[ns]', freq='BAS-JAN')

In [35]:
t = pd.date_range(start="2001-01-01", periods=10, freq='A-FEB')

In [36]:
s.union(t)

DatetimeIndex(['2000-01-03', '2001-01-01', '2001-02-28', '2002-01-01',
               '2002-02-28', '2003-01-01', '2003-02-28', '2004-01-01',
               '2004-02-29', '2005-01-03', '2005-02-28', '2006-01-02',
               '2006-02-28', '2007-01-01', '2007-02-28', '2008-01-01',
               '2008-02-29', '2009-01-01', '2009-02-28', '2010-02-28'],
              dtype='datetime64[ns]', freq=None)

In [37]:
s.intersection(t)

DatetimeIndex([], dtype='datetime64[ns]', freq=None)

## Exercises

1. Create a DatetimeIndex with at least two elements **without** using `pd.date_range` or `pd.to_datetime`
2. Use three different date strings of your choice (or that you encountered in data sets) and test, whether `pd.to_datetime` can parse it correctly.
3. Create a DatetimeIndex with 100 date entries starting January, 1st 2010 and indexed hourly.
4. Create two DatetimeIndex object, both starting April 1st 2010. One index should contain hourly entries, the other one entry every ten minutes. Create 100 periods. Combine both indices. Create an index, that only contains entries, which can be found in both indices.

## Solutions

1. 1. Create a DatetimeIndex with at least two elements **without** using `pd.date_range` or `pd.to_datetime`

**Solution**: Use default Python datetime objects.

In [38]:
pd.DatetimeIndex([datetime.datetime(2000, 1, 1), datetime.datetime(2000, 2, 1)])

DatetimeIndex(['2000-01-01', '2000-02-01'], dtype='datetime64[ns]', freq=None)

In [39]:
pd.DatetimeIndex([datetime.datetime.now(), datetime.datetime(2000, 2, 1)])

DatetimeIndex(['2018-09-07 21:38:33.350709', '2000-02-01 00:00:00'], dtype='datetime64[ns]', freq=None)

Possible alternative, using `time.time()`.

`time.time()` ...

> Return the current time in seconds since the Epoch.
Fractions of a second may be present if the system clock provides them.

However, the DatetimeIndex constructor assumes nanoseconds, if passed an integer (as time.time() returns).

So, convert seconds to nanoseconds first, then it would work as well.

In [40]:
import time

In [41]:
pd.DatetimeIndex([time.time() * 1000000000])

DatetimeIndex(['2018-09-07 19:38:33.373552896'], dtype='datetime64[ns]', freq=None)

2. Use three different date strings of your choice (or that you encountered in data sets) and test, whether `pd.to_datetime` can parse it correctly.

**Solution**: Just use some strings. Here are some examples. Use `dayfirst` if necessary.

In [42]:
pd.to_datetime("Wed 23 May 2018")

Timestamp('2018-05-23 00:00:00')

In [43]:
pd.to_datetime("2018/23/05", dayfirst=True)

Timestamp('2018-05-23 00:00:00')

In [44]:
# pd.to_datetime("2018/23/05", dayfirst=False) # TypeError

In [45]:
# pd.to_datetime("29.02.2010") # ValueError, since "Leap Year", ValueError: day is out of range for month

In [46]:
pd.to_datetime("2018, Jan 3rd")

Timestamp('2018-01-03 00:00:00')

How about local month names, e.g. März?

With normal strptime, you could set your locale, but this does not work with Pandas.

In [47]:
import locale
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')

'de_DE.UTF-8'

In [48]:
# pd.to_datetime("2010, März 20") # TypeError

However, due to the locale setting the month abbreviations are available in the standard library as  `calendar.month_abbr` as a dictionary, which could be used to preprocess date strings. 

In [49]:
import calendar

In [50]:
dict((v,k) for k,v in enumerate(calendar.month_abbr))

{'': 0,
 'Apr': 4,
 'Aug': 8,
 'Dez': 12,
 'Feb': 2,
 'Jan': 1,
 'Jul': 7,
 'Jun': 6,
 'Mai': 5,
 'Mär': 3,
 'Nov': 11,
 'Okt': 10,
 'Sep': 9}

3. Create a DatetimeIndex with 100 date entries starting January, 1st 2010 and indexed hourly.

**Solution** use `pd.date_range`.

In [51]:
pd.date_range(start="2010-01-01", periods=100, freq="H")

DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 01:00:00',
               '2010-01-01 02:00:00', '2010-01-01 03:00:00',
               '2010-01-01 04:00:00', '2010-01-01 05:00:00',
               '2010-01-01 06:00:00', '2010-01-01 07:00:00',
               '2010-01-01 08:00:00', '2010-01-01 09:00:00',
               '2010-01-01 10:00:00', '2010-01-01 11:00:00',
               '2010-01-01 12:00:00', '2010-01-01 13:00:00',
               '2010-01-01 14:00:00', '2010-01-01 15:00:00',
               '2010-01-01 16:00:00', '2010-01-01 17:00:00',
               '2010-01-01 18:00:00', '2010-01-01 19:00:00',
               '2010-01-01 20:00:00', '2010-01-01 21:00:00',
               '2010-01-01 22:00:00', '2010-01-01 23:00:00',
               '2010-01-02 00:00:00', '2010-01-02 01:00:00',
               '2010-01-02 02:00:00', '2010-01-02 03:00:00',
               '2010-01-02 04:00:00', '2010-01-02 05:00:00',
               '2010-01-02 06:00:00', '2010-01-02 07:00:00',
               '2010-01-

4. Create two DatetimeIndex object, both starting April 1st 2010. One index should contain hourly entries, the other one entry every ten minutes. Create 100 periods. Combine both indices. Create an index, that only contains entries, which can be found in both indices.

**Solution**: Use `union` and `intersection`.

In [52]:
a = pd.date_range(start="2000-01-01", freq="1H", periods=100)
b = pd.date_range(start="2000-01-01", freq="10min", periods=100)

a.union(b)

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:10:00',
               '2000-01-01 00:20:00', '2000-01-01 00:30:00',
               '2000-01-01 00:40:00', '2000-01-01 00:50:00',
               '2000-01-01 01:00:00', '2000-01-01 01:10:00',
               '2000-01-01 01:20:00', '2000-01-01 01:30:00',
               ...
               '2000-01-04 18:00:00', '2000-01-04 19:00:00',
               '2000-01-04 20:00:00', '2000-01-04 21:00:00',
               '2000-01-04 22:00:00', '2000-01-04 23:00:00',
               '2000-01-05 00:00:00', '2000-01-05 01:00:00',
               '2000-01-05 02:00:00', '2000-01-05 03:00:00'],
              dtype='datetime64[ns]', length=183, freq=None)

In [53]:
a.intersection(b)

DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 01:00:00',
               '2000-01-01 02:00:00', '2000-01-01 03:00:00',
               '2000-01-01 04:00:00', '2000-01-01 05:00:00',
               '2000-01-01 06:00:00', '2000-01-01 07:00:00',
               '2000-01-01 08:00:00', '2000-01-01 09:00:00',
               '2000-01-01 10:00:00', '2000-01-01 11:00:00',
               '2000-01-01 12:00:00', '2000-01-01 13:00:00',
               '2000-01-01 14:00:00', '2000-01-01 15:00:00',
               '2000-01-01 16:00:00'],
              dtype='datetime64[ns]', freq='H')