# Time Series Basics

In [2]:
from pandas import DataFrame, Series
import pandas as pd
import sys
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline


from datetime import datetime
from datetime import timedelta
from dateutil.parser import parse

The most basic kind of time series object in pandas is a Series indexed by timestamps,
which is often represented external to pandas as Python strings or datetime objects:

In [3]:
from datetime import datetime

In [4]:
dates = [datetime(2011, 1, 2), datetime(2011, 1, 5), datetime(2011, 1, 7),
        datetime(2011, 1, 8), datetime(2011, 1, 10), datetime(2011, 1, 12)]

In [5]:
dates

[datetime.datetime(2011, 1, 2, 0, 0),
 datetime.datetime(2011, 1, 5, 0, 0),
 datetime.datetime(2011, 1, 7, 0, 0),
 datetime.datetime(2011, 1, 8, 0, 0),
 datetime.datetime(2011, 1, 10, 0, 0),
 datetime.datetime(2011, 1, 12, 0, 0)]

In [6]:
ts = Series(np.random.randn(6), index=dates)

In [7]:
ts

2011-01-02    0.333574
2011-01-05   -0.346737
2011-01-07   -1.471713
2011-01-08   -1.703889
2011-01-10    0.921323
2011-01-12   -0.733133
dtype: float64

Under the hood, these datetime objects have been put in a DatetimeIndex, and the
variable ts is now of type TimeSeries:

In [8]:
type(ts)

pandas.core.series.Series

In [9]:
ts.index

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

Like other Series, arithmetic operations between differently-indexed time series automatically align on the dates:

In [10]:
ts + ts[::2]

2011-01-02    0.667149
2011-01-05         NaN
2011-01-07   -2.943425
2011-01-08         NaN
2011-01-10    1.842646
2011-01-12         NaN
dtype: float64

pandas stores timestamps using NumPy’s datetime64 data type at the nanosecond resolution:

In [11]:
ts.index.dtype

dtype('<M8[ns]')

Scalar values from a DatetimeIndex are pandas Timestamp objects

In [13]:
stamp = ts.index[0]

In [14]:
stamp

Timestamp('2011-01-02 00:00:00')

A Timestamp can be substituted anywhere you would use a datetime object. Additionally, it can store frequency information (if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later.

# Indexing, Selection, Subsetting

TimeSeries is a subclass of Series and thus behaves in the same way with regard to indexing and selecting data based on label:

In [15]:
stamp = ts.index[2]

In [16]:
stamp

Timestamp('2011-01-07 00:00:00')

In [17]:
ts[stamp]

-1.4717125469335286

As a convenience, you can also pass a string that is interpretable as a date:

In [18]:
ts['1/10/2011']

0.92132316928562608

In [21]:
ts['20110110']

0.92132316928562608

For longer time series, a year or only a year and month can be passed to easily select slices of data:

In [23]:
longer_ts = Series(np.random.randn(1000),
    index=pd.date_range('1/1/2000', periods=1000))


In [26]:
longer_ts.head()

2000-01-01    1.372077
2000-01-02    0.364052
2000-01-03   -0.809639
2000-01-04   -2.506893
2000-01-05    0.681060
Freq: D, dtype: float64

In [28]:
longer_ts['2001'].head()

2001-01-01   -0.906485
2001-01-02   -0.156577
2001-01-03    0.505718
2001-01-04   -1.336015
2001-01-05    1.069350
Freq: D, dtype: float64

In [29]:
longer_ts['2001-05']

2001-05-01   -1.514354
2001-05-02   -0.756711
2001-05-03   -1.507894
2001-05-04    0.538485
2001-05-05    1.278094
2001-05-06    0.265412
2001-05-07   -0.556082
2001-05-08    0.947302
2001-05-09    0.239167
2001-05-10   -1.576049
2001-05-11   -4.066289
2001-05-12    0.322240
2001-05-13    0.020418
2001-05-14   -0.005017
2001-05-15   -0.036190
2001-05-16   -1.037135
2001-05-17   -1.183014
2001-05-18    0.027053
2001-05-19   -0.655309
2001-05-20    0.892519
2001-05-21    1.052929
2001-05-22    0.235770
2001-05-23    0.962950
2001-05-24   -1.658024
2001-05-25    0.709359
2001-05-26   -1.440372
2001-05-27    1.576416
2001-05-28    0.678293
2001-05-29   -0.490487
2001-05-30    0.573801
2001-05-31    1.230159
Freq: D, dtype: float64

Slicing with dates works just like with a regular Series:

In [33]:
ts[datetime(2011, 1, 7):]

2011-01-07   -1.471713
2011-01-08   -1.703889
2011-01-10    0.921323
2011-01-12   -0.733133
dtype: float64

In [34]:
ts[:datetime(2011, 1, 7)]

2011-01-02    0.333574
2011-01-05   -0.346737
2011-01-07   -1.471713
dtype: float64

Because most time series data is ordered chronologically, you can slice with timestamps
not contained in a time series to perform a range query:

In [35]:
ts

2011-01-02    0.333574
2011-01-05   -0.346737
2011-01-07   -1.471713
2011-01-08   -1.703889
2011-01-10    0.921323
2011-01-12   -0.733133
dtype: float64

In [36]:
ts['1/6/2011':'1/11/2011']

2011-01-07   -1.471713
2011-01-08   -1.703889
2011-01-10    0.921323
dtype: float64

As before you can pass either a string date, datetime, or Timestamp. Remember that slicing in this manner produces views on the source time series just like slicing NumPy arrays. There is an equivalent instance method truncate which slices a TimeSeries between two dates:



In [38]:
ts.truncate(after='1/9/2011')

2011-01-02    0.333574
2011-01-05   -0.346737
2011-01-07   -1.471713
2011-01-08   -1.703889
dtype: float64

All of the above holds true for DataFrame as well, indexing on its rows:

In [39]:
dates = pd.date_range('1/1/2000', periods=100, freq='W-WED')

In [40]:
long_df = DataFrame(np.random.randn(100, 4),
            index=dates,
            columns=['Colorado', 'Texas', 'New York', 'Ohio'])

In [43]:
long_df.loc['5-2001']

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,1.246874,0.258168,-0.323973,-0.761692
2001-05-09,0.594826,-0.520239,0.51743,1.641458
2001-05-16,1.467048,1.282116,-1.073815,0.169824
2001-05-23,0.722739,1.645053,-1.195086,-0.417969
2001-05-30,0.235388,-2.207366,-0.682517,-0.759445


## Time Series with Duplicate Indices

In some applications, there may be multiple data observations falling on a particular
timestamp. Here is an example:

In [44]:
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000', '1/2/2000',
        '1/3/2000'])

In [45]:
dup_ts = Series(np.arange(5), index=dates)

In [46]:
dates

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-02', '2000-01-02',
               '2000-01-03'],
              dtype='datetime64[ns]', freq=None)

We can tell that the index is not unique by checking its is_unique property:

In [47]:
dup_ts.index.is_unique

False

Indexing into this time series will now either produce scalar values or slices depending
on whether a timestamp is duplicated:

In [48]:
dup_ts['1/3/2000'] # not duplicated

4

In [49]:
dup_ts['1/2/2000'] # duplicated

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int64

Suppose you wanted to aggregate the data having non-unique timestamps. One way
to do this is to use groupby and pass level=0 (the only level of indexing!):

In [50]:
grouped = dup_ts.groupby(level=0)

In [53]:
grouped.mean()


2000-01-01    0
2000-01-02    2
2000-01-03    4
dtype: int64

In [54]:
grouped.count()


2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64