## Creating Samples

In [23]:
import pandas as pd

In [24]:
import numpy as np

In [25]:
# In Python, when you load a time series data, you can specify the index to be the date column.

date = [pd.Timestamp("2017-01-01"),

        pd.Timestamp("2017-01-02"),

        pd.Timestamp("2017-01-03")]

timeSeries = pd.Series(np.random.randn(len(date)), index=date)

In [26]:
timeSeries

2017-01-01    1.461519
2017-01-02   -0.625641
2017-01-03   -0.054669
dtype: float64

In [27]:
# In the first step, we create date values. In the second step, we create random time series with date as index.

timeSeries.index 

pd.DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq=None)

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'], dtype='datetime64[ns]', freq=None)

In [28]:
# Retrieving the Values
# If you want to pull out the value at a given timestamp, you can easily reference the date to pull the value, since it is a date time index.

# If you want to get the value for Jan 1, all you have to code is,

timeSeries['2017-01-01']

1.4615191320090732

In [29]:
# Retrieving a Range of Dates
# Once your time series is indexed by date, you can retrieve a single date as well as a date range.
# You pass the start and the end date to retrieve all values in the given range.

timeSeries['2017-01-01':'2017-01-02']

2017-01-01    1.461519
2017-01-02   -0.625641
dtype: float64

Date Range Generate Dates
Say you know the start date and the end date, and you would want to generate a set of dates in that range.

You can do that using Python in the following way

In [30]:
pd.date_range(start='2017-01-01',end='2017-01-19',freq='B')
#freq = 'B' signifies business day

DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05',
               '2017-01-06', '2017-01-09', '2017-01-10', '2017-01-11',
               '2017-01-12', '2017-01-13', '2017-01-16', '2017-01-17',
               '2017-01-18', '2017-01-19'],
              dtype='datetime64[ns]', freq='B')

>> **Generating Dates in Time Intervals**

In the previous example, you saw how to generate dates in a range for business days.

You can go a bit granular and generate based on time steps (Hours, Minutes and Seconds).

Say you want to generate dates starting from Jan 1, 2017, 00:00:00 hrs every hour/minute or second you will do that in the following way In the following examples, the parameter freq= controls how the different date values are generated.

In [31]:
pd.date_range(start="2017-01-01", periods=3, freq='H')

DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 01:00:00',
               '2017-01-01 02:00:00'],
              dtype='datetime64[ns]', freq='H')

In [32]:
pd.date_range(start="2017-01-01", periods=3, freq='T')

DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 00:01:00',
               '2017-01-01 00:02:00'],
              dtype='datetime64[ns]', freq='T')

In [33]:
pd.date_range(start="2017-01-01", periods=3, freq='S')

DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 00:00:01',
               '2017-01-01 00:00:02'],
              dtype='datetime64[ns]', freq='S')

>> **Varying Frequencies**

So far, you have seen how to generate date time indices at specific frequencies

Let's say you want to generate date-time values that are 1 day, 1 hour, 1 minute and 10 seconds apart.

How will you do that using Python?

See the code below.

In [34]:
pd.date_range(start="2017-01-01", periods=5, freq='1D1h1min10s')

DatetimeIndex(['2017-01-01 00:00:00', '2017-01-02 01:01:10',
               '2017-01-03 02:02:20', '2017-01-04 03:03:30',
               '2017-01-05 04:04:40'],
              dtype='datetime64[ns]', freq='90070S')

>> **Generating Custom Date Ranges**

Instead of specifying a date, you can also specify a day from when you want to generate the date time.

For example, you want to generate date time stamp every Friday for five instances from a given start date.

In [35]:
pd.date_range(start="2017-01-01", periods=5, freq='W-FRI')

DatetimeIndex(['2017-01-06', '2017-01-13', '2017-01-20', '2017-01-27',
               '2017-02-03'],
              dtype='datetime64[ns]', freq='W-FRI')

>> **Combining Indices**

You have generated separate indices with different dates and would want to combine them . How would you do that ?
The code below explains the steps.

In [36]:
#First index generated 10 first business days in January starting 2017
# Second index genetated 10 last buisiness days in February starting 2017.
# The union() function helped in combining one index to another

a = pd.date_range(start="2017-01-01", periods=10, freq='BAS-JAN')
b = pd.date_range(start="2017-01-01", periods=10, freq='A-FEB')

a.union(b)

DatetimeIndex(['2017-01-02', '2017-02-28', '2018-01-01', '2018-02-28',
               '2019-01-01', '2019-02-28', '2020-01-01', '2020-02-29',
               '2021-01-01', '2021-02-28', '2022-01-03', '2022-02-28',
               '2023-01-02', '2023-02-28', '2024-01-01', '2024-02-29',
               '2025-01-01', '2025-02-28', '2026-01-01', '2026-02-28'],
              dtype='datetime64[ns]', freq=None)

## Resampling

>> **Downsample Scenario**

Let us take an example where customers are visiting a supermarket.

You are interested in studying the customer incidence pattern at different time steps.

You can simulate that scenarios in the following way using Python.

In [37]:
customerArrival = pd.date_range('18/09/2017 8:00', periods=600, freq='T')

custArrivalTs = pd.Series(np.random.randint(0, 100, len(customerArrival)), index=customerArrival)
custArrivalTs.head(10)

2017-09-18 08:00:00    13
2017-09-18 08:01:00    38
2017-09-18 08:02:00    43
2017-09-18 08:03:00    93
2017-09-18 08:04:00     8
2017-09-18 08:05:00    26
2017-09-18 08:06:00    14
2017-09-18 08:07:00    80
2017-09-18 08:08:00    68
2017-09-18 08:09:00     1
Freq: T, dtype: int32

In [38]:
#The data says that 8 customers have arrived at 8:00 and 37 customers have come at 8:05 . This is just a hypothetical number.

>> **Downsample Data**

In the previous card, you saw how to create a random customer incidence scenario for every minute.
You are not interested in customer incidence every minute but you would want to get the mean customer incidence every 10 mins.

You will resample (downsample) your time series in the following way.

In [57]:
custArrivalTs.resample('10min').mean().head()

2017-09-18 08:00:00    38.4
2017-09-18 08:10:00    40.1
2017-09-18 08:20:00    43.7
2017-09-18 08:30:00    60.5
2017-09-18 08:40:00    54.3
Freq: 10T, dtype: float64

In [40]:
custArrival10 = custArrivalTs.resample('10min')

In [58]:
custArrivalTs.resample('10min').interpolate(method='linear').head()

2017-09-18 08:00:00    13
2017-09-18 08:10:00    22
2017-09-18 08:20:00    64
2017-09-18 08:30:00    20
2017-09-18 08:40:00    16
Freq: 10T, dtype: int32

>> **Custom Aggregation**

If you do not want the aggregation using the mean, you can specify your custom function.
See the below code to understand that process.

In [59]:
custArrivalTs.resample('10min').sum().head()

2017-09-18 08:00:00    384
2017-09-18 08:10:00    401
2017-09-18 08:20:00    437
2017-09-18 08:30:00    605
2017-09-18 08:40:00    543
Freq: 10T, dtype: int32

>> **Other Custom Aggregation Options**

You have seen how to pass custom aggregation functions to downsample a time series data.
In this example, you will notice how to get the maximum incidence in a given time interval.

In [44]:
custArrivalTs.resample('1h').max().head()

2017-09-18 08:00:00    96
2017-09-18 09:00:00    99
2017-09-18 10:00:00    99
2017-09-18 11:00:00    99
2017-09-18 12:00:00    99
Freq: H, dtype: int32

In [45]:
#The above output is the maximum incidence at a given hour.

>> **Using Lambda Function in Custom Aggregation**

When you perform down sampling and you want to write your own custom function, you can accomplish that in the following manner.

In [65]:
import random

custArrivalTs.resample('1h').apply(lambda m: random.choice(m)).head()

2017-09-18 08:00:00    55
2017-09-18 09:00:00    47
2017-09-18 10:00:00    64
2017-09-18 11:00:00    68
2017-09-18 12:00:00     1
Freq: H, dtype: int32

>> **Open High Low Close**

Let's say you are analyzing customer incidence data. You would wish to see the opening, closing, high and low incidence values in a given interval of time.
How will you do that?

See the code below.

In [47]:
custArrivalTs.resample('1h').ohlc().head()

Unnamed: 0,open,high,low,close
2017-09-18 08:00:00,13,96,0,8
2017-09-18 09:00:00,20,99,0,75
2017-09-18 10:00:00,35,99,4,60
2017-09-18 11:00:00,30,99,5,72
2017-09-18 12:00:00,74,99,1,42


## Upsampling

In upsampling, the frequency of the data points is more than that of the original data captured.

For example, you are creating ten time stamps with random values every one hour on a given date.

In [48]:
sampleRng = pd.date_range('9/18/2017 8:00', periods=10, freq='H')

sampleTs = pd.Series(np.random.randint(0, 100, len(sampleRng)), index=sampleRng)

In [49]:
sampleTs

2017-09-18 08:00:00    27
2017-09-18 09:00:00     9
2017-09-18 10:00:00    23
2017-09-18 11:00:00    95
2017-09-18 12:00:00    73
2017-09-18 13:00:00    46
2017-09-18 14:00:00     9
2017-09-18 15:00:00    59
2017-09-18 16:00:00    82
2017-09-18 17:00:00    52
Freq: H, dtype: int32

>> **Upsampling Example**

In the previous card, you have seen how to create a sample time series every 1 hour.

If you want to study your data every 15 mins, you have to perform upsampling.

In [67]:
sampleTs.resample('15min').mean().head()

2017-09-18 08:00:00    27.0
2017-09-18 08:15:00     NaN
2017-09-18 08:30:00     NaN
2017-09-18 08:45:00     NaN
2017-09-18 09:00:00     9.0
Freq: 15T, dtype: float64

Forward Filling
The Forward and Backward filling can be used to fill missing values.

In forward filling, you have to fill the missing values based on the forward values.

In [68]:
sampleTs.resample('15min').ffill().head()

2017-09-18 08:00:00    27
2017-09-18 08:15:00    27
2017-09-18 08:30:00    27
2017-09-18 08:45:00    27
2017-09-18 09:00:00     9
Freq: 15T, dtype: int32

Backward Filling
In backward filling, the missing values are filled from backwards.

In [69]:
sampleTs.resample('15min').bfill().head()

2017-09-18 08:00:00    27
2017-09-18 08:15:00     9
2017-09-18 08:30:00     9
2017-09-18 08:45:00     9
2017-09-18 09:00:00     9
Freq: 15T, dtype: int32

Fill with Limitation
When you fill the missing values, you can also limit the number of fills.



In [70]:
sampleTs.resample('15min').ffill(limit=2).head()

2017-09-18 08:00:00    27.0
2017-09-18 08:15:00    27.0
2017-09-18 08:30:00    27.0
2017-09-18 08:45:00     NaN
2017-09-18 09:00:00     9.0
Freq: 15T, dtype: float64

>> **Interpolation Example**

In the below example, you will see how to use interpolation to fix the missing values.

In [54]:
print(sampleTs)

interEx = sampleTs.resample('15min')

interEx

2017-09-18 08:00:00    27
2017-09-18 09:00:00     9
2017-09-18 10:00:00    23
2017-09-18 11:00:00    95
2017-09-18 12:00:00    73
2017-09-18 13:00:00    46
2017-09-18 14:00:00     9
2017-09-18 15:00:00    59
2017-09-18 16:00:00    82
2017-09-18 17:00:00    52
Freq: H, dtype: int32


DatetimeIndexResampler [freq=<15 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]

In [55]:
interEx.interpolate().head(10)

2017-09-18 08:00:00    27.0
2017-09-18 08:15:00    22.5
2017-09-18 08:30:00    18.0
2017-09-18 08:45:00    13.5
2017-09-18 09:00:00     9.0
2017-09-18 09:15:00    12.5
2017-09-18 09:30:00    16.0
2017-09-18 09:45:00    19.5
2017-09-18 10:00:00    23.0
2017-09-18 10:15:00    41.0
Freq: 15T, dtype: float64