# Date and Time

The date and time information is uniquitous in real-world datasets. It is important to understand how date and time are handled in Python and more specificaally in the `pandas` package.

Some of the material discussed here was borrowed from [here](https://github.com/ResearchComputing/Meetup-Fall-2013). 

## The `datetime` type

In [None]:
from datetime import datetime, date, time
from datetime import timedelta

Note the not so typical way of import statements here. If you want to import specific things you will use it. 

The interpretation here is that you are importing specific subpackages like `datetime, date, time, timedelta` from the `datetime` package. 

In [None]:
now = datetime.now()
print(now)
print(now.year, now.month, now.day)
print(now.hour, now.minute, now.second)
print(now.microsecond)
print(now.hour, now.year)

In [None]:
soon = datetime(2021,day=20, month=5)
print(soon)

## datetime.timedelta

In [None]:
delta = datetime(2013,12,3) - datetime(2012,12,3)
print(type(delta))
print(delta.days)
print(delta.seconds)

In [None]:
print(now)
print(now + timedelta(seconds=600)) #10 minutes

## Converting between strings with strftime, strptime, and dateutil.parser

Often times when you load the data from a file you might have to know how to convert a column that is in a string to a DateTime format, or you might want to know how to convert a time to a specific string format. 

### Converting datetime to a string

In [None]:
print(str(now))
print(str(soon))

To string with more control... (string format time -> str**f**time)

[Documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) provides the meaning for various behaviors, like `%Y`, `%b`, `%A`, etc. 

In [None]:
print(now.strftime('%Y-%b-%d'))
print(now.strftime('%Y/%b/%d'))
print(now.strftime('%D'))

In [None]:
now = datetime.now()
print(now.strftime('%a, %d %b %Y %H:%M:%S %p %P'))

### Converting string to a datetime

We will use the method str**p**time to parse the time


In [None]:
print(datetime.strptime('2018-12-4', '%Y-%m-%d'))

In [None]:
date_list = ['2018-12-19 05:26:39', 
             '2018-12-19 07:00:39', 
             '2018-12-19 09:00:39']

[ datetime.strptime(x, '%Y-%m-%d %H:%M:%S') for x in date_list ]

In [None]:
date_list = ['2018-12-19 05:26:39', 
             '2018-12-19 07:00', 
             '2018-12-19 09:00']

[datetime.strptime(x, '%Y-%m-%d %H:%M:%S') for x in date_list]

The above conversion failed because the second time is not formatted like the rest of them, it only has hour information but not the minute and second information. 

### `dateutil.parser` to the rescue

This parse is smart to figure out the format as much as possible. 

In [None]:
from dateutil.parser import parse

print(parse('2018-04-25'))
print(parse('2018/04/25'))

In [None]:
# It smartly parses the text to convert to date time. 

print(parse('April 4th, 2018 at 11:30am'))

In [None]:
# Though it is not super smart

print(parse('tomorrow at 11:30am'))

In [None]:
date_list = ['2018-12-19 05:26:39', 
             '2018-12-19 07', 
             '2018-12-19 09:00']

dates = [parse(x) for x in date_list]
for dt in dates:
    print(dt)

# Time Series with pandas

`pandas` package has a lot of functionality in handling time based data sets. 

The pandas was developed by people working in finance industry, so it is no wonder they were able to build very efficient methods for handling time series in pandas. 

In [None]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('seaborn')

In [None]:
date = pd.to_datetime("4th July, 2018")
date.strftime('%A')

In [None]:
date = pd.to_datetime("15th Nov, 2018 at 12:30:45 PM")
date + pd.to_timedelta(2, 'D')

The 'D' represented the day, you can use 'M' for the month. The following characters and the corresponding description

| Code      | Description |                         
|-----------|-------------|
|``D``      |Calendar Day |
|``W``      |Week |
|``M``      |Month |
|``Y``      |Year |
|``H``      |Hour |
|``T``      |Minute |
|``S``      |Seconds |


In [None]:
print("original date: ", date)
print(date + pd.to_timedelta(1, 'M'))
print(date + pd.to_timedelta(1, 'T'))
print(date + pd.to_timedelta(1, 'W'))
print(date + pd.to_timedelta(365, 'D'))


## Working with Google stock price data

This data was extracted from [Yahoo-finance](https://finance.yahoo.com/quote/GOOG/chart?p=GOOG) data. 

In [None]:
goog = pd.read_csv("./data/Google_Stock_Price.csv")
goog.head()

In [None]:
goog.dtypes

Note that the `Date` column is read as an `object` which is equivalent to a string. However, it'll be a lot more beneficial if we convert it to a datetime format. 

## `pd.to_datetime` 

`pd.to_datetime` is used to convert a column read as string to a datetime object. 

Oftentimes, the `to_datetime()` method is smart to figure out the format of the date, however, sometimes it might be necessary to provide the `format` parameter. [Look](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) here for more options. 

In [None]:
# Converting the date and replacing it with the same column 
goog['Date'] = pd.to_datetime(goog['Date'])
goog.dtypes

Now notice that the `Date` column is no more an object but as a `datetime64` data type. Though, it might not change it visually it has a lot more advantages as we will see later. 

In [None]:
goog.head()

In [None]:
goog['Date'].max()

In [None]:
goog['Date'].min()

# Pandas Time Series

## Setting the index

You can use `set_index` method to set a specific column as an index. 

In here we are setting `Date` column as a index. 

In [None]:
goog.set_index('Date', inplace=True)
goog.head()

## Indexing by Time

DateTime object as an index is very useful for selection, grouping and doing a lot of timeseries manipulations. 

### Selection based on the index

In [None]:
#Selecting the range of dates 
goog.loc['2007-09-17': '2007-09-21']

In [None]:
#The indexing is now smart to read your dates for indexing
goog.loc['Nov 7th 2015':'Nov 16th 2015']

In [None]:
#You can select the whole month or year
goog.loc['Dec 2015']

In [None]:
# You can do the average stock price in the month of Jan in 2014

goog.loc['Jan 2014']['Close'].mean()

Notice that there are three things happening in the above statement in computing the mean closing price in the month of January. 
* First we are selecting all the rows with index or dates in January of 2014
* Then we are selecing the 'Close' column which is the closing price of the stock 
* We are computing the mean. 

In [None]:
# You can also do a very smart indexing with pandas time series objects

# NOTICE HOW YOU CAN SPECIFY THE DATES IN ANY FORMAT YOU WANT

goog.loc['Nov 2011':'12/2011':4]

Notice that it is getting the rows from Nov 2011 to Dec 2011, and the last :4 is saying get every 4th day in the data. 

# Activity

We will work with dataset based on the Group Project 1, but now work with time series data based on departure time. 

In [None]:
flights = pd.read_csv("./data/flights_departures.csv")
flights.head()

In [None]:
flights.dtypes

#### Step 1

* Convert the 'DEPARTURE_DATETIME' to a datetime data type.
* Now look at the dtypes of the 'flights' DataFrame by typing flights.dtypes

#### Step 2

* Set the 'DEPARTURE_DATETIME' as an index for the flights DataFrame. 

#### Step 3: Selecting and producing summary statistics by using DateTime index

Keep in mind that all these summary statistis are on the sample of the data and not on the actual datasets. 

* Select all the flights that are on Nov 15th 2015. 
* How many flights are in the dataset on July 4th 2015? 
* How many unique airlines were there are on the labor day of 2015? 
    * You might have to google the date for the labor day in 2015 as pandas is not smart to figure this out. 
* What is the median distance traveled by flights between July 15th 2015 and July 20th 2015? 

# Plotting TimeSeries data

Once you have the timeseries as an index for a DataFrame or Series, it becomes really easy to plot using `matplotlib`

In [None]:
# Extract the OHLC (Open, High, Low, Close) columns of the dataset. 

goog_new = goog[['Open','High','Low','Close']]

In [None]:
goog_new.head()

### Plotting timeseries data

In [None]:
figure, axes = plt.subplots(figsize=(12,8))

goog_new.plot(ax=axes)

The above plot is plotting the four lines for each of the Open, High, Low and Close price of Google for each day. 

### Plotting only slices of the dataset

For example let us see the price of Google in Nov 2008 (during the time of financial crisis)

In [None]:
figure, axes = plt.subplots()

goog_new.loc['Nov 2008'].plot(ax = axes)

### Plotting the closing price of Google

Since all the four columns (Open, High, Low, Close) are on the same scale, it makes sense to plot all of them. However, not all times the columns of a DataFrame are on the same scale. So you can select the column you want to plot. 

In [None]:
figure, axes = plt.subplots()
axes.plot(goog_new['Close'], label="Close")
# goog_new['Close'].plot(ax = axes)
axes.legend()

### Plotting a slice of closing price of Google

In [None]:
figure, axes = plt.subplots()

goog_new.loc['2007 Dec':'2009 May']['Close'].plot(ax = axes)
axes.legend()

Plot two columns if you think they both are on the same scale. 

In [None]:
goog_new.loc['2007 Dec':'2009 May'][['High', 'Low']]

In [None]:
figure, axes = plt.subplots()

goog_new.loc['2007 Dec':'2009 May'][['High', 'Low']].plot(ax = axes)
axes.legend()

# Activity

We will use the flights DataFrame for this activity. 

In [None]:
figure, axes = plt.subplots()
flights.plot(ax = axes)

This does not make sense as all of the columns here are not on the same scale. 

### Plotting Activity

Plot the following details from the dataset
* Plot the 'AIR_TIME' for the dataset. 
* Plot the 'DEPARTURE_DELAY' for the flights on 15th August 2015
* Plot the 'TAXI_IN' and 'TAXI_OUT' for the flights on July 4th 2015

# Time Series Operations: Resampling and Windowing

Resampling and windowing techniques are very similar to the groupby operations that we have learned earlier, however, they are much more simple and easy to achieve with DateTime as an index to a DataFrame or Series in `pandas`.  

## `resample()` and `asfreq()` methods

`resample()` method needs to be provided with a frequency for resampling a given frequency. For example, if you provide the frequencey as 'M' for month, it is similar to the groupby month operation for each unique month in the dataset and then you can apply aggregate operations. 

`asfreq()` method just provides the the value at the end of that frequency. For example, if the frequency is 'M' for month, then it provides the value all the columns on the last day of each month. 

The notation used in frequency specification is given by the table along with description. 


| Code      | Description |                         
|-----------|-------------|
|``D``      |Calendar Day |
|``W``      |Weekly  |
|``M``      |Month end |
|``Q``      |Quarter end |
|``A``      |Year end |
|``H``      |Hours |
|``T``      |Minutes |
|``S``      |Seconds |
|``B``      |Business day|
|``BM``      |Business Month end|
|``BQ``      |Business Quarter end|
|``BA``      |Business Year end|
|``BH``      |Business Hours|

In [None]:
# Average Google price for the whole year (from the beginning to) the end of each year
goog.resample('A').mean()


You can think of the above resample() and mean() as groupby the year and then compute the mean for each year. However, this is much more simpler and easy to achieve as you can specify any frequency you want. 

In [None]:
# asfreq gives you the exact price on the very last day of each year. 
goog.asfreq('A')

Notice that there is no price for 31st Dec 2011, because it was a Saturday, not a working day. You can get the last business day of the year by using 'BA'. 

In [None]:
goog.asfreq('BA')

In [None]:
goog.resample('BM').median().head()

**NOTE:** You can also sepcifiy a number for the resampling. For example, you can resample every 5 days using 5D. 

In [None]:
goog.resample('5D').mean().head(10)

### Plotting by resampling the data

In [None]:
goog_close = goog['Close']

figure, axes = plt.subplots(figsize=(12,8))

goog_close.plot(alpha = 0.5, style = '-', label='original')
goog_close.resample('BA').mean().plot(style=':', label='resample')
goog_close.asfreq('BA').plot(style='--',label='asfreq')
axes.legend()

# Rolling Window using `rolling()` method

A lot of times you may not want to compute the average at the end of the year or end of the month, but compute the rolling average (or any summary statistic) as the data moves along. You can do that by using `rolling()` method. 

In [None]:
goog.rolling(7).mean()

**NOTE**: Since in this case, each row is a day it might seem like it is last 7 days, but the parameter means 7 observations (that is, 7 business days). 

### Plotting the rolling window averages

In [None]:
goog_close = goog['Close']

figure, axes = plt.subplots(figsize=(12,8))
goog_close.plot(ax = axes, label = 'actual')
goog_close.rolling(365).mean().plot(ax = axes, label = 'rolling_365')
axes.legend()

# Activity

Again we will use flights data for this activity. 

In [None]:
flights.head()

### Activity on `resampling()`, `rolling()`

* Compute the average DISTANCE travelled by flights in each month
* Compute the median AIR_TIME of the flights in 45 days
* Compute rolling average of DEPARTURE_DELAY for the last 30 obeservations

### Activity on plotting the resampling and rolling data

Plot the following details
* Plot the rolling average DISTANCE of previous 500 observations in the dataset