# Datetime Library
Fun story time! Pandas started out in the financial world, which is why it's so great at everything related to timeseries.

Today, we're going to review datetime objects, look at timedeltas, generate basic timeseries plots, and calculate autocorrelation using python.



## Datetime Object

In [None]:
# The date time library is something you should already have because of Anaconda.
from datetime import datetime
# And quite a few of you are already familiar with

# Let's look at the date we once believed the world would end on.
lesson_date = datetime(2012, 12, 21, 12, 21, 12, 844089)


In [None]:
print "Micro-Second", lesson_date.microsecond
print "Second", lesson_date.second
print "Minute", lesson_date.minute
print "Hour", lesson_date.hour
print "Day", lesson_date.day
print "Month",lesson_date.month
print "Year", lesson_date.year


## Timedelta
Say we want to add or subtract time to/from a date. Perhaps we're using time as an index and we want to get everything that happened a week before a specific observation, for example.

We can use a timedelta object to shift a Datetime object. Here's an example:

In [None]:
# Import timedelta from datetime library
from datetime import timedelta

# Time deltas represent time as an amount as opposed to a fixed position.
offset = timedelta(days=1, seconds=20)

# the time delta has attributes that allow us to extract values from it.
print 'offset days', offset.days
print 'offset seconds', offset.seconds
print 'offset microseconds', offset.microseconds

In [None]:
now = datetime.now()
print "It's now: ", now

In [None]:
print "Future: ", now + offset
print "Past: ", now - offset

_The largest value a time delta can hold is 'Days'.  I.e. you can't say you want you an offset to be 2 years, 44 days and 12 hours.  You would have to manually convert the time of those years to be represented in days._


In [None]:
# Get a dataset from the internets
import pandas as pd
ufo = pd.read_csv('http://bit.ly/uforeports')

In [None]:
ufo.head()

In [None]:
# We can see that the Time column is just an object.
ufo.dtypes

In [None]:
#Overwrite the original Time column with one that has been converted to a datetime series.
ufo['Time'] = pd.to_datetime(ufo.Time)

#Letting pandas guess how to do this can take a little bit of time we can use a few arguments to help.
'''ufo['Time'] = pd.to_datetime(ufo.Time, format='%Y%m%d', errors='coerce')'''
# Format will let pandas know what format pandas should use to interpret the date as
# errors will allow you to automatically deal with errors when converting.

In [None]:
#the time column looks a bit different now!
ufo.head()

In [None]:
#let's take a look at how the series has changed
ufo.dtypes

In [None]:
# we can also use dt to get weekday names 
ufo.Time.dt.weekday_name.head()

In [None]:
#and what day of the year it was!
ufo.Time.dt.dayofyear.head()

#### Independent activity:
Take 10 minutes to look at the different ways you can work with timezones and timezone formatting. Try creating a few new columns for things like daylight savings adjustment, timezone name, etc.

https://docs.python.org/2/library/datetime.html

## Time Stamps

In [None]:
#let's create a timestamp of interest
ts = pd.to_datetime('9/10/1993')
#^that's the day x-files first came out, for all of you wondering
ts
# The main difference between a Datetime object and a timestamp is...
# that timestamps can be used as comparisions.

In [None]:
# Use the timestamp we just saved to create a new dataframe.
ufo.loc[ufo.Time >= ts, :].head()

In [None]:
#we could create a new column looking at how far away from our point of interest a particular UFO was sighted
ufo['new'] = ufo.Time - ts

In [None]:
ufo.head()

In [None]:
ufo.tail()

In [None]:
# Timedelta can also be used to get the min and max of a timeseries.
ufo.Time.max() - ufo.Time.min()

You can also use timedelta to mess around with the silly YouTube videos you're embedding in a notebook

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("hAAlDoAtV7Y")

In [None]:
start=int(timedelta(minutes=1, seconds=2).total_seconds())
YouTubeVideo("hAAlDoAtV7Y", start=start, autoplay=1, theme="light", color="red")

#### More independent work: 

Search for .dt. on http://pandas.pydata.org/pandas-docs/stable/api.html for more information about pandas Datetime.

## Plotting a timeseries using pandas

In [None]:
#let's load in a different dataset
crime = pd.read_csv('https://raw.githubusercontent.com/rufuspollock/crime-data-sf/gh-pages/data/sfpd_incidents_march_2012.tidied.csv')

In [None]:
#taking a look at our different types
crime.dtypes

In [None]:
#so do we want to mess around with the date or the time?
crime.head()

In [None]:
#let's turn date into a datetime object
crime['Date'] = pd.to_datetime(crime.Date)

In [None]:
crime.tail()

In [None]:
#I'm arbitrarily picking weekday to be how we look at our data
crime['weekday'] = crime.Date.dt.weekday

In [None]:
#let's groupby weekday on this 
crime_ts = crime.groupby('weekday').aggregate(len)['IncidntNum']
#the groupby statement automatically makes weekday the index

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

plt.plot(crime_ts.index, crime_ts.values, lw=5)
#LW = line width!
#a small stringed instrument! a classical timeseries!

In [None]:
#let's convert the date to be the index
crime.set_index('Date', inplace=True)

In [None]:
crime['Month'] = crime.index.month
crime['weekday'] = crime.index.weekday

In [None]:
#an FYI-- filtering by date becomes really easy when you're working with it as an index!
crime['2012-03-04']

In [None]:
#including looking at a range of observations
crime['3/3/2012':'3/4/2012']

## Quick intro to autocorrelation and window functions

In [None]:
#load data!
url = 'https://raw.githubusercontent.com/sinanuozdemir/sfdat22/master/data/rossmann.csv'
data = pd.read_csv(url, skipinitialspace=True)
import seaborn as sns

In [None]:
data.head()

In [None]:
# Most interested in date - format properly and convert to index
data['Date'] = pd.to_datetime(data['Date'])
data.set_index('Date', inplace=True)

In [None]:
# create new columns for year and month 
data['Year'] = data.index.year
data['Month'] = data.index.month

data.head() 

In [None]:
# There are over a million sales data points in this dataset, so for some simple EDA we will focus on just one store.
store1_data = data[data.Store == 1]
store1_data.head()


In [None]:
'''
As we begin to study the sales from this drugstore, we also want to know both the time dependent elements of sales as 
well as whether promotions or holidays effected these sales. To start, we can compare the average sales on those events.
To compare sales on holidays, we can compare the sales using box-plots, which allows us to compare the distribution of 
sales on holidays against all other days. On state holidays the store is closed (which means there are 0 sales), and 
on school holidays the sales are relatively similar. These types of insights represent the contextual knowledge needed 
to truly explain time series phenomenon. Can you think of any other special considerations we should make when tracking sales?
'''

# check similarity between School Holiday and Sales
sns.factorplot(
    x='SchoolHoliday',
    y='Sales',
    data=store1_data,
    kind='box'
)

In [None]:
#  We can see that there is a difference in sales on promotion days
sns.factorplot(
    col='Open',
    x='Promo',
    y='Sales',
    data=store1_data,
    kind='box'
)

In [None]:
'''
Why is it important to separate out days where the store is closed? 
Because there aren't any promotions on those days either, so including 
them will bias your sales data on days without promotions! Remember: 
Data Scientists needs to think about the business logic (context) as well as 
analyzing the raw data.
'''

# perhaps plot sales across day of the week
sns.factorplot(
    col='Open',
    x='DayOfWeek',
    y='Sales',
    data=store1_data,
    kind='box',
)


In [None]:
# Consider sales across multiple years. How did sales change from 2014 to 2015?

# Filter to days store 1 was open
store1_open_data = store1_data[store1_data.Open==1]
store1_open_data[['Sales']].plot()          # sales over time
store1_open_data[['Customers']].plot()      # customers over time

# EXERCISE: Use filtering to show the trend in 2015 alone

store1_data_2015 = store1_data['2015']
store1_data_2015[
    store1_data_2015.Open==1
][['Sales']].plot()


#### Check:

What is autocorrelation?

Autocorrelation features measure the statistical correlation of a time series with a _lagged_ version of itself.

In [None]:
'''
Computing Autocorrelation
To measure how much the sales are correlated with each other, we want to compute 
the autocorrelation of the 'Sales' column. In pandas, we'll do this with the 
autocorr function.
autocorr takes one argument, the lag - which is how many prior data points 
should be used to compute the correlation. If we set the lag to 1, we compute 
the correlation between every point and the point directly preceding it, 
If we set lag to 10, this computes the correlation between every point 
and the point 10 days earlier:
'''

data['Sales'].resample('D').mean().autocorr(lag=1)

In [None]:
#that's a pretty small mean correlation. what if we look at the autocorrelation for 30 days
data['Sales'].resample('D').mean().autocorr(lag=30)

In [None]:
'''
If we want to investigate trends over time in sales, as always, we will 
start by computing simple aggregates. We want to know: what were the mean 
and median sales in each month and year?
In Pandas, this is performed using the resample command, which is very 
similar to the groupby command. It allows us to group over different 
time intervals.
We can use data.resample and provide as arguments: - The level on 
which to roll-up to, 'D' for day, 'W' for week, 'M' for month, 'A' 
for year - The aggregation to perform: 'mean', 'median', 'sum', etc.
'''

# Here we can see again that December 2013 and 2014 were the highest average sale months.
data[['Sales']].resample('A').mean()

In [None]:
data.resample('A').mean()    # whole dataframe

In [None]:
data[['Sales']].resample('M').mean() 

In [None]:
# Resample to have the daily total over all stores
# Alternatively, this could a daily average over all store with how='mean'
daily_store_sales = data[['Sales']].resample('D').mean()
daily_store_sales

In [None]:
# CHECK: What is a rolling mean? Why might it be useful?

# 3-day rolling mean of daily store sales
pd.rolling_mean(daily_store_sales, window=3, center=True)
pd.rolling_mean(daily_store_sales, window=3, center=True)['2015']   # filter to 2015 only
pd.rolling_mean(daily_store_sales, window=10, center=True).plot()   # plot

In [None]:
# We can also use exponential moving average. CHECK: What is the difference?
pd.ewma(data['Sales'], span=10)

In [None]:
'''
WINDOW FUNCTIONS
Pandas rolling_mean and rolling_median are only two examples of Pandas
window function capabilities. Window functions operate on a set of N
consecutive rows (i.e.: a window) and produce an output.
n addition to rolling_mean and rolling_median, there are rolling_sum,
rolling_min, rolling_max... and many more.
Another common one is diff, which takes the difference over time.
pd.diff takes one argument: periods, which measures how many rows
prior to use for the difference.
For example, if we want to compute the difference in sales,
day by day, we could compute:
'''

daily_store_sales.diff(periods=1) # day by day difference in sales
daily_store_sales.diff(periods=7) # compare same day each week

# Difference functions allow us to identify seasonal changes when we see repeated up or downswings.
# An example from FiveThirtyEight:
# http://i2.wp.com/espnfivethirtyeight.files.wordpress.com/2015/03/casselman-datalab-wsj2.png?quality=90&strip=all&w=575&ssl=1

In [None]:
'''
Pandas Expanding Functions
In addition to the set of rolling_* functions, Pandas also 
provides a similar collection of expanding_* functions, which, 
instead of using a window of N values, uses all values up until 
that time.
'''


pd.expanding_mean(daily_store_sales) # average date from first till last date specified
pd.expanding_sum(daily_store_sales) # sum of average sales per store until that date

In [None]:
'''
EXERCISES
1. Plot the distribution of sales by month and compare the effect of promotions.
hint: try using hue in sns
2. Are sales more correlated with the prior date, a similar date last year, or a similar date last month?
4. Identify the date with largest drop in sales from the same date in the previous week.
5. Compute the total sales up until Dec. 2014.
6. When were the largest differences between 15-day moving/rolling averages? HINT: Using rolling_mean and diff
'''

# Plot the distribution of sales by month and compare the effect of promotions
sns.factorplot(
    col='Open',
    hue='Promo',
    x='Month',
    y='Sales',
    data=store1_data,
    kind='box'
)


In [None]:
# Are sales more correlated with the prior date, a similar date last year, or a similar date last month?
# Compare the following:
average_daily_sales = data[['Sales', 'Open']].resample('D', how='mean')

print average_daily_sales['Sales'].autocorr(lag=1)        # day

print average_daily_sales['Sales'].autocorr(lag=30)       # month  

average_daily_sales['Sales'].autocorr(lag=365)      # year


In [None]:
# Identify the date with largest drop in average store sales from the same date in the previous month:
average_daily_sales = data[['Sales', 'Open']].resample('D', how='mean')
average_daily_sales['DiffVsLastWeek'] = average_daily_sales[['Sales']].diff(periods=7)

average_daily_sales.sort(['DiffVsLastWeek']).head

In [None]:
# Unsurprisingly, this day is Dec. 25 and Dec. 26 in 2014 and 2015, when the store is closed and there are many sales in the preceding week. How about when the store is open?
average_daily_sales[average_daily_sales.Open == 1].sort(['DiffVsLastWeek'])

In [None]:
# Compute the total sales up until Dec. 2014:
total_daily_sales = data[['Sales']].resample('D', how='sum')
pd.expanding_sum(total_daily_sales)['2014-12']
# THIS IS NOT pd.expanding_sum(data['Sales'])['2014-12']

In [None]:
# When were the largest differences between 15-day moving/rolling averages? HINT: Using rolling_mean and diff
pd.rolling_mean(total_daily_sales, window=15).diff(1).sort('Sales')

#### Programming note on using time series for Capstones:

Here's an example of using timeseries for a Capstone: https://github.com/samuel-stack/Portfolio/blob/master/Moving%20Violations%20VS.%20Speed%20Traps/Granger%20Causality%20test%20.ipynb

Note, this Capstone makes use of Granger Causality: a statistical concept that says if a signal X "Granger-causes" (or "G-causes") a signal Y, then past values of X should contain information that helps predict Y above and beyond the information contained in past values of Y alone. 

To put it another way, a time series X1 is said to Granger-cause Y if the X1 values provide statistically significant information about future values of Y. 