# <font color='#eb3483'> Data Wrangling - Time Series </font>
In this module we explore a special type of data - time series (a.k.a. data related to times or dates). Time series data is pretty ubiquitous. Think of the stock market, weather data, or even your own bank statements - it's all data tied to specific dates and times. Dates and times deserve some extra special attention during your data wrangling process because they have some unique properties. Like numeric data they have a natural ordering (i.e. 3pm is after 2pm), but they also have additional structure (i.e. for a given time we have an hour, a day of the week, a year, a zodiac sign...etc.). 



## In this notebook: 
    
1. Quick look at base python date/time
1. Working with date/time in pandas
1. The amazingness of date/time with pandas
1. Cool pandas functionality wirh date/time
1. Filtering date/time with pandas
1. Aggregating (aka resampling) data/time
1. Rolling windows 



## <font color='#eb3483'> 1. Basic Data time in Python </font>
Python's basic way to deal with dates - the datetime object. One package to work with date time you may come across is the `datetime` package.

In [None]:
from datetime import datetime, date #importing the datetime type from the datetime package (I know ... it's confusing!)

print('Right now:', datetime.now())
print("Today's date:", date.today())

Datetime objects are great for individual dates (and provide a lot of flexibility/ease of use), but don't scale well to columns in a dataframe (aka vectors). For that it's time to turn to our favorite coding bears - pandas! 

### 2. Working with date/time in andas

## <font color='#eb3483'> 2. Working with date/time in pandas </font>

In [None]:
import pandas as pd

### <font color='#eb3483'>Timestamps  </font>

The timestamp is the most basic form of time series data that Pandas has. It does exactly what the name describes: marks the exact moment in which the data was collected. 

While kaggle datasets and other online datasets are normally clean "hourly" or "daily" dataset, TimeStamps are how most data is normally collected in the wild! 

An event happens, and the time of the event is dumped into a database. 

One example of this would be... bitcoin! Now, whatever you may think about bitcoin, it is an excellent source of high-granularity data. Let's dive in! 

In [None]:
data = pd.read_csv('./data/bitcoin.csv')

In [None]:
data.head()

In [None]:
data.tail()

We have this `Timestamp` column

In [None]:
data.Timestamp.head()

What kind of data type is this ?

We can use `pd.to_datetime` to parse the timestamps into a date data type.

We can kind of understand this. Looks like Year, month, and day, then hours, minutes, then seconds ...  

In [None]:
data.Timestamp = pd.to_datetime(data.Timestamp, infer_datetime_format=True)
data.head()

Let's inspect a random row: 

What is it now? 

In [None]:
data.Timestamp.head()

Now the column is in `datetime[ns]` format! That means the column is a timestamp (with precission in nanoseconds)

Now we can compute statistics with it!

In [None]:
data.Timestamp.min()

Now the column is in `datetime[ns]` format! That means the column is a timestamp (with precission in nanoseconds).

Now we can compute statistics with it!

In [None]:
data.Timestamp.max()

Because the column is a timestamp dtype, it has the `.dt` accessor with all of the timestamp related functions. Since pandas was created for stock trading data (which are timeseries), [there are lot of timestamp specific properties!](https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties)

We can extract days, months etcetera:

Let's make a toy dataset so that we can see the plethora of the options ....

In [None]:
data.Timestamp.dt.month
# try change this to month/year etc

In [None]:
new = pd.DataFrame()
new['date'] = data.Timestamp
new['day'] = new['date'].dt.day
new['month'] = new['date'].dt.month
new['year'] = new['date'].dt.year
#new['hour'] = new['date'].dt.hour
#new['minute'] = new['date'].dt.minute
#new['second'] = new['date'].dt.second
#new['day of the week'] = new['date'].dt.weekday
#new['quarter'] = new['date'].dt.quarter
#new['is it a leap year?'] = new['date'].dt.is_leap_year

new.head(10)

Pandas... is a ninja!!!! ([Source](https://dribbble.com/shots/614156-Panda-ninja))

![image.png](attachment:image.png)

### <font color='#eb3483'> 3. The amazingness of date/time with pandas
  </font>

Now you may be thinking _"hang on, was that just because the strings were exactly in the way Pandas likes them?"_

It's a fair question, and the answer is No. Pandas' [`to_datetime`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) has an `infer_datetime_format` argument which is amazingly good, and can for the most part figure out what you need from it. 

Let's put it to the test: 

In [None]:
check = pd.to_datetime('04/05/2007', infer_datetime_format=True) 

print(check)
print("day is:", check.day)
print("month is:", check.month)
print("year is:", check.year)

# Can we separate with hyphens instead? 
# How about writing the year in a short form of 07 ?
# What about in English ? - 'April 5th, 2007'

Wow! 

But if you are from South Africa or Europe - you may be thinking to yourself WTF!   

Wait... what? It got the day and month mixed up! 



It turns out Pandas can infer lots of things, but Europe isn't it's strength. Even though the second and third line clearly indicate that the month is in the middle (the 13'th can't be a month), it still gets confused. 

And here is where line 2 of [The Zen of Python](https://www.python.org/dev/peps/pep-0020/#id3) comes in:
> Explicit is better than implicit 

In [None]:
check_eu = pd.to_datetime('04/05/2007', dayfirst=True)  # <--- explicit! 
print("day is:", check_eu.day)
print("month is:", check_eu.month)
print("year is:", check_eu.year)

By being explicit, we can parse arbitrarily crazy dates, following python [date string formatting syntax](http://strftime.org/):

In [None]:
# April 5th, 2007, in made up quack_timesystem
check = pd.to_datetime('05_quack_2007$04', format='%d_quack_%Y$%m') 
                                                  #%d is day, %m is month, %Y is 4 digit year

print(check)
print("day is:", check.day)
print("month is:", check.month)
print("year is:", check.year)

<hr>

### <font color='#eb3483'> 4. Filtering our data with Pandas (using indexes) </font>
Where pandas really shines is when we set a datetime data as our index (it's generally good practice to do this when you have time series data for reasons that will become apparent soon). So let's start by setting our timestampl column as our index.

In [None]:
data.head(3)

In [None]:
data = data.set_index('Timestamp',    # <---- Set the index to be our timestamp data  
                      drop=True)      # <---- drop the original column

In [None]:
#Let's take a peak to make sure we did this right
data.head()

In [None]:
#We can also sort our dataframe by the time index (good practice for time series data!)
data = data.sort_index()

Now that we have our data with the timeseries index we can do some really cool indexing (pandas is ... amazing!)

In [None]:
#Let's get all the data for Jan 17th
data.loc['Jan 17th 2018'].head()   # <--- wait, you can do that???

In [None]:
#Or how about all the January data?


In [None]:
#We can even look at data between dates
data.loc['01/15/2018':'01/22/2018']  # <--- remember, American dates are less error prone in Pandas 


In [None]:
# Play around with this above for 5 mins

Essentially we can slice our data by using dates, and pandas even let's us use date different formats. The beauty of this is that it seems perfectly natural (of course we should be able to just pull all of january's data without fancy index conditions), but for anyone coming from a different coding language you'll realize this is bonkers crazy!

### <font color='#eb3483'>5. Aggregating or Resampling Data </font>


Sometimes we might get data at a really granular level (i.e. microsecond) and want to take a step back and look at a larger time frequency (i.e. days). 


Let's think about some of our bitcoin data fields. The price on Jan 17th, at 3h00m00s makes sense (since its an event, something that happened). But the volume "in that moment"? It's a bit non-sensical (you dont have a number of transactions in a snap second, *you have them over a period*). 

Counting using timestamps is like asking _how many people went into McDonalds at an exact moment_. Probably none. It does't tell us much. 
We' think in people "per minute", or "per hour". To "resample our data at a different time frequency, we can use the <code>resample</code> function [link](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwi3jfnKgNnaAhUGvBQKHRCwBd4QFggpMAA&url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Fgenerated%2Fpandas.DataFrame.resample.html&usg=AOvVaw1le9agxvLanaQp9zlNYG9Y).

Let's start by looking at our bitcoin data in 5 minute intervals. All we have to do is call the resample method on our series and specify the interval (5 min).

In [None]:
data.head()

In [None]:
data['Volume_(Currency)'].resample('5 min').sum()

Boo ya - now we have the total volume (currency) traded in 5 minute time buckets. We could have also chosen other aggregation functions (like max, mean, min...etc.) 

This is similar to the the `groupby` we have learnt alreadywhere the resmaple tells us how to aggregate and then the function tells us the calcualtion to apply.

In [None]:
#Try it out yourself!
#Resample to 30 mins ?
#try a min or a max etc

We can specify our resampling windows using special characters just like our string formatting (check-out the full list of frequency code names [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)). For example, let's look at the max volume in every 2 week interval.

In [None]:
data['Volume_(Currency)'].resample('2W').max().head()

### <font color='#eb3483'> Quick Knowledge Check</font>
1. What's the higheset bitcoin open price every day in january? (Hint first get all the january data, and then apply our resampling function for days)

<hr>

### <font color='#eb3483'> 6. Stripping out volatility with Rolling Windows </font>




Time data can be very volatile over short periods.

Let's say it's December 18th 2017, in the early morning, and we are at our terminal. 

##### Midnight and a bit... over 4 mins

In [None]:
data.loc['Dec 18th 2017 00:08:00':'Dec 18th 2017 00:12:00', 'Weighted_Price'].plot(figsize=(16, 4));

<img src='https://i.imgflip.com/29iucd.jpg' width="300">

##### A few minutes pass... 

In [None]:
data.loc['Dec 18th 2017 00:12:00':'Dec 18th 2017 00:15:00', 'Weighted_Price'].plot(figsize=(16, 4));

<img src='https://i.redditmedia.com/VE5dgdjQ8FKZ47gdxJdQ07q36bsZVyhvAmllvLdtTnI.jpg?w=534&s=ce869cd0d8630cd420af7fa72b3c296d' width="300">


##### A few more minutes... 

In [None]:
data.loc['Dec 18th 2017 00:15:00':'Dec 18th 2017 00:18:00', 'Weighted_Price'].plot(figsize=(16, 4));

<img src='https://i.imgflip.com/29iucd.jpg' width="300">

I think you get the picture. What's going on is that we're being extremely reactive to noise, and missing the underlying process. What is in fact going on is that we are in a free-fall, but it might not be obvious unless we look at the slightly broader picture. 

In other words, assuming there is an underlying process, we can assume the recent past should carry some weight. How much weight? A rolling [window](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html) of weight! 

Rolling windows do what their name suggest: aggregate of the previous X periods (and, for instance, take the mean). They are very useful to smooth choppy timeseries and be less reactive to noise. 

We can choose to center the window (look back and forward), but in general we only want to take into account information from the past, so we should use `center=False` (which is the default)

#### The first hour of Dec 18th 2017, as seen by traders

In [None]:
data.loc['Dec 18th 2017 00:00:00':'Dec 18th 2017 01:00:00', 'Weighted_Price'].plot(figsize=(16, 4));

#### The first hour of Dec 18th 2017, as seen by a rolling window of 10 minutes

In [None]:
# this is just the raw data, so we can apply a rolling window on it  
first_hour = data.loc['Dec 18th 2017 00:00:00':'Dec 18th 2017 01:00:00', 'Weighted_Price']

# notice the window size as a parameter of rolling, feel free to mess around with that parameter 
# and the center set to False. That's because we don't want to use data from the future! 
# Also notice how we use the mean. We can use many others. Try changing it! 
window_size = 10
first_hour_rolling_window = first_hour.rolling(window=window_size, center=False).mean()


What do these look like? A rolling window of 10 basically calculates the average bitcoin price in 10 minutes interval (so the average price between 00:00 and 00:10, the average price between 00:01 and 00:11, the avg price between 00:02 and 00:12, etc)

In [None]:
# Let's plot these together 
first_hour_rolling_window.plot(figsize=(16, 8), 
                               color='b',
                               label=f'rolling_window = {window_size}');
first_hour.plot(figsize=(16, 8), label='raw data', alpha=.7, ls='-', color='orange');

Useful!

# BOOM - the end!
