# Time Series Part 1 - Data Wrangling and Plotting

# Setup

In [0]:
## update the latest seaborn (0.9.0)
!pip install seaborn==0.9.0


In [1]:
## setup our environment
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## pandas print columns/rows option (100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

## set the styling for seaborn (white)
sns.set_style("dark")

# Date and Time Wrangling

We saw last week that we can build datasets use `date_range` from pandas.  The argument `freq` takes a character string to represent the time of date/time we want to build.  The image below shows these values and is taken from the documentation at the link below.

http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects

<img src="http://drive.google.com/uc?export=view&id=1U1oVmpmlnkyLba0jc-tGQSZRckmYej_G">



In [2]:
## start basic: generate the days of the year for 2019 up through March 27
days19 = pd.date_range("2019-01-01", "2019-03-27", freq="D")
days19

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06', '2019-01-07', '2019-01-08',
               '2019-01-09', '2019-01-10', '2019-01-11', '2019-01-12',
               '2019-01-13', '2019-01-14', '2019-01-15', '2019-01-16',
               '2019-01-17', '2019-01-18', '2019-01-19', '2019-01-20',
               '2019-01-21', '2019-01-22', '2019-01-23', '2019-01-24',
               '2019-01-25', '2019-01-26', '2019-01-27', '2019-01-28',
               '2019-01-29', '2019-01-30', '2019-01-31', '2019-02-01',
               '2019-02-02', '2019-02-03', '2019-02-04', '2019-02-05',
               '2019-02-06', '2019-02-07', '2019-02-08', '2019-02-09',
               '2019-02-10', '2019-02-11', '2019-02-12', '2019-02-13',
               '2019-02-14', '2019-02-15', '2019-02-16', '2019-02-17',
               '2019-02-18', '2019-02-19', '2019-02-20', '2019-02-21',
               '2019-02-22', '2019-02-23', '2019-02-24', '2019-02-25',
      

In [3]:
## lets make this a dataframe, with the date as the index and a random values
returns = {'returns':np.random.normal(0, 2.5, size=len(days19))}
year19 = pd.DataFrame(returns, index=days19)
year19.head()

Unnamed: 0,returns
2019-01-01,0.844868
2019-01-02,2.785651
2019-01-03,0.654903
2019-01-04,0.151442
2019-01-05,-0.198513


In [0]:
# quick plot
year19.plot(figsize=(12,5))

In [0]:
## what about generating data by minute?
min19 = pd.date_range("2019-01-01", "2019-03-27", freq="T")
min19

In [0]:
min19df = pd.DataFrame({'a':np.random.normal(0,2,len(min19))}, index=min19)
min19df.head()

In [0]:
# another quick plot
min19df.plot(figsize=(15, 5))

## Quick Exercise: 

Generate a random weekly dataframe (or Series) and plot it

In [0]:
## create the dataset

In [0]:
## plot it

# Additional Date Parts

We saw last week that we could extract dateparts.  Let's look at a few more, quickly.

In [0]:
year19['date'] = year19.index
year19.head()

In [0]:
year19['weekday'] = year19.date.dt.weekday
year19['quarter'] = year19.date.dt.quarter
year19['weekyear'] = year19.date.dt.weekofyear
year19['dayyear'] = year19.date.dt.dayofyear

In [0]:
year19.tail()

# Filtering with Dates

Lets use the first dataset, year19

### When the column is a datetime

In [0]:
## Keep just February
feb19 = year19.loc[(year19['date'] >= '2019-02-01') & (year19['date'] <= '2019-02-28'), ]

In [0]:
# print out the first and last date
print(feb19['date'].min())
print(feb19['date'].max())

## When the index is datetime

In [0]:
## same filter
feb19_index = year19.loc["2019-02-01":"2019-02-28", ]

In [0]:
# print out the first and last date
print(feb19_index['date'].min())
print(feb19_index['date'].max())

## We can also use time in the filter too

Use the `min19df` dataframe and select a date and time range

In [0]:
# quick refresher
min19df.info()

In [0]:
# filter st paddys day
paddy = min19df.loc['2019-03-17 00:00:00':'2019-03-17 23:59:59', ]

In [0]:
# check what we have
print(paddy.index.min())
print(paddy.index.max())

In [0]:
# and because its by minute, how many rows?
len(paddy)

In [0]:
# does this make sense (minutes * hours in a day)?
60*24

# Reshaping Data to fit a Timeseries Format - Melt

Generally speaking, we have seen that our data work well for time series when each observation is a row, and just one column representing the date/time.  

Sometimes we get reports from our team/clients where the date/time is across the columns, where each column is a date and the cell is the value.  While this makes it easy to see in a spreadsheet form, often we need to change the shape of the data to go to the "long" format we have been using.

In pandas, we can `melt` the data.

> Download the stocks.csv file on Questrom Tools and import it as stocks

In [0]:
## read in the file
stocks = pd.read_csv("stocks.csv")

In [0]:
## first few rows
stocks.head()

In [0]:
## melt the dataset from wide (all of the dates as each column, to rows
stocks_long = stocks.melt(id_vars = "ticker")

In [0]:
# take a look
stocks_long.head()

In [0]:
# we can use other arguments to clean this up
stocks_long = stocks.melt(id_vars="ticker", var_name="date", value_name="close")
stocks_long.head()

In [0]:
# check the types
stocks_long.dtypes

In [0]:
# as expected, we need to change the column types
# http://strftime.org/
stocks_long['date'] = pd.to_datetime(stocks_long['date'], format="%m/%d/%y")

In [0]:
# confirm
stocks_long.dtypes

## Quick Exercise

For each ticker, calcualte the min/max/and average close

# Plotting multiple series

There is nothing stopping us from plotting the tickers together

In [0]:
sns.lineplot(x="date", y="close", hue="ticker", data=stocks_long)