### Time_Series_Feature_Engineering

Jay Urbain, PhD

Credits:  
- Introduction to Time Series Forecasting with Python, Jason Brownlee.  
- Python Data Science Handbook, Jake VanderPlas.  
- Chris Albion, https://chrisalbon.com/python/data_wrangling/pandas_time_series_basics/


In [1]:
# check the versions of key python libraries
# scipy
import scipy
print('scipy: %s' % scipy.__version__)
# numpy
import numpy
print('numpy: %s' % numpy.__version__)
# matplotlib
import matplotlib
print('matplotlib: %s' % matplotlib.__version__)
# pandas
import pandas
print('pandas: %s' % pandas.__version__)
# statsmodels
import statsmodels
print('statsmodels: %s' % statsmodels.__version__)
# scikit-learn
import sklearn
print('sklearn: %s' % sklearn.__version__)

scipy: 1.0.0
numpy: 1.13.3
matplotlib: 3.0.1
pandas: 0.23.4
statsmodels: 0.9.0
sklearn: 0.19.1


Time Series data must be re-framed as a supervised learning dataset before we can start using machine learning algorithms. 

There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps. In this tutorial, you will discover how to perform feature engineering on time series data with Python to model your time series problem with machine learning algorithms.

Transform standard time series:  
time 1, value 1  
time 2, value 2  
time 3, value 3  

To the following for ML:  
input 1, output 1  
input 2, output 2  
input 3, output 3  


Classes of features that we can create from our time series dataset:
- Date Time Features: these are components of the time step itself for each observation.  
- Lag Features: these are values at prior time steps.  
- Window Features: these are a summary of values over a fixed window of prior time steps.  


#### Load Minimum Daily Temperatures Dataset

The Minimum Daily Temperatures dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia. 

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html 

In [3]:
# create date time features of a dataset
import pandas as pd
from pandas import Series
from pandas import DataFrame
dataframe = pd.read_csv('daily-minimum-temperatures.csv', header=0, parse_dates=[0]) 
dataframe.head() 

Unnamed: 0,Date,Temp
0,1981-01-01,20.7
1,1981-01-02,17.9
2,1981-01-03,18.8
3,1981-01-04,14.6
4,1981-01-05,15.8


In [4]:
df = dataframe.copy() 
df['Date'] = pd.to_datetime(df['Date'])
df['Date'].iloc[0].month

1

If you want new columns showing year and month separately you can do this:

In [5]:
t = pandas.tslib.Timestamp.now()
print( type(t) )
t

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


You can access Timestamp as pandas.Timestamp
  if __name__ == '__main__':


Timestamp('2018-11-04 07:21:57.624484')

In [6]:
t.month

11

In [7]:
df = dataframe.copy()
df['month'] = [df.index[i].month for i in range(len(series))] 
df['day'] = [df.index[i].day for i in range(len(series))] 
df['temperature'] = [df[i] for i in range(len(series))] 
df.head(5)

NameError: name 'series' is not defined

Two features that we can start with are the integer month and day for each observation. We can imagine that supervised learning algorithms may be able to use these inputs to help tease out time-of-year or time-of-month type seasonality information. 

The supervised learning problem we are proposing is to predict the daily minimum temperature given the month and day, as follows:  
Month, Day, Temperature  
Month, Day, Temperature  
Month, Day, Temperature  

#### Exploring Time Series Data

In [None]:
# summarize first few lines of a file
series.head(10)

In [None]:
print(series.tail(10))

In [None]:
# summarize the dimensions of a time series
from pandas import Series
print(series.size)

#### Querying By Time

In [None]:
print(series.loc['1959-01'])

#### Descriptive Statistics   

Calculating descriptive statistics on your time series can help get an idea of the distribution and spread of values. This may help with ideas of data scaling and even data cleaning that you can perform later as part of preparing your dataset for modeling. The describe() function creates a 7 number summary of the loaded time series including mean, standard deviation, median, minimum, and maximum of the observations.

In [None]:
print(series.describe())

#### Using pandas

In [None]:
from datetime import datetime
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as pyplot

Create a dataframe

In [None]:
data = {'date': ['2014-05-01 18:47:05.069722', '2014-05-01 18:47:05.119994', '2014-05-02 18:47:05.178768', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.230071', '2014-05-02 18:47:05.280592', '2014-05-03 18:47:05.332662', '2014-05-03 18:47:05.385109', '2014-05-04 18:47:05.436523', '2014-05-04 18:47:05.486877'], 
        'battle_deaths': [34, 25, 26, 15, 15, 14, 26, 25, 62, 41]}
df = pd.DataFrame(data, columns = ['date', 'battle_deaths'])
print(df)

Convert df['date'] from string to datetime

In [None]:
df['date'] = pd.to_datetime(df['date'])

Set df['date'] as the index and delete the column

In [None]:
df.index = df['date']
del df['date']
df

View all observations that occured in 2014

In [None]:
df['2014']

View all observations that occured in May 2014

In [None]:
df['2014-05']

Observations after May 3rd, 2014

In [None]:
df[datetime(2014, 5, 3):]

Count the number of observations per timestamp

In [None]:
df.groupby(level=0).count()

Mean value of battle_deaths per day

In [None]:
df.resample('D').mean()

Total value of battle_deaths per day

In [None]:
df.resample('D').sum()

Plot of the total battle deaths per day

In [None]:
df.resample('D').sum().plot()