# Converting Strings to Dates
> When dates and times come as strings, we need to convert them into a data type
Python can understand.<br>
> - While there are a number of Python tools for converting
strings to datetimes,  we can use `to_datetime` to conduct the transformation.
>> One obstacle to using strings to represent dates and times is that the format of the strings can vary significantly between
data sources.
>> > **EXAMPLE:** one vector of dates might represent `March 23, 2015` as
`03-23-15` while another might use `3|23|2015`.<br>
>> > - We can use the format parameter
to specify the exact format of the string. Here are some common date and time
formatting codes:

> |Code| Description| 			Example|
|----|---------------------------|------|
|%Y| Full year 			|	2001|
|%m| Month w/ zero padding| 		04|
|%d| Day of the month w/ zero padding| 	09|
|%I| Hour (12hr clock) w/ zero padding| 	02|
|%p| AM or PM 				|AM|
|%M| Minute w/ zero padding| 		05|
|%S| Second w/ zero padding| 		09|

In [1]:
# If you want to transform a vector of strings representing dates and times into time series data,
# use pandas’ to_datetime with the format of the date and/or time specified in the format parameter:
# Load libraries
import numpy as np
import pandas as pd
# Create strings
date_strings = np.array(['03-04-2005 11:35 PM',
'23-05-2010 12:01 AM',
'04-09-2009 09:09 PM'])

# Convert to datetimes
[pd.to_datetime(date, format='%d-%m-%Y %I:%M %p') for date in date_strings]
# [Timestamp('2005-04-03 23:35:00'),
#  Timestamp('2010-05-23 00:01:00'),
#  Timestamp('2009-09-04 21:09:00')]

[Timestamp('2005-04-03 23:35:00'),
 Timestamp('2010-05-23 00:01:00'),
 Timestamp('2009-09-04 21:09:00')]

In [2]:
# We might also want to add an argument to the errors parameter to handle problems:
# Convert to datetimes
[pd.to_datetime(date, format="%d-%m-%Y %I:%M %p", errors="coerce") for date in date_strings]

[Timestamp('2005-04-03 23:35:00'),
 Timestamp('2010-05-23 00:01:00'),
 Timestamp('2009-09-04 21:09:00')]

> - If `errors="coerce"`, then any problem that occurs will not raise an error (the default
behavior) but instead will set the value causing the error to `NaT` (a missing value).
>> This allows you to deal with outliers by filling them with null values, as opposed to
troubleshooting errors for individual records in the data.

# Handling Time Zones

In [4]:
# When you have time series data and want to add or change time zone information.
# Unless specified, pandas objects have no time zone. We can add a time zone using tz during creation:
# Load library
import pandas as pd
# Create datetime
pd.Timestamp('2017-05-01 06:00:00', tz='Europe/London')
# Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')

Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')

In [5]:
# We can add a time zone to a previously created datetime using tz_localize:
# Create datetime
date = pd.Timestamp('2017-05-01 06:00:00')
# Set time zone
date_in_london = date.tz_localize('Europe/London')
# Show datetime
date_in_london
# Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')

Timestamp('2017-05-01 06:00:00+0100', tz='Europe/London')

In [6]:
# We also can convert to a different time zone:
# Change time zone
date_in_london.tz_convert('Africa/Abidjan')
# Timestamp('2017-05-01 05:00:00+0000', tz='Africa/Abidjan')

Timestamp('2017-05-01 05:00:00+0000', tz='Africa/Abidjan')

In [10]:
# Finally, the pandas Series objects can apply tz_localize and tz_convert to every element:
# Create three dates
dates = pd.Series(pd.date_range('2/2/2002', periods=3, freq='ME'))
# Set time zone
dates.dt.tz_localize('Africa/Abidjan')

0   2002-02-28 00:00:00+00:00
1   2002-03-31 00:00:00+00:00
2   2002-04-30 00:00:00+00:00
dtype: datetime64[ns, Africa/Abidjan]

In [11]:
# pandas supports two sets of strings representing timezones; however, I suggest using
# the pytz library strings. We can see all the strings used to represent time zones by
# importing all_timezones:
# Load library
from pytz import all_timezones
# Show two time zones
all_timezones[0:2]
# ['Africa/Abidjan', 'Africa/Accra']

['Africa/Abidjan', 'Africa/Accra']

# Selecting Dates and Times
> Whether we use boolean conditions or index slicing is situation dependent.
> > If we
wanted to do some complex time series manipulation, it might be worth the overhead
of setting `the date column as the index` of the DataFrame,<br><br>
> > but if we wanted to do
some simple data wrangling, `the boolean conditions` might be easier.

In [13]:
# When you have a vector of dates and you want to select one or more.
# Use two boolean conditions as the start and end dates:
# Load library
import pandas as pd
# Create data frame
dataframe = pd.DataFrame()
# Create datetimes
dataframe['date'] = pd.date_range('1/1/2001', periods=100000, freq='h')
# Select observations between two datetimes
dataframe[(dataframe['date'] > '2002-1-1 01:00:00') & (dataframe['date'] <= '2002-1-1 04:00:00')]

Unnamed: 0,date
8762,2002-01-01 02:00:00
8763,2002-01-01 03:00:00
8764,2002-01-01 04:00:00


In [14]:
# Alternatively, we can set the date column as the DataFrame’s index and then slice using loc:
# Set index
dataframe = dataframe.set_index(dataframe['date'])
# Select observations between two datetimes
dataframe.loc['2002-1-1 01:00:00':'2002-1-1 04:00:00']

Unnamed: 0_level_0,date
date,Unnamed: 1_level_1
2002-01-01 01:00:00,2002-01-01 01:00:00
2002-01-01 02:00:00,2002-01-01 02:00:00
2002-01-01 03:00:00,2002-01-01 03:00:00
2002-01-01 04:00:00,2002-01-01 04:00:00


# Breaking Up Date Data into Multiple Features
> Sometimes it can be useful to break up a column of dates into components. 
>> **EXAMPLE:** we might want a feature that includes just the year of the observation or
we might want to consider only the month of some observations so we can compare
them regardless of year.

In [15]:
# When you have a column of dates and times and you want to create features for year,month, day, hour, and minute.
# Use the time properties in pandas Series.dt:
# Load library
import pandas as pd
# Create data frame
dataframe = pd.DataFrame()
# Create five dates
dataframe['date'] = pd.date_range('1/1/2001', periods=150, freq='W')
# Create features for year, month, day, hour, and minute
dataframe['year'] = dataframe['date'].dt.year
dataframe['month'] = dataframe['date'].dt.month
dataframe['day'] = dataframe['date'].dt.day
dataframe['hour'] = dataframe['date'].dt.hour
dataframe['minute'] = dataframe['date'].dt.minute

# Show three rows
dataframe.head(3)

Unnamed: 0,date,year,month,day,hour,minute
0,2001-01-07,2001,1,7,0,0
1,2001-01-14,2001,1,14,0,0
2,2001-01-21,2001,1,21,0,0


# Calculating the Difference Between Dates
> There are times when the feature we want is the change (delta) between two points in
time.
> > **EXAMPLE:** we might have the dates a customer checks in and checks out of
a hotel, but the feature we want is the duration of the customer’s stay.
> > >pandas makes
this calculation easy using the `TimeDelta` data type.

In [16]:
# When you have two datetime features and want to calculate the time between them for each observation.
# Just subtract the two date features using pandas:
# Load library
import pandas as pd
# Create data frame
dataframe = pd.DataFrame()
# Create two datetime features
dataframe['Arrived'] = [pd.Timestamp('01-01-2017'), pd.Timestamp('01-04-2017')]
dataframe['Left'] = [pd.Timestamp('01-01-2017'), pd.Timestamp('01-06-2017')]
# Calculate duration between features
dataframe['Left'] - dataframe['Arrived']

0   0 days
1   2 days
dtype: timedelta64[ns]

In [17]:
# Often we will want to remove the days output and keep only the numerical value:
# Calculate duration between features
pd.Series(delta.days for delta in (dataframe['Left'] - dataframe['Arrived']))

0    0
1    2
dtype: int64

# Encoding Days of the Week
> - Knowing the weekday can be helpful if, for instance, we wanted to compare total
sales on Sundays for the past three years. 

In [19]:
# When you have a vector of dates and want to know the day of the week for each date.
# Use the pandas Series.dt method day_name():
# Load library
import pandas as pd
# Create dates
dates = pd.Series(pd.date_range("2/2/2002", periods=3, freq="ME"))
# Show days of the week
dates.dt.day_name()

0    Thursday
1      Sunday
2     Tuesday
dtype: object

In [20]:
# If we want the output to be a numerical value and therefore more usable as a machine
# learning feature, we can use weekday where the days of the week are represented as
# integers (Monday is 0):
# Show days of the week
dates.dt.weekday

0    3
1    6
2    1
dtype: int32

# Creating a Lagged Feature
> Very often data is based on regularly spaced time periods (e.g., every day, every
hour, every three hours) and we are interested in using values in the past to make
predictions (often called lagging a feature).
> > **EXAMPLE:** we might want to predict a
stock’s price using the price it was the day before.
> > > With `pandas` we can use `shift` to
lag values by one row, creating a new feature containing past values.

In [21]:
# When you want to create a feature that is lagged n time periods.
# Use the pandas shift method:
# Load library
import pandas as pd
# Create data frame
dataframe = pd.DataFrame()
# Create data
dataframe["dates"] = pd.date_range("1/1/2001", periods=5, freq="D")
dataframe["stock_price"] = [1.1,2.2,3.3,4.4,5.5]
# Lagged values by one row
dataframe["previous_days_stock_price"] = dataframe["stock_price"].shift(1)
# Show data frame
dataframe

Unnamed: 0,dates,stock_price,previous_days_stock_price
0,2001-01-01,1.1,
1,2001-01-02,2.2,1.1
2,2001-01-03,3.3,2.2
3,2001-01-04,4.4,3.3
4,2001-01-05,5.5,4.4


# Using Rolling Time Windows
> Rolling (also called moving) time windows are conceptually simple but can be difficult
to understand at first. Imagine we have monthly observations for a stock’s price. It is
often useful to have a time window of a certain number of months and then move
over the observations calculating a statistic for all observations in the time window
>__________________________________
> Another way to put it: our three-month time window “walks” over the observations,
calculating the window’s mean at each step.

>> The pandas rolling method allows us to specify the size of the window by using
window and then quickly calculate some common statistics, including the max value
`(max()), mean value (mean()), count of values (count())`, and `rolling correlation`
`(corr())`.

>>> - `Rolling means` are often used to `smooth time series data` because using the mean of
the entire time window dampens the effect of short-term fluctuations.

In [23]:
# Given time series data, you want to calculate a statistic for a rolling time.
# Use the pandas DataFrame rolling method:
# Load library
import pandas as pd
# Create datetimes
time_index = pd.date_range("01/01/2010", periods=5, freq="ME")
# Create data frame, set index
dataframe = pd.DataFrame(index=time_index)
# Create feature
dataframe["Stock_Price"] = [1,2,3,4,5]
# Calculate rolling mean
dataframe.rolling(window=2).mean()

Unnamed: 0,Stock_Price
2010-01-31,
2010-02-28,1.5
2010-03-31,2.5
2010-04-30,3.5
2010-05-31,4.5


# Handling Missing Data in Time Series
> Interpolation is a technique for filling gaps caused by missing values by, in effect,
drawing a line or curve between the known values bordering the gap and using that
line or curve to predict reasonable values.
> > **Note:** Interpolation can be particularly useful
when the time intervals are constant, the data is not prone to noisy fluctuations, and
the gaps caused by missing values are small. 

In [25]:
# In addition to the missing data strategies previously discussed, when we have time
# series data we can use interpolation to fill gaps caused by missing values:
# Load libraries
import pandas as pd
import numpy as np
# Create date
time_index = pd.date_range("01/01/2010", periods=5, freq="ME")
# Create data frame, set index
dataframe = pd.DataFrame(index=time_index)
# Create feature with a gap of missing values
dataframe["Sales"] = [1.0,2.0,np.nan,np.nan,5.0]
# Interpolate missing values
dataframe.interpolate()

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.0
2010-04-30,4.0
2010-05-31,5.0


### `Back filling` and `forward filling` are forms of naive interpolation, where we draw a flat line from a known value and use it to fill in missing values. 
- One (minor) advantage
back filling and forward filling have over interpolation is that they don’t require
known values on both sides of missing values

In [26]:
# Alternatively, we can replace missing values with the last known value (i.e., forward filling):
# Forward fill
dataframe.ffill()

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,2.0
2010-04-30,2.0
2010-05-31,5.0


In [27]:
# We can also replace missing values with the latest known value (i.e., back filling):
# Backfill
dataframe.bfill()

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,5.0
2010-04-30,5.0
2010-05-31,5.0


In [28]:
# If we believe the line between the two known points is nonlinear, we can use
# interpolate’s method parameter to specify the interpolation method:
# Interpolate missing values
dataframe.interpolate(method="quadratic")

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.059808
2010-04-30,4.038069
2010-05-31,5.0


- Finally, we may have large gaps of missing values but do not want to interpolate
values across the entire gap. In these cases we can use `limit` to restrict the number
of interpolated values and `limit_direction` to set whether to interpolate values
forward from the last known value before the gap or vice versa:

In [29]:
# Interpolate missing values
dataframe.interpolate(limit=1, limit_direction="forward")

Unnamed: 0,Sales
2010-01-31,1.0
2010-02-28,2.0
2010-03-31,3.0
2010-04-30,
2010-05-31,5.0


# END of Chapter 7 --> Handling dates and time series data