# Selecting Time Series Data

Broadly speaking, time series data are points of data gathered over time (datetimes using pandas terminology). The time order is meaningful and there is only one observation per unit of time typically. Each unit of time often uniquely identifies each record. Often, time is evenly spaced between each data point. 

Examples of time series data include stock market closing prices, levels of CO2 in the atmosphere, unemployment rates, and airplane altitude. pandas has good functionality with regards to analyzing time series data, aggregating over different time periods, sampling different periods of time, and more. Let's begin by reading in 20 years of stock market data, putting the `date` column in the index.

In [66]:
import pandas as pd
df = pd.read_csv('../data/stocks/stocks10.csv', parse_dates=['date'], 
                 index_col='date')
df.head(3)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,


## Set the datetime column as the index

If you do have time series data where the values of one datetime column uniquely identify each row, then it's best to use this column as the index. pandas provides extra functionality to DataFrames that have a datetime index.

### DateTimeIndex

Setting a datetime column as the index creates a DateTimeIndex. 

In [67]:
idx = df.index
type(idx)

pandas.core.indexes.datetimes.DatetimeIndex

Like other index objects, items may be selected with slice notation.

In [68]:
idx[:5]

DatetimeIndex(['1999-10-25', '1999-10-26', '1999-10-27', '1999-10-28',
               '1999-10-29'],
              dtype='datetime64[ns]', name='date', freq=None)

You can directly call specific datetime methods on DateTimeIndex objects just like you can with the `dt` accessor on datetime Series. Let's get the year, month, and day name directly from this index object. The first five values of each attribute are returned.

In [69]:
idx.year[:5]

Index([1999, 1999, 1999, 1999, 1999], dtype='int32', name='date')

In [70]:
idx.month[:5]

Index([10, 10, 10, 10, 10], dtype='int32', name='date')

In [71]:
idx.day_name()[:5]

Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], dtype='object', name='date')

## Easy subset selection with a DateTimeIndex

One big advantage of a DateTimeIndex is the ability to select subsets of data without using boolean indexing. We can use strings to represent specific datetimes and pass those strings to the `loc` indexer. Here, we select the row of data for January 5th, 2017.

In [73]:
df

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.80,36.94,18.27,,
1999-10-28,29.01,2.43,16.59,71.00,,21.19,38.85,19.79,,
1999-10-29,29.88,2.50,17.21,70.62,,21.47,39.25,20.00,,
...,...,...,...,...,...,...,...,...,...,...
2019-10-18,137.41,236.41,32.31,1757.51,256.95,67.61,119.14,38.47,185.85,175.71
2019-10-21,138.43,240.51,33.59,1785.66,253.50,68.74,119.74,38.23,189.76,176.43
2019-10-22,136.37,239.96,34.82,1765.73,255.58,69.09,119.58,38.17,182.34,170.86
2019-10-23,137.24,243.18,35.33,1762.17,254.68,69.75,119.35,37.74,186.15,171.32


In [72]:
df.loc['2017-1-5']

MSFT     59.23
AAPL    111.73
SLB      76.93
AMZN    780.45
TSLA    226.75
XOM      79.11
WMT      64.88
T        36.08
FB      120.67
V        79.61
Name: 2017-01-05 00:00:00, dtype: float64

Note that we did not have to convert the string to a datetime object first. pandas implicitly understood that the string was a datetime.

### Partial string matching to select entire periods of time

You can select entire periods of time by using a string with less precision. Here, we select all of the rows from the month of February, 2017.

In [11]:
df.loc['2017-2'].head(3)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2017-02-01,60.45,123.36,75.01,832.35,249.24,74.1,62.09,35.99,133.23,80.94
2017-02-02,60.06,123.15,74.35,839.95,251.55,74.55,62.53,35.24,130.84,80.8
2017-02-03,60.54,123.68,74.4,810.2,251.33,74.63,62.34,35.3,130.98,84.51


Below, we select the entire year 2016.

In [12]:
df.loc['2016'].head(3)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2016-01-04,50.71,98.74,60.79,636.99,223.41,66.84,55.99,27.65,102.22,73.61
2016-01-05,50.94,96.27,61.07,633.79,223.43,67.41,57.32,27.85,102.73,74.16
2016-01-06,50.01,94.38,59.49,632.65,219.04,66.85,57.9,27.81,102.97,73.19


### Slicing with partial string matching

Use slice notation to select a specific date range. Below, we select from March 28, 2017 through April 3, 2017. Note that the stop value is inclusive.

In [74]:
df.loc['2017-3-28':'2017-4-3']

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2017-03-28,62.45,138.38,71.06,856.0,277.45,73.78,66.41,35.56,141.76,87.66
2017-03-29,62.62,138.68,71.39,874.32,277.38,73.94,66.81,35.47,142.65,87.72
2017-03-30,62.85,138.5,70.63,876.34,277.92,75.46,67.61,35.74,142.41,87.55
2017-03-31,62.99,138.24,70.87,886.54,278.3,73.93,68.07,35.56,142.05,87.41
2017-04-03,62.7,138.28,70.5,891.51,298.52,73.99,67.83,35.57,142.28,87.9


### Selecting date ranges along with specific columns

The `loc` indexer allows you select specific columns along with ranges of dates. Here, we select the month of May, 2017 along with three specific columns.

In [75]:
df.loc['2017-5', ['SLB', 'T', 'FB']].head()

Unnamed: 0_level_0,SLB,T,FB
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-05-01,65.6,33.86,152.46
2017-05-02,64.83,33.73,152.78
2017-05-03,64.91,33.25,151.8
2017-05-04,64.37,32.92,150.85
2017-05-05,65.31,33.39,150.24


## Selecting rows at specific frequencies

In addition to selecting consecutive rows, it is possible to select disjoint rows at specific frequencies of time. The `asfreq` method allows you to select very specific intervals, by passing it an **offset alias** as a string. An offset alias determines the frequency of the time series data you would like to sample. The table below shows the most common offset aliases. Reference all of the [offset aliases in the official documentation][1].

| Alias    | Description     |  Alias  |  Description  |
|:---------|:----------------|:--------|:--------------|
| `Y`/`A`        | year end        | `D`       | day           |
| `YS`/`AS`       | year start      | `H`        | hourly       |
| `Q`        | quarter end     | `T` or `min`   | minutes      |
| `QS`       | quarter start   | `S`        | seconds      |
| `M`        | month end     | `L` or `ms`    | milliseconds |
| `MS`       | month start       | `U` or `us`    | microseconds |
| `W`        | weekly          | `N`        | nanoseconds  |

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

Let's say we are interested in selecting the last day of each year. To do so, we choose `'Y'` for the year end frequency. We pass this as a string to the `asfreq` method to return the very last day of each year. Note that `asfreq` only works for DataFrames with a DateTimeIndex.

In [79]:
df.asfreq('YE').head(8)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-12-31,37.68,3.2,18.11,76.12,,23.49,48.21,18.49,,
2000-12-31,,,,,,,,,,
2001-12-31,21.38,1.37,18.58,10.82,,24.18,40.56,15.52,,
2002-12-31,16.69,0.89,14.68,18.89,,22.03,35.79,11.13,,
2003-12-31,17.82,1.33,19.72,52.62,,26.58,37.84,11.2,,
2004-12-31,19.45,4.01,24.72,44.29,,34.04,38.03,11.62,,
2005-12-31,,,,,,,,,,
2006-12-31,,,,,,,,,,


### Business offset aliases

In this case, selecting the very last day isn't quite what we want because the stock market is only open on weekdays and December 31st falls on a weekend some years. The `asfreq` method returns one row for each frequency regardless if there is data for that date. All values for rows that do not appear in the DataFrame will be missing.

Most of the offset aliases above can be prepended by the character `'B'` to signify a business offset alias. Business offset aliases only consider the weekdays Monday through Friday. Let's change the offset alias to `'BY'` to signify business year end frequency. Doing so correctly selects the last trading day of each year.

In [82]:
df.asfreq('BYE').head(8)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-12-31,37.68,3.2,18.11,76.12,,23.49,48.21,18.49,,
2000-12-29,14.0,0.93,26.32,15.56,,25.89,37.23,18.5,,
2001-12-31,21.38,1.37,18.58,10.82,,24.18,40.56,15.52,,
2002-12-31,16.69,0.89,14.68,18.89,,22.03,35.79,11.13,,
2003-12-31,17.82,1.33,19.72,52.62,,26.58,37.84,11.2,,
2004-12-31,19.45,4.01,24.72,44.29,,34.04,38.03,11.62,,
2005-12-30,19.27,8.96,36.62,47.15,,38.05,34.12,11.64,,
2006-12-29,22.32,10.58,48.1,39.46,,52.92,34.16,17.83,,


### Anchored offset aliases

Let's say we would like to select every Thursday. We'll need to use a slightly different string called an **anchored offset alias**. You can anchor years and quarters to months and weeks to days by placing a dash and the abbreviation of the anchor after the offset alias. For example, `BY-APR` signifies business year frequency ending in April. When anchoring the week, use the three-character abbreviation of the day. Below, we anchor weeks to Thursday. The default anchor for weeks is Sunday.

In [83]:
df.asfreq('W-THU').head()

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-28,29.01,2.43,16.59,71.0,,21.19,38.85,19.79,,
1999-11-04,29.61,2.61,17.09,63.06,,21.16,39.03,19.39,,
1999-11-11,28.93,2.88,17.57,73.0,,22.4,40.03,19.39,,
1999-11-18,27.42,2.79,18.95,77.94,,23.63,41.38,19.47,,
1999-11-25,,,,,,,,,,


Select the last day of June of each year by using the `A` offset alias and anchoring to the three-character abbreviation of the month. At the time of this writing, the `Y` offset alias, does not allow for anchoring.

In [86]:
df.asfreq('A-Jun').head()

  df.asfreq('A-Jun').head()


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2000-06-30,25.82,3.26,24.34,36.31,,23.14,40.29,16.87,,
2001-06-30,,,,,,,,,,
2002-06-30,,,,,,,,,,
2003-06-30,16.6,1.19,16.89,36.32,,22.96,38.15,10.72,,
2004-06-30,18.6,2.03,23.18,54.4,,29.16,37.62,10.67,,


## Upsampling - Increasing the number of rows

The above selections choose a specific subset of rows. pandas uses the terminology **downsampling** when selecting a subset of the original data (usually less rows than the original). Instead, we may choose to **upsample** and increase the number of rows. This will lead to many rows of missing values. Both upsampling and downsampling ensure that the rows are evenly spaced units of time. Let's return a DataFrame with a single row for each day of the year. This will create rows all non-trading days (weekends and holidays).

In [87]:
df.asfreq('D').head(7)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,
1999-10-28,29.01,2.43,16.59,71.0,,21.19,38.85,19.79,,
1999-10-29,29.88,2.5,17.21,70.62,,21.47,39.25,20.0,,
1999-10-30,,,,,,,,,,
1999-10-31,,,,,,,,,,


## Use integers in the offset alias

You can provide more precise offsets by placing an integer in front of the offset alias. These represent a multiple of the offset alias. For example, `'3M'` stands for 3 months and `'15s'` for 15 seconds. To select every 6th Wednesday, we do the following:

In [88]:
df.asfreq('6W-WED').head()

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,
1999-12-08,29.61,3.43,16.88,88.56,,24.34,40.88,20.5,,
2000-01-19,34.54,3.32,21.25,66.81,,24.95,44.68,15.46,,
2000-03-01,29.31,4.06,25.0,65.88,,22.3,34.18,16.01,,
2000-04-12,25.62,3.41,23.76,56.38,,23.44,43.6,18.02,,


You can also upsample by smaller units than what is present in the index. For instance, '4H' will make a new row for every 4 hours of time.

In [89]:
df.asfreq('4H').head(8)

  df.asfreq('4H').head(8)


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25 00:00:00,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-25 04:00:00,,,,,,,,,,
1999-10-25 08:00:00,,,,,,,,,,
1999-10-25 12:00:00,,,,,,,,,,
1999-10-25 16:00:00,,,,,,,,,,
1999-10-25 20:00:00,,,,,,,,,,
1999-10-26 00:00:00,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-26 04:00:00,,,,,,,,,,


You can fill in the missing values with the previous or next known values using the `method` parameter which can be set to either `'ffill'` or `'bfill'`. Here we fill the missing values using the previously known value in the column.

In [90]:
df.asfreq('4H', method='ffill').head(8)

  df.asfreq('4H', method='ffill').head(8)


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25 00:00:00,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-25 04:00:00,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-25 08:00:00,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-25 12:00:00,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-25 16:00:00,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-25 20:00:00,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26 00:00:00,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-26 04:00:00,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,


### No duplicates are allowed and dates must be ordered

Upsampling and downsampling only work when there are no duplicate dates and when the data is ordered. Let's take a look at the employee dataset which has a datetime column, but is not time series data.

In [91]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp = emp.set_index('hire_date')
emp.head(3)

Unnamed: 0_level_0,dept,title,salary,sex,race
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-12-03,Police,POLICE SERGEANT,87545.38,Male,White
2010-11-15,Other,ASSISTANT CITY ATTORNEY II,82182.0,Male,Hispanic
2006-01-09,Houston Public Works,SENIOR SLUDGE PROCESSOR,49275.0,Male,Black


If we try and sample it by year, we get an error.

In [92]:
emp

Unnamed: 0_level_0,dept,title,salary,sex,race
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-12-03,Police,POLICE SERGEANT,87545.38,Male,White
2010-11-15,Other,ASSISTANT CITY ATTORNEY II,82182.00,Male,Hispanic
2006-01-09,Houston Public Works,SENIOR SLUDGE PROCESSOR,49275.00,Male,Black
1997-05-27,Police,SENIOR POLICE OFFICER,75942.10,Male,Hispanic
2006-01-23,Police,SENIOR POLICE OFFICER,69355.26,Male,White
...,...,...,...,...,...
2001-12-03,Police,SENIOR POLICE OFFICER,75942.10,Male,Black
2016-03-28,Other,SENIOR PROCUREMENT SPECIALIST,76175.00,Female,Black
2015-09-14,Houston Public Works,WATER SERVICE INSPECTOR I,35173.00,Male,Black
2008-05-19,Health & Human Services,HUMAN SERVICE PROGRAM MANAGER,67198.00,Female,Black


In [93]:
emp.asfreq('Y')

  emp.asfreq('Y')


ValueError: cannot reindex on an axis with duplicate labels

Let's try and make it more like a time series by sorting the index.

In [94]:
emp = emp.sort_index()
emp.head(3)

Unnamed: 0_level_0,dept,title,salary,sex,race
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1968-12-13,Police,SENIOR POLICE OFFICER,,Male,Black
1969-03-21,Police,POLICE SERGEANT,,Male,Hispanic
1969-10-06,Other,SENIOR PUBLIC LOSS INVESTIGATOR,75067.0,Female,White


The operation will only be successful if there are no duplicate dates. The error tells us that at least one hire date is not unique.

In [95]:
emp.asfreq('M')

  emp.asfreq('M')


ValueError: cannot reindex on an axis with duplicate labels

Selection with partial string still works.

In [96]:
emp.loc['2012-1':'2012-2'].head()

Unnamed: 0_level_0,dept,title,salary,sex,race
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2012-01-03,Other,COUNCIL MEMBER,62983.0,Female,White
2012-01-03,Other,COUNCIL MEMBER,62983.0,Male,White
2012-01-03,Other,DEPUTY DIRECTOR (EXECUTIVE LEVEL),142170.0,Female,Black
2012-01-03,Police,CRIMINAL INTELLIGENCE ANALYST,59322.0,Male,White
2012-01-03,Health & Human Services,SURVEILLANCE INVESTIGATOR-EPIDEMIOLOGY,46654.0,Female,Black


## Exercises

In [65]:
import pandas as pd
import numpy as np

# ------------------------------------------------------------------------------
# 1. ROBUST INGESTION MODULE
# ------------------------------------------------------------------------------
def load_weather_production(path: str) -> pd.DataFrame:
    """
    Ingests weather data with maximum safety.
    
    Standards Applied:
    1. .replace(): Non-destructive transformation. Preserves unexpected values.
    2. 'boolean': Nullable type. Allows True/False/NA (unlike numpy bool).
    3. Method Chaining: Atomic steps.
    """
    # Define mapping dictionary
    yes_no_map = {'Yes': True, 'No': False}
    
    return (pd.read_csv(path)
        .assign(
            # 1. Date: Convert and Index
            date=lambda x: pd.to_datetime(x['date']),
            
            # 2. Rain/Snow: The Safe Transformation
            # We use .replace() so we don't destroy data we didn't expect.
            # We use .astype('boolean') to handle missing values safely.
            rain=lambda x: x['rain'].replace(yes_no_map).astype('boolean'),
            snow=lambda x: x['snow'].replace(yes_no_map).astype('boolean'),
            
            # 3. Temp: Force numeric, turning errors into NaNs (safely)
            temperature=lambda x: pd.to_numeric(x['temperature'], errors='coerce')
        )
        .set_index('date')
        .sort_index()
    )

# Load Data (Simulated path for MDA repo)
weather = load_weather_production('../data/weather.csv')

  rain=lambda x: x['rain'].replace(yes_no_map).astype('boolean'),
  snow=lambda x: x['snow'].replace(yes_no_map).astype('boolean'),


In [97]:
weather

Unnamed: 0_level_0,rain,snow,temperature
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007-01-01,True,False,68.0
2007-01-02,False,False,55.9
2007-01-03,False,False,62.1
2007-01-04,False,False,69.1
2007-01-05,True,False,72.0
...,...,...,...
2018-11-20,False,False,64.0
2018-11-21,False,False,57.0
2018-11-22,False,False,50.0
2018-11-23,False,False,45.0


### Exercise 1

<span style="color:green; font-size:16px">Read in the weather time series dataset and place the date column in the index. Then use this DataFrame for the following questions.</span>

In [55]:
weather.index

DatetimeIndex(['2007-01-01', '2007-01-02', '2007-01-03', '2007-01-04',
               '2007-01-05', '2007-01-06', '2007-01-07', '2007-01-08',
               '2007-01-09', '2007-01-10',
               ...
               '2018-11-15', '2018-11-16', '2018-11-17', '2018-11-18',
               '2018-11-19', '2018-11-20', '2018-11-21', '2018-11-22',
               '2018-11-23', '2018-11-24'],
              dtype='datetime64[ns]', name='date', length=4346, freq=None)

### Exercise 2

<span style="color:green; font-size:16px">Select all of the month of November, 2010</span>

In [56]:
weather.loc['2010-11'].head(3)

Unnamed: 0_level_0,rain,snow,temperature
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-11-01,False,False,63.0
2010-11-02,False,False,57.9
2010-11-03,True,False,55.9


### Exercise 3

<span style="color:green; font-size:16px">Select all of the second quarter of 2017.</span>

In [57]:
q2_2017 = weather.loc['2017-04':'2017-06']

In [100]:
weather.loc['2017Q2']

Unnamed: 0_level_0,rain,snow,temperature
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-04-01,False,False,75.0
2017-04-02,False,False,69.1
2017-04-03,False,False,75.9
2017-04-04,False,False,82.0
2017-04-05,False,False,78.1
...,...,...,...
2017-06-26,False,False,84.9
2017-06-27,False,False,82.9
2017-06-28,False,False,82.9
2017-06-29,False,False,87.1


### Exercise 4

<span style="color:green; font-size:16px">Select data from July 1, 2015 to the end of 2016.</span>

In [58]:
mid_15_to_16 = weather.loc['2015-07-01':'2016']
mid_15_to_16

Unnamed: 0_level_0,rain,snow,temperature
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-07-01,False,False,87.1
2015-07-02,False,False,87.1
2015-07-03,False,False,78.1
2015-07-04,False,False,87.1
2015-07-05,False,False,90.0
...,...,...,...
2016-12-27,False,False,68.0
2016-12-28,False,False,60.1
2016-12-29,False,False,63.0
2016-12-30,False,False,48.0


In [101]:
weather.loc['2015Q3':'2016']

Unnamed: 0_level_0,rain,snow,temperature
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-07-01,False,False,87.1
2015-07-02,False,False,87.1
2015-07-03,False,False,78.1
2015-07-04,False,False,87.1
2015-07-05,False,False,90.0
...,...,...,...
2016-12-27,False,False,68.0
2016-12-28,False,False,60.1
2016-12-29,False,False,63.0
2016-12-30,False,False,48.0


### Exercise 5

<span style="color:green; font-size:16px">Select just the rain and snow columns from the January 1, 2008 to January 7, 2008.</span>

In [59]:
weather.loc['2008-01-01':'2008-01-07', ['rain', 'snow']]

Unnamed: 0_level_0,rain,snow
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-01-01,False,False
2008-01-02,False,True
2008-01-03,False,False
2008-01-04,False,False
2008-01-05,False,False
2008-01-06,False,False
2008-01-07,False,False


### Exercise 6

<span style="color:green; font-size:16px">What was the temperature on June 11, 2011?</span>

In [60]:
weather.loc['2011-06-11','temperature']

np.float64(93.9)

### Exercise 7

<span  style="color:green; font-size:16px">How many days did it rain during the last three months of 2011?</span>

In [62]:
rainy_days = (weather
    .loc['2011-10':'2011-12', 'rain']
    .gt(0)
    .sum()
)

rainy_days

np.int64(23)

In [111]:
rainy_days = (weather
    .loc['2011Q4', 'rain']
    .sum()
)

np.int64(23)

### Exercise 8

<span style="color:green; font-size:16px">Which year had more snow days, 2007 or 2012?</span>

In [158]:
snow_07 = weather.loc['2007', 'snow'].sum()
snow_12 = weather.loc['2012', 'snow'].sum()
more_snow_year = 2007 if snow_07 > snow_12 else 2012

more_snow_year

2007

In [159]:
more_snow_year = (
    weather
    .loc[lambda df_: df_.index.year.isin([2007,2012]),'snow']
    .groupby(lambda idx: idx.year)  
    .sum()                        
)

more_snow_year.idxmax()

np.int64(2007)

### Exercise 9

<span style="color:green; font-size:16px">Select every other Thursday.</span>

In [162]:
biweekly_thursdays = (
    weather       # Required for performance and reindexing safety
    .asfreq('2W-THU')       # Selects every 2nd week, anchored on Thursday
)

### Exercise 10

<span style="color:green; font-size:16px">Select the first day of each month.</span>

In [164]:
weather.asfreq('MS')

Unnamed: 0_level_0,rain,snow,temperature
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007-01-01,True,False,68.0
2007-02-01,True,True,34.0
2007-03-01,True,False,66.9
2007-04-01,True,False,77.0
2007-05-01,False,False,91.9
...,...,...,...
2018-07-01,False,False,95.0
2018-08-01,False,False,90.0
2018-09-01,False,False,93.0
2018-10-01,False,False,82.0


### Exercise 11

<span style="color:green; font-size:16px">Select every other October 1st.</span>

In [166]:
weather.asfreq('2YS-OCT')

Unnamed: 0_level_0,rain,snow,temperature
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2007-10-01,False,False,82.0
2009-10-01,False,False,73.0
2011-10-01,False,False,64.0
2013-10-01,False,False,82.0
2015-10-01,False,False,66.0
2017-10-01,False,False,71.1


### Use the temperature dataset for the remaining exercises

Execute the following cell to read in the temperature dataset which sets the datetime column in the index.

In [167]:
df_temp = pd.read_csv('../data/weather/temperature.csv', parse_dates=['datetime'], 
                      index_col='datetime')
df_temp.head()

Unnamed: 0_level_0,Seattle,San Francisco,Los Angeles,Las Vegas,Denver,Houston,Chicago,Atlanta,Miami,New York
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2013-01-01 00:00:00,2.94,11.5,11.66,7.28,-2.3,8.81,-0.19,1.92,19.41,-1.12
2013-01-01 01:00:00,2.4,10.22,10.67,5.95,-3.23,8.81,0.28,0.6,19.35,-1.69
2013-01-01 02:00:00,1.7,8.02,9.91,5.18,-3.03,8.81,0.33,-0.53,18.99,-1.96
2013-01-01 03:00:00,1.45,7.3,9.33,4.42,-3.67,8.48,0.12,-1.36,18.56,-2.08
2013-01-01 04:00:00,0.95,6.84,8.82,3.62,-5.55,8.34,0.04,-1.44,18.49,-2.32


### Exercise 12

<span style="color:green; font-size:16px">Select the temperatures for Houston between 3 and 6 p.m. on July 4, 2014.</span>

In [170]:
houston_temp = df_temp.loc['2014-07-04 15:00':'2014-07-04 18:00', 'Houston']

### Exercise 13

<span style="color:green; font-size:16px">Upsample the result from the previous exercise so that there are entries every 20 minutes.</span>

In [173]:
houston_temp.asfreq('20T')

  houston_temp.asfreq('20T')


datetime
2014-07-04 15:00:00    27.37
2014-07-04 15:20:00      NaN
2014-07-04 15:40:00      NaN
2014-07-04 16:00:00    28.85
2014-07-04 16:20:00      NaN
2014-07-04 16:40:00      NaN
2014-07-04 17:00:00    30.29
2014-07-04 17:20:00      NaN
2014-07-04 17:40:00      NaN
2014-07-04 18:00:00    31.00
Freq: 20min, Name: Houston, dtype: float64

### Exercise 14

<span style="color:green; font-size:16px">Linearly interpolate the missing values in the previous exercise to estimate the temperature at 4:40 pm on July 4, 2014.</span>

In [176]:
houston_temp.asfreq('20min').interpolate(method='linear')

datetime
2014-07-04 15:00:00    27.370000
2014-07-04 15:20:00    27.863333
2014-07-04 15:40:00    28.356667
2014-07-04 16:00:00    28.850000
2014-07-04 16:20:00    29.330000
2014-07-04 16:40:00    29.810000
2014-07-04 17:00:00    30.290000
2014-07-04 17:20:00    30.526667
2014-07-04 17:40:00    30.763333
2014-07-04 18:00:00    31.000000
Freq: 20min, Name: Houston, dtype: float64