# Grouping by Time

In previous chapters, we learned how to select a single period of time series data. In this chapter, we will group each row into independent periods of time and then perform an operation on each group. For example, we will find the average closing price of a stock for every month. This type of analysis is similar to the material presented in the **Grouping Data** part. Instead of grouping by unique values in a particular column, we will group by time periods. Each row will be placed into a single group based on its time period and then an operation will be performed on each group. Let's begin by reading in our stock dataset.

In [20]:
import pandas as pd
df = pd.read_csv('../data/stocks/stocks10.csv', parse_dates=['date'], 
                 index_col='date')
df.head(3)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,


## Grouping with the `resample` method

The `resample` method is available to group by particular time periods. It's actually possible to use the `groupby` method to get the same result, but we will begin with `resample`, as it is a bit simpler and was built just for this purpose.

### Find the average closing price of Amazon for every month

If we are interested in finding the average closing price of Amazon for every month, then we need to group by month and aggregate the closing price with the mean function.

### Grouping column, aggregating column, and aggregating method

This procedure is very similar to how we grouped and aggregated columns in the Groupby chapters. The only difference is that our grouping column will now be the datetime index. The syntax is similar to the `groupby` method. Pass the `resample` method an [offset alias][1] to determine the grouping time period. As with `groupby`, calling the `resample` method does not produce a result, it just informs pandas how to create the groups. You must take action on these groups by chaining a method to it. Here, we chain the `agg` method to perform an aggregation that renames the resulting column.

```python
df.resample('offset alias').agg(new_column=('aggregating column', 'aggregating function'))
```

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

Here, we use the offset alias `'M'` to group by month end and then choose to aggregate the AMZN and WMT columns with the mean and median, respectively.

In [21]:
df.resample('M').agg(AMZN_mean=('AMZN', 'mean'),
                     WMT_median=('WMT', 'median')).head(3)

  df.resample('M').agg(AMZN_mean=('AMZN', 'mean'),


Unnamed: 0_level_0,AMZN_mean,WMT_median
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1999-10-31,76.312,38.85
1999-11-30,76.240952,40.42
1999-12-31,91.07,45.575


### Other `resample` syntax

All the other groupby aggregation syntaxes that we covered previously are available with `resample`. We replicate the result from above using a dictionary to map the aggregating column to the aggregating function.

In [22]:
df.resample('M').agg({'AMZN': 'mean', 'WMT': 'median'}).head(3)

  df.resample('M').agg({'AMZN': 'mean', 'WMT': 'median'}).head(3)


Unnamed: 0_level_0,AMZN,WMT
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1999-10-31,76.312,38.85
1999-11-30,76.240952,40.42
1999-12-31,91.07,45.575


### Map each column name to a list of aggregations

Compute multiple aggregations per column by using a list as the values part of the dictionary passed to the `agg` method.

In [23]:
df.resample('M').agg({'AMZN': ['size', 'min', 'mean', 'max'], 
                      'WMT': ['max']}).head(3)

  df.resample('M').agg({'AMZN': ['size', 'min', 'mean', 'max'],


Unnamed: 0_level_0,AMZN,AMZN,AMZN,AMZN,WMT
Unnamed: 0_level_1,size,min,mean,max,max
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1999-10-31,5,70.62,76.312,82.75,39.25
1999-11-30,21,63.06,76.240952,93.12,41.73
1999-12-31,22,76.12,91.07,106.69,48.43


### Aggregation methods

All the normal DataFrame aggregations are available directly as methods and will perform their aggregation on each column. Here, the mean of all columns for each month is taken.

In [24]:
df.resample('M').mean().tail(3).round(1)

  df.resample('M').mean().tail(3).round(1)


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2019-08-31,136.2,204.8,33.3,1793.6,225.1,69.1,110.2,34.2,184.5,177.5
2019-09-30,138.1,218.0,36.0,1799.1,237.3,71.5,116.8,36.7,185.7,177.2
2019-10-31,138.1,232.2,32.5,1746.8,251.3,68.4,118.7,37.4,182.9,174.9


The `size` method returns the total number of rows per group. Since this number is the same per column, a Series is returned.

In [25]:
df.resample('M').size().head()

  df.resample('M').size().head()


date
1999-10-31     5
1999-11-30    21
1999-12-31    22
2000-01-31    20
2000-02-29    20
Freq: ME, dtype: int64

The `count` method returns the number of non-missing values for each time period per column. Notice that some of the stocks did not exist in 1999.

In [26]:
df.resample('M').count().head(3)

  df.resample('M').count().head(3)


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-31,5,5,5,5,0,5,5,5,0,0
1999-11-30,21,21,21,21,0,21,21,21,0,0
1999-12-31,22,22,22,22,0,22,22,22,0,0


## Grouping by different time periods

Let's see several more of the offset aliases beginning with `'W'` for week ending on Sunday.

In [30]:
df.resample('W').mean().head().round(1)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-31,29.6,2.4,16.8,76.3,,21.2,38.2,18.4,,
1999-11-07,29.7,2.6,17.1,65.9,,21.2,39.3,19.5,,
1999-11-14,28.7,2.9,17.7,73.8,,22.2,40.4,19.3,,
1999-11-21,27.8,2.8,18.9,77.0,,23.2,41.1,19.4,,
1999-11-28,29.2,2.9,18.2,85.7,,22.9,39.9,19.5,,


Grouping by quarter end.

In [31]:
df.resample('Q').mean().head().round(1)

  df.resample('Q').mean().head().round(1)


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-12-31,31.4,3.0,17.2,83.0,,23.0,42.2,19.4,,
2000-03-31,32.9,3.6,21.9,69.0,,23.2,39.2,15.9,,
2000-06-30,23.5,3.3,24.4,51.6,,23.8,39.5,17.2,,
2000-09-30,22.8,3.3,26.0,38.1,,24.3,37.5,16.9,,
2000-12-31,19.2,1.2,25.1,27.4,,26.5,33.9,20.7,,


Grouping by year end.

In [36]:
df.resample('Y').mean().round(1)

  df.resample('Y').mean().round(1)


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-12-31,31.4,3.0,17.2,83.0,,23.0,42.2,19.4,,
2000-12-31,24.6,2.8,24.4,46.5,,24.4,37.5,17.7,,
2001-12-31,20.2,1.3,19.2,12.2,,25.1,36.5,16.9,,
2002-12-31,17.6,1.2,16.4,16.5,,23.5,39.1,12.3,,
2003-12-31,16.9,1.2,16.1,37.7,,23.1,38.6,10.0,,
2004-12-31,17.9,2.2,22.8,43.6,,29.8,39.3,11.2,,
2005-12-31,19.0,5.8,29.5,39.9,,39.1,35.3,11.2,,
2006-12-31,19.5,8.8,47.3,35.9,,44.8,34.2,14.4,,
2007-12-31,22.9,16.0,64.1,67.2,,58.0,35.0,20.1,,
2008-12-31,20.3,17.7,63.4,69.9,,58.6,41.9,17.6,,15.5


### Use start of period instead of end as label for group

The single character offset aliases `'Y'`, `'Q'`, `'M'`, and `'W'` all use the end of the period as the index label for the group. Appending the character `'S'` groups by the same span of time, but uses the start of the time period as the label. Here, we group by year start.

In [38]:
df.resample('YS').mean().head().round(1)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-01-01,31.4,3.0,17.2,83.0,,23.0,42.2,19.4,,
2000-01-01,24.6,2.8,24.4,46.5,,24.4,37.5,17.7,,
2001-01-01,20.2,1.3,19.2,12.2,,25.1,36.5,16.9,,
2002-01-01,17.6,1.2,16.4,16.5,,23.5,39.1,12.3,,
2003-01-01,16.9,1.2,16.1,37.7,,23.1,38.6,10.0,,


Month start is used below.

In [39]:
df.resample('MS').mean().head(3).round(1)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-01,29.6,2.4,16.8,76.3,,21.2,38.2,18.4,,
1999-11-01,28.9,2.8,17.9,76.2,,22.4,40.2,19.5,,
1999-12-01,34.3,3.2,16.5,91.1,,24.0,44.9,19.5,,


### Grouping by anchored offset aliases

Year, quarter, and week can all be anchored to a different month or day of the week. Here, we group by quarter, where the quarter end months are Feb, May, August, and November.

In [40]:
df.resample('Q-Feb').mean().head(4).round(1)

  df.resample('Q-Feb').mean().head(4).round(1)


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-11-30,29.0,2.7,17.7,76.3,,22.2,39.9,19.3,,
2000-02-29,33.7,3.3,18.8,77.6,,23.6,42.1,17.0,,
2000-05-31,26.6,3.7,24.7,59.2,,23.3,38.8,16.7,,
2000-08-31,23.6,3.2,25.0,39.5,,23.9,38.6,17.0,,


Here, the calendar year is set to be July 1 through June 30th. Note that you must use `'A'` and not `'Y'` as the offset alias.

In [41]:
df.resample('A-Jun').mean().head().round(1)

  df.resample('A-Jun').mean().head().round(1)


Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2000-06-30,29.1,3.3,21.5,66.6,,23.4,40.1,17.4,,
2001-06-30,20.6,1.8,23.6,23.6,,25.5,35.9,18.1,,
2002-06-30,19.5,1.3,17.6,12.6,,24.8,39.1,15.3,,
2003-06-30,16.3,1.0,14.6,22.4,,22.1,37.0,10.3,,
2004-06-30,17.4,1.5,19.6,47.5,,25.7,40.3,10.5,,


## Grouping by more than one consecutive offset alias period

As we've learned, it's possible to place an integer before the offset alias to represent consecutive time periods. Here, we group by two consecutive months.

In [42]:
df.resample('2M').size().head()

  df.resample('2M').size().head()


date
1999-10-31     5
1999-12-31    43
2000-02-29    40
2000-04-30    42
2000-06-30    44
Freq: 2ME, dtype: int64

The `size` method was chosen on purpose to focus on the first time period, which spans from September 1 to October 31, 1999. While it is a span of two months, it's probably not intuitive.  The very first row of data is on October 25, 1999, so you might expect the first time period to start on October 1, 1999 and end on November 30, 1999. The rest of the groups are also two-month time periods, but it is this crucial first group that often confuses users. In order for the time span to begin on the first month of actual data you must use a month start offset alias, which is exactly what we do below.

In [43]:
df.resample('2MS').size().head()

date
1999-10-01    26
1999-12-01    42
2000-02-01    43
2000-04-01    41
2000-06-01    42
Freq: 2MS, dtype: int64

The first time period (confusingly in my opinion) always uses the first month as the end time. Here, we group 5 consecutive months at a time.

In [44]:
df.resample('5M').size().head()

  df.resample('5M').size().head()


date
1999-10-31      5
2000-03-31    106
2000-08-31    106
2001-01-31    104
2001-06-30    104
Freq: 5ME, dtype: int64

Switching the offset alias to use month start, we get the more intuitive result. 

In [45]:
df.resample('5MS').size().head()

date
1999-10-01     88
2000-03-01    106
2000-08-01    106
2001-01-01    104
2001-06-01    103
Freq: 5MS, dtype: int64

The same rule applies when grouping by multiple years. Here we group together two years using the end-of-year offset alias. The first time period spans from January 1, 1998 to December 31, 1999.

In [51]:
df

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.80,36.94,18.27,,
1999-10-28,29.01,2.43,16.59,71.00,,21.19,38.85,19.79,,
1999-10-29,29.88,2.50,17.21,70.62,,21.47,39.25,20.00,,
...,...,...,...,...,...,...,...,...,...,...
2019-10-18,137.41,236.41,32.31,1757.51,256.95,67.61,119.14,38.47,185.85,175.71
2019-10-21,138.43,240.51,33.59,1785.66,253.50,68.74,119.74,38.23,189.76,176.43
2019-10-22,136.37,239.96,34.82,1765.73,255.58,69.09,119.58,38.17,182.34,170.86
2019-10-23,137.24,243.18,35.33,1762.17,254.68,69.75,119.35,37.74,186.15,171.32


In [49]:
df.resample('2A').size().head(3)

  df.resample('2A').size().head(3)


date
1999-12-31     48
2001-12-31    500
2003-12-31    504
Freq: 2YE-DEC, dtype: int64

Using the start-of-year offset alias, the first time period begins on January 1, 1999 and ends on December 31, 2000.

In [50]:
df.resample('2AS').size().head(3)

  df.resample('2AS').size().head(3)


date
1999-01-01    300
2001-01-01    500
2003-01-01    504
Freq: 2YS-JAN, dtype: int64

## Grouping by time with the `groupby` method

Grouping by time is also possible with the `groupby` method. Instead of passing the offset alias directly to the method, you need to pass it to the `pd.Grouper` constructor, setting the `freq` parameter. It technically creates a `TimeGrouper` object, which you can think of a dictionary containing information on how the time periods will be grouped. Here, we tell pandas to group by month end.

In [53]:
tg = pd.Grouper(freq='ME')
type(tg)

pandas.core.resample.TimeGrouper

Pass this newly created object to the `groupby` method and then finish the aggregation as usual.

In [54]:
df.groupby(tg).mean().round(1).head()

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-31,29.6,2.4,16.8,76.3,,21.2,38.2,18.4,,
1999-11-30,28.9,2.8,17.9,76.2,,22.4,40.2,19.5,,
1999-12-31,34.3,3.2,16.5,91.1,,24.0,44.9,19.5,,
2000-01-31,34.6,3.2,19.6,68.0,,24.2,44.0,16.1,,
2000-02-29,32.1,3.5,20.6,72.5,,22.6,37.2,15.2,,


It's uncommon to assign the result of `pd.Grouper` to a variable name. You can can pass it in directly to `groupby`. All the normal functionality is available when using `groupby`.

In [56]:
(df.groupby(pd.Grouper(freq='4MS'))
   .agg(mean_msft=('MSFT', 'mean'), 
        max_slb=('SLB', 'max'), 
        obs=('SLB', 'size'))
   .head(3))

Unnamed: 0_level_0,mean_msft,max_slb,obs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1999-10-01,32.350735,21.33,68
2000-02-01,27.890952,26.99,84
2000-06-01,23.028,28.66,85


### Choosing between `resample` and `groupby` with `pd.Grouper`

Because the `groupby` method has more methods you can chain to it (and more options within those methods) than `resample`, you may want to use it when grouping by time. For example, selecting the first two rows from every four month period is only possible using `groupby`. The first `head` method below is applied to each group. The last `head` method works on the entire DataFrame to shorten the output.

In [57]:
df.groupby(pd.Grouper(freq='4MS')).head(2).head(6)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
2000-02-01,33.23,3.12,20.63,67.44,,24.25,40.93,16.16,,
2000-02-02,32.54,3.08,19.84,69.44,,24.44,40.71,16.2,,
2000-06-01,20.84,2.78,23.72,50.19,,24.45,39.81,16.87,,
2000-06-02,21.4,2.88,22.63,57.88,,23.34,41.25,17.25,,


Attempting to chain the `head` method to `resample` results in an error as it does not exist for it.

In [58]:
df.resample('4MS').head(2)

AttributeError: 'DatetimeIndexResampler' object has no attribute 'head'

## Calling `resample` on a datetime column

By default, the `resample` method works on DataFrames with a datetimes, timedeltas, or periods in the index. It is possible to make it work on DataFrames that have these values in a column and not in the index. Let's place our current index as the first column by calling the `reset_index` method.

In [59]:
df2 = df.reset_index()
df2.head(3)

Unnamed: 0,date,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
0,1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1,1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
2,1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,


Specify the column to be grouped with the `on` parameter. The result is the exact same.

In [61]:
(df2.resample('W-WED', on='date')
    .agg({'AMZN': ['size', 'min']})
    .head(3))

Unnamed: 0_level_0,AMZN,AMZN
Unnamed: 0_level_1,size,min
date,Unnamed: 1_level_2,Unnamed: 2_level_2
1999-10-27,3,75.94
1999-11-03,5,65.81
1999-11-10,5,63.06


To achieve the same result with `groupby`, set the `key` parameter within `pd.Grouper` to column to be grouped.

In [62]:
(df2.groupby(pd.Grouper(freq='QS', key='date'))
    .agg({'XOM': 'max', 'SLB': 'min'})
    .head())

Unnamed: 0_level_0,XOM,SLB
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1999-10-01,25.11,15.35
2000-01-01,24.95,17.44
2000-04-01,24.87,22.16
2000-07-01,26.57,22.47
2000-10-01,28.13,20.31


## Calling `resample` on a Series

Above, we called `resample` on a DataFrame. We can also use it on a Series. Let's select Amazon's closing price as a Series.

In [64]:
amzn_close = df['AMZN']
amzn_close.head(3)

date
1999-10-25    82.75
1999-10-26    81.25
1999-10-27    75.94
Name: AMZN, dtype: float64

For a Series, the aggregating column is just the values. It's not necessary to use the `agg` method in order to aggregate. Instead, we can call aggregation methods directly. Here, we find the mean closing price by month.

In [65]:
amzn_close.resample('M').mean().head()

  amzn_close.resample('M').mean().head()


date
1999-10-31    76.312000
1999-11-30    76.240952
1999-12-31    91.070000
2000-01-31    68.049500
2000-02-29    72.463000
Freq: ME, Name: AMZN, dtype: float64

To compute multiple aggregations, use the `agg` method and pass it a list of the aggregating functions as strings. Here we find the total number of trading days, the min, and max of the closing price for every three year period.

In [66]:
amzn_close.resample('3AS').agg(['size', 'min', 'max']).head(3)

  amzn_close.resample('3AS').agg(['size', 'min', 'max']).head(3)


Unnamed: 0_level_0,size,min,max
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1999-01-01,548,5.97,106.69
2002-01-01,756,9.13,59.91
2005-01-01,754,26.07,100.82


Using `groupby` is also available for Series.

In [67]:
amzn_close.groupby(pd.Grouper(freq='3AS')).agg(['size', 'min', 'max']).head(3)

  amzn_close.groupby(pd.Grouper(freq='3AS')).agg(['size', 'min', 'max']).head(3)


Unnamed: 0_level_0,size,min,max
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1999-01-01,548,5.97,106.69
2002-01-01,756,9.13,59.91
2005-01-01,754,26.07,100.82


## Exercises

Execute the following cell that reads in 20 years of Microsoft stock data and use it for the first few exercises.

In [68]:
msft = pd.read_csv('../data/stocks/msft20.csv', parse_dates=['date'], index_col='date')
msft.head(3)

Unnamed: 0_level_0,open,high,low,close,adjusted_close,volume,dividend_amount
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1999-10-19,88.25,89.25,85.25,86.313,27.8594,69945600,0.0
1999-10-20,91.563,92.375,90.25,92.25,29.7758,88090600,0.0
1999-10-21,90.563,93.125,90.5,93.063,30.0381,60801200,0.0


### Exercise 1

<span style="color:green; font-size:16px">In which week did MSFT have the greatest number of its shares (volume) traded?</span>

In [83]:
week_max_vol = msft.resample('W').agg(max_vol=('volume','max')).idxmax()

week_max_vol


max_vol   2006-04-30
dtype: datetime64[ns]

### Exercise 2

<span style="color:green; font-size:16px">With help from the `diff` method, find the quarter containing the most number of "up" days. An up day is when the adjusted close of the current day is greater than the previous day.</span>

In [None]:
msft

Unnamed: 0_level_0,open,high,low,close,adjusted_close,volume,dividend_amount
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1999-10-19,88.250,89.250,85.2500,86.313,27.8594,69945600,0.0
1999-10-20,91.563,92.375,90.2500,92.250,29.7758,88090600,0.0
1999-10-21,90.563,93.125,90.5000,93.063,30.0381,60801200,0.0
1999-10-22,93.563,93.875,91.7500,92.688,29.9171,43650600,0.0
1999-10-25,92.000,93.563,91.1250,92.438,29.8364,30492200,0.0
...,...,...,...,...,...,...,...
2019-10-15,140.060,141.790,139.8100,141.570,141.5700,19695700,0.0
2019-10-16,140.790,140.990,139.5300,140.410,140.4100,20751600,0.0
2019-10-17,140.950,141.420,139.0200,139.690,139.6900,21460600,0.0
2019-10-18,139.760,140.000,136.5638,137.410,137.4100,27654449,0.0


In [95]:
best_quarter = (
    msft['adjusted_close']
    .diff()                # Calculate daily price delta [1]
    .gt(0)                 # Convert to Boolean "up" days [3]
    .resample('QE')        # Bucket by modern Quarter End alias [6]
    .sum()                 # Count True values (up days) per quarter [7]
    .idxmax()
)

best_quarter

Timestamp('2001-12-31 00:00:00')

### Exercise 3

<span style="color:green; font-size:16px">Find the mean price per year along with the minimum and maximum volume.</span>

In [109]:
(
    msft
    .resample('YE') # Modern 'Year End' alias [11]
    .agg(
        avg_price=('close', 'mean'),
        min_vol=('volume', 'min'),
        max_vol=('volume', 'max')
    ) # Named Aggregation produces flat columns [12]
    .assign(
        vol_range=lambda df_: df_.max_vol - df_.min_vol
    )
    # The "Harrison Standard" for bulk scaling:
    # Use .pipe with a lambda to filter and scale multiple columns at once [13, 14]
    .pipe(lambda df_: df_.assign(**(df_.filter(like='vol') / 1e6)))
)

Unnamed: 0_level_0,avg_price,min_vol,max_vol,vol_range
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1999-12-31,96.872519,12.5176,243.8192,231.3016
2000-12-31,76.220722,15.7348,313.6458,297.911
2001-12-31,62.54229,11.7016,209.3488,197.6472
2002-12-31,54.546635,18.386,202.3078,183.9218
2003-12-31,29.238298,12.0769,210.5583,198.4814
2004-12-31,27.124718,24.3987,258.269,233.8703
2005-12-31,25.871306,27.2125,187.3843,160.1718
2006-12-31,26.290355,20.4567,591.0522,570.5955
2007-12-31,30.446745,29.6226,288.1212,258.4986
2008-12-31,26.64751,16.8804,291.1389,274.2585


### Exercise 4

<span style="color:green; font-size:16px">Find the mean of each column for every 6 month time period. The first time period should start on the month in the first row.</span>

In [None]:
(msft
 .resample('6MS')
 .agg(
     
 )

 
)

<pandas.core.resample.DatetimeIndexResampler object at 0x00000265C3EBA050>

### Exercise 5

<span style="color:green; font-size:16px">Repeat exercise 4 using a time span of 3 years where the year begins July 1.</span>

### Exercise 6

<span style="color:green; font-size:16px">Repeat exercise five using the `groupby` method instead of `resample`.</span>

### Use the temperature dataset for the remaining exercises

Execute the following cell to read in the temperature dataset which sets the datetime column in the index.

In [None]:
temp = pd.read_csv('../data/weather/temperature.csv', 
                   parse_dates=['datetime'], index_col='datetime')
temp.head()

### Exercise 7

<span style="color:green; font-size:16px">Find the mean temperature of every city for every 8 hour time period.</span>

### Exercise 8

<span style="color:green; font-size:16px">Verify that there are 24 rows for each day.</span>

### Exercise 9

<span style="color:green; font-size:16px">For each month, return the maximum temperature amongst all cities.</span>

### Exercise 10

<span style="color:green; font-size:16px">For each month, return the maximum temperature amongst all cities along with the city name where the maximum occurred. Return a two-column DataFrame, where the first column is the maximum temperature, and the second is the city. The index should be the month.</span>