# Grouping by Time and another Column

In this chapter, we take a look at a special scenario where we group together periods of time alongside another column. We'll use the employee dataset, which does not contain typical time series data, but does allow us to group the hire date along with other columns. Let's begin by reading it in, putting the `'hire_date'` column in the index and sorting it.

In [2]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'], 
                  index_col='hire_date').sort_index()
emp.head()

Unnamed: 0_level_0,dept,title,salary,sex,race
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1968-12-13,Police,SENIOR POLICE OFFICER,,Male,Black
1969-03-21,Police,POLICE SERGEANT,,Male,Hispanic
1969-10-06,Other,SENIOR PUBLIC LOSS INVESTIGATOR,75067.0,Female,White
1970-02-02,Police,SENIOR POLICE OFFICER,,Male,White
1970-04-06,Fire,INSPECTOR FIRE,70181.28,Male,Hispanic


As a review, let's find the average salary and number of employees for each ten year period using the `groupby` method.

In [3]:
emp.groupby(pd.Grouper(freq='10YS')).agg({'salary': ['mean', 'size']}).round()

Unnamed: 0_level_0,salary,salary
Unnamed: 0_level_1,mean,size
hire_date,Unnamed: 1_level_2,Unnamed: 2_level_2
1968-01-01,77051.0,151
1978-01-01,73033.0,1488
1988-01-01,68804.0,4203
1998-01-01,62109.0,6338
2008-01-01,53796.0,10028
2018-01-01,40907.0,2100


## Grouping by an amount of time and another column

There are two different ways to group by time and another column. The difference is subtle but important, and can make a difference in the result. The datetime column and the other column can either be grouped **together** or grouped **independently**. Let's say we wanted to find the average salary over five-year time periods for each sex.

### Group together

To group sex and a five-year time span together, we must use `groupby`. Pass a list of both the `Grouper` object and the column name to the `groupby` method. 

In [4]:
tg = pd.Grouper(freq='10YS')
groups = ['sex', tg]
emp.groupby(groups).agg({'salary':['mean', 'size']}).round()

Unnamed: 0_level_0,Unnamed: 1_level_0,salary,salary
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,size
sex,hire_date,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,1968-01-01,67269.0,7
Female,1978-01-01,64163.0,266
Female,1988-01-01,61457.0,1215
Female,1998-01-01,57365.0,1832
Female,2008-01-01,52962.0,3281
Female,2018-01-01,43032.0,757
Male,1968-01-01,80312.0,144
Male,1978-01-01,75817.0,1222
Male,1988-01-01,71881.0,2988
Male,1998-01-01,64041.0,4506


### Datetimes are the same

Notice, how the datetimes for both female and male groups are the same. This is not going to be the case below.

## Group independently

To group independently, we first group the non-datetime column with the `groupby` method. The Groupby object has a `resample` method which allows you to then group by an amount of time **within** the groups you just created. You use it just like it was being called from a DataFrame. Notice how the hire dates for males and females are different.

In [7]:
emp.groupby('sex').resample('10YS').agg({'salary':['mean', 'size']}).round()

Unnamed: 0_level_0,Unnamed: 1_level_0,salary,salary
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,size
sex,hire_date,Unnamed: 2_level_2,Unnamed: 3_level_2
Female,1969-01-01,72408.0,13
Female,1979-01-01,62609.0,313
Female,1989-01-01,60755.0,1298
Female,1999-01-01,57087.0,2074
Female,2009-01-01,50754.0,3660
Male,1968-01-01,80312.0,144
Male,1978-01-01,75817.0,1222
Male,1988-01-01,71881.0,2988
Male,1998-01-01,64041.0,4506
Male,2008-01-01,54201.0,6747


### Different results

Its important to see that you will get different results depending on whether you group together or group independently. The reason the results are different is because the earliest male and female employees don't a hire date of the same year. The earliest hire date for female employees was 1969 while it is 1968 for males. If the first male and female employees were both hired in 1968 (or 1969), then the returned datetime index would have been the same.

## Using a pivot table with `Grouper` for easier comparisons

You can pass a `Grouper` object to a pivot table to get a nice final product. This groups sex together with time.

In [8]:
emp.pivot_table(index=tg, columns='sex', values='salary', aggfunc=['mean', 'size']).round()

Unnamed: 0_level_0,mean,mean,size,size
sex,Female,Male,Female,Male
hire_date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1968-01-01,67269.0,80312.0,7,144
1978-01-01,64163.0,75817.0,266,1222
1988-01-01,61457.0,71881.0,1215,2988
1998-01-01,57365.0,64041.0,1832,4506
2008-01-01,52962.0,54201.0,3281,6747
2018-01-01,43032.0,39710.0,757,1343


## Rolling windows within a group

Rolling window calculation within a group are also possible. In order to show this example, we'll need time series data where each date appears only one time. With the employee data set, multiple people may be hired on the same date. We begin by grouing by department and week, returning the size of each group. This represents the number of employees hired in each month for that department.

In [9]:
emp.groupby(['dept', pd.Grouper(freq='W')]).size().head()

dept  hire_date 
Fire  1970-04-12    1
      1972-01-02    1
      1974-06-30    1
      1977-08-07    1
      1977-12-04    1
dtype: int64

We'll move the department out of the index and give a name to the values

In [10]:
df = emp.groupby(['dept', pd.Grouper(freq='W')]).size()
df = df.reset_index('dept', name='size')
df.head()

Unnamed: 0_level_0,dept,size
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1
1970-04-12,Fire,1
1972-01-02,Fire,1
1974-06-30,Fire,1
1977-08-07,Fire,1
1977-12-04,Fire,1


The tail of the DataFrame has the most recent months of hire date for the last department.

In [12]:
df.tail()

Unnamed: 0_level_0,dept,size
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-10-28,Solid Waste Management,2
2018-11-11,Solid Waste Management,5
2018-12-09,Solid Waste Management,6
2018-12-23,Solid Waste Management,3
2019-01-06,Solid Waste Management,1


Using this proper time series data, let's find the total number hired in each department over a rolling 6-week period.

In [11]:
df.groupby('dept').rolling('42D')['size'].sum().tail(10)

dept                    hire_date 
Solid Waste Management  2018-08-19    16.0
                        2018-09-02    15.0
                        2018-09-16    11.0
                        2018-09-30     7.0
                        2018-10-14    10.0
                        2018-10-28     9.0
                        2018-11-11    12.0
                        2018-12-09    11.0
                        2018-12-23     9.0
                        2019-01-06    10.0
Name: size, dtype: float64

In [None]:
import pandas as pd
energy = pd.read_csv('../data/energy_consumption.csv', parse_dates=['date'], 
                     index_col='date')
energy.head()

## Exercises

Execute the following cell to read in the energy consumption dataset.

In [13]:
energy = pd.read_csv('../data/energy_consumption.csv', parse_dates=['date'], 
                     index_col='date')
energy.head()

Unnamed: 0_level_0,source,energy (btu)
date,Unnamed: 1_level_1,Unnamed: 2_level_1
1973-01-01,residential,1932.187
1973-02-01,residential,1687.255
1973-03-01,residential,1497.067
1973-04-01,residential,1177.661
1973-05-01,residential,1015.008


In [14]:
energy['source'].value_counts()

source
residential       548
commercial        548
industrial        548
transportation    548
Name: count, dtype: int64

### Exercise 1

<span style="color:green; font-size:16px">Find the average energy consumption per sector per 10 year time span beginning from the first year of data. Return the results using both `groupby` and `pivot_table`.</span>

In [26]:
tg = pd.Grouper(freq='10YS')

energy.groupby(['source',tg]).agg({'energy (btu)':['sum','size']}).astype(int)

Unnamed: 0_level_0,Unnamed: 1_level_0,energy (btu),energy (btu)
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,size
source,date,Unnamed: 2_level_2,Unnamed: 3_level_2
commercial,1973-01-01,101931,120
commercial,1983-01-01,123438,120
commercial,1993-01-01,157480,120
commercial,2003-01-01,178566,120
commercial,2013-01-01,102779,68
industrial,1973-01-01,314461,120
industrial,1983-01-01,301236,120
industrial,1993-01-01,339833,120
industrial,2003-01-01,315776,120
industrial,2013-01-01,180058,68


In [28]:
energy.pivot_table(index=tg,columns='source', values='energy (btu)', aggfunc='sum')

source,commercial,industrial,residential,transportation
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1973-01-01,101931.953,314461.638,153900.393,193286.695
1983-01-01,123438.643,301236.855,166286.507,212926.099
1993-01-01,157480.212,339833.248,193051.528,249648.823
2003-01-01,178566.533,315776.448,211705.248,274455.122
2013-01-01,102779.919,180058.795,117522.188,156173.328


### Use the bikes dataset for the remaining exercises

Execute the following cell to read in the bikes dataset. Note, that it does NOT set the index to be a datetime.

In [30]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy


### Exercise 2

<span style="color:green; font-size:16px">Filter the data so that it only contains rows from the five most frequent `from_station_name` values. Then find the mean temperature at every station for every quarter. Present the result as a pivot table.</span>

In [46]:
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy


In [38]:
top5 = bikes['from_station_name'].value_counts().index[:5]

In [51]:
(
    bikes
    .loc[lambda df_: df_['from_station_name'].isin(bikes['from_station_name'].value_counts().index[:5]), ['starttime','from_station_name','temperature']]
    .pivot_table(index=pd.Grouper(freq='QS', key='starttime'), columns='from_station_name', values='temperature', aggfunc='mean')
)

from_station_name,Canal St & Adams St,Canal St & Madison St,Clinton St & Madison St,Clinton St & Washington Blvd,Columbus Dr & Randolph St
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2013-04-01,,,,69.1,
2013-07-01,61.866667,75.692308,74.792857,70.864516,73.644444
2013-10-01,38.666667,45.13125,41.2,44.37037,43.575
2014-01-01,35.8,27.781818,26.725,29.138462,26.0
2014-04-01,61.409091,64.22381,66.145455,62.0225,65.596429
2014-07-01,68.186207,70.714706,70.853846,70.284615,75.127273
2014-10-01,40.793548,42.725,45.77,43.11087,44.847619
2015-01-01,25.49375,24.393333,32.65,32.710345,24.685714
2015-04-01,60.884091,63.929032,63.870588,61.269492,66.561538
2015-07-01,72.725676,72.47037,70.629167,70.8625,75.126829


### Exercise 3

<span style="color:green; font-size:16px">Find the number of rides per day from each `from_station_name`.</span>

In [64]:
exercise_3_result  = bikes.groupby(['from_station_name',pd.Grouper(freq='MS', key='starttime')]).agg(ride_count=('starttime', 'size'))

### Exercise 4

<span style="color:green; font-size:16px">Reset the `from_station_name` index level from the solution in exercise 3 and then perform a 100 day rolling window of each `from_station_name` calculating the number of rides in this group.</span>

In [65]:
result = (
    exercise_3_result
    .reset_index('from_station_name')  # Moves station name to a column [1]
    .groupby('from_station_name')      # Isolates the rolling window per station [2]
    .rolling('100D')                   # 100-day time-based window [2]
    ['ride_count']                     # Target the count column
    .sum()                             # Aggregate the window
)

In [66]:
result

from_station_name             starttime 
2112 W Peterson Ave           2016-08-01    1.0
                              2016-09-01    4.0
                              2016-10-01    5.0
                              2017-04-01    1.0
                              2017-11-01    1.0
                                           ... 
Woodlawn Ave & Lake Park Ave  2017-02-01    1.0
                              2017-03-01    2.0
                              2017-04-01    3.0
                              2017-06-01    3.0
Yates Blvd & 75th St          2015-09-01    1.0
Name: ride_count, Length: 14052, dtype: float64