## Worksheet 4 - Grouping

Run the cell below to import the necessary packages for this worksheet.

In [1]:
import pandas as pd
import numpy as np

### Q1
rubric={autograde:1}

We return to the Narrabeen beach survey dataset that we encountered in Lab 1. Read in the data from `data/beach_data.csv` and save it to a data frame named `beach_df`, making sure to set the dates as index in a pandas datetime format.

In [2]:
beach_df = pd.read_csv("data/beach_data.csv", index_col=0, parse_dates=True) # SOLUTION
beach_df

Unnamed: 0,location_1,location_2,location_3,location_4,location_5
1980-01-15,99.4,75.0,92.5,27.5,63.4
1980-02-22,98.9,77.3,92.6,24.5,63.2
1980-03-14,95.2,76.8,93.8,29.3,71.3
1980-04-11,99.9,67.7,102.5,27.8,72.4
1980-05-23,80.7,63.9,82.5,12.5,73.3
...,...,...,...,...,...
2019-10-01,127.6,80.8,78.9,12.6,39.4
2019-10-15,121.9,76.6,82.3,12.6,37.5
2019-11-01,127.4,86.6,88.9,13.1,35.6
2019-11-14,126.7,87.5,82.9,13.0,38.9


In [3]:
assert hasattr(beach_df.index, 'month'), "Did you remember to parse the dates?"
assert np.isclose(beach_df.loc['1993-01-11', 'location_3'], 76.0, atol = 0.1)

## Q2

rubric={autograde:1}

The data is quite irregularly spaced, with a frequency ranging between a few weeks to a few months. Your first tasks are to

* Resample `beach_df` from the previous step to monthly intervals with the mean as the aggregation function.

* After resampling, subtract the mean value of each resulting "location" column from the same column. This will help to see if a location on the beach is narrower (negative numbers) or wider (positive numbers) at a certain time compared to the average.

* Finally, reshape the data set so it has two columnns `location` and `width`. The `location` column should indicate which of the five locations each measurement was taken at.

The final data frame should look something like the following:


|   &nbsp;   |    location   |   width   |
| ---------  | ------        | ------    | 
|**datetime**|     &nbsp;    |  &nbsp;   |                   
|1980-01-01  |   location_1  |   4.796227|
|1980-02-01  |  location_1   |  4.296227 |
|1980-03-01  | location_1    | 0.596227  |
|1980-04-01  |location_1     |5.296227   |
|1980-05-01  |   location_1  | -13.903773|
|...         |      ...      |    ...    |
|2019-07-01  | location_5    |-8.395284  |
|2019-08-01  |location_5     |-7.578618  |
|2019-09-01  |   location_5  |-15.778618 |
|2019-10-01  |  location_5   |-16.478618 |
|2019-11-01  | location_5    |-16.895284 |




In [4]:
# BEGIN SOLUTION
beach_df = beach_df.resample("MS").mean()
beach_df -= beach_df.mean(axis=0)
beach_df.index.name = "datetime"
#beach_df = beach_df.melt(var_name="location", value_name="width", ignore_index=False)
beach_df
# END SOLUTION

Unnamed: 0_level_0,location_1,location_2,location_3,location_4,location_5
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1980-01-01,4.796227,3.947666,17.102274,-4.109321,8.471382
1980-02-01,4.296227,6.247666,17.202274,-7.109321,8.271382
1980-03-01,0.596227,5.747666,18.402274,-2.309321,16.371382
1980-04-01,5.296227,-3.352334,27.102274,-3.809321,17.471382
1980-05-01,-13.903773,-7.152334,7.102274,-19.109321,18.371382
...,...,...,...,...,...
2019-07-01,27.529560,20.380999,8.102274,-1.542654,-8.395284
2019-08-01,28.996227,11.197666,8.402274,-10.309321,-7.578618
2019-09-01,28.546227,9.547666,4.102274,-17.409321,-15.778618
2019-10-01,30.146227,7.647666,5.202274,-19.009321,-16.478618


In [5]:
assert beach_df.shape == (2395, 2), "Did you melt the data frame?"
assert beach_df.sort_index().index[105] - beach_df.sort_index().index[100] == pd.Timedelta('30 days'), "Did you resample to a monthly frequency?"


AssertionError: Did you melt the data frame?

In [None]:
beach_df

## Q3

rubric={autograde: 1}

* Extract the month from the `DatetimeIndex`, and use it to create a new `month` column in the original `beach_df` dataframe.
* Group the data by month and aggregate it based on the mean. Use this to find the three months where the beach was narrowest on average. Store your answer in a dataframe called `beach_top_3` with 3 rows, 'month' as index and 1 column for 'width'.

In [None]:
# BEGIN SOLUTION
beach_df['month'] = beach_df.index.month
beach_top_3 = beach_df.groupby('month').mean(numeric_only=True).sort_values(by='width').iloc[:3]
beach_top_3
# END SOLUTION

In [None]:
beach_top_3

In [None]:
assert beach_top_3.shape == (3, 1)
assert 'month' in beach_top_3.index.names, "Error: DataFrame does not have index 'month'"
assert 'width' in beach_top_3.columns, "Error: DataFrame does not have a column 'width'"
assert np.isclose(beach_top_3.query('month == 7')['width'].values[0], -1.678812, atol=0.1)
assert np.isclose(beach_top_3.query('month == 8')['width'].values[0], -1.766926, atol=0.1)
assert np.isclose(beach_top_3.query('month == 9')['width'].values[0], -1.935821, atol=0.1)

## Q4
rubric={autograde:1}

Perform a double `groupby()` to determine the combination of **month** *and* **location** for which the beach is the widest. This time, aggregate the data based on the median value per group.

Store your output in a dataframe called `beach_df_widest` with a single row. It should have the `month` and `location` as the index, and 1 column called `width`.

In [None]:
beach_df

In [None]:
beach_df_widest = beach_df.groupby(['month', 'location']).median().query("width == width.max()")
beach_df_widest
# SOLUTION

In [None]:
assert len(beach_df_widest) == 1
assert 'month' in beach_df_widest.index.names
assert 'location' in beach_df_widest.index.names
assert 'width' in beach_df_widest.columns
assert beach_df_widest.index[0] == (5, 'location_5')
assert np.isclose(beach_df_widest['width'].values[0], 10.471382, atol=0.1)

## Q5
rubric={autograde: 1}

Run the code cell below to create two dataframes, `dates` and `rooms`.

In [None]:
# RUN THIS CELL

dates = pd.DataFrame(
                    { 'name': ['Kate', 'Kaiyun', 'Prajeet', 'Tiffany', 'Mohit', 'Eric'],
                      'day': ['Monday', 'Tuesday', 'Tuesday', 'Wednesday', 'Thursday', 'Wednesday'],
                      'time': ['5pm', '4:30pm', '1pm', '1pm', '4pm', '4pm']
                    }
                    )

rooms = pd.DataFrame(
                    {
                      'day': ['Wednesday', 'Wednesday', 'Thursday', 'Tuesday', 'Tuesday', 'Monday'],
                      'time': ['4pm', '1pm', '4pm', '4:30pm', '1pm', '5pm'],
                      'room': ['MCML 160', 'ESB 3174', 'ICCS X153', 'ICCS X153', 'ESB 1046', 'ICCS X153']
                    }
                    )

In [None]:
rooms

In [None]:
dates

The data frame `dates` contains the dates and times for some instructor and TA office hours for DSCI 511 this term. In `rooms` you will find room booking information for these dates and times. Your task is to:

* Combine the two data frames meaningfully to make a 'time table' of office hours. You will have to decide whether `concat()` or `merge()` is more suitable for this.
* Save this time table in a data frame titled `oh_info` with index `name`
* Make sure `oh_info` has the three columns `day`, `time` and `room`.

In [None]:
oh_info = None
# BEGIN SOLUTION
oh_info = pd.merge(dates, rooms, on=['day', 'time']).set_index('name')
oh_info
# END SOLUTION

In [None]:
oh_info

In [None]:
assert oh_info.shape == (6,3)
assert oh_info.loc['Prajeet','room'] == 'ESB 1046'
assert oh_info.loc['Mohit','day'] == 'Thursday'