# Notes: GroupBy walk-through


- Load a weather CSV into a DataFrame and explore groupby patterns.
- Use built-in aggregations (`max`, `mean`) and custom key functions for bucketing.
- Inspect groups via iteration or `get_group` to verify contents before further analysis.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('weather_by_cities.csv')
df

Unnamed: 0,day,city,temperature,windspeed,event
0,1/1/2017,new york,32,6,Rain
1,1/2/2017,new york,36,7,Sunny
2,1/3/2017,new york,28,12,Snow
3,1/4/2017,new york,33,7,Sunny
4,1/1/2017,mumbai,90,5,Sunny
5,1/2/2017,mumbai,85,12,Fog
6,1/3/2017,mumbai,87,15,Fog
7,1/4/2017,mumbai,92,5,Rain
8,1/1/2017,paris,45,20,Sunny
9,1/2/2017,paris,50,13,Cloudy


### Loading data


- `pd.read_csv('weather_by_cities.csv')` builds `df` with city, temperature, windspeed.
- Display the raw frame first to spot obvious issues (missing values, unexpected strings) before grouping.

In [3]:
g = df.groupby('city')
g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028BFCA81E80>

### Basic group object


- `df.groupby('city')` returns a `DataFrameGroupBy`; iterating over it yields `(city, subframe)` pairs.
- Use quick iteration to sanity-check rows per city before aggregating.

In [5]:
for city, their_data in g:
    print("City:", city)
    print(their_data)

City: mumbai
        day    city  temperature  windspeed  event
4  1/1/2017  mumbai           90          5  Sunny
5  1/2/2017  mumbai           85         12    Fog
6  1/3/2017  mumbai           87         15    Fog
7  1/4/2017  mumbai           92          5   Rain
City: new york
        day      city  temperature  windspeed  event
0  1/1/2017  new york           32          6   Rain
1  1/2/2017  new york           36          7  Sunny
2  1/3/2017  new york           28         12   Snow
3  1/4/2017  new york           33          7  Sunny
City: paris
         day   city  temperature  windspeed   event
8   1/1/2017  paris           45         20   Sunny
9   1/2/2017  paris           50         13  Cloudy
10  1/3/2017  paris           54          8  Cloudy
11  1/4/2017  paris           42         10  Cloudy


In [6]:
g.get_group('mumbai')

Unnamed: 0,day,city,temperature,windspeed,event
4,1/1/2017,mumbai,90,5,Sunny
5,1/2/2017,mumbai,85,12,Fog
6,1/3/2017,mumbai,87,15,Fog
7,1/4/2017,mumbai,92,5,Rain


### Inspecting a single group


- `get_group('mumbai')` fetches one subset for deeper inspection or debugging.
- Helpful to validate expected rows before aggregations or joins.

In [10]:
g.max()


Unnamed: 0_level_0,day,temperature,windspeed,event
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mumbai,1/4/2017,92,15,Sunny
new york,1/4/2017,36,12,Sunny
paris,1/4/2017,54,20,Sunny


In [12]:
g[['temperature', 'windspeed']].mean()

Unnamed: 0_level_0,temperature,windspeed
city,Unnamed: 1_level_1,Unnamed: 2_level_1
mumbai,88.5,9.25
new york,32.25,8.0
paris,47.75,12.75


### Aggregations


- `g.max()` computes column-wise maxima per city; good for quick extremes.
- Selecting columns before `mean` (e.g., `[['temperature','windspeed']]`) avoids reducing non-numeric data.
- Combine aggregations with `agg({'temperature': ['mean','max'], 'windspeed': 'mean'})` when you need multiple stats at once.

In [13]:
def grouper(df, idx, col):
    if 80 <= df[col].loc[idx] <= 90:
        return '80-90'
    elif 50 <= df[col].loc[idx] <= 60:
        return '50-60'
    else:
        return 'others'

In [18]:
g = df.groupby(lambda x: grouper(df, x, 'temperature'))
for temp, their_data in g:
    print("Temp:", temp)
    print(their_data)

Temp: 50-60
         day   city  temperature  windspeed   event
9   1/2/2017  paris           50         13  Cloudy
10  1/3/2017  paris           54          8  Cloudy
Temp: 80-90
        day    city  temperature  windspeed  event
4  1/1/2017  mumbai           90          5  Sunny
5  1/2/2017  mumbai           85         12    Fog
6  1/3/2017  mumbai           87         15    Fog
Temp: others
         day      city  temperature  windspeed   event
0   1/1/2017  new york           32          6    Rain
1   1/2/2017  new york           36          7   Sunny
2   1/3/2017  new york           28         12    Snow
3   1/4/2017  new york           33          7   Sunny
7   1/4/2017    mumbai           92          5    Rain
8   1/1/2017     paris           45         20   Sunny
11  1/4/2017     paris           42         10  Cloudy


In [19]:
for key, d in g:
    print("Group by Key: {}\n".format(key))
    print(d)

Group by Key: 50-60

         day   city  temperature  windspeed   event
9   1/2/2017  paris           50         13  Cloudy
10  1/3/2017  paris           54          8  Cloudy
Group by Key: 80-90

        day    city  temperature  windspeed  event
4  1/1/2017  mumbai           90          5  Sunny
5  1/2/2017  mumbai           85         12    Fog
6  1/3/2017  mumbai           87         15    Fog
Group by Key: others

         day      city  temperature  windspeed   event
0   1/1/2017  new york           32          6    Rain
1   1/2/2017  new york           36          7   Sunny
2   1/3/2017  new york           28         12    Snow
3   1/4/2017  new york           33          7   Sunny
7   1/4/2017    mumbai           92          5    Rain
8   1/1/2017     paris           45         20   Sunny
11  1/4/2017     paris           42         10  Cloudy


### Custom grouping logic


- `groupby(lambda idx: grouper(...))` uses a key function to bucket rows (here by temperature ranges).
- Useful when raw values are too granular; ensure the key function handles all cases to avoid `KeyError`.
- You can also map precomputed labels via a Series: `df.groupby(df['temperature'].pipe(custom_bucket))`.