# Lecture 2.6: Querying and Summarizing in Pandas

## 🎯 Learning Objectives
By the end of this lecture, you will:
- Use `.query()` to filter data more cleanly
- Refresh `.groupby()` syntax for splitting and grouping data
- Use `.agg()` to summarize grouped values


In [1]:
import pandas as pd

# Load OHIE data
OHIE = pd.read_csv('../Data/OHIE_12m.csv')
OHIE.head()
# df = pd.DataFrame({
#     'participant': range(1, 11),
#     'age': [34, 45, 67, 50, 29, 61, 73, 41, 55, 39],
#     'sex': ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M', 'F'],
#     'smoker': ['Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No'],
#     'bmi': [22.5, 27.4, 30.1, 25.8, 23.4, 29.9, 31.2, 21.3, 28.0, 24.7]
# })

Unnamed: 0,person_id,household_id,treatment,draw_treat,draw_lottery,applied_app,approved_app,dt_notify_lottery,dt_retro_coverage,birthyear_list,...,live_partner_12m,live_parents_12m,live_friends_12m,live_relatives_12m,live_other_12m,hhsize_12m,PHQ2_1,PHQ2_2,PHQ2_sum,PHQ2_cutoff
0,64350,164350,Not selected,,Lottery Draw 6,,,2008-07-14,2008-08-08,1974,...,No,Yes,No,No,No,2.0,3.0,3.0,6.0,True
1,55655,155655,Not selected,,Lottery Draw 7,,,2008-08-12,2008-09-08,1987,...,Yes,No,No,No,No,2.0,1.0,1.0,2.0,False
2,20087,128134,Selected,Draw 6: selected in lottery 07/01/2008,Lottery Draw 6,Submitted an Application to OHP,No,2008-07-14,2008-08-08,1963,...,No,No,No,Yes,No,7.0,0.0,1.0,1.0,False
3,70998,170998,Not selected,,Lottery Draw 7,,,2008-08-12,2008-09-08,1954,...,Yes,No,No,No,No,2.0,3.0,2.0,5.0,True
4,8839,108839,Selected,Draw 8: selected in lottery 09/02/2008,Lottery Draw 8,Did NOT submit an application to OHP,No,2008-09-11,2008-10-08,1964,...,No,No,Yes,No,No,4.0,2.0,2.0,4.0,True


## 🔍 Filtering with `.query()`

In [2]:
# Traditional filtering
OHIE[OHIE['birthyear_list'] < 1958]

Unnamed: 0,person_id,household_id,treatment,draw_treat,draw_lottery,applied_app,approved_app,dt_notify_lottery,dt_retro_coverage,birthyear_list,...,live_partner_12m,live_parents_12m,live_friends_12m,live_relatives_12m,live_other_12m,hhsize_12m,PHQ2_1,PHQ2_2,PHQ2_sum,PHQ2_cutoff
3,70998,170998,Not selected,,Lottery Draw 7,,,2008-08-12,2008-09-08,1954,...,Yes,No,No,No,No,2.0,3.0,2.0,5.0,True
7,7491,107491,Selected,Draw 3: selected in lottery 04/08/2008,Lottery Draw 3,Submitted an Application to OHP,No,2008-04-16,2008-05-08,1952,...,No,No,No,No,No,1.0,0.0,0.0,0.0,False
16,8538,127696,Selected,Draw 7: selected in lottery 08/01/2008,Lottery Draw 7,Submitted an Application to OHP,No,2008-08-12,2008-09-08,1951,...,Yes,No,No,No,No,2.0,0.0,0.0,0.0,False
19,37303,158931,Selected,Draw 6: selected in lottery 07/01/2008,Lottery Draw 6,Did NOT submit an application to OHP,No,2008-07-14,2008-08-08,1946,...,Yes,No,No,No,No,2.0,0.0,0.0,0.0,False
20,23931,123931,Not selected,,Lottery Draw 1,,,2008-03-10,2008-03-11,1956,...,No,No,No,Yes,No,3.0,2.0,2.0,4.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3979,60711,160711,Selected,Draw 1: selected in lottery 03/05/2008,Lottery Draw 1,Did NOT submit an application to OHP,No,2008-03-10,2008-03-11,1952,...,No,No,Yes,No,No,2.0,2.0,2.0,4.0,True
3981,15291,115291,Selected,Draw 2: selected in lottery 03/27/2008,Lottery Draw 2,Submitted an Application to OHP,No,2008-04-07,2008-04-08,1957,...,Yes,No,No,No,No,3.0,0.0,0.0,0.0,False
3982,50392,150392,Not selected,,Lottery Draw 5,,,2008-06-11,2008-07-08,1956,...,No,No,Yes,No,No,2.0,1.0,0.0,1.0,False
3983,72843,172843,Not selected,,Lottery Draw 3,,,2008-04-16,2008-05-08,1955,...,No,No,No,Yes,No,2.0,2.0,1.0,3.0,True


In [4]:
# Using query
OHIE.query("birthyear_list < 1958")

Unnamed: 0,person_id,household_id,treatment,draw_treat,draw_lottery,applied_app,approved_app,dt_notify_lottery,dt_retro_coverage,birthyear_list,...,live_partner_12m,live_parents_12m,live_friends_12m,live_relatives_12m,live_other_12m,hhsize_12m,PHQ2_1,PHQ2_2,PHQ2_sum,PHQ2_cutoff
3,70998,170998,Not selected,,Lottery Draw 7,,,2008-08-12,2008-09-08,1954,...,Yes,No,No,No,No,2.0,3.0,2.0,5.0,True
7,7491,107491,Selected,Draw 3: selected in lottery 04/08/2008,Lottery Draw 3,Submitted an Application to OHP,No,2008-04-16,2008-05-08,1952,...,No,No,No,No,No,1.0,0.0,0.0,0.0,False
16,8538,127696,Selected,Draw 7: selected in lottery 08/01/2008,Lottery Draw 7,Submitted an Application to OHP,No,2008-08-12,2008-09-08,1951,...,Yes,No,No,No,No,2.0,0.0,0.0,0.0,False
19,37303,158931,Selected,Draw 6: selected in lottery 07/01/2008,Lottery Draw 6,Did NOT submit an application to OHP,No,2008-07-14,2008-08-08,1946,...,Yes,No,No,No,No,2.0,0.0,0.0,0.0,False
20,23931,123931,Not selected,,Lottery Draw 1,,,2008-03-10,2008-03-11,1956,...,No,No,No,Yes,No,3.0,2.0,2.0,4.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3979,60711,160711,Selected,Draw 1: selected in lottery 03/05/2008,Lottery Draw 1,Did NOT submit an application to OHP,No,2008-03-10,2008-03-11,1952,...,No,No,Yes,No,No,2.0,2.0,2.0,4.0,True
3981,15291,115291,Selected,Draw 2: selected in lottery 03/27/2008,Lottery Draw 2,Submitted an Application to OHP,No,2008-04-07,2008-04-08,1957,...,Yes,No,No,No,No,3.0,0.0,0.0,0.0,False
3982,50392,150392,Not selected,,Lottery Draw 5,,,2008-06-11,2008-07-08,1956,...,No,No,Yes,No,No,2.0,1.0,0.0,1.0,False
3983,72843,172843,Not selected,,Lottery Draw 3,,,2008-04-16,2008-05-08,1955,...,No,No,No,Yes,No,2.0,2.0,1.0,3.0,True


In [5]:
# Multiple conditions
OHIE.query("birthyear_list < 1958 and treatment == 'Selected'")

Unnamed: 0,person_id,household_id,treatment,draw_treat,draw_lottery,applied_app,approved_app,dt_notify_lottery,dt_retro_coverage,birthyear_list,...,live_partner_12m,live_parents_12m,live_friends_12m,live_relatives_12m,live_other_12m,hhsize_12m,PHQ2_1,PHQ2_2,PHQ2_sum,PHQ2_cutoff
7,7491,107491,Selected,Draw 3: selected in lottery 04/08/2008,Lottery Draw 3,Submitted an Application to OHP,No,2008-04-16,2008-05-08,1952,...,No,No,No,No,No,1.0,0.0,0.0,0.0,False
16,8538,127696,Selected,Draw 7: selected in lottery 08/01/2008,Lottery Draw 7,Submitted an Application to OHP,No,2008-08-12,2008-09-08,1951,...,Yes,No,No,No,No,2.0,0.0,0.0,0.0,False
19,37303,158931,Selected,Draw 6: selected in lottery 07/01/2008,Lottery Draw 6,Did NOT submit an application to OHP,No,2008-07-14,2008-08-08,1946,...,Yes,No,No,No,No,2.0,0.0,0.0,0.0,False
27,15392,115392,Selected,Draw 4: selected in lottery 05/01/2008,Lottery Draw 4,Submitted an Application to OHP,No,2008-05-09,2008-06-09,1948,...,Yes,No,No,No,No,2.0,2.0,3.0,5.0,True
28,4104,104104,Selected,Draw 5: selected in lottery 06/02/2008,Lottery Draw 5,Submitted an Application to OHP,Yes,2008-06-11,2008-07-08,1956,...,No,No,Yes,No,No,1.0,2.0,1.0,3.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3968,46181,146181,Selected,Draw 3: selected in lottery 04/08/2008,Lottery Draw 3,Did NOT submit an application to OHP,No,2008-04-16,2008-05-08,1947,...,No,No,No,No,Yes,3.0,3.0,3.0,6.0,True
3969,13242,113242,Selected,Draw 7: selected in lottery 08/01/2008,Lottery Draw 7,Submitted an Application to OHP,No,2008-08-12,2008-09-08,1950,...,No,No,No,No,No,1.0,1.0,1.0,2.0,False
3979,60711,160711,Selected,Draw 1: selected in lottery 03/05/2008,Lottery Draw 1,Did NOT submit an application to OHP,No,2008-03-10,2008-03-11,1952,...,No,No,Yes,No,No,2.0,2.0,2.0,4.0,True
3981,15291,115291,Selected,Draw 2: selected in lottery 03/27/2008,Lottery Draw 2,Submitted an Application to OHP,No,2008-04-07,2008-04-08,1957,...,Yes,No,No,No,No,3.0,0.0,0.0,0.0,False


## 📊 Using `.groupby()` to Split Data

In [10]:
# Group by sex
grouped = OHIE.groupby('female_list')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10b5635c0>

In [11]:
grouped['birthyear_list'].mean()

female_list
0: Male      1964.982222
1: Female    1965.964124
Name: birthyear_list, dtype: float64

In [12]:
grouped['treatment'].value_counts()

female_list  treatment   
0: Male      Not selected     788
             Selected         787
1: Female    Selected        1224
             Not selected    1201
Name: count, dtype: int64

## 🧮 Summarizing with `.agg()`

In [13]:
# Multiple summaries per group
OHIE.groupby('treatment').agg({
    'birthyear_list': ['mean', 'min', 'max'],
    'PHQ2_sum': ['mean', 'std']
})

Unnamed: 0_level_0,birthyear_list,birthyear_list,birthyear_list,PHQ2_sum,PHQ2_sum
Unnamed: 0_level_1,mean,min,max,mean,std
treatment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Not selected,1965.449975,1945,1988,2.205325,2.005201
Selected,1965.70363,1945,1988,1.933574,1.935917


## 🔗 Combine `.query()` with `.groupby()`

In [14]:
# Compare mean BMI in smokers over 40
OHIE.query("birthyear_list < 1960").groupby('treatment')['PHQ2_sum'].mean()

treatment
Not selected    2.442254
Selected        2.083815
Name: PHQ2_sum, dtype: float64

## ✅ Summary
- Use `.query()` for readable filtering with logical conditions
- Use `.groupby()` to segment data by categories
- Use `.agg()` to compute custom summaries per group

These techniques are powerful for epidemiological description and subgroup comparisons.
