In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Data Aggregation and Group Operations

## Group By 

split-apply-combine

* Data is split into groups based on one or more keys
* A function is applied to each group, producing a new value
* The results of the function applications are combined into a results object

_include illustration of split-apply-combine_


The `groupby()` function aggregates data over multiple rows of a DataFrame:
* It can find the sum of a column
* It can get the mean of a subset of values in a column.

`groupby()` returns a GroupBy object that works with aggregate functions.

In [1]:
## Begin Example
grouped = df["data1"].groupby(df["key1"])
grouped

NameError: name 'df' is not defined

In this example, `grouped` is a GroupBy object. Nothing has been computed. However `grouped` has all the information required to apply operations to each of the groups.

In [22]:
grouped.mean()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4383 entries, 0 to 4382
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         4383 non-null   datetime64[ns]
 1   Consumption  4383 non-null   float64       
 2   Wind         2920 non-null   float64       
 3   Solar        2188 non-null   float64       
 4   Wind+Solar   2187 non-null   float64       
dtypes: datetime64[ns](1), float64(4)
memory usage: 171.3 KB


The data has been aggregated according to teh group key, creating a new Series that is now indexed by the unique values in the `key1` column.

The result index has the name `key1` because the DataFrame column `key1` has the same name.


## Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of columns names has the effect of selecting those columns for aggregation

```python
df.groupby('key1')['data1']

In [5]:
file = "../data/totals-2025-05-05T21_58_15.csv"
df = pd.read_csv(file)
df.head()

Unnamed: 0,name,office,office_full,party,party_full,state,district,district_number,election_districts,election_years,...,party.1,office.1,candidate_inactive.1,individual_itemized_contributions,transfers_from_other_authorized_committee,other_political_committee_contributions,state.1,district.1,district_number.1,state_full
0,"FETAYA, ALAIN",P,President,W,WRITE-IN,US,0.0,0.0,{00},{2024},...,W,P,f,0.0,0.0,0.0,US,0.0,0.0,Other
1,"VASAPOLLI, JOSEPH A",P,President,AIP,AMERICAN INDEPENDENT PARTY,US,0.0,0.0,{00},{2024},...,AIP,P,f,0.0,0.0,0.0,US,0.0,0.0,Other
2,"TANNIRU, JOSEPH KISHORE",P,President,SEP,SOCIALIST EQUALITY PARTY,US,0.0,0.0,"{00,00}","{2020,2024}",...,SEP,P,f,0.0,0.0,0.0,US,0.0,0.0,Other
3,"CROW LOMBARDI, SAMATHA MARQUETTA LATIC",P,President,IND,INDEPENDENT,US,,,{NULL},{2024},...,IND,P,f,0.0,0.0,0.0,US,,,Other
4,"SPREWELL DE BOURBON MEDICI, MARIA ANTO",P,President,REP,REPUBLICAN PARTY,US,0.0,0.0,{00},{2024},...,REP,P,f,0.0,0.0,0.0,US,0.0,0.0,Other


In [14]:
df.iloc[55]

name                                         VILLARI, TIMOTHY MICHAEL MR.
office                                                                  P
office_full                                                     President
party                                                                 REP
party_full                                               REPUBLICAN PARTY
state                                                                  US
district                                                              0.0
district_number                                                       0.0
election_districts                                                {00,00}
election_years                                                {2020,2024}
cycles                                              {2020,2022,2024,2026}
candidate_status                                                        C
incumbent_challenge                                                     C
incumbent_challenge_full              

In [13]:
df[df["name"].str.contains("Wu")]

Unnamed: 0,name,office,office_full,party,party_full,state,district,district_number,election_districts,election_years,...,party.1,office.1,candidate_inactive.1,individual_itemized_contributions,transfers_from_other_authorized_committee,other_political_committee_contributions,state.1,district.1,district_number.1,state_full


In [16]:
netflix = pd.read_excel("../data/netflix_titles.xlsx")
netflix.head()


Unnamed: 0,duration_minutes,duration_seasons,type,title,date_added,release_year,rating,description,show_id
0,90.0,,Movie,Norm of the North: King Sized Adventure,2019-09-09 00:00:00,2019.0,TV-PG,Before planning an awesome wedding for his gra...,81145628.0
1,94.0,,Movie,Jandino: Whatever it Takes,2016-09-09 00:00:00,2016.0,TV-MA,Jandino Asporaat riffs on the challenges of ra...,80117401.0
2,,1.0,TV Show,Transformers Prime,2018-09-08 00:00:00,2013.0,TV-Y7-FV,"With the help of three human allies, the Autob...",70234439.0
3,,1.0,TV Show,Transformers: Robots in Disguise,2018-09-08 00:00:00,2016.0,TV-Y7,When a prison ship crash unleashes hundreds of...,80058654.0
4,99.0,,Movie,#realityhigh,2017-09-08 00:00:00,2017.0,TV-14,When nerdy high schooler Dani finally attracts...,80125979.0


In [27]:
grouped_counts = netflix.groupby("type")["title"].count()
grouped_counts
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6236 entries, 0 to 6235
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   duration_minutes  4267 non-null   object 
 1   duration_seasons  1971 non-null   object 
 2   type              6235 non-null   object 
 3   title             6235 non-null   object 
 4   date_added        6223 non-null   object 
 5   release_year      6234 non-null   float64
 6   rating            6223 non-null   object 
 7   description       6233 non-null   object 
 8   show_id           6232 non-null   float64
dtypes: float64(2), object(7)
memory usage: 438.6+ KB


In [28]:
netflix["duration_minutes"] = pd.to_numeric(netflix["duration_minutes"], errors="coerce")

release_year_average = netflix.groupby("type")["duration_minutes"].mean()
release_year_average

type
1944             NaN
Movie      99.100821
TV Show          NaN
Name: duration_minutes, dtype: float64