# Grouping Data

In [1]:
import pandas as pd

#### Read csv files

In [2]:
from pathlib import Path

# creating a relative path to the data folder 
pth = Path('../../data')

In [3]:
# read canvas csv file into a dataframe 
canvas = pd.read_csv(pth / 'canvas.csv')

In [4]:
# read graffito
graffiti = pd.read_csv(pth/'graffiti.csv')

## Grouping

Use [`.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) to group information based on the entries in one or more columns

### Exploring Groupby() object

The result of `.groupby()` by itself is not terribly useful but it is still helpful to become familiar with some of its attributes. Let's group entries in the graffiti table by their *type*.

In [5]:
# create a groupby object
grouped = graffiti.groupby('type')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023AA36DBB00>

 `.groupby()` returns the groups it has generated as a Python [Dictionary](../python/dictionary.ipynb).

In [6]:
grouped.groups

{'blockbuster': [118, 944, 952, 1420, 1505, 1993, 2021, 2027, 2052, 2056, 2064, 2068, 2072, 2086, 2091, 2125, 2804], 'edging': [711, 718, 1757, 1758, 1759, 1760, 1762, 1763, 1765, 1766, 1767, 3362, 3557, 3562, 3624], 'hollow': [3, 111, 236, 417, 425, 434, 441, 446, 447, 550, 551, 611, 625, 626, 630, 631, 632, 636, 640, 641, 642, 759, 764, 797, 809, 913, 918, 925, 942, 1046, 1049, 1082, 1182, 1281, 1348, 1349, 1449, 1462, 1544, 1567, 1568, 1589, 1596, 1612, 1628, 1644, 1671, 1710, 1711, 1712, 1713, 1715, 1719, 1727, 1728, 1730, 1731, 1737, 1768, 1787, 1809, 1811, 1819, 1820, 1821, 1822, 1823, 1826, 1828, 1829, 1830, 1831, 1835, 1836, 1838, 1841, 1842, 1846, 1882, 1887, 1900, 1932, 1960, 1965, 1997, 2003, 2012, 2024, 2065, 2155, 2199, 2201, 2202, 2370, 2373, 2375, 2376, 2385, 2390, 2395, ...], 'other': [174, 175, 177, 188, 215, 216, 219, 228, 234, 285, 495, 573, 580, 585, 743, 751, 755, 757, 1028, 1033, 1043, 1045, 1079, 1379, 1385, 1389, 1395, 1461, 1480, 1481, 1482, 1502, 1658, 1691, 1

You access its entries, keys and values as you would any Python Dictionary.

In [7]:
# Names of the groups?
grouped.groups.keys()

dict_keys(['blockbuster', 'edging', 'hollow', 'other', 'pasteUp', 'piece', 'stencil', 'sticker', 'tag', 'throwUp', 'wildstyle'])

In [8]:
# return (row) indexes  for all entries in the 'throwUp' group
grouped.groups['throwUp']

Index([   7,    8,   10,   11,   12,   14,   18,   29,   54,   56,
       ...
       3447, 3450, 3452, 3453, 3523, 3525, 3575, 3580, 3596, 3616],
      dtype='int64', length=280)

### Retrieving a Group

You can retrieve the entries of a particular group as a DataFrame.

In [9]:
# get 'piece' group?
grouped.get_group('piece').head(3)

Unnamed: 0,id,canvas_id,created_at,uploaded_at,created_by,title,num,date_recorded,width,height,...,technique,marker_type,nip_type,other,num_colors,colors,nature_graffiti,transcribable,message,transcription
0,0,3,2023-11-27 13:40:11-08:00,2023-11-27 13:40:35-08:00,jsomer@uw.edu,11/27/2023 “Roja”,1,11/27/2023,91,35,...,spray,,,,2,"['black', 'white']","['Image', 'Text']",Y,writter,“Roja”
1,1,3,2023-11-27 13:39:13-08:00,2023-11-27 13:40:33-08:00,jsomer@uw.edu,11/27/2023 Triangle/prism,1,11/27/2023,91,60,...,spray,,,,3-5,"['black', 'white', 'red', 'gold']",['Image'],Y,other,Triangle/prism
16,16,4,2023-11-27 13:14:51-08:00,2023-11-27 13:33:25-08:00,kbtellez@uw.edu,11/27/2023 Aim,1,11/27/2023,162,151,...,spray,,,,2,"['black', 'white']",['Text'],Y,writter,Aim


### Grouping with Mutliple Columns

Pandas allows you to use more than one column when generating groups. When doing so, Pandas will group based on the first column, within each of the groups created from that first grouping it will group again based on the second column and so on. You end up with information grouped in a hierarchical manner.

In [10]:
# group by canvas_location and viewing_potential
grp= canvas.groupby(['canvas_location', 'viewing_potential'], observed= True)

In [11]:
# what are the groups?
grp.groups.keys()

dict_keys([('alley', 'high'), ('alley', 'low'), ('alley', 'medium'), ('bridge', 'high'), ('bridge', 'low'), ('bridge', 'medium'), ('highway', 'low'), ('highway', 'medium'), ('indoors', 'high'), ('indoors', 'low'), ('indoors', 'medium'), ('other', 'high'), ('other', 'low'), ('other', 'medium'), ('overpass', 'high'), ('railroad_tracks', 'high'), ('railroad_tracks', 'low'), ('railroad_tracks', 'medium'), ('roof', 'high'), ('roof', 'medium'), ('street', 'high'), ('street', 'low'), ('street', 'medium')])

In [12]:
# return entries for alley and high visibility
grp.get_group(('alley', 'high')).head(3)


Unnamed: 0,id,created_at,uploaded_at,created_by,title,at_canvas,coords,date_entry_canvas,property_type,property_use,surveillance_status,surveillance,canvas_location,canvas_nature,surface_material,graffiti_removal,viewing_potential,accessibility
13,13,2023-11-26 19:37:06-08:00,2023-11-26 19:52:38-08:00,mingmats@uw.edu,11/26/2023 Sign,Y,"{'latitude': 47.620266, 'longitude': -122.3197...",11/26/2023,comercial,in_use,Y,"['cameras', 'lights']",alley,sign,['metal'],N,high,['street_Level']
83,83,2023-11-21 14:24:00-08:00,2023-11-26 18:30:29-08:00,aimerino@uw.edu,11/21/2023 Wall,Y,"{'latitude': 47.610107, 'longitude': -122.3404...",11/21/2023,comercial,in_use,Y,"['cameras', 'lights', 'alarms', 'people']",alley,wall,['brick'],N,high,['street_Level']
131,131,2023-11-18 10:43:16.091000-08:00,2023-11-18 12:54:18-08:00,meribeth@uw.edu,11/18/2023 Wall,Y,"{'latitude': 47.661555, 'longitude': -122.3148...",11/18/2023,comercial,in_use,Y,"['lights', 'people', 'cameras']",alley,wall,['concrete'],N,high,['street_Level']


Handling multi-indexes is not trivial. 

### Simple Summaries

Once information has been grouped, you can apply various Pandas' methods to generate numerical summaries. 

In [13]:
# number of entries
graffiti.groupby('type')['type'].count()


type
blockbuster      17
edging           15
hollow          162
other           106
pasteUp          13
piece           121
stencil           8
sticker         442
tag            2446
throwUp         280
wildstyle        23
Name: type, dtype: int64

In [14]:
#similar to value_counts()
graffiti.type.value_counts()

type
tag            2446
sticker         442
throwUp         280
hollow          162
piece           121
other           106
wildstyle        23
blockbuster      17
edging           15
pasteUp          13
stencil           8
Name: count, dtype: int64

In [15]:
# mean width and height
graffiti.groupby('type')[['width', 'height']].mean()


Unnamed: 0_level_0,width,height
type,Unnamed: 1_level_1,Unnamed: 2_level_1
blockbuster,232.705882,78.941176
edging,63.2,24.2
hollow,226.006173,98.54321
other,58.915094,44.169811
pasteUp,22.461538,41.538462
piece,200.214876,120.958678
stencil,21.0,25.625
sticker,10.466063,9.99095
tag,40.646361,29.544971
throwUp,205.346429,108.45


### `.agg()` `.aggregate()`

You can do generate quite sophisticated summary tables using the Pandas' methods [`.agg()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html), [`.aggregate()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.aggregate.html)

In [16]:
# find the count, min, mean, std, max width for each type
graffiti.groupby('type')['width'].aggregate(['count','min', 'mean', 'std', 'max'])


Unnamed: 0_level_0,count,min,mean,std,max
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
blockbuster,17,2,232.705882,215.012431,710
edging,15,10,63.2,34.090635,128
hollow,162,4,226.006173,349.30041,3003
other,106,2,58.915094,117.86248,800
pasteUp,13,1,22.461538,19.598535,70
piece,121,1,200.214876,239.848223,2000
stencil,8,5,21.0,10.042766,35
sticker,442,1,10.466063,16.350331,250
tag,2446,0,40.646361,74.071764,2020
throwUp,280,1,205.346429,186.281759,1000


In [17]:
# Use a dictionary to define operations for the type and width columns min, mean, std, max width
operations = {'type': 'count',
              'width':['min', 'mean', 'max']}


In [18]:
# execute operations
graffiti.groupby('type')[['type', 'width']].agg(operations)


Unnamed: 0_level_0,type,width,width,width
Unnamed: 0_level_1,count,min,mean,max
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
blockbuster,17,2,232.705882,710
edging,15,10,63.2,128
hollow,162,4,226.006173,3003
other,106,2,58.915094,800
pasteUp,13,1,22.461538,70
piece,121,1,200.214876,2000
stencil,8,5,21.0,35
sticker,442,1,10.466063,250
tag,2446,0,40.646361,2020
throwUp,280,1,205.346429,1000


## Filtering

Use [`.filter()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html) to select data based on **group** properties.

In [19]:
# create a function that checks if the number of entries for each type is greater than 10
def greater10(grp):
    return grp['type'].count() > 10

In [20]:
# apply above function
result = graffiti.groupby('type')[['type', 'width']].filter(greater10)
result

Unnamed: 0,type,width
0,piece,91
1,piece,91
2,tag,25
3,hollow,365
4,tag,53
...,...,...
3628,piece,120
3629,piece,120
3630,hollow,120
3631,tag,15


## Additional References

Additional help with [`.groupby`](https://realpython.com/pandas-groupby/).