# Summarizing data

Pandas has a number of functions for summarizing data.

Those with a database background will probably be most. comfortable with the GroupBy method and the agg method. 

Those who are more used to working with spreadsheets might be most comfortable with the pivot_table method.

Either way, Pandas has you covered, although there is a lot of overlap between the two.

Lets read in the data we created in the lecture on intersections

In [1]:
%matplotlib inline
import geopandas as gpd

raptor_buffer = gpd.read_file('data/intersections.gpkg', layer = 'raptor_buffer')
raptor_linear = gpd.read_file('data/intersections.gpkg', layer = 'raptor_linear')

In [2]:
raptor_buffer.sort_values('Nest_ID')

Unnamed: 0,Nest_ID,recentstat,recentspec,Project,type,length_m,area_ha,geometry
4,3,ACTIVE NEST,Swainsons Hawk,87,Pipeline,14108.354396,3.205486,"POLYGON ((522030.885 4448245.106, 522030.278 4..."
18,3,ACTIVE NEST,Swainsons Hawk,330,Electric Line,713.509839,1.054177,"POLYGON ((521905.544 4448114.292, 521893.573 4..."
14,3,ACTIVE NEST,Swainsons Hawk,138,Pipeline,3526.139806,6.771152,"POLYGON ((521909.590 4448116.717, 521893.573 4..."
21,4,FLEDGED NEST,Red-tail Hawk,82,Pipeline,2865.014487,14.059488,"POLYGON ((516406.771 4445107.660, 516400.475 4..."
32,4,FLEDGED NEST,Red-tail Hawk,500,Access Road - Confirmed,2756.526713,6.167202,"POLYGON ((517287.869 4444587.931, 517257.884 4..."
...,...,...,...,...,...,...,...,...
839,908,INACTIVE NEST,Red-tail Hawk,110,Flowline,34.438473,0.119039,"POLYGON ((505505.531 4431444.246, 505499.858 4..."
825,911,INACTIVE NEST,Red-tail Hawk,954,Electric Line,409.051112,1.752454,"POLYGON ((500977.597 4428548.425, 500979.325 4..."
54,913,INACTIVE NEST,Red-tail Hawk,489,Access Road - Confirmed,1488.756595,0.054009,"POLYGON ((508015.112 4440063.980, 508005.634 4..."
846,1001,INACTIVE NEST,Swainsons Hawk,468,Access Road - Confirmed,83.107753,0.423830,"POLYGON ((504436.145 4454658.274, 504419.285 4..."


With this data we can ask questions like "What projects are impacted by Nest 68?"

In [3]:
raptor_buffer[raptor_buffer['Nest_ID']==68]

Unnamed: 0,Nest_ID,recentstat,recentspec,Project,type,length_m,area_ha,geometry
103,68,ACTIVE NEST,Red-tail Hawk,44,Pipeline,10375.889346,5.458058,"POLYGON ((513327.213 4441031.236, 513323.851 4..."
128,68,ACTIVE NEST,Red-tail Hawk,177,Flowline,321.836809,0.054251,"POLYGON ((514200.418 4441990.047, 514246.858 4..."
131,68,ACTIVE NEST,Red-tail Hawk,233,Pipeline,508.347461,2.482518,"POLYGON ((514194.219 4440818.906, 514192.863 4..."
133,68,ACTIVE NEST,Red-tail Hawk,371,Extraction,712.185148,1.864479,"POLYGON ((514077.276 4440769.921, 514072.061 4..."
135,68,ACTIVE NEST,Red-tail Hawk,381,Extraction,2624.232121,2.760046,"POLYGON ((514146.558 4440796.193, 514133.691 4..."
151,68,ACTIVE NEST,Red-tail Hawk,595,Access Road - Confirmed,620.005032,0.23501,"POLYGON ((513809.882 4440743.018, 513769.909 4..."


## Aggregate functions 

Or "How many acres of ROW are impacted by Nest 68?"

In [4]:
filtered_data = raptor_buffer[raptor_buffer['Nest_ID'] == 68]
total_area = filtered_data['area_ha'].sum()
print(total_area)

12.854362974034387


## Challenge!

How many meters of the linear Project 1107 are impacted by Raptor nests?

In [5]:
raptor_linear[raptor_linear['Project']==1107]['length_intersection'].sum()

6122.665143174445

You can also use other aggregate functions like count(), mean(), std(), min(), max(), etc.  A full list can be found at [Panda's aggregate functions](https://cmdlinetips.com/2019/10/pandas-groupby-13-functions-to-aggregate/)

In [6]:
raptor_linear[raptor_linear['Project']==1107]['length_intersection'].mean()

1020.4441905290741

You can also use the describe method to get a full set of aggregate functions

In [7]:
raptor_linear[raptor_linear['Project']==1107]['length_intersection'].describe()

count       6.000000
mean     1020.444191
std       425.806257
min       292.306049
25%       833.512511
50%      1177.690517
75%      1321.918311
max      1386.197308
Name: length_intersection, dtype: float64

## The groupby method

The group by method in Pandas allows you to summarize a set of data agregated over one or more groups.

The basic syntax is a column name or set of column names and an aggregate function as follows

In [8]:
raptor_buffer.groupby(['Project']).sum()

TypeError: cannot perform sum with type geometry

Note that Project is now a named index.  It doesn't make much sense to sum the *Nest_ID* or *length_m* fields.  But it is interesting to see the sum of each project that is impacted by raptor nests.  Lets look at just the *area_ha* column sorted by area.

In [None]:
raptor_buffer.groupby(['Project']).sum()['area_ha'].sort_values()

We might also be interested in seeing the number of nests impacted by each project

In [None]:
raptor_buffer.groupby(['Project']).count()['area_ha'].sort_values()

Or in how many projects are impacted by each nest

In [None]:
raptor_buffer.groupby(['Nest_ID']).count()['area_ha'].sort_values()

If you want more detail we can add a second level of grouping.  For instance to see how many acreas of each project are impacted by each species of raptor.

In [None]:
raptor_buffer.groupby(['Project', 'recentspec']).sum()['area_ha']

And we can go even further to group by nest status just by adding a third grouping column

In [None]:
raptor_buffer.groupby(['Project', 'recentspec', 'recentstat']).count()['area_ha']

At this level of detail however it might make more sense to set the first level of grouping to a broader category like project type

In [None]:
raptor_buffer.groupby(['type', 'recentspec', 'recentstat']).count()['area_ha']

## The agg method

Can be applied to any dataframe.

Allows you to specify exactly which aggregate functions to display for each column.

You can provide a list of aggregate functions which will be applied to each numeric column.

In [None]:
# 'area_ha' sütununu sayısal değere dönüştürme
raptor_buffer['area_ha'] = raptor_buffer['area_ha'].astype(float)

# Gruplama işlemi
grouped_data = raptor_buffer.groupby(['type', 'recentspec', 'recentstat']).agg({'area_ha': ['count', 'sum']})

Or, if you want more fine-scale control of what is included,  you can provide a dictionary mapping columns to a list of aggregate functions for that column.

In [None]:
raptor_buffer.groupby(['type', 'recentspec', 'recentstat']).agg({"Nest_ID":['count'], "area_ha":['sum','mean','std']})

## Challenge #2

How many electric lines have swainson hawk nests within 333 meters and how many nests are affected by each electric line

In [None]:
raptor_buffer[raptor_buffer['type']=='Electric Line'][raptor_buffer['recentspec']=='Swainsons Hawk'].groupby(['Project', 'recentspec']).agg({"Nest_ID":['count']})