<a href="https://colab.research.google.com/github/EngComp-Henrique/Effective-Pandas/blob/main/Effective_Pandas_Chapter_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aggregate Methods
* Aggregations are the numbers everyone wants to hear, without knowing the details
* Allows us to take detailed data and collpase it to a single value

## Aggregations
* To visualize what that meant, let's take a look in the bellow dataset example 

In [2]:
import pandas as pd
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(url)

In [3]:
df.head(3)

Unnamed: 0,barrels08,barrelsA08,charge120,charge240,city08,city08U,cityA08,cityA08U,cityCD,cityE,...,mfrCode,c240Dscr,charge240b,c240bDscr,createdOn,modifiedOn,startStop,phevCity,phevHwy,phevComb
0,15.695714,0.0,0.0,0.0,19,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
1,29.964545,0.0,0.0,0.0,9,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0
2,12.207778,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,...,,,0.0,,Tue Jan 01 00:00:00 EST 2013,Tue Jan 01 00:00:00 EST 2013,,0,0,0


In [4]:
city_mpg = df.city08

* Calculating the mean of a serie

In [5]:
city_mpg.mean()

18.369045304297103

* Checking the `is_` methods

In [6]:
city_mpg.is_unique

False

In [7]:
city_mpg.is_monotonic_increasing

False

* Calculate `quantile`. See the differences bellow
* The default is 50%! 

In [8]:
city_mpg.quantile(0.8)

21.0

In [9]:
city_mpg.quantile([.1, .8, .99])

0.10    13.0
0.80    21.0
0.99    40.0
Name: city08, dtype: float64

In [10]:
city_mpg.quantile()

17.0

In [11]:
city_mpg.quantile(.9)

24.0

In [12]:
city_mpg.quantile([.1, 0.5, .9])

0.1    13.0
0.5    17.0
0.9    24.0
Name: city08, dtype: float64

## Count and Mean of an attribute
* Using chaining operations, we can perform aggregations following some criterias
* For example, if we want the count and percent of cars with mileage greater than 20, we can use the following code

In [13]:
(city_mpg
 .gt(20)
 .sum())

10272

In [14]:
(city_mpg
 .gt(20)
 .mul(100)
 .mean())

24.965973167412017

## `.agg` and Strings 
* Perform aggregation methods
* It's good for multiple operations!

In [15]:
city_mpg.agg('mean')

18.369045304297103

In [21]:
import numpy as np

def second_to_last(s):
    return s.iloc[-2]

In [22]:
city_mpg.agg([ 'mean' , np.var, max ,second_to_last ])

mean               18.369045
var                62.503036
max               150.000000
second_to_last     18.000000
Name: city08, dtype: float64

## Exercises
1. Find the count of non-missing values of a series.

In [40]:
df.mfrCode.count()

10326

2. Find the number of entries of a series.

In [30]:
len(df.mfrCode)

41144

3. Find the number of unique entries of a series.

In [45]:
city_mpg.nunique()

105

4. Find the mean value of a series.

In [36]:
city_mpg.mean()

18.369045304297103

5. Find the maximum value of a series.

In [37]:
city_mpg.max()

150

6. Use the .agg method to find all of the above.

In [41]:
city_mpg.agg(['nunique', 'mean', 'max'])

nunique    105.000000
mean        18.369045
max        150.000000
Name: city08, dtype: float64

In [47]:
df.mfrCode.agg(['count', 'size'])

count    10326
size     41144
Name: mfrCode, dtype: int64