# Pandas Aggregation
### Aggregation Function
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Dataframe.aggregate() function is used to apply some aggregation across one or more column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used aggregations are:

sum: Return the sum of the values for the requested axis
min: Return the minimum of the values for the requested axis
max: Return the maximum of the values for the requested axis


Syntax: DataFrame.aggregate(func, axis=0, *args, **kwargs)

Parameters:
func : callable, string, dictionary, or list of string/callables. Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column names.
axis : (default 0) {0 or ‘index’, 1 or ‘columns’} 0 or ‘index’: apply function to each column. 1 or ‘columns’: apply function to each row.

Returns: Aggregated DataFrame

In [8]:
import pandas as pd
 
def main():
    # create a dictionary with
    # three fields each
    data = {
            'A':[0, 2, 3], 
            'B':[4, 15, 6], 
            'C':[47, 8, 19] }
     
    # Convert the dictionary into DataFrame 
    df = pd.DataFrame(data)  
    print('Before applying aggregation: ')
    print(df)     
  
    print('After applying aggregation: ')
    df1 = df.aggregate(['sum', 'max']) 
    print(df1)

    df1 = df.aggregate({"A":['sum', 'min'], 
              "B":['max', 'min'], 
              "C":['min', 'sum']}) 
    print('After applying aggregation on Columns: ')
    # printing the new dataframe
    print(df1)
  
if __name__ == '__main__':
    main()

Before applying aggregation: 
   A   B   C
0  0   4  47
1  2  15   8
2  3   6  19
After applying aggregation: 
     A   B   C
sum  5  25  74
max  3  15  47
After applying aggregation on Columns: 
       A     B     C
sum  5.0   NaN  74.0
min  0.0   4.0   8.0
max  NaN  15.0   NaN


### Pandas DataFrame mean() 
Pandas dataframe.mean() function returns the mean of the values for the requested axis. If the method is applied on a pandas series object, then the method returns a scalar value which is the mean value of all the observations in the Pandas Dataframe. If the method is applied on a Pandas Dataframe object, then the method returns a Pandas series object which contains the mean of the values over the specified axis.

Syntax: DataFrame.mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

Parameters :

axis : {index (0), columns (1)}
skipna : Exclude NA/null values when computing the result
level : If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
numeric_only : Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Returns : mean : Series or DataFrame (if level specified)

#### Calculate Mean on column

In [9]:
# importing pandas as pd 
import pandas as pd 
  
# Creating the dataframe 
df = pd.DataFrame({"A":[12, 4, 5, 44, 1], 
                "B":[5, 2, 54, 3, 2], 
                "C":[20, 16, 7, 3, 8], 
                "D":[14, 3, 17, 2, 6]}) 
df1 = df.mean(axis = 0) 
print(df1)

A    13.2
B    13.2
C    10.8
D     8.4
dtype: float64


#### Calculate Mean on Row

In [11]:
# importing pandas as pd 
import pandas as pd 
  
# Creating the dataframe 
df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                "B":[7, 2, 54, 3, None], 
                "C":[20, 16, 11, 3, 8],
                "D":[14, 3, None, 2, 6]}) 
  
# skip the Na values while finding the mean 
df.mean(axis = 1, skipna = True) 

0    13.250000
1     6.250000
2    23.333333
3     2.666667
4     5.000000
dtype: float64

### Pandas dataframe.sem()
Pandas dataframe.sem() function return unbiased standard error of the mean over requested axis. The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution[1] or an estimate of that standard deviation. If the parameter or the statistic is the mean, it is called the standard error of the mean (SEM).

Syntax : DataFrame.sem(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)

Parameters :
axis : {index (0), columns (1)}
skipna : Exclude NA/null values. If an entire row/column is NA, the result will be NA
level : If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series
ddof : Delta Degrees of Freedom. The divisor used in calculations is N – ddof, where N represents the number of elements.
numeric_only : Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series

Return : sem : Series or DataFrame (if level specified)

In [15]:
# importing pandas as pd 
import pandas as pd 
  
# Creating the dataframe 
df = pd.DataFrame({"A":[12, 4, 5, 44, 1], 
                "B":[5, 2, 54, 3, 2], 
                "C":[20, 16, 7, 3, 8], 
                "D":[14, 3, 17, 2, 6]}) 
df1 = df.sem(axis = 0)  
print(df1)

A     7.908224
B    10.214695
C     3.120897
D     3.009983
dtype: float64


### Pandas Series.value_counts()
Pandas Series.value_counts() function return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Syntax: Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

Parameter :
normalize : If True then the object returned will contain the relative frequencies of the unique values.
sort : Sort by values.
ascending : Sort in ascending order.
bins : Rather than count values, group them into half-open bins, a convenience for pd.cut, only works with numeric data.
dropna : Don’t include counts of NaN.

Returns : counts : Series

In [18]:
import pandas as pd 
  
# Creating the Series 
sr = pd.Series(['New York', 'Chicago', 'Toronto', 'Lisbon', 'Rio', 'Chicago', 'Lisbon']) 
  
# Print the series 
print(sr) 
print('\n\nValue Count:')
print(sr.value_counts())

0    New York
1     Chicago
2     Toronto
3      Lisbon
4         Rio
5     Chicago
6      Lisbon
dtype: object


Value Count:
Chicago     2
Lisbon      2
New York    1
Toronto     1
Rio         1
Name: count, dtype: int64


### Group By
A dataframe, can group by a list of key columns. 

In [3]:
# Initilaize data set
from pandas import DataFrame
import pandas as pd
import numpy as np
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)});
print(df)

  key1 key2     data1     data2
0    a  one  0.566805 -0.100761
1    a  two  0.611113 -0.332433
2    b  one -1.107966  0.157592
3    b  two  0.287503 -0.668305
4    a  one  0.693716  0.950847


#### Calculate mean based on key1

In [24]:
# calculate mean value
mean = df.groupby('key1')['data1'].mean();
print(mean);

key1
a    0.623878
b   -0.410231
Name: data1, dtype: float64


#### we can put a list of columns as group by

In [23]:
# calculate mean value
mean = df.groupby(['key1', 'key2'])['data2'].mean();
print(mean);

key1  key2
a     one     0.425043
      two    -0.332433
b     one     0.157592
      two    -0.668305
Name: data2, dtype: float64


In [10]:
# calculate group size
mean = df.groupby(df['key1']).size();
print(mean);

key1
a    3
b    2
dtype: int64


In [14]:
print(dict(list(df.groupby('key1'))))

{'a':   key1 key2     data1     data2
0    a  one  0.566805 -0.100761
1    a  two  0.611113 -0.332433
4    a  one  0.693716  0.950847, 'b':   key1 key2     data1     data2
2    b  one -1.107966  0.157592
3    b  two  0.287503 -0.668305}


#### Calculate quantile on key

In [30]:
print(df.groupby('key1')['data1'].quantile(0.9))

key1
a    0.677195
b    0.147956
Name: data1, dtype: float64


#### use customized function as aggregation funcion.

In [48]:
# define your own aggregation function
def peak_to_peak(arr):
    return arr.max() - arr.min();
result = df.groupby('key1')['data1'].agg(peak_to_peak);
print(result);


key1
a    0.126910
b    1.395469
Name: data1, dtype: float64


#### Mapping columns

In [49]:
# Initilaize data set
from pandas import DataFrame
import pandas as pd
import numpy as np
people = DataFrame(
                   np.random.randn(5,5), 
                   columns = ['a', 'b', 'c', 'd', 'e'],
                   index = ['Joe', 'Steve', 'Wes', 'Jim', 'Travis'] 
                   );
# map the columns to color
mapping = {'a' : 'red', 'b' : 'red', 'c' : 'blue', 'd': 'blue', 'e': 'red', 'f' : 'organge'};
#calculate sum
print(people.groupby(mapping, axis=1).sum());


            blue       red
Joe     0.963148  2.772641
Steve  -0.068170  1.132276
Wes    -0.029720  0.266524
Jim     0.807887  1.358557
Travis -0.852801 -0.508133


#### Optimized groupby methods
| Function Name | Description |
| --- | --- |
| count | Number of non-NA values in the group. |
| sum | Sum of non-NA values |
| mean | Mean of non-NA values |
| median | Arithmetic median of non-NA values. |
| std, var | Unbiased(n-1 denominator) standard deviation and variance. |
| min, max | Minimum and maximum of non-NA values. |
| prod | Product of non-NA values. |
| first, last | First and last non-NA values. |

In [76]:
# Initilaize data set
from pandas import DataFrame
import pandas as pd
import numpy as np
tips = DataFrame({
                  'total_bill' : [16.99, 10.34, 21.01, 23.68, 24.59, 25.29],
                  'tip' :[1.01, 1.66, 3.50, 3.31, 3.61, 4.71],
                  'sex' : ['Female', 'Male', 'Male', 'Male', 'Female', 'Male'],
                  'smoker' : ['No', 'No', 'No', 'Yes', 'Yes', 'Yes'],
                  'day' : ['Sun', 'Sat', 'Sat', 'Sun', 'Sun', 'Sun'],
                  'time' : ['Dinner', 'Lunch', 'Dinner', 'Dinner', 'Lunch', 'Dinner'],
                  'size' : [2, 3, 3, 2, 4, 4]
                });
# map the columns to color
tips['tips_pct'] = tips['tip'] / tips['total_bill'];
#calculate sum
print(tips);
aggr= tips.groupby(['sex', 'smoker'])['tips_pct'].apply(np.mean);
print(aggr)


   total_bill   tip     sex smoker  day    time  size  tips_pct
0       16.99  1.01  Female     No  Sun  Dinner     2  0.059447
1       10.34  1.66    Male     No  Sat   Lunch     3  0.160542
2       21.01  3.50    Male     No  Sat  Dinner     3  0.166587
3       23.68  3.31    Male    Yes  Sun  Dinner     2  0.139780
4       24.59  3.61  Female    Yes  Sun   Lunch     4  0.146808
5       25.29  4.71    Male    Yes  Sun  Dinner     4  0.186240
sex     smoker
Female  No        0.059447
        Yes       0.146808
Male    No        0.163564
        Yes       0.163010
Name: tips_pct, dtype: float64


### Random Sampling and Permutation

In [6]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
# Spade, Heart, Diamond, Club
suits = ['S', 'H', 'D', 'C'];
card_val = (list(range(1, 11)) + [10] * 3) * 4;
base_names = ['A'] + list(range(2, 11)) + ['J', 'Q', 'K']
cards = []
for suit in ['S', 'H', 'D', 'C']:
    cards.extend(suit + str(name) for name in base_names)
deck = Series(card_val, index = cards);

def draw(deck, n = 5):
    return deck.take(np.random.permutation(len(deck))[:n])

candidates = deck.groupby(lambda card:card[0], group_keys = False).apply(draw, 2)
print(candidates)

CQ    10
C4     4
DA     1
D4     4
H5     5
HK    10
S2     2
SJ    10
dtype: int64


#### Group Weighted Average and Correlation

In [8]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
df = DataFrame({'category' : ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights' : np.random.randn(8)})
category = df.groupby('category').apply(lambda g : np.average(g['data'], weights=g['weights']))
print(category)

category
a    0.594622
b    1.272249
dtype: float64
