## About aggregates in Pandas.
An aggregate statistic is a way of creating a single number that describes a group of numbers.
Common aggregate statistics incluse mean, median, or standard deviation.

- How to perform aggregate statistics over individual rows with the same value using groupby.
- How to rearrange a DataFrame into a pivot table, a great way to compare data across two dimensions.


#### Command Description .
- .mean() : Average/Mean of all values in column
- .std() : Standard deviation
- .median() : Median
- .max() : Maximum value in column
- .min() : Minimum value in column
- .count() : Number of values in column
- .nunique() : Number of unique values in column
- .unique() : List of unique values in column

In [1]:
# The general syntax for these calculations is:
# df.column_name.command()

In [2]:
# Before to use these commands, we have to check our DataFrame.
# Example, if you have something like this:
import pandas as pd
import numpy as np

df = pd.DataFrame([
    ['Nick', 'Arizona', 19, 7],
    ['Ron', 'Monclair', 21, 7],
    ['Sam', 'Boston', 23, 8],
    ['Stella', 'Missouri', 22, 9],
    ['Hill', 'Boston', 27, 9],
    ['Williams', 'Colorado', 22, 8],
    ['Miller', 'Boston', 22, 8],
    ['Ross', 'Arizona', 20, 7],
    ['Cook', 'Boston', 20, 8],
    ['Davis', 'Colorado', 29, 7],
    ['Brown', 'Boston', 22, 8]],
    columns = ['name', 'university', 'age', 'graduate_level']
)
df

Unnamed: 0,name,university,age,graduate_level
0,Nick,Arizona,19,7
1,Ron,Monclair,21,7
2,Sam,Boston,23,8
3,Stella,Missouri,22,9
4,Hill,Boston,27,9
5,Williams,Colorado,22,8
6,Miller,Boston,22,8
7,Ross,Arizona,20,7
8,Cook,Boston,20,8
9,Davis,Colorado,29,7


In [3]:
# Compute Column Statistics!
# Example with "age" from df:
most_age = df.age.max()
print(most_age)

# With Numpy, we can also get this stat ;)
most_age_np = np.max(df.age)
print(most_age_np)

29
29


In [4]:
# In general, we use the following syntax to calculate aggregates:
# df.groupby('column1').column2.measurement()

In [5]:
# How to get mean, median, etc??? 
# With .groupby method. df.groupby('column1').column2.measurement()
# If we want to calculate statistics, we have to aggregate before…

df_university = df.groupby('university').graduate_level.mean()
print(df_university)
print("####################################")

df_university_level = df.groupby(['university', 'graduate_level']).graduate_level.mean()
print(df_university_level)

university
Arizona     7.0
Boston      8.2
Colorado    7.5
Missouri    9.0
Monclair    7.0
Name: graduate_level, dtype: float64
####################################
university  graduate_level
Arizona     7                 7
Boston      8                 8
            9                 9
Colorado    7                 7
            8                 8
Missouri    9                 9
Monclair    7                 7
Name: graduate_level, dtype: int64


In [6]:
# groupby statement followed by reset_index:
# Transform our Series into a DataFrame and move the indices into their own column.
df_university_level = df.groupby(['university', 'graduate_level']).graduate_level.mean()
df_university_level

university  graduate_level
Arizona     7                 7
Boston      8                 8
            9                 9
Colorado    7                 7
            8                 8
Missouri    9                 9
Monclair    7                 7
Name: graduate_level, dtype: int64

In [7]:
# How to calculate an operation more complicated than mean or count? 
# We can use .apply() method and lambda function like it:
df.groupby('university').graduate_level.apply(lambda x: np.percentile(x, 25))

university
Arizona     7.00
Boston      8.00
Colorado    7.25
Missouri    9.00
Monclair    7.00
Name: graduate_level, dtype: float64

In [8]:
# Pivot Table:
# A new DataFrame to explain .pivot_table() method.
# Three stores with daily sales…

'''
df.pivot(columns='ColumnToPivot',
         index='ColumnToBeRows',
         values='ColumnToBeValues')
'''
#
df_shops = pd.DataFrame([
    ['Shop 1', 'Saturday', 234.76],
    ['Shop 1', 'Wednesday', 434.66],
    ['Shop 2', 'Friday', 712.74],
    ['Shop 3', 'Thursday', 87.93],
    ['Shop 1', 'Monday', 876.12],
    ['Shop 2', 'Wednesday', 345.88],
    ['Shop 1', 'Friday', 684.16],
    ['Shop 3', 'Wednesday', 304.56],
    ['Shop 2', 'Monday', 345.71],
    ['Shop 3', 'Friday', 934.63]],
    columns = ['shop', 'day', 'sales'])
df_shops

Unnamed: 0,shop,day,sales
0,Shop 1,Saturday,234.76
1,Shop 1,Wednesday,434.66
2,Shop 2,Friday,712.74
3,Shop 3,Thursday,87.93
4,Shop 1,Monday,876.12
5,Shop 2,Wednesday,345.88
6,Shop 1,Friday,684.16
7,Shop 3,Wednesday,304.56
8,Shop 2,Monday,345.71
9,Shop 3,Friday,934.63


In [9]:
# Application of .pivot_table() method, why?
# Just to get a better distribution of sales…
df_shops = df_shops.pivot_table(index = 'shop', columns = 'day', values = 'sales').reset_index()
df_shops

# .reset_index() is optional, but it'll be better to use it.

day,shop,Friday,Monday,Saturday,Thursday,Wednesday
0,Shop 1,684.16,876.12,234.76,,434.66
1,Shop 2,712.74,345.71,,,345.88
2,Shop 3,934.63,,,87.93,304.56
