## How to Aggregate Data in Pandas

In [30]:
import numpy as np
import pandas as pd

In [5]:
url = "https://raw.githubusercontent.com/nyangweso-rodgers/Data_Analytics/main/Analytics-with-Python/Exploratory-Data-Analysis-with-Python/Exploratory-Data-Analysis-for-Online-Retail-Store/grouped_daily_customer_data.csv"
online_retail_data = pd.read_csv(url, encoding= 'unicode_escape')

In [6]:
online_retail_data.head(3)

Unnamed: 0.1,Unnamed: 0,CustomerID,Date,Country,TotalAmount,CountOfUniqueInvoices
0,0,12346.0,2011-01-18,United Kingdom,77183.6,1
1,1,12347.0,2010-12-07,Iceland,711.79,1
2,2,12347.0,2011-01-26,Iceland,475.39,1


### Aggregate

#### Multiple functions on the same feature
* Calculate the:
  * minimum
  * maximum
  * mean, and 
  * skewness of __TotalAmount__ for each __CustomerID__.

In [19]:
# set up
segments = ['CustomerID', 'Country']
feature = ['TotalAmount']

# skewness” is a statistical concept which captures the degree of asymmetry inherent in a distribution.
# we can add more functions to the list.
functions = ['sum', 'min', 'max', 'mean', 'skew']

# apply it
customer_summary_data = online_retail_data.groupby(segments)[feature].agg(functions)

# save to a csv
## customer_summary_data.to_csv("customer_summary_data.csv")

# preview the data
customer_summary_data

Unnamed: 0_level_0,Unnamed: 1_level_0,TotalAmount,TotalAmount,TotalAmount,TotalAmount,TotalAmount
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,min,max,mean,skew
CustomerID,Country,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
12346.0,United Kingdom,77183.60,77183.60,77183.60,77183.600000,
12347.0,Iceland,4310.00,224.82,1294.32,615.714286,1.400805
12348.0,Finland,1797.24,227.44,892.80,449.310000,1.782797
12349.0,Italy,1757.55,1757.55,1757.55,1757.550000,
12350.0,Norway,334.40,334.40,334.40,334.400000,
...,...,...,...,...,...,...
18280.0,United Kingdom,180.60,180.60,180.60,180.600000,
18281.0,United Kingdom,80.82,80.82,80.82,80.820000,
18282.0,United Kingdom,178.05,77.84,100.21,89.025000,
18283.0,United Kingdom,2094.88,99.47,313.65,149.634286,1.585590


#### Multiple functions on multiple features
* There are various ways to do this. We’ll cover two different approaches. 
  * First up, __using lists__.
  * Second, using a __Dictionary__.

In [17]:
# set up
segments = ['CustomerID', 'Country']
features = ['Date', 'TotalAmount']
functions = ['min', 'max']

# apply
online_retail_data.groupby(segments)[features].agg(functions)

Unnamed: 0_level_0,Unnamed: 1_level_0,Date,Date,TotalAmount,TotalAmount
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,min,max
CustomerID,Country,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
12346.0,United Kingdom,2011-01-18,2011-01-18,77183.60,77183.60
12347.0,Iceland,2010-12-07,2011-12-07,224.82,1294.32
12348.0,Finland,2010-12-16,2011-09-25,227.44,892.80
12349.0,Italy,2011-11-21,2011-11-21,1757.55,1757.55
12350.0,Norway,2011-02-02,2011-02-02,334.40,334.40
...,...,...,...,...,...
18280.0,United Kingdom,2011-03-07,2011-03-07,180.60,180.60
18281.0,United Kingdom,2011-06-12,2011-06-12,80.82,80.82
18282.0,United Kingdom,2011-08-05,2011-12-02,77.84,100.21
18283.0,United Kingdom,2011-01-06,2011-12-06,99.47,313.65


* _Remarks_: 
   * _Here, instead of applying various functions to a single feature, we’re applying the same functions to various features. The only real difference here is that we’re specifying the features in a list._
   * _What about applying a number of different functions to various features? For that, we’ll need a dictionary._

In [23]:
# set up
segments = ['CustomerID', 'Country']
functions = {
    'TotalAmount': ['sum', 'min', 'max'],
    'Date': ['min', 'max']
}

# save the above results in a csv
online_retail_data.groupby(segments).agg(functions).to_csv("customer_summary_data.csv")

# apply
online_retail_data.groupby(segments).agg(functions)

Unnamed: 0_level_0,Unnamed: 1_level_0,TotalAmount,TotalAmount,TotalAmount,Date,Date
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,min,max,min,max
CustomerID,Country,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
12346.0,United Kingdom,77183.60,77183.60,77183.60,2011-01-18,2011-01-18
12347.0,Iceland,4310.00,224.82,1294.32,2010-12-07,2011-12-07
12348.0,Finland,1797.24,227.44,892.80,2010-12-16,2011-09-25
12349.0,Italy,1757.55,1757.55,1757.55,2011-11-21,2011-11-21
12350.0,Norway,334.40,334.40,334.40,2011-02-02,2011-02-02
...,...,...,...,...,...,...
18280.0,United Kingdom,180.60,180.60,180.60,2011-03-07,2011-03-07
18281.0,United Kingdom,80.82,80.82,80.82,2011-06-12,2011-06-12
18282.0,United Kingdom,178.05,77.84,100.21,2011-08-05,2011-12-02
18283.0,United Kingdom,2094.88,99.47,313.65,2011-01-06,2011-12-06


* Notice how using a dictionary with a list of functions to apply gives us much more flexibility in terms of what we apply? Neat!

### Using Tuples (More tuples, more flexibility)
* Using __tuples__ which will:
   1. Return a DataFrame with flat indices. (Forget about those __reset_index()__ calls).
   2. Allow us to specify the names of each column returned.

In [29]:
# set up
segments = ['CustomerID', 'Country']

# apply
online_retail_data.groupby(segments).agg(
    customer_first_purchase_date = ('Date', 'min'),
    customer_last_purchase_date = ('Date', 'max'),
    total_purchase_amount = ('TotalAmount', 'sum'),
    average_purchase_amount = ('TotalAmount', 'mean'),
    median_purchase_amount = ('TotalAmount', 'median'),
    count_of_delivery_dates = ('Date', 'count')
)

Unnamed: 0_level_0,Unnamed: 1_level_0,customer_first_purchase_date,customer_last_purchase_date,total_purchase_amount,average_purchase_amount,median_purchase_amount,count_of_delivery_dates
CustomerID,Country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12346.0,United Kingdom,2011-01-18,2011-01-18,77183.60,77183.600000,77183.600,1
12347.0,Iceland,2010-12-07,2011-12-07,4310.00,615.714286,584.910,7
12348.0,Finland,2010-12-16,2011-09-25,1797.24,449.310000,338.500,4
12349.0,Italy,2011-11-21,2011-11-21,1757.55,1757.550000,1757.550,1
12350.0,Norway,2011-02-02,2011-02-02,334.40,334.400000,334.400,1
...,...,...,...,...,...,...,...
18280.0,United Kingdom,2011-03-07,2011-03-07,180.60,180.600000,180.600,1
18281.0,United Kingdom,2011-06-12,2011-06-12,80.82,80.820000,80.820,1
18282.0,United Kingdom,2011-08-05,2011-12-02,178.05,89.025000,89.025,2
18283.0,United Kingdom,2011-01-06,2011-12-06,2094.88,149.634286,116.165,14


* Using tuples allows us to specify the names of the summarised features. This is useful if you want to be very specific about the names of the __DataFrame__ .

### Exotic functions

In [34]:
# bespoke function
def scaled_median(s):
    # scales Series median by ratio of Series max to Series min
    return s.median() * s.max() / s.min()

segments = ['Country']

# apply
online_retail_data.groupby(segments).agg(
    country_min_purchased_value = ('TotalAmount', np.min),
    country_max_purchased_value = ('TotalAmount', np.max),
    country_median_purchased_value = ('TotalAmount', np.median),
    country_scaled_median_purchased_value = ('TotalAmount', scaled_median)
)

Unnamed: 0_level_0,country_min_purchased_value,country_max_purchased_value,country_median_purchased_value,country_scaled_median_purchased_value
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia,81.6,23426.81,467.295,134157.2
Austria,153.76,1542.08,534.7,5362.579
Bahrain,89.0,459.4,274.2,1415.365
Belgium,45.6,1491.59,350.57,11467.25
Brazil,1143.6,1143.6,1143.6,1143.6
Canada,51.56,1217.64,542.59,12813.8
Channel Islands,198.4,2060.03,914.24,9492.751
Cyprus,15.0,2876.85,641.38,123010.3
Czech Republic,277.48,549.26,413.37,818.2485
Denmark,168.9,3978.99,515.1,12134.86
