# Summarizing Data & Chaining Commands

**These links will lead you to the guide on Github!**

[**Basic Summary Commands**](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#basic-summary-commands)
* [Count rows, count non-NA values - `len()` and `count()`](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#count-rows-count-non-na-values)
* [Maximum, mean, median, and sum of a column - `max()`, `mean()`, `median()`, and `sum()`](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#maximum-mean-median-and-sum-of-a-column)
* [Summary Statistics - `describe()`](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#summary-statistics)


[**Grouping Columns - `groupby()`**](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#grouping-columns---groupby)
* [Count number of occurrences of each item in a column - `value_counts()`](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#count-number-of-occurrences-of-each-item-in-a-column---value_counts)
* [Grouping 2 columns and count number of occurrences of each group](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#grouping-2-columns-and-count-number-of-occurrences-of-each-group)
* [Grouping 2 columns and finding the mean of another column](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#grouping-2-columns-and-finding-the-mean-of-another-column)
* [Returning results as a dataframe rather than a series](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#returning-results-as-a-dataframe-rather-than-a-series)
* [Returning the index as columns instead of an index](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#returning-the-index-as-columns-instead-of-an-index)

[**Aggregating Columns - `agg()`**](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#aggregating-columns---agg)
* [Grouping 2 columns and using `agg` to count number of occurrences of each group](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#grouping-2-columns-and-using-agg-to-count-number-of-occurrences-of-each-group)
* [Grouping 2 columns and counting the number of **distinct** occurrences in each group](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#grouping-2-columns-and-counting-the-number-of-distinct-occurrences-in-each-group)
* [Group 2 columns, count distinct number of items, and rename a column](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#group-2-columns-count-distinct-number-of-items-and-rename-a-column)
* [Group 2 columns, calculate mean of another based on a single distinct customer](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#group-2-columns-calculate-mean-of-another-based-on-a-single-distinct-customer)

[**Using Multiple `agg()` Variables**](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#using-multiple-agg-variables)
* [Group 2 columns, then count distinct and calculate mean of different columns](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#group-2-columns-then-count-distinct-and-calculate-mean-of-different-columns)
* [Calculate multiple aggregations on a single column](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#calculate-multiple-aggregations-on-a-single-column)

[**Chaining Commands - Putting it All Together**](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#method-chaining---putting-it-all-together)
* [Preface](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#preface)
* [All-in-One: Group 2 columns, index as a column, calculate mean, rename a column for a single distinct user](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#all-in-one-group-2-columns-index-as-a-column-calculate-mean-rename-a-column-for-a-single-distinct-user)
* [All-in-One: Slicing for a group of specific users](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#all-in-one-slicing-for-a-group-of-specific-users)

[**Swapping Columns and Rows**](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#swapping-columns-and-rows)
* [Turning index values into column names - `unstack()`](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#turning-index-values-into-column-names---unstack)
* [Turning columns names into index names - `stack()`](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#turning-columns-names-into-index-names---stack)
* [Transpose](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#transpose)


[**Grouping by Time - `resample()`**](https://github.com/kn-kn/python-guide/wiki/Summarizing-Data-&-Chaining-Commands#grouping-by-time---resample)

This section uses the [Phone Data File](https://github.com/kn-kn/python-guide/blob/master/phone_data.csv).

In [1]:
import pandas as pd
import numpy as np

dateparse = lambda x: pd.datetime.strptime(x, '%d-%m-%Y %H:%M')
infile = "C:/Users/kenguyen/Downloads/phone_data.csv"
df = pd.read_csv(infile, parse_dates=['date'], date_parser=dateparse)

In [2]:
df.head(5)

Unnamed: 0,user_id,date,duration,item,month,network,network_type,total_cost
0,7,2014-10-15 06:58:00,34.429,data,2014-11,data,data,2.09
1,11,2014-10-15 06:58:00,13.0,call,2014-11,Vodafone,mobile,1.93
2,28,2014-10-15 14:46:00,23.0,call,2014-11,Meteor,mobile,0.14
3,23,2014-10-15 14:48:00,4.0,call,2014-11,Tesco,mobile,1.43
4,6,2014-10-15 17:27:00,4.0,call,2014-11,Tesco,mobile,1.34


# Basic Summary Commands

In [3]:
# Count rows
len(df)

830

In [4]:
# Count number of non-NA values in a column
df['duration'].count()

830

In [5]:
# Get maximum value of a column
df['duration'].max()

10528.0

In [6]:
# Mean of a column
df['duration'].mean()

117.80403614457829

In [7]:
# Median of a column
df['duration'].median()

24.5

In [8]:
# Sum of a column
df['duration'].sum()

97777.35000000002

In [9]:
# Get summary statistics of numerical columns in your dataset
df.describe()

Unnamed: 0,user_id,duration,total_cost
count,830.0,830.0,830.0
mean,17.759036,117.804036,4.090699
std,10.041759,444.12956,3.386263
min,1.0,1.0,0.0
25%,9.0,1.0,1.3025
50%,18.0,24.5,3.17
75%,26.0,55.0,6.16
max,35.0,10528.0,14.89


# **Grouping Columns - `groupby()`**

## Count number of occurrences of each item in a column - `value_counts()`

In [10]:
df['network'].value_counts()

Three        215
Vodafone     215
data         150
Meteor        87
Tesco         84
landline      42
voicemail     27
world          7
special        3
Name: network, dtype: int64

## Grouping 2 columns and count number of occurrences of each group

In [11]:
df.groupby('item')['network'].value_counts()

item  network  
call  Three        128
      Tesco         71
      Vodafone      66
      Meteor        54
      landline      42
      voicemail     27
data  data         150
sms   Vodafone     149
      Three         87
      Meteor        33
      Tesco         13
      world          7
      special        3
Name: network, dtype: int64

## Grouping 2 columns and finding the mean of another column

In [12]:
df.groupby(['item', 'network'])['duration'].mean()

item  network  
call  Meteor       133.333333
      Tesco        194.760563
      Three        284.875000
      Vodafone     221.530303
      landline     438.880952
      voicemail     65.740741
data  data          34.429000
sms   Meteor         1.000000
      Tesco          1.000000
      Three          1.000000
      Vodafone       1.000000
      special        1.000000
      world          1.000000
Name: duration, dtype: float64

## Returning results as a dataframe rather than a series

Double-bracket the column you are applying the aggregation function (e.g. count, mean, median) on.

In [13]:
df.groupby(['item', 'network'])[['duration']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,duration
item,network,Unnamed: 2_level_1
call,Meteor,133.333333
call,Tesco,194.760563
call,Three,284.875
call,Vodafone,221.530303
call,landline,438.880952
call,voicemail,65.740741
data,data,34.429
sms,Meteor,1.0
sms,Tesco,1.0
sms,Three,1.0


## Returning the index as columns instead of an index

Add the parameter `as_index=False` to your `groupby()`

In [14]:
df.groupby(['item', 'network'], as_index=False)[['duration']].mean()

Unnamed: 0,item,network,duration
0,call,Meteor,133.333333
1,call,Tesco,194.760563
2,call,Three,284.875
3,call,Vodafone,221.530303
4,call,landline,438.880952
5,call,voicemail,65.740741
6,data,data,34.429
7,sms,Meteor,1.0
8,sms,Tesco,1.0
9,sms,Three,1.0


# Aggregating Columns - `agg()`

## Grouping 2 columns and using `agg` to count number of occurrences of each group

In [15]:
df.groupby(['item', 'network']).agg({'user_id':len})

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id
item,network,Unnamed: 2_level_1
call,Meteor,54
call,Tesco,71
call,Three,128
call,Vodafone,66
call,landline,42
call,voicemail,27
data,data,150
sms,Meteor,33
sms,Tesco,13
sms,Three,87


## Grouping 2 columns and counting the number of **distinct** occurrences in each group

In [16]:
df.groupby(['item', 'network']).agg({'user_id':pd.Series.nunique})

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id
item,network,Unnamed: 2_level_1
call,Meteor,29
call,Tesco,31
call,Three,32
call,Vodafone,29
call,landline,28
call,voicemail,17
data,data,33
sms,Meteor,21
sms,Tesco,11
sms,Three,30


## Group 2 columns, count distinct number of items, and rename a column

In [17]:
df.groupby(['item', 'network']).agg({'user_id':pd.Series.nunique}).rename(columns={'user_id': 'Count of Unique Users'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Count of Unique Users
item,network,Unnamed: 2_level_1
call,Meteor,29
call,Tesco,31
call,Three,32
call,Vodafone,29
call,landline,28
call,voicemail,17
data,data,33
sms,Meteor,21
sms,Tesco,11
sms,Three,30


## Group 2 columns, calculate mean of another based on a single distinct customer

Combine everything above but add a boolean slice as a filter. If we want to see the average duration for customer with an ID of 15 for every item, network pair, then:

In [18]:
df[df['user_id'] == 15].groupby(['item', 'network'])['duration'].mean()

item  network 
call  Three        13.000
      Vodafone      4.250
      landline    251.500
data  data         34.429
sms   Meteor        1.000
      Three         1.000
      Vodafone      1.000
Name: duration, dtype: float64

In [19]:
# And if we want a dataframe version of the above, double bracket:
df[df['user_id'] == 15].groupby(['item', 'network'])[['duration']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,duration
item,network,Unnamed: 2_level_1
call,Three,13.0
call,Vodafone,4.25
call,landline,251.5
data,data,34.429
sms,Meteor,1.0
sms,Three,1.0
sms,Vodafone,1.0


## Using Multiple `agg()` Variables

The instructions for multiple aggregation functions are in the form of a dictionary {}. You can add numerous aggregation instructions to your liking.

## Group 2 columns, then count distinct and calculate mean of different columns

Lets calculate the number of unique users as well as their average cost.


In [20]:
df.groupby(['item', 'network']).agg({'user_id':pd.Series.nunique, 'total_cost': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id,total_cost
item,network,Unnamed: 2_level_1,Unnamed: 3_level_1
call,Meteor,29,4.855
call,Tesco,31,4.016901
call,Three,32,3.823359
call,Vodafone,29,4.385
call,landline,28,3.940238
call,voicemail,17,3.917407
data,data,33,4.2732
sms,Meteor,21,4.024242
sms,Tesco,11,4.885385
sms,Three,30,3.613793


## Calculate multiple aggregations on a single column

We can calculate multiple different aggregations on a single column; you simply need to nest another dictionary. You can repeat this process as much as you want, adding nested dictionaries in your `agg()` function to calculate as much metrics as you need.

In [21]:
df.groupby(['item', 'network']).agg({'user_id': pd.Series.nunique,
                                     'total_cost': {'mean', 'median'}})

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id,total_cost,total_cost
Unnamed: 0_level_1,Unnamed: 1_level_1,user_id,median,mean
item,network,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
call,Meteor,29,4.28,4.855
call,Tesco,31,3.35,4.016901
call,Three,32,2.905,3.823359
call,Vodafone,29,3.635,4.385
call,landline,28,2.475,3.940238
call,voicemail,17,2.25,3.917407
data,data,33,3.465,4.2732
sms,Meteor,21,3.02,4.024242
sms,Tesco,11,4.2,4.885385
sms,Three,30,2.95,3.613793


## Chaining Commands

# Method Chaining - Putting it All Together

## Preface

Lets say you were tasked to do the following:

**For customer ID 15, find the following:**

* Find the mean time communication for each item, network pair

* Rename the duration column to "time_in_seconds"

* Ensure your results are in a Dataframe

You could easily complete this by doing each method at a time, and assigning it to a new variable:

In [22]:
df2 = df[df['user_id'] == 15]
df3 = df2.groupby(['item', 'network'], as_index=False)[['duration']].mean()
df4 = df3.rename(columns={'duration': 'time_in_seconds'})

This however would create too much variables and cause confusion. Method chaining in Pandas allows you to put everything into a single line of code. There are however pros and cons to this. 

**Pros to method chaining:**
* Less variable creation and overall confusion
* Less cluttered dataframes in your environment
* Better code readability

**Cons to method chaining:**
* Extremely difficult to debug, especially if its code handed down to you from another user

## All-in-One: Group 2 columns, index as a column, calculate mean, rename a column for a single distinct user

In [23]:
df[df['user_id'] == 15].groupby(['item', 'network'], as_index=False)[['duration']].mean().rename(
    columns={'duration': 'time_in_seconds'})

Unnamed: 0,item,network,time_in_seconds
0,call,Three,13.0
1,call,Vodafone,4.25
2,call,landline,251.5
3,data,data,34.429
4,sms,Meteor,1.0
5,sms,Three,1.0
6,sms,Vodafone,1.0


## All-in-One: Slicing for a group of specific users

If you have a specific subset of users you want to analyze on, then create a list of those users and slice your dataframe using `isin()`.

In [24]:
user_list = [10, 15]

df[df['user_id'].isin(user_list)].groupby([
    'user_id', 'item', 'network'], as_index=False)[['duration']].mean().rename(columns={'duration': 'time_in_seconds'})

Unnamed: 0,user_id,item,network,time_in_seconds
0,10,call,Meteor,9.0
1,10,call,Tesco,229.0
2,10,call,Three,142.5
3,10,call,Vodafone,199.333333
4,10,call,landline,141.0
5,10,data,data,34.429
6,10,sms,Three,1.0
7,10,sms,Vodafone,1.0
8,15,call,Three,13.0
9,15,call,Vodafone,4.25


# Swapping Columns and Rows 

`stack()` turns column names into index names, and the `unstack()` method turns index values into column names. Depending on your situation, one or the other may be helpful in displaying your data in a logical manner.

## Turning index values into column names - `unstack()`

In [25]:
df.groupby(['item', 'network'])['duration'].mean()

item  network  
call  Meteor       133.333333
      Tesco        194.760563
      Three        284.875000
      Vodafone     221.530303
      landline     438.880952
      voicemail     65.740741
data  data          34.429000
sms   Meteor         1.000000
      Tesco          1.000000
      Three          1.000000
      Vodafone       1.000000
      special        1.000000
      world          1.000000
Name: duration, dtype: float64

In [26]:
df.groupby(['item', 'network'])['duration'].mean().unstack()

network,Meteor,Tesco,Three,Vodafone,data,landline,special,voicemail,world
item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
call,133.333333,194.760563,284.875,221.530303,,438.880952,,65.740741,
data,,,,,34.429,,,,
sms,1.0,1.0,1.0,1.0,,,1.0,,1.0


## Turning columns names into index names - `stack()`

In [27]:
df.groupby(['item', 'network']).agg({'user_id':pd.Series.nunique, 'total_cost': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,user_id,total_cost
item,network,Unnamed: 2_level_1,Unnamed: 3_level_1
call,Meteor,29,4.855
call,Tesco,31,4.016901
call,Three,32,3.823359
call,Vodafone,29,4.385
call,landline,28,3.940238
call,voicemail,17,3.917407
data,data,33,4.2732
sms,Meteor,21,4.024242
sms,Tesco,11,4.885385
sms,Three,30,3.613793


In [28]:
df.groupby(['item', 'network']).agg({'user_id':pd.Series.nunique, 'total_cost': 'mean'}).stack()

item  network              
call  Meteor     user_id       29.000000
                 total_cost     4.855000
      Tesco      user_id       31.000000
                 total_cost     4.016901
      Three      user_id       32.000000
                 total_cost     3.823359
      Vodafone   user_id       29.000000
                 total_cost     4.385000
      landline   user_id       28.000000
                 total_cost     3.940238
      voicemail  user_id       17.000000
                 total_cost     3.917407
data  data       user_id       33.000000
                 total_cost     4.273200
sms   Meteor     user_id       21.000000
                 total_cost     4.024242
      Tesco      user_id       11.000000
                 total_cost     4.885385
      Three      user_id       30.000000
                 total_cost     3.613793
      Vodafone   user_id       35.000000
                 total_cost     4.060134
      special    user_id        3.000000
                 total_cost  

## Tranpose

You can easily transpose a Dataframe by adding `.T` to the end of your statement.

In [29]:
df.groupby(['network']).agg({'total_cost': 'mean'})

Unnamed: 0_level_0,total_cost
network,Unnamed: 1_level_1
Meteor,4.539885
Tesco,4.15131
Three,3.738558
Vodafone,4.15986
data,4.2732
landline,3.940238
special,3.363333
voicemail,3.917407
world,4.444286


In [30]:
df.groupby(['network']).agg({'total_cost': 'mean'}).T

network,Meteor,Tesco,Three,Vodafone,data,landline,special,voicemail,world
total_cost,4.539885,4.15131,3.738558,4.15986,4.2732,3.940238,3.363333,3.917407,4.444286


## Grouping by Time - `resample()`

[Documentation for `resample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html)

`resample()` groups dates in the index. In the example below, we'll set the date as the index and find the mean of all numerical columns. The parameter you add to `resample()` is how you wish to group the dates. `W` is for week and `M` is for month.


In [31]:
df.set_index(['date']).resample('M').mean()

Unnamed: 0_level_0,user_id,duration,total_cost
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-10-31,16.954955,152.525162,3.954054
2014-11-30,18.449339,78.898106,4.163744
2014-12-31,18.584337,96.333127,4.241325
2015-01-31,16.881657,117.29171,4.223846
2015-02-28,17.449153,101.83061,3.918814
2015-03-31,17.25641,387.373769,3.35641
