<h1><font color = #fc7cc9> Ch. 10 Data Aggregation and Group Operations
    <br>pg. 287 - 316</h1>
<p> 

In [1]:
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns

In this chapter we will learn how to:
<ul>
    <li>Split a pandas object into pieces using one or more keys (in the form of func‐ tions, arrays, or DataFrame column names) </li>
    <li>Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function </li>
    <li>Apply within-group transformations or other manipulations, like normalization, linear regression, rank, or subset selection </li>
    <li>Compute pivot tables and cross-tabulations </li>
    <li>Perform quantile analysis and other statistical group analyses </li>
</ul> <br>
<b>Note:</b> Aggregation of time series data is referred to as <i>resampling</i> in this book and will be addressed more in Chapter 11.

<h2> <font color = #39abed> 10.1 GroupBy Mechanics
    </h2>
<p>Using <i>split-apply-combine</i>.<br>
<p>1. <b>Split</b> the data into groups based on one or more kets that you provide. You can also split a particular axis (e.g., with a DF, split by rows is with axis=0, or by columns with axis=1).
<p>2. Then, a function is <b>applied</b> to each group, which produces a new value.
<p>Lastly, the results of those function applications are <b>combined</b> into a result object.
<br>
<br>
<p> Grouping keys can take many forms and they do not have to be all of the same type. They can be a list or array, a column name in a DF,  a dict or Series, or a function to be invoked on the axis index or the individual labels inthe index.

In [3]:
# Starting with an example data set
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.443386,-0.524045
1,a,two,1.492145,-0.014583
2,b,one,-0.475742,-0.661969
3,b,two,-0.483896,0.178955
4,a,one,-0.062065,-0.037307


<blockquote>Suppose you wanted to compute the mean of the data1 column using the labels from key1. There are a number of ways to do this. One is to access data1 and call groupby with the column (a Series) at key1</blockquote>

In [5]:
grouped = df['data1'].groupby(df['key1'])
grouped  # This is now a 'GroupBy'object. Nothing has been computed at this point until the next line.

grouped.mean()

key1
a    0.328898
b   -0.479819
Name: data1, dtype: float64

In [6]:
# If you instead passed multiple arrays as a list, you get something different
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one    -0.252725
      two     1.492145
b     one    -0.475742
      two    -0.483896
Name: data1, dtype: float64

In [7]:
# & Also, the same as above but with a hierarchical index
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.252725,1.492145
b,-0.475742,-0.483896


In the above examples, the group keys are all Series. But they could also be any arrays of the right/matching length.

In [8]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])

df['data1'].groupby([states, years]).mean()

California  2005    1.492145
            2006   -0.475742
Ohio        2005   -0.463641
            2006   -0.062065
Name: data1, dtype: float64

In [9]:
df.groupby('key1').mean()  

# Key 2 is not printed here because it is not numeric data. 
# So by default, it will not be printed out. 

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.328898,-0.191978
b,-0.479819,-0.241507


In [10]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,-0.252725,-0.280676
a,two,1.492145,-0.014583
b,one,-0.475742,-0.661969
b,two,-0.483896,0.178955


In [11]:
# Can also use GroupBy for size, which returns a Series containing the group size
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

<h3> <font color = #39abed> Iterating Over Groups
    </h3>
 <p>The GroupBy object supports iteration, which generates a sequence of 2-tuples containing the group name along with the chunk of data. See examples below.

In [12]:
for name, group in df.groupby('key1'):  # 1 tuple for 'a', and another for 'b'
    print(name)
    print(group)

a
  key1 key2     data1     data2
0    a  one -0.443386 -0.524045
1    a  two  1.492145 -0.014583
4    a  one -0.062065 -0.037307
b
  key1 key2     data1     data2
2    b  one -0.475742 -0.661969
3    b  two -0.483896  0.178955


In [13]:
# If you have multiple kets, the first element in the tubple will be a tuple of key values
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

('a', 'one')
  key1 key2     data1     data2
0    a  one -0.443386 -0.524045
4    a  one -0.062065 -0.037307
('a', 'two')
  key1 key2     data1     data2
1    a  two  1.492145 -0.014583
('b', 'one')
  key1 key2     data1     data2
2    b  one -0.475742 -0.661969
('b', 'two')
  key1 key2     data1     data2
3    b  two -0.483896  0.178955


In [15]:
# You can also compute a dict of the data pieces as a one-liner (of code)

pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,key1,key2,data1,data2
2,b,one,-0.475742,-0.661969
3,b,two,-0.483896,0.178955


<h3> <font color = #39abed> Selecting a Column or Subset of Columns
    </h3>
 <p>

In [None]:
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

# The above is neater and the same as:

df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

<h3> <font color = #39abed> Grouping with Dicts and Series
    </h3>
 <p>

<h3> <font color = #39abed> Grouping with Functions
    </h3>
 <p>

<h3> <font color = #39abed> Grouping by Index Levels
    </h3>
 <p>

<h2> <font color = #39abed> 10.2 Data Aggregation
    </h2>
    <p> 

<h3> <font color = #39abed> Column-Wise Multiple Function Application
    </h3>
 <p>

<h3> <font color = #39abed>Returning Aggregated Data Without Row Indexes
    </h3>
 <p>

<h2> <font color = #39abed> 10.3 Apply: General split-apply-combine
    </h2>
    <p> 

<h3> <font color = #39abed> Suppressing Group Keys
    </h3>
 <p>

<h3> <font color = #39abed> Quantile and Bucket Analysis
    </h3>
 <p>

<h3> <font color = #39abed>Example: Filling Missing Values with Group-Specific Values
    </h3>
 <p>

<h3> <font color = #39abed> Example: Random Sampling and Permutation
    </h3>
 <p>

<h3> <font color = #39abed> Example: Group Weighted Average and Correlation
    </h3>
 <p>

<h3> <font color = #39abed> Example: Group-Wise Linear Regression
    </h3>
 <p>

<h2> <font color = #39abed> 10.4 Pivot Tables and Cross-Tabulations
    </h2>
    <p> 

<h3> <font color = #39abed> Cross-Tabulations: Crosstab
    </h3>
 <p>

<h2> <font color = #39abed> 10.5 Conclusion
    </h2>
    <p> 