# Ch. 10  Data Aggregation and Group Operations

In [None]:
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, is often a critical component of a data analysis workflow.   
After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purposes.   
pandas provides a flexible groupby interface, enabling you to slice, dice, and summarize datasets in a natural way. 

One reason for the popularity of relational databases and SQL (which stands for “structured query language”) is the ease with which data can be joined, filtered, transformed, and aggregated.  
However, query languages like SQL are somewhat constrained in the kinds of group operations that can be performed.   
As you will see, with the expressiveness of Python and pandas, we can perform quite complex group operations by utilizing any function that accepts a pandas object or NumPy array.   
In this chapter, you will learn how to: 

* Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names) 
* Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function 
* Apply within-group transformations or other manipulations, like normalization, linear regression, rank, or subset selection 
* Compute pivot tables and cross-tabulations 
* Perform quantile analysis and other statistical group analyses


<img style="float: left;" src="pic/pic_0_2.png">

Aggregation of time series data, a special use case of groupby, is referred to as resampling in this book and will receive separate treatment in Chapter 11.

## 10.1 GroupBy Mechanics

Hadley Wickham, an author of many popular packages for the R programming language, coined the term *split-apply-combine* for describing group operations.    
In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is *split* into groups based on one or more *keys* that you provide.  
The splitting is performed on a particular axis of an object.   
For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).   
Once this is done, a function is *applied* to each group, producing a new value.  
Finally, the results of all those function applications are *combined* into a result object.   
The form of the resulting object will usually depend on what’s being done to the data.   
See Figure 10-1 for a mockup of a simple group aggregation.

In [None]:
df = pd.DataFrame({'key' : ['A', 'B', 'C', 'A', 'B', 'C','A', 'B', 'C'],
                   'data' : [0,5,10,5,10,15,10,15,20]})
df

In [None]:
GG= df['data'].groupby(df['key'])
GG

<img style="float: left;" src="pic/pic_10_1.png" width="700">

This grouped variable is now a GroupBy object.
It has not actually computed anything yet except for some intermediate data about the group key df['key'].
The idea is that this object has all of the information needed to then apply some operation to each of the groups.
For example, to compute group means we can call the GroupBy’s mean method:

In [None]:
GG.sum()

In [None]:
means = df['data'].groupby(df['key']).sum()
means

In [None]:
df['data'].groupby(df['key']).mean()

In [None]:
df['data'].groupby(df['key']).max()

In [None]:
df['data'].groupby(df['key']).min()

In [None]:
df['data'].groupby(df['key']).std()

In [None]:
df['data'].groupby(df['key']).count()

Each grouping key can take many forms, and the keys do not have to be all of the same type: 

* A list or array of values that is the same length as the axis being grouped   
* A value indicating a column name in a DataFrame  
* A dict or Series giving a correspondence between the values on the axis being grouped and the group names   
* A function to be invoked on the axis index or the individual labels in the index

Note that the latter three methods are shortcuts for producing an array of values to be used to split up the object.   
Don’t worry if this all seems abstract.   
Throughout this chapter, I will give many examples of all these methods.   
To get started, here is a small tabular dataset as a DataFrame:

In [None]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Suppose you wanted to compute the mean of the data1 column using the labels from key1.   
There are a number of ways to do this.   
One is to access data1 and call groupby with the column (a Series) at key1:


In [None]:
grouped = df['data1'].groupby(df['key1'])
grouped

This grouped variable is now a *GroupBy* object.   
It has not actually computed anything yet except for some intermediate data about the group key df['key1'].   
The idea is that this object has all of the information needed to then apply some operation to each of the groups.   
For example, to compute group means we can call the GroupBy’s mean method:


In [None]:
grouped.mean()

Later, I’ll explain more about what happens when you call **.mean()**.   
The important thing here is that the data (a Series) has been aggregated according to the group key, producing a new Series that is now indexed by the unique values in the key1 column.  
The result index has the name 'key1' because the DataFrame column df['key1'] did.     
If instead we had passed multiple arrays as a list, we’d get something different:


In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

Here we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:

In [None]:
means.unstack()

In this example, the group keys are all Series, though they could be any arrays of the right length:

In [None]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

In [None]:
df

Frequently the grouping information is found in the same DataFrame as the data you want to work on.   
In that case, you can pass column names (whether those are strings, numbers, or other Python objects) as the group keys:


In [None]:
df.groupby('key1').mean()

In [None]:
df.groupby(['key1', 'key2']).mean()

You may have noticed in the first case df.groupby('key1').mean() that there is no key2 column in the result. Because df['key2'] is not numeric data, it is said to be a nuisance column, which is therefore excluded from the result.   
By default, all of the numeric columns are aggregated, though it is possible to filter down to a subset, as you’ll see soon.

Regardless of the objective in using groupby, a generally useful GroupBy method is size, which returns a Series containing group sizes:

In [None]:
df.groupby(['key1', 'key2']).size()

Take note that any missing values in a group key will be excluded from the result. 

### Iterating Over Groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. 

In [None]:
df

In [None]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

In the case of multiple keys, the first element in the tuple will be a tuple of key values:


In [None]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

Of course, you can choose to do whatever you want with the pieces of data.   
A recipe you may find useful is computing a dict of the data pieces as a one-liner:


In [None]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

In [None]:
df.groupby('key1')

In [None]:
list(df.groupby('key1'))


In [None]:
dict(list(df.groupby('key1')))

In [None]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

In [None]:
pieces = dict(list(df.groupby('key1')))
print(pieces['b'])

### Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation.   
This means that:


```python
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
```

are syntactic sugar for:


```python
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
```

Especially for large datasets, it may be desirable to aggregate only a few columns.   
For example, in the preceding dataset, to compute means for just the data2 column and get the result as a DataFrame, we could write:


In [None]:
df.groupby(['key1', 'key2'])[['data2']].mean()

In [None]:
type(df.groupby(['key1', 'key2'])[['data2']].mean())

In [None]:
df.groupby(['key1', 'key2'])['data2'].mean()

In [None]:
type(df.groupby(['key1', 'key2'])['data2'].mean())

The object returned by this indexing operation is a grouped DataFrame if a list or array is passed or a grouped Series if only a single column name is passed as a scalar:


In [None]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped
s_grouped.mean()

### Grouping with Dicts and Series

Grouping information may exist in a form other than an array. Let’s consider another example DataFrame:


In [None]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
people

Now, suppose I have a group correspondence for the columns and want to sum together the columns by group:


In [None]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}

Now, you could construct an array from this dict to pass to groupby, but instead we can just pass the dict (I included the key 'f' to highlight that unused grouping keys are OK):


In [None]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

The same functionality holds for Series, which can be viewed as a fixed-size mapping:


In [None]:
map_series = pd.Series(mapping)
map_series

In [None]:
people.groupby(map_series, axis=1).count()

### Grouping with Functions

Using Python functions is a more generic way of defining a group mapping compared with a dict or Series.   
Any function passed as a group key will be called once per index value, with the return values being used as the group names.   
More concretely, consider the example DataFrame from the previous section, which has people’s first names as index values.   
Suppose you wanted to group by the length of the names; while you could compute an array of string lengths, it’s simpler to just pass the len function:


In [None]:
people

In [None]:
people.groupby(len).sum()

Mixing functions with arrays, dicts, or Series is not a problem as everything gets converted to arrays internally:

In [None]:
people

In [None]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

### Grouping by Index Levels

A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. 

In [None]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
columns

In [None]:
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

To group by level, pass the level number or name using the level keyword:


In [None]:
hier_df.groupby(level='cty', axis=1).count()

## 10.2   Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays.   
The preceding examples have used several of them, including mean, count, min, and sum.   
You may wonder what is going on when you invoke mean() on a GroupBy object.   
Many common aggregations, such as those found in Table 10-1, have optimized implementations.   
However, you are not limited to only this set of methods.

<img style="float: left;" src="pic/pic_10_2.png" width="500">

You can use aggregations of your own devising and additionally call any method that is also defined on the grouped object.   
For example, you might recall that **quantile** computes sample quantiles of a Series or a DataFrame’s columns.

While **quantile** is not explicitly implemented for GroupBy, it is a Series method and thus available for use.   
Internally, GroupBy efficiently slices up the Series, calls piece.quantile(0.9) for each piece, and then assembles those results together into the result object:

In [None]:
df

In [None]:
grouped = df.groupby('key1')

In [None]:
grouped['data1'].quantile(0.9)

To use your own aggregation functions, pass any function that aggregates an array to the **aggregate** or **agg** method:


In [None]:
def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

You may notice that some methods like **describe** also work, even though they are not aggregations, strictly speaking:


In [None]:
grouped.describe()

I will explain in more detail what has happened here in Section 10.3, “Apply: General split-apply-combine,” on page 302.


### Column-Wise and Multiple Function Application

Let’s return to the tipping dataset from earlier examples. After loading it with read_csv, we add a tipping percentage column tip_pct:

In [None]:
tips = pd.read_csv('examples/tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips

As you’ve already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with the desired function or calling a method like mean or std.   
However, you may want to aggregate using a different function depending on the column, or multiple functions at once.   
Fortunately, this is possible to do, which I’ll illustrate through a number of examples.   
First, I’ll group the tips by day and smoker:


In [None]:
grouped = tips.groupby(['day', 'smoker'])

Note that for descriptive statistics like those in Table 10-1, you can pass the name of the function as a string:

In [None]:
grouped_pct = grouped['tip_pct']
list(grouped_pct)

In [None]:
grouped_pct.agg('mean')

If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:

In [None]:
grouped_pct.agg(['mean', 'std', peak_to_peak])

Here we passed a list of aggregation functions to agg to evaluate indepedently on the data groups. 

You don’t need to accept the names that GroupBy gives to the columns; notably, lambda functions have the name '**\< lambda \>**', which makes them hard to identify (you can see for yourself by looking at a function’s **\_\_name\_\_** attribute).   
Thus, if you pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column names (you can think of a list of 2-tuples as an ordered mapping):

In [None]:
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column.  
To start, suppose we wanted to compute the same three statistics for the tip_pct and total_bill columns:


In [None]:
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result

As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using **concat** to glue the results together using the column names as the keys argument:


In [None]:
result['tip_pct']

As before, a list of tuples with custom names can be passed:


In [None]:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)

Now, suppose you wanted to apply potentially different functions to one or more of the columns.   
To do this, pass a dict to agg that contains a mapping of column names to any of the function specifications listed so far:


In [None]:
grouped.agg({'tip' : np.max, 'size' : 'sum'})

In [None]:
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
             'size' : 'sum'})

A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column. 

### Returning Aggregated Data Without Row Indexes

In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations.  
Since this isn’t always desirable, you can disable this behavior in most cases by passing **as_index=False** to **groupby**:


In [None]:
tips.groupby(['day', 'smoker'], as_index=False).mean()

## 10.3  Apply: General split-apply-combine

The most general-purpose GroupBy method is apply, which is the subject of the rest of this section.   
As illustrated in Figure 10-2, apply splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces together.


<img style="float: left;" src="pic/pic_10_3.png" width="500">

Returning to the tipping dataset from before, suppose you wanted to select the top five tip_pct values by group.  
First, write a function that selects the rows with the largest values in a particular column:


In [None]:
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]
top(tips, n=6)

 Now, if we group by smoker, say, and call apply with this function, we get the following

In [None]:
tips.groupby('smoker').apply(top)

What has happened here? The top function is called on each row group from the DataFrame, and then the results are glued together using **pandas.concat**, labeling the pieces with the group names.   
The result therefore has a hierarchical index whose inner level contains index values from the original DataFrame. 

If you pass a function to apply that takes other arguments or keywords, you can pass these after the function:

In [None]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

You may recall that I earlier called describe on a GroupBy object:


In [None]:
result = tips.groupby('smoker')['tip_pct'].describe()

In [None]:
result

In [None]:
result.unstack('smoker')

Inside GroupBy, when you invoke a method like describe, it is actually just a shortcut for:


```python
f = lambda x: x.describe()
grouped.apply(f)
```

### Suppressing the Group Keys

In the preceding examples, you see that the resulting object has a hierarchical index formed from the group keys along with the indexes of each piece of the original object.   
You can disable this by passing group_keys=False to groupby:


In [None]:
tips.groupby('smoker', group_keys=False).apply(top)

In [None]:
tips.groupby('smoker').apply(top)

### Quantile and Bucket Analysis

As you may recall from Chapter 8, pandas has some tools, in particular cut and qcut, for slicing data up into buckets with bins of your choosing or by sample quantiles.   
Combining these functions with groupby makes it convenient to perform bucket or quantile analysis on a dataset.   
Consider a simple random dataset and an equal-length bucket categorization using cut:


In [None]:
frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})
frame

In [None]:
frame.head(10)

In [None]:
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]

The **Categorical** object returned by **cut** can be passed directly to **groupby**.   
So we could compute a set of statistics for the data2 column like so:


In [None]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}
grouped = frame.data2.groupby(quartiles)
grouped.apply(get_stats).unstack()

In [None]:
grouped.apply(get_stats)

These were equal-length buckets; to compute equal-size buckets based on sample quantiles, use **qcut**.   
I’ll pass **labels=False** to just get quantile numbers:


In [None]:
# Return quantile numbers
grouping = pd.qcut(frame.data1, 10, labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

We will take a closer look at pandas’s Categorical type in Chapter 12. 

### Example: Filling Missing Values with Group-Specific       Values

When cleaning up missing data, in some cases you will replace data observations using **dropna**, but in others you may want to impute (fill in) the null (NA) values using a fixed value or some value derived from the data.   
**fillna** is the right tool to use; for example, here I fill in NA values with the mean:


In [None]:
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s

In [None]:
s.fillna(s.mean())

Suppose you need the fill value to vary by group.   
One way to do this is to group the data and use **apply** with a function that calls **fillna** on each data chunk.   
Here is some sample data on US states divided into eastern and western regions:


In [None]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']

In [None]:
group_key = ['East'] * 4 + ['West'] * 4
group_key

In [None]:
data = pd.Series(np.random.randn(8), index=states)
data

Note that the syntax ['East'] * 4 produces a list containing four copies of the elements in ['East'].   
Adding lists together concatenates them.  
Let’s set some values in the data to be missing:

In [None]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

In [None]:
data.groupby(group_key).mean()

We can fill the NA values using the group means like so:


In [None]:
data

In [None]:
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

In another case, you might have predefined fill values in your code that vary by group.   
Since the groups have a name attribute set internally, we can use that:


In [None]:
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)

### Example: Random Sampling and Permutation

Suppose you wanted to draw a random sample (with or without replacement) from a large dataset for Monte Carlo simulation purposes or some other application.   
There are a number of ways to perform the “draws”; here we use the sample method for Series. 

To demonstrate, here’s a way to construct a deck of English-style playing cards:

In [None]:
# Hearts, Spades, Clubs, Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)

In [None]:
print(cards)

So now we have a Series of length 52 whose index contains card names and values are the ones used in Blackjack and other games (to keep things simple, I just let the ace 'A' be 1):

In [None]:
deck

Now, based on what I said before, drawing a hand of five cards from the deck could be written as:


In [None]:
def draw(deck, n=5):
    return deck.sample(n)
draw(deck)

Suppose you wanted two random cards from each suit. Because the suit is the last character of each card name, we can group based on this and use apply:

In [None]:
get_suit = lambda card: card[-1] # last letter is suit
deck.groupby(get_suit).apply(draw, n=2)

In [None]:
get_suit = lambda k: k[-1] # last letter is suit
deck.groupby(get_suit).apply(draw, n=2)

In [None]:
for name, group in deck.groupby(get_suit):
    print(name)
    print(group)

In [None]:
get_suit = lambda k: k[:-1] # number
deck.groupby(get_suit).apply(draw, n=2)

In [None]:
for name, group in deck.groupby(get_suit):
    print(name)
    print(group)

Alternatively, we could write:

In [None]:
get_suit = lambda card: card[-1] # last letter is suit
deck.groupby(get_suit).apply(draw, n=2)

In [None]:
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

### Example: Group Weighted Average and Correlation (생략)

Under the split-apply-combine paradigm of **groupby**, operations between columns in a DataFrame or two Series, such as a group weighted average, are possible.   
As an example, take this dataset containing group keys, values, and some weights:


In [None]:
df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
                                'b', 'b', 'b', 'b'],
                   'data': np.random.randn(8),
                   'weights': np.random.rand(8)})
df

The group weighted average by category would then be:


In [None]:
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])  #avg = sum(a * weights) / sum(weights)
grouped.apply(get_wavg)

As another example, consider a financial dataset originally obtained from Yahoo! Finance containing end-of-day prices for a few stocks and the S&P 500 index (the SPX symbol):


In [None]:
close_px = pd.read_csv('examples/stock_px_2.csv', parse_dates=True,
                       index_col=0)
close_px.info()

In [None]:
close_px[-4:]

In [None]:
type(close_px.index)

In [None]:
close_px.index.year

One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns (computed from percent changes) with SPX.   
As one way to do this, we first create a function that computes the pairwise correlation of each column with the 'SPX' column:


In [None]:
spx_corr = lambda x: x.corrwith(x['SPX'])

Next, we compute percent change on close_px using pct_change:


In [None]:
rets = close_px.pct_change().dropna()

In [None]:
close_px.pct_change?

Lastly, we group these percent changes by year, which can be extracted from each row label with a one-line function that returns the year attribute of each datetime label:


In [None]:
get_year = lambda x: x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)

You could also compute inter-column correlations. Here we compute the annual correlation between Apple and Microsoft:


In [None]:
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

### Example: Group-Wise Linear Regression (생략)

In the same theme as the previous example, you can use **groupby** to perform more complex group-wise statistical analysis, as long as the function returns a pandas object or scalar value.   
For example, I can define the following regress function (using the statsmodels econometrics library), which executes an ordinary least squares (OLS) regression on each chunk of data:


In [None]:
import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

Now, to run a yearly linear regression of AAPL on SPX returns, execute:


In [None]:
by_year.apply(regress, 'AAPL', ['SPX'])

## 10.4  Pivot Tables and Cross-Tabulation

A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software.   
It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns.  
Pivot tables in Python with pandas are made possible through the groupby facility described in this chapter combined with reshape operations utilizing hierarchical indexing.   
DataFrame has a **pivot_table** method, and there is also a top-level **pandas.pivot_table** function.   
In addition to providing a convenience interface to **groupby**, **pivot_table** can add partial totals, also known as margins. 

Returning to the tipping dataset, suppose you wanted to compute a table of group means (the default **pivot_table** aggregation type) arranged by day and smoker on the rows:

In [None]:
tips

In [None]:
tips.pivot_table(index=['day', 'smoker'])

<img style="float: left;" src="pic/pic_10_4.png" width="700">

In [None]:
tips.pivot_table?

This could have been produced with groupby directly.   
Now, suppose we want to aggregate only tip_pct and size, and additionally group by time.   
I’ll put smoker in the table columns and day in the rows:


In [None]:
tips.tail(10)

In [None]:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                 columns='smoker')

We could augment this table to include partial totals by passing **margins=True**.   
This has the effect of adding All row and column labels, with corresponding values being the group statistics for all the data within a single tier:


In [None]:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                 columns='smoker', margins=True)

Here, the All values are means without taking into account smoker versus nonsmoker (the All columns) or any of the two levels of grouping on the rows (the All row). 

To use a different aggregation function, pass it to aggfunc.  
For example, 'count' or len will give you a cross-tabulation (count or frequency) of group sizes:

In [None]:
tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day',
                 aggfunc='count', margins=True)

In [None]:
tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day',
                 aggfunc=len, margins=True)

If some combinations are empty (or otherwise NA), you may wish to pass a **fill_value**:


In [None]:
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
                 columns='day', aggfunc='mean', fill_value=0)

See Table 10-2 for a summary of **pivot_table** methods.


<img style="float: left;" src="pic/pic_10_5.png" width="700">

### Cross-Tabulations: Crosstab (생략)

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies. 

In [None]:
from io import StringIO
data = """\
Sample  Nationality  Handedness
1   USA  Right-handed
2   Japan    Left-handed
3   USA  Right-handed
4   Japan    Right-handed
5   Japan    Left-handed
6   Japan    Right-handed
7   USA  Right-handed
8   USA  Left-handed
9   Japan    Right-handed
10  USA  Right-handed"""
data = pd.read_table(StringIO(data), sep='\s+')

In [None]:
data

As part of some survey analysis, we might want to summarize this data by nationality and handedness.   
You could use **pivot_table** to do this, but the **pandas.crosstab** function can be more convenient:

In [None]:
pd.crosstab(data.Nationality, data.Handedness, margins=True)

The first two arguments to crosstab can each either be an array or Series or a list of arrays. As in the tips data:


In [None]:
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

In [None]:
pd.options.display.max_rows = PREVIOUS_MAX_ROWS

## Conclusion