# Data Aggregation and Group Operations

Pandas has a GroupBy class which has a set of methods associated with an GroupBy object.

GroupBy objects are returned by groupby calls: pandas.DataFrame.groupby(), pandas.Series.groupby(), etc.

https://pandas.pydata.org/docs/reference/groupby.html


Indexing, iteration
GroupBy.__iter__()

Groupby iterator.

GroupBy.groups

Dict {group name -> group labels}.

GroupBy.indices

Dict {group name -> group indices}.

GroupBy.get_group(name[, obj])

Construct DataFrame from group with provided name.

Grouper(*args, **kwargs)

A Grouper allows the user to specify a groupby instruction for an object.

Function application
GroupBy.apply(func, *args, **kwargs)

Apply function func group-wise and combine the results together.

GroupBy.agg(func, *args, **kwargs)

SeriesGroupBy.aggregate([func, engine, ...])

Aggregate using one or more operations over the specified axis.

DataFrameGroupBy.aggregate([func, engine, ...])

Aggregate using one or more operations over the specified axis.

SeriesGroupBy.transform(func, *args[, ...])

Call function producing a like-indexed Series on each group and return a Series having the same indexes as the original object filled with the transformed values.

DataFrameGroupBy.transform(func, *args[, ...])

Call function producing a like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values.

GroupBy.pipe(func, *args, **kwargs)

Apply a function func with arguments to this GroupBy object and return the function's result.

Computations / descriptive stats
GroupBy.all([skipna])

Return True if all values in the group are truthful, else False.

GroupBy.any([skipna])

Return True if any value in the group is truthful, else False.

GroupBy.bfill([limit])

Backward fill the values.

GroupBy.backfill([limit])

Backward fill the values.

GroupBy.count()

Compute count of group, excluding missing values.

GroupBy.cumcount([ascending])

Number each item in each group from 0 to the length of that group - 1.

GroupBy.cummax([axis])

Cumulative max for each group.

GroupBy.cummin([axis])

Cumulative min for each group.

GroupBy.cumprod([axis])

Cumulative product for each group.

GroupBy.cumsum([axis])

Cumulative sum for each group.

GroupBy.ffill([limit])

Forward fill the values.

GroupBy.first([numeric_only, min_count])

Compute first of group values.

GroupBy.head([n])

Return first n rows of each group.

GroupBy.last([numeric_only, min_count])

Compute last of group values.

GroupBy.max([numeric_only, min_count])

Compute max of group values.

GroupBy.mean([numeric_only, engine, ...])

Compute mean of groups, excluding missing values.

GroupBy.median([numeric_only])

Compute median of groups, excluding missing values.

GroupBy.min([numeric_only, min_count])

Compute min of group values.

GroupBy.ngroup([ascending])

Number each group from 0 to the number of groups - 1.

GroupBy.nth(n[, dropna])

Take the nth row from each group if n is an int, otherwise a subset of rows.

GroupBy.ohlc()

Compute open, high, low and close values of a group, excluding missing values.

GroupBy.pad([limit])

Forward fill the values.

GroupBy.prod([numeric_only, min_count])

Compute prod of group values.

GroupBy.rank([method, ascending, na_option, ...])

Provide the rank of values within each group.

GroupBy.pct_change([periods, fill_method, ...])

Calculate pct_change of each value to previous entry in group.

GroupBy.size()

Compute group sizes.

GroupBy.sem([ddof])

Compute standard error of the mean of groups, excluding missing values.

GroupBy.std([ddof, engine, engine_kwargs])

Compute standard deviation of groups, excluding missing values.

GroupBy.sum([numeric_only, min_count, ...])

Compute sum of group values.

GroupBy.var([ddof, engine, engine_kwargs])

Compute variance of groups, excluding missing values.

GroupBy.tail([n])

Return last n rows of each group.

The following methods are available in both SeriesGroupBy and DataFrameGroupBy objects, but may differ slightly, usually in that the DataFrameGroupBy version usually permits the specification of an axis argument, and often an argument indicating whether to restrict application to columns of a specific data type.

DataFrameGroupBy.all([skipna])

Return True if all values in the group are truthful, else False.

DataFrameGroupBy.any([skipna])

Return True if any value in the group is truthful, else False.

DataFrameGroupBy.backfill([limit])

Backward fill the values.

DataFrameGroupBy.bfill([limit])

Backward fill the values.

DataFrameGroupBy.corr

Compute pairwise correlation of columns, excluding NA/null values.

DataFrameGroupBy.count()

Compute count of group, excluding missing values.

DataFrameGroupBy.cov

Compute pairwise covariance of columns, excluding NA/null values.

DataFrameGroupBy.cumcount([ascending])

Number each item in each group from 0 to the length of that group - 1.

DataFrameGroupBy.cummax([axis])

Cumulative max for each group.

DataFrameGroupBy.cummin([axis])

Cumulative min for each group.

DataFrameGroupBy.cumprod([axis])

Cumulative product for each group.

DataFrameGroupBy.cumsum([axis])

Cumulative sum for each group.

DataFrameGroupBy.describe(**kwargs)

Generate descriptive statistics.

DataFrameGroupBy.diff

First discrete difference of element.

DataFrameGroupBy.ffill([limit])

Forward fill the values.

DataFrameGroupBy.fillna

Fill NA/NaN values using the specified method.

DataFrameGroupBy.filter(func[, dropna])

Return a copy of a DataFrame excluding filtered elements.

DataFrameGroupBy.hist

Make a histogram of the DataFrame's columns.

DataFrameGroupBy.idxmax([axis, skipna])

Return index of first occurrence of maximum over requested axis.

DataFrameGroupBy.idxmin([axis, skipna])

Return index of first occurrence of minimum over requested axis.

DataFrameGroupBy.mad

Return the mean absolute deviation of the values over the requested axis.

DataFrameGroupBy.nunique([dropna])

Return DataFrame with counts of unique elements in each position.

DataFrameGroupBy.pad([limit])

Forward fill the values.

DataFrameGroupBy.pct_change([periods, ...])

Calculate pct_change of each value to previous entry in group.

DataFrameGroupBy.plot

Class implementing the .plot attribute for groupby objects.

DataFrameGroupBy.quantile([q, interpolation])

Return group values at the given quantile, a la numpy.percentile.

DataFrameGroupBy.rank([method, ascending, ...])

Provide the rank of values within each group.

DataFrameGroupBy.resample(rule, *args, **kwargs)

Provide resampling when using a TimeGrouper.

DataFrameGroupBy.sample([n, frac, replace, ...])

Return a random sample of items from each group.

DataFrameGroupBy.shift([periods, freq, ...])

Shift each group by periods observations.

DataFrameGroupBy.size()

Compute group sizes.

DataFrameGroupBy.skew

Return unbiased skew over requested axis.

DataFrameGroupBy.take

Return the elements in the given positional indices along an axis.

DataFrameGroupBy.tshift

(DEPRECATED) Shift the time index, using the index's frequency if available.

DataFrameGroupBy.value_counts([subset, ...])

Return a Series or DataFrame containing counts of unique rows.

The following methods are available only for SeriesGroupBy objects.

SeriesGroupBy.hist

Draw histogram of the input series using matplotlib.

SeriesGroupBy.nlargest([n, keep])

Return the largest n elements.

SeriesGroupBy.nsmallest([n, keep])

Return the smallest n elements.

SeriesGroupBy.nunique([dropna])

Return number of unique elements in the group.

SeriesGroupBy.unique

Return unique values of Series object.

SeriesGroupBy.value_counts([normalize, ...])

SeriesGroupBy.is_monotonic_increasing

Alias for is_monotonic.

SeriesGroupBy.is_monotonic_decreasing

Return boolean if values in the object are monotonic_decreasing.


* References:

- https://pandas.pydata.org/docs/user_guide/groupby.html

- https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/


In [12]:
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

## GroupBy Mechanics

In [27]:
df = pd.DataFrame({'key1' : ['a', 'a', 'a','b', 'b','b', 'a'],
                   'key2' : ['one', 'one', 'two', 'one', 'two', 'one','two'],
                   'data1' : [1,1,1,2,2,2,2],
                   'data2' : np.random.randn(7)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,1,-0.539741
1,a,one,1,0.476985
2,a,two,1,3.248944
3,b,one,2,-1.021228
4,b,two,2,-0.577087
5,b,one,2,0.124121
6,a,two,2,0.302614


In [42]:
grouped = df[['data1','key1']].groupby('key1')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022D3D9A5EE0>

In [29]:
grouped.mean()

key1
a    1.25
b    2.00
Name: data1, dtype: float64

In [44]:
means = df[['data1','key1','key2']].groupby(['key1', 'key2']).mean()
means

Unnamed: 0_level_0,Unnamed: 1_level_0,data1
key1,key2,Unnamed: 2_level_1
a,one,1.0
a,two,1.5
b,one,2.0
b,two,2.0


In [45]:
means.unstack()

Unnamed: 0_level_0,data1,data1
key2,one,two
key1,Unnamed: 1_level_2,Unnamed: 2_level_2
a,1.0,1.5
b,2.0,2.0


In [46]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio', 'Texas', 'Texas'])
years = np.array([2005, 2005, 2006, 2005, 2006, 2007, 2008])
df['data1'].groupby([states, years]).mean()

California  2005    1.0
            2006    1.0
Ohio        2005    1.5
            2006    2.0
Texas       2007    2.0
            2008    2.0
Name: data1, dtype: float64

In [47]:
df.groupby('key1').mean()
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,1.0,-0.031378
a,two,1.5,1.775779
b,one,2.0,-0.448553
b,two,2.0,-0.577087


### Use groupby(..).size() as the row count by each group

In [48]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     2
b     one     2
      two     1
dtype: int64

In [49]:
df.groupby(['key1', 'key2'])['data1'].count()

key1  key2
a     one     2
      two     2
b     one     2
      two     1
Name: data1, dtype: int64

### Use groupby(..).nunique() as the unique count of variable by each group

In [37]:
df.groupby(['key1', 'key2'])['data1'].nunique()

key1  key2
a     one     1
      two     2
b     one     1
      two     1
Name: data1, dtype: int64

### Iterating Over Groups

In [36]:
for name, group in df.groupby('key1'):
    print(name)
    print(group)

a
  key1 key2  data1     data2
0    a  one      1 -0.539741
1    a  one      1  0.476985
2    a  two      1  3.248944
6    a  two      2  0.302614
b
  key1 key2  data1     data2
3    b  one      2 -1.021228
4    b  two      2 -0.577087
5    b  one      2  0.124121


In [38]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)

('a', 'one')
  key1 key2     data1     data2
0    a  one -0.204708  1.393406
4    a  one  1.965781  1.246435
('a', 'two')
  key1 key2     data1     data2
1    a  two  0.478943  0.092908
('b', 'one')
  key1 key2     data1     data2
2    b  one -0.519439  0.281746
('b', 'two')
  key1 key2    data1     data2
3    b  two -0.55573  0.769023


In [39]:
pieces = dict(list(df.groupby('key1')))
pieces['b']

Unnamed: 0,key1,key2,data1,data2
2,b,one,-0.519439,0.281746
3,b,two,-0.55573,0.769023


In [40]:
df.dtypes
grouped = df.groupby(df.dtypes, axis=1)

In [41]:
for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0 -0.204708  1.393406
1  0.478943  0.092908
2 -0.519439  0.281746
3 -0.555730  0.769023
4  1.965781  1.246435
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### Selecting a Column or Subset of Columns

df.groupby('key1')['data1']
df.groupby('key1')[['data2']]

df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])

In [42]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,1.31992
a,two,0.092908
b,one,0.281746
b,two,0.769023


In [43]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped
s_grouped.mean()

key1  key2
a     one     1.319920
      two     0.092908
b     one     0.281746
      two     0.769023
Name: data2, dtype: float64

### Grouping with Dicts and Series

In [44]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
people

Unnamed: 0,a,b,c,d,e
Joe,1.007189,-1.296221,0.274992,0.228913,1.352917
Steve,0.886429,-2.001637,-0.371843,1.669025,-0.43857
Wes,-0.539741,,,-1.021228,-0.577087
Jim,0.124121,0.302614,0.523772,0.00094,1.34381
Travis,-0.713544,-0.831154,-2.370232,-1.860761,-0.860757


In [45]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}

In [46]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,0.503905,1.063885
Steve,1.297183,-1.553778
Wes,-1.021228,-1.116829
Jim,0.524712,1.770545
Travis,-4.230992,-2.405455


In [47]:
map_series = pd.Series(mapping)
map_series
people.groupby(map_series, axis=1).count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wes,1,2
Jim,2,3
Travis,2,3


### Grouping with Functions

In [48]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,0.591569,-0.993608,0.798764,-0.791374,2.119639
5,0.886429,-2.001637,-0.371843,1.669025,-0.43857
6,-0.713544,-0.831154,-2.370232,-1.860761,-0.860757


In [49]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.539741,-1.296221,0.274992,-1.021228,-0.577087
3,two,0.124121,0.302614,0.523772,0.00094,1.34381
5,one,0.886429,-2.001637,-0.371843,1.669025,-0.43857
6,two,-0.713544,-0.831154,-2.370232,-1.860761,-0.860757


### Grouping by Index Levels

In [50]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.560145,-1.265934,0.119827,-1.063512,0.332883
1,-2.359419,-0.199543,-1.541996,-0.970736,-1.30703
2,0.28635,0.377984,-0.753887,0.331286,1.349742
3,0.069877,0.246674,-0.011862,1.004812,1.327195


In [51]:
hier_df.groupby(level='cty', axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## Data Aggregation

Use agg() to aggregate data:

- df.agg()  - alias for aggregation()
- df.aggregation()

### Arguments:
- a single function:  Series.agg(sum)

- a user-defined function or a lambda function: Series.agg(lambda x: sum(x) if x>0)

- a list of functions: Series.agg([len, max, min, sum, 'median', 'mean', 'std'])

- a dictionary of functions:Series.agg({'col1':max, 'col2': sum, 'col3':'mean')

In [52]:
df
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    1.668413
b   -0.523068
Name: data1, dtype: float64

In [53]:
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000238086978E0>

In [54]:
def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

  grouped.agg(peak_to_peak)


Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.170488,1.300498
b,0.036292,0.487276


In [55]:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,0.746672,1.109736,-0.204708,0.137118,0.478943,1.222362,1.965781,3.0,0.910916,0.712217,0.092908,0.669671,1.246435,1.31992,1.393406
b,2.0,-0.537585,0.025662,-0.55573,-0.546657,-0.537585,-0.528512,-0.519439,2.0,0.525384,0.344556,0.281746,0.403565,0.525384,0.647203,0.769023


In [59]:
df.groupby('key1').agg([len, max, min, sum, 'median', 'mean', 'std'])

  df.groupby('key1').agg([len, max, min, sum, 'median', 'mean', 'std'])


Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,len,max,min,sum,median,mean,std,len,max,min,sum,median,mean,std
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2
a,3,1.965781,-0.204708,2.240016,0.478943,0.746672,1.109736,3,1.393406,0.092908,2.732748,1.246435,0.910916,0.712217
b,2,-0.519439,-0.55573,-1.075169,-0.537585,-0.537585,0.025662,2,0.769023,0.281746,1.050769,0.525384,0.525384,0.344556


### Column-Wise and Multiple Function Application

In [None]:
tips = pd.read_csv('examples/tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]

In [None]:
grouped = tips.groupby(['day', 'smoker'])

In [None]:
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')

In [None]:
grouped_pct.agg(['mean', 'std', peak_to_peak])

In [None]:
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

In [None]:
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result

In [None]:
result['tip_pct']

In [None]:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)

In [None]:
grouped.agg({'tip' : np.max, 'size' : 'sum'})
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
             'size' : 'sum'})

### Returning Aggregated Data Without Row Indexes

In [None]:
tips.groupby(['day', 'smoker'], as_index=False).mean()

## Apply: General split-apply-combine

In [None]:
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]
top(tips, n=6)

In [None]:
tips.groupby('smoker').apply(top)

In [None]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

In [None]:
result = tips.groupby('smoker')['tip_pct'].describe()
result
result.unstack('smoker')

f = lambda x: x.describe()
grouped.apply(f)

### Suppressing the Group Keys

In [None]:
tips.groupby('smoker', group_keys=False).apply(top)

### Quantile and Bucket Analysis

In [None]:
frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]

In [None]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}
grouped = frame.data2.groupby(quartiles)
grouped.apply(get_stats).unstack()

In [None]:
# Return quantile numbers
grouping = pd.qcut(frame.data1, 10, labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

### Example: Filling Missing Values with Group-Specific       Values

In [None]:
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s
s.fillna(s.mean())

In [None]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data

In [None]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data
data.groupby(group_key).mean()

In [None]:
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

In [None]:
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)

### Example: Random Sampling and Permutation

In [None]:
# Hearts, Spades, Clubs, Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

deck = pd.Series(card_val, index=cards)

In [None]:
deck[:13]

In [None]:
def draw(deck, n=5):
    return deck.sample(n)
draw(deck)

In [None]:
get_suit = lambda card: card[-1] # last letter is suit
deck.groupby(get_suit).apply(draw, n=2)

In [None]:
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

### Example: Group Weighted Average and Correlation

In [None]:
df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
                                'b', 'b', 'b', 'b'],
                   'data': np.random.randn(8),
                   'weights': np.random.rand(8)})
df

In [None]:
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)

In [None]:
close_px = pd.read_csv('examples/stock_px_2.csv', parse_dates=True,
                       index_col=0)
close_px.info()
close_px[-4:]

In [None]:
spx_corr = lambda x: x.corrwith(x['SPX'])

In [None]:
rets = close_px.pct_change().dropna()

In [None]:
get_year = lambda x: x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)

In [None]:
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

### Example: Group-Wise Linear Regression

In [None]:
import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params

In [None]:
by_year.apply(regress, 'AAPL', ['SPX'])

## Pivot Tables and Cross-Tabulation

In [None]:
tips.pivot_table(index=['day', 'smoker'])

In [None]:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                 columns='smoker')

In [None]:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'],
                 columns='smoker', margins=True)

In [None]:
tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day',
                 aggfunc=len, margins=True)

In [None]:
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
                 columns='day', aggfunc='mean', fill_value=0)

### Cross-Tabulations: Crosstab

In [None]:
from io import StringIO
data = """\
Sample  Nationality  Handedness
1   USA  Right-handed
2   Japan    Left-handed
3   USA  Right-handed
4   Japan    Right-handed
5   Japan    Left-handed
6   Japan    Right-handed
7   USA  Right-handed
8   USA  Left-handed
9   Japan    Right-handed
10  USA  Right-handed"""
data = pd.read_table(StringIO(data), sep='\s+')

In [None]:
data

In [None]:
pd.crosstab(data.Nationality, data.Handedness, margins=True)

In [None]:
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

In [None]:
pd.options.display.max_rows = PREVIOUS_MAX_ROWS

## Conclusion