![sslogo](https://github.com/stratascratch/stratascratch.github.io/raw/master/assets/sslogo.jpg)

# Quick Rendering Hack


We'll create a quick class that allows us to display multiple ``DataFrame``s side by side. The code makes use of the special ``_repr_html_`` method, which IPython uses to implement its rich object display:

In [0]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
    

## Import Required Modules

In [0]:
import numpy as np
import pandas as pd

# GroupBy: Split, Apply, Combine

## Pull Planets Data

In [0]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [0]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


### Split, apply, combine

#### Create a GroupBy object grouped by method for planets

In [0]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc9d8a1a668>

#### Create a GroupBy object grouped by method and select the mass column

In [0]:
planets.groupby('method')['mass']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fc9d8a1ad30>

#### Using the GroupBy object defined above to calculate mean orbital period grouped by method

In [0]:
planets.groupby('method')['orbital_period'].mean()

method
Astrometry                          631.180000
Eclipse Timing Variations          4751.644444
Imaging                          118247.737500
Microlensing                       3153.571429
Orbital Brightness Modulation         0.709307
Pulsar Timing                      7343.021201
Pulsation Timing Variations        1170.000000
Radial Velocity                     823.354680
Transit                              21.102073
Transit Timing Variations            79.783500
Name: orbital_period, dtype: float64

### Aggregate, filter, transform, apply

In [0]:
rng = np.random.RandomState(4)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(1, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,8
1,B,1,6
2,C,2,2
3,A,3,9
4,B,4,8
5,C,5,9


#### Aggregation

#### Get the min, median, and max aggregates grouped by key from the df dataset

In [0]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,8,8.5,9
B,1,2.5,4,6,7.0,8
C,2,3.5,5,2,5.5,9


#### Get the count, mean, and median aggregates of orbital period grouped by method for the planets dataset

In [0]:
planets.groupby('method')['orbital_period'].aggregate(['count', 'mean', 'std'])

Unnamed: 0_level_0,count,mean,std
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Astrometry,2,631.18,544.217663
Eclipse Timing Variations,9,4751.644444,2499.130945
Imaging,12,118247.7375,213978.177277
Microlensing,7,3153.571429,1113.166333
Orbital Brightness Modulation,3,0.709307,0.725493
Pulsar Timing,5,7343.021201,16313.265573
Pulsation Timing Variations,1,1170.0,
Radial Velocity,553,823.35468,1454.92621
Transit,397,21.102073,46.185893
Transit Timing Variations,3,79.7835,71.599884


#### Filtering

#### Create a filter function that returns true for keys with a mean greater than 4 for the data2 column. Filter the df dataset grouped by method. Display the input table, the aggregate mean values of the table, and filtered table

In [0]:
def filter_func(x):
    return x['data2'].mean() > 4

display('df', "df.groupby('key').mean()", "df.groupby('key').filter(filter_func)")

Unnamed: 0,key,data1,data2
0,A,0,8
1,B,1,6
2,C,2,2
3,A,3,9
4,B,4,8
5,C,5,9

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1.5,8.5
B,2.5,7.0
C,3.5,5.5

Unnamed: 0,key,data1,data2
0,A,0,8
1,B,1,6
2,C,2,2
3,A,3,9
4,B,4,8
5,C,5,9


#### Create a filter function that returns true for methods with a count higher than 10. Filter the planets dataset grouped by method. Display the count method applied to the input and output table

Remember: you can chain groupby method calls to perform more complex aggregations

In [0]:
def filter_func(x):
    return x['orbital_period'].count() > 10

display('planets.groupby("method").count().head()', 'planets.groupby("method").filter(filter_func).groupby("method").count()')

Unnamed: 0_level_0,number,orbital_period,mass,distance,year
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Astrometry,2,2,0,2,2
Eclipse Timing Variations,9,9,2,4,9
Imaging,38,12,0,32,38
Microlensing,23,7,0,10,23
Orbital Brightness Modulation,3,3,0,2,3

Unnamed: 0_level_0,number,orbital_period,mass,distance,year
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Imaging,38,12,0,32,38
Radial Velocity,553,553,510,530,553
Transit,397,397,1,224,397


#### Transformation

#### Transform the df dataset by subtracting the mean over standard deviation grouped by key. Use a lambda function

In [0]:
df.groupby('key').transform(lambda x: x - x.mean() / x.std())

Unnamed: 0,data1,data2
0,-0.707107,-4.020815
1,-0.178511,1.050253
2,0.350084,0.888832
3,2.292893,-3.020815
4,2.821489,3.050253
5,3.350084,7.888832


#### Create a transformation function which takes a series, checks if the dtype is float64, if it is it divides that series by 1000, if it isn't it returns the series unchanged. Apply this transformation to the planets dataset. This will convert the units of the float64 columns. Display the input and output tables

In [0]:
def transformation_func(x):
    if x.dtype == 'float64':
        return x / 1000
    else:
        return x
    
display('planets.head()', 'planets.transform(transformation_func).head()')

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,0.2693,0.0071,0.0774,2006
1,Radial Velocity,1,0.874774,0.00221,0.05695,2008
2,Radial Velocity,1,0.763,0.0026,0.01984,2011
3,Radial Velocity,1,0.32603,0.0194,0.11062,2007
4,Radial Velocity,1,0.51622,0.0105,0.11947,2009


#### Apply

#### Define a function that takes a DataFrame and assign data1 + data2.mean() to data3. Apply that function to the df dataset grouped by key. Display the input and output table

In [0]:
def func(x):
    # x is a DataFrame of group values
    x['data3'] = x['data1'] + x['data2'].mean()
    return x

display('df', "df.groupby('key').apply(func)")

Unnamed: 0,key,data1,data2
0,A,0,8
1,B,1,6
2,C,2,2
3,A,3,9
4,B,4,8
5,C,5,9

Unnamed: 0,key,data1,data2,data3
0,A,0,8,8.5
1,B,1,6,8.0
2,C,2,2,7.5
3,A,3,9,11.5
4,B,4,8,11.0
5,C,5,9,10.5


### Specifying the split key

#### Create a python list providing grouping keys for the df dataset and calculate the sum based on that grouping. Display the input and output tables

In [0]:
L = [0, 1, 3, 1, 2, 0]
display('df', 'df.groupby(L).sum()')

Unnamed: 0,key,data1,data2
0,A,0,8
1,B,1,6
2,C,2,2
3,A,3,9
4,B,4,8
5,C,5,9

Unnamed: 0,data1,data2
0,5,17
1,4,15
2,4,8
3,2,2


#### Pull a column from the df dataset, apply any functions of your choice, and use the resulting array as the grouping key. Calculate the sums for the resulting groups. Display the input and output tables

In [0]:
x = df['data2']
x = (x ** 2) % 3

display('df', 'df.groupby(x).sum()')

Unnamed: 0,key,data1,data2
0,A,0,8
1,B,1,6
2,C,2,2
3,A,3,9
4,B,4,8
5,C,5,9

Unnamed: 0_level_0,data1,data2
data2,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9,24
1,6,18


#### Define a python function that groups df based whether or not the index is even. Calculate the median value of the resulting groups. Display the input and output tables

In [0]:
def func(index):
    if index % 2 == 0:
        return 'even'
    else:
        return 'odd'

display('df', 'df.groupby(func).median()')

Unnamed: 0,key,data1,data2
0,A,0,8
1,B,1,6
2,C,2,2
3,A,3,9
4,B,4,8
5,C,5,9

Unnamed: 0,data1,data2
even,2,8
odd,3,9
