# Categoricals and groupby
## Split-apply-combine
- sales.groupby('weekday').count()
    - split by ‘weekday’
    - apply count() function on each group
    - combine counts per group
## Aggregation/Reduction
- Some reducing functions
    - mean()
    - std()
    - sum()
    - first(), last()
    - min(), max()
## Categorical data
- Advantages
    - Uses less memory
    - Speeds up operations like groupby()

### Grouping by multiple columns
In this exercise, you will return to working with the `Titanic` dataset and use `.groupby()` to analyze the distribution of passengers who boarded the Titanic.

The `'pclass'` column identifies which class of ticket was purchased by the passenger and the `'embarked'` column indicates at which of the three ports the passenger boarded the Titanic. `'S'` stands for Southampton, England, `'C'` for Cherbourg, France and `'Q'` for Queenstown, Ireland.

Your job is to first group by the `'pclass'` column and count the number of rows in each class using the `'survived'` column. You will then group by the `'embarked'` and `'pclass'` columns and count the number of passengers.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data_file = '../_datasets/titanic.csv'

titanic = pd.read_csv(data_file)
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
# Group titanic by 'pclass'
by_class = titanic.groupby('pclass')
by_class.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
323,2,0,"Abelson, Mr. Samuel",male,30.0,1,0,P/PP 3381,24.0,,C,,,"Russia New York, NY"
324,2,1,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0,,C,10,,"Russia New York, NY"
325,2,0,"Aldworth, Mr. Charles Augustus",male,30.0,0,0,248744,13.0,,S,,,"Bryn Mawr, PA, USA"
326,2,0,"Andrew, Mr. Edgardo Samuel",male,18.0,0,0,231945,11.5,,S,,,"Buenos Aires, Argentina / New Jersey, NJ"
327,2,0,"Andrew, Mr. Frank Thomas",male,25.0,0,0,C.A. 34050,10.5,,S,,,"Cornwall, England Houghton, MI"


In [4]:
# Aggregate 'survived' column of by_class by count
count_by_class = by_class.survived.count()

# Print count_by_class
count_by_class

pclass
1    323
2    277
3    709
Name: survived, dtype: int64

In [5]:
# Group titanic by 'embarked' and 'pclass'
by_mult = titanic.groupby(['embarked','pclass'])

# Aggregate 'survived' column of by_mult by count
count_mult = by_mult.survived.count()

# Print count_mult
count_mult

embarked  pclass
C         1         141
          2          28
          3         101
Q         1           3
          2           7
          3         113
S         1         177
          2         242
          3         495
Name: survived, dtype: int64

### Grouping by another series
In this exercise, you'll use two data sets from [Gapminder.org][1] to investigate the average life expectancy (in years) at birth in 2010 for the 6 continental regions. To do this you'll read the life expectancy data per country into one pandas DataFrame and the association between country and region into another.

By setting the index of both DataFrames to the country name, you'll then use the region information to group the countries in the life expectancy DataFrame and compute the mean value for 2010.

The life expectancy CSV file is available to you in the variable `life_fname` and the regions filename is available in the variable `regions_fname`.

[1]: http://gapminder.org/

In [6]:
life_fname = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/life_expectancy.csv'
regions_fname = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1650/datasets/regions.csv'

In [7]:
# Read life_fname into a DataFrame: life
life = pd.read_csv(life_fname, index_col='Country')
life.head()

Unnamed: 0_level_0,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,...,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,33.639,34.152,34.662,35.17,35.674,36.172,36.663,37.143,37.614,38.075,...,56.583,57.071,57.582,58.102,58.618,59.124,59.612,60.079,60.524,60.947
Albania,65.475,65.863,66.122,66.316,66.5,66.702,66.948,67.251,67.595,67.966,...,75.725,75.949,76.124,76.278,76.433,76.598,76.78,76.979,77.185,77.392
Algeria,47.953,48.389,48.806,49.205,49.592,49.976,50.366,50.767,51.195,51.67,...,69.682,69.854,70.02,70.18,70.332,70.477,70.615,70.747,70.874,71.0
Angola,34.604,35.007,35.41,35.816,36.222,36.627,37.032,37.439,37.846,38.247,...,48.036,48.572,49.041,49.471,49.882,50.286,50.689,51.094,51.498,51.899
Antigua and Barbuda,63.775,64.149,64.511,64.865,65.213,65.558,65.898,66.232,66.558,66.875,...,74.355,74.544,74.729,74.91,75.087,75.263,75.437,75.61,75.783,75.954


In [8]:
# Read regions_fname into a DataFrame: regions
regions = pd.read_csv(regions_fname, index_col='Country')
regions.head()

Unnamed: 0_level_0,region
Country,Unnamed: 1_level_1
Afghanistan,South Asia
Albania,Europe & Central Asia
Algeria,Middle East & North Africa
Angola,Sub-Saharan Africa
Antigua and Barbuda,America


In [9]:
# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions['region'])
# Print the mean over the '2010' column of life_by_region
life_by_region['2010'].mean()

region
America                       74.037350
East Asia & Pacific           73.405750
Europe & Central Asia         75.656387
Middle East & North Africa    72.805333
South Asia                    68.189750
Sub-Saharan Africa            57.575080
Name: 2010, dtype: float64

It looks like the average life expectancy (in years) at birth in 2010 was highest in `Europe & Central Asia` and lowest in `Sub-Saharan Africa`

# Groupby and aggregation
## string names
- ‘sum’
- ‘mean’
- ‘count’
### Computing multiple aggregates of multiple columns
The `.agg()` method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index.

In this exercise, you're going to group passengers on the `Titanic` by `'pclass'` and aggregate the `'age'` and `'fare'` columns by the functions `'max'` and `'median'`. You'll then use multi-level selection to find the oldest passenger per class and the median fare price per class.

In [10]:
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [11]:
# Group titanic by 'pclass': by_class
by_class = titanic.groupby('pclass')

# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max','median'])

# Print the maximum age in each class
aggregated.loc[:, ('age','max')]

pclass
1    80.0
2    70.0
3    74.0
Name: (age, max), dtype: float64

In [12]:
# Print the median fare in each class
aggregated.loc[:, ('fare','median')]

pclass
1    60.0000
2    15.0458
3     8.0500
Name: (fare, median), dtype: float64

It isn't surprising that the highest median fare was for the 1st passenger class.

### Aggregating on index levels/fields
If you have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns.

In this exercise you'll use the full Gapminder dataset which contains yearly values of life expectancy, population, child mortality (per 1,000) and per capita gross domestic product (GDP) for every country in the world from 1964 to 2013.

Your job is to create a multi-level DataFrame of the columns `'Year'`, `'Region'` and `'Country'`. Next you'll group the DataFrame by the `'Year'` and `'Region'` levels. Finally, you'll apply a dictionary aggregation to compute the total population, spread of per capita GDP values and average child mortality rate.

In [13]:
# Read the CSV file into a DataFrame and sort the index: gapminder
gapminder = pd.read_csv('../_datasets/gapminder_tidy.csv', index_col=['Year','region','Country']).sort_index()
gapminder.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,fertility,life,population,child_mortality,gdp
Year,region,Country,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1964,America,Antigua and Barbuda,4.25,63.775,58653.0,72.78,5008.0
1964,America,Argentina,3.068,65.388,21966478.0,57.43,8227.0
1964,America,Aruba,4.059,67.113,57031.0,,5505.0
1964,America,Bahamas,4.22,64.189,133709.0,48.56,18160.0
1964,America,Barbados,4.094,62.819,234455.0,64.7,5681.0


In [14]:
# Define the function to compute spread: spread
def spread(series):
    return series.max() - series.min()

In [15]:
# Group gapminder by 'Year' and 'region': by_year_region
by_year_region = gapminder.groupby(level=['Year','region'])

# Create the dictionary: aggregator
aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}

# Aggregate by_year_region using the dictionary: aggregated
aggregated = by_year_region.agg(aggregator)

# Print the last 6 entries of aggregated 
aggregated.tail(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,population,child_mortality,gdp
Year,region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2012,Europe & Central Asia,894506500.0,10.241042,86833.0
2012,Middle East & North Africa,396327600.0,20.9035,128183.0
2012,South Asia,1677417000.0,48.1125,10744.0
2012,Sub-Saharan Africa,898385700.0,80.077347,34666.0
2013,America,962908700.0,17.745833,49634.0
2013,East Asia & Pacific,2244209000.0,22.285714,134744.0
2013,Europe & Central Asia,896878800.0,9.831875,86418.0
2013,Middle East & North Africa,403050400.0,20.2215,128676.0
2013,South Asia,1701241000.0,46.2875,11469.0
2013,Sub-Saharan Africa,920599600.0,76.94449,32035.0


### Grouping on a function of the index
Groupby operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group.

In this exercise you'll read in a set of sample sales data from February 2015 and assign the `'Date'` column as the index. Your job is to group the sales data by the day of the week and aggregate the sum of the `'Units'` column.

Is there a day of the week that is more popular for customers? To find out, you're going to use `.strftime('%a')` to transform the index datetime values to abbreviated days of the week.

In [16]:
# Read file: sales
sales = pd.read_csv('../_datasets/sales-feb-2015.csv',index_col='Date',parse_dates=True)
sales.head()

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-02 08:30:00,Hooli,Software,3
2015-02-02 21:00:00,Mediacore,Hardware,9
2015-02-03 14:00:00,Initech,Software,13
2015-02-04 15:30:00,Streeplex,Software,13
2015-02-04 22:00:00,Acme Coporation,Hardware,14


In [17]:
# Create a groupby object: by_day
by_day = sales.groupby(sales.index.strftime('%a'))

# Create sum: units_sum
units_sum = by_day['Units'].sum()

# Print units_sum
print(units_sum)

Mon    48
Sat     7
Thu    59
Tue    13
Wed    48
Name: Units, dtype: int64


# Groupby and transformation
### Detecting outliers with Z-Scores
As Dhavide demonstrated in the video using the zscore function, you can apply a `.transform()` method after grouping to apply a function to groups of data independently. **The z-score is also useful to find outliers: a z-score value of +/- 3 is generally considered to be an outlier**.

In this example, you're going to normalize the Gapminder data in 2010 for life expectancy and fertility by the z-score per region. Using boolean indexing, you will filter out countries that have high fertility rates and low life expectancy for their region.

In [18]:
# Read the CSV file into a DataFrame and sort the index: gapminder
gapminder = pd.read_csv('../_datasets/gapminder_tidy.csv', index_col='Year')
gapminder_2010 = gapminder.loc[2010]
gapminder_2010.reset_index()
gapminder_2010 = gapminder_2010.set_index('Country')
gapminder_2010.head()

Unnamed: 0_level_0,fertility,life,population,child_mortality,gdp,region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Afghanistan,5.659,59.612,31411743.0,105.0,1637.0,South Asia
Albania,1.741,76.78,3204284.0,16.6,9374.0,Europe & Central Asia
Algeria,2.817,70.615,35468208.0,27.4,12494.0,Middle East & North Africa
Angola,6.218,50.689,19081912.0,182.5,7047.0,Sub-Saharan Africa
Antigua and Barbuda,2.13,75.437,88710.0,9.9,20567.0,America


In [19]:
# Import zscore
from scipy.stats import zscore

# Group gapminder_2010: standardized
standardized = gapminder_2010.groupby('region')['life','fertility'].transform(zscore)

# Construct a Boolean Series to identify outliers: outliers
outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)

# Filter gapminder_2010 by the outliers: gm_outliers
gm_outliers = gapminder_2010.loc[outliers]

# Print gm_outliers
gm_outliers

Unnamed: 0_level_0,fertility,life,population,child_mortality,gdp,region
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Guatemala,3.974,71.1,14388929.0,34.5,6849.0,America
Haiti,3.35,45.0,9993247.0,208.8,1518.0,America
Tajikistan,3.78,66.83,6878637.0,52.6,2110.0,Europe & Central Asia
Timor-Leste,6.237,65.952,1124355.0,63.8,1777.0,East Asia & Pacific


Using z-scores like this is a great way to identify outliers in your data.

### Filling missing data (imputation) by group
**Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered**. Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior). In Chapter 1, you practiced using the `.dropna()` method to drop missing values. Now, you will practice imputing missing values. You can use `.groupby()` and `.transform()` to fill missing data appropriately for each group.

Your job is to fill in missing `'age'` values for passengers on the Titanic with the median age from their `'gender'` and `'pclass'`. To do this, you'll group by the `'sex'` and `'pclass'` columns and transform each group with a custom function to call `.fillna()` and impute the median value.

In [20]:
titanic.tail(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1299,3,0,"Yasbeck, Mr. Antoni",male,27.0,1,0,2659,14.4542,,C,C,,
1300,3,1,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1,0,2659,14.4542,,C,,,
1301,3,0,"Youseff, Mr. Gerious",male,45.5,0,0,2628,7.225,,C,,312.0,
1302,3,0,"Yousif, Mr. Wazli",male,,0,0,2647,7.225,,C,,,
1303,3,0,"Yousseff, Mr. Gerious",male,,0,0,2627,14.4583,,C,,,
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


In [21]:
# Write a function that imputes median
def impute_median(series):
    return series.fillna(series.median())

In [22]:
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['sex','pclass'])

# Impute age and assign to titanic['age']
titanic.age = by_sex_class['age'].transform(impute_median)

# Print the output of titanic.tail(10)
titanic.tail(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1299,3,0,"Yasbeck, Mr. Antoni",male,27.0,1,0,2659,14.4542,,C,C,,
1300,3,1,"Yasbeck, Mrs. Antoni (Selini Alexander)",female,15.0,1,0,2659,14.4542,,C,,,
1301,3,0,"Youseff, Mr. Gerious",male,45.5,0,0,2628,7.225,,C,,312.0,
1302,3,0,"Yousif, Mr. Wazli",male,25.0,0,0,2647,7.225,,C,,,
1303,3,0,"Yousseff, Mr. Gerious",male,25.0,0,0,2627,14.4583,,C,,,
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,22.0,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C,,,
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S,,,


Imputing missing values intelligently is always preferrable to dropping them entirely!

### Other transformations with .apply
The `.apply()` method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The `.apply()` method will then combine the results in an intelligent way.

In this exercise, you're going to analyze economic disparity within regions of the world using the Gapminder data set for 2010. To do this you'll define a function to compute the aggregate spread of per capita GDP in each region and the individual country's z-score of the regional per capita GDP. You'll then select three countries - United States, Great Britain and China - to see a summary of the regional GDP and that country's z-score against the regional mean.

The following function must be defined:
```Python
def disparity(gr):
    # Compute the spread of gr['gdp']: s
    s = gr['gdp'].max() - gr['gdp'].min()
    # Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
    z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
    # Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
    return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})
```

In [23]:
def disparity(gr):
    # Compute the spread of gr['gdp']: s
    s = gr['gdp'].max() - gr['gdp'].min()
    # Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
    z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
    # Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
    return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})

In [24]:
# Group gapminder_2010 by 'region': regional
regional = gapminder_2010.groupby('region')

# Apply the disparity function on regional: reg_disp
reg_disp = regional.apply(disparity)

# Print the disparity of 'United States', 'United Kingdom', and 'China'
reg_disp.loc[['United States','United Kingdom','China']]

Unnamed: 0_level_0,z(gdp),regional spread(gdp)
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United States,3.013374,47855.0
United Kingdom,0.572873,89037.0
China,-0.432756,96993.0


# Groupby and filtering
### Grouping and filtering with .apply()
By using `.apply()`, you can write functions that filter rows within groups. The `.apply()` method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame.

In this exercise you'll take the Titanic data set and analyze survival rates from the `'C'` deck, which contained the most passengers. To do this you'll group the dataset by `'sex'` and then use the `.apply()` method on a provided user defined function which calculates the mean survival rates on the `'C'` deck:
```Python
def c_deck_survival(gr):

    c_passengers = gr['cabin'].str.startswith('C').fillna(False)

    return gr.loc[c_passengers, 'survived'].mean()
```

In [25]:
def c_deck_survival(gr):

    c_passengers = gr['cabin'].str.startswith('C').fillna(False)

    return gr.loc[c_passengers, 'survived'].mean()

In [26]:
# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby('sex')

# Call by_sex.apply with the function c_deck_survival
c_surv_by_sex = by_sex.apply(c_deck_survival)

# Print the survival rates
c_surv_by_sex

sex
female    0.913043
male      0.312500
dtype: float64

### Grouping and filtering with .filter()
You can use groupby with the `.filter()` method to remove whole groups of rows from a DataFrame based on a boolean condition.

In this exercise, you'll take the February sales data and remove entries from companies that purchased less than or equal to 35 Units in the whole month.

First, you'll identify how many units each company bought for verification. Next you'll use the `.filter()` method after grouping by `'Company'` to remove all rows belonging to companies whose sum over the `'Units'` column was less than or equal to 35. Finally, verify that the three companies whose total Units purchased were less than or equal to 35 have been filtered out from the DataFrame.

In [27]:
# Read the CSV file into a DataFrame: sales
sales = pd.read_csv('../_datasets/sales-feb-2015.csv', index_col='Date', parse_dates=True)

# Group sales by 'Company': by_company
by_company = sales.groupby('Company')

# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company.Units.sum()
by_com_sum

Company
Acme Coporation    34
Hooli              30
Initech            30
Mediacore          45
Streeplex          36
Name: Units, dtype: int64

In [28]:
# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g:g['Units'].sum()>35)
by_com_filt

Unnamed: 0_level_0,Company,Product,Units
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-02 21:00:00,Mediacore,Hardware,9
2015-02-04 15:30:00,Streeplex,Software,13
2015-02-09 09:00:00,Streeplex,Service,19
2015-02-09 13:00:00,Mediacore,Software,7
2015-02-19 11:00:00,Mediacore,Hardware,16
2015-02-19 16:00:00,Mediacore,Service,10
2015-02-21 05:00:00,Mediacore,Software,3
2015-02-26 09:00:00,Streeplex,Service,4


### Filtering and grouping with .map()
You have seen how to group by a column, or by multiple columns. Sometimes, you may instead want to group by a function/transformation of a column. The key here is that the Series is indexed the same way as the DataFrame. You can also mix and match column grouping with Series grouping.

In this exercise your job is to investigate survival rates of passengers on the Titanic by `'age'` and `'pclass'`. In particular, the goal is to find out what fraction of children under 10 survived in each `'pclass'`. You'll do this by first creating a boolean array where True is passengers under 10 years old and `False` is passengers over 10. You'll use `.map()` to change these values to strings.

Finally, you'll group by the under 10 series and the `'pclass'` column and aggregate the `'survived'` column. The `'survived'` column has the value `1` if the passenger survived and `0` otherwise. The mean of the `'survived'` column is the fraction of passengers who lived.

In [29]:
# Create the Boolean Series: under10
under10 = (titanic.age<10).map({True:'under 10',False:'over 10'})
under10.head()

0     over 10
1    under 10
2    under 10
3     over 10
4     over 10
Name: age, dtype: object

In [30]:
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
survived_mean_1

age
over 10     0.366748
under 10    0.609756
Name: survived, dtype: float64

In [31]:
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10,'pclass'])['survived'].mean()
survived_mean_2

age       pclass
over 10   1         0.617555
          2         0.380392
          3         0.238897
under 10  1         0.750000
          2         1.000000
          3         0.446429
Name: survived, dtype: float64