## Grouping and Summarizing

In [None]:
import pandas as pd
import  numpy as np

Our first pass will be with our old friends the penguins.

In [None]:
penguins = pd.read_csv("penguins-raw.csv")

## Some basic data cleaning

1.  Data types

In [None]:
penguins.dtypes

2. Focus on Species, Island, Sex, Culmen Length/Depth, Flipper Length, Body Mass.

In [None]:
focus = ['Species','Island','Sex','Culmen Length (mm)','Culmen Depth (mm)','Flipper Length (mm)', 'Body Mass (g)']
simplified = penguins[focus]

3. Clean up column names

In [None]:
edited_columns = ['species','island','sex','culmen_length', 'culmen_depth','flipper_length','body_mass']
simplified.columns = edited_columns

4. Simplify factor names


In [None]:
species = simplified['species'].unique()
simple_species_dict={x:x.split(' ')[0].lower() for x in species}
simplified['species'].map(simple_species_dict)

5. Remaking a column (watch out!)

In [None]:
simplified['species'] = simplified['species'].map(simple_species_dict)

Old option: use .loc.

In [None]:
simplified.loc[:,'species'] = simplified['species'].map(simple_species_dict)

Newer option: use .assign().  Notice that .assign() *returns a dataframe.*


In [None]:
simplified = simplified.assign(species = lambda x: x['species'].map(simple_species_dict))

Fix some other factor variables:

In [None]:
simplified  = simplified.assign(island = lambda x: x.island.str.lower())
simplified = simplified.assign(sex = lambda x: x['sex'].str.lower())

5.  Standardize the variables - column by column

In [None]:
simplified = simplified.assign(culmen_length_std = lambda x: (x.culmen_length-x.culmen_length.mean())/x.culmen_length.std())

or make a standardization function.

In [None]:
def standardize(x):
    return (x-x.mean())/x.std()
simplified = simplified.assign(
    **{i+'_std':(lambda x: standardize(x[i])) for i in simplified.columns[3:]}
)

6.  Missing Values

In [None]:
simplified.isna().sum()
simplified = simplified.dropna(axis=0)

7.  Grouping

Grouping combines with aggregation.


In [None]:
numerical_variables = ['culmen_length','culmen_depth','flipper_length','body_mass']
by_sex = simplified[['sex']+numerical_variables].groupby('sex').agg('mean')