# Baby Boom

The file ```babyboom.csv``` contains information on all 44 babies born at one hospital in one 24-hour period.  It has 3 columns:  

* Time = time of birth, in the format hhmm
* Sex = the sex of the baby
* Weight = the weight of the baby, in grams.

Dataset: “Time of Birth, Sex, and Birth Weight of 44 Babies," submitted by Peter K. Dunn, University of Southern Queensland. Dataset obtained from the Journal of Statistics Education
(http://www.amstat.org/publications/jse). Accessed 14 July 2015. 
Used by permission of author.
Metadata:  http://www.amstat.org/publications/jse/datasets/babyboom.txt


We'll use the [pandas library](http://pandas.pydata.org/) to work with this data set.

In [None]:
import pandas as pd
import numpy as np

The pandas library includes a data structure called a Data Frame, which is optimized for working with tabular data.  We read in babyboom.csv as a data frame.

In [None]:
bb_df = pd.read_csv('babyboom.csv')
bb_df

Note that blank lines have been replaced by the symbol ```NaN```, which stands for "not a number".  This special value is a constant in the ```numpy``` package, and can be accessed as ```np.nan```.

We can use the ```describe()``` function for data frames to get a quick summary of the numerical information in the data frame.  The ```count``` function will ignore any ```NaN``` values.

In [None]:
bb_df.describe()

We notice that the data set includes at least one question mark.  We'd like to replace all question marks by ```NaN```.  To do so, we create a function, and then apply this function to every cell in the data frame using the ```applymap``` function for data frames.

In [None]:
def remove_question(s):
    """ return np.nan if the argument is a question mark """
    if s == "?":
        return np.nan
    else:
        return s

In [None]:
bb_df = bb_df.applymap(remove_question)
bb_df

We notice that the Sex column contains many different types of data.  We'd like to standardize this data as M or F.  The first step is to get a list of all of the different values in the column.

We access this column using its text label.  Each column of a pandas data frame is a ```Series``` object, and we can use the built-in ```unique()``` function for series.

In [None]:
bb_df['Sex'].unique()

We want to map each of the values from the Sex column to "M" or "F".  We could do this using lots of if-then statements, but because we're describing a correspondence, it makes sense to create a dictionary.

In [None]:
sex_dict = { 'F':'F', 'Male':'M', 'boy':'M', 'M':'M', 'girl':'F', 'not recorded':np.nan, 'female':'F', np.nan:np.nan}

In [None]:
# We can use the get command to obtain the value for a specified key
sex_dict.get("girl")

In [None]:
# We create a lambda function which calls our dictionary, and map it to every element of the Sex column
# This returns a Series object
bb_df['Sex'].map(lambda s: sex_dict.get(s))

In [None]:
# Of course, what we really want to do is REPLACE the Sex column in our data frame

bb_df['Sex'] = bb_df['Sex'].map(lambda s: sex_dict.get(s))
bb_df

Now we can group our data frame by M or F, and compare descriptive statistics for each group.

In [None]:
bb_group = bb_df.groupby('Sex')
bb_group.describe()

We've demonstrated using the ```map``` command on Series to transform columns of our data frame.  We can use the ```apply``` command to do operations on rows.

In [None]:
# A function to apply to a row of our data frame

def convert_to_lb(row):
    """ convert weight in grams to weight in pounds"""
    return row['Weight']/453.59237

In [None]:
# Specifying axis=1 applies the function to each row; the default would apply the function to each column

bb_df['Weight in lb'] = bb_df.apply(convert_to_lb, axis=1)
bb_df

## Saving our data

We'd like to save our data for possible analysis in R.

Because R won't recognize the ```NaN``` symbol, we replace null data with an empty string.

In [None]:
# We use the inplace flag to change the original data frame, rather than creating a new one
bb_df.fillna("", inplace=True)
bb_df

In [None]:
# Save the data frame to a .csv file
# Specifying index=False prevents us from writing a column of row numbers

bb_df.to_csv("babyboom_clean.csv", index=False)