# Enoch Ikunda
### COS 184 fall 2020

## Program 6: Births data (20 points)

This assignment allows us to do some real research into actual data. Lab 8 has already paved the way by showing how to read the data file, mark data as missing, and save a nice clean <code>DataFrame</code>. Lab 8 saved the clean data as a pickled <code>DataFrame</code> so we can just read it in ("un-pickle it") and get going.

In [1]:
# Don't change this cell.
import pandas as pd, numpy as np
DataIn = 'Births2006clean.pkl'
DataOut = 'Births2006reduced.pkl'

In [2]:
# Your code goes here, to read the pickled DataFrame and un-pickle it on the way.
births = pd.read_pickle(DataIn)
#births

We're only going to be asking questions about birth weight, mother's age, gestation age and health assessment, so we can drop columns that don't relate to these questions. The code below will modify the old Dataframe by deleting columns, with only the columns <code>MAGER</code>, <code>APGAR5</code>, <code>ESTGEST</code> and <code>DBWT</code> remaining.

In [3]:
# Don't change this cell.
colsDrop = ['DOB_MM', 'DOB_WK', 'TBO_REC', 'WTGAIN', 'SEX', 'DMEDUC', 'UPREVIS', 'DMETH_REC', 'DPLURAL']
births.drop(labels=colsDrop, axis=1, inplace=True)  # drop columns -- inplace saves storage

We need complete data in the columns <code>MAGER</code>, <code>APGAR5</code>, <code>ESTGEST</code> and <code>DBWT</code>. Let's drop every row of the data that has missing data in these columns. Do this by using the marvelous DataFrame method <code>dropna()</code>, described at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html. Again, modify the existing Dataframe by supplying this method with the argument <code>inplace=True</code>. Now you can see why we did all that work to mark every bit of missing data as <code>np.NaN</code>.

In [4]:
# Your code goes here to drop every row with NA data in any column.
births.dropna(inplace=True)
#births

Now, since we have dropped some rows, the row index is no longer continuous. While we may not need the row numeric index right away, it's nicer to have it continuous. 

In [5]:
# Don't change this cell.
# Since births has had some rows dropped its index is no longer continuous.
# We solve this by creating a new Dataframe from the ndarray in births (births.values)
# but with the index and column names we want. 
birthsReduced = pd.DataFrame(births.values,  # ndarray
                      index=range(0, births.shape[0]), # 0..367982
                      columns=['MAGER', 'APGAR5', 'ESTGEST', 'DBWT'],
                      copy=True
                     )
birthsReduced

Unnamed: 0,MAGER,APGAR5,ESTGEST,DBWT
0,28.0,9.0,37.0,3625.0
1,18.0,9.0,38.0,3650.0
2,21.0,9.0,38.0,3045.0
3,25.0,10.0,40.0,3827.0
4,28.0,8.0,39.0,3090.0
...,...,...,...,...
367978,20.0,8.0,39.0,2187.0
367979,30.0,9.0,38.0,3210.0
367980,34.0,9.0,39.0,3799.0
367981,32.0,9.0,38.0,4290.0


In [6]:
# Code check block - don't change this cell.
assert birthsReduced.shape == (367983, 4)
assert birthsReduced.loc[0, 'DBWT'] == 3625.0

In [7]:
# Just how much memory does this sucker take up? (don't change this cell)
print('Total memory used: {0:0.2f} MBytes'.format(sum(birthsReduced.memory_usage(deep=True))/1.0e+06))

Total memory used: 11.78 MBytes


We know that there is no missing data in <code>birthsReduced</code>, but before we do any analysis or visualization, let's do some informal validation. We might reasonably expect that:
<ul>
<li>that there are no mothers younger than 10 or older than 60 years old
<li>that the APGAR scores are between 0.0 and 10.0, inclusive
<li>that the period of gestation is not less than 20 nor more than 40 weeks
<li>that birth weights are not less than 1 pound (454 grams) nor greater than 10 pounds (4536 grams)
</ul>
We can easily compute many descriptive statistics with the <code>DataFrame</code> method <code>describe()</code>. Look at these statistics to see if the data meets our expectations.

In [8]:
# Your code goes here to compute summary descriptive statistics.
birthsReduced.describe()
#births.min(axis=1)

Unnamed: 0,MAGER,APGAR5,ESTGEST,DBWT
count,367983.0,367983.0,367983.0,367983.0
mean,27.271515,8.858627,38.434034,3259.400369
std,6.127856,0.757586,2.18539,596.681923
min,12.0,0.0,12.0,227.0
25%,22.0,9.0,38.0,2955.0
50%,27.0,9.0,39.0,3297.0
75%,32.0,9.0,40.0,3629.0
max,50.0,10.0,51.0,8165.0


Please comment here on your observations. Any surprises?
<code>Yes there were few surprises </code>
#### For example ESTGEST had an acceptable interval of not less than 20 nor more than 40 weeks, but there is way less than 20 the minimum is 12 and the maximum is 51 which are bothe way out of boundaries.
#### another surprise is Birth weight there is less than the minimum (454 g) we have 227g, and we have almost twice the upper acceptable limit (4536g) at 8165g.

Let's take a look at the data for the maximum <code>DBWT</code> and maximum <code>ESTGEST</code> values, as well as for the minimum <code>MAGER</code> values. Do these seem legitimate?

In [9]:
# Now we retrieve three sets of rows: one containing the maximum DBWT, one containing the maximum
# ESTGEST, and one containing MAGER values <= 12. Here's an example, for the first set:
print('Rows for which the birth weight was 8165 grams.')
boolcol = birthsReduced['DBWT'] == 8165.0
print(birthsReduced[boolcol])
# Note the use of Boolean indexing to select the rows to print.
# Your code goes here to print the rows where ESTGEST is 51 and where MAGER is 12 or less.

Rows for which the birth weight was 8165 grams.
        MAGER  APGAR5  ESTGEST    DBWT
134175   33.0     9.0     40.0  8165.0
142562   34.0     9.0     38.0  8165.0


We'll investigate "outliers" further in Program 7.

We've done quite a bit of work! Let's save the <code>birthsReduced Dataframe</code> for later use in Program 7.

In [None]:
# Your code goes here, to save birthsReduced via pandas to_pickle().
birthsReduced.to_pickle(DataOut)