## Exploring US Births by Date
In this workbook, I will be exploring a 1994 to 2003 US births data set. It  was provided by the National Center for Health Statistics housed within the Centers for Disease Control and Prevention in the US.

I will not be use the pandas library to explore the data, but will rely on the functions that I will define.

### Opening the csv file
This is to initially see what the data file looks like.

In [1]:
file=open('US_births_1994-2003_CDC_NCHS.csv','r')

file=file.read()
file=str(file)

file_split=file.split('\n')

file_split[0:2]


['year,month,date_of_month,day_of_week,births', '1994,1,1,6,8096']

I can see the granuarity of the data provided is: year, month, day of month, day of week.

### Define function to read and convert data into a list of lists
The data is currently organised into a list of strings where each string is effectively a row of data.
Here I will convert each string into a list, so the data is then a list of lists.

In [2]:
def read_csv(input_csv):

    file=open(input_csv,'r')

    file=file.read()
    file=str(file)
    file_split=file.split('\n')
    string_list=file_split[1:]
    
    final_list=[]
        
    for x in string_list:
        int_fields=[]
        string_fields= x.split(',')              
        
        for y in string_fields:
            number=int(y)
            int_fields.append(number)
        final_list.append(int_fields)
        
    return final_list

In [3]:
# here I review the first 10 items in my list
cdc_list=read_csv('US_births_1994-2003_CDC_NCHS.csv')

cdc_list[:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

### Calculating number of births for each month

In [4]:
def month_births(l_of_l):
    births_per_month={}
    
    for x in l_of_l:
        month=x[1]
        if month in births_per_month:
            births_per_month[month]=births_per_month[month]+x[-1]
        else:
            births_per_month[month]=x[-1]
    return births_per_month

In [5]:
cdc_month_births= month_births(cdc_list)

cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

### Calculating number of births for each day of the week

In [6]:
def dow_births(l_of_l):
    births_of_dow={}
    
    for x in l_of_l:
        if x[3] in births_of_dow:
            births_of_dow[x[3]]=births_of_dow[x[3]]+x[-1]
        else:
            births_of_dow[x[3]]=x[-1]
    return births_of_dow

In [7]:
cdc_day_births=dow_births(cdc_list)

cdc_day_births

{6: 4562111,
 7: 4079723,
 1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657}

### Creating a general function to apply across any time frame

In [8]:
def calc_counts(data, column):
    total_dict={}
    
    for x in data:
        if x[column] in total_dict:
            total_dict[x[column]]=total_dict[x[column]]+x[-1]
        else:
            total_dict[x[column]]=x[-1]
    return total_dict


In [9]:
cdc_year_births = calc_counts(cdc_list, 0)
cdc_month_births = calc_counts(cdc_list, 1)
cdc_dom_births = calc_counts(cdc_list, 2)
cdc_dow_births = calc_counts(cdc_list, 3)

In [10]:
print(cdc_year_births)

{1994: 3952767, 1995: 3899589, 1996: 3891494, 1997: 3880894, 1998: 3941553, 1999: 3959417, 2000: 4058814, 2001: 4025933, 2002: 4021726, 2003: 4089950}


In [11]:
print(cdc_month_births)

{1: 3232517, 2: 3018140, 3: 3322069, 4: 3185314, 5: 3350907, 6: 3296530, 7: 3498783, 8: 3525858, 9: 3439698, 10: 3378814, 11: 3171647, 12: 3301860}


In [12]:
print(cdc_dom_births)

{1: 1276557, 2: 1288739, 3: 1304499, 4: 1288154, 5: 1299953, 6: 1304474, 7: 1310459, 8: 1312297, 9: 1303292, 10: 1320764, 11: 1314361, 12: 1318437, 13: 1277684, 14: 1320153, 15: 1319171, 16: 1315192, 17: 1324953, 18: 1326855, 19: 1318727, 20: 1324821, 21: 1322897, 22: 1317381, 23: 1293290, 24: 1288083, 25: 1272116, 26: 1284796, 27: 1294395, 28: 1307685, 29: 1223161, 30: 1202095, 31: 746696}


In [13]:
print(cdc_dow_births)

{6: 4562111, 7: 4079723, 1: 5789166, 2: 6446196, 3: 6322855, 4: 6288429, 5: 6233657}


### Creating a general function to apply across any 2 time frames
e.g you can see the birth rate in January across all the years in the dataset.

To use the function, the  column indexes are:

'year' = 0

'month' =1

'date_of_month' =2

'day_of_week' =3


In [14]:
#This will allow you to extract the number of births across different 'column' 
# values (eg year) for a constant 'column 2' (eg month).
# So for example you could see the year on year number of births in january.

#column is varying eg year
#column_2 is constant eg month over the years
def calc_counter(data, column,column_2,values_c2): 
    total_dict={} 
    data_2=[]
    for x in data:
        if x[column_2] == values_c2:
            data_2.append(x)
        
    #print(data_2)
    
    for x in data_2:
        if x[column] in total_dict:
            total_dict[x[column]]=total_dict[x[column]]+x[-1]
        else:
            total_dict[x[column]]=x[-1]
    return total_dict


We will now apply this function to 3 examples below.

###### Check birthrates in january for each year

In [15]:
# check birthrates in january for each year
jan_counter=calc_counter(cdc_list, 0,1, 1)
jan_counter

{1994: 320705,
 1995: 316013,
 1996: 314283,
 1997: 317211,
 1998: 319340,
 1999: 319182,
 2000: 330108,
 2001: 335198,
 2002: 330674,
 2003: 329803}

In [16]:
# quick check: summing the jan counter dictionary value and
# comparing to the the cdc_month_births result for month 1

# cdc_months_births result
jan= 3232517 

# jan_counter dictionary results summed
jan==(320705+ 316013+ 314283+ 317211+ 319340+ 319182+ 330108+ 
      335198+ 330674+329803)


True

##### Birthrates on day 7 of the week for each year

In [17]:
# check birthrates on day 7 of the week
day_7=calc_counter(cdc_list, 0,3, 7) 
day_7

{1994: 428752,
 1995: 425790,
 1996: 413336,
 1997: 404478,
 1998: 407129,
 1999: 401991,
 2000: 416454,
 2001: 397119,
 2002: 391375,
 2003: 393299}

In [18]:
# check day 7 dictionary with the cdc_dow_births dictionary for day 7
# using sum function and indexing this time

sum(day_7.values()) ==cdc_dow_births[7]

True

##### Birthrates on day 31 for each month

In [19]:
# check birthrates on day 7 of the week
day_31=calc_counter(cdc_list, 1 ,2 , 31) 
day_31

{1: 107281, 3: 103872, 5: 106984, 7: 116488, 8: 109017, 10: 99731, 12: 103323}

The day 31 dictionary correctly only contained the months with 31 days: months:
1, 3, 5, 7, 8, 10 and 12 i.e January, March, May, July, August, October, and December

In [20]:
# check day 31 dictionary with the cdc_dom_births dictionary for day 31

sum(day_31.values()) == cdc_dom_births[31]

True