# Guided Project #1 - U.S. Births
By [Luis Munguia](http://www.linkedin.com/in/luis-munguia) and [Dataquest](http://www.dataquest.io)

In this guided project, I'll work with a dataset on births in the U.S. compiled by FiveThirtyEight from CDC information.

The dataset contains the following columns:

* `year`: Year (`1994` to `2003`).
* `month`: Month (`1` to `12`).
* `date_of_month`: Day number of the month (`1` to `31`).
* `day_of_week`: Day of week (`1` to `7`).
* `births`: Number of births that day.

## 1.- Read CSV and explore data.
Use built in functions to convert file. The dataset came with **newline delimiters and comma separated values**.

In [1]:
f = open("US_births_1994-2003_CDC_NCHS.csv")
header_database = f.read()
header_database[0:300] #Database is a string with headers and values}

'year,month,date_of_month,day_of_week,births\n1994,1,1,6,8096\n1994,1,2,7,7772\n1994,1,3,1,10142\n1994,1,4,2,11248\n1994,1,5,3,11053\n1994,1,6,4,11406\n1994,1,7,5,11251\n1994,1,8,6,8653\n1994,1,9,7,7910\n1994,1,10,1,10498\n1994,1,11,2,11706\n1994,1,12,3,11567\n1994,1,13,4,11212\n1994,1,14,5,11570\n1994,1,15,6,8660\n'

## 2.- Create `read_csv()` function
Define a function that takes a single required argument and formats dataset into a **headerless integer list of lists**.

In [2]:
def read_csv(csvfile):
    split_header_database = open(csvfile).read().split("\n") 
    database = split_header_database[1:]    #Remove header
    integer_database = []
    for data in database:
        row = data.split(",")
        int_fields = [int(index) for index in row]
        integer_database.append(int_fields)
    return integer_database                 #Database items are integers

cdc_births = read_csv("US_births_1994-2003_CDC_NCHS.csv")

In [3]:
cdc_births[0:10] #This confirms it's a list of lists.

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

In [4]:
type(cdc_births[0][0]) #This confirms value is integer and not string.

int

## 3.- Create `month_births()` function

Define a function that takes a single required argument and calculates the total births by **month**.

In [5]:
def month_births(read_csv):
    births_per_month = {}
    for row in read_csv:
        month = row[1]
        births = row[-1] #Births is the last column
        if month in births_per_month:
            births_per_month[month] += births
        else:
            births_per_month[month] = births
    return births_per_month

In [6]:
cdc_month_births = month_births(cdc_births)
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

Statistically, the month with highest births is **August**, which seems to corroborate an old woman's saying: *"Made during the cold days"*.

## 4.- Create `dow_births()` function

Define a function that takes a single required argument and calculates the total births by **day of week**.

In [7]:
def dow_births(read_csv):
    births_by_dow = {}
    for row in read_csv:
        day = row[3]
        birth = row[-1]
        if day in births_by_dow:
            births_by_dow[day] += birth
        else:
            births_by_dow[day] = birth
    return births_by_dow

In [8]:
cdc_dow_births = dow_births(cdc_births)
cdc_dow_births

{1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657,
 6: 4562111,
 7: 4079723}

Is there a reason the lowest births happen on **Friday or Saturday**? Is it because most doctors take those days off?

## 5.- Create `count_column()` function

Define a function that takes a two required arguments: data and column, and calculates the **total of each unique value in the column**.

In [9]:
def calc_counts(read_csv, column):
    births_by_column = {}
    for row in read_csv:
        selected_column = row[column]
        birth = row[-1]
        if selected_column in births_by_column:
            births_by_column[selected_column] += birth
        else:
            births_by_column[selected_column] = birth
    return births_by_column

In [10]:
cdc_year_births = calc_counts(cdc_births,0)
cdc_year_births

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

In [11]:
cdc_month_births = calc_counts(cdc_births,1)
cdc_month_births

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

In [12]:
cdc_dom_births = calc_counts(cdc_births,2)
cdc_dom_births

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

In [13]:
cdc_dow_births = calc_counts(cdc_births,3)
cdc_dow_births

{1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657,
 6: 4562111,
 7: 4079723}

This function gives more flexibility, as code can be reused to analyze different columns.

## 6.- Create `max_min()` function

Define a function that calculates the min and max values for any dictionary that's passed in.

In [14]:
def max_min(dictionary):
    max_value = 0
    min_value = 0
    for row in dictionary:
        if min_value == 0: #This will happen only in the first loop
            min_value = dictionary[row]
        if dictionary[row] > max_value:
            max_value = dictionary[row]
        if dictionary[row] < min_value:
            min_value = dictionary[row]
    return print("Max value =", max_value, "\nMin value =", min_value)

In [15]:
max_min_year = max_min(cdc_dom_births)
max_min_year

Max value = 1326855 
Min value = 746696


## 7.- Create `compare_value()` function

Define a function that extracts the same values across years and calculates the difference between consecutive values.

In [16]:
def compare_value(read_csv, year1, year2, column, value): #Value is month or day_of_week or day_of_month
    year1_dict = {}
    year2_dict = {}
    
    for row in read_csv:
        if row[0] == year1:
            selected_column = row[column]
            birth = row[-1]
            if selected_column in year1_dict:
                year1_dict[selected_column] += birth
            else:
                year1_dict[selected_column] = birth
            
        if row[0] == year2:
            selected_column = row[column]
            birth = row[-1]
            if selected_column in year2_dict:
                year2_dict[selected_column] += birth
            else:
                year2_dict[selected_column] = birth
    
    comparison = year1_dict[value] - year2_dict[value]
    
    if comparison > 0:
        text = "increased"
    else:
        text = "decreased"
        comparison = comparison * -1
        
    return print("Births between", year1, "and", year2, text, "by", comparison, "for the selected input.")

In [17]:
compare_value(cdc_births,1994,2003,1,1) #Last input is the month of January.
compare_value(cdc_births,1994,2003,1,12)

Births between 1994 and 2003 decreased by 9098 for the selected input.
Births between 1994 and 2003 decreased by 16831 for the selected input.


In the example above, I compared how births changed from January 1994 to January 2003 and December 1994 to December 2003.

## 8.- Combine CDC Data with SSA Data.

There is another set of data provided by the SSA, that ranges from 2000 to 2014, and uses the same column structure as the CDC. First I will explore the overlapped data, and decide the best way to combine them in single Database.

In [18]:
ssa_births = read_csv("US_births_2000-2014_SSA.csv")

In [19]:
compare_value(cdc_births,2000,2001,1,1)
compare_value(ssa_births,2000,2001,1,1)

Births between 2000 and 2001 decreased by 5090 for the selected input.
Births between 2000 and 2001 decreased by 4640 for the selected input.


In [20]:
compare_value(cdc_births,2001,2002,1,1)
compare_value(ssa_births,2001,2002,1,1)

Births between 2001 and 2002 increased by 4524 for the selected input.
Births between 2001 and 2002 increased by 4907 for the selected input.


In [21]:
compare_value(cdc_births,2000,2001,3,7)
compare_value(ssa_births,2000,2001,3,7)

Births between 2000 and 2001 increased by 19335 for the selected input.
Births between 2000 and 2001 increased by 20384 for the selected input.


In [22]:
compare_value(cdc_births,2001,2002,2,6)
compare_value(ssa_births,2001,2002,2,6)

Births between 2001 and 2002 increased by 6887 for the selected input.
Births between 2001 and 2002 increased by 7355 for the selected input.


In [23]:
compare_value(cdc_births,2002,2003,3,2)
compare_value(ssa_births,2002,2003,3,2)

Births between 2002 and 2003 decreased by 9243 for the selected input.
Births between 2002 and 2003 decreased by 8735 for the selected input.


In [24]:
5090 / 4640

1.0969827586206897

In [25]:
4524 / 4907

0.9219482372121459

In [26]:
19335 / 20384

0.9485380690737834

In [27]:
6887 / 7355

0.9363698164513936

In [28]:
9243 / 8735

1.0581568402976531

Births seem to be +-10% from each other. For me, it makes sense to average between them. So now, how to combine them?

## 9.- Create `combine_csv_by_ave()` function

Create a function that combines data using averages from overlapping years.

In [29]:
def combine_csv_by_ave(read_csv1, read_csv2, overlapyear):
    combined_csv = []
    count = 0
    count2 = 0
    for row in read_csv1:
        if row[0] < overlapyear:
            combined_csv.append(row)
            count += 1
    for row in read_csv1[count:]:
        ave_birth = int((row[-1] + read_csv2[count2][-1])//2)
        row[-1] = ave_birth
        combined_csv.append(row)
        count2 += 1
    for row in read_csv2[count2:]:
        combined_csv.append(row)
    return(combined_csv)

In [30]:
cdc_ssa_births = combine_csv_by_ave(cdc_births,ssa_births,2000)

In [31]:
cdc_ssa_births[2188:2195]

[[1999, 12, 29, 3, 12629],
 [1999, 12, 30, 4, 11935],
 [1999, 12, 31, 5, 9335],
 [2000, 1, 1, 6, 8963],
 [2000, 1, 2, 7, 7911],
 [2000, 1, 3, 1, 11243],
 [2000, 1, 4, 2, 12867]]

In [32]:
(8843+9083)/2

8963.0

In [33]:
(7816+8006)/2

7911.0

The function succesfully combines both datasets, but only if the funcion is called once. It seems Python handles the id of the list and the calculations fail. Lets see what new informatio this complete dasaset provides.

In [34]:
cdc_ssa_years = calc_counts(cdc_ssa_births,0)
cdc_ssa_years

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4104119,
 2001: 4068359,
 2002: 4060428,
 2003: 4126419,
 2004: 4186863,
 2005: 4211941,
 2006: 4335154,
 2007: 4380784,
 2008: 4310737,
 2009: 4190991,
 2010: 4055975,
 2011: 4006908,
 2012: 4000868,
 2013: 3973337,
 2014: 4010532}

In [35]:
cdc_ssa_months = calc_counts(cdc_ssa_births,1)
cdc_ssa_months

{1: 6965310,
 2: 6499459,
 3: 7134617,
 4: 6838762,
 5: 7162927,
 6: 7110295,
 7: 7514008,
 8: 7610244,
 9: 7425952,
 10: 7278923,
 11: 6869491,
 12: 7139141}

In [36]:
cdc_ssa_dow = calc_counts(cdc_ssa_births,3)
cdc_ssa_dow

{1: 12696731,
 2: 14044930,
 3: 13803846,
 4: 13717707,
 5: 13497703,
 6: 9433618,
 7: 8354594}

In [37]:
compare_value(cdc_ssa_births,1994,2014,1,1)

Births between 1994 and 2014 decreased by 9257 for the selected input.


I'll do some analysis of the combined data.

In [38]:
total = 0
for key in cdc_ssa_years:
    total = cdc_ssa_years[key] + total
    ave_total = int(total/len(cdc_ssa_years))
print(ave_total)

4073768


In [39]:
max_min_year_combined = max_min(cdc_ssa_years)

Max value = 4380784 
Min value = 3880894


In [40]:
4073768 - 4380784

-307016

In [41]:
4073768 - 3880894

192874

In [42]:
307016 / 4073768

0.07536413462916887

In [43]:
192874 / 4073768

0.04734535692754226

## 10.- Closing Commentary

These are my findings after reviewing both CDC and SSA data, and doing simple analysis:
* August is the month with most births across time. This seems to verify the assumption that December is the month with more conception.
* Birth rates have not increased exponentially. It seems to be controlled by unknown factors.
* Total Births ranges from +7.53% to -4.73% year to year.
* Friday and Saturday account for the least births in the week. As stated above, it seems to correspond to doctors normally not being on call those days.

Take aways:
* Market birth control more heavily during the holidays, develop strategy of awareness.
* Compare birth rates of US to global indicators and other countries for mor insight.
* Extend data up and down 10 more years to more accurately depict range.
* Provide more medical service on Friday and Saturday, e.g. discounts for having your baby born during the weekend.