# Analyzing U.S Birth Data

The project is aimed at analyzing U.S birth data using **python programming concepts.**

The dataset for the project is obtained [here](https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv)

The first step is to read the file into our notebook:

In [5]:
# Using the open() function to read the file in
file = open("birth_data.csv", 'r')
#Reading the file into an object we can split on
file_read = file.read()
#Spliting the file into a single list having multiple rows
string_list = file_read.split('\n')

In [6]:
#Displaying the first 10 rows of the file
string_list[:10]

['year,month,date_of_month,day_of_week,births',
 '1994,1,1,6,8096',
 '1994,1,2,7,7772',
 '1994,1,3,1,10142',
 '1994,1,4,2,11248',
 '1994,1,5,3,11053',
 '1994,1,6,4,11406',
 '1994,1,7,5,11251',
 '1994,1,8,6,8653',
 '1994,1,9,7,7910']

From the data we can see that the columns are:

- year
- month
- date_of_month
- day_of_weeks
- births

In order to read this data into our notebook in a faster way using a single function call
and convert any csv file into a list of lists where each list represents a single row of the 
file we use the code shown below: 

In [10]:
#Function to return a list of lists from a csv file

def read_csv(FileName):
    
    file = open(FileName, 'r')
    file = file.read()
    string_list = file.split('\n')[1:] #skip the header row
    
    final_list = [] #Initilaizing an empty list to store our list of lists
    
    for element in string_list:
        
        string_fields = element.split(',') #Splitting on the delimiter of the file to make each row into a list
        int_fields = [] #Initializing an empty list to store the integer type for each row
        
        for row in string_fields:
            int_fields.append(int(row)) #Converting the row into integer 
        
        final_list.append(int_fields)
        
    return final_list

A brief overview of the **read_csv** function: 

- Reads the file into a string, splits the string on the newline character ("\n"), 
  and removes the header row.
  
Uses a for loop to:

- Iterate over string_list,
- Create an empty list named int_fields,
- Splits each row on the comma delimiter (,) and assigns the resulting list to string_fields,
- Converts each value in string_fields to an integer and appends to int_fields,
- Appends int_fields to final_list.
- Returns final_list.

In [11]:
function_test = read_csv("birth_data.csv")
function_test[:10]

[[1994, 1, 1, 6, 8096],
 [1994, 1, 2, 7, 7772],
 [1994, 1, 3, 1, 10142],
 [1994, 1, 4, 2, 11248],
 [1994, 1, 5, 3, 11053],
 [1994, 1, 6, 4, 11406],
 [1994, 1, 7, 5, 11251],
 [1994, 1, 8, 6, 8653],
 [1994, 1, 9, 7, 7910],
 [1994, 1, 10, 1, 10498]]

Testing the function we can see that the function successfully returns a list of lists:

- Where each list inside the final_list represents a single row of the csv file.
- Each element in the list is an integer. 
- There is no header row

Now that the data is in a more useable format we can start analyzing it.
Extracting the total number of births for each month across all years in the dataset will give
us a concise picture of which month in a given year has the highest or lowest number of births

In [12]:
def month_births(list_of_lists):
    
    births_per_month = {} #Initialize empty dictionary to store month and number of births 
    
    for row in list_of_lists:
        month = row[1] #Extract the month from list of lists
        births = row[4] #Extract the number of births from list of lists
        
        if month in births_per_month:
            births_per_month[month] = births_per_month[month] + births
        else:
            births_per_month[month] = births
            
    return births_per_month

The function above aims at creating:

- A dictionary that maps the month to the number of births for that month across all years.
- We extract the month and number of births for that month from the list of lists.

Then right a for loop in which: 

- If the month value already exists as a key in births_per_month, 
  the births value is added to the existing value,
- If the month value doesn't exist as a key in births_per_month, it's created and the 
  associated value is the births value

In [14]:
#Testing the function

month_births(function_test)

{1: 3232517,
 2: 3018140,
 3: 3322069,
 4: 3185314,
 5: 3350907,
 6: 3296530,
 7: 3498783,
 8: 3525858,
 9: 3439698,
 10: 3378814,
 11: 3171647,
 12: 3301860}

We can see that the function:
    
- Gives us the number of births associated for each month.
- August is the month with the most number of births at 3525858.
- Febuary has the least number of births at 3018140.

Febuary could have the least number of births partly because it only has 28 days or 29 on a leap year
while the rest of the months have 30 or 31 days across which gives them 3 to 4 additional days to 
have births.

Extracting the number of births of every day of the week will enable us to gain insights about
which day of the particular week had the highest and lowest number of births

In [17]:
def dow_births(list_of_lists):
    
    births_per_dow = {} #Initialize empty dictionary to store day of week and number of births 
    
    for row in list_of_lists:
        dow = row[3] #Extract the day of week from list of lists
        births = row[4] #Extract the number of births from list of lists
        
        if dow in births_per_dow:
            births_per_dow[dow] = births_per_dow[dow] + births
        else:
            births_per_dow[dow] = births
            
    return births_per_dow

The function above returns a dictionary that:

- Returns the day of the week and number of births associated for that particular day.

In [18]:
dow_births(function_test)

{1: 5789166,
 2: 6446196,
 3: 6322855,
 4: 6288429,
 5: 6233657,
 6: 4562111,
 7: 4079723}

We can observe that: 

- Tuesday has the highest number of births at 6446196.
- Sunday has the least number of births at 4079723.

Since the two functions we created - **month_births** and **dow_births** are pretty 
similar we can create a single generic function that aims to return the number of births 
for each month, day of the week, year and date of the month.

In [19]:
def calc_counts(list_of_lists, column):
    
    sums_dict = {}
    
    for row in list_of_lists:
        
        col_value = row[column]
        births = row[4]
        
        if col_value in sums_dict:
            sums_dict[col_value] = sums_dict[col_value] + births
        else:
            sums_dict[col_value] = births
            
    return sums_dict

The generic function - **calc_counts** will return a dictionary that contains the number of births for 
any time period in our data set. 

We can now extract the number of births by year:

In [20]:
#Number of births by year

calc_counts(function_test, 0)

{1994: 3952767,
 1995: 3899589,
 1996: 3891494,
 1997: 3880894,
 1998: 3941553,
 1999: 3959417,
 2000: 4058814,
 2001: 4025933,
 2002: 4021726,
 2003: 4089950}

We can see that the number of births has been increasing in an upward trend every year but:

- There has been a decline in the number of births in 1996 compared to 1995. 
- There has been a decline in the number of births in 1997 compared to 1997. 
- There has been a decline in the number of births in 2001 compared to 2000.
- There has been a decline in the number of births in 2002 compared to 2001. 

When we see a general upward trend in the population over the years which is something to be expected
and decline in population calls for some deeper investgation into our data.

The decline in population could be due to a plethora of possible reasons such as:

- Higher mortality rate due to a certain illness.
- Families reducing the number of children they have.
- Errors in data entry.

Let's take a look at the number of births by the day of the month:

In [21]:
#Number of births by date of month

calc_counts(function_test, 2)

{1: 1276557,
 2: 1288739,
 3: 1304499,
 4: 1288154,
 5: 1299953,
 6: 1304474,
 7: 1310459,
 8: 1312297,
 9: 1303292,
 10: 1320764,
 11: 1314361,
 12: 1318437,
 13: 1277684,
 14: 1320153,
 15: 1319171,
 16: 1315192,
 17: 1324953,
 18: 1326855,
 19: 1318727,
 20: 1324821,
 21: 1322897,
 22: 1317381,
 23: 1293290,
 24: 1288083,
 25: 1272116,
 26: 1284796,
 27: 1294395,
 28: 1307685,
 29: 1223161,
 30: 1202095,
 31: 746696}

From the data above we can see that:
    
- The 31st of every month has the lowest number of births. This is to be expected as not all months 
  have 31 days.
- The 18th of every month has the highest number of births.

# Conclusion

From the analysis above we gathered useful statistics on the birth data in the U.S by making use
of functions to give us data in a very neat format.