## Group Members
Anona Gupta (A13606136),  Abigail Paterson (A13370320), Martha Sheets (A14074010) 

Member Contributions:

- Anona Gupta - Data Cleaning and Processing, Data Description, Plotly geographic maps, writing
- Abigail Paterson - Normal and T-tests, Python plots, writing
- Martha Sheets - Conclusion, Plotly geographic maps, writing

# COGS 108 - Final Project: Race and Infant Mortality Data 

### Research Question
Is there a significant relationship between the distribution of race and infant mortality rates throughout the United States?

### Introduction and Background

The representation of population health statistics is a vital tool in understanding problems within health systems. One such problem is the stark disparity in maternal healthcare between white women and women of color, especially in the United States. 
Many sources report that Black mothers are up to 3 to 4 times more likely to die in childbirth than white counterparts, demonstrating an inequitable balance of care over population based on race. 

Similarly, our group decided to look if race may factor into infant mortality rates, as we ask whether or not this racial health gap will be reflected in higher infant mortality rates in communities with more women of color. 

Furthermore, we assess the relationship between infant mortality and race by comparing overall infant mortality rates with the distribution of Black and White populations across each county in the U.S. 

Based on the pattern of maternal death rates among African American women, we expect that the higher the black infant population within a county, the higher infant mortality rate. We acknowledge that while many other health factors such as maternal health, socioeconomic status, educational status, and family planning affect the outcomes of infant wellbeing, infant mortality most succinctly describes the relationship between natal treatment and race in the United States. Additionally, we decided to analyze the data on a county by county basis to represent the population statistics in more detail than just state averages. Although this is a bleak subject, it is important to equip the public with the understanding of the dire situations women and children of color face in the United States. 

### Project Outline 
The notebook consists of many parts, each accomplishing a further step of analysis with the data. The first part includes data cleaning, wrangling, and standardization, but will move into tests for correlation and significance in later parts. By the end we hope to both establish a strong relationship between race and health outcomes, should one exist, and represent the relationship in a geospatially eloquent way through maps, graphs, and other visualizations 

### Data Description and Ethics Considerations

The data provided via the .txt file “'CompressedMortality1999-2016-2.txt'” includes statistics on the Black infant deaths, White infant deaths, black infant mortality rate, and white infant mortality rate within each county of every state. The dataset also includes 3 other catgories for Race : Native American, Asian-Pacific Islander, and 'nan' entries. It does not have Hispanic as a category because according to CDC:

"Race and Hispanic origin are reported separately on the death certificate", as the federal government regards Hispanic as and ethnicity and not a race.

A secondary data set, also from CDC wonder, "CompressedMortalityTotals1999-2016.txt", includes overall infant mortality statistics per county, not separated by race. We use this dataset to compare total infant mortality rate of each county to the race specific ones. We used two datasets, because, when queried to separate by race, the CDC database does not include total populations.

The dataset comes from the public source, CDC Wonder, which seeks to provide the general public and health workers the public health data collected by the Center for Disease Control. The site specifically states that its purpose is “to promote information-driven decision making… with access to specific and detailed information,” which in many ways is in the spirit of this research question and project. The hope is that by statistically and quantitatively reinforcing the claims that racial injustice does directly harm the health of minority populations, better programs and interventions may arise to end such systematic disadvantages. 

##### Before accessing the data provided, the CDC sets these guidelines for data usage: 

Use these data for health statistical reporting and analysis only. For sub-national geography, do not present or publish death counts of 9 or fewer or death rates based on counts of nine or fewer (in figures, graphs, maps, tables, etc.). Make no attempt to learn the identity of any person or establishment included in these data. Make no disclosure or other use of the identity of any person or establishment discovered inadvertently and advise the NCHS Confidentiality Officer of any such discovery. 

In addition, the CDC has more guidelines regarding infant information, that say not to present or publish infant death rates based on counts of 19 or fewer.

Therefore, we will be removing any county values that include fewer than 19 deaths. Although our dataset includes the county names, and by extension, sub-national geography information, which is discouraged by the Safe Harbour method, the CDC heavily values patient and population privacy and allows this information to be analyzed. Moreover, the CDC’s preprocessing of the data ensures ethical privacy and can be used ethically in these circumstances as long as this project adheres to their guidelines. 

### Data Cleaning/ Preprocessing

- The original text file from CDC Wonder has a separate row for each race within a county: So if Autauga County, AL reports deaths for both Black and White infants, those entries will be in two separate rows. Therefore the biggest part of the data cleaning will trying to combine these entries into two columns, and ensuring that each county has only one row.

Steps: 
- 1) Deleting unnecessary columns 
- 3) Remove any county observations with fewer than 19 deaths 
- 4) Standardizing the column names and types within them 
- 5) Removing Races not Black or White
- 6) Reshaping dataset, to represent Race as separate columns
- 7) Making sure there is only one entry pre county

We hope to arrange our data so that the columns end in this order: (State, County, Federal Information Processing Standards (FIPS) code, # of Black infant deaths, Black Infant Death Rate per 1000, # of White infant deaths, White Infant Death Rate per 1000, Black/African American infant population, White infant population)

##### Also, not all counties report populations for each race. In fact there are far more counties that report only white populations then only black: 2056 Counties report white infant populations, and 759  report black infant populations. So 1297 counties do not report any black infant populations


In [1]:
# #Imports needed for data cleaning, tests, and visualizations
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib as matlib
import matplotlib.pyplot as plt
import seaborn as sns

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import plotly
plotly.tools.set_credentials_file(username='mesheets', api_key='nvd9Uwfa6x1l1zNbv863')
plotly.tools.set_config_file(world_readable=True, sharing='public')


import plotly.plotly as py
import plotly.figure_factory as ff
from plotly.figure_factory._county_choropleth import create_choropleth
import plotly.graph_objs as go
import geopandas 
import shapely
import shapefile

In [2]:
#primary dataset from CDC wonder

#Crude Rate = Count / Population * 100,000

all_data = 'CompressedMortality1999-2016-2.txt'

df = pd.read_csv(all_data, sep = "\t", dtype={'County Code': 'str'}) #making sure that any leading 0's are not dropped
df.head(130)

Unnamed: 0,Notes,County,County Code,Race,Race Code,Deaths,Population,Crude Rate
0,,"Autauga County, AL",01001,Black or African American,2054-5,32.0,2181.0,1467.2
1,,"Autauga County, AL",01001,White,2106-3,56.0,8989.0,623.0
2,,"Baldwin County, AL",01003,Black or African American,2054-5,69.0,4811.0,1434.2
3,,"Baldwin County, AL",01003,White,2106-3,173.0,31807.0,543.9
4,,"Barbour County, AL",01005,Black or African American,2054-5,37.0,3123.0,1184.8
5,,"Barbour County, AL",01005,White,2106-3,20.0,2776.0,720.5
6,,"Bibb County, AL",01007,Black or African American,2054-5,19.0,1075.0,1767.4 (Unreliable)
7,,"Bibb County, AL",01007,White,2106-3,37.0,3894.0,950.2
8,,"Blount County, AL",01009,White,2106-3,82.0,12409.0,660.8
9,,"Bullock County, AL",01011,Black or African American,2054-5,24.0,1916.0,1252.6


There are certain counties where the crude rate is given in the format "int (Unreliable)".
This is because according to CDC guidlines:  
"When the infant mortality measure represents ten to nineteen (10-19) infant deaths, the number of deaths and live births are shown, but rates and associated measures should not be shown."
So, we drop any entry that has number of deaths less than 19

In [3]:
df = df[df.Deaths > 19]

#Here we are getting the unique number of Codes that the CDC uses for race,
#which will be helpful later when we reformat and merge the dataset
df['Race Code'].unique()

array(['2054-5', '2106-3', '1002-5', 'A-PI'], dtype=object)

In [4]:
#Here we split the original dataset in to 4 subsets, based on the Race Codes

df_asian_pacific = df.loc[df['Race Code'] == 'A-PI']
df_black = df.loc[df['Race Code'] == '2054-5']
df_white = df.loc[df['Race Code'] == '2106-3']
df_native = df.loc[df['Race Code'] == '1002-5']


In [5]:
# Rename the general column names of the subsets, to specify race, which will help create unique columns
# for merging later.

def rename(inputDf):
    if (inputDf.iloc[0,4] == '2054-5'):
        #print('yeeet')
        inputDf.rename(columns={ 'Deaths' : '# of Black infant deaths', 'Population': 'Black Infant Population of County',
                                'Crude Rate' : 'black infant death rate per 1000'}, inplace=True)
        
    elif (inputDf.iloc[0,4] == '2106-3'):
        inputDf.rename(columns={ 'Deaths' : '# of White infant deaths',  'Population': 'white infant Population of County', 
                                'Crude Rate' : 'white infant death rate per 1000'}, inplace=True)
        #inputDf = inputDf.drop('County', axis = 1)
    elif (inputDf.iloc[0,4] == 'A-PI'):
        inputDf.rename(columns={ 'Deaths' : '# of A-PI infant deaths',  'Population': 'A-PI Population of County', 
                                'Crude Rate' : 'A-PI death rate'}, inplace=True)
        #inputDf = inputDf.drop('County', axis = 1)
    elif (inputDf.iloc[0,4] == '1002-5'):
        inputDf.rename(columns={ 'Deaths' : '# of Native infant deaths',  'Population': 'NA Population of County', 
                                'Crude Rate' : 'NA rate'}, inplace=True)
        #inputDf = inputDf.drop('County', axis = 1)
                                
    inputDf = inputDf.drop('Race Code', axis = 1)
    inputDf = inputDf.drop('Race', axis = 1)
    inputDf = inputDf.drop('Notes', axis = 1)
    return inputDf
    
df_white = rename(df_white)
df_black = rename(df_black)
df_asian_pacific = rename(df_asian_pacific)
df_native = rename(df_native)


In [28]:
#Merging Black and White using the County Code
#Now there is a single row for each county
merged_data = pd.merge(df_black,df_white,on='County Code', how = 'outer').copy(deep = True)
merged_data.rename(columns={ 'County Code': 'FIPS'}, inplace=True)


#merging counties where there are only white populations
merged_data['County'] = merged_data['County_x'].combine_first(merged_data['County_y'])
merged_data = merged_data.drop(['County_x', 'County_y'], axis = 1)

#adding column For State
def state_name(string):
    
    string = string.upper()
    string = string.strip()
    output = string[-2:]
    
    return output
merged_data['State'] = merged_data['County'].apply(state_name)

#reordering Columns
merged_data = merged_data.reindex(['FIPS', 'State', 'County','# of Black infant deaths', 'Black Infant Population of County',
                                              'black infant death rate per 1000', '# of White infant deaths',
                                              'white infant Population of County', 'white infant death rate per 1000'], axis = 1)


In [29]:
#Adding in total infant mortality rates and infant population per county

total_numbers_per_county = 'CompressedMortalityTotals1999-2016.txt'

df2 = df = pd.read_csv(total_numbers_per_county, sep = "\t", dtype={'County Code': 'str'})
df2 = df2.drop(['Notes', 'County Code'], axis = 1)
df2 = df2[df2.Deaths > 19]
df2 = df2.dropna()

merged_final = pd.merge(merged_data,df2,on='County', how = 'outer').copy(deep = True)
merged_final.rename(columns={ 'Deaths' : 'Total infant deaths',  'Population': 'Total Infant Population of County', 
                                'Crude Rate' : 'Crude rate per 1000'}, inplace=True)
merged_final.dropna(subset=['FIPS'], inplace=True)


In [30]:
#dividing CDC death rate by 100, so now rate = deaths per 1000.
merged_final['black infant death rate per 1000'] = merged_final['black infant death rate per 1000'].apply(float).div(100)
merged_final['white infant death rate per 1000'] = merged_final['white infant death rate per 1000'].apply(float).div(100)
merged_final['Crude infant mortality rate per 1000'] = merged_final['Crude rate per 1000'].apply(float).div(100)
merged_final = merged_final.drop(['Crude rate per 1000'], axis = 1)

merged_final.head()

Unnamed: 0,FIPS,State,County,# of Black infant deaths,Black Infant Population of County,black infant death rate per 1000,# of White infant deaths,white infant Population of County,white infant death rate per 1000,Total infant deaths,Total Infant Population of County,Crude infant mortality rate per 1000
0,1001,AL,"Autauga County, AL",32.0,2181.0,14.672,56.0,8989.0,6.23,88.0,11289.0,7.795
1,1003,AL,"Baldwin County, AL",69.0,4811.0,14.342,173.0,31807.0,5.439,244.0,37225.0,6.555
2,1005,AL,"Barbour County, AL",37.0,3123.0,11.848,20.0,2776.0,7.205,57.0,5940.0,9.596
3,1011,AL,"Bullock County, AL",24.0,1916.0,12.526,,,,28.0,2602.0,10.761
4,1013,AL,"Butler County, AL",30.0,2446.0,12.265,,,,41.0,4741.0,8.648


## Data is cleaned!


## Exploratory Data Vizualization

We use three types of plots, line plots, histograms and scatter plots. The line plots and histograms show us that the approximate average black infant mortality rate is higher than the approximate white infant mortality rate, which implies that a higher African American population in a certain county will lead to a higher infant mortality rate. The scatterplot shows a clear positve correlation between percentage of African American infants and total infant mortality.

For the line plots and scatter plots we work with the infant mortality rate as opposed to total number to control for county size. And it should also be noted that many of the counties (about 2/3) have no African American population.  

#### Line Plots

In [9]:
only_rates = merged_final[["black infant death rate per 1000", "white infant death rate per 1000", 'County']]

fig = only_rates.set_index('County').plot.line( rot=0, title='Rates by County', figsize=(15,8), fontsize=14)
plt.savefig('LinePlot.png')
plt.close()


![title](LinePlot.png)

The line plot above demonstrates that black infant death rate is generally at least double that of white infant death rate. White infant death rate per 1000 is typically above 5, while black infant death rate per 1000 is typically above 10. 

#### Histograms

In [10]:
#histogram of black infant mortality rate

hits_1 = merged_final["black infant death rate per 1000"].plot.hist()
hist_1 = plt.gcf()
plt.savefig('HistAA.png')
plt.close()

![title](HistAA.png)

The histogram above shows the distribution of black infant death rates across counties, and displays that the majority of rates are between 10 and 15 deaths per 1000. The highest rate observed is over double the majority rate at 35 deaths per 1000, while the lowest rate is less than 5 deaths. 


In [11]:
#histogram of white infant mortality rate
hist_2 = merged_final["white infant death rate per 1000"].plot.hist()
hist_2 = plt.gcf()

plt.savefig('HistWhite.png')
plt.close()

![title](HistWhite.png)

The second histogram graphs the distribution of white infant death rates across counties, and shows the average infant death rate is 6 deaths per 1000. The minimum rate observed is just above 2 deaths per 1000 and the maximum is just under 20 deaths per 1000. In comparison to the distribution of black infant death rates, the average, minimum, and maximum rates of death for black infants are all about 2 times higher, or more, related to white infant deaths. 


In [12]:
#histogram of total infant mortality rate
hist_2 = merged_final['Crude infant mortality rate per 1000'].plot.hist()
hist_2 = plt.gcf()

plt.savefig('HistTotal.png')
plt.close()

![title](HistTotal.png)

#### Scatterplot

In [60]:
#dataframe to calculate percentage of African American infants by county
AA_pop_county = pd.DataFrame()
AA_pop_county["County"] = merged_final["County"]
AA_pop_county["Crude"] = merged_final['Crude infant mortality rate per 1000']
AA_pop_county["Percentage of black infants"] = merged_final["Black Infant Population of County"] / merged_final["Total Infant Population of County"]
AA_pop_county = AA_pop_county.dropna()


#create a scatter plot of total infant death rate by African American population


x = AA_pop_county["Percentage of black infants"]
y = AA_pop_county["Crude"]

plt.scatter(x,y)
plt.title('Infant mortality by African American population')
plt.xlabel('Percentage of black infants')
plt.ylabel('Total infant mortality rate')


plt.savefig('ScatterWhite.png')
plt.close()

#AA_pop_county


![title](ScatterAA.png)

The scatterplot above graphs the percent of black infants within a county population against crude infant mortality rate. The distribution demonstrates a strong positive correlation, meaning as black infant population increases, county infant mortality increases.

In [14]:
#create a scatter plot of infant death rate by White population

White_pop_county = pd.DataFrame()
White_pop_county["County"] = merged_final["County"]
White_pop_county["Crude"] = merged_final['Crude infant mortality rate per 1000']
White_pop_county["Percentage of white infants"] = merged_final["white infant Population of County"] / merged_final["Total Infant Population of County"]
White_pop_county.dropna()


x = White_pop_county["Percentage of white infants"]
y = White_pop_county["Crude"]

plt.scatter(x,y)
plt.title('Infant mortality by White population')
plt.xlabel('Percentage of white infants')
plt.ylabel('Total infant mortality rate')

plt.savefig('ScatterWhite.png')
plt.close()

#White_pop_county

# we see that white population caps out at 1 which gives the sort of "wall" effect seen

![title](ScatterWhite.png)

The second scatterplot graphs percent of white infants within a county population against crude infant mortality rate. The distribution is more scattered and random, but also seems to demonstrate a slight negative relationship between white infant population and infant death rate. In other words, as white infant population increases within a county, there is a slight decrease in infant mortality. 

## Data Analysis and Results
#### The first part of data analysis is to check if the data is normal
The function stats.normaltest() returns two values, a test statistic and a p_value, which can be tested again an alpha significants value to determine normalcy. 

In [17]:
#death rate information isolated
death_rate_AA = merged_final["black infant death rate per 1000"]
death_rate_w = merged_final["white infant death rate per 1000"]
death_rate = merged_final['Crude infant mortality rate per 1000']

#population information isolated
pop_AA = AA_pop_county["Percentage of black infants"]
pop_w = White_pop_county["Percentage of white infants"]
pop = merged_final["Total Infant Population of County"]


#### Normal test

In [18]:
#perform the normal test 


#variables are named like so: s for test statistic or p for p-vale, then the first letters of the data
#so p_drb is p_value death_rate_b or s_pw is test statistic pop_w

s_dra, p_dra = stats.normaltest(death_rate_AA, nan_policy='omit')
s_drw, p_drw = stats.normaltest(death_rate_w, nan_policy='omit')
s_dr, p_dr = stats.normaltest(death_rate, nan_policy='omit')

s_pa, p_pa = stats.normaltest(pop_AA, nan_policy='omit')
s_pw, p_pw = stats.normaltest(pop_w, nan_policy='omit')
s_p, p_p = stats.normaltest(pop, nan_policy='omit')

#we will use an alpha value of 0.01 to test for normalcy. The data is normal if the p_value obtained by normaltest
#is less than the alpha value

alpha = 0.01
dra = bool(p_dra < alpha)
drw = bool(p_drw < alpha)
dr = bool(p_dr < alpha)

pa = bool(p_pa < alpha)
pw = bool(p_pw < alpha)
p = bool(p_p < alpha)
print(dra, drw, dr, pa, pw, p)

# we can see that the data is all normal

True True True True True True


Below we are going to conduct several tests to try to answer our research question. First we will use a t-test to see if there is a significant difference between average White and African American infant mortality rate. Second we will use a Pearson's correlation test to see if there is a significant correlation between total infant deaths and percentage of the infant population that is African American.

#### Difference between White and African American infant mortality rate

In [19]:
#calculate averages
avg_AA_infant_mortality_rate = merged_final["black infant death rate per 1000"].mean()
avg_w_infant_mortality_rate =  merged_final["white infant death rate per 1000"].mean()

#display averages
print("African American infant mortality rate is \t {:2.2f} deaths per 1000".format(avg_AA_infant_mortality_rate))
print("White infant mortality rate is \t {:2.2f} deaths per 1000".format(avg_w_infant_mortality_rate))

African American infant mortality rate is 	 12.96 deaths per 1000
White infant mortality rate is 	 6.42 deaths per 1000


#### Statistical T-test

In [21]:
#perform the t-test
t_val, p_val = ttest_ind(death_rate_AA, death_rate_w, equal_var = False, nan_policy = 'omit')


#use alpha = 0.05 to test significance. If the p_value from the test is less than 0.05, the difference between the
#two averages is greater than we would expect to see simply from chance.

significant = False
alpha = 0.05
if p_val < alpha:
    significant = True
else:
    significant = False
print(t_val, p_val)
print(significant)
#below we can see that there is in fact a significant difference between African American and white infant mortality

46.94802780819173 3.4553022954778085e-246
True


#### Correlation between percentage of African American population and total infant mortality rate

In [27]:
#we will use the pearsonr function in scipy to find the correlation or r value and the p value to use for a t-test
#remove rows with no to little population from second dataset so arrays are equal in size 
x = AA_pop_county["Percentage of black infants"].dropna()
y = merged_final[merged_final["Black Infant Population of County"] > 5]['Crude infant mortality rate per 1000']

r_cor, p_val = stats.pearsonr(x, y)

significant = False
alpha = 0.05

if p_val < alpha:
    significant = True
    print('The correlation is \t {:2.2f} and this is a significant correlation'.format(r_cor))
else:
    significant = False
    print('The correlation is \t {:2.2f} and this is NOT a significant correlation')
    
print(p_val)

The correlation is 	 0.79 and this is a significant correlation
8.760860455451559e-166


#### Finding county with highest black infant mortality rate

In [65]:
b_max = merged_final['black infant death rate per 1000'].max()
b_max_county = merged_final[merged_final['black infant death rate per 1000'] == b_max]['County']
b_max_county

185    Jeff Davis County, GA
Name: County, dtype: object

#### Finding county with lowest black infant mortality rate

In [67]:
b_min = merged_final['black infant death rate per 1000'].min()
b_min_county = merged_final[merged_final['black infant death rate per 1000'] == b_min]['County']
b_min_county

337    Essex County, MA
Name: County, dtype: object

## Results

The histograms, scatterplots and other graphs all visualize the disparity between black and white infant populations, where higher black infant populations correlate with higher infant death rates. In comparing the two histograms the distribution of black infant death doubled that of white infant death. 

These numbers are also consistent in the line plot representation. 

Next, we found the difference between White and African American infant mortality rate over their reported county populations, ommiting counties with null values. 

Here also, the mean infant moratlity rate was much nearly double for black infants (12.96 deaths per 1000) than for white infants (6.42 deaths per 1000).

The correlative test, using pearson R, reinforces that the relationship between black infant population and infant mortality is significant; the correlation coefficient was 0.79 with a low p-value of 8.76^-166, demonstrating a strong positive linearity. 

Additionally, the t-test, comparing white infant death rates and black infant death rates showed a high statistical significance of 46.948 with a low p-value of 3.455^-246; this means that the difference in black infant death rates and white infant death rates is very unlikely due to chance and most likely due to a significant difference in the black vs. white populations. 


## Limitations 
Our exclusive comparison between white and black infant populations, rather than all races and ethnicities listed in census data, skews the proportion of the population and does not accurately represent other communities of color. The data reported is also not fully representative of the black infant population, as only 795 counties report black infant populations, despite 2056 counties in total within our data. Our data and analysis can by no means be seen as the full picture of infant mortality in the nation, but it does clearly allude to a systematic inequality in the treatment and care of patients based on Black or White race status. 

In the future we would like to include other health indicators, to see the larger effect race and ethnicity may play on the holistic picture of health.


## Conclusion and Discussion

This project explores the relationship between race and healthcare provision as represented by infant mortality. Our findings, which demonstrate that racial identity does negatively impact the survival of infants in childbirth, builds on other findings that implicate a systematic health disadvantage for people of color. Future research may include tracking the health and morbidity of all ages by race and ethnicity and including other social determinants of health such as healthcare access, education level, income status, etc. Going forward, counties such as Jeff Davis County, GA, which have the highest rate of black infant mortality, should be targeted for reform and re-evaluate and their practices to ensure equitable care is provided across all identities. Additionally, counties such as Essex County, MA, that have shown to have the lowest rates of black infant mortality, should be studied to see if their system implements an active prevention of discriminatory or biased practices that can be replicated in other regions. While this research is small in scale, its implications can have a large impact on the way the public evaluates how our Black mothers and Black children are being treated in this country. 

## Geographic Data Visualizations... Just for Fun!

In [22]:
#infant mortality rates by county in Alabama

df_fips = merged_final[merged_final['State']== 'AL']

infant_death = df_fips['Crude infant mortality rate per 1000'].tolist()
fips = df_fips['FIPS'].tolist()

colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"]

endpts = list(np.linspace(1, 12, len(colorscale) - 1))

fig = ff.create_choropleth(
    fips=fips, values=infant_death, scope=['Alabama'], show_state_data=True,
    colorscale=colorscale, binning_endpoints=endpts, round_legend_values=True,
    plot_bgcolor='rgb(229,229,229)',
    paper_bgcolor='rgb(229,229,229)',
    legend_title='Infant Mortality Rate by County',
    title = 'Alabama'
)
py.iplot(fig, filename='choropleth_AL', world_readable = True)

py.image.save_as(fig, filename='alabama.png')


Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Consider using IPython.display.IFrame instead



![title](alabama.png)

In [None]:
#infant mortality rates across the Northeast

NE_states = ['CT' , 'ME' ,'MA','NH','RI','VT']
df_NE = merged_final[merged_final['State'].isin(NE_states)]

infant_death_NE = df_NE['Crude infant mortality rate per 1000'].tolist()
fips = df_NE['FIPS'].tolist()

colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"]

endpts = list(np.linspace(1, 12, len(colorscale) - 1))

fig = ff.create_choropleth(
    fips=fips, values=infant_death_NE, colorscale = colorscale,
    binning_endpoints=endpts, show_hover = True,
    scope=NE_states, #county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
    legend_title='Infant mortality rate per county',
    title = 'North East'
   
)
fig['layout']['legend'].update({'x': 0})
fig['layout']['annotations'][0].update({'x': -0.12, 'xanchor': 'left'})
py.iplot(fig, filename='choropleth_new_england', world_readable = True)

py.image.save_as(fig, filename='NE.png')

![title](NE.png)

In [None]:
#infant mortality rates in the Southeast

SE_states = ['AR' , 'LA' ,'MS','TN','KY','AL', 'WV','DC', 'VA', 'NC', 'SC', 'GA','FL']
df_SE = merged_final[merged_final['State'].isin(SE_states)]
#df_sample_r

infant_death_SE = df_SE['Crude infant mortality rate per 1000'].tolist()
fips = df_SE['FIPS'].tolist()

colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"]

endpts = list(np.linspace(1, 12, len(colorscale) - 1))

fig = ff.create_choropleth(
    fips=fips, values=infant_death_SE, colorscale = colorscale,
    binning_endpoints=endpts, show_hover = True,
    scope=SE_states, #county_outline={'color': 'rgb(255,255,255)', 'width': 0.5},
    legend_title='Infant mortality rate per county', title = 'South East'
   
)
fig['layout']['legend'].update({'x': 0})
fig['layout']['annotations'][0].update({'x': -0.12, 'xanchor': 'left'})
py.iplot(fig, filename='choropleth_southeast', world_readable = True)

py.image.save_as(fig, filename='SE.png')

![title](SE.png)

In [23]:
#infant mortality rates across the U.S

infant_death_USA = merged_final['Crude infant mortality rate per 1000'].tolist()
fips = merged_final['FIPS'].tolist()

colorscale = ["#f7fbff","#ebf3fb","#deebf7","#d2e3f3","#c6dbef","#b3d2e9","#9ecae1",
              "#85bcdb","#6baed6","#57a0ce","#4292c6","#3082be","#2171b5","#1361a9",
              "#08519c","#0b4083","#08306b"]

endpts = list(np.linspace(1, 12, len(colorscale) - 1))

fig = ff.create_choropleth(
    fips=fips, values=infant_death_USA, colorscale = colorscale,
    binning_endpoints=endpts, show_state_data=False,
    show_hover=True, centroid_marker={'opacity': 0},
    legend_title='Infant mortality rate per county')

fig['layout']['legend'].update({'x': 0})
fig['layout']['annotations'][0].update({'x': -0.12, 'xanchor': 'left'})
py.iplot(fig, filename='choropleth_USA', world_readable = True)

py.image.save_as(fig, filename='USA.png')


Estimated Draw Time Slow


Consider using IPython.display.IFrame instead



The draw time for this plot will be slow for clients without much RAM.


![title](USA.png)