# 1. Data in Use

### 1) U.S. Census Bureau, 2013-2017 American Community Survey 5-Year Estimates.

- Median household income by state in the U.S. 2013-2017.

https://www.census.gov/search-results.html?q=median+household+income&page=1&stateGeo=none&searchtype=web&cssp=SERP&_charset_=UTF-8

### 2) American Dental Association, Supply of Dentist in the U.S. 2001-2018 (Published in February 2019).

- Dentists per 100,000 population in each state - dentists working in dentistry 2001-2018.
- Supply of Dentists in the U.S. by practice area 2001-2018.

https://www.ada.org/en/science-research/health-policy-institute/data-center/supply-and-profile-of-dentists

# 2. Data Dictionary

1) Median household income by state in the U.S. dataframe

- Median Household Income: Income in the Past 12 Months - Income of Households: This includes the income of the householder and all other individuals 15 years old and over in the household, whether they are related to the householder or not. Because many households consist of only one person, average household income is usually less than average family income. Although the household income statistics cover the past 12 months, the characteristics of individuals and the composition of households refer to the time of interview. Thus, the income of the household does not include amounts received by individuals who were members of the household during all or part of the past 12 months if these individuals no longer resided in the household at the time of interview. Similarly, income amounts reported by individuals who did not reside in the household during the past 12 months but who were members of the household at the time of interview are included. However, the composition of most households was the same during the past 12 months as at the time of interview.

- The median divides the income distribution into two equal parts: one-half of the cases falling below the median income and one-half above the median. For households and families, the median income is based on the distribution of the total number of households and families including those with no income. The median income for individuals is based on individuals 15 years old and over with income. Median income for households, families, and individuals is computed on the basis of a standard distribution.

- Margin of error: The Fact is based on data collected in the American Community Survey (ACS) and the Puerto Rico Community Survey (PRCS) conducted annually by the U.S. Census Bureau. A sample of over 3.5 million housing unit addresses is interviewed each year over a 12 month period. This Fact (estimate) is based on five years of ACS and PRCS sample data and describes the average value of person, household and housing unit characteristics over this period of collection. <br>
Statistics from all surveys are subject to sampling and nonsampling error. Sampling error is the uncertainty between an estimate based on a sample and the corresponding value that would be obtained if the estimate were based on the entire population (as from a census). 
Measures of sampling error are provided in the form of margins of error for all estimates included with ACS and PRCS published products.<br>
 The margin of error measures the degree of uncertainty caused by sampling error. The margin of error is used with an ACS estimate to construct a confidence interval about the estimate. The interval is formed by adding the margin of error to the estimate (the upper bound) and subtracting the margin of error from the estimate (the lower bound). It is expected with 90 percent confidence that the interval will contain the full population value of the estimate. <br>
 The following example is for demonstrating purposes only. Suppose the ACS reported that the percentage of people in a state who were 25 years and older with a bachelor's degree was 21.3 percent and that the margin of error associated with this estimate was 0.7 percent. By adding and subtracting the margin of error from the estimate, we calculate the 90-percent confidence interval for this estimate:<br>
21.3% - 0.7% = 20.6% => Lower-bound estimate <br>
21.3% + 0.7% = 22.0% => Upper-bound estimate <br>
Therefore, we can be 90 percent confident that the percent of the population 25 years and older having a bachelor's degree in a state falls somewhere between 20.6 percent and 22.0 percent.

- Reference: https://www.census.gov/quickfacts/fact/note/US/INC110217

2) Dentists per 100,000 population in each state dataset

- Dentists working in dentistry: Those whose primary occupation is one of the following: private practice (full- or part-time), dental school/faculty staff member, armed forces, other federal services (i.e., Veterans' Affairs, Public Health Service), state or local government employee, hospital staff dentist, graduate student/intern/resident, or other health/dental organization staff member.
- Reference: https://www.ada.org/en/science-research/health-policy-institute/data-center/supply-and-profile-of-dentists

3) Supply of Dentists in the U.S. by practice area dataset

- Practice area: Note: This dataset counts a single dentist toward each practice area for which they hold a degree. For example, a dentist possessing degrees in orthodontics and pediatric dentistry will be counted in both categories. Therefore, the sum of categories will exceed the number of dentists working in dentistry.

- Professionally active dentists in the second and the third datasets are those who are listed in the ADA(American Dental Association) masterfile as licensed, not retired, living in the 50 states or District of Columbia, and having a primary occupation of private practice (full- or part-time), dental school/faculty staff member, armed forces, other federal services (i.e., Veterans' Affairs, Public Health Service), state or local government employee, hospital staff dentist, graduate student/intern/resident, or other health/dental organization staff member.  These datasets exclude dentists who are located in U.S. territories or U.S. armed forces overseas.

- Reference: https://www.ada.org/en/science-research/health-policy-institute/data-center/supply-and-profile-of-dentists

# 3. Coding Beachhead

- In this step, I will read the three datasets into respective dataframe using pandas library.

In [1]:
#import library
import pandas as pd

### 1) Median household income by state in the U.S. 2013-2017.

In [2]:
#reading the data into dataframe
df1 = pd.read_csv("https://raw.githubusercontent.com/mhan1/Capstone-Project/master/median_household_income_by_state.csv")
df1.head(3)

Unnamed: 0,State,Income,Margin Of Error
0,Alabama,"$46,472",+/- $301
1,Alaska,"$76,114",+/- $979
2,Arizona,"$53,510",+/- $259


In [3]:
#checking the number of rows and columns
df1.shape

(53, 3)

In [4]:
df1

Unnamed: 0,State,Income,Margin Of Error
0,Alabama,"$46,472",+/- $301
1,Alaska,"$76,114",+/- $979
2,Arizona,"$53,510",+/- $259
3,Arkansas,"$43,813",+/- $401
4,California,"$67,169",+/- $192
5,Colorado,"$65,458",+/- $317
6,Connecticut,"$73,781",+/- $450
7,Delaware,"$63,036",+/- $738
8,District of Columbia,"$77,649","+/- $1,075"
9,Florida,"$50,883",+/- $140


In [5]:
#checking the last row information
df1[52:]

Unnamed: 0,State,Income,Margin Of Error
52,"Source(s): U.S. Census Bureau, 2013-2017 Ameri...",,


In [6]:
# drop the last row, which is not valid information 
df1 = df1.drop(df1.index[52])
df1

Unnamed: 0,State,Income,Margin Of Error
0,Alabama,"$46,472",+/- $301
1,Alaska,"$76,114",+/- $979
2,Arizona,"$53,510",+/- $259
3,Arkansas,"$43,813",+/- $401
4,California,"$67,169",+/- $192
5,Colorado,"$65,458",+/- $317
6,Connecticut,"$73,781",+/- $450
7,Delaware,"$63,036",+/- $738
8,District of Columbia,"$77,649","+/- $1,075"
9,Florida,"$50,883",+/- $140


In [7]:
#checking the number of rows and columns
df1.shape

(52, 3)

In [8]:
#checking the column names
df1.columns

Index(['State', 'Income', 'Margin Of Error'], dtype='object')

In [9]:
# checking if there is null value in each column
df1.isnull().any(axis=0)

State              False
Income             False
Margin Of Error    False
dtype: bool

In [10]:
#desriptive statistics of the dataframe
df1.describe()

Unnamed: 0,State,Income,Margin Of Error
count,52,52,52
unique,52,52,49
top,New York,"$46,472",+/- $380
freq,1,1,2


In [11]:
#brief summary of the dataframe
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52 entries, 0 to 51
Data columns (total 3 columns):
State              52 non-null object
Income             52 non-null object
Margin Of Error    52 non-null object
dtypes: object(3)
memory usage: 1.6+ KB


### 2) Dentists per 100,000 population in each state - dentists working in dentistry 2001-2018.

In [12]:
#reading the data into dataframe
df2 = pd.read_csv("https://raw.githubusercontent.com/mhan1/Capstone-Project/master/supply_of_dentists_in_us.csv")
df2.head(3)

Unnamed: 0.1,Unnamed: 0,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Alabama,42.28,41.65,41.7,41.54,40.86,41.24,41.24,41.27,41.36,42.41,42.64,43.73,44.14,43.86,43.74,43.23,40.43,41.78
1,Alaska,72.59,70.06,71.1,71.74,74.37,72.86,74.82,76.37,77.26,77.44,77.42,79.27,78.83,79.86,82.03,76.2,79.48,81.5
2,Arizona,44.54,44.83,47.13,48.3,50.09,50.64,50.98,52.67,53.54,53.83,53.62,53.85,54.45,53.58,53.69,53.66,53.85,54.42


In [13]:
#brief summary of the dataframe
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 19 columns):
Unnamed: 0    51 non-null object
2001          51 non-null float64
2002          51 non-null float64
2003          51 non-null float64
2004          51 non-null float64
2005          51 non-null float64
2006          51 non-null float64
2007          51 non-null float64
2008          51 non-null float64
2009          51 non-null float64
2010          51 non-null float64
2011          51 non-null float64
2012          51 non-null float64
2013          51 non-null float64
2014          51 non-null float64
2015          51 non-null float64
2016          51 non-null float64
2017          51 non-null float64
2018          51 non-null float64
dtypes: float64(18), object(1)
memory usage: 7.6+ KB


In [14]:
# checking if there is null value in each column
df2.isnull().any(axis=0)

Unnamed: 0    False
2001          False
2002          False
2003          False
2004          False
2005          False
2006          False
2007          False
2008          False
2009          False
2010          False
2011          False
2012          False
2013          False
2014          False
2015          False
2016          False
2017          False
2018          False
dtype: bool

In [15]:
#descriptive statistics of the dataframe
df2.describe()

Unnamed: 0,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,56.211961,55.705098,55.906667,55.98,56.167255,56.32902,56.632745,57.034706,57.373922,57.652353,58.278235,58.628235,58.884902,58.648431,59.290196,58.87549,59.192157,58.962745
std,13.884484,13.511067,12.999855,12.95083,12.831249,12.95931,13.094925,13.031859,12.927426,12.495697,12.533893,12.421279,12.285866,12.214083,12.331285,12.211331,12.402215,12.403437
min,38.16,38.14,39.12,38.77,39.12,39.38,38.68,39.0,39.56,39.64,39.82,40.34,41.09,40.81,41.1,41.2,40.43,41.78
25%,46.435,46.33,45.855,46.55,45.905,46.63,47.67,47.365,47.39,48.075,49.035,49.54,50.125,50.27,50.96,50.46,50.66,50.695
50%,53.77,51.43,52.74,53.12,53.1,51.43,51.88,52.67,53.54,53.83,54.11,54.52,54.45,55.18,55.67,55.04,55.45,54.42
75%,63.84,62.74,62.61,62.125,62.735,63.275,63.63,64.4,64.54,65.01,65.56,65.76,65.92,65.905,66.61,65.99,65.985,66.275
max,114.36,110.44,106.07,104.98,101.74,106.19,105.85,106.16,104.18,100.71,103.29,103.19,102.55,100.68,101.3,101.96,103.64,102.78


In [16]:
#checking the column names
df2.columns

Index(['Unnamed: 0', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018'],
      dtype='object')

In [17]:
#changing the column name into appropriate name.
df2.rename(columns={'Unnamed: 0':'state'}, 
                 inplace=True)
df2.head(3)

Unnamed: 0,state,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,Alabama,42.28,41.65,41.7,41.54,40.86,41.24,41.24,41.27,41.36,42.41,42.64,43.73,44.14,43.86,43.74,43.23,40.43,41.78
1,Alaska,72.59,70.06,71.1,71.74,74.37,72.86,74.82,76.37,77.26,77.44,77.42,79.27,78.83,79.86,82.03,76.2,79.48,81.5
2,Arizona,44.54,44.83,47.13,48.3,50.09,50.64,50.98,52.67,53.54,53.83,53.62,53.85,54.45,53.58,53.69,53.66,53.85,54.42


In [18]:
#checking the number of rows and columns
df2.shape

(51, 19)

### 3) Supply of Dentists in the U.S. by practice area 2001-2018.

In [19]:
#reading the data into dataframe
df3 = pd.read_csv("https://raw.githubusercontent.com/mhan1/Capstone-Project/master/supply_of_dentists_by_practice_area.csv")
df3.head(3)

Unnamed: 0.1,Unnamed: 0,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,General Practice,130775,133213,134629,135736,137150,138000,141217,142966,145323,145980,148189,150235,152021,152153,154755,155121,156992,157676
1,Oral and Maxillofacial Surgery,6358,6285,6359,6587,6508,6576,6576,6597,6694,6922,6981,7082,7261,7374,7559,7594,7546,7509
2,Endodontics,4045,4080,4157,4333,4517,4522,4561,4658,4754,4959,5025,5118,5306,5384,5552,5631,5664,5704


In [20]:
#brief summary of the dataframe
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 19 columns):
Unnamed: 0    10 non-null object
2001          10 non-null object
2002          10 non-null object
2003          10 non-null object
2004          10 non-null object
2005          10 non-null object
2006          10 non-null object
2007          10 non-null object
2008          10 non-null object
2009          10 non-null object
2010          10 non-null object
2011          10 non-null object
2012          10 non-null object
2013          10 non-null object
2014          10 non-null object
2015          10 non-null object
2016          10 non-null object
2017          10 non-null object
2018          10 non-null object
dtypes: object(19)
memory usage: 1.6+ KB


In [21]:
# checking if there is null value in each column
df3.isnull().any(axis=0)

Unnamed: 0    False
2001          False
2002          False
2003          False
2004          False
2005          False
2006          False
2007          False
2008          False
2009          False
2010          False
2011          False
2012          False
2013          False
2014          False
2015          False
2016          False
2017          False
2018          False
dtype: bool

In [22]:
#descriptive statistics of the dataframe
df3.describe()

Unnamed: 0.1,Unnamed: 0,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
count,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10
unique,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10
top,Endodontics,9265,4278,17,135736,69,4522,5121,9726,3262,9982,97,10355,6632,152153,5686,10680,827,8033
freq,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


In [23]:
#checking the column names
df3.columns

Index(['Unnamed: 0', '2001', '2002', '2003', '2004', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018'],
      dtype='object')

In [24]:
#changing the column name into appropriate name.
df3.rename(columns={'Unnamed: 0':'practice_area'}, 
                 inplace=True)
df3.head(3)

Unnamed: 0,practice_area,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,General Practice,130775,133213,134629,135736,137150,138000,141217,142966,145323,145980,148189,150235,152021,152153,154755,155121,156992,157676
1,Oral and Maxillofacial Surgery,6358,6285,6359,6587,6508,6576,6576,6597,6694,6922,6981,7082,7261,7374,7559,7594,7546,7509
2,Endodontics,4045,4080,4157,4333,4517,4522,4561,4658,4754,4959,5025,5118,5306,5384,5552,5631,5664,5704


In [25]:
#checking the number of rows and columns
df3.shape

(10, 19)

- Now, I am ready to analyze all of these three dataframes in the next step. I will use descriptive statistics to rank respective variables to find insights and recommend them to the company.