# Centranz - Transform US Census Bureau Data to Star Schema

This project uses **Income, Poverty and Health Insurance Coverage in the United States: 2015** report released by _The U.S. Census Bureau_ on SEPT. 13, 2016. These findings are contained in two reports: 
1. Income and Poverty in the United States: 2015 and 
2. Health Insurance Coverage in the United States: 2015.

The Current Population Survey Annual Social and Economic Supplement was conducted nationwide and collected information about income and health insurance coverage during the 2015 calendar year. The Current Population Survey, sponsored jointly by the U.S. Census Bureau and U.S. Bureau of Labor Statistics, is conducted every month and is the primary source of labor force statistics for the U.S. population; it is used to calculate the monthly unemployment rate estimates. Supplements are added in most months; the Annual Social and Economic Supplement questionnaire is designed to give annual, national estimates of income, poverty and health insurance numbers and rates.

Another Census Bureau report, **The Supplemental Poverty Measure: 2015**, was also released. With support from the Bureau of Labor Statistics, it describes research showing a different way of measuring poverty in the United States and includes estimates for numerous demographic groups, including state-level estimates. The supplemental poverty measure serves as an additional indicator of economic well-being and provides a deeper understanding of economic conditions. The Census Bureau has published poverty estimates using this supplemental measure annually since 2011. Since September 2015, the supplemental poverty measure has been released the same day as the official poverty estimates.

In this project we chose to extract data from poverty reports and load the date into MySQL in a star schema format with below mentioned facts and dimensions:

* Fact - Population
* Dimension - Age Band
* Dimension - Disability Status
* Dimension - Education
* Dimension - Family Status
* Dimension - Gender
* Dimension - Nativity
* Dimension - Race
* Dimension - Region
* Dimension - Residence
* Dimension - Work Experience
* Dimension - Year

***

### Import Libraries Used for ETL Process

1. Pandas - an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools.
2. OS - provides a portable way of using operating system dependent functionality. In this project os.path is used to build file paths in OS independent manner
3. Abbreviate - utility which attempts to automatically and intelligently abbreviate strings. In this project abbreviate is used to extract codes for dimensions from descriptions (long format strings)
4. OrderedDict - A specialized version of python dictionary which preserves the insertion order.
5. SQLAlchemy - SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.
6. pymysql - A pure python MySQL client library, internally used by SQLAlchemy.

In [1]:
import pandas as pd
import os
import abbreviate
from collections import OrderedDict
from sqlalchemy import create_engine, inspect
import pymysql
pymysql.install_as_MySQLdb()

## Input Files
Source: https://data.world/uscensusbureau/income-poverty-health-ins

1. **Table 3**
_Poverty Status of People, by Age, Race, and Hispanic Origin: 1959 to 2015_

SOURCE: U.S. Bureau of the Census, Current Population Survey, Annual Social and Economic Supplements.

2. **Table 3** 
_People in Poverty by Selected Characteristics: 2014 and 2015_

(Numbers in thousands, margin of error in thousands or percentage points as appropriate. People as of March of the following year. For information on confidentiality protection, sampling error, nonsampling error, and definitions)



In [2]:
TBL3_POVERTY_BY_RACE = os.path.join('data', 'raw', 'Table_3_Poverty Status of People_Age-Race-Hispanic Origin-hstpov3.xls')
TBL3_POVERTY_BY_SELECTED_CHAR = os.path.join('data', 'raw', 'Table_3_People_Poverty_Selected_Characteristics_2014-2015 .xls')

### Common Variables

In [3]:
abbr = abbreviate.Abbreviate()

### Inspect Sheets in Excel Files

In [4]:
xl_pbr = pd.ExcelFile(TBL3_POVERTY_BY_RACE)
xl_pbsc = pd.ExcelFile(TBL3_POVERTY_BY_SELECTED_CHAR)

In [5]:
xl_pbr.sheet_names

['all races',
 'WHITE, NOT HISPANIC',
 'ASIAN AND PACIFIC ISLANDER',
 'ASIAN ALONE',
 'ASIAN ALONE OR IN COMBINATION',
 'BLACK',
 'BLACK ALONE ',
 'BLACK ALONE OR IN COMBO',
 'WHITE ALONE',
 'WHITE',
 'WHITE ALONE, NOT HISPANIC',
 'HISPANIC, ANY RACE']

In [6]:
xl_pbsc.sheet_names

['All People',
 'Education Attainment',
 'Work Experience',
 'Disability Status',
 'Residence',
 'Age',
 'Sex',
 'Family Status',
 'Race and Hispanic Origin',
 'Naitvity',
 'Region']

### Load Excel Files into DataFrames

This step loads the Excel files into memory as Pandas Data Frames. Here we pass sheet_name parameter as None, which enables load of all worksheets into a dictionary of data frames.

In [7]:
pbr_dfs = pd.read_excel(TBL3_POVERTY_BY_RACE, sheet_name=None)
pbsc_dfs = pd.read_excel(TBL3_POVERTY_BY_SELECTED_CHAR, sheet_name=None)

In [29]:
pbr_dfs

OrderedDict([('all races',
                 ALL RACES- Year and Characteristic  Under 18 years - All People - Total   \
              0                                2015                                 73647   
              1                                2014                                 73556   
              2                           2013 (19)                                 73439   
              3                           2013 (18)                                 73625   
              4                                2012                                 73719   
              5                                2011                                 73737   
              6                           2010 (17)                                 73873   
              7                                2009                                 74579   
              8                                2008                                 74068   
              9                            

In [30]:
pbsc_dfs

OrderedDict([('All People',
                 All People - 2014 - Total  2014 - Below Poverty - Number  \
              0                     315804                          46657   
              
                 2014 - Below Poverty - Number - Margin of error1 (+/-)  \
              0                                                857        
              
                 2014 - Below Poverty - Percent  \
              0                            14.8   
              
                 2014 - Below Poverty - Percent - Margin of error1 (+/-)  2015 - Total  \
              0                                                0.3              318454   
              
                 2015 - Below Poverty - Number  \
              0                          43123   
              
                 2015 - Below Poverty - Number - Margin of error1 (+/-)  \
              0                                                926        
              
                 2015 - Below Poverty - Percen

### Review columns available in all DataFrames

This is just an exploratory step where we try to understand the contents of the Excel workbooks.


In [8]:
print("TBL3_POVERTY_BY_RACE:")
for k, df in pbr_dfs.items():
    print(f"    {k} - {df.columns}\n")
    
print("TBL3_POVERTY_BY_SELECTED_CHARS:")
for k, df in pbsc_dfs.items():
    print(f"    {k} - {df.columns}\n")

TBL3_POVERTY_BY_RACE:
    all races - Index(['ALL RACES- Year and Characteristic',
       'Under 18 years - All People - Total ',
       'Under 18 years - All People - Below Poverty - Number',
       'Under 18 years - All People - Below Poverty - Percent',
       'Under 18 years - Related Children in Families - Total',
       'Under 18 years - Related Children in Families - Below Poverty - Number',
       'Under 18 years - Related Children in Families - Below Poverty - Percent',
       '18 to 64 years - Total', '18 to 64 years - Below Poverty - Number',
       '18 to 64 years - Below Poverty - Percent', '65 years and over - Total',
       '65 years and over - Below Poverty - Number',
       '65 years and over - Below Poverty - Percent'],
      dtype='object')

    WHITE, NOT HISPANIC - Index(['WHITE, NOT HISPANIC- Year and Characteristic',
       'Under 18 years - All People - Total ',
       'Under 18 years - All People - Below Poverty - Number',
       'Under 18 years - All People - 

In [33]:
print("TBL3_POVERTY_BY_RACE:")
for k, df in pbr_dfs.items():
    print ("k: " + str (k))
    print ("df: " + str(df))
    print(f"    {k} - {df.columns}\n")

TBL3_POVERTY_BY_RACE:
k: all races
df:    ALL RACES- Year and Characteristic  Under 18 years - All People - Total   \
0                                2015                                 73647   
1                                2014                                 73556   
2                           2013 (19)                                 73439   
3                           2013 (18)                                 73625   
4                                2012                                 73719   
5                                2011                                 73737   
6                           2010 (17)                                 73873   
7                                2009                                 74579   
8                                2008                                 74068   
9                                2007                                 73996   
10                               2006                                 73727   
11           

df:    WHITE - Year and Characteristic Under 18 years - All People - Total   \
0                             2001                                56089   
1                        2000 (12)                                55980   
2                        1999 (11)                                55833   
3                             1998                                56016   
4                             1997                                55863   
5                             1996                                55606   
6                             1995                                55444   
7                             1994                                55186   
8                        1993 (10)                                54639   
9                        1992 (9 )                                54110   
10                       1991 (8 )                                52523   
11                            1990                                51929   
12                   

In [9]:
all_pbr_df = pd.DataFrame({"Year":[], "AgeBand":[], "Race":[], "Total":[], "BelowPoverty":[]})

for key, df in pbr_dfs.items():
    clean_column_names = {df.columns[0]:"Year", 
                          df.columns[1]: "Under 18 - Total", 
                          df.columns[2]: "Under 18 - Below Poverty", 
                          df.columns[3]: "Under 18 - Below Poverty %", 
                          df.columns[4]: "Under 18 - Children Total", 
                          df.columns[5]: "Under 18 - Children Below Poverty", 
                          df.columns[6]: "Under 18 - Children Below Poverty %", 
                          df.columns[7]: "18 to 64 - Total", 
                          df.columns[8]: "18 to 64 - Below Poverty", 
                          df.columns[9]: "18 to 64 - Below Poverty %", 
                          df.columns[10]: "65 and Over - Total", 
                          df.columns[11]: "65 and Over - Below Poverty", 
                          df.columns[12]: "65 and Over - Below Poverty %"}
    clean_df = df.rename(columns=clean_column_names)[["Year",
                    "Under 18 - Total", "Under 18 - Below Poverty",
                    "18 to 64 - Total", "18 to 64 - Below Poverty",
                    "65 and Over - Total", "65 and Over - Below Poverty",
                   ]]
    u18_df = clean_df[["Year", "Under 18 - Total", "Under 18 - Below Poverty"]]
    u18_normalized_columns = {"Under 18 - Total": "Total", "Under 18 - Below Poverty":"BelowPoverty"}
    u18_df = u18_df.rename(columns=u18_normalized_columns)
    u18_df["Race"] = key
    u18_df["AgeBand"] = "Under age 18"
    all_pbr_df = pd.concat([all_pbr_df, u18_df], sort=False)
    
    a1864_df = clean_df[["Year", "18 to 64 - Total", "18 to 64 - Below Poverty"]]
    a1864_normalized_columns = {"18 to 64 - Total": "Total", "18 to 64 - Below Poverty":"BelowPoverty"}
    a1864_df = a1864_df.rename(columns=a1864_normalized_columns)
    a1864_df["Race"] = key
    a1864_df["AgeBand"] = "Aged 18 to 64"
    all_pbr_df = pd.concat([all_pbr_df, a1864_df], sort=False)
    
    a65ao_df = clean_df[["Year", "65 and Over - Total", "65 and Over - Below Poverty"]]
    a65ao_normalized_columns = {"65 and Over - Total": "Total", "65 and Over - Below Poverty":"BelowPoverty"}
    a65ao_df = a65ao_df.rename(columns=a65ao_normalized_columns)
    a65ao_df["Race"] = key
    a65ao_df["AgeBand"] = "Aged 65 and older"
    all_pbr_df = pd.concat([all_pbr_df, a65ao_df], sort=False)
    
print(all_pbr_df.count())
all_pbr_df.head()

Year            933
AgeBand         933
Race            933
Total           933
BelowPoverty    933
dtype: int64


Unnamed: 0,Year,AgeBand,Race,Total,BelowPoverty
0,2015,Under age 18,all races,73647,14509
1,2014,Under age 18,all races,73556,15540
2,2013 (19),Under age 18,all races,73439,15801
3,2013 (18),Under age 18,all races,73625,14659
4,2012,Under age 18,all races,73719,16073


In [10]:
pbsc_dfs.keys()

odict_keys(['All People', 'Education Attainment', 'Work Experience', 'Disability Status', 'Residence', 'Age', 'Sex', 'Family Status', 'Race and Hispanic Origin', 'Naitvity', 'Region'])

In [11]:
all_pbsc_df = pd.DataFrame({"Year":[], "Race":[], 
                            "Education":[], "WorkExperience":[], "DisabilityStatus":[], 
                            "Residence":[], "AgeBand":[], "Nativity":[],
                            "Gender":[], "FamilyStatus":[], "Region":[],
                            "Total":[], "BelowPoverty":[]})

char_key_to_value_map = {
    "All People": "DROP",
    "Education Attainment": "Education",
    "Work Experience": "WorkExperience",
    "Disability Status": "DisabilityStatus",
    "Residence": "Residence",
    "Age": "AgeBand",
    "Sex": "Gender",
    "Family Status": "FamilyStatus",
    "Race and Hispanic Origin": "Race",
    "Naitvity": "Nativity",
    "Region": "Region"
}

for key, df in pbsc_dfs.items():
    if key == "All People":
        continue;
    clean_column_names = {df.columns[0]:char_key_to_value_map[key], 
                          df.columns[1]: "2014 - Total", 
                          df.columns[2]: "2014 - Below Poverty", 
                          df.columns[3]: "2014 - Below Poverty MOE", 
                          df.columns[4]: "2014 - Below Poverty %", 
                          df.columns[5]: "2014 - Below Poverty % MOE", 
                          df.columns[6]: "2015 - Total", 
                          df.columns[7]: "2015 - Below Poverty", 
                          df.columns[8]: "2015 - Below Poverty MOE", 
                          df.columns[9]: "2015 - Below Poverty %", 
                          df.columns[10]: "2015 - Below Poverty % MOE", 
                          df.columns[11]: "2014-15 - Below Poverty Change", 
                          df.columns[12]: "2014-15 - Below Poverty % Change"
                         }
    clean_df = df.rename(columns=clean_column_names)[[char_key_to_value_map[key],
                                                      "2014 - Total", "2014 - Below Poverty",
                                                      "2015 - Total", "2015 - Below Poverty"]]
    
    y2014_clean_df = clean_df[[char_key_to_value_map[key],"2014 - Total", "2014 - Below Poverty"]]
    y2014_normalized_columns = {"2014 - Total": "Total", "2014 - Below Poverty":"BelowPoverty"}
    y2014_clean_df = y2014_clean_df.rename(columns=y2014_normalized_columns)
    y2015_clean_df = clean_df[[char_key_to_value_map[key],"2015 - Total", "2015 - Below Poverty"]]
    y2015_normalized_columns = {"2015 - Total": "Total", "2015 - Below Poverty":"BelowPoverty"}
    y2015_clean_df = y2015_clean_df.rename(columns=y2015_normalized_columns)
    
    clean_df = pd.concat([y2014_clean_df, y2015_clean_df])
    
    for col in all_pbsc_df.columns:
        if col not in clean_df.columns:
            clean_df[col] = "" ## Add empty columns to transform to common structure
    
    all_pbsc_df = pd.concat([all_pbsc_df, clean_df], sort=False)
    
print(all_pbsc_df.count())
all_pbsc_df.head()

Year                86
Race                86
Education           86
WorkExperience      86
DisabilityStatus    86
Residence           86
AgeBand             86
Nativity            86
Gender              86
FamilyStatus        86
Region              86
Total               86
BelowPoverty        86
dtype: int64


Unnamed: 0,Year,Race,Education,WorkExperience,DisabilityStatus,Residence,AgeBand,Nativity,Gender,FamilyStatus,Region,Total,BelowPoverty
0,,,"Total, aged 25 and older",,,,,,,,,212132.0,25163.0
1,,,No high school diploma,,,,,,,,,24582.0,7098.0
2,,,"High school, no college",,,,,,,,,62575.0,8898.0
3,,,"Some college, no degree",,,,,,,,,56031.0,5719.0
4,,,Bachelor's degree or higher,,,,,,,,,68945.0,3449.0


### Combine All Data

Here we combine DFs extracted from both excel workbooks into one consolidated DF.

In [12]:
for col in all_pbsc_df.columns:
    if col not in all_pbr_df.columns:
        all_pbr_df[col] = ""
        

all_data_df = pd.concat([all_pbr_df, all_pbsc_df], sort=False)
print(all_data_df.count())
all_data_df.head()

Year                1019
AgeBand             1019
Race                1019
Total               1019
BelowPoverty        1019
Education           1019
WorkExperience      1019
DisabilityStatus    1019
Residence           1019
Nativity            1019
Gender              1019
FamilyStatus        1019
Region              1019
dtype: int64


Unnamed: 0,Year,AgeBand,Race,Total,BelowPoverty,Education,WorkExperience,DisabilityStatus,Residence,Nativity,Gender,FamilyStatus,Region
0,2015,Under age 18,all races,73647,14509,,,,,,,,
1,2014,Under age 18,all races,73556,15540,,,,,,,,
2,2013 (19),Under age 18,all races,73439,15801,,,,,,,,
3,2013 (18),Under age 18,all races,73625,14659,,,,,,,,
4,2012,Under age 18,all races,73719,16073,,,,,,,,


### Prepare Dimensions
This is a generic reusable function which is used to extract unique dimension values and create a dataframe with columns that match DB columns.

In [13]:
def prepare_dim_df(dim):
    """
       This function returns a dimesion dataframe with unique dimension records
    """
    dim_info = []
    for index, value in enumerate(all_data_df[dim].unique()):
        if value is None or (type(value) == str and len(value) == 0):
            continue
        dim_info_row = OrderedDict()
        dim_info_row['ID'] =  index
        if (isinstance(value, str)):
            dim_info_row['CODE'] =  abbr.abbreviate(value.replace(',', '').replace(' ', '').upper(), target_len=5)
        else:
            dim_info_row['CODE'] =  str(value)
        dim_info_row['DESCRIPTION'] = value
        dim_info.append(dim_info_row)
    
    dim_df = pd.DataFrame(fo)dim_in

    return dim_df

### Prepare Year

In [24]:
#ranges of year 1959 to 2015 doesn't have all attributes as rows for each characteristics
year_df = prepare_dim_df('Year')
year_df.head()


Unnamed: 0,ID,CODE,DESCRIPTION
0,0,2015,2015
1,1,2014,2014
2,2,2013(19),2013 (19)
3,3,2013(18),2013 (18)
4,4,2012,2012


In [28]:
year_df

Unnamed: 0,ID,CODE,DESCRIPTION
0,0,2015,2015
1,1,2014,2014
2,2,2013(19),2013 (19)
3,3,2013(18),2013 (18)
4,4,2012,2012
5,5,2011,2011
6,6,2010(17),2010 (17)
7,7,2009,2009
8,8,2008,2008
9,9,2007,2007


In [34]:
year_df.tail()

Unnamed: 0,ID,CODE,DESCRIPTION
53,53,1963,1963
54,54,1962,1962
55,55,1961,1961
56,56,1960,1960
57,57,1959,1959


In [37]:
year_df.count()

ID             58
CODE           58
DESCRIPTION    58
dtype: int64

### Create Population DF

Here we update the combined dataframe with dimension IDs.

In [25]:
def update_value_with_id(main_df, dim, dim_df):
    """
    This function replaces dimension values with corresponding IDs
    """
    di = dim_df.set_index('DESCRIPTION').to_dict()['ID']
    main_df[dim] = pd.Series(main_df[dim].map(di), dtype=object)
    return main_df

In [26]:
population_df = all_data_df.copy()
population_df = update_value_with_id(population_df, "AgeBand", ageband_df)
population_df = update_value_with_id(population_df, "DisabilityStatus", disabilitystatus_df)
population_df = update_value_with_id(population_df, "Education", education_df)
population_df = update_value_with_id(population_df, "FamilyStatus", familystatus_df)
population_df = update_value_with_id(population_df, "Gender", gender_df)
population_df = update_value_with_id(population_df, "Nativity", nativity_df)
population_df = update_value_with_id(population_df, "Race", race_df)
population_df = update_value_with_id(population_df, "Region", region_df)
population_df = update_value_with_id(population_df, "Residence", residence_df)
population_df = update_value_with_id(population_df, "WorkExperience", workexperience_df)
population_df = update_value_with_id(population_df, "Year", year_df)
population_df.head()

Unnamed: 0,Year,AgeBand,Race,Total,BelowPoverty,Education,WorkExperience,DisabilityStatus,Residence,Nativity,Gender,FamilyStatus,Region
0,0,0,0,73647,14509,,,,,,,,
1,1,0,0,73556,15540,,,,,,,,
2,2,0,0,73439,15801,,,,,,,,
3,3,0,0,73625,14659,,,,,,,,
4,4,0,0,73719,16073,,,,,,,,


### Connect to DB
Here we connect to the database and explore table names.

In [27]:
USERNAME="root"
PASSWORD="rootPassword"
HOST="localhost"
PORT="3357"
SCHEMA="CENTRANZ"
connection_string = f"{USERNAME}:{PASSWORD}@{HOST}:{PORT}/{SCHEMA}"
engine = create_engine(f'mysql://{connection_string}')
engine.table_names()

OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'localhost' ([Errno 61] Connection refused)") (Background on this error at: http://sqlalche.me/e/e3q8)

### Insert Dimensions

Here we insert all dimensions into DB.

In [None]:
year_df.set_index('ID').to_sql(name='YEAR', con=engine, if_exists='append')

### Fix Population DF column names to match DB column names

Here we do handle of important things. 

Change all the column names to match DB column names

By default, Pandas writes columns with integer values with some null values as floats. In our case, we have FK columns which are nullable. That is some of the rows have non null integers and some rows have null values. So default behaviour coverts all FK integer values into floats (i.e. 0 becomes 0.0, 1 becomes 1.0, etc). This behavior can be overriden by passing dtype to to_sql call.

In below code segment we use SQLAlchemy inspect API to get construct dtype dictionary. 

In [None]:
population_df.columns

In [None]:
inspector = inspect(engine)
population_columns = inspector.get_columns('POPULATION')
dtypes = {}
for col in population_columns:
    if col['name'] == 'ID':
        continue
    dtypes[col['name']] = col['type']
dtypes    

In [None]:
column_name_map = {}

column_name_map['Year'] = "YEAR_FK"

In [None]:
final_population_df = population_df.rename(columns=column_name_map)

In [None]:
final_population_df.dtypes

### Drop rows with (NA) in TOTAL or BELOW_POVERTY

There are some rows which have a string value (NA), where data is not available. We drop such rows.

In [None]:
final_population_df = final_population_df[final_population_df['TOTAL']!='(NA)']
final_population_df = final_population_df[final_population_df['BELOW_POVERTY']!='(NA)']

### Insert Population DF

In [None]:
final_population_df.to_sql(name='POPULATION', con=engine, if_exists='append', index=False, dtype=dtypes)