# Centranz - Transform US Censes Bureau Data to Star Schema

This project uses **Income, Poverty and Health Insurance Coverage in the United States: 2015** report released by _The U.S. Census Bureau_ on SEPT. 13, 2016. These findings are contained in two reports: 
1. Income and Poverty in the United States: 2015 and 
2. Health Insurance Coverage in the United States: 2015.

The Current Population Survey Annual Social and Economic Supplement was conducted nationwide and collected information about income and health insurance coverage during the 2015 calendar year. The Current Population Survey, sponsored jointly by the U.S. Census Bureau and U.S. Bureau of Labor Statistics, is conducted every month and is the primary source of labor force statistics for the U.S. population; it is used to calculate the monthly unemployment rate estimates. Supplements are added in most months; the Annual Social and Economic Supplement questionnaire is designed to give annual, national estimates of income, poverty and health insurance numbers and rates.

Another Census Bureau report, **The Supplemental Poverty Measure: 2015**, was also released. With support from the Bureau of Labor Statistics, it describes research showing a different way of measuring poverty in the United States and includes estimates for numerous demographic groups, including state-level estimates. The supplemental poverty measure serves as an additional indicator of economic well-being and provides a deeper understanding of economic conditions. The Census Bureau has published poverty estimates using this supplemental measure annually since 2011. Since September 2015, the supplemental poverty measure has been released the same day as the official poverty estimates.

In this project we chose to extract data from poverty reports and load the date into MySQL in a star schema format with below mentioned facts and dimensions:

* Fact - Population
* Dimension - Age Band
* Dimension - Disability Status
* Dimension - Education
* Dimension - Family Status
* Dimension - Nativity
* Dimension - Race
* Dimension - Region
* Dimension - Residence
* Dimension - Work Experience
* Dimension - Year

In [1]:
import pandas as pd
import os
import abbreviate
from collections import OrderedDict

## Input Files
Source: https://data.world/uscensusbureau/income-poverty-health-ins

1. **Table 3**
_Poverty Status of People, by Age, Race, and Hispanic Origin: 1959 to 2015_

SOURCE: U.S. Bureau of the Census, Current Population Survey, Annual Social and Economic Supplements.

2. **Table 3** 
_People in Poverty by Selected Characteristics: 2014 and 2015_

(Numbers in thousands, margin of error in thousands or percentage points as appropriate. People as of March of the following year. For information on confidentiality protection, sampling error, nonsampling error, and definitions)



In [2]:
TBL3_POVERTY_BY_RACE = os.path.join('data', 'raw', 'Table_3_Poverty Status of People_Age-Race-Hispanic Origin-hstpov3.xls')
TBL3_POVERTY_BY_SELECTED_CHAR = os.path.join('data', 'raw', 'Table_3_People_Poverty_Selected_Characteristics_2014-2015 .xls')

### Common Variables

In [3]:
abbr = abbreviate.Abbreviate()

### Inspect Sheets in Excel Files

In [4]:
xl_pbr = pd.ExcelFile(TBL3_POVERTY_BY_RACE)
xl_pbsc = pd.ExcelFile(TBL3_POVERTY_BY_SELECTED_CHAR)

In [5]:
xl_pbr.sheet_names

['all races',
 'WHITE, NOT HISPANIC',
 'ASIAN AND PACIFIC ISLANDER',
 'ASIAN ALONE',
 'ASIAN ALONE OR IN COMBINATION',
 'BLACK',
 'BLACK ALONE ',
 'BLACK ALONE OR IN COMBO',
 'WHITE ALONE',
 'WHITE',
 'WHITE ALONE, NOT HISPANIC',
 'HISPANIC, ANY RACE']

In [6]:
xl_pbsc.sheet_names

['All People',
 'Education Attainment',
 'Work Experience',
 'Disability Status',
 'Residence',
 'Age',
 'Sex',
 'Family Status',
 'Race and Hispanic Origin',
 'Naitvity',
 'Region']

### Load Excel Files into DataFrames

In [7]:
pbr_dfs = pd.read_excel(TBL3_POVERTY_BY_RACE, sheetname=None)
pbsc_dfs = pd.read_excel(TBL3_POVERTY_BY_SELECTED_CHAR, sheetname=None)

### Review columns available in all DataFrames

In [8]:
print("TBL3_POVERTY_BY_RACE:")
for k, df in pbr_dfs.items():
    print(f"    {k} - {df.columns}\n")
    
print("TBL3_POVERTY_BY_SELECTED_CHARS:")
for k, df in pbsc_dfs.items():
    print(f"    {k} - {df.columns}\n")

TBL3_POVERTY_BY_RACE:
    all races - Index(['ALL RACES- Year and Characteristic',
       'Under 18 years - All People - Total ',
       'Under 18 years - All People - Below Poverty - Number',
       'Under 18 years - All People - Below Poverty - Percent',
       'Under 18 years - Related Children in Families - Total',
       'Under 18 years - Related Children in Families - Below Poverty - Number',
       'Under 18 years - Related Children in Families - Below Poverty - Percent',
       '18 to 64 years - Total', '18 to 64 years - Below Poverty - Number',
       '18 to 64 years - Below Poverty - Percent', '65 years and over - Total',
       '65 years and over - Below Poverty - Number',
       '65 years and over - Below Poverty - Percent'],
      dtype='object')

    WHITE, NOT HISPANIC - Index(['WHITE, NOT HISPANIC- Year and Characteristic',
       'Under 18 years - All People - Total ',
       'Under 18 years - All People - Below Poverty - Number',
       'Under 18 years - All People - 

### Process Poverty By Race

In [9]:
all_pbr_df = pd.DataFrame({"Year":[], "AgeBand":[], "Race":[], "Total":[], "BelowPoverty":[]})

for key, df in pbr_dfs.items():
    clean_column_names = {df.columns[0]:"Year", 
                          df.columns[1]: "Under 18 - Total", 
                          df.columns[2]: "Under 18 - Below Poverty", 
                          df.columns[3]: "Under 18 - Below Poverty %", 
                          df.columns[4]: "Under 18 - Children Total", 
                          df.columns[5]: "Under 18 - Children Below Poverty", 
                          df.columns[6]: "Under 18 - Children Below Poverty %", 
                          df.columns[7]: "18 to 64 - Total", 
                          df.columns[8]: "18 to 64 - Below Poverty", 
                          df.columns[9]: "18 to 64 - Below Poverty %", 
                          df.columns[10]: "65 and Over - Total", 
                          df.columns[11]: "65 and Over - Below Poverty", 
                          df.columns[12]: "65 and Over - Below Poverty %"}
    clean_df = df.rename(columns=clean_column_names)[["Year",
                    "Under 18 - Total", "Under 18 - Below Poverty",
                    "18 to 64 - Total", "18 to 64 - Below Poverty",
                    "65 and Over - Total", "65 and Over - Below Poverty",
                   ]]
    u18_df = clean_df[["Year", "Under 18 - Total", "Under 18 - Below Poverty"]]
    u18_normalized_columns = {"Under 18 - Total": "Total", "Under 18 - Below Poverty":"BelowPoverty"}
    u18_df = u18_df.rename(columns=u18_normalized_columns)
    u18_df["Race"] = key
    u18_df["AgeBand"] = "Under age 18"
    all_pbr_df = pd.concat([all_pbr_df, u18_df])
    
    a1864_df = clean_df[["Year", "18 to 64 - Total", "18 to 64 - Below Poverty"]]
    a1864_normalized_columns = {"18 to 64 - Total": "Total", "18 to 64 - Below Poverty":"BelowPoverty"}
    a1864_df = a1864_df.rename(columns=a1864_normalized_columns)
    a1864_df["Race"] = key
    a1864_df["AgeBand"] = "Aged 18 to 64"
    all_pbr_df = pd.concat([all_pbr_df, a1864_df])
    
    a65ao_df = clean_df[["Year", "65 and Over - Total", "65 and Over - Below Poverty"]]
    a65ao_normalized_columns = {"65 and Over - Total": "Total", "65 and Over - Below Poverty":"BelowPoverty"}
    a65ao_df = a65ao_df.rename(columns=a65ao_normalized_columns)
    a65ao_df["Race"] = key
    a65ao_df["AgeBand"] = "Aged 65 and older"
    all_pbr_df = pd.concat([all_pbr_df, a65ao_df])
    
print(all_pbr_df.count())
all_pbr_df.head()

AgeBand         933
BelowPoverty    933
Race            933
Total           933
Year            933
dtype: int64


Unnamed: 0,AgeBand,BelowPoverty,Race,Total,Year
0,Under age 18,14509,all races,73647,2015
1,Under age 18,15540,all races,73556,2014
2,Under age 18,15801,all races,73439,2013 (19)
3,Under age 18,14659,all races,73625,2013 (18)
4,Under age 18,16073,all races,73719,2012


### Process Poverty By Selected Characterestics

In [10]:
pbsc_dfs.keys()

odict_keys(['All People', 'Education Attainment', 'Work Experience', 'Disability Status', 'Residence', 'Age', 'Sex', 'Family Status', 'Race and Hispanic Origin', 'Naitvity', 'Region'])

In [11]:
all_pbsc_df = pd.DataFrame({"Year":[], "Race":[], 
                            "Education":[], "WorkExperience":[], "DisabilityStatus":[], 
                            "Residence":[], "AgeBand":[], "Nativity":[],
                            "Gender":[], "FamilyStatus":[], "Region":[],
                            "Total":[], "BelowPoverty":[]})

char_key_to_value_map = {
    "All People": "DROP",
    "Education Attainment": "Education",
    "Work Experience": "WorkExperience",
    "Disability Status": "DisabilityStatus",
    "Residence": "Residence",
    "Age": "AgeBand",
    "Sex": "Gender",
    "Family Status": "FamilyStatus",
    "Race and Hispanic Origin": "Race",
    "Naitvity": "Nativity",
    "Region": "Region"
}

for key, df in pbsc_dfs.items():
    if key == "All People":
        continue;
    clean_column_names = {df.columns[0]:char_key_to_value_map[key], 
                          df.columns[1]: "2014 - Total", 
                          df.columns[2]: "2014 - Below Poverty", 
                          df.columns[3]: "2014 - Below Poverty MOE", 
                          df.columns[4]: "2014 - Below Poverty %", 
                          df.columns[5]: "2014 - Below Poverty % MOE", 
                          df.columns[6]: "2015 - Total", 
                          df.columns[7]: "2015 - Below Poverty", 
                          df.columns[8]: "2015 - Below Poverty MOE", 
                          df.columns[9]: "2015 - Below Poverty %", 
                          df.columns[10]: "2015 - Below Poverty % MOE", 
                          df.columns[11]: "2014-15 - Below Poverty Change", 
                          df.columns[12]: "2014-15 - Below Poverty % Change"
                         }
    clean_df = df.rename(columns=clean_column_names)[[char_key_to_value_map[key],
                                                      "2014 - Total", "2014 - Below Poverty",
                                                      "2015 - Total", "2015 - Below Poverty"]]
    
    y2014_clean_df = clean_df[[char_key_to_value_map[key],"2014 - Total", "2014 - Below Poverty"]]
    y2014_normalized_columns = {"2014 - Total": "Total", "2014 - Below Poverty":"BelowPoverty"}
    y2014_clean_df = y2014_clean_df.rename(columns=y2014_normalized_columns)
    y2015_clean_df = clean_df[[char_key_to_value_map[key],"2015 - Total", "2015 - Below Poverty"]]
    y2015_normalized_columns = {"2015 - Total": "Total", "2015 - Below Poverty":"BelowPoverty"}
    y2015_clean_df = y2015_clean_df.rename(columns=y2015_normalized_columns)
    
    clean_df = pd.concat([y2014_clean_df, y2015_clean_df])
    
    for col in all_pbsc_df.columns:
        if col not in clean_df.columns:
            clean_df[col] = "" ## Add empty columns to transform to common structure
    
    all_pbsc_df = pd.concat([all_pbsc_df, clean_df])
    
print(all_pbsc_df.count())
all_pbsc_df.head()

AgeBand             86
BelowPoverty        86
DisabilityStatus    86
Education           86
FamilyStatus        86
Gender              86
Nativity            86
Race                86
Region              86
Residence           86
Total               86
WorkExperience      86
Year                86
dtype: int64


Unnamed: 0,AgeBand,BelowPoverty,DisabilityStatus,Education,FamilyStatus,Gender,Nativity,Race,Region,Residence,Total,WorkExperience,Year
0,,25163.0,,"Total, aged 25 and older",,,,,,,212132.0,,
1,,7098.0,,No high school diploma,,,,,,,24582.0,,
2,,8898.0,,"High school, no college",,,,,,,62575.0,,
3,,5719.0,,"Some college, no degree",,,,,,,56031.0,,
4,,3449.0,,Bachelor's degree or higher,,,,,,,68945.0,,


### Combine All Data

In [12]:
for col in all_pbsc_df.columns:
    if col not in all_pbr_df.columns:
        all_pbr_df[col] = ""
        

all_data_df = pd.concat([all_pbr_df, all_pbsc_df])
print(all_data_df.count())
all_data_df.head()

AgeBand             1019
BelowPoverty        1019
DisabilityStatus    1019
Education           1019
FamilyStatus        1019
Gender              1019
Nativity            1019
Race                1019
Region              1019
Residence           1019
Total               1019
WorkExperience      1019
Year                1019
dtype: int64


Unnamed: 0,AgeBand,BelowPoverty,DisabilityStatus,Education,FamilyStatus,Gender,Nativity,Race,Region,Residence,Total,WorkExperience,Year
0,Under age 18,14509,,,,,,all races,,,73647,,2015
1,Under age 18,15540,,,,,,all races,,,73556,,2014
2,Under age 18,15801,,,,,,all races,,,73439,,2013 (19)
3,Under age 18,14659,,,,,,all races,,,73625,,2013 (18)
4,Under age 18,16073,,,,,,all races,,,73719,,2012


### Prepare Dimensions

In [13]:
def prepare_dim_df(dim):
    dim_info = []
    for index, value in enumerate(all_data_df[dim].unique()):
        if value is None or (type(value) == str and len(value) == 0):
            continue
        dim_info_row = OrderedDict()
        dim_info_row['ID'] =  index
        if (isinstance(value, str)):
            dim_info_row['CODE'] =  abbr.abbreviate(value.replace(',', '').replace(' ', '').upper(), target_len=5)
        else:
            dim_info_row['CODE'] =  str(value)
        dim_info_row['DESCRIPTION'] = value
        dim_info.append(dim_info_row)
    
    dim_df = pd.DataFrame(dim_info)

    return dim_df

### Prepare Age Band

In [14]:
ageband_df = prepare_dim_df('AgeBand')
ageband_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,0,UNDRG18,Under age 18
1,1,AGD18T64,Aged 18 to 64
2,2,AGD65NDLDR,Aged 65 and older


### Prepare Disability Status

In [15]:
disabilitystatus_df = prepare_dim_df('DisabilityStatus')
disabilitystatus_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,1,TTLGD18T64,"Total, aged 18 to 64"
1,2,WTHDSBLTY,With a disability
2,3,WTHNDSBLTY,With no disability


### Prepare Education

In [16]:
education_df = prepare_dim_df('Education')
education_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,1,TTLGD25NDLDR,"Total, aged 25 and older"
1,2,NHGHSCHLDPLMA,No high school diploma
2,3,HGHSCHLNCLLGE,"High school, no college"
3,4,SMCLLGNDGREE,"Some college, no degree"
4,5,BCHLR'SDGRRHGHR,Bachelor's degree or higher


### Prepare Family Status

In [17]:
familystatus_df = prepare_dim_df('FamilyStatus')
familystatus_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,1,INFMLS,In families
1,2,HSHLDR,Householder
2,3,RLTDCHLDRNNDRG18,Related children under age 18
3,4,RLTDCHLDRNNDRG6,Related children under age 6
4,5,INNRLTDSBFMLS,In unrelated subfamilies


### Prepare Nativity

In [18]:
nativity_df = prepare_dim_df('Nativity')
nativity_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,1,NTVBRN,Native born
1,2,FRGNBRN,Foreign born
2,3,NTRLZDCTZN,Naturalized citizen
3,4,NTCTZN,Not a citizen


### Prepare Race

In [19]:
race_df = prepare_dim_df('Race')
race_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,0,ALLRCS,all races
1,1,WHTNTHSPNC,"WHITE, NOT HISPANIC"
2,2,ASNNDPCFCSLNDR,ASIAN AND PACIFIC ISLANDER
3,3,ASNLNE,ASIAN ALONE
4,4,ASNLNRNCMBNTN,ASIAN ALONE OR IN COMBINATION


### Prepare Region

In [20]:
region_df = prepare_dim_df('Region')
region_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,1,NRTHST,Northeast
1,2,MDWST,Midwest
2,3,STH,South
3,4,WEST,West


### Prepare Residence

In [21]:
residence_df = prepare_dim_df('Residence')
residence_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,1,INSDMTRPLTNSTTSTCLRS,Inside metropolitan statistical areas
1,2,INSDPRNCPLCTS,Inside principal cities
2,3,OUTSDPRNCPLCTS,Outside principal cities
3,4,OUTSDMTRPLTNSTTSTCLRS4,Outside metropolitan statistical areas4


### Prepare Work Experience

In [22]:
workexperience_df = prepare_dim_df('WorkExperience')
workexperience_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,1,TTLGD18T64,"Total, aged 18 to 64"
1,2,ALLWRKRS,All workers
2,3,WRKDFLL-TMR-RND,"Worked full-time, year-round"
3,4,LSSTHNFLL-TMR-RND,"Less than full-time, year-round"
4,5,DDNTWRKTLST1WK,Did not work at least 1 week


### Prepare Year

In [23]:
year_df = prepare_dim_df('Year')
year_df.head()

Unnamed: 0,ID,CODE,DESCRIPTION
0,0,2015,2015
1,1,2014,2014
2,2,2013(19),2013 (19)
3,3,2013(18),2013 (18)
4,4,2012,2012
