# Column Mapping
This notebook (plus human effort) builds a map from the as-written ACS question names to the standardized names used in our analysis.

First, this notebook builds a table of the questions asked, in order, on each year of the ACS (colnames.csv). With some human elbow-grease, this becomes colnames_final.csv which has one row per question and one column per year; each entry is the name given to that question in that year. colnams_final.csv can be used to map the as-written names to the "official" name: the name used the final year that question was asked.

In [1]:
import glob
import os
import pandas as pd
import re

In [2]:
yearFolders = ['2007', '2008', '2009','2010', '2011', '2012', '2013', '2014','2015','2016']
all_colnames = []

#for each year folder, concatenate across the 4 data profile files, 
    #save one resulting df per year in the dem_files_byYear list

#Loop over years (there is one folder per year, with 4 csvs per year)
for folder in yearFolders:
    #get all csvs of the approprite format 
    dem_files = glob.glob(os.path.join('data/raw/ACS_2007_2016',folder,'ACS_*_1YR_*with_ann.csv*'))  # * allows for gziped csv's 
    year_pat = re.compile(r'(\d{2})')
    cur_year_colnames = []
    for fname in dem_files:
        #read in a table
        dem_yr_df = pd.read_csv(fname, skiprows=1, header=0)
        #ignore column names that include 'Margin of Error' or exclude 'Percent;'
        dem_yr_df = dem_yr_df[dem_yr_df.columns.drop(list(dem_yr_df.filter(regex='Margin of Error')))]
        bad_col_names = [x for x in dem_yr_df.columns.values if 'Percent;' not in x]
        dem_yr_df = dem_yr_df[dem_yr_df.columns.drop(bad_col_names)]
        #record column names
        cur_year_colnames.append(dem_yr_df.columns.values)
    #done with this year, append
    all_colnames.append(cur_year_colnames)

In [3]:
#write out the given column names
import csv
with open('data/derived/colnames.csv', 'w') as csvfile:
    spamwriter = csv.writer(csvfile)
    for cur_year_colnames in all_colnames:
        #flatten the list-of-lists so we have one list of colnames for this year instead of four
        flat_colnames = [x for sublist in cur_year_colnames for x in sublist]
        spamwriter.writerow(flat_colnames)


In [4]:
#example:
#read in the mapping from given names to standardized name
col_name2final_name={}
with open('data/derived/colnames_final.csv', 'r') as csvfile:
    spamwriter = csv.reader(csvfile)
    for row in spamwriter:
        final_name=row[-1]
        for colname in row:
            col_name2final_name[colname]=final_name

In [5]:
col_name2final_name

{'2007': 'Final',
 '2008': 'Final',
 '2009': 'Final',
 '2010': 'Final',
 '2011': 'Final',
 '2012': 'Final',
 '2013': 'Final',
 '2014': 'Final',
 '2015': 'Final',
 '2016': 'Final',
 'Final': 'Final',
 'Percent; Estimate; HOUSEHOLDS BY TYPE - Total households': 'Percent; HOUSEHOLDS BY TYPE - Total households',
 'Percent; HOUSEHOLDS BY TYPE - Total households': 'Percent; HOUSEHOLDS BY TYPE - Total households',
 'Percent; Estimate; HOUSEHOLDS BY TYPE - Total households - Family households (families)': 'Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families)',
 'Percent; HOUSEHOLDS BY TYPE - Family households (families)': 'Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families)',
 'Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families)': 'Percent; HOUSEHOLDS BY TYPE - Total households - Family households (families)',
 'Percent; Estimate; HOUSEHOLDS BY TYPE - Total households - Family households (families) - With own children