# GSS Data Analysis: Preliminary Modeling

## Cleaning the data

The data currently comes in an all numeric format, making visuals and understanding the data difficult. Additionaly, all NAs are coded as if there are actual values. We must clean and reformat the data.

### Converting to strings and dropping Nulls

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Loading in the data

df = pd.read_csv('data/GSS.csv')

In [3]:
# Using the meta data as a tool for converting the numeric values of the df into strings and then back

df_remap  = pd.read_excel('data/GSS_metadata.xlsx',sheet_name='Codes')
df_remap = df_remap.iloc[:,1:]
df_remap.fillna(method='ffill')
df_remap['Variable Name'] = df_remap['Variable Name'].str.upper()

In [4]:
# This dataframe will help turn the strings back into numbers
# This will help speed up recoding ordinal values for machine learning

df_remap_reverse = df_remap[['Variable Name','Label','Code']]

In [5]:
def recur_dictify(frame):
    '''Recursive function to turn the meta data dataframe into a dictionary for remapping values'''
    
    # Base Case
    if len(frame.columns) == 1:
        if frame.values.size == 1: return frame.values[0][0]
        return frame.values.squeeze()
    
    # Recursive Case
    grouped = frame.groupby(frame.columns[0])
    dictionary = {name: recur_dictify(group.iloc[:,1:]) for name,group in grouped}
    return dictionary

In [6]:
remaper = recur_dictify(df_remap)
remaper_reverse = recur_dictify(df_remap_reverse)

In [7]:
# Converts the numbers to strings and drops NAs

df_strings = df.replace(remaper)
df_strings = df_strings.replace(["No answer","Don't know","Not applicable"],np.NaN)

In [8]:
# We must be more specific to convert strings back into numbers
# This is because the pandas 'replace' function fails for the 'remaper_reverse' dictionary

df = pd.DataFrame()
for key in remaper_reverse.keys():
    df[key] = df_strings[key].map(remaper_reverse[key])

In [None]:
#

# df_strings.to_csv('data/data_strings.csv')
# df.to_csv('data/data_clean.csv')

### Converting to strings and dropping Nulls