# Cleaning

[DSLC stages]: Data cleaning and pre-processing


Start by loading in any libraries that you will use in this document.


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import sys


pd.set_option('display.max_columns', 100)


## Domain problem formulation

Write a summary of the problem.





## Data source overview

Briefly describe where the data being used for this project came from


## Step 1: Review background information {#sec-bg-info}

### Information on data collection

Write a summary of how the data was collected.

### Data dictionary

If there is a data dictionary, give some details here.


### Answering questions about the background information

Answer the recommended background information questions from the Data Cleaning chapter.

- *What does each variable measure?* 

- *How was the data collected?* 

- *What are the observational units?* 

- *Is the data relevant to my project?*




## Step 2: Loading in the data


Load in the data. 


In [2]:
data_orig = pd.read_csv(r"../data/anes_timeseries_2020_csv_20220210.csv") 


  data_orig = pd.read_csv(r"../data/anes_timeseries_2020_csv_20220210.csv")


Let's look at the first few rows to make sure it looks like it has been loaded in correctly:

In [3]:
data_orig.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Columns: 1771 entries, version to V203527
dtypes: float64(3), int64(1723), object(45)
memory usage: 111.9+ MB


And let's examine the dimension of the data.


In [4]:
dictionary = {
  "V200002" : {"column": "interviewMode", "unique_values": 3, "type":"cat"},
  "V200010b": {"column": "weights", "unique_values": sys.maxsize, "type": "num"},
  "V201033": {"column": "PRE_VotePresident", "unique_values": 5, "type": "cat"},
  "V201144x": {"column": "approvalOfPresidentCovidResponse", "unique_values": 4, "type":"rank"},
  "V201218": {"column": "PRE_RaceOutcomePrediction", "unique_values": 2, "type": "cat"},
  "V201151": {"column": "PRE_ThermoBiden", "unique_values": 100, "type": "rank"},
  "V201152": {"column": "PRE_ThermoTrump", "unique_values": 100, "type": "rank"},
  "V201153": {"column": "PRE_ThermoHarris", "unique_values": 100, "type": "rank"},
  "V201154": {"column": "PRE_ThermoPence", "unique_values": 100, "type": "rank"},
  "V201155": {"column": "PRE_ThermoObama", "unique_values": 100, "type": "rank"},
  "V201156": {"column": "PRE_ThermoDemParty", "unique_values": 100, "type": "rank"},
  "V201157": {"column": "PRE_ThermoRepParty", "unique_values": 100, "type": "rank"},
  "V201553": {"column": "PRE_ParentNativeStatus", "unique_values": 3, "type": "cat"},
  "V201587": {"column": "PRE_YearsAtAddress", "unique_values": 40, "type": "num"},
  "V201600": {"column": "Sex", "unique_values": 2, "type": "cat"},
  "V201225x": {"column": "PRE_SummaryVoteDutyChoice", "unique_values": 7, "type": "rank"},
  "V201231x": {"column": "PRE_PartyID", "unique_values": 7, "type": "rank"},
  "V201246": {"column": "PRE_ScaleSpendingServices", "unique_values": 7, "type": "rank"},
  #"V201018": {"column": "PRE_PartyRegistration", "unique_values": 4, "type": "cat"},
  "V201115": {"column": "PRE_CountryDirection", "unique_values": 5, "type": "rank"},
  "V201233": {"column": "PRE_GovTrust", "unique_values": 5, "type": "rank"},
  "V201324": {"column": "PRE_EconomyView", "unique_values": 5, "type": "rank"},
  "V201340": {"column": "PRE_AbortionRightsSC", "unique_values": 3, "type": "cat"},
  "V201507x": {"column": "Age", "unique_values": 80, "type": "num"},
  "V201510": {"column": "EducationLevel", "unique_values": 8, "type": "cat"},
  "V201517": {"column": "WorkStatus", "unique_values": 2, "type": "cat"},
  "V201617x": {"column": "Income", "unique_values": 22, "type": "rank"},
  "V201549x": {"column": "Race", "unique_values": 6, "type": "cat"},
  "V202054x": {"column": "StateRegistration", "unique_values": 56, "type": "cat"},
  "V201567": {"column": "HouseholdChildren", "unique_values": 5, "type": "rank"},
  "V201630b": {"column": "PRE_Fox_Hannity", "unique_values": 2, "type": "cat"},
  "V201630c": {"column": "PRE_Fox_TuckerCarlsonTonight", "unique_values": 2, "type": "cat"},
  "V201630k": {"column": "PRE_Fox_SpecialReportBretBaier", "unique_values": 2, "type": "cat"},
  "V201630f": {"column": "PRE_Fox_TheFive", "unique_values": 2, "type": "cat"},
  "V201630g": {"column": "PRE_Fox_TheIngrahamAngle", "unique_values": 2, "type": "cat"},
  "V201630h": {"column": "PRE_Fox_TheStoryMarthaMacCallum", "unique_values": 2, "type": "cat"},
  "V201631k": {"column": "PRE_Fox_FoxAndFriends", "unique_values": 2, "type": "cat"},
  "V201634f": {"column": "PRE_Fox_FoxNewsWebsite", "unique_values": 2, "type": "cat"},
  "V201630i": {"column": "PRE_CNN_TheLeadJakeTapper", "unique_values": 2, "type": "cat"},
  "V201630j": {"column": "PRE_CNN_AndersonCooper360", "unique_values": 2, "type": "cat"},
  "V201630q": {"column": "PRE_CNN_CuomoPrimeTime", "unique_values": 2, "type": "cat"},
  "V201631b": {"column": "PRE_CNN_ErinBurnettOutFront", "unique_values": 2, "type": "cat"},
  "V201634b": {"column": "PRE_CNN_CNNWebsite", "unique_values": 2, "type": "cat"},
  "V201630n": {"column": "PRE_ABC_WorldNewsTonight", "unique_values": 2, "type": "cat"},
  "V201631d": {"column": "PRE_ABC_2020", "unique_values": 2, "type": "cat"},
  "V201631i": {"column": "PRE_ABC_GoodMorningAmerica", "unique_values": 2, "type": "cat"},
  "V201646": {"column": "PRE_PartyMoreHouseMembers", "unique_values": 2, "type": "cat"},
  "V201645": {"column": "PRE_FederalSpendingKnowledge", "unique_values": 4, "type": "cat"},
  "V201351": {"column": "PRE_VoteAccuracy", "unique_values": 5, "type": "rank"},
  "V201650": {"column": "PRE_SurveySeriousness", "unique_values": 5, "type": "rank"},
  "V201249": {"column": "PRE_ScaleDefenseSpending", "unique_values": 7, "type": "rank"},
  "V201252": {"column": "PRE_ScaleMedInsurance", "unique_values": 7, "type": "rank"},
  "V201380": {"column": "PRE_CorruptionView", "unique_values": 3, "type": "cat"},
  "V201258": {"column": "PRE_ScaleGovAssistanceBlacks", "unique_values": 7, "type": "rank"},
  "V201255": {"column": "PRE_ScaleJobIncome", "unique_values": 7, "type": "rank"},
  "V202051": {"column": "POST_RegistrationStatus", "unique_values": 3, "type": "cat"},
  "V202068x": {"column": "POST_Voted2020", "unique_values": 3, "type": "cat"},
  "V202073": {"column": "POST_VotePresident", "unique_values": 4, "type": "cat"},
  "V202219": {"column": "POST_VoteAccuracy", "unique_values": 5, "type": "rank"},
  "V202156": {"column": "POST_ThermoHarris", "unique_values": 100, "type": "rank"},
  "V202157": {"column": "POST_ThermoPence", "unique_values": 100, "type": "rank"},
  "V202143": {"column": "POST_ThermoBiden", "unique_values": 100, "type": "rank"},
  "V202144": {"column": "POST_ThermoTrump", "unique_values": 100, "type": "rank"},
  "V202123": {"column": "POST_ReasonNotVoting", "unique_values": 15, "type": "cat"},
  "V202205y1": {"column": "POST_ProblemMention", "unique_values": 82, "type": "cat"},
  #"V202580": {"column": "POST_ScaleMedInsurance", "unique_values": 7, "type": "rank"},
  #"V202624": {"column": "POST_HealthSpending", "unique_values": 7, "type": "rank"},
  "V202644": {"column": "POST_RespondentHonesty", "unique_values": 3, "type": "cat"}
}


In [5]:
column_labels = {
    "V200002":{
        "column": "interviewMode",
        "labels":{
            1: "Video", 
            2: "Telephone", 
            3: "Web"
        }
    },
    "V201033": {
        "column": "PRE_VotePresident",
        "labels": {
            1: "Joe Biden",
            2: "Donald Trump",
            3: "Jo Jorgensen",
            4: "Howie Hawkins",
            5: "Other"
        }
    },
    "V201218": {
        "column": "PRE_RaceOutcomePrediction",
        "labels": {
            1: "Will be close",
            2: "Win by quite a bit"
        }
    },
    "V201553": {
        "column": "PRE_ParentNativeStatus",
        "labels": {
            1: "Both parents born in the US",
            2: "One parent born in the US",
            3: "Both parents born in another country"
        }
    },
    "V201600": {
        "column": "Sex",
        "labels": {
            1: "Male",
            2: "Female"
        }
    },
    "V201340": {
        "column": "PRE_AbortionRightsSC",
        "labels": {
            1: "Pleased",
            2: "Upset",
            3: "Neither pleased nor upset"
        }
    },
    "V201510": {
        "column": "EducationLevel",
        "labels": {
            1: "Less than high school credential",
            2: "High school graduate",
            3: "Some college but no degree",
            4: "Associate degree - occupational/vocational",
            5: "Associate degree - academic",
            6: "Bachelor’s degree",
            7: "Master’s degree",
            8: "Professional/Doctoral degree",
        }
    },
    "V201517": {
        "column": "WorkStatus",
        "labels": {
            1: "Yes",
            2: "No, did not work (or retired)"
        }
    },
    "V201549x": {
        "column": "Race",
        "labels": {
            1: "White, non-Hispanic",
            2: "Black, non-Hispanic",
            3: "Hispanic",
            4: "Asian/Pacific Islander, non-Hispanic",
            5: "Native American/Alaska Native, non-Hispanic",
            6: "Multiple races, non-Hispanic"
        }
    },
    "V202054x": {
        "column": "StateRegistration",
        "labels": {
            1: "Alabama",
            2: "Alaska",
            4: "Arizona",
            5: "Arkansas",
            6: "California",
            8: "Colorado",
            9: "Connecticut",
            10: "Delaware",
            11: "Washington DC",
            12: "Florida",
            13: "Georgia",
            15: "Hawaii",
            16: "Idaho",
            17: "Illinois",
            18: "Indiana",
            19: "Iowa",
            20: "Kansas",
            21: "Kentucky",
            22: "Louisiana",
            23: "Maine",
            24: "Maryland",
            25: "Massachusetts",
            26: "Michigan",
            27: "Minnesota",
            28: "Mississippi",
            29: "Missouri",
            30: "Montana",
            31: "Nebraska",
            32: "Nevada",
            33: "New Hampshire",
            34: "New Jersey",
            35: "New Mexico",
            36: "New York",
            37: "North Carolina",
            38: "North Dakota",
            39: "Ohio",
            40: "Oklahoma",
            41: "Oregon",
            42: "Pennsylvania",
            44: "Rhode Island",
            45: "South Carolina",
            46: "South Dakota",
            47: "Tennessee",
            48: "Texas",
            49: "Utah",
            50: "Vermont",
            51: "Virginia",
            53: "Washington",
            54: "West Virginia",
            55: "Wisconsin",
            56: "Wyoming"
        }
    },
    "V201646": {
        "column": "PRE_PartyMoreHouseMembers",
        "labels": {
            1: "correct (D)",
            2: "incorrect (R)"
        }
    },
    "V201645": {
        "column": "PRE_FederalSpendingKnowledge",
        "labels": {
            1: "correct (Foreign aid)",
            0: "incorrect (Medicare, National defense, SS)"
        }
    },
    "V201380": {
        "column": "PRE_CorruptionView",
        "labels": {
            1: "Increased",
            2: "Decreased",
            3: "Stayed the same"
        }
    },
    "V202051": {
        "column": "POST_RegistrationStatus",
        "labels": {
            1: "Registered at this address",
            2: "Registered at a different address",
            3: "Not currently registered"
        }
    },
    "V202068x": {
        "column": "POST_Voted2020",
        "labels": {
            0: "Not registered and did not vote",
            1: "Registered and did not vote",
            2: "Voted"
        }
    },
    "V202073": {
        "column": "POST_VotePresident",
        "labels": {
            1: "Joe Biden",
            2: "Donald Trump",
            3: "Jo Jorgensen",
            4: "Howie Hawkins",
            5: "Other candidate {SPECIFY}"
        }
    },
    "V202205y1": {
        "column": "POST_ProblemMention",
        "labels": {
            1: "Defense spending",
            2: "Middle East",
            3: "Iraq",
            4: "War",
            5: "Terrorism",
            6: "Veterans",
            7: "National defense (all other)",
            8: "Foreign aid",
            9: "Foreign Trade",
            10: "Protection of US jobs",
            11: "Serbia /Balkans",
            12: "China",
            13: "International affairs (all other)",
            14: "Energy crisis",
            15: "Energy prices",
            16: "Energy (all other)",
            17: "Environment",
            18: "Natural Resources (all other)",
            19: "Education and training",
            20: "School funding",
            21: "Education (all other)",
            22: "AIDS",
            23: "Medicare",
            24: "Health (all other)",
            25: "Welfare",
            26: "Poverty",
            27: "Employment",
            28: "Housing",
            29: "Social security",
            30: "Income (all other)",
            31: "Crime",
            32: "Race relations",
            33: "Illegal drugs",
            34: "Police problems",
            35: "Guns",
            36: "Corporate Corruption",
            37: "Justice (all other)",
            38: "Budget",
            39: "Size of government",
            40: "Taxes",
            41: "Immigration",
            42: "Campaign finance",
            43: "Political corruption",
            44: "Ethics",
            45: "Government power",
            46: "Budget priorities",
            47: "Partisan politics",
            48: "Politicians",
            49: "Government (all other)",
            50: "The economy",
            51: "Stock market",
            52: "Economic inequality",
            53: "Recession",
            54: "Inflation",
            55: "Economics (all other)",
            56: "Agriculture",
            57: "Science",
            58: "Commerce",
            59: "Transportation",
            60: "Community development",
            61: "Abortion",
            62: "Child care",
            63: "Overpopulation",
            64: "Public morality",
            65: "Domestic violence",
            66: "Family",
            67: "Young people",
            68: "Sexual identity /LGBT+ issues",
            69: "The media",
            75: "Sexism /Gender issues",
            76: "Afghanistan",
            77: "Syria",
            78: "Elections",
            79: "Religion",
            80: "Civility",
            81: "Unity /division",
            82: "Health care",
            83: "Other"
        }
    }
}


In [6]:
path = r"../data/anes_timeseries_2020_csv_20220210.csv"

In [7]:
import sys
import os
import importlib

# Add the directory containing the file to the Python path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(path), 'functions')))

# Import the function
from functions import load_data

# Reload the module to reflect any updates
importlib.reload(load_data)

data_filtered = load_data.load_data(path, dictionary)

  df = pd.read_csv(path)  # Modify this if your data is in a different format (e.g., .xlsx, .json, etc.)


In [8]:
data_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Data columns (total 65 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   interviewMode                     8280 non-null   int64 
 1   weights                           8280 non-null   object
 2   PRE_VotePresident                 8280 non-null   int64 
 3   approvalOfPresidentCovidResponse  8280 non-null   int64 
 4   PRE_RaceOutcomePrediction         8280 non-null   int64 
 5   PRE_ThermoBiden                   8280 non-null   int64 
 6   PRE_ThermoTrump                   8280 non-null   int64 
 7   PRE_ThermoHarris                  8280 non-null   int64 
 8   PRE_ThermoPence                   8280 non-null   int64 
 9   PRE_ThermoObama                   8280 non-null   int64 
 10  PRE_ThermoDemParty                8280 non-null   int64 
 11  PRE_ThermoRepParty                8280 non-null   int64 
 12  PRE_ParentNativeStat



## Step 3: Examine the data

In this section we explore the common messy data traits to identify any cleaning action items.





### Finding invalid values



In [9]:
data_filtered.describe()

Unnamed: 0,interviewMode,PRE_VotePresident,approvalOfPresidentCovidResponse,PRE_RaceOutcomePrediction,PRE_ThermoBiden,PRE_ThermoTrump,PRE_ThermoHarris,PRE_ThermoPence,PRE_ThermoObama,PRE_ThermoDemParty,PRE_ThermoRepParty,PRE_ParentNativeStatus,PRE_YearsAtAddress,Sex,PRE_SummaryVoteDutyChoice,PRE_PartyID,PRE_ScaleSpendingServices,PRE_CountryDirection,PRE_GovTrust,PRE_EconomyView,PRE_AbortionRightsSC,Age,EducationLevel,WorkStatus,Income,Race,StateRegistration,HouseholdChildren,PRE_Fox_Hannity,PRE_Fox_TuckerCarlsonTonight,PRE_Fox_SpecialReportBretBaier,PRE_Fox_TheFive,PRE_Fox_TheIngrahamAngle,PRE_Fox_TheStoryMarthaMacCallum,PRE_Fox_FoxAndFriends,PRE_Fox_FoxNewsWebsite,PRE_CNN_TheLeadJakeTapper,PRE_CNN_AndersonCooper360,PRE_CNN_CuomoPrimeTime,PRE_CNN_ErinBurnettOutFront,PRE_CNN_CNNWebsite,PRE_ABC_WorldNewsTonight,PRE_ABC_2020,PRE_ABC_GoodMorningAmerica,PRE_PartyMoreHouseMembers,PRE_FederalSpendingKnowledge,PRE_VoteAccuracy,PRE_SurveySeriousness,PRE_ScaleDefenseSpending,PRE_ScaleMedInsurance,PRE_CorruptionView,PRE_ScaleGovAssistanceBlacks,PRE_ScaleJobIncome,POST_RegistrationStatus,POST_Voted2020,POST_VotePresident,POST_VoteAccuracy,POST_ThermoHarris,POST_ThermoPence,POST_ThermoBiden,POST_ThermoTrump,POST_ReasonNotVoting,POST_ProblemMention,POST_RespondentHonesty
count,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0
mean,2.896498,1.165097,2.897705,1.253261,47.812319,39.055314,49.675362,44.637077,59.907488,44.752536,43.131763,1.231643,11.610507,1.456522,3.251449,3.833816,18.079106,2.493961,3.420531,3.212802,1.977899,49.038889,5.532126,1.399758,10.221739,1.498913,23.251691,0.533213,-0.253623,-0.250242,-0.267633,-0.2593,-0.268961,-0.275362,-0.229469,-0.306039,-0.268116,-0.238164,-0.256763,-0.278382,-0.257126,-0.212319,-0.236473,-0.224758,1.086957,2.03285,3.004469,4.558092,16.971256,14.064855,1.620169,13.264734,14.743841,-1.261957,1.011715,0.238889,1.355193,46.183816,40.408454,46.919807,33.297705,-0.520652,80.242271,-1.459058
std,0.423705,1.937772,1.345693,0.992782,36.806871,40.571078,65.88754,55.784589,37.425584,35.949072,36.148125,0.969502,12.098934,1.066932,2.40667,2.39749,33.18717,1.30409,1.207423,1.261308,1.233611,20.771267,9.98756,0.823674,8.444621,1.698425,18.594514,1.313558,0.951118,0.953788,0.939846,0.946592,0.938759,0.933479,0.96977,1.008629,0.939451,0.963166,0.948621,0.930962,1.045849,0.982437,0.96446,0.973295,1.46808,2.098351,1.485154,1.447893,32.377374,29.738682,1.348386,28.716329,30.002713,1.865796,2.426249,2.56632,2.82523,48.731938,47.695825,38.600682,40.314633,3.970426,151.006446,1.684504
min,1.0,-9.0,-2.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-2.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-7.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-7.0
25%,3.0,1.0,1.0,1.0,15.0,0.0,0.0,0.0,30.0,15.0,15.0,1.0,2.0,1.0,1.0,2.0,4.0,2.0,3.0,2.0,2.0,35.0,3.0,1.0,4.0,1.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,1.0,1.0,2.0,5.0,3.0,2.0,1.0,2.0,3.0,-1.0,2.0,-1.0,1.0,0.0,0.0,0.0,0.0,-1.0,32.0,-1.0
50%,3.0,1.0,4.0,1.0,50.0,30.0,50.0,50.0,70.0,50.0,40.0,1.0,7.0,2.0,2.0,4.0,5.0,3.0,4.0,3.0,2.0,51.0,5.0,1.0,11.0,1.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,3.0,5.0,5.0,4.0,1.0,4.0,4.0,-1.0,2.0,1.0,2.0,50.0,30.0,50.0,10.0,-1.0,50.0,-1.0
75%,3.0,2.0,4.0,2.0,85.0,85.0,85.0,75.0,100.0,70.0,70.0,1.0,19.0,2.0,6.0,6.0,7.0,3.0,4.0,4.0,3.0,65.0,6.0,2.0,17.0,2.0,39.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,4.0,5.0,6.0,7.0,3.0,6.0,7.0,-1.0,2.0,2.0,3.0,85.0,75.0,85.0,70.0,-1.0,82.0,-1.0
max,3.0,12.0,4.0,2.0,998.0,100.0,999.0,999.0,100.0,998.0,998.0,3.0,40.0,2.0,7.0,7.0,99.0,5.0,5.0,5.0,3.0,80.0,95.0,2.0,22.0,6.0,56.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,4.0,5.0,5.0,99.0,99.0,3.0,99.0,99.0,3.0,2.0,12.0,5.0,999.0,999.0,100.0,100.0,16.0,997.0,3.0


#### Numeric variables



In [10]:
from functions import prepare_data

importlib.reload(prepare_data)

ranked_columns, num_columns, cat_columns = prepare_data.extract_ranked_num_and_cat_columns(dictionary)

print(ranked_columns)
print(num_columns)

['approvalOfPresidentCovidResponse', 'PRE_ThermoBiden', 'PRE_ThermoTrump', 'PRE_ThermoHarris', 'PRE_ThermoPence', 'PRE_ThermoObama', 'PRE_ThermoDemParty', 'PRE_ThermoRepParty', 'PRE_SummaryVoteDutyChoice', 'PRE_PartyID', 'PRE_ScaleSpendingServices', 'PRE_CountryDirection', 'PRE_GovTrust', 'PRE_EconomyView', 'Income', 'HouseholdChildren', 'PRE_VoteAccuracy', 'PRE_SurveySeriousness', 'PRE_ScaleDefenseSpending', 'PRE_ScaleMedInsurance', 'PRE_ScaleGovAssistanceBlacks', 'PRE_ScaleJobIncome', 'POST_VoteAccuracy', 'POST_ThermoHarris', 'POST_ThermoPence', 'POST_ThermoBiden', 'POST_ThermoTrump']
['weights', 'PRE_YearsAtAddress', 'Age']


#### Categorical variables




In [11]:
print(cat_columns)

['interviewMode', 'PRE_VotePresident', 'PRE_RaceOutcomePrediction', 'PRE_ParentNativeStatus', 'Sex', 'PRE_AbortionRightsSC', 'EducationLevel', 'WorkStatus', 'Race', 'StateRegistration', 'PRE_Fox_Hannity', 'PRE_Fox_TuckerCarlsonTonight', 'PRE_Fox_SpecialReportBretBaier', 'PRE_Fox_TheFive', 'PRE_Fox_TheIngrahamAngle', 'PRE_Fox_TheStoryMarthaMacCallum', 'PRE_Fox_FoxAndFriends', 'PRE_Fox_FoxNewsWebsite', 'PRE_CNN_TheLeadJakeTapper', 'PRE_CNN_AndersonCooper360', 'PRE_CNN_CuomoPrimeTime', 'PRE_CNN_ErinBurnettOutFront', 'PRE_CNN_CNNWebsite', 'PRE_ABC_WorldNewsTonight', 'PRE_ABC_2020', 'PRE_ABC_GoodMorningAmerica', 'PRE_PartyMoreHouseMembers', 'PRE_FederalSpendingKnowledge', 'PRE_CorruptionView', 'POST_RegistrationStatus', 'POST_Voted2020', 'POST_VotePresident', 'POST_ReasonNotVoting', 'POST_ProblemMention', 'POST_RespondentHonesty']


### Examining missing values




In [12]:
# Count missing values for each column
missing_values_count = data_filtered.isna().sum()

# Display the counts
print("Missing values count for each column:")
print(missing_values_count)

# Filter out columns with no missing values
missing_values_greater_than_zero = missing_values_count[missing_values_count > 0]

print()
# Display the counts for columns with missing values
print("Missing values count for columns with missing values greater than 0:")
print(missing_values_greater_than_zero)


Missing values count for each column:
interviewMode                       0
weights                             0
PRE_VotePresident                   0
approvalOfPresidentCovidResponse    0
PRE_RaceOutcomePrediction           0
                                   ..
POST_ThermoBiden                    0
POST_ThermoTrump                    0
POST_ReasonNotVoting                0
POST_ProblemMention                 0
POST_RespondentHonesty              0
Length: 65, dtype: int64

Missing values count for columns with missing values greater than 0:
Series([], dtype: int64)


### Examining the data format



### Assessing column names and variable type



In [13]:
data_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Data columns (total 65 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   interviewMode                     8280 non-null   int64 
 1   weights                           8280 non-null   object
 2   PRE_VotePresident                 8280 non-null   int64 
 3   approvalOfPresidentCovidResponse  8280 non-null   int64 
 4   PRE_RaceOutcomePrediction         8280 non-null   int64 
 5   PRE_ThermoBiden                   8280 non-null   int64 
 6   PRE_ThermoTrump                   8280 non-null   int64 
 7   PRE_ThermoHarris                  8280 non-null   int64 
 8   PRE_ThermoPence                   8280 non-null   int64 
 9   PRE_ThermoObama                   8280 non-null   int64 
 10  PRE_ThermoDemParty                8280 non-null   int64 
 11  PRE_ThermoRepParty                8280 non-null   int64 
 12  PRE_ParentNativeStat



## Step 4: Prepare the data

Don't forget to split the data into training, validation and test sets before you clean and pre-process it!

In [14]:
data_temp = pd.to_numeric(data_filtered['weights'], errors='coerce')

# Identify problematic values
problematic_values = data_filtered.loc[data_temp.isna(), ['weights']]

print("Problematic values converted to NaN:")
print(problematic_values)


Problematic values converted to NaN:
     weights
52          
55          
122         
147         
151         
...      ...
8215        
8225        
8240        
8248        
8271        

[827 rows x 1 columns]


In [15]:
data_filtered['weights'] = pd.to_numeric(data_filtered['weights'], errors='coerce')

In [16]:
from functions import prepare_data

importlib.reload(prepare_data)

data_processed = prepare_data.clean_columns(data_filtered, dictionary)

In [17]:
data_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Data columns (total 65 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   interviewMode                     8280 non-null   int64  
 1   weights                           7453 non-null   float64
 2   PRE_VotePresident                 7049 non-null   float64
 3   approvalOfPresidentCovidResponse  8238 non-null   float64
 4   PRE_RaceOutcomePrediction         8213 non-null   float64
 5   PRE_ThermoBiden                   8060 non-null   float64
 6   PRE_ThermoTrump                   8048 non-null   float64
 7   PRE_ThermoHarris                  7980 non-null   float64
 8   PRE_ThermoPence                   8045 non-null   float64
 9   PRE_ThermoObama                   8165 non-null   float64
 10  PRE_ThermoDemParty                8152 non-null   float64
 11  PRE_ThermoRepParty                8141 non-null   float64
 12  PRE_Pa

In [18]:
# Calculate the threshold for allowed missing values (70% valid data)
threshold = len(data_processed) * 0.5

# Get the initial columns
original_columns = set(data_processed.columns)

# Drop columns with more than 50% missing values
data_dropped = data_processed.dropna(axis=1, thresh=threshold)

# Get the dropped columns
dropped_columns = original_columns - set(data_dropped.columns)

print("\nDropped columns:")
print(dropped_columns)


Dropped columns:
{'POST_RespondentHonesty', 'POST_RegistrationStatus', 'POST_ReasonNotVoting'}


In [19]:
# Safely remove keys from the dictionary
for col in list(dictionary.keys()):  # Iterate over a list of keys to avoid modifying during iteration
    if dictionary[col]['column'] in dropped_columns:
        print(dictionary[col]['column'])
        del dictionary[col]

POST_RegistrationStatus
POST_ReasonNotVoting
POST_RespondentHonesty


In [20]:
importlib.reload(prepare_data)

# Test for optimal k
data_imputed = prepare_data.knn_impute(data_dropped, dictionary)

In [21]:
data_imputed['POST_ProblemMention'].value_counts()

POST_ProblemMention
82    2156
32     673
81     608
47     419
49     364
      ... 
15       1
23       1
1        1
22       1
76       1
Name: count, Length: 76, dtype: int64

In [22]:
importlib.reload(prepare_data)
data_imputed = prepare_data.group_top_and_other(data_imputed,'POST_ProblemMention')

In [23]:
print(data_imputed['POST_ProblemMention'].value_counts())

POST_ProblemMention
83    2806
82    2156
32     673
81     608
47     419
49     364
50     339
48     308
24     238
78     188
17     181
Name: count, dtype: int64


In [24]:
importlib.reload(prepare_data)
data_imputed = prepare_data.replace_all_other_cols(data_imputed, 'PRE_FederalSpendingKnowledge', 1)

In [25]:
data_imputed['PRE_FederalSpendingKnowledge'].value_counts()

PRE_FederalSpendingKnowledge
0    5228
1    3052
Name: count, dtype: int64

In [26]:
importlib.reload(prepare_data)
data_combined = prepare_data.combine_columns_by_group(data_imputed)

In [27]:
data_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Data columns (total 49 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   interviewMode                     8280 non-null   int64  
 1   weights                           8280 non-null   float64
 2   PRE_VotePresident                 8280 non-null   int64  
 3   approvalOfPresidentCovidResponse  8280 non-null   int64  
 4   PRE_RaceOutcomePrediction         8280 non-null   int64  
 5   PRE_ThermoBiden                   8280 non-null   int64  
 6   PRE_ThermoTrump                   8280 non-null   int64  
 7   PRE_ThermoHarris                  8280 non-null   int64  
 8   PRE_ThermoPence                   8280 non-null   int64  
 9   PRE_ThermoObama                   8280 non-null   int64  
 10  PRE_ThermoDemParty                8280 non-null   int64  
 11  PRE_ThermoRepParty                8280 non-null   int64  
 12  PRE_Pa

In [28]:
importlib.reload(prepare_data)
data_one_hot = prepare_data.one_hot_cat_cols(data_combined, dictionary,column_labels)

In [29]:
data_one_hot

Unnamed: 0,weights,approvalOfPresidentCovidResponse,PRE_ThermoBiden,PRE_ThermoTrump,PRE_ThermoHarris,PRE_ThermoPence,PRE_ThermoObama,PRE_ThermoDemParty,PRE_ThermoRepParty,PRE_YearsAtAddress,PRE_SummaryVoteDutyChoice,PRE_PartyID,PRE_ScaleSpendingServices,PRE_CountryDirection,PRE_GovTrust,PRE_EconomyView,Age,Income,HouseholdChildren,PRE_VoteAccuracy,PRE_SurveySeriousness,PRE_ScaleDefenseSpending,PRE_ScaleMedInsurance,PRE_ScaleGovAssistanceBlacks,PRE_ScaleJobIncome,POST_VoteAccuracy,POST_ThermoHarris,POST_ThermoPence,POST_ThermoBiden,POST_ThermoTrump,mentionFox,mentionABC,mentionCNN,interviewMode_Telephone,interviewMode_Web,PRE_VotePresident_Donald Trump,PRE_VotePresident_Jo Jorgensen,PRE_VotePresident_Howie Hawkins,PRE_VotePresident_Other,PRE_RaceOutcomePrediction_Win by quite a bit,PRE_ParentNativeStatus_One parent born in the US,PRE_ParentNativeStatus_Both parents born in another country,Sex_Female,PRE_AbortionRightsSC_Upset,PRE_AbortionRightsSC_Neither pleased nor upset,EducationLevel_High school graduate,EducationLevel_Some college but no degree,EducationLevel_Associate degree - occupational/vocational,EducationLevel_Associate degree - academic,EducationLevel_Bachelor’s degree,...,StateRegistration_Massachusetts,StateRegistration_Michigan,StateRegistration_Minnesota,StateRegistration_Mississippi,StateRegistration_Missouri,StateRegistration_Montana,StateRegistration_Nebraska,StateRegistration_Nevada,StateRegistration_New Hampshire,StateRegistration_New Jersey,StateRegistration_New Mexico,StateRegistration_New York,StateRegistration_North Carolina,StateRegistration_North Dakota,StateRegistration_Ohio,StateRegistration_Oklahoma,StateRegistration_Oregon,StateRegistration_Pennsylvania,StateRegistration_43,StateRegistration_Rhode Island,StateRegistration_South Carolina,StateRegistration_South Dakota,StateRegistration_Tennessee,StateRegistration_Texas,StateRegistration_Utah,StateRegistration_Vermont,StateRegistration_Virginia,StateRegistration_Washington,StateRegistration_West Virginia,StateRegistration_Wisconsin,StateRegistration_Wyoming,PRE_PartyMoreHouseMembers_incorrect (R),PRE_FederalSpendingKnowledge_correct (Foreign aid),PRE_CorruptionView_Decreased,PRE_CorruptionView_Stayed the same,POST_Voted2020_Registered and did not vote,POST_Voted2020_Voted,POST_VotePresident_Donald Trump,POST_VotePresident_Jo Jorgensen,POST_VotePresident_Howie Hawkins,POST_ProblemMention_Health (all other),POST_ProblemMention_Race relations,POST_ProblemMention_Partisan politics,POST_ProblemMention_Politicians,POST_ProblemMention_Government (all other),POST_ProblemMention_The economy,POST_ProblemMention_Elections,POST_ProblemMention_Unity /division,POST_ProblemMention_Health care,POST_ProblemMention_Other
0,1.01,1,0,100,0,85,0,0,85,10.0,1,7,1,3,5,2,46.0,21,0,3,5,7,7,7,7,2,0,85,0,100,0,0,0,False,True,True,False,False,False,True,True,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False,True,False,False,False,False,False,False,False,False
1,1.16,3,0,0,0,0,50,0,50,4.0,4,4,3,1,5,3,37.0,13,1,2,5,4,4,4,5,2,0,0,15,15,0,0,0,False,True,False,True,False,False,False,False,False,True,True,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,True,False,True,False,False,False,False,False,False,False,False,True,False,False
2,0.77,4,65,0,65,0,90,60,0,11.0,4,3,6,1,4,4,40.0,17,2,3,5,1,2,3,4,1,80,0,85,0,0,0,1,False,True,False,False,False,False,True,False,False,True,True,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False
3,0.52,4,70,15,85,15,85,50,70,20.0,1,6,7,2,3,4,41.0,7,1,4,5,4,1,3,7,1,85,50,100,60,0,0,0,False,True,False,False,False,False,True,False,True,False,False,True,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False
4,0.97,3,15,85,15,90,10,20,70,10.0,1,4,2,4,5,4,72.0,22,0,2,5,4,5,6,7,3,0,95,0,90,1,0,0,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8275,2.54,1,0,100,0,50,0,0,100,1.0,1,7,7,3,2,2,26.0,8,2,4,5,6,1,5,2,4,22,50,40,100,0,0,0,False,False,True,False,False,False,True,False,True,True,True,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,True,False,False
8276,0.91,2,50,70,50,70,50,40,70,22.0,5,6,4,2,4,4,52.0,19,0,3,5,5,5,5,4,1,40,70,40,70,0,0,0,False,False,True,False,False,False,False,False,True,True,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,True,True,False,False,False,False,False,False,False,False,False,True,False,False
8277,0.65,4,70,30,85,15,70,85,50,6.0,1,1,5,3,4,2,45.0,16,0,4,5,7,4,4,4,1,70,20,60,30,0,0,0,True,False,False,False,False,False,False,False,True,False,True,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,True,False,False,False,False,False,False,False,False,False
8278,0.16,1,0,100,0,100,0,0,70,16.0,1,7,2,3,4,2,65.0,14,0,1,5,6,6,7,7,5,0,100,0,100,0,0,0,False,True,True,False,False,False,True,False,False,True,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,True


In [30]:
'Sex_Female' in data_one_hot.columns

True

In [66]:
'Race' in data_one_hot.columns

False

In [31]:
import pandas as pd

# Example DataFrame
# data_one_hot = pd.DataFrame(...)

# Get all columns containing 'StateRegistration'
columns_with_state_registration = data_one_hot.filter(like='StateRegistration', axis=1).columns

print(columns_with_state_registration)


Index(['StateRegistration_Alaska', 'StateRegistration_Arizona',
       'StateRegistration_Arkansas', 'StateRegistration_California',
       'StateRegistration_7', 'StateRegistration_Colorado',
       'StateRegistration_Connecticut', 'StateRegistration_Delaware',
       'StateRegistration_Washington DC', 'StateRegistration_Florida',
       'StateRegistration_Georgia', 'StateRegistration_14',
       'StateRegistration_Hawaii', 'StateRegistration_Idaho',
       'StateRegistration_Illinois', 'StateRegistration_Indiana',
       'StateRegistration_Iowa', 'StateRegistration_Kansas',
       'StateRegistration_Kentucky', 'StateRegistration_Louisiana',
       'StateRegistration_Maine', 'StateRegistration_Maryland',
       'StateRegistration_Massachusetts', 'StateRegistration_Michigan',
       'StateRegistration_Minnesota', 'StateRegistration_Mississippi',
       'StateRegistration_Missouri', 'StateRegistration_Montana',
       'StateRegistration_Nebraska', 'StateRegistration_Nevada',
       'Sta

In [32]:
train_data, val_data, test_data = prepare_data.split_data(data_one_hot)

In [67]:
# Save the datasets to CSV files
data_combined.to_csv('anes_data.csv', index=False)
train_data.to_csv('train_data.csv', index=False)
val_data.to_csv('val_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

# 2016 Data

# Census Data

In [52]:
data_States = pd.read_csv(r"../data/Census Data/table04b.csv") 
data_States = data_States.iloc[:, 2:]

In [53]:
# Set the first column as the index
data_States.set_index(data_States.columns[0], inplace=True)
data_States = data_States.rename_axis(None, axis=0)

In [36]:
data_States

Unnamed: 0,Total population,Total citizen population,Total registered,Percent registered\n(Total),Registered-Margin of error 1,Percent registered\n(Citizen),Registered-Margin of error 1.1,Total voted,Percent voted\n(Total),Voted-Margin of error 1,Percent voted\n(Citizen),Voted-Margin of error 1.1
US-Total,252274,231593,168308,66.7,0.4,72.7,0.4,154628,61.3,0.4,66.8,0.4
US-Male,121870,111485,79340,65.1,0.5,71.2,0.5,72474,59.5,0.5,65,0.5
US-Female,130404,120108,88968,68.2,0.5,74.1,0.5,82154,63,0.5,68.4,0.5
US-White alone,195227,181891,134889,69.1,0.4,74.2,0.4,124301,63.7,0.4,68.3,0.4
US-White non-Hispanic alone,157442,154827,118389,75.2,0.4,76.5,0.4,109830,69.8,0.4,70.9,0.4
...,...,...,...,...,...,...,...,...,...,...,...,...
WYOMING-Asian alone,2,-,-,B,B,B,B,-,B,B,B,B
WYOMING-Hispanic (of any race),40,38,23,B,B,B,B,21,B,B,B,B
WYOMING-White alone or in combination,422,416,290,68.6,3.5,69.6,3.5,273,64.7,3.6,65.7,3.6
WYOMING-Black alone or in combination,4,3,3,B,B,B,B,3,B,B,B,B


In [51]:
# Extract state names from the index and add as a new column
data_States['State'] = data_States.index.str.split('-').str[0]

# Group by state
grouped_by_state = data_States.groupby('State')

# Inspect the resulting groups
grouped_by_state.groups.keys()

{'ALABAMA': ['ALABAMA-Total', 'ALABAMA-Male', 'ALABAMA-Female', 'ALABAMA-White alone', 'ALABAMA-White non-Hispanic alone', 'ALABAMA-Black alone', 'ALABAMA-Asian alone', 'ALABAMA-Hispanic (of any race)', 'ALABAMA-White alone or in combination', 'ALABAMA-Black alone or in combination', 'ALABAMA-Asian alone or in combination'], 'ALASKA': ['ALASKA-Total', 'ALASKA-Male', 'ALASKA-Female', 'ALASKA-White alone', 'ALASKA-White non-Hispanic alone', 'ALASKA-Black alone', 'ALASKA-Asian alone', 'ALASKA-Hispanic (of any race)', 'ALASKA-White alone or in combination', 'ALASKA-Black alone or in combination', 'ALASKA-Asian alone or in combination'], 'ARIZONA': ['ARIZONA-Total', 'ARIZONA-Male', 'ARIZONA-Female', 'ARIZONA-White alone', 'ARIZONA-White non-Hispanic alone', 'ARIZONA-Black alone', 'ARIZONA-Asian alone', 'ARIZONA-Hispanic (of any race)', 'ARIZONA-White alone or in combination', 'ARIZONA-Black alone or in combination', 'ARIZONA-Asian alone or in combination'], 'ARKANSAS': ['ARKANSAS-Total', 'A

In [54]:
data_register = pd.read_csv(r"../data/Census Data/table05_1.csv") 

In [55]:
one_hot=pd.get_dummies(data_register['Unnamed: 0'])
data_register = pd.concat([data_register.drop(columns=['Unnamed: 0']), one_hot], axis=1)

In [56]:
data_register

Unnamed: 0,male,Total population,Total Citizen Population,Reported registered-Number,Reported registered-Percent,Reported not registered-Number,Reported not registered-Percent,No response to registration1 -Number,No response to registration1 -Percent,Reported voted-Number,Reported voted-Percent,Reported not voted-Number,Reported not voted-Percent,No response to voting2 -Number,No response to voting2 -Percent,Reported registered - total pop %,Reported voted - total pop %,"9th to 12th grade, no diploma",Advanced degree,Bachelor's degree,High school graduate,Less than 9th grade,Some college or associate's degree
0,1,4358,2403,1050,43.7,806,33.5,547,22.8,889,37.0,1007,41.9,507,21.1,24.1,20.4,False,False,False,False,True,False
1,1,7971,6770,3255,48.1,2056,30.4,1459,21.5,2663,39.3,2750,40.6,1357,20.0,40.8,33.4,True,False,False,False,False,False
2,1,37483,34460,21214,61.6,6638,19.3,6609,19.2,18374,53.3,9738,28.3,6348,18.4,56.6,49.0,False,False,False,True,False,False
3,1,31870,30656,23209,75.7,2701,8.8,4746,15.5,21113,68.9,4901,16.0,4642,15.1,72.8,66.2,False,False,False,False,False,True
4,1,25593,23974,19456,81.2,1128,4.7,3390,14.1,18521,77.3,2121,8.8,3332,13.9,76.0,72.4,False,False,True,False,False,False
5,1,14596,13222,11156,84.4,369,2.8,1697,12.8,10914,82.5,630,4.8,1677,12.7,76.4,74.8,False,True,False,False,False,False
6,0,4327,2389,1146,48.0,772,32.3,470,19.7,912,38.2,1028,43.0,449,18.8,26.5,21.1,False,False,False,False,True,False
7,0,7054,5968,3179,53.3,1478,24.8,1311,22.0,2620,43.9,2120,35.5,1228,20.6,45.1,37.1,True,False,False,False,False,False
8,0,35846,33079,21587,65.3,5192,15.7,6299,19.0,19082,57.7,8030,24.3,5966,18.0,60.2,53.2,False,False,False,True,False,False
9,0,36928,35393,26975,76.2,3143,8.9,5275,14.9,24842,70.2,5316,15.0,5234,14.8,73.0,67.3,False,False,False,False,False,True


In [57]:
data_income = pd.read_csv(r"../data/Census Data/table07.csv") 

In [58]:
data_income = data_income.iloc[:, 2:]
one_hot=pd.get_dummies(data_income['Unnamed: 2'])
data_income = pd.concat([data_income.drop(columns=['Unnamed: 2']), one_hot], axis=1)

In [59]:
data_income

Unnamed: 0,18-24,22-24,45-64,65-74,75+,Total population,Total Citizen Population,Reported registered-Number,Reported registered-Percent,Reported not registered-Number,Reported not registered-Percent,No response to registration1 -Number,No response to registration1 -Percent,Reported voted-Number,Reported voted-Percent,Reported not voted-Number,Reported not voted-Percent,No response to voting2 -Number,No response to voting2 -Percent,Reported registered - total pop %,Reported voted - total pop %,"10,000 to 14,999","100,000 to 149,999","15,000 to 19,999","150,000 and over","20,000 to 29,999","30,000 to 39,999","40,000 to 49,999","50,000 to 74,999","75,000 to 99,999",Income not reported,"Under 10,000"
0,0,0,0,0,0,3083,2571,1523,59.2,629,24.5,419,16.3,1212,47.1,997,38.8,362,14.1,49.4,39.3,False,False,False,False,False,False,False,False,False,False,True
1,0,0,0,0,0,3359,2762,1745,63.2,645,23.4,371,13.4,1392,50.4,1011,36.6,358,13.0,52.0,41.5,True,False,False,False,False,False,False,False,False,False,False
2,0,0,0,0,0,3327,2833,1637,57.8,787,27.8,409,14.4,1412,49.8,1048,37.0,374,13.2,49.2,42.4,False,False,True,False,False,False,False,False,False,False,False
3,0,0,0,0,0,9142,7774,5124,65.9,1661,21.4,989,12.7,4357,56.1,2475,31.8,942,12.1,56.0,47.7,False,False,False,False,True,False,False,False,False,False,False
4,0,0,0,0,0,12794,10829,7637,70.5,1848,17.1,1345,12.4,6832,63.1,2708,25.0,1290,11.9,59.7,53.4,False,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61,0,0,0,0,1,2056,2004,1772,88.4,98,4.9,133,6.7,1676,83.6,202,10.1,126,6.3,86.2,81.5,False,False,False,False,False,False,False,True,False,False,False
62,0,0,0,0,1,1147,1102,1027,93.2,34,3.1,41,3.7,982,89.1,79,7.2,41,3.7,89.5,85.6,False,False,False,False,False,False,False,False,True,False,False
63,0,0,0,0,1,989,964,834,86.5,70,7.3,61,6.3,793,82.2,108,11.2,63,6.5,84.3,80.1,False,True,False,False,False,False,False,False,False,False,False
64,0,0,0,0,1,1030,958,819,85.5,56,5.8,84,8.7,784,81.8,91,9.5,83,8.7,79.5,76.1,False,False,False,True,False,False,False,False,False,False,False


In [60]:
data_noVote = pd.read_csv(r"../data/Census Data/table10.csv") 

In [61]:
data_noVote = data_noVote.iloc[:,2:]
# Set the first column as the index
data_noVote.set_index(data_noVote.columns[0], inplace=True)
data_noVote = data_noVote.rename_axis(None, axis=0)

In [62]:
data_noVote

Unnamed: 0,Total not voting1,Out of town%,Forgot to vote%,Concerns about the coronavirus (COVID-19) pandemic%,Illness or disability%,Not interested%,"Too busy, conflicting schedule%",Transportation problems%,Did not like candidates or campaign issues%,Registration problems%,Bad weather conditions%,Inconvenient polling place%,Other reason%,Don't know or refused%
Total,12810,6.1,3.7,4.3,13.0,17.6,13.1,2.4,14.5,4.9,0.1,2.6,14.5,3.2
18 to 24 years,2017,8.8,5.7,2.3,4.5,19.2,18.4,3,11.0,4.9,-,2.5,15.7,4.0
25 to 44 years,5350,5.8,3.6,3.8,5.8,19.8,16.3,1.4,16.7,5.8,-,3.4,14.4,3.2
45 to 64 years,3364,6.3,3.2,4.1,15.8,16.7,11.7,3,15.3,4.6,0.1,2,14.3,3.0
65 years and over,2079,3.8,2.8,8.1,35.2,11.7,2.1,3.3,11.2,3.1,0.2,1.8,14.0,2.6
Male,6375,7.0,3.5,3.4,9.7,20.1,14.2,2.2,14.6,5.2,0.1,2.6,14.5,2.9
Female,6434,5.1,3.9,5.2,16.3,15.1,12.1,2.6,14.5,4.6,0.1,2.6,14.5,3.5
White alone,9926,6.3,3.5,4.3,12.7,17.8,12.8,1.9,15.9,4.9,0.1,2.6,14.1,3.1
White non-Hispanic alone,8056,6.4,3.3,4.3,13.2,17.6,12.0,2.1,16.1,4.7,-,2.4,14.5,3.4
Black alone,1800,4.7,5,3.4,14.0,18.1,10.5,5.8,10.2,4.6,-,3.4,16.0,4.4


In [46]:
# data_States.to_csv('data_States.csv', index=False)
data_register.to_csv('data_register.csv', index=False)
data_income.to_csv('data_income.csv', index=False)
data_noVote.to_csv('data_noVote.csv', index=False)