# Cleaning

[DSLC stages]: Data cleaning and pre-processing


Start by loading in any libraries that you will use in this document.


In [78]:
import pandas as pd
import numpy as np
import plotly.express as px

pd.set_option('display.max_columns', 100)


## Domain problem formulation

Write a summary of the problem.





## Data source overview

Briefly describe where the data being used for this project came from


## Step 1: Review background information {#sec-bg-info}

### Information on data collection

Write a summary of how the data was collected.

### Data dictionary

If there is a data dictionary, give some details here.


### Answering questions about the background information

Answer the recommended background information questions from the Data Cleaning chapter.

- *What does each variable measure?* 

- *How was the data collected?* 

- *What are the observational units?* 

- *Is the data relevant to my project?*




## Step 2: Loading in the data


Load in the data. 


In [110]:
data_orig = pd.read_csv(r"..\data\anes_timeseries_2020_csv_20220210.csv") 

  data_orig = pd.read_csv(r"..\data\anes_timeseries_2020_csv_20220210.csv")


Let's look at the first few rows to make sure it looks like it has been loaded in correctly:

In [167]:
data_orig.info()

KeyError: 'POST_ProblemMention'

And let's examine the dimension of the data.


In [134]:
dictionary = {
  "V201033": {"column": "PRE_VotePresident", "unique_values": 5, "type": "cat"},
  "V201218": {"column": "PRE_RaceOutcomePrediction", "unique_values": 2, "type": "cat"},
  "V201151": {"column": "PRE_ThermoBiden", "unique_values": 100, "type": "rank"},
  "V201152": {"column": "PRE_ThermoTrump", "unique_values": 100, "type": "rank"},
  "V201153": {"column": "PRE_ThermoHarris", "unique_values": 100, "type": "rank"},
  "V201154": {"column": "PRE_ThermoPence", "unique_values": 100, "type": "rank"},
  "V201155": {"column": "PRE_ThermoObama", "unique_values": 100, "type": "rank"},
  "V201156": {"column": "PRE_ThermoDemParty", "unique_values": 100, "type": "rank"},
  "V201157": {"column": "PRE_ThermoRepParty", "unique_values": 100, "type": "rank"},
  "V201553": {"column": "PRE_ParentNativeStatus", "unique_values": 3, "type": "cat"},
  "V201587": {"column": "PRE_YearsAtAddress", "unique_values": 40, "type": "num"},
  "V201600": {"column": "Sex", "unique_values": 2, "type": "cat"},
  "V201225x": {"column": "PRE_SummaryVoteDutyChoice", "unique_values": 7, "type": "rank"},
  "V201231x": {"column": "PRE_PartyID", "unique_values": 7, "type": "rank"},
  "V201246": {"column": "PRE_ScaleSpendingServices", "unique_values": 7, "type": "rank"},
  #"V201018": {"column": "PRE_PartyRegistration", "unique_values": 4, "type": "cat"},
  "V201115": {"column": "PRE_CountryDirection", "unique_values": 5, "type": "rank"},
  "V201233": {"column": "PRE_GovTrust", "unique_values": 5, "type": "rank"},
  "V201324": {"column": "PRE_EconomyView", "unique_values": 5, "type": "rank"},
  "V201340": {"column": "PRE_AbortionRightsSC", "unique_values": 3, "type": "cat"},
  "V201507x": {"column": "Age", "unique_values": 80, "type": "num"},
  "V201510": {"column": "EducationLevel", "unique_values": 8, "type": "cat"},
  "V201517": {"column": "WorkStatus", "unique_values": 2, "type": "cat"},
  "V201617x": {"column": "Income", "unique_values": 22, "type": "rank"},
  "V201549x": {"column": "Race", "unique_values": 6, "type": "cat"},
  "V202054x": {"column": "StateRegistration", "unique_values": 56, "type": "cat"},
  "V201567": {"column": "HouseholdChildren", "unique_values": 5, "type": "rank"},
  "V201630b": {"column": "PRE_Fox_Hannity", "unique_values": 2, "type": "cat"},
  "V201630c": {"column": "PRE_Fox_TuckerCarlsonTonight", "unique_values": 2, "type": "cat"},
  "V201630k": {"column": "PRE_Fox_SpecialReportBretBaier", "unique_values": 2, "type": "cat"},
  "V201630f": {"column": "PRE_Fox_TheFive", "unique_values": 2, "type": "cat"},
  "V201630g": {"column": "PRE_Fox_TheIngrahamAngle", "unique_values": 2, "type": "cat"},
  "V201630h": {"column": "PRE_Fox_TheStoryMarthaMacCallum", "unique_values": 2, "type": "cat"},
  "V201631k": {"column": "PRE_Fox_FoxAndFriends", "unique_values": 2, "type": "cat"},
  "V201634f": {"column": "PRE_Fox_FoxNewsWebsite", "unique_values": 2, "type": "cat"},
  "V201630i": {"column": "PRE_CNN_TheLeadJakeTapper", "unique_values": 2, "type": "cat"},
  "V201630j": {"column": "PRE_CNN_AndersonCooper360", "unique_values": 2, "type": "cat"},
  "V201630q": {"column": "PRE_CNN_CuomoPrimeTime", "unique_values": 2, "type": "cat"},
  "V201631b": {"column": "PRE_CNN_ErinBurnettOutFront", "unique_values": 2, "type": "cat"},
  "V201634b": {"column": "PRE_CNN_CNNWebsite", "unique_values": 2, "type": "cat"},
  "V201630n": {"column": "PRE_ABC_WorldNewsTonight", "unique_values": 2, "type": "cat"},
  "V201631d": {"column": "PRE_ABC_2020", "unique_values": 2, "type": "cat"},
  "V201631i": {"column": "PRE_ABC_GoodMorningAmerica", "unique_values": 2, "type": "cat"},
  "V201646": {"column": "PRE_PartyMoreHouseMembers", "unique_values": 2, "type": "cat"},
  "V201645": {"column": "PRE_FederalSpendingKnowledge", "unique_values": 4, "type": "cat"},
  "V201351": {"column": "PRE_VoteAccuracy", "unique_values": 5, "type": "rank"},
  "V201650": {"column": "PRE_SurveySeriousness", "unique_values": 5, "type": "rank"},
  "V201249": {"column": "PRE_ScaleDefenseSpending", "unique_values": 7, "type": "rank"},
  "V201252": {"column": "PRE_ScaleMedInsurance", "unique_values": 7, "type": "rank"},
  "V201380": {"column": "PRE_CorruptionView", "unique_values": 3, "type": "cat"},
  "V201258": {"column": "PRE_ScaleGovAssistanceBlacks", "unique_values": 7, "type": "rank"},
  "V201255": {"column": "PRE_ScaleJobIncome", "unique_values": 7, "type": "rank"},
  "V202051": {"column": "POST_RegistrationStatus", "unique_values": 3, "type": "cat"},
  "V202068x": {"column": "POST_Voted2020", "unique_values": 3, "type": "cat"},
  "V202073": {"column": "POST_VotePresident", "unique_values": 4, "type": "cat"},
  "V202219": {"column": "POST_VoteAccuracy", "unique_values": 5, "type": "rank"},
  "V202156": {"column": "POST_ThermoHarris", "unique_values": 100, "type": "rank"},
  "V202157": {"column": "POST_ThermoPence", "unique_values": 100, "type": "rank"},
  "V202143": {"column": "POST_ThermoBiden", "unique_values": 100, "type": "rank"},
  "V202144": {"column": "POST_ThermoTrump", "unique_values": 100, "type": "rank"},
  "V202123": {"column": "POST_ReasonNotVoting", "unique_values": 15, "type": "cat"},
  "V202205y1": {"column": "POST_ProblemMention", "unique_values": 82, "type": "cat"},
  #"V202580": {"column": "POST_ScaleMedInsurance", "unique_values": 7, "type": "rank"},
  #"V202624": {"column": "POST_HealthSpending", "unique_values": 7, "type": "rank"},
  "V202644": {"column": "POST_RespondentHonesty", "unique_values": 3, "type": "cat"}
}


In [231]:
column_labels = {
    "V201033": {
        "column": "PRE_VotePresident",
        "labels": {
            1: "Joe Biden",
            2: "Donald Trump",
            3: "Jo Jorgensen",
            4: "Howie Hawkins",
            5: "Other"
        }
    },
    "V201218": {
        "column": "PRE_RaceOutcomePrediction",
        "labels": {
            1: "Will be close",
            2: "Win by quite a bit"
        }
    },
    "V201553": {
        "column": "PRE_ParentNativeStatus",
        "labels": {
            1: "Both parents born in the US",
            2: "One parent born in the US",
            3: "Both parents born in another country"
        }
    },
    "V201600": {
        "column": "Sex",
        "labels": {
            1: "Male",
            2: "Female"
        }
    },
    "V201340": {
        "column": "PRE_AbortionRightsSC",
        "labels": {
            1: "Pleased",
            2: "Upset",
            3: "Neither pleased nor upset"
        }
    },
    "V201510": {
        "column": "EducationLevel",
        "labels": {
            1: "Less than high school credential",
            2: "High school graduate",
            3: "Some college but no degree",
            4: "Associate degree - occupational/vocational",
            5: "Associate degree - academic",
            6: "Bachelor’s degree",
            7: "Master’s degree",
            8: "Professional/Doctoral degree",
        }
    },
    "V201517": {
        "column": "WorkStatus",
        "labels": {
            1: "Yes",
            2: "No, did not work (or retired)"
        }
    },
    "V201549x": {
        "column": "Race",
        "labels": {
            1: "White, non-Hispanic",
            2: "Black, non-Hispanic",
            3: "Hispanic",
            4: "Asian/Pacific Islander, non-Hispanic",
            5: "Native American/Alaska Native, non-Hispanic",
            6: "Multiple races, non-Hispanic"
        }
    },
    "V202054x": {
        "column": "StateRegistration",
        "labels": {
            1: "Alabama",
            2: "Alaska",
            4: "Arizona",
            5: "Arkansas",
            6: "California",
            8: "Colorado",
            9: "Connecticut",
            10: "Delaware",
            11: "Washington DC",
            12: "Florida",
            13: "Georgia",
            15: "Hawaii",
            16: "Idaho",
            17: "Illinois",
            18: "Indiana",
            19: "Iowa",
            20: "Kansas",
            21: "Kentucky",
            22: "Louisiana",
            23: "Maine",
            24: "Maryland",
            25: "Massachusetts",
            26: "Michigan",
            27: "Minnesota",
            28: "Mississippi",
            29: "Missouri",
            30: "Montana",
            31: "Nebraska",
            32: "Nevada",
            33: "New Hampshire",
            34: "New Jersey",
            35: "New Mexico",
            36: "New York",
            37: "North Carolina",
            38: "North Dakota",
            39: "Ohio",
            40: "Oklahoma",
            41: "Oregon",
            42: "Pennsylvania",
            44: "Rhode Island",
            45: "South Carolina",
            46: "South Dakota",
            47: "Tennessee",
            48: "Texas",
            49: "Utah",
            50: "Vermont",
            51: "Virginia",
            53: "Washington",
            54: "West Virginia",
            55: "Wisconsin",
            56: "Wyoming"
        }
    },
    "V201646": {
        "column": "PRE_PartyMoreHouseMembers",
        "labels": {
            1: "correct (D)",
            2: "incorrect (R)"
        }
    },
    "V201645": {
        "column": "PRE_FederalSpendingKnowledge",
        "labels": {
            1: "correct (Foreign aid)",
            2: "incorrect (Medicare, National defense, SS)"
        }
    },
    "V201380": {
        "column": "PRE_CorruptionView",
        "labels": {
            1: "Increased",
            2: "Decreased",
            3: "Stayed the same"
        }
    },
    "V202051": {
        "column": "POST_RegistrationStatus",
        "labels": {
            1: "Registered at this address",
            2: "Registered at a different address",
            3: "Not currently registered"
        }
    },
    "V202068x": {
        "column": "POST_Voted2020",
        "labels": {
            0: "Not registered and did not vote",
            1: "Registered and did not vote",
            2: "Voted"
        }
    },
    "V202073": {
        "column": "POST_VotePresident",
        "labels": {
            1: "Joe Biden",
            2: "Donald Trump",
            3: "Jo Jorgensen",
            4: "Howie Hawkins",
            5: "Other candidate {SPECIFY}"
        }
    },
    "V202205y1": {
        "column": "POST_ProblemMention",
        "labels": {
            1: "Defense spending",
            2: "Middle East",
            3: "Iraq",
            4: "War",
            5: "Terrorism",
            6: "Veterans",
            7: "National defense (all other)",
            8: "Foreign aid",
            9: "Foreign Trade",
            10: "Protection of US jobs",
            11: "Serbia /Balkans",
            12: "China",
            13: "International affairs (all other)",
            14: "Energy crisis",
            15: "Energy prices",
            16: "Energy (all other)",
            17: "Environment",
            18: "Natural Resources (all other)",
            19: "Education and training",
            20: "School funding",
            21: "Education (all other)",
            22: "AIDS",
            23: "Medicare",
            24: "Health (all other)",
            25: "Welfare",
            26: "Poverty",
            27: "Employment",
            28: "Housing",
            29: "Social security",
            30: "Income (all other)",
            31: "Crime",
            32: "Race relations",
            33: "Illegal drugs",
            34: "Police problems",
            35: "Guns",
            36: "Corporate Corruption",
            37: "Justice (all other)",
            38: "Budget",
            39: "Size of government",
            40: "Taxes",
            41: "Immigration",
            42: "Campaign finance",
            43: "Political corruption",
            44: "Ethics",
            45: "Government power",
            46: "Budget priorities",
            47: "Partisan politics",
            48: "Politicians",
            49: "Government (all other)",
            50: "The economy",
            51: "Stock market",
            52: "Economic inequality",
            53: "Recession",
            54: "Inflation",
            55: "Economics (all other)",
            56: "Agriculture",
            57: "Science",
            58: "Commerce",
            59: "Transportation",
            60: "Community development",
            61: "Abortion",
            62: "Child care",
            63: "Overpopulation",
            64: "Public morality",
            65: "Domestic violence",
            66: "Family",
            67: "Young people",
            68: "Sexual identity /LGBT+ issues",
            69: "The media",
            75: "Sexism /Gender issues",
            76: "Afghanistan",
            77: "Syria",
            78: "Elections",
            79: "Religion",
            80: "Civility",
            81: "Unity /division",
            82: "Health care"
        }
    }
}


In [83]:
path = r"..\data\anes_timeseries_2020_csv_20220210.csv"

In [169]:
import sys
import os
import importlib

# Add the directory containing the file to the Python path
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(path), 'functions')))

# Import the function
from functions import load_data

# Reload the module to reflect any updates
importlib.reload(load_data)

data_filtered = load_data.load_data(path, dictionary)

  df = pd.read_csv(path)  # Modify this if your data is in a different format (e.g., .xlsx, .json, etc.)


In [170]:
data_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Data columns (total 62 columns):
 #   Column                           Non-Null Count  Dtype
---  ------                           --------------  -----
 0   PRE_VotePresident                8280 non-null   int64
 1   PRE_RaceOutcomePrediction        8280 non-null   int64
 2   PRE_ThermoBiden                  8280 non-null   int64
 3   PRE_ThermoTrump                  8280 non-null   int64
 4   PRE_ThermoHarris                 8280 non-null   int64
 5   PRE_ThermoPence                  8280 non-null   int64
 6   PRE_ThermoObama                  8280 non-null   int64
 7   PRE_ThermoDemParty               8280 non-null   int64
 8   PRE_ThermoRepParty               8280 non-null   int64
 9   PRE_ParentNativeStatus           8280 non-null   int64
 10  PRE_YearsAtAddress               8280 non-null   int64
 11  Sex                              8280 non-null   int64
 12  PRE_SummaryVoteDutyChoice        8280 non-null  

In [21]:
data_filtered_num = data_filtered.select_dtypes(include='number')

# Compute the correlation matrix
correlation_matrix = data_filtered_num.corr()

# Extract the upper triangle of the correlation matrix (to avoid duplicates)
upper_triangle = correlation_matrix.where(
    ~correlation_matrix.apply(lambda x: x.index.to_series() >= x.name)
)

# Unstack the matrix into a Series and sort by values
sorted_correlations = upper_triangle.unstack().dropna().sort_values()

# Get the top negative and positive correlations
top_negative = sorted_correlations.head(15)  # Adjust the number to see more
top_positive = sorted_correlations.tail(15)  # Adjust the number to see more

# Display results
print("Top Negative Correlations:\n", top_negative)
print("\nTop Positive Correlations:\n", top_positive)

Top Negative Correlations:
 PRE_ThermoTrump     PRE_ThermoObama      -0.744000
                    PRE_ThermoBiden      -0.692895
PRE_ThermoObama     POST_ThermoTrump     -0.680679
PRE_ThermoTrump     POST_ThermoBiden     -0.659753
                    PRE_ThermoDemParty   -0.642706
PRE_ThermoBiden     POST_ThermoTrump     -0.633113
PRE_ThermoDemParty  POST_ThermoTrump     -0.585299
PRE_ThermoRepParty  PRE_ThermoObama      -0.570714
PRE_ThermoTrump     POST_ThermoHarris    -0.552223
PRE_ThermoRepParty  PRE_ThermoBiden      -0.540317
                    POST_ThermoBiden     -0.519356
POST_ThermoTrump    POST_ThermoBiden     -0.513552
PRE_ThermoObama     POST_ThermoPence     -0.494797
PRE_ThermoTrump     PRE_EconomyView      -0.469436
PRE_ThermoBiden     POST_ThermoPence     -0.458362
dtype: float64

Top Positive Correlations:
 POST_VotePresident       POST_ScaleMedInsurance     0.791339
POST_VoteStatus          POST_VoteAccuracy          0.795725
PRE_ThermoObama          PRE_ThermoBiden 

In [22]:
from scipy.stats import ttest_rel, wilcoxon
# Separate PRE and POST columns
pre_columns = [col for col in data_filtered_num.columns if col.startswith("PRE_")]
post_columns = [col for col in data_filtered_num.columns if col.startswith("POST_")]

# Match PRE and POST columns (ignoring "PRE_" and "POST_" prefixes)
pre_post_pairs = {
    pre: pre.replace("PRE_", "POST_") for pre in pre_columns if pre.replace("PRE_", "POST_") in post_columns
}

# Prepare a DataFrame to store test results
results = []

# Compare each PRE-POST pair
for pre_col, post_col in pre_post_pairs.items():
    pre_data = data_filtered_num[pre_col].dropna()
    post_data = data_filtered_num[post_col].dropna()
    
    # Ensure equal length for paired testing
    paired_data = pd.DataFrame({"pre": pre_data, "post": post_data}).dropna()
    pre_values = paired_data["pre"]
    post_values = paired_data["post"]
    
    # Perform paired t-test (use Wilcoxon if data is non-normal)
    try:
        t_stat, p_value = ttest_rel(pre_values, post_values)
        test_type = "t-test"
    except ValueError:
        t_stat, p_value = wilcoxon(pre_values, post_values)
        test_type = "Wilcoxon"
    
    # Store results
    results.append({
        "Variable": pre_col.replace("PRE_", ""),
        "PRE_Mean": pre_values.mean(),
        "POST_Mean": post_values.mean(),
        "Test_Type": test_type,
        "T_Statistic": t_stat,
        "P_Value": p_value
    })

# Convert results to a DataFrame
comparison_results = pd.DataFrame(results)

# Display significant changes (p-value < 0.05)
significant_changes = comparison_results[comparison_results["P_Value"] < 0.05]

print("Significant Changes from Pre to Post:")
print(significant_changes)

Significant Changes from Pre to Post:
            Variable   PRE_Mean  POST_Mean Test_Type  T_Statistic  \
0      VotePresident  -0.894807   0.238889    t-test   -38.021702   
1        ThermoBiden  47.812319  46.919807    t-test     2.646178   
2        ThermoTrump  39.055314  33.297705    t-test    19.603202   
3       ThermoHarris  49.675362  46.183816    t-test     5.108380   
4        ThermoPence  44.637077  40.408454    t-test     7.077247   
5       VoteAccuracy   3.004469   1.355193    t-test    45.964388   
6  ScaleMedInsurance  14.064855  -1.560870    t-test    47.659814   

         P_Value  
0  1.051193e-291  
1   8.156082e-03  
2   1.123396e-83  
3   3.321833e-07  
4   1.590451e-12  
5   0.000000e+00  
6   0.000000e+00  


That's a lot of data!



## Step 3: Examine the data

In this section we explore the common messy data traits to identify any cleaning action items.





### Finding invalid values



In [79]:
data_filtered.describe()

Unnamed: 0,PRE_VotePresident,PRE_RaceOutcomePrediction,PRE_ThermoBiden,PRE_ThermoTrump,PRE_ThermoHarris,PRE_ThermoPence,PRE_ThermoObama,PRE_ThermoDemParty,PRE_ThermoRepParty,PRE_ParentNativeStatus,PRE_YearsAtAddress,Sex,PRE_SummaryVoteDutyChoice,PRE_PartyID,PRE_ScaleSpendingServices,PRE_PartyRegistration,PRE_CountryDirection,PRE_GovTrust,PRE_EconomyView,PRE_AbortionRightsSC,Age,EducationLevel,WorkStatus,Income,Race,StateRegistration,HouseholdChildren,PRE_Fox_Hannity,PRE_Fox_TuckerCarlsonTonight,PRE_Fox_SpecialReportBretBaier,PRE_Fox_TheFive,PRE_Fox_TheIngrahamAngle,PRE_Fox_TheStoryMarthaMacCallum,PRE_Fox_FoxAndFriends,PRE_Fox_FoxNewsWebsite,PRE_CNN_TheLeadJakeTapper,PRE_CNN_AndersonCooper360,PRE_CNN_CuomoPrimeTime,PRE_CNN_ErinBurnettOutFront,PRE_CNN_CNNWebsite,PRE_ABC_WorldNewsTonight,PRE_ABC_2020,PRE_ABC_GoodMorningAmerica,PRE_PartyMoreHouseMembers,PRE_FederalSpendingKnowledge,PRE_VoteAccuracy,PRE_SurveySeriousness,PRE_ScaleDefenseSpending,PRE_ScaleMedInsurance,PRE_CorruptionView,PRE_ScaleGovAssistanceBlacks,PRE_ScaleJobIncome,POST_RegistrationStatus,POST_VoteStatus,POST_VotePresident,POST_VoteAccuracy,POST_ThermoHarris,POST_ThermoPence,POST_ThermoBiden,POST_ThermoTrump,POST_ReasonNotVoting,POST_ProblemMention,POST_ScaleMedInsurance,POST_HealthSpending,POST_RespondentHonesty
count,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0,8280.0
mean,-0.894807,1.253261,47.812319,39.055314,49.675362,44.637077,59.907488,44.752536,43.131763,1.231643,11.610507,1.456522,3.251449,3.833816,18.079106,0.568478,1.675483,3.420531,3.212802,1.977899,49.038889,5.532126,0.298188,10.221739,1.498913,23.251691,0.533213,-0.253623,-0.250242,-0.267633,-0.2593,-0.268961,-0.275362,-0.229469,-0.306039,-0.268116,-0.238164,-0.256763,-0.278382,-0.257126,-0.212319,-0.236473,-0.224758,1.086957,2.03285,3.004469,4.558092,16.971256,14.064855,1.620169,13.264734,14.743841,-1.511111,2.453382,0.238889,1.355193,46.183816,40.408454,46.919807,33.297705,-0.520652,-2.0,-1.56087,-1.56087,-1.459058
std,0.623393,0.992782,36.806871,40.571078,65.88754,55.784589,37.425584,35.949072,36.148125,0.969502,12.098934,1.066932,2.40667,2.39749,33.18717,1.795337,0.941906,1.207423,1.261308,1.233611,20.771267,9.98756,1.480671,8.444621,1.698425,18.594514,1.313558,0.951118,0.953788,0.939846,0.946592,0.938759,0.933479,0.96977,1.008629,0.939451,0.963166,0.948621,0.930962,1.045849,0.982437,0.96446,0.973295,1.46808,2.098351,1.485154,1.447893,32.377374,29.738682,1.348386,28.716329,30.002713,1.533102,3.144927,2.56632,2.82523,48.731938,47.695825,38.600682,40.314633,3.970426,0.0,1.579909,1.579909,1.684504
min,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-2.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-7.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-9.0,-2.0,-7.0,-7.0,-7.0
25%,-1.0,1.0,15.0,0.0,0.0,0.0,30.0,15.0,15.0,1.0,2.0,1.0,1.0,2.0,4.0,-1.0,1.0,3.0,2.0,2.0,35.0,3.0,-1.0,4.0,1.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,1.0,1.0,2.0,5.0,3.0,2.0,1.0,2.0,3.0,-1.0,3.0,-1.0,1.0,0.0,0.0,0.0,0.0,-1.0,-2.0,-1.0,-1.0,-1.0
50%,-1.0,1.0,50.0,30.0,50.0,50.0,70.0,50.0,40.0,1.0,7.0,2.0,2.0,4.0,5.0,1.0,2.0,4.0,3.0,2.0,51.0,5.0,-1.0,11.0,1.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,3.0,5.0,5.0,4.0,1.0,4.0,4.0,-1.0,4.0,1.0,2.0,50.0,30.0,50.0,10.0,-1.0,-2.0,-1.0,-1.0,-1.0
75%,-1.0,2.0,85.0,85.0,85.0,75.0,100.0,70.0,70.0,1.0,19.0,2.0,6.0,6.0,7.0,2.0,2.0,4.0,4.0,3.0,65.0,6.0,2.0,17.0,2.0,39.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,4.0,5.0,6.0,7.0,3.0,6.0,7.0,-1.0,4.0,2.0,3.0,85.0,75.0,85.0,70.0,-1.0,-2.0,-1.0,-1.0,-1.0
max,12.0,2.0,998.0,100.0,999.0,999.0,100.0,998.0,998.0,3.0,40.0,2.0,7.0,7.0,99.0,5.0,2.0,5.0,5.0,3.0,80.0,95.0,2.0,22.0,6.0,56.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,4.0,5.0,5.0,99.0,99.0,3.0,99.0,99.0,-1.0,4.0,12.0,5.0,999.0,999.0,100.0,100.0,16.0,-2.0,-1.0,-1.0,3.0


#### Numeric variables



In [171]:
from functions import prepare_data

importlib.reload(prepare_data)

ranked_columns, num_columns, cat_columns = prepare_data.extract_ranked_num_and_cat_columns(dictionary)

print(ranked_columns)
print(num_columns)

['PRE_ThermoBiden', 'PRE_ThermoTrump', 'PRE_ThermoHarris', 'PRE_ThermoPence', 'PRE_ThermoObama', 'PRE_ThermoDemParty', 'PRE_ThermoRepParty', 'PRE_SummaryVoteDutyChoice', 'PRE_PartyID', 'PRE_ScaleSpendingServices', 'PRE_CountryDirection', 'PRE_GovTrust', 'PRE_EconomyView', 'Income', 'HouseholdChildren', 'PRE_VoteAccuracy', 'PRE_SurveySeriousness', 'PRE_ScaleDefenseSpending', 'PRE_ScaleMedInsurance', 'PRE_ScaleGovAssistanceBlacks', 'PRE_ScaleJobIncome', 'POST_VoteAccuracy', 'POST_ThermoHarris', 'POST_ThermoPence', 'POST_ThermoBiden', 'POST_ThermoTrump']
['PRE_YearsAtAddress', 'Age']


#### Categorical variables




In [172]:
print(cat_columns)

['PRE_VotePresident', 'PRE_RaceOutcomePrediction', 'PRE_ParentNativeStatus', 'Sex', 'PRE_AbortionRightsSC', 'EducationLevel', 'WorkStatus', 'Race', 'StateRegistration', 'PRE_Fox_Hannity', 'PRE_Fox_TuckerCarlsonTonight', 'PRE_Fox_SpecialReportBretBaier', 'PRE_Fox_TheFive', 'PRE_Fox_TheIngrahamAngle', 'PRE_Fox_TheStoryMarthaMacCallum', 'PRE_Fox_FoxAndFriends', 'PRE_Fox_FoxNewsWebsite', 'PRE_CNN_TheLeadJakeTapper', 'PRE_CNN_AndersonCooper360', 'PRE_CNN_CuomoPrimeTime', 'PRE_CNN_ErinBurnettOutFront', 'PRE_CNN_CNNWebsite', 'PRE_ABC_WorldNewsTonight', 'PRE_ABC_2020', 'PRE_ABC_GoodMorningAmerica', 'PRE_PartyMoreHouseMembers', 'PRE_FederalSpendingKnowledge', 'PRE_CorruptionView', 'POST_RegistrationStatus', 'POST_Voted2020', 'POST_VotePresident', 'POST_ReasonNotVoting', 'POST_ProblemMention', 'POST_RespondentHonesty']


### Examining missing values




In [50]:
# Count missing values for each column
missing_values_count = data_filtered.isna().sum()

# Display the counts
print("Missing values count for each column:")
print(missing_values_count)

# Filter out columns with no missing values
missing_values_greater_than_zero = missing_values_count[missing_values_count > 0]

print()
# Display the counts for columns with missing values
print("Missing values count for columns with missing values greater than 0:")
print(missing_values_greater_than_zero)


Missing values count for each column:
PRE_VotePresident            0
PRE_RaceOutcomePrediction    0
PRE_ThermoBiden              0
PRE_ThermoTrump              0
PRE_ThermoHarris             0
                            ..
POST_ReasonNotVoting         0
POST_ProblemMention          0
POST_ScaleMedInsurance       0
POST_HealthSpending          0
POST_RespondentHonesty       0
Length: 65, dtype: int64

Missing values count for columns with missing values greater than 0:
Series([], dtype: int64)


### Examining the data format



### Assessing column names



In [92]:
data_filtered.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Data columns (total 63 columns):
 #   Column                           Non-Null Count  Dtype
---  ------                           --------------  -----
 0   PRE_VotePresident                8280 non-null   int64
 1   PRE_RaceOutcomePrediction        8280 non-null   int64
 2   PRE_ThermoBiden                  8280 non-null   int64
 3   PRE_ThermoTrump                  8280 non-null   int64
 4   PRE_ThermoHarris                 8280 non-null   int64
 5   PRE_ThermoPence                  8280 non-null   int64
 6   PRE_ThermoObama                  8280 non-null   int64
 7   PRE_ThermoDemParty               8280 non-null   int64
 8   PRE_ThermoRepParty               8280 non-null   int64
 9   PRE_ParentNativeStatus           8280 non-null   int64
 10  PRE_YearsAtAddress               8280 non-null   int64
 11  Sex                              8280 non-null   int64
 12  PRE_SummaryVoteDutyChoice        8280 non-null  

### Assessing variable type



### Evaluating data completeness





### Answering any unanswered questions







## Step 4: Prepare the data

Don't forget to split the data into training, validation and test sets before you clean and pre-process it!

In [173]:
from functions import prepare_data

importlib.reload(prepare_data)

data_processed = prepare_data.clean_columns(data_filtered, dictionary)

In [174]:
data_processed.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Data columns (total 62 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   PRE_VotePresident                7049 non-null   float64
 1   PRE_RaceOutcomePrediction        8213 non-null   float64
 2   PRE_ThermoBiden                  8060 non-null   float64
 3   PRE_ThermoTrump                  8048 non-null   float64
 4   PRE_ThermoHarris                 7980 non-null   float64
 5   PRE_ThermoPence                  8045 non-null   float64
 6   PRE_ThermoObama                  8165 non-null   float64
 7   PRE_ThermoDemParty               8152 non-null   float64
 8   PRE_ThermoRepParty               8141 non-null   float64
 9   PRE_ParentNativeStatus           8239 non-null   float64
 10  PRE_YearsAtAddress               8119 non-null   float64
 11  Sex                              8213 non-null   float64
 12  PRE_SummaryVoteDutyC

In [184]:
# Calculate the threshold for allowed missing values (70% valid data)
threshold = len(data_processed) * 0.5

# Get the initial columns
original_columns = set(data_processed.columns)

# Drop columns with more than 50% missing values
data_dropped = data_processed.dropna(axis=1, thresh=threshold)

# Get the dropped columns
dropped_columns = original_columns - set(data_dropped.columns)

print("\nDropped columns:")
print(dropped_columns)


Dropped columns:
{'POST_ReasonNotVoting', 'POST_RespondentHonesty', 'POST_RegistrationStatus'}


In [205]:
# Safely remove keys from the dictionary
for col in list(dictionary.keys()):  # Iterate over a list of keys to avoid modifying during iteration
    if dictionary[col]['column'] in dropped_columns:
        print(dictionary[col]['column'])
        del dictionary[col]

In [206]:
importlib.reload(prepare_data)

# Test for optimal k
data_imputed = prepare_data.knn_impute(data_dropped, dictionary)

In [213]:
data_imputed['POST_ProblemMention'].value_counts()

POST_ProblemMention
82    2155
32     672
81     608
47     402
49     363
      ... 
15       1
23       1
1        1
22       1
76       1
Name: count, Length: 76, dtype: int64

In [228]:
importlib.reload(prepare_data)
data_imputed = prepare_data.group_top_and_other(data_imputed,'POST_ProblemMention')

In [230]:
print(data_imputed['POST_ProblemMention'].value_counts())

POST_ProblemMention
83    2996
82    2155
32     672
81     608
47     402
49     363
50     347
48     311
24     238
78     188
Name: count, dtype: int64


In [224]:
importlib.reload(prepare_data)
data_imputed = prepare_data.replace_all_other_cols(data_imputed, 'PRE_FederalSpendingKnowledge', 1)

In [225]:
data_imputed['PRE_FederalSpendingKnowledge'].value_counts()

PRE_FederalSpendingKnowledge
0    5228
1    3052
Name: count, dtype: int64

In [223]:
importlib.reload(prepare_data)
data_combined = prepare_data.combine_columns_by_group(data_imputed)

In [217]:
data_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8280 entries, 0 to 8279
Data columns (total 46 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   PRE_VotePresident             8280 non-null   int32  
 1   PRE_RaceOutcomePrediction     8280 non-null   int32  
 2   PRE_ThermoBiden               8280 non-null   int32  
 3   PRE_ThermoTrump               8280 non-null   int32  
 4   PRE_ThermoHarris              8280 non-null   int32  
 5   PRE_ThermoPence               8280 non-null   int32  
 6   PRE_ThermoObama               8280 non-null   int32  
 7   PRE_ThermoDemParty            8280 non-null   int32  
 8   PRE_ThermoRepParty            8280 non-null   int32  
 9   PRE_ParentNativeStatus        8280 non-null   int32  
 10  PRE_YearsAtAddress            8280 non-null   float64
 11  Sex                           8280 non-null   int32  
 12  PRE_SummaryVoteDutyChoice     8280 non-null   int32  
 13  PRE

In [218]:
data_combined['POST_ProblemMention'].value_counts()

POST_ProblemMention
83    2815
82    2155
32     672
81     608
47     402
49     363
50     347
48     311
24     238
78     188
17     181
Name: count, dtype: int64

In [236]:
importlib.reload(prepare_data)
data_one_hot = prepare_data.one_hot_cat_cols(data_combined, dictionary,column_labels)

In [237]:
data_one_hot

Unnamed: 0,PRE_ThermoBiden,PRE_ThermoTrump,PRE_ThermoHarris,PRE_ThermoPence,PRE_ThermoObama,PRE_ThermoDemParty,PRE_ThermoRepParty,PRE_YearsAtAddress,PRE_SummaryVoteDutyChoice,PRE_PartyID,PRE_ScaleSpendingServices,PRE_CountryDirection,PRE_GovTrust,PRE_EconomyView,Age,Income,HouseholdChildren,PRE_VoteAccuracy,PRE_SurveySeriousness,PRE_ScaleDefenseSpending,PRE_ScaleMedInsurance,PRE_ScaleGovAssistanceBlacks,PRE_ScaleJobIncome,POST_VoteAccuracy,POST_ThermoHarris,POST_ThermoPence,POST_ThermoBiden,POST_ThermoTrump,mentionFox,mentionABC,mentionCNN,PRE_VotePresident_Joe Biden,PRE_VotePresident_Donald Trump,PRE_VotePresident_Jo Jorgensen,PRE_VotePresident_Howie Hawkins,PRE_VotePresident_Other,PRE_RaceOutcomePrediction_Will be close,PRE_RaceOutcomePrediction_Win by quite a bit,PRE_ParentNativeStatus_Both parents born in the US,PRE_ParentNativeStatus_One parent born in the US,PRE_ParentNativeStatus_Both parents born in another country,Sex_Male,Sex_Female,PRE_AbortionRightsSC_Pleased,PRE_AbortionRightsSC_Upset,PRE_AbortionRightsSC_Neither pleased nor upset,EducationLevel_Less than high school credential,EducationLevel_High school graduate,EducationLevel_Some college but no degree,EducationLevel_Associate degree - occupational/vocational,...,StateRegistration_New Hampshire,StateRegistration_New Jersey,StateRegistration_New Mexico,StateRegistration_New York,StateRegistration_North Carolina,StateRegistration_North Dakota,StateRegistration_Ohio,StateRegistration_Oklahoma,StateRegistration_Oregon,StateRegistration_Pennsylvania,StateRegistration_43,StateRegistration_Rhode Island,StateRegistration_South Carolina,StateRegistration_South Dakota,StateRegistration_Tennessee,StateRegistration_Texas,StateRegistration_Utah,StateRegistration_Vermont,StateRegistration_Virginia,StateRegistration_Washington,StateRegistration_West Virginia,StateRegistration_Wisconsin,StateRegistration_Wyoming,PRE_PartyMoreHouseMembers_correct (D),PRE_PartyMoreHouseMembers_incorrect (R),PRE_FederalSpendingKnowledge_correct (Foreign aid),"PRE_FederalSpendingKnowledge_incorrect (Medicare, National defense, SS)",PRE_FederalSpendingKnowledge_3,PRE_FederalSpendingKnowledge_4,PRE_CorruptionView_Increased,PRE_CorruptionView_Decreased,PRE_CorruptionView_Stayed the same,POST_Voted2020_Not registered and did not vote,POST_Voted2020_Registered and did not vote,POST_Voted2020_Voted,POST_VotePresident_Joe Biden,POST_VotePresident_Donald Trump,POST_VotePresident_Jo Jorgensen,POST_VotePresident_Howie Hawkins,POST_ProblemMention_Environment,POST_ProblemMention_Health (all other),POST_ProblemMention_Race relations,POST_ProblemMention_Partisan politics,POST_ProblemMention_Politicians,POST_ProblemMention_Government (all other),POST_ProblemMention_The economy,POST_ProblemMention_Elections,POST_ProblemMention_Unity /division,POST_ProblemMention_Health care,POST_ProblemMention_83
0,0,100,0,85,0,0,85,10.0,1,7,1,3,5,2,46.0,21,0,3,5,7,7,7,7,2,0,85,0,100,0,0,0,False,True,False,False,False,False,True,False,True,False,True,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,True,False,False,True,False,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False
1,0,0,0,0,50,0,50,4.0,4,4,3,1,5,3,37.0,13,1,2,5,4,4,4,5,2,0,0,15,15,0,0,0,False,False,True,False,False,True,False,True,False,False,False,True,False,True,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,True,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,True,False,False
2,65,0,65,0,90,60,0,11.0,4,3,6,1,4,4,40.0,17,2,3,5,1,2,3,4,1,80,0,85,0,0,0,1,True,False,False,False,False,False,True,True,False,False,False,True,False,True,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,True,False,False,False,True,False,False,False,False,True,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False
3,70,15,85,15,85,50,70,20.0,1,6,7,2,3,4,41.0,7,1,4,5,4,1,3,7,1,85,50,100,60,0,0,0,True,False,False,False,False,False,True,False,False,True,True,False,False,False,True,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,True,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False
4,15,85,15,90,10,20,70,10.0,1,4,2,4,5,4,72.0,22,0,2,5,4,5,6,7,3,0,95,0,90,1,0,0,False,True,False,False,False,True,False,True,False,False,True,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False,False,True,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8275,0,100,0,50,0,0,100,1.0,1,7,7,3,2,2,26.0,8,2,4,5,6,1,5,2,4,22,50,40,100,0,0,0,False,True,False,False,False,False,True,False,False,True,False,True,False,True,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,False,True,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False
8276,50,70,50,70,50,40,70,22.0,5,6,4,2,4,4,52.0,19,0,3,5,5,5,5,4,1,40,70,40,70,0,0,0,False,True,False,False,False,True,False,False,False,True,False,True,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,False,False,True,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False
8277,70,30,85,15,70,85,50,6.0,1,1,5,3,4,2,45.0,16,0,4,5,7,4,4,4,1,70,20,60,30,0,0,0,True,False,False,False,False,True,False,False,False,True,True,False,False,True,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,True,True,False,False,False,False,True,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False
8278,0,100,0,100,0,0,70,16.0,1,7,2,3,4,2,65.0,14,0,1,5,6,6,7,7,5,0,100,0,100,0,0,0,False,True,False,False,False,False,True,True,False,False,False,True,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,True,False,True,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True


In [150]:
train_data, val_data, test_data = prepare_data.split_data(data_imputed)

In [151]:
# Save the datasets to CSV files
train_data.to_csv('train_data.csv', index=False)
val_data.to_csv('val_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)