## Data exploration and data preprocessing for the congressional voting dataset

The folder includes 3 .csv files:
* CongressionalVotingID.shuf.lrn.csv *- the training data set*
* CongressionalVotingID.shuf.sol.ex.csv *- a sample solution file in the correct format* 
* CongressionalVotingID.shuf.tes.csv *- the test dataset*


The data contains one integer ID column and multiple object columns


For side note in Kaggle: https://medium.com/mcd-unison/using-the-kaggle-api-e43e902fba23

In [1]:
import pandas as pd
import numpy as np

### Import and first overview of the data

In [2]:
# load the data
congVoting_train = pd.read_csv('./data/CongressionVoting/CongressionalVotingID.shuf.lrn.csv')
congVoting_test = pd.read_csv('./data/CongressionVoting/CongressionalVotingID.shuf.tes.csv')
congVoting_sol = pd.read_csv('./data/CongressionVoting/CongressionalVotingID.shuf.sol.ex.csv')
print(f"The shape of the data looks as follows: train - {congVoting_train.shape}, test - {congVoting_test.shape}, sol - {congVoting_sol.shape}")

The shape of the data looks as follows: train - (218, 18), test - (217, 17), sol - (217, 2)


In [3]:
congVoting_train.head()
# I assume the ID is the key of the test data and the sol could be merged on to the test data by using the ID
 
len(set(congVoting_test.ID)& set(congVoting_sol.ID))
len(set(congVoting_test.ID)& set(congVoting_train.ID))

print(f"The congVoting_test data shares {len(set(congVoting_test.ID)& set(congVoting_sol.ID))} IDs with the congVoting_sol data and {len(set(congVoting_test.ID)& set(congVoting_train.ID))} IDs with the congVoting_train data")

The congVoting_test data shares 217 IDs with the congVoting_sol data and 0 IDs with the congVoting_train data


The previous code chunkshows, that in deed, the IDs are the same in test and sol and additionally that none of the test IDs occur in the train dataset. Thus for different split one could combine both datasets.

In [4]:
congVoting = pd.concat([congVoting_train,pd.merge(congVoting_test, congVoting_sol, on='ID', how='inner')])
congVoting.head()
print(f"The new merged dataset has the following shape: {congVoting.shape}")

The new merged dataset has the following shape: (435, 18)


In [5]:
congVoting_train.dtypes

ID                                         int64
class                                     object
handicapped-infants                       object
water-project-cost-sharing                object
adoption-of-the-budget-resolution         object
physician-fee-freeze                      object
el-salvador-aid                           object
religious-groups-in-schools               object
anti-satellite-test-ban                   object
aid-to-nicaraguan-contras                 object
mx-missile                                object
immigration                               object
synfuels-crporation-cutback               object
education-spending                        object
superfund-right-to-sue                    object
crime                                     object
duty-free-exports                         object
export-administration-act-south-africa    object
dtype: object

### Missing value handling

In [6]:
#congVoting_train.isna().sum()
# The unknowns are represented by the string 'unknown'

congVoting_train.apply(pd.Series.value_counts).loc['unknown']


ID                                         NaN
class                                      NaN
handicapped-infants                        7.0
water-project-cost-sharing                21.0
adoption-of-the-budget-resolution          4.0
physician-fee-freeze                       6.0
el-salvador-aid                            9.0
religious-groups-in-schools                5.0
anti-satellite-test-ban                    6.0
aid-to-nicaraguan-contras                 10.0
mx-missile                                12.0
immigration                                4.0
synfuels-crporation-cutback               13.0
education-spending                        16.0
superfund-right-to-sue                    14.0
crime                                      7.0
duty-free-exports                         14.0
export-administration-act-south-africa    58.0
Name: unknown, dtype: float64

In [7]:
congVoting_train.set_index('ID', inplace=True)

In [9]:
rowiseMissingValues = congVoting_train.apply(pd.Series.value_counts, axis=1).sort_values(by='unknown', ascending=False).loc[:,['unknown']]

# Drop the rows with number of missing values bigger than a defined threshold

# Threshold number is the half of the number of features
missingValueThreshold = round(congVoting_train.shape[1]*0.5)
rowsToDrop = rowiseMissingValues[rowiseMissingValues['unknown'] > missingValueThreshold]

congVoting_train.drop(rowsToDrop.index, inplace=True)
congVoting_train.shape



(217, 17)

### Convert data boolean data to integers

The features column have the types 'n','y' and 'unknown'. This will be converted to 0,1,NaN.

In [7]:
def convert_values_to_numeric(df):
    # Define the mapping for conversion
    mapping = {'n': 0, 'y': 1, 'unknown': np.nan}

    # Use the replace function to perform the conversion
    converted_df = df.replace(mapping)

    return converted_df

def convert_numeric_to_values(df):
    # Define the mapping for conversion
    mapping = {0: 'n', 1: 'y', np.nan:'unknown'}

    # Use the replace function to perform the conversion
    converted_df = df.replace(mapping)

    return converted_df

In [8]:
to_number_test = convert_values_to_numeric(congVoting_train)
print(f"List the number of uniue values in the column crime: {to_number_test.crime.unique()}")
print(f"For testing, print the number of NaN values in the column crime: {to_number_test.crime.isna().sum()}")

to_numeric_test = convert_numeric_to_values(to_number_test)
print(f"List the number of uniue values in the column crime: {to_numeric_test.crime.unique()}")
print(f"For testing, print the number of NaN values in the column crime: {to_numeric_test.crime.isna().sum()}")

List the number of uniue values in the column crime: [ 0.  1. nan]
For testing, print the number of NaN values in the column crime: 6
List the number of uniue values in the column crime: ['n' 'y' 'unknown']
For testing, print the number of NaN values in the column crime: 0


In [12]:
congVoting_train["class"].unique()

array(['democrat', 'republican'], dtype=object)