# Exploratory Data Analysis Ballots Data

# Table of Contents

* [Load Modules](#setup)
* [Load Data](#loaddata)
* [Data Cleaning](#datacleaning)
 * [Remove Columns with Personal Identifying Information](#removepii)
 * [Duplicate Rows](#duprows)
 * [Remove Ballots With Status NaN](#ballotsstatusnan)
 * [Duplicate Voter Id Records](#dupvoterid)
 * [Final Ballots Dataset](#finaldataset)
 * [Write Dataset to CSV File](#writecsv)

<hr>

## Load Modules<a class="anchor" id="setup"></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from scipy.stats import chi2_contingency

mpl.rcParams['font.size'] = 12.0
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

<hr>

## Load Data<a  class="anchor" id="loaddata"></a>

In [2]:
ballots_dataset = pd.read_csv('../../../vbm12.13.20.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<hr>

## Data Cleaning<a class="anchor" id="datacleaning"></a>

### Remove Columns with Personal Identifying Information<a class="anchor" id="removepii"></a>

In [4]:
ballots_dataset.drop(columns=['voter_lastName', 'voter_firstName', 'voter_middleName', 'voter_suffix',
'ballot_addr_address1', 'ballot_addr_address2', 'ballot_addr_address3','ballot_addr_zipcode',
'voter_resAddr_address1', 'voter_resAddr_address2','voter_resAddr_address3',
'voter_resAddr_zipcode', 'voter_phone', 'voter_email'], axis=1, inplace=True)

### Remove Duplicate Rows<a class="anchor" id="duprows"></a>

In [5]:
# Remove Duplicate Rows
ballots_dataset.drop_duplicates(inplace=True)

### Remove Ballots With Status NaN<a class="anchor" id="ballotsstatusnan"></a>

In [6]:
ballots_dataset = ballots_dataset[ballots_dataset['ballot_status'].notna()].reset_index(drop=True)

In [7]:
ballots_dataset.shape

(4368985, 31)

### Voter Id Duplicate Records<a class="anchor" id="dupvoterid"></a>

In [8]:
ballots_dataset[ballots_dataset['voter_id'].duplicated(keep=False)]

Unnamed: 0,current_county,current_municipality,current_ward,current_district,ballot_requestType,voter_id,voter_party,voter_status,ballot_type,ballot_county,ballot_municipality,ballot_ward,ballot_district,ballot_vtr_party,ballot_addr_city,ballot_addr_state,ballot_addr_country,application_receivedDate,application_processedDate,application_status,ballot_mailedDate,ballot_receivedDate,ballot_countedDate,ballot_status,received_type,received_rejReason,voter_resAddr_num,voter_resAddr_street,voter_resAddr_city,voter_resAddr_state,received_bearer
955199,Camden,Collingswood Borough,0,9.0,Annual Mail-In Elections,N5918452380,Democratic,Active,Presidential / Removed Resident,Camden,Collingswood Borough,0,9,Democratic,COLLINGSWOOD,NJ,UNITED STATES,12/22/2019,12/22/2019,Accepted,10/01/2020,10/27/2020,11/19/2020,Accepted,Mail,,109,E Franklin Ave,Collingswood,NJ,
955200,Camden,Collingswood Borough,0,9.0,Annual Mail-In Elections,N5918452380,Democratic,Active,Presidential / Removed Resident,Camden,Collingswood Borough,0,9,Democratic,COLLINGSWOOD,NJ,UNITED STATES,12/22/2019,12/22/2019,Accepted,10/01/2020,10/27/2020,11/19/2020,Accepted,Drop Box,,109,E Franklin Ave,Collingswood,NJ,
1027242,Camden,Lindenwold Borough,0,7.0,Single Election,O6429403939,Unaffiliated,Active,Regular,Camden,Lindenwold Borough,0,7,Unaffiliated,LINDENWOLD,NJ,US,09/23/2020,09/23/2020,Accepted,09/28/2020,10/17/2020,11/19/2020,Rejected,Drop Box,Certificate Not Signed,101,E GIBBSBORO RD,LINDENWOLD,NJ,
1027243,Camden,Lindenwold Borough,0,7.0,Single Election,O6429403939,Unaffiliated,Active,Regular,Camden,Lindenwold Borough,0,7,Unaffiliated,LINDENWOLD,NJ,US,09/23/2020,09/23/2020,Accepted,09/28/2020,10/17/2020,11/19/2020,Accepted,Mail,,101,E GIBBSBORO RD,LINDENWOLD,NJ,
4078501,Union,Clark Township,2,4.0,Single Election,B4484851749,Republican,Active,Regular,Union,Clark Township,2,4,Republican,Clark,NJ,,08/14/2020,08/31/2020,Accepted,09/15/2020,10/27/2020,11/20/2020,Accepted,Bearer,,75,Victoria Dr,Clark,NJ,CHRISTOPHER PANDOLFO
4078502,Union,Clark Township,2,4.0,Single Election,B4484851749,Republican,Active,Regular,Union,Clark Township,2,4,Republican,Clark,NJ,,08/14/2020,08/31/2020,Accepted,09/15/2020,10/27/2020,11/20/2020,Accepted,Drop Box,,75,Victoria Dr,Clark,NJ,
4158632,Union,Linden City,6,2.0,Single Election,A4579251725,Democratic,Active,Regular,Union,Linden City,6,2,Democratic,Linden,NJ,,08/14/2020,08/31/2020,Accepted,09/22/2020,11/03/2020,11/20/2020,Rejected,Bearer,Certificate Not Signed,119,E 11th St,Linden,NJ,ROY HERMAN
4158633,Union,Linden City,6,2.0,Single Election,A4579251725,Democratic,Active,Regular,Union,Linden City,6,2,Democratic,Linden,NJ,,08/14/2020,08/31/2020,Accepted,09/22/2020,11/03/2020,11/20/2020,Accepted,Bearer,,119,E 11th St,Linden,NJ,ROY HERMAN


In [9]:
# Remove duplicate voter id rows (kept Accepted ballots)
ballots_dataset.drop(index=[955199,1027242,4078501,4158632], inplace=True)

### Final Ballots Dataset<a class="anchor" id="finaldataset"></a>

In [10]:
ballots_dataset.shape

(4368981, 31)

### Write Dataset to CSV File<a class="anchor" id="writecsv"></a>

In [11]:
ballots_dataset.to_csv('./ballots_dataset.csv', index=False)