## Litics360 Prototype Version Pv1.0
# Data Collection
--------------------------------------------------------------------------------------------------------------------
# Voter Data - Ohio Secretary of State Voter Database

### Background on the data source:
The database that will be utilized for building the prototype version 1.0 of Litics360 is a record collection of registered voters in the state of Ohio, as submitted by each county Board of Elections. This website is Ohio State of Secretary's official voter database. These records are submitted and maintained in accordance with the Ohio Revised Code. Current files include voting histories for elections from year 2000 to present as provided by the counties.

Source link: https://www6.ohiosos.gov/ords/f?p=VOTERFTP:STWD:::#stwdVtrFiles

### Purpose behind usage of database:
This dataset will serve as the foundation of voter data for predictive prototyping as it contains reliable and publically available voter information of Ohio's voter records, including voter's demogragraphic information, registered party affiliation and turnout in elections (Primary/General) since the 2000s.

--------------------------------------------------------------------------------------------------------------------

## 1. Download Voter Dataset
### Brief scope of tasks:
    1.1 Import necessary libraries
    1.2 Initalize download function
    1.3 Extract appropriate links from Ohio SOS statewide voter files download page
    1.4 Download and decompress files

In [21]:
#------------------------------------------------------------------------------------------
    #1.1 Import necessary libraries
#------------------------------------------------------------------------------------------
print('\n----------------------------------------\n1.1 Importing necessary libraries ...')
#For html page import and to find links
import requests
from bs4 import BeautifulSoup
import re
import sys
import requests
import shutil, os
import send2trash

#For descompressing gzip files
import gzip
import shutil
print('Import --> COMPLETED')

#------------------------------------------------------------------------------------------
    #1.2 Initalize download function
#------------------------------------------------------------------------------------------
print('\n----------------------------------------\n1.2 Initalized download function')

def download(url, filename):
    with open(filename, 'wb') as f:
        print('\n','Downloading --> ',filename)
        response = requests.get(url, stream=True)
        total = response.headers.get('content-length')

        if total is None:
            f.write(response.content)
        else:
            downloaded = 0
            total = int(total)
            for data in response.iter_content(chunk_size=max(int(total / 1000), 1024 * 1024)):
                downloaded += len(data)
                f.write(data)
                done = int(50 * downloaded / total)
                sys.stdout.write('\r[{}{}]'.format('█' * done, '.' * (50 - done)))
                sys.stdout.flush()
    sys.stdout.write('\nDownload --> COMPLETED\n')
    
#------------------------------------------------------------------------------------------
    #1.3 Extract appropriate links from Ohio SOS statewide voter files download page
#------------------------------------------------------------------------------------------
print('\n----------------------------------------\n1.3 Extracting appropriate links from Ohio SOS statewide voter files download page ...')
#URL to SOS Ohio statewide voter files page
main_url = "https://www6.ohiosos.gov/ords/f?p=VOTERFTP:STWD:::#stwdVtrFiles"

#Scrape from the SOS Ohio statewide voter files page
response = requests.get(main_url)
page = BeautifulSoup(response.content, "html5lib")

link_leader = 'https://www6.ohiosos.gov/ords/'
links = []

#Get the links for statewide download files
for link in page.findAll('a', attrs={'href': re.compile('f?p=VOTERFTP:DOWNLOAD::FILE:')}):
    links.append(link_leader + str(link.get('href')))
    
#Display extracted links
print('\n---- Source:\n',main_url,
      '\n\n---- Links extracted:\n',list(links))

#Path directory text
path = 'VoterData/'
      
# Display files in directory before downloading
dir_list = os.listdir(path)  
print('\n---- Files in directory:\n', dir_list)

#Create filenames for easy download
gz_files =[]
txt_files = []
for i in range(1,5):
    gz_files.append('VoterData/voterfile'+str(i)+'.txt.gz')
    txt_files.append('VoterData/voterfile'+str(i)+'.txt')
gz_files.sort()
txt_files.sort()

print('\n\nExtract links --> COMPLETED')

#------------------------------------------------------------------------------------------
    #1.4 Download and decompress files
#------------------------------------------------------------------------------------------
print('\n----------------------------------------\n1.4 Downloading and decompressing files ...\n')
for i in range(0,4):
    
    #Download files
    download(links[i],gz_files[i])
    
    #Decompress files
    print('\nDecompressing --> ',txt_files[i])
    
    with gzip.open((gz_files[i]), 'rb') as f_in:
        with open((txt_files[i]), 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
        
    print('Decompress --> COMPLETED')

# Sending gzip files to trash
send2trash.send2trash('VoterData/voterfile1.txt.gz')
send2trash.send2trash('VoterData/voterfile2.txt.gz')
send2trash.send2trash('VoterData/voterfile3.txt.gz')
send2trash.send2trash('VoterData/voterfile4.txt.gz')

dir_list = os.listdir(path)  
print('\n---- Files in directory:\n', dir_list)

print('\n\nFile download/decompress --> COMPLETED')


----------------------------------------
1.1 Importing necessary libraries ...
Import --> COMPLETED

----------------------------------------
1.2 Initalized download function

----------------------------------------
1.3 Extracting appropriate links from Ohio SOS statewide voter files download page ...

---- Source:
 https://www6.ohiosos.gov/ords/f?p=VOTERFTP:STWD:::#stwdVtrFiles 

---- Links extracted:
 ['https://www6.ohiosos.gov/ords/f?p=VOTERFTP:DOWNLOAD::FILE:NO:2:P2_PRODUCT_NUMBER:363', 'https://www6.ohiosos.gov/ords/f?p=VOTERFTP:DOWNLOAD::FILE:NO:2:P2_PRODUCT_NUMBER:364', 'https://www6.ohiosos.gov/ords/f?p=VOTERFTP:DOWNLOAD::FILE:NO:2:P2_PRODUCT_NUMBER:365', 'https://www6.ohiosos.gov/ords/f?p=VOTERFTP:DOWNLOAD::FILE:NO:2:P2_PRODUCT_NUMBER:366']

---- Files in directory:
 ['Blind', '.DS_Store', 'Model']


Extract links --> COMPLETED

----------------------------------------
1.4 Downloading and decompressing files ...


 Downloading -->  VoterData/voterfile1.txt.gz
[██████████████

--------------------------------------------------------------------------------------------------------------------

## 2. Combine Voter Data

### Brief scope of tasks:
    2.1 Import necessary libraries
    2.2 Import and combine files into one dataset

In [22]:
#------------------------------------------------------------------------------------------
    #2.1 Import necessary libraries
#------------------------------------------------------------------------------------------
print('\n--------------------------------\n2.1 Importing necessary libraries ...')
import numpy as np
import pandas as pd
print('Import --> COMPLETED')

#------------------------------------------------------------------------------------------
    #2.2 Import and combine files into one dataset
#------------------------------------------------------------------------------------------

print('\n--------------------------------\n2.2 Importing and merging all data files ...')


voterdata_file1 = pd.read_csv('VoterData/voterfile1.txt',sep=",", quotechar='"',header=0, encoding='ISO-8859-1',na_values=['NA'],low_memory=False)
print('Import file#1 --> COMPLETED')
voterdata_file2 = pd.read_csv('VoterData/voterfile2.txt',sep=",", quotechar='"',header=0, encoding='ISO-8859-1',na_values=['NA'],low_memory=False)
print('Import file#2 --> COMPLETED')
voterdata_file3 = pd.read_csv('VoterData/voterfile3.txt',sep=",", quotechar='"',header=0, encoding='ISO-8859-1',na_values=['NA'],low_memory=False)
print('Import file#3 --> COMPLETED')
voterdata_file4 = pd.read_csv('VoterData/voterfile4.txt',sep=",", quotechar='"',header=0, encoding='ISO-8859-1',na_values=['NA'],low_memory=False)
print('Import file#4--> COMPLETED')

#Combine all into one dataframe
print('\nMerging data files ...')
df = pd.concat([voterdata_file1,voterdata_file2,voterdata_file3,voterdata_file4], 
                         axis=0, join='outer', join_axes=None, ignore_index=False,
                         keys=None, levels=None, names=None, verify_integrity=False, copy=True)
print('Merge --> COMPLETED')


--------------------------------
2.1 Importing necessary libraries ...
Import --> COMPLETED

--------------------------------
2.2 Importing and merging all data files ...
Import file#1 --> COMPLETED
Import file#2 --> COMPLETED
Import file#3 --> COMPLETED
Import file#4--> COMPLETED

Merging data files ...
Merge --> COMPLETED


--------------------------------------------------------------------------------------------------------------------

## 3. Organize and prep dataset

### Brief scope of tasks:
    3.1:  Import necessary libraries
    3.2:  Preview dataset information
    3.3:  Dropping voters missing key values
    3.4:  Feature --> CPVI (Cook Partisan Voting Index) of Reps ---- CD_REP_CPVI

In [23]:
#------------------------------------------------------------------------------------------
    #3.1 Import necessary libraries
#------------------------------------------------------------------------------------------
print('\n--------------------------------\n3.1 Importing necessary libraries ...')
import numpy as np
import pandas as pd
print('Import --> COMPLETED')

#------------------------------------------------------------------------------------------
    #3.2: Preview dataset information
#------------------------------------------------------------------------------------------

#Preview changes -- Compare original #rows to current #rows
print('\n--------------------------------\n3.2 Previewing dataset information ...')

#Current row count
row_ct1 = df.shape[0]
#Current col count
col_ct1 = df.shape[1]

#Number of rows, columns
print ("\n# of Registered Voters: ", row_ct1)
print ("\n# of Features: ", col_ct1)

#Unique types of party affiliations listed in dataset
parties = df.PARTY_AFFILIATION.unique()
print("\nParty Affiliations Listed Types: ", parties)
print("--> nan = Undeclared\n--> D = Democrat\n--> R = Republican\n--> G = Green\n--> L = Libertarian")

#Number of undeclared/declared voters
num_undeclaredvoters = sum(pd.isna(df['PARTY_AFFILIATION']))
print("\n# of Undeclared Party Affiliation: ", num_undeclaredvoters)

num_declaredvoters = len(df)-(sum(pd.isna(df['PARTY_AFFILIATION'])))
print("# of Declared Party Affiliation: ", num_declaredvoters)

print('\n---> END OF PREVIEW <---\n')

#------------------------------------------------------------------------------------------  
    #3.3: Dropping voters missing key values
#------------------------------------------------------------------------------------------  

print('\n--------------------------------\n3.3  Dropping voters missing key values...')

#3.3.1: Drop rows with missing SOS_VOTERID value
df = df[pd.notnull(df['SOS_VOTERID'])]
print('\n\t3.3.1  Drop voters missing SOS_VOTERID value --> COMPLETED')

#3.3.2: Drop rows with missing CONGRESSIONAL_DISTRICT value
df = df[pd.notnull(df['CONGRESSIONAL_DISTRICT'])]
#Convert CONGRESSIONAL_DISTRICT from float to int
df["CONGRESSIONAL_DISTRICT"]= df["CONGRESSIONAL_DISTRICT"].astype(int) 
print('\n\t3.3.2  Drop voters missing CONGRESSIONAL_DISTRICT value --> COMPLETED')

#3.3.3: Drop rows with missing COUNTY_NUMBER value
df = df[pd.notnull(df['COUNTY_NUMBER'])]
#Convert COUNTY_NUMBER from float to int
df["COUNTY_NUMBER"]= df["COUNTY_NUMBER"].astype(int) 
print('\n\t3.3.3  Drop voters missing COUNTY_NUMBER value --> COMPLETED')

#Preview changes -- Compare original #rows to current #rows
#Current row count
row_ct2 = df.shape[0]
print("\n---Original # of rows:", row_ct1,
      "\n---Current # of rows:", row_ct2)

#Current col count
col_ct2 = df.shape[1]
#Preview changes -- Compare original #columns to current #columns
print("\n---Original # of columns:", col_ct1,
      "\n---Current # of columns:", col_ct2)
print('\nClean --> COMPLETED')



--------------------------------
3.1 Importing necessary libraries ...
Import --> COMPLETED

--------------------------------
3.2 Previewing dataset information ...

# of Registered Voters:  7772371

# of Features:  109

Party Affiliations Listed Types:  ['R' nan 'D' 'G' 'L']
--> nan = Undeclared
--> D = Democrat
--> R = Republican
--> G = Green
--> L = Libertarian

# of Undeclared Party Affiliation:  4565332
# of Declared Party Affiliation:  3207039

---> END OF PREVIEW <---


--------------------------------
3.3  Dropping voters missing key values...

	3.3.1  Drop voters missing SOS_VOTERID value --> COMPLETED

	3.3.2  Drop voters missing CONGRESSIONAL_DISTRICT value --> COMPLETED

	3.3.3  Drop voters missing COUNTY_NUMBER value --> COMPLETED

---Original # of rows: 7772371 
---Current # of rows: 7771790

---Original # of columns: 109 
---Current # of columns: 109

Clean --> COMPLETED


--------------------------------------------------------------------------------------------------------------------

## 4. Import Congressional District Representatives (Party Affiliation & CPVI)

### Brief scope of tasks:
    4.1:  Import necessary libraries
    4.2:  Feature --> Party Affiliation of Reps ---- CD_REP_PARTY
    4.3:  Feature --> CPVI (Cook Partisan Voting Index) of Reps ---- CD_REP_CPVI

In [24]:
#------------------------------------------------------------------------------------------
    #4.1 Import necessary libraries
#------------------------------------------------------------------------------------------

print('\n--------------------------------\n4.1 Importing necessary libraries ...')
import numpy as np
import pandas as pd
from pandas.io.html import read_html
print('Import --> COMPLETED')

#------------------------------------------------------------------------------------------
    #4.2: Feature --> Party Affiliation of Reps ---- CD_REP_PARTY
#------------------------------------------------------------------------------------------

print('\n--------------------------------\n4.2 Feature --> Party Affiliation of Reps ---- CD_REP_PARTY ...')
#Pull in Congressional District Representatives for all states from official page

page_reps = 'https://www.house.gov/representatives'

#Pull in all tables containing House of Reps info
table_reps = pd.read_html(page_reps,attrs={'class':'table'})

#--FYI--Ohio district is the 38th table
#Pull in 38th table and get rid of the unnecessary columns
ohio_reps = table_reps[38].drop(axis=1, labels=['Phone', 'Office Room', 'Committee Assignment'])

#Clean OHIO_REPS dataframe
    #1. Extract only district numbers -- get rid of st/rd/th
    #2. Convert district number into numeric values
    #3. Drop all cols except district, name, party
    #4. Rename District to CONGRESSIONAL_DISTRICT, party to CD_REP_PARTY, name to CD_REP_NAME
    #5. Merge with df

#Iterate over each row
for index_label, row_series in ohio_reps.iterrows():

    # For each row update the 'District' value -- by removing the last two chracters
    ohio_reps.at[index_label , 'District'] = row_series['District'][:len(row_series['District'])-2]

#Turn all values for District and CONGRESSIONAL_DISTRICT col to INT
ohio_reps['District'] = ohio_reps['District'].astype(int)

#Rename District to CONGRESSIONAL_DISTRICT, party to CONGRESSIONAL_DISTRICT_REP_PARTY
ohio_reps.rename(columns={'District':'CONGRESSIONAL_DISTRICT',
                          'Party':'CD_REP_PARTY',
                          'Name': 'CD_REP_NAME'
                         }, 
                 inplace=True)

#Merge with main df -- match to CONGRESSIONAL_DISTRICT
df = pd.merge(df, ohio_reps, on='CONGRESSIONAL_DISTRICT')

#Preview changes -- first two rows of CD_REP_PARTY
print('Preview feature CD_REP_PARTY:\n',df['CONGRESSIONAL_DISTRICT'].head(2))
print('\nFeature CD_REP_PARTY --> COMPLETED')

#------------------------------------------------------------------------------------------
    #4.3:  Feature --> CPVI (Cook Partisan Voting Index) of Reps ---- CD_REP_CPVI
#------------------------------------------------------------------------------------------

print('\n--------------------------------\n4.3 Engineering feature --> CPVI of Reps ---- CD_REP_CPVI ...')
#Pull in Congressional District Representatives for Ohio from Wiki
page_CPVI = 'https://en.wikipedia.org/wiki/United_States_congressional_delegations_from_Ohio'

#Pull in all tables
table_CPVI = pd.read_html(page_CPVI,attrs={'class':'wikitable'})

reps_CPVI = pd.DataFrame(table_CPVI[0])

reps_CPVI = reps_CPVI.drop(axis=1, labels=['Representative(Residence)', 'Party', 'Incumbency', 'District map'])

#Clean table_CPVI dataframe
    #1. Rename district values to only numeric-- get rid of st/rd/th
    #2. Turn district from STR to INT
    #3. Drop all cols except district and cpvi
    #4. Rename District to CONGRESSIONAL_DISTRICT, CPVI to CD_REP
    #5. Merge with df

#Iterate over each row
for index_label, row_series in reps_CPVI.iterrows():

    # For each row update the 'District' value -- by removing the last two chracters
    reps_CPVI.at[index_label , 'District'] = row_series['District'][:len(row_series['District'])-2]
    reps_CPVI.at[index_label , 'CPVI'] = row_series['CPVI'][2:len(row_series['CPVI'])]
    
#Turn all values for District and CONGRESSIONAL_DISTRICT col to INT
reps_CPVI['District'] = reps_CPVI['District'].astype(int)
reps_CPVI['CPVI'] = reps_CPVI['CPVI'].astype(int)

#Rename District to CONGRESSIONAL_DISTRICT, CPVI to CD_REP_CPVI
reps_CPVI.rename(columns={'District':'CONGRESSIONAL_DISTRICT',
                        'CPVI':'CD_REP_CPVI'},
               inplace=True)

#Merge with main df -- match to CONGRESSIONAL_DISTRICT
df = pd.merge(df, reps_CPVI, on='CONGRESSIONAL_DISTRICT')

print('Preview feature CD_REP_CPVI:\n',df.iloc[:2,-1:])
print('\nFeature CD_REP_CPVI --> COMPLETED')


--------------------------------
4.1 Importing necessary libraries ...
Import --> COMPLETED

--------------------------------
4.2 Feature --> Party Affiliation of Reps ---- CD_REP_PARTY ...
Preview feature CD_REP_PARTY:
 0    15
1    15
Name: CONGRESSIONAL_DISTRICT, dtype: int64

Feature CD_REP_PARTY --> COMPLETED

--------------------------------
4.3 Engineering feature --> CPVI of Reps ---- CD_REP_CPVI ...
Preview feature CD_REP_CPVI:
    CD_REP_CPVI
0            7
1            7

Feature CD_REP_CPVI --> COMPLETED


--------------------------------------------------------------------------------------------------------------------

## 5. Partition data

### Brief scope of tasks:
    5.1:  Import necessary libraries
    5.2:  Partition: Blind Data and Model Data (10/90)
    5.3:  Export partitioned files into folders

In [25]:
#------------------------------------------------------------------------------------------
    #5.1 Import necessary libraries
#------------------------------------------------------------------------------------------

print('\n--------------------------------\n5.1 Importing necessary libraries ...')
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn import metrics
import shutil, os
import send2trash
print('Import --> COMPLETED')

#------------------------------------------------------------------------------------------
    #5.2 Partition: Blind Data and Model Data (10/90)
#------------------------------------------------------------------------------------------

print('\n--------------------------------\n5.2 Creating partition: Blind Data and Model Data (10/90) ...')
#Partition into blind and model - 10/90, random_state none, shuffle True, 
model_df, blind_df= split(
    df, train_size=0.9, test_size=0.1, random_state=None, shuffle=True)

#Calculate percentage, check # of rows in BLIND_TESTING df
print("\n10% of", len(df) ,"=",round(0.1*(len(df))),"\n BLIND dataset =",len(blind_df))

#Calculate percentage, check # of rows in MODEL df
print("\n90% of", len(df) ,"=",round(0.9*(len(df))),"\n MODEL dataset =",len(model_df))
print('\n\nParition --> COMPLETED')

#------------------------------------------------------------------------------------------
    #5.3 Export partitioned files into folders
#------------------------------------------------------------------------------------------

print('\n--------------------------------\n5.3 Exporting and saving partitioned data files')

#Function for easy export to csv
def df2csv(df,filename):
    path = 'VoterData/'+filename+'_voterdata.csv'
    df.to_csv(path, encoding='utf-8', mode='a', header=True, index=False)

#Run function to export
df2csv(model_df, 'Model/model')
print('Creating file: model_voterdata.csv --> COMPLETED')
df2csv(blind_df, 'Blind/blind')
print('Creating file: blind_voterdata.csv --> COMPLETED')

# Sending gzip files to trash
send2trash.send2trash('VoterData/voterfile1.txt')
send2trash.send2trash('VoterData/voterfile2.txt')
send2trash.send2trash('VoterData/voterfile3.txt')
send2trash.send2trash('VoterData/voterfile4.txt')
print('Sending GZip files to local trash folder --> COMPLETED')

dir_list = os.listdir('VoterData/')  
print('\n---- Files in directory:\n', dir_list)

print('\n\nExport & Save --> COMPLETED')


--------------------------------
5.1 Importing necessary libraries ...
Import --> COMPLETED

--------------------------------
5.2 Creating partition: Blind Data and Model Data (10/90) ...

10% of 7771790 = 777179 
 BLIND dataset = 777179

90% of 7771790 = 6994611 
 MODEL dataset = 6994611


Parition --> COMPLETED

--------------------------------
5.3 Exporting and saving partitioned data files
Creating file: model_voterdata.csv --> COMPLETED
Creating file: blind_voterdata.csv --> COMPLETED
Sending GZip files to local trash folder --> COMPLETED

---- Files in directory:
 ['Blind', '.DS_Store', 'Model']


Export & Save --> COMPLETED
