# Preprocessing of data - Dataset 4 - Politicians

In this notebook the politician dataset, containing a [list of members](https://github.com/suneman/socialgraphs2018/blob/master/files/data_US_congress/H115.csv) of the 115th House of representatives and a [zip-file](https://github.com/suneman/socialgraphs2018/blob/master/files/data_twitter/tweets.zip) of the 200 most recent tweets of said members (collected around September 2018), is preprocessed in order to train neural network models on it. These files have been prepared and are hosted by Technical University of Denmark Associate Professor Sune Lehmann Jørgensen for the 2018 installment of his DTU Compute course 02805 Social Graphs and Interactions. This notebook adds further required extension and preprocessing for the purposes of this project. The dataset describes the relationship between [House representatives of the 115th US congress](https://en.wikipedia.org/wiki/List_of_members_of_the_United_States_House_of_Representatives_in_the_115th_Congress_by_seniority) and the purpose of the built machine learning algorithm is to classify politicians in terms of party.

This notebook assumes the presence of the following files and file structure:

- H115.csv
- tweets (make sure to unzip the provided file)
    - AustinScottGA08
    - BennieGThompson
    - ...
- legislators-current.csv (file used to translate Twitter handles to Wiki IDs, acquired [here](https://github.com/unitedstates/congress-legislators/))

This notebook adds the following files to the directory:

- H115_nodes.csv
- H115_connections.csv
- H115_nodes_oh.tsv ('oh' for OneHot)
- H115_connections.tsv

The following steps is performed in this notebook:
* Age (as of today's date) of the politicians is added as an extra feature. Sex has been handled manually.
* Connections between the politicians are extracted based on retweets in Twitter data dumps.
* Features are one-hot encoded.

NB! Please note that Wikipedia page IDs or anything else could have been subject for change in the time between finishing this project and you viewing/executing this code. If that is the case, then you *will* get errors when executing the code.

## Extra features: Age and sex

As an attempt to strengthen the generalization ability of the machine learning model, the original dataset containing `WikiPageName`, `State` and `Party` is extended with `Age` and `Sex`.

`Sex` has been manually added to the `H115.csv` file one by one by looking up the politicians. <br>
`Age` has been calculated and added, as done in the following code. Instances with names with problematic charactes have been manually added to the `H115.csv` file. <br>
The result is the file `H115_vertices.csv`.

In [170]:
from urllib.request import urlopen # request urls
from urllib import parse
import re # regex
import csv
from datetime import date

In [171]:
def getinfobox(name):
    baseurl = "https://en.wikipedia.org/w/api.php/?"
    action = "action=query"
    title = "titles=" + name
    content = "prop=revisions"
    rvprop ="rvprop=timestamp|content"
    dataformat = "format=json"
    section = "rvsection=0" # get only the infobox

    query = "%s%s&%s&%s&%s&%s&%s" % (baseurl, action, title, content, rvprop, dataformat, section)
    req = urlopen(query).read()
    #print("ok")
    return req

In [172]:
def parsefeature(infoboxstring, feature):
    #req = str(req)
    infoboxstring = infoboxstring.decode("utf-8")
    
    if feature == 'party':
        out = re.findall(r'\|party[\ ]+= \[\[(\w+)',infoboxstring)
        out = sParty[0]
    elif feature == 'bd':
        out = re.findall(r'\|[\s]*birth_date[\s]*=[\s]*\{\{[B+b][irth date and age]*[irth based on age as of date]*[\|]*[\s]*[mf=yes]*[\|]*[\s]*([\d]+)\|([\d]+)\|([\d]+)', infoboxstring) # regex
    return out

In [173]:
def calculate_age(bornyear, bornmonth, bornday):
    today = date.today()
    return today.year - bornyear - ((today.month, today.day) < (bornmonth, bornday))

In [174]:
# test function
name = "Nancy_Pelosi"
#name = 'Jim_Banks'
res = parsefeature(getinfobox(name), 'bd')
bornyear = int(res[0][0])
bornmonth = int(res[0][1])
bornday = int(res[0][2])
calculate_age(bornyear, bornmonth, bornday)

79

In [175]:
with open('H115.csv', 'r') as csvinput:
    with open('H115_nodes.csv', 'w') as csvoutput:
        reader = csv.reader(csvinput, delimiter = ',')
        next(csvinput, None)  # skip the header row
        writer = csv.writer(csvoutput)
        writer.writerow(['WikiPageName', 'Party', 'State', 'Sex', 'Age']) # write the header
        
        for row in reader:
            #print(row)
            #print(len(row))
            if len(row) == 4: 
                wikipage = row[0]
                res = parsefeature(getinfobox(wikipage), 'bd')
                #print(res)
                bornyear = int(res[0][0])
                bornmonth = int(res[0][1])
                bornday = int(res[0][2])
                age = calculate_age(bornyear, bornmonth, bornday)
                row.append(age)
            print(row[0], ':', age) # debugging 
            writer.writerow(row)
            #print(row) 
            #break

print('done')

John_Conyers : 90
Don_Young : 86
Jim_Sensenbrenner : 76
Hal_Rogers : 81
Chris_Smith_(New_Jersey_politician) : 66
Steny_Hoyer : 80
Marcy_Kaptur : 73
Sander_Levin : 88
Joe_Barton : 70
Pete_Visclosky : 70
Peter_DeFazio : 72
John_Lewis_(civil_rights_leader) : 79
Louise_Slaughter : 90
Lamar_Smith : 72
Fred_Upton : 66
Nancy_Pelosi : 79
Jimmy_Duncan_(politician) : 72
Frank_Pallone : 68
Eliot_Engel : 72
Nita_Lowey : 82
Richard_Neal : 70
Dana_Rohrabacher : 72
Ileana_Ros-Lehtinen : 67
José_E._Serrano : 67
David_Price_(American_politician) : 79
Rosa_DeLauro : 76
Collin_Peterson : 75
Maxine_Waters : 81
Sam_Johnson : 89
Jerry_Nadler : 72
Jim_Cooper : 65
Xavier_Becerra : 61
Sanford_Bishop : 72
Ken_Calvert : 66
Jim_Clyburn : 79
Anna_Eshoo : 76
Bob_Goodlatte : 67
Gene_Green : 72
Luis_Gutiérrez : 72
Alcee_Hastings : 83
Eddie_Bernice_Johnson : 84
Peter_T._King : 75
Carolyn_Maloney : 73
Lucille_Roybal-Allard : 78
Ed_Royce : 68
Bobby_Rush : 73
Bobby_Scott_(politician) : 72
Nydia_Velázquez : 72
Bennie_Thom

Dwight_Evans_(politician) : 65
Brad_Schneider : 58
Jodey_Arrington : 47
Don_Bacon_(politician) : 56
Jim_Banks : 40
Nanette_Barragán : 40
Jack_Bergman : 72
Andy_Biggs : 60
Lisa_Blunt_Rochester : 57
Anthony_G._Brown : 58
Ted_Budd : 48
Salud_Carbajal : 55
Liz_Cheney : 53
Lou_Correa : 61
Charlie_Crist : 63
Val_Demings : 62
Neal_Dunn : 66
Adriano_Espaillat : 65
John_Faso : 67
Drew_Ferguson_(politician) : 53
Brian_Fitzpatrick_(American_politician) : 45
Matt_Gaetz : 37
Mike_Gallagher_(American_politician) : 35
Tom_Garrett_(Virginia_politician) : 47
Vicente_González_(politician) : 47
Josh_Gottheimer : 44
Clay_Higgins : 58
Trey_Hollingsworth : 36
Pramila_Jayapal : 54
Mike_Johnson_(Louisiana_politician) : 47
Ro_Khanna : 43
Ruben_Kihuen : 39
Raja_Krishnamoorthi : 46
David_Kustoff : 53
Al_Lawson : 71
Jason_Lewis_(Minnesota_politician) : 64
Roger_Marshall_(politician) : 59
Brian_Mast : 39
Donald_McEachin : 58
Paul_Mitchell_(politician) : 63
Stephanie_Murphy : 41
Tom_O'Halleran : 73
Jimmy_Panetta : 

## Twitter data: Edges
### Find connections

In [176]:
import re
import os
import csv
from datetime import date
from itertools import islice
import numpy as np

In [177]:
data_dir = './tweets/'

In [178]:
# make connections based on retweets
# undirected network assumed, so if one person mentions the other, then they are both connected to each other

connections = []
#count = 0
for root, dirs, files in os.walk(data_dir):
    #print(files)
    for filename in files:
        #print(filename)
        for line in open(data_dir + filename, "r").readlines():
            if line.strip(): # if line is not empty
                #print(line)
                match = re.search(r'RT @(\w+)', line)
                if match:
                    #print("match!")
                    cited_handle = match.group(1) # get the retweet handle
                    #print(cited_handle)
                    if cited_handle in files: # if the retweeted handle belongs to a member of congress
                        #print("match:", cited_handle)
                        if (filename, cited_handle) or (cited_handle, filename) not in connections: # and the connection between the retweeter and retweeted people has not already been recorded
                            connections.append((filename, cited_handle)) # append tuple with (retweeter, retweeted) handles
                #break
        #break
        #count = count + 1
        #if count == 10:
            #break
#connections

In [179]:
print("found", len(connections), "connections!")

found 2773 connections!


In [180]:
# print a connection
for pair in connections:
    print(pair)
    break

('MarkAmodeiNV2', 'SteveScalise')


### Build dictionary to translate from Twitter id to Wikipedia id

In [181]:
# translates twitter handles to wikipage names based on a list of CURRENT legislators
# beware that any current legislator might not have been in the 115th congress

translate_dict = dict()

with open('legislators-current.csv') as csv_current:
        csvReader = csv.reader(csv_current)
        next(csvReader, None)  # skip the header row
        for row in csvReader:
            twitterhandle = row[18]
            wikipageid = row[33]
            wikipageid = wikipageid.replace(' ', '_')
            translate_dict[twitterhandle] = wikipageid
            
len(translate_dict)

530

In [182]:
connections_renamed = list()
not_translated = list() # keep track of which handles were not translated

for pair in connections:
    retweeter = pair[0]
    retweeted = pair[1]
    
    retweeter_wiki = ''
    if retweeter in translate_dict:
        retweeter_wiki = translate_dict[retweeter] 
    else:
        if retweeter not in not_translated:
            print("could not find wiki for:", retweeter)
            not_translated.append(retweeter)
    
    retweeted_wiki = ''
    if retweeted in translate_dict:
        retweeted_wiki = translate_dict[retweeted]
    else:
        if retweeted not in not_translated:
            print("could not find wiki for:", retweeted)
            not_translated.append(retweeted)
        
    if retweeter_wiki != '' and retweeted_wiki != '':
        connections_renamed.append((retweeter_wiki,retweeted_wiki))

could not find wiki for: cathymcmorris
could not find wiki for: SpeakerRyan
could not find wiki for: RepEdRoyce
could not find wiki for: RosLehtinen
could not find wiki for: RepSeanDuffy
could not find wiki for: repgregwalden
could not find wiki for: RepDaveBrat
could not find wiki for: chelliepingree
could not find wiki for: WhipHoyer
could not find wiki for: RepCummings
could not find wiki for: RepRickAllen
could not find wiki for: RepDonBacon
could not find wiki for: SamsPressShop
could not find wiki for: RepRaskin
could not find wiki for: RepRaulGrijalva
could not find wiki for: RepDavidKustoff
could not find wiki for: RepMcSally
could not find wiki for: RepLukeMesser
could not find wiki for: repmarkpocan
could not find wiki for: RepHensarling
could not find wiki for: HurdOnTheHill
could not find wiki for: CongressmanHice
could not find wiki for: RepCharlieCrist
could not find wiki for: RepTenney
could not find wiki for: RepBrianMast
could not find wiki for: RepCurbelo
could not fi

In [183]:
print("managed to rename", len(connections_renamed), "connections out of", len(connections), "total connections")

managed to rename 1298 connections out of 2773 total connections


In [184]:
print(len(set(not_translated)), "twitters could not be translated based on legislators-current.csv")

136 twitters could not be translated based on legislators-current.csv


Try to search the `H115_nodes.csv` file for wikipagenames by trying to find longest matching part between twitterhandle and wikipagename.

This does not guarantee all are found correctly. Needs manual review and correction after.

In [185]:
# function to find common substring between two strings
def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

In [186]:
# function to search the csv file for a matching wikipagename based on a twitterhandle
def searchCsvForWiki(twitterhandle):
    with open('H115_nodes.csv') as csvDataFile:
        csvReader = csv.reader(csvDataFile)
        next(csvReader, None)  # skip the header row
        tmp = ''
        for row in csvReader:
            wikipage_untouched = row[0]
            wikipage = wikipage_untouched.replace('(_politician)','')#.replace('_','')
            # check for match based on forward matching
            res = longestSubstringFinder(twitterhandle,wikipage)
            if len(res) > len(tmp):
                out = wikipage_untouched
                tmp = res
            # check for match based on backward matching
            res = longestSubstringFinder(twitterhandle[::-1], wikipage[::-1])
            res = res[::-1]
            if len(res) > len(tmp):
                out = wikipage_untouched
                tmp = res
            #if out != '':
                #print('match:', out)

        #print(out)
        return out

In [187]:
manual_translate_dict = dict()
for name in set(not_translated):
    name_cleaned = name.replace('Cong','').replace('Congressman','').replace('Rep','').replace('USRep','')
    
    manual_translate_dict[name] = searchCsvForWiki(name_cleaned) 

#manual_translate_dict

In [188]:
print("managed to translate", len(manual_translate_dict), "twitterhandles out of the remaining", len(set(not_translated)), "untranslated twitterhandles" )

managed to translate 136 twitterhandles out of the remaining 136 untranslated twitterhandles


In [189]:
# manually correcting wrong translations - based on googling
manual_translate_dict['ScottTaylor']='Scott_Taylor_(politician)'
manual_translate_dict['MikeBishop']='Mike_Bishop_(politician)'
manual_translate_dict['Russell']='Steve_Russell_(politician)'
manual_translate_dict['DWStweets']='Debbie_Wasserman_Schultz'
manual_translate_dict['DonBacon']='Don_Bacon_(politician)'
manual_translate_dict['nikiinthehouse']='Niki_Tsongas'
manual_translate_dict['DrNealDunnFL2']='Neal_Dunn'
manual_translate_dict['DanDonovan']='Dan_Donovan_(politician)'
manual_translate_dict['DaveBrat']='Dave_Brat'
manual_translate_dict['JohnDelaney']='John_Delaney_(Maryland_politician)'
manual_translate_dict['KYComer']='James_Comer_(politician)'
manual_translate_dict['TomGarrett']='Tom_Garrett_(Virginia_politician)'
manual_translate_dict['StevePearce']='Steve_Pearce_(politician)'
manual_translate_dict['SanfordSC']='Mark_Sanford'
manual_translate_dict['BrianMast']='Brian_Mast'
manual_translate_dict['HurdOnTheHill']='Will_Hurd'
manual_translate_dict['DavidYoung']='David_Young_(Iowa_politician)'
manual_translate_dict['DaveTrott']='Dave_Trott_(politician)'
manual_translate_dict['JudgeTedPoe']='Ted_Poe'
manual_translate_dict['BrianFitz']='Brian_Fitzpatrick_(American_politician)'
manual_translate_dict['SteveKnight25']='Steve_Knight_(politician)'
manual_translate_dict['boblatta']='Bob_Latta'
manual_translate_dict['Brady']='Bob_Brady'
manual_translate_dict['repdavidscott']='David_Scott_(Georgia_politician)'
manual_translate_dict['CharlieCrist']='Charlie_Crist'
manual_translate_dict['KevinYoder']='Kevin_Yoder'
manual_translate_dict['KevinYoder']='Kevin_Yoder'
translate_dict['GOPLeader'] = 'Kevin_McCarthy_(U.S._Representative)' # correct outdated info from the legislators-current.csv file
translate_dict['USRepMikeDoyle'] = 'Mike_Doyle_(American_politician)' # correct outdated info from the legislators-current.csv file
translate_dict['RepJerryNadler'] = 'Jerry_Nadler' # correct outdated info from the legislators-current.csv file
translate_dict['RepTimRyan'] = 'Tim_Ryan_(Ohio_politician)' # correct outdated info from the legislators-current.csv file
#manual_translate_dict['RepEdRoyce']='Ed_Royce' # ?

In [190]:
len(translate_dict)

530

In [191]:
len(manual_translate_dict)

153

In [192]:
complete_dict = dict(translate_dict)
complete_dict.update(manual_translate_dict)
len(complete_dict)

683

### Translate Twitter to Wikipedia and export to csv
At this point all twitter handles have been translated. Now we will translate the connections.

In [193]:
connections_renamed = list()

for pair in connections:
    retweeter = pair[0]
    retweeted = pair[1]
    
    retweeter_wiki = complete_dict[retweeter]
    retweeted_wiki = complete_dict[retweeted]
    connections_renamed.append((retweeter_wiki, retweeted_wiki))

In [194]:
# make sure that we have same number of renamed connections as original connections
assert len(connections_renamed) == len(connections)

In [195]:
# write list of connection tuples to csv
with open('H115_connections.csv','w') as out:
    csv_out=csv.writer(out)
    csv_out.writerow(['Retweeter','Retweeted'])
    for row in connections_renamed:
        csv_out.writerow(row)

## One-hot and integer encoding

The processed data consists of two files: `H115_nodes.csv` and `H115_connections.csv`. 

`H115_nodes.csv` is a simple table with the following format:

* WikiPageName (string) (id)
* Party (string) (label/target)
* State (string) (feature)
* Sex (string) (feature)
* Age (int) (feature)

Here, `WikiPageName` is the row ID, and the rest are features. `Party` in the feature we wish to predict is this problem, so this will be the label/target.

In order for the NSL library to work with the data, the features must be converted to readable formats. To do this we do the following:

* `State` will be converted to a single one-hot vector of length 50.

* `Sex` will be converted to a one-hot vector of length 2.

The NSL library will automatically process the target labels, so these do *not* need to be converted.

In [196]:
# read in the file and store content
features_wiki = []
features_state = []
features_sex = []
features_age = []
features_party = []

with open('H115_nodes.csv', 'r') as content:
    next(content) # skip header
    for line in content:
        entries = line.rstrip('\n').split(',')
        #print(entries)
        features_wiki.append(entries[0])
        features_party.append(entries[1])
        features_state.append(entries[2])
        features_sex.append(entries[3])
        features_age.append(entries[4])
        #break

In [197]:
assert len(features_state) == len(features_sex) == len(features_age) == len(features_wiki) == len(features_party)

In [198]:
# make dict of wikipages to turn them into integer IDs
values = [i+1 for i in range(len(features_wiki))]
wiki_dict = dict(zip(features_wiki, values)) 

In [199]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

def onehotencode(l):
    l = np.array(l).reshape(-1, 1) # reshape to proper format
    cat = OneHotEncoder()
    #X = np.array([['a', 'b', 'a', 'c'], [0, 1, 0, 1]], dtype=object).T
    onehotvectors = cat.fit_transform(l).toarray()
    return onehotvectors

In [200]:
# one-hot encoding of each attribute in the feature lists
features_state_oh = onehotencode(features_state)
features_sex_oh = onehotencode(features_sex)
features_age_oh = onehotencode(features_age)
features_party_oh = onehotencode(features_party)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [201]:
# we expect a one-hot vector length of 50, because there are 50 states
len(features_state_oh[0])

50

In [202]:
# we expect a one-hot vector length of 2, because there are 2 genders
len(features_sex_oh[0])

2

In [203]:
# how many unique age points do we have?
len(features_age_oh[0])

54

In [204]:
# we expect two parties
len(features_party_oh[0])

2

In [205]:
# assimilate features
features_oh = list()
for i,v in enumerate(features_wiki):
    #features_oh.append([v, features_state_oh[i], features_sex_oh[i], features_age[i], features_party[i]]) #WITHOUT age one-hot encoded 
    features_oh.append([wiki_dict[v], features_state_oh[i], features_sex_oh[i], features_age_oh[i], features_party[i]]) #WITH age one-hot encoded 
    #features_oh.append([v, features_state_oh[i], features_sex_oh[i], features_age_oh[i], features_party_oh[i]]) #WITH age one-hot encoded and WITH target (party) onehot encoded
    #features_oh.append([v, features_state_oh[i], features_sex_oh[i], features_age[i], features_party_oh[i]]) #WITHOUT age one-hot encoded and WITH target (party) onehot encoded

In [206]:
# sample
features_oh[0]

[1, array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), array([0., 1.]), array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1.]), 'Democratic']

In [207]:
#output to tab separated file
with open('H115_nodes_oh.tsv', 'w') as content:
    #content.write('WikiPageName\tState\tSex\tAge\tParty\n')
    for line in features_oh:
        content.write(str(line[0])+'\t'+
                      str(line[1].astype(np.int)).replace('\n','').replace('[','').replace(']','').replace(' ','\t')+'\t'+
                      str(line[2].astype(np.int)).replace('\n','').replace('[','').replace(']','').replace(' ','\t')+'\t'+
                      #str(line[3])+'\t'+ # age WITHOUT onehot encoding
                      str(line[3].astype(np.int)).replace('\n','').replace('[','').replace(']','').replace(' ','\t')+'\t'+ # age WITH onehot encoding
                      line[4]+'\n')
                      #str(line[4].astype(np.int)).replace('\n','').replace('[','').replace(']','').replace(' ','\t')+'\n') # party WITH onehot encoding

Convert the connections file to a tsv file

In [208]:
with open('H115_connections.csv', 'r') as infile:
    with open('H115_connections.tsv', 'w') as outfile:
        next(infile) # skip header
        for line in infile:
            entries = line.rstrip('\n').split(',')
            #print(entries)
            #outfile.write(entries[0]+'\t'+entries[1]+'\n') # with actual wiki names
            outfile.write(str(wiki_dict[entries[0]])+'\t'+str(wiki_dict[entries[1]])+'\n') # with int encoded wiki IDs
            #break