# Congressional Record Monologue and Feature Parser

This notebook takes U.S. Congressional Record session text files created from [here](https://github.com/kdunn926/gov-data/blob/master/govdata-congressionalRecordScraper.ipynb) and splits it apart into distinct monolgoes, capitalizing off of the semi-structures format of `Mr. NOBODY: <monologe text>` consistent throughout the GPO's Congressional Record archive.

Additionally, it attempts to identify and extract:
- session times (start and/or end - work in progess)
- mentions within a monologue of other people and proper nouns, both implemented using some hairy regexp
- "context" of a monologue - each monologue is assigned a unique, sequential id, to facilitate contextual forward/backward exploration of the session when starting from a single monologue

We also "join" in Congressional "roster" data, scraped (and manually cleaned) from Wikipedia - to allow annotating: 
- party affiliations
- state represented
- "role" in the Congress (whip, speaker, etc.)
- term end

The output of this notebook is compressed, delimited text files, intended to be loaded in something like Neo4J, or even a SQL database. Neo4J _should_ allow for intuitive queries based on the highly connection nature of this dataset.

### Note: this process is rather compute-intensive. It's probably very poorly optimized... some attempt has been made to help this but PRs are always welcome!

End to end timing for data from ~1992 to mid-2016 was **9 hours**

2012 Macbook, 2.3 GHz Intel Core i7, 8GB 1600 MHz DDR3, no SSD

In [1]:
import json
import re
import py2neo
import glob
import requests
import pandas as pd
import sys

In [2]:
def setSpeakers(theRecord, theSpeaker):
    proTemporeSections = re.split("The SPEAKER pro tempore \(((?:Mr.|Mrs.|Ms.)\s?[A-z]{2,}(?:\s[A-Z][a-z]+[-]?){1,2})\)", theRecord)
    
    replaced = ""
    name = ""
    for n, token in enumerate(proTemporeSections):
        if re.match("^(Mr.|Mrs.|Ms.)", token):
            #print n
            name = token
        
        # This serves double duty, replace references to speaker of the moment with a real name
        # and handle the case where speaker of the moment is a woman and subsequent monologues
        # reference her as "Madam Speaker"
        replaced = replaced + re.sub("The SPEAKER pro tempore", name, re.sub("Madam Speaker", "[ {0} ]".format(name), token))
        
    replaced = re.sub("The SPEAKER", theSpeaker, replaced)
    
    return replaced

In [3]:
monologueRegex = re.compile("[^[]\s((?:Mr.|Mrs.|Ms.|\^\^THE\^\^)\s(?:[a-z]{2,}\s)?[A-Z]{1}[A-z]{2,}(?: of (?:[A-Z][a-z]+[ ]?)+)?)\.")
def getMonologues(theRecord, theSpeaker):
        
    with open(theRecord) as theFile:
        text = theFile.read()
        
    ctext = setSpeakers(text, theSpeaker)

    # All the newlines have been removed, only operate
    # on the first element
    ctext = re.sub("Mr."," Mr.", ctext)
    ctext = re.sub("Mrs."," Mrs.", ctext)
    ctext = re.sub("Ms."," Ms.", ctext)
    ctext = re.sub("Roll No."," Roll No. ", ctext)
    ctext = re.sub("The SPEAKER pro tempore"," ^^THE^^ SPEAKERPROTEMPORE", ctext)
    ctext = re.sub("The SPEAKER"," ^^THE^^ SPEAKER", ctext)
    ctext = re.sub("The Clerk"," ^^THE^^ CLERK.", ctext)
    ctext = re.sub("The PRESIDING OFFICER"," ^^THE^^ PRESIDINGOFFICER.", ctext)
    ctext = re.sub("The assistant legislative clerk read"," ^^THE^^ ASSISTANTLEGISLATIVECLERK. read", ctext)
    
    #text = re.sub("Roll No.", " Roll No.", ctext)
    
    # Proto to capture roll number, then sub back in with spaces
    for rollNumber in re.findall("(Roll No.\s+\d+)", ctext):
        ctext = re.sub(rollNumber, " " + rollNumber + " ", ctext)
    ctext = re.sub("[^\x00-\x7F]",' ', ctext)

    # TODO capture the roll number and put a space after the roll number

    monologues = monologueRegex.split(ctext)
    return monologues, text

In [4]:
#getMonologues("/Users/kyledunn/Desktop/congressionalRecord/HOUSE/Merged/1994-01-26.txt", "NoOneInParticular")

In [5]:
def getNameReferences(monologue):
    cleanM = monologue.replace("]", "")
    # Ms. Vaughn or Mr. de Lugo
    titleNamePattern = "(?:Dr|Mr|Mrs|Ms|Madam)[\.]\s(?:[A-z]{1}[a-z]{,2}\s)?(?:Mc|Mac)?(?:[A-Z]{1}[a-z]+)"
    
    # Something like Mr. and Mrs. America or Mr. and Mrs. van Daan
    couplePattern = "(?:Dr|Mr|Mrs|Ms)[\.]\sand\s(?:Dr|Mr|Mrs|Ms)[\.]\s(?:[A-z]{1}[a-z]{,2}\s)?(?:Mc|Mac)?(?:[A-Z]{1}[a-z]+)"
    
    # Something like John Conyers - this is too wide of a net - moved to Proper Noun extraction
    #fullNamePattern = "(?:[A-Z]{1}[a-z]+){1}\s(?:[A-z]{1}[a-z]{,2}\s)?(?:[A-Z]{1}[a-z]+){1}"
    
    # Must match the couple pattern first otherwise "Mrs" gets interpretted as last name
    names = re.findall("({0})".format("|".join([couplePattern, titleNamePattern])), cleanM)
    
    if names:
        return names
    else:
        return ""

In [6]:
from nltk.corpus import stopwords
#from stop_words import get_stop_words

stopWords = set(stopwords.words('english'))
#stopWords.update(get_stop_words('en'))

def cleanWord(text):
    return re.sub("\W+", " ", text).strip()

def getSetOfWords(monologue):
    return set([cleanWord(word) for word in monologue.split() if (len(cleanWord(word)) > 2) 
                                                              and cleanWord(word).lower() not in stopWords])

In [7]:
def noStopWordsWithin(phrase):    
    for word in phrase.split():
        if word.lower() in stopWords:
            return False
        
    return True

# Something like John Conyers - this is too wide of a net - moved to Proper Noun extraction
properNounPattern = re.compile("(?:(?:Mc|Mac)?(?:[A-Z]{1}[a-z]{2,})\s(?:[a-z]{2,3}\s)?){1,}(?:(?:Mc|Mac)?(?:[A-Z]{1}[a-z]{2,}))")
def getProperNounReferences(monologue):
    cleanM = monologue.replace("]", "")
    
    refs = [p.strip() for p in properNounPattern.findall(cleanM)]
    
    if refs:
        return refs
    else:
        return ""

In [8]:
test="""Mr. Chairman, I yield myself such time as I may consume. Mr. Chairman, the Forest Recovery and Protection Act of 
1998 is the result of some 14 months of listening and learning and fact-gathering. It is the result of seven hearings in 
which we heard from a broad array of people across this Nation, including scientists, academics, State foresters, 
professional associates, environmental groups, wildlife organizations, citizens, community leaders, elected officials, 
organized labor, the forest products industry and the administration. Beyond the hearing process, the committee has worked 
exhaustively with minority Members, northeastern Republicans, hopefully all Members of this body to refine the bill to 
broaden support for what we believe is a very necessary and a very reasonable initiative. We extended a hand and we worked 
with those who have expressed concerns with the bill and we were willing to work in good faith to find solutions. I am 
delighted to stand here today and to tell my colleagues that because we have collaborated with these concerned parties 
we have a stronger bill and one that truly represents, we believe, diverse interests. Here are just a few of the groups, 
by the way, that support this bill: the AFL-CIO, the United Brotherhood of Carpenters and Joiners of America, the 
National Association of Counties, the Society of American Foresters, the National Association of State Foresters, the 
National Association of Professional Forestry Schools. But despite our best efforts to include all interests in crafting 
this legislation, there are those of course who have elected to remain outside the process rather than coming to the 
table to seek solutions. Unfortunately, because they have not been engaged, there are some misunderstandings about this 
bill, which I would like to clear up. There are a number of people who are talking about this bill, about what it is not. 
I would like to explain to them about what the bill does. It is a five-year pilot project providing a timely and organized 
and scientific strategy to address the chronic conditions of our national forests. The bill establishes an independent 
scientific panel through the National Academy of Sciences to recommend to the Secretary of Agriculture the standards and 
criteria that should be used to identify which national forests are in the worst shape and where restoration efforts are 
needed most. The public then provides input on the standards and criteria which the Secretary publishes. Based upon the 
standards and criteria, the Secretary then determines which forests have the greatest restoration needs and allocates 
amounts to those forests. On-the-ground forest managers then begin planning projects to restore degraded and deteriorating 
forest resources. I have been hearing information to the contrary, so I want to make this clear to everyone in this assembly. 
These projects must comply with all applicable environmental laws. This legislation does not in any way limit public 
participation under existing laws and regulations. More than that, a full, open, public process must be conducted by all 
recovery projects. All project planning, including analysis of environmental impacts, must comply with NEPA, the 
National Environmental Policy Act. Recovery projects must be consistent with land and resource management plans, 
plans that have been analyzed by NEPA and have been deemed consistent with environmental laws and regulations. There is 
no short-circuiting, circumventing or limiting of laws. Public process or judicial review anywhere in this bill are 
always protected. So those who oppose 2515, the original bill, must oppose current environmental laws and regulations. 
Those who oppose this bill must oppose restoring fish habitat. They must oppose reducing the threat of epidemic levels 
of insects and disease. They must oppose replanting trees and stabilizing slopes after catastrophic events, and they 
must oppose reducing the risk of wildfire. Those who oppose this bill say the forest health crisis is a myth, that forest 
health is an excuse to log our national forests. Of course, not every acre in the National Forest is degraded or 
deteriorating, but over the last decade an enormous body of scientific literature has been generated about our degraded, 
deteriorating forest resources. Scientists agree that our forests are ``outside the historic range of variability,'' and 
that active management is necessary in some areas to begin to return forests to their historic conditions. The Chief of 
the Forest Service has said that there are some 40 million acres of National Forest at unacceptable risk of destruction 
by catastrophic fire, and listed these sources: the Integrated Scientific Assessment for Ecosystem Management in the 
Interior Columbia Basin says, ``We found that forests and ecosystems have become more susceptible to severe fire and 
outbreaks of insects and disease''; the Southern [[Page H1652]] Appalachian Assessment states, ``Several tree species 
in the Southern Appalachians are at risk of extinction or significant genetic loss because of exotic pests'' and ``lack 
of active management in other stands has led to development of dense understories, and to the senescence of overstory 
trees of some species''; the Sierra Nevada Ecosystem Project states, ``Fire protection for the last half century has 
provided for the development of continuous dense forest stands which are in need of thinning to accelerate growth, reduce 
fire hazard, provide for more mid-successional forest habitat and yield of usable wood.'' Well, there is no question about 
it in my mind and all others that this is an essential bill. ``Active management'' is a term that is frequently distorted. 
Active management could be creating in-stream structure for fish habitat. It could be planting native grasses to stabilize 
the stream bed; it could be planting trees near a stream to provide shade to reduce stream temperatures; and yes, it could 
also be cutting trees to prevent the spread of insects and disease or reduce the risk of catastrophic wildfire. It seems to 
me, Mr. Chairman, that the Forest Service is in some state of catatonic immobilization in that the direction; and the goals 
of the Forest Service are somehow hidden, and direction is essential, which certainly this legislation does. The 
Forest Service, I believe, needs emergency care here to help them direct resources in this Nation to protect this very 
valuable resource. On-the-ground managers are confused and frustrated with their missions. While environmental laws, 
no question about it, have shut down logging, particularly in the Pacific Northwest, please give us an opportunity to 
nurture and care for this resource. To let it burn is huge waste; to let it burn means we lost all the environmental 
issues that we all deem important; we lost stream bank protection, we lost the resource, we lost wildlife, we lost all 
of those important issues to all of us in the West for some 250 years. Will this legislation answer all the questions? 
Of course not. This is a moderate, meager, bipartisan effort to answer some of the problems and some of the forests 
that are in the worst condition in this Nation. We think that this will give the Forest Service the direction necessary 
and again, I reiterate, abide by every environmental law in this land. Mr. Chairman, I reserve the balance of my time.
""".replace("\n", "")

#print getProperNounReferences(test)
#print getNameReferences(test)

#['Forest Recovery and Protection Act', 'United Brotherhood of Carpenters and Joiners of America', 'National Association of Counties', 'Society of American Foresters', 'National Association of State Foresters', 'National Association of Professional Forestry Schools', 'National Academy of Sciences', 'Secretary of Agriculture', 'National Environmental Policy Act', 'National Forest', 'The Chief', 'Forest Service', 'National Forest', 'Integrated Scientific Assessment for Ecosystem Management', 'Interior Columbia Basin', 'Appalachian Assessment', 'Southern Appalachians', 'Sierra Nevada Ecosystem Project', 'Forest Service', 'Forest Service', 'The Forest Service', 'Pacific Northwest', 'Forest Service']

In [9]:
#gensim????

In [10]:
def getNumberOfWords(monologue):
    return len(monologue.split())

def getWordHistogram(wordSet, monologue):
    countTuples = [(w, monologue.count(w)) for w in wordSet]
    
    sortedTuples = sorted(countTuples, key=lambda x: x[1], reverse=True)
    
    return ["{0}:{1}".format(w, str(c)) for w, c in sortedTuples]
 

def getLengthOfMonologue(monologues):

    lengthOfMonologue = [len(m.split() ) for m in monologues]
    return lengthOfMonologue

In [1]:
# This is not implemented - we'd never finish with the latency imposed by this
# also, this implementation was sub-par, at the _very_ best
def getSentiment(monologues=[]):
    sentimentList = []

    #for m in monologues:
    #    sentimentJson = requests.post('http://sentiment.vivekn.com/api/batch/',data="text='"+m+"'")
    #    sentiment = json.loads(sentimentJson.text)['label']
    #    sentimentList.append(sentiment)
    
    
    jsonPayload = dict(zip(range(len(monologues)), monologues))
    
    #print sys.getsizeof(jsonPayload)
    
    sentimentJson = requests.post('http://sentiment.vivekn.com/api/batch/', data=jsonPayload)
    #print sentimentJson.headers
    
    print sentimentJson.json()
    #sentiment = json.loads(sentimentJson.text)['label']
    
    
    return sentimentList

In [12]:
def getLeaders(leadersFile):
    leaders = pd.read_csv(leadersFile)  #"/home/kelvin/congression_record/101/houseLeaders.csv"
    leaders["First Name"] = leaders.Name.map(lambda s:s.split()[0])
    leaders["Last Name"] = leaders.Name.map(lambda s:s.split()[-1])
    return leaders

In [13]:
def getTheRole(leaders, lastName):
    try:
        if lastName == "The SPEAKER":
            #lastName = leaders[leaders["Role"] == lastName.capitalize() ]["Last Name"].values[0]
            role = "Speaker"
        else:
            role = leaders[leaders["Last Name"] == lastName.capitalize().replace("Mr.","").replace("Mrs.","").replace("Ms.","") ].Role.values[0]
    except:
        role = False
    
    theRole[lastName] = role
    return theRole

In [14]:
def getTermEnd(leaders, lastName):
    try:
        if lastName == "The SPEAKER":
            termEnd = leaders[leaders["Role"] == lastName.capitalize() ]["Term End"].values[0]
        else:
            termEnd = leaders[leaders["Last Name"] == lastName.capitalize().replace("Mr.","").replace("Mrs.","").replace("Ms.","") ]["Term End"].values[0]
    except:
        termEnd = False
        
    theTermEnd[lastName] = termEnd
    return theTermEnd

In [15]:
def cleanNestedWhitespace(text):
    return re.sub( '\s+', ' ', text).strip()

def truncateToSection(text):
    try:
        return text.split("____________________")[0]
    except IndexError:
        return text
    
def cleanMonologue(text):
    p1 = truncateToSection(text)
    return cleanNestedWhitespace(p1)

def splitTheList(speakersAndMonologues):
    
    totalLength = len(speakersAndMonologues)
    speakerIndices = None
    for n, token in enumerate(speakersAndMonologues):
        # If the token starts with a title, its a speaker,
        # rather than a monologue
        if re.match("^(Mr.|Mrs.|Ms.|\^\^THE\^\^)", token):
            #print n
            speakerIndices = range(n, totalLength, 2)
            monologueIndices = range(n+1, totalLength, 2)
            break
            
    if speakerIndices is None:
        return None, None
            
    speakers = [speakersAndMonologues[i] for i in speakerIndices]
    monologues = [cleanMonologue(speakersAndMonologues[i]) for i in monologueIndices]
            
    
    return speakers, monologues

In [16]:
def getState(df, lastName, session):
    try:
        return df[(df.Last == lastName.title()) & (df.Session == int(session))]["State"].values[0]
        #return df[(df.Name.str.contains(lastName.title())) & (df.Session == int(session))]["State"].values[0]
    except IndexError:
        try:
            # So and so Of someState
            if re.match("[^\s]*\sOF\s.*", lastName.upper()):
                return re.split("[oO][fF]", lastName.title())[-1].split()[0]
            else:
                return "Unknown"
        except:
            return "Unknown"
    
def getParty(df, lastName, session):
    try:
        return df[(df.Last == lastName.title()) & (df.Session == int(session))]["Party Code"].values[0]
        #return df[(df.Name.str.contains(lastName.title())) & (df.Session == int(session))]["Party Code"].values[0]
    except IndexError:
        return "Unknown"
    
def getFullName(df, name, session):
    lastName = " ".join(name.split()[1:])
    try:
        nameOrNames = df[(df.Last == lastName.title()) & (df.Session == int(session))]["Name"].values
        if nameOrNames.shape[0] > 1:
            return "Ambiguous {0}".format(name)
        else:
            return nameOrNames[0].replace('"', "'")
        #return df[(df.Name.str.contains(lastName.title())) & (df.Session == int(session))]["Name"].values[0].replace('"', "'")       
    except IndexError:
        return name.title().replace('"', "'")
    except re.error:
        return name.replace('"', "'")
    
def getSpeaker(df, session):
    try:
        return df[(df.Role == "Speaker") & (df.Session == int(session))]["Name"].values[0]
    except IndexError:
        return "Session {0} Speaker".format(str(session))
    
def getRole(df, lastName, session):
    try:
        return df[(df.Last == lastName.title()) & (df.Session == int(session))]["Role"].values[0]
        #return df[(df.Name.str.contains(lastName.title())) & (df.Session == int(session))]["Role"].values[0]
    except IndexError:
        return None

def fullNamesFromReferences(namesInMonologue, congressSession, theBranch):
    fullNames = list(namesInMonologue)
    for n, name in enumerate(namesInMonologue):
        if name.split()[-1] == "Speaker":
            fullNames[n] = getSpeaker(rosterAndLeaders[theBranch.title()]['Leaders'], congressSession)
        
        else:
            fullNames[n] = getFullName(rosterAndLeaders[theBranch.title()]['Roster'], name, congressSession)
        
    return fullNames
    
#getState(rosterAndLeaders['Senate']['Roster'], "Young", 101)
#getParty(rosterAndLeaders['Senate']['Roster'], "Young", 101)
#getRole(rosterAndLeaders['House']['Leaders'], "Foley", 103)
#getSpeaker(rosterAndLeaders['House']['Leaders'], 110)
#getFullName(rosterAndLeaders['House']['Roster'], "Mr. Foley", 103) -> Hit
#getFullName(rosterAndLeaders['Senate']['Roster'], "Mr. Young", 103) -> Ambiguous

In [17]:
startRegex = re.compile("met at (\d+[: ]?\d*\W?(?:o'clock noon|noon|[aApP]\.[mM]\.))")
timesRegex = re.compile("{time}\W*([\d]{4})")
def getTimes(rawText):
    match = startRegex.search(rawText)
    times = timesRegex.findall(rawText)
    
    if match is not None:
        startTime = match.group(1)
        
        if len(times) > 0:
            lastTime = max(times)
        else:
            lastTime = "Unknown"
    elif len(times) > 1:
        startTime = min(times)
        
        lastTime = max(times)
    else:
        startTime = "Unknown"
        lastTime = "Unknown"
        
    return startTime, lastTime

In [18]:
#test="The Senate met at 10 a.m., on the expiration of the recess, and was"
test="{time} 1015 \n\n\n\n\n\ {time} 2045 time 3342"
print getTimes(test)

('1015', '2045')


In [19]:
from multiprocessing.dummy import current_process

from csv import writer, QUOTE_ALL, QUOTE_NONE
import gzip

def writeCsv(monologues, speakers, startTime, lastTime,
             monologueWordSets, monologueWordCounts, monologueNameReferences, 
             sentiment, theBranch, congressSession, theDate, speakerRoles, speakerParties, theTermEnd, 
             speakerStates, wordHistograms, mentionedProperNouns):
    
    threadId = current_process().ident
    
    nodeTypes = ["Date", "Session", "Congress", "Monologue", 
                 "Person", "Party", "State", "ProperNoun"]
    relTypes = ["sessionDate", "sessionCongress", "monologueDate", "monologueCongress",
                "monologueSession", "monologueSequence", "personParty", "personState",
                "personSpoke", "monologueMentions", "monologueProperNounMentions"]
    
    outFiles = [gzip.open("/Users/kyledunn/Desktop/congressionalRecord/normalized/{0}-{1}.csv.gz".format(t, threadId), 'a') for t in nodeTypes + relTypes]
    csvWriters = [writer(f, delimiter='|', quotechar='', quoting=QUOTE_NONE, doublequote=False) for f in outFiles]

    writers = dict(zip(nodeTypes + relTypes, csvWriters))
    #writers['Date'].writerow(['Spam'] * 5 + ['Baked Beans'])
    
    """
    http://neo4j.com/docs/stable/cypherdoc-importing-csv-files-with-cypher.html
    http://stackoverflow.com/questions/31639855/avoid-processing-duplicate-data-when-csv-importing-via-cypher
    
    LOAD CSV FROM "..." AS row
    OPTIONAL MATCH (f:MyLabel {id:row.uniqueId})
    WHERE f IS NULL
    MERGE (f:MyLabel {id:row.uniqueId})
    ON CREATE SET f....
    WITH f,row
    MATCH (otherNode:OtherLabel {id : row.otherNodeId})
    MERGE (f) -[:REL1] -> (otherNode)
    """;
    
    
    #graphDb = py2neo.Graph("http://{u}:{p}@{h}:7474/db/data".format(u=user, p=password, h=host))
    
    # Do the thing all in one transaction
    #tx = graphDb.cypher.begin()
  
    """
    USING PERIODIC COMMIT 1000 LOAD CSV FROM "file:///../../../../Desktop/congressionalRecord/pipe" as row 
    FIELDTERMINATOR '|' 
    CREATE (n:Date {date:row[0], branch:row[1], startTime:row[2], stopTime:row[3]})
    """;

    #dateNode = MergeNode("Date").set(date=theDate, startTime=startTime, stopTime=lastTime)
    #tx.append(dateNode)
    writers['Date'].writerow([theDate, theBranch, startTime, lastTime])
 
    """
    USING PERIODIC COMMIT 1000 LOAD CSV FROM "file:///../../../../Desktop/congressionalRecord/pipe" as row 
    FIELDTERMINATOR '|' 
    CREATE (:Session { label:row[0]} )
    """;

    thisBranchSession = "{0} Session".format(theBranch)
    #writers['Session'].writerow([thisBranchSession])

    """
    USING PERIODIC COMMIT 1000 LOAD CSV FROM "file:///../../../../Desktop/congressionalRecord/pipe" as row 
    FIELDTERMINATOR '|' 
    CREATE (:Congress { label:row[0]} )
    """;
    
    #congressYearNode = MergeNode("Congress {0}".format(congressSession))
    #tx.append(congressYearNode)
    thisCongress = "Congress {0}".format(congressSession)
    #writers['Congress'].writerow([thisCongress])
    
    #statement = """
    #MATCH (sp:{0}) 
    #MATCH (da:Date {{date: {{d}}}}) 
    #CREATE (sp)-[:`ON DATE`]->(da)
    #""".format("`{0} Session`".format(theBranch))
    #tx.append(statement, {"d": theDate})
 
    """
    USING PERIODIC COMMIT 1000 LOAD CSV FROM "file:///Users/kyledunn/Desktop/congressionalRecord/pipe" as row 
    FIELDTERMINATOR '|' 
    MATCH (br:Session)
    MATCH (da:Date {date:row[2]})
    WHERE br.label = row[0]
    CREATE (br)-[:`ON DATE`]->(da)
    """;

    writers["sessionDate"].writerow([thisBranchSession, "ON DATE", theDate])
    
    #tx.append("""
    #MATCH (s:{s})
    #MATCH (d:{d})
    #CREATE (s)-[:`PART OF`]->(d)
    #""".format(s="`{0} Session`".format(theBranch) , 
    #           d="`Congress {0}`".format(congressSession)))

    """
    USING PERIODIC COMMIT 1000 LOAD CSV FROM "file:///Users/kyledunn/Desktop/congressionalRecord/pipe" as row 
    FIELDTERMINATOR '|' 
    MATCH (br:Session)
    MATCH (cg:Congress)
    WHERE br.label = row[0] AND cg.id = row[2]
    CREATE (br)-[:`PART OF`]->(cg)
    """;
    
    writers["sessionCongress"].writerow([thisBranchSession, "PART OF", thisCongress])
    
    #tx.commit()
    
    previousMonologueId = None
    
    for index, m in enumerate(monologues):
        
        #tx = graphDb.cypher.begin()
        
        monologueId = "-".join([theDate, theBranch, str(congressSession), 'monologue', str(index)])
        """
        monologueNode = CreateNode("Monologue").set(id=monologueId,
                                                    speaker=speakers[index], 
                                                    branch=theBranch, 
                                                    congressionalYear=congressSession,
                                                    date=theDate,
                                                    text=m,
                                                    numWords=monologueWordCounts[index],
                                                    wordSet=monologueWordSets[index],
                                                    wordHistogram=wordHistograms[index],
                                                    properNouns=mentionedProperNouns[index],
                                                    party=speakerParties[index],
                                                    role=speakerRoles[index])
        tx.append(monologueNode)
        """;
        
        
        """
        LOAD CSV FROM "file:///path/to/Monologue-123145338265600.csv.gz" AS row
        CREATE (m:Monologue {id:row[0], 
                             speaker:row[1], 
                             branch:row[2], 
                             congressionalYear:row[3],
                             date:row[4],
                             text:row[5],
                             numWords:row[6],
                             wordSet:row[7],
                             wordHistogram:row[8],
                             properNouns:row[9],
                             party:row[10],
                             role:row[11]})
        """;
        
        monologueFields = [monologueId, speakers[index], theBranch, congressSession, 
                           theDate, " {0} ".format(m.replace('"', r'\"')), 
                           monologueWordCounts[index], monologueWordSets[index], 
                           wordHistograms[index], mentionedProperNouns[index], speakerParties[index],
                           speakerRoles[index]]
        writers["Monologue"].writerow(monologueFields)
        
        #tx.append("""
        #MATCH (m:Monologue {id: {i}})
        #MATCH (d:Date {date: {d}})
        #CREATE (m)-[:`ON DATE`]->(d)""", {"i": monologueId, "d": theDate})
        
        """
        CREATE CONSTRAINT ON (m:Monologue) ASSERT m.id IS UNIQUE
        USING PERIODIC COMMIT 1000 LOAD CSV FROM "file:///Users/kyledunn/Desktop/congressionalRecord/pipe" as row 
        FIELDTERMINATOR '|' 
        MATCH (m:Monologue {id:row[0]})
        MATCH (da:Date {date:row[2]})
        CREATE (m)-[:`ON DATE`]->(da)
        """;
        
        writers["monologueDate"].writerow([monologueId, "ON DATE", theDate])

        #tx.append("""
        #MATCH (m:Monologue {{id: "{i}"}})
        #MATCH (s:{s})
        #CREATE (m)-[:`PART OF`]->(s)
        #""".format(i=monologueId , s="`Congress {0}`".format(congressSession)))
        #, {"i": monologueId, "s": })

        """
        USING PERIODIC COMMIT 1000 LOAD CSV FROM "file:///Users/kyledunn/Desktop/congressionalRecord/pipe" as row 
        FIELDTERMINATOR '|' 
        MATCH (m:Monologue {id:row[0]})
        MATCH (cg:Congress)
        WHERE cg.id = row[2]
        CREATE (m)-[:`row[1]`]->(cg)
        """;

        writers["monologueCongress"].writerow([monologueId, "PART OF", thisCongress])

        #tx.append("""
        #MATCH (m:Monologue {{id: "{i}"}})
        #MATCH (s:{s})
        #CREATE (m)-[:`PART OF`]->(s)
        #""".format(i=monologueId , s="`{0} Session`".format(theBranch)))         
        #, {"i": monologueId, "s": })

        """
        LOAD CSV FROM "file:///path/to/monologueSession-*.csv.gz" AS row
        MATCH (m:Monologue {id:row[0]})
        MATCH (bs:`row[2]`)
        CREATE (m)-[:`row[1]`]->(bs)
        """;
        
        writers["monologueSession"].writerow([monologueId, "PART OF", thisBranchSession])

        
        if previousMonologueId is not None:
            #tx.append("""
            #MATCH (m:Monologue {id: {i}})
            #MATCH (n:Monologue {id: {j}})
            #CREATE (m)-[:`SAID BEFORE`]->(n)""", {"i": previousMonologueId, "j": monologueId})
            
            """
            LOAD CSV FROM "file:///path/to/monologueSequence-*.csv.gz" AS row
            MATCH (m:Monologue {id:row[0]})
            MATCH (n:Monologue {id:row[2]})
            CREATE (m)-[:`row[1]`]->(n)
            """;
            
            writers["monologueSequence"].writerow([previousMonologueId, "SAID BEFORE", monologueId])

        #tx.append(MergeNode("Person", "name", speakers[index]).set(party=speakerParties[index], 
        #                                                           role=speakerRoles[index],
        #                                                           state=speakerStates[index]))
        
        """
        LOAD CSV FROM "file:///path/to/Person-*.csv.gz" AS row
        MATCH (sp:Person {name:row[0]})
        WHERE sp.party IS NULL OR sp.state IS NULL
        MERGE (sp:Person {name:row[0], party:row[1], role:row[2], state:row[3]})
        """;
        
        writers["Person"].writerow([speakers[index], speakerParties[index], 
                                    speakerRoles[index], speakerStates[index]])

        
        #tx.append(MergeNode("Party", "name", speakerParties[index]))
        
        """
        LOAD CSV FROM "file:///path/to/Party-*.csv.gz" AS row
        MERGE (st:Party {name:row[0]})
        """;
        
        writers["Party"].writerow([speakerParties[index]])
        
        #tx.append("""
        #MATCH (speaker:Person {{name: "{s}"}})
        #MATCH (party:Party {{name: "{p}"}})
        #MERGE (speaker)-[r:`MEMBER OF`]->(party) 
        #RETURN r
        #""".format(s=speakers[index], p=speakerParties[index]))
        
        """
        LOAD CSV FROM "file:///path/to/personParty-*.csv.gz" AS row
        MATCH (sp:Person {name:row[0]})
        MATCH (pa:Party {name:row[2]})
        CREATE (sp)-[:`row[1]`]->(pa)
        """;
        
        writers["personParty"].writerow([speakers[index], "MEMBER OF",
                                         speakerParties[index]])
        
        #tx.append(MergeNode("State", "name", speakerStates[index].strip()))
       
        """
        LOAD CSV FROM "file:///path/to/State-*.csv.gz" AS row
        MERGE (st:State {name:row[0]})
        """;
    
        writers["State"].writerow([speakerStates[index].strip()])
        
        #tx.append("""
        #MATCH (speaker:Person {{name: "{s}"}})
        #MATCH (state:State {{name: "{p}"}})
        #MERGE (speaker)-[r:FROM]->(state)
        #RETURN r
        #""".format(s=speakers[index], p=speakerStates[index].strip()))
        
        """
        LOAD CSV FROM "file:///path/to/personState-*.csv.gz" AS row
        MATCH (sp:Person {name:row[0]})
        MATCH (st:State {name:row[2]})
        CREATE (sp)-[:`row[1]`]->(st)
        """;
        
        writers["personState"].writerow([speakers[index], "FROM",
                                         speakerStates[index].strip()])
        
        #tx.append("""
        #MATCH (n:Person {{name: "{n}"}})
        #MATCH (m:Monologue {{id: "{i}"}})
        #CREATE (n)-[r:SPOKE]->(m) 
        #RETURN r
        #""".format(n=speakers[index], i=monologueId))
        #, {"n": speakers[index], "i": monologueId})
        
        """
        LOAD CSV FROM "file:///path/to/personSpoke-*.csv.gz" AS row
        MATCH (sp:Person {name:row[0]})
        MATCH (m:Monologue {id:row[2]})
        CREATE (sp)-[:`row[1]`]->(m)
        """;
        
        writers["personSpoke"].writerow([speakers[index], "SPOKE",  monologueId])

        #tx.commit()
        #tx = graphDb.cypher.begin()
        
        for p in monologueNameReferences[index]:
            #tx.append("""MERGE (n:Person {name: {N}}) RETURN n""", {"N": m})
            
            writers["Person"].writerow([p, None, None, None])

            #tx.append("""
            #MATCH (speaker:Person {name: {s}})
            #MATCH (person:Person {name: {p}})
            #MERGE (speaker)-[r:`MENTIONED PERSON`]->(person)
            #ON CREATE set r.count = 1, r.monologueIds = [{m}]
            #ON MATCH set r.count = r.count + 1, r.monologueIds = r.monologueIds + {m}
            #RETURN r """, {"s": speakers[index], "p": m, "m": monologueId});
            #WITH speaker.name = "{s}" AND person.name = "{p}"
        
            """
            LOAD CSV FROM "file:///path/to/monologueMentions-*.csv.gz" AS row
            MATCH (sp:Person {name:row[0]})
            MATCH (p:Person {name:row[2]})
            MERGE (sp)-[r:`row[1]`]->(p)
            ON CREATE SET r.count = 1, r.monologueIds = [row[3]]
            ON MATCH SET r.count = r.count + 1, r.monologueIds = r.monologueIds + row[3]
            """;
        
            # TODO play the match, create/merge trick above on load 
            writers["monologueMentions"].writerow([speakers[index], "MENTIONED PERSON",
                                                   p, monologueId])
            
        #tx.commit()
        #tx = graphDb.cypher.begin()
            
        for noun in mentionedProperNouns[index]:
            #tx.append("""MERGE (n:`Proper Noun` {name: {N}}) RETURN n""", {"N": noun})
           
            """
            1)
            zcat uniqueProperNoun.csv > pipe
            2)
            LOAD CSV FROM "file:////Users/kyledunn/Desktop/congressionalRecord/pipe" AS row FIELDTERMINATOR '|'
            CREATE (:ProperNoun { label:row[0] })
            """;
        
            writers["ProperNoun"].writerow([noun])

            #tx.append(
            #"""
            #MATCH (speaker:Person {name: {s}})
            #MATCH (noun:`Proper Noun` {name: {n}})
            #MERGE (speaker)-[r:`MENTIONED PROPER NOUN`]->(noun)
            #ON CREATE set r.count = 1, r.monologueIds = [{m}]
            #ON MATCH set r.count = r.count + 1, r.monologueIds = r.monologueIds + {m}
            #RETURN r
            #""", {"s": speakers[index], "n": noun, "m": monologueId});
            #WITH speaker.name = "{s}" AND noun.name = "{n}"
            
            """
            LOAD CSV FROM "file:///path/to/monologueProperNounMentions-*.csv.gz" AS row
            MATCH (sp:Person {name:row[0]})
            MATCH (n:`Proper Noun` {name:row[2]})
            MERGE (sp)-[r:`row[1]`]->(n)
            ON CREATE SET r.count = 1, r.monologueIds = [row[3]]
            ON MATCH SET r.count = r.count + 1, r.monologueIds = r.monologueIds + row[3]
            """;
            
            # TODO play the match, create/merge trick above on load
            writers["monologueProperNounMentions"].writerow([speakers[index], "MENTIONED PROPER NOUN",
                                                             noun, monologueId])
        
        # Update the last monologue for context linkage
        previousMonologueId = monologueId
        
    #tx.commit()
    for f in outFiles:
        f.close()


In [20]:
import numpy as np        

yearLut = dict()
for n, y in enumerate(np.arange(1993, 2017, 2)):
    yearLut[y] = 103+n
    yearLut[y+1] = 103+n
    
#print yearLut

In [24]:
try:
    root = "/Users/kdunn/Dropbox/workspace/python/congressionalRecord/"
    rosterAndLeaders = dict()
    rosterAndLeaders['House'] = dict({"Roster": pd.read_csv(root + "hRoster.csv" ), 
                                     "Leaders": pd.read_csv(root + "hLeaders.csv" )})

    rosterAndLeaders['Senate'] = dict({"Roster": pd.read_csv(root + "sRoster.csv" ),
                                      "Leaders": pd.read_csv(root + "sLeaders.csv" )})

except IOError:
    root = "/Users/kyledunn/Dropbox/workspace/python/congressionalRecord/"
    rosterAndLeaders = dict()
    rosterAndLeaders['House'] = dict({"Roster": pd.read_csv(root + "hRoster.csv" ), 
                                     "Leaders": pd.read_csv(root + "hLeaders.csv" )})
    
    rosterAndLeaders['House']['Roster']['Last'] = rosterAndLeaders['House']['Roster'].Name.map(lambda s: str(s).strip().split()[-1])
    rosterAndLeaders['House']['Leaders']['Last'] = rosterAndLeaders['House']['Leaders'].Name.map(lambda s: str(s).strip().split()[-1])

    
    rosterAndLeaders['Senate'] = dict({"Roster": pd.read_csv(root + "sRoster.csv" ),
                                      "Leaders": pd.read_csv(root + "sLeaders.csv" )})
    
    rosterAndLeaders['Senate']['Roster']['Last'] = rosterAndLeaders['Senate']['Roster'].Name.map(lambda s: str(s).strip().split()[-1])
    rosterAndLeaders['Senate']['Leaders']['Last'] = rosterAndLeaders['Senate']['Leaders'].Name.map(lambda s: str(s).strip().split()[-1])

In [25]:
#from py2neo.core import Unauthorized
#from py2neo.packages.httpstream import SocketError

def process(theRecord):
    
    #print "Filename", theRecord
    path = theRecord.split("/")
    
    theFilename = path[-1]
    theBranch = path[-3].title()
    
    theDate = "".join(theFilename.split('.')[0].split("-")) #[-3:])
    
    try:
        congressSession = yearLut[int(theDate[:4])]
    except KeyError:
        print "Failed to lookup", theDate
        return
    
    #print "Loading congress", congressSession, "date:", theDate
    
    theSpeaker = "The SPEAKER"
    if theBranch == "House":
        theSpeaker = getSpeaker(rosterAndLeaders['House']['Leaders'], congressSession)
    
    speakersAndMonologues, rawText = getMonologues(theRecord, theSpeaker)
    
    startTime, lastTime = getTimes(rawText)
    
    speakers, monologues = splitTheList(speakersAndMonologues)
    
    if speakers is None or monologues is None:
        print "Bad splits for", theBranch, theDate, '-', theFilename
        return
    
    monologueWordSets = map(getSetOfWords, monologues)
    
    monologueWordHistograms = map(getWordHistogram, monologueWordSets, monologues)
    
    monologueWordCounts = map(getNumberOfWords, monologues)
    
    monologueNameReferences = map(getNameReferences, monologues)
    
    # Takes a *long* time
    #sentiment = getSentiment(monologues)
    #sentiment = []
    
    speakerRoles = map(lambda s: getRole(rosterAndLeaders[theBranch.title()]['Leaders'], 
                                         " ".join(s.split()[1:]), 
                                         congressSession), speakers)
    
    speakerParties = map(lambda s: getParty(rosterAndLeaders[theBranch.title()]['Roster'],
                                            " ".join(s.split()[1:]),
                                            congressSession), speakers)
    
    speakerStates = map(lambda s: getState(rosterAndLeaders[theBranch.title()]['Roster'],
                                           " ".join(s.split()[1:]),
                                           congressSession), speakers)

    speakerFullNames = map(lambda s: getFullName(rosterAndLeaders[theBranch.title()]['Roster'], s,
                                                 congressSession), speakers)
    
    mentionedFullNames = map(lambda l: fullNamesFromReferences(l, congressSession, theBranch), monologueNameReferences)
    
    mentionedProperNouns = map(getProperNounReferences, monologues)
    
    if True:
        try:
            writeCsv(monologues, speakerFullNames, startTime, lastTime,
                     monologueWordSets, monologueWordCounts, mentionedFullNames, 
                     None, theBranch,  congressSession, theDate, speakerRoles, speakerParties, 
                     None, speakerStates, monologueWordHistograms, mentionedProperNouns)
        except:
            print theBranch, theDate, "shit the bed", "-", theFilename
            pass
        #print "Wrote", theDate, "to CSV.GZ"

In [26]:
from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool
from accelerate import profiler

In [None]:
%%time

numThreads = 4
path = "/Users/k*dunn/Desktop/congressionalRecord/*/Merged/*.txt"

pool = ThreadPool(numThreads)

pool.map(process, glob.glob(path));

# Wall time: 9h 35min 11s for everything @24 threads
# Wall time: 9h 34min 56s for everything @16
# Wall time: 9h 23min 2s for everything @6
# Wall time: 9h 26min 30s @4

#map(process, glob.glob(path)[:2]);

pool.close() 
pool.join()

In [None]:
%matplotlib inline

p = profiler.Profile()

path = "/Users/k*dunn/Desktop/congressionalRecord/HOUSE/Merged/*.txt"

p.run('map(process, glob.glob(path)[:2])')

profiler.plot(p)

In [None]:
"""

monologueHeader.csv:
id:ID,speaker,branch,congressionalYear,date,text,numWords,wordSet,wordHistogram,properNouns,party,role

./bin/neo4j-import \
--into /Users/kyledunn/neo4j-community-2.2.1/data/graph.db \
--nodes:Monologue "monologueHeader.csv,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145405571072.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145409777664.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145413984256.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145418190848.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145422397440.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145426604032.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145430810624.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145435017216.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145439223808.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145443430400.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145447636992.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145451843584.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145456050176.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145460256768.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145464463360.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145468669952.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145472876544.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145477083136.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145481289728.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145485496320.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145489702912.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145493909504.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145498116096.csv.gz,/Users/kyledunn/Desktop/congressionalRecord/normalized/Monologue-123145502322688.csv.gz" \
--skip-duplicate-nodes \
--bad-tolerance
"""