# Script for preprocessing tweets for bi-LSTM training #

**Author:** [Andrew Larkin](https://www.linkedin.com/in/andrew-larkin-525ba3b5/) <br>
**Affiliation:** [Oregon State University, College of Public Health and Human Sciences](https://health.oregonstate.edu/) <br>
**Date Created:** January 7, 2019 <br>

**Summary** <br>
Add a regional context indicator variable for words that have regional meaning.  The regional context indicator was added after datasets were preprocssed, which is why these functions are in a separate script from the PreprocessTweets notebook.  Regional indicators are the word indices within a tweet where a key reigonal indicator phrase is present

### setup: import packages, define filepaths and global constants ###

In [27]:
import pandas as ps
import pickle
import numpy as np

In [8]:
# define input and output filepaths
parentFolder = "C:/Users/larkinan/Desktop/DBTraining/"
dataset = parentFolder + "preprocessingOutput/"
performFolder = parentFolder + "modelTrainingPerformance/"

In [42]:
# pickled datasets to load
datasetPickleParams = { 
    "trainDictPicklePath":dataset + "trainDict.p",
    "devDictPicklePath":dataset + "devDict.p",
    "testDictPicklePath":dataset + "testDict.p",
    "allDictPicklePath":dataset + "allDict.p",
    "embeddingMatrixPicklePath":dataset + "embeddingMatrix.p",
    "word2IndexPicklePath":dataset + "word2Index.p",
    "NYC_DictPicklePath":dataset + "NYCDict.p"
}

In [12]:
# words in Portland that have regionally unique context
keyWordsPDX = ['Enchanted Forest','Splash Mountain','Bull Mountain','Sun River','Lake Oswego','Menlo Park','Forest Heights','Salt Lake',
               'Fire on the Mountain','Lake Tahoe','Garden State','CenturyLink Field','Mountain View','Hood River','Cooper Mountain',
               'Highland Park','Columbia River','Meridian Park','Merlo Park','Space Mountain','ATandT Park','Maywood Park','Fenway Park',
               'Coors Field']

# tweets in NYC that have reigonally unique context
keyWordsNYC = ['Sunset Park','Belmont Park','Forest Hills','Garden City','the Garden','Borough Park',
               'Mad Square Garden','Madsquaregarden', 'square garden','Massapequa Park','Citi Field',
               'Lone Star Park','Deer park','Pearl River', 'Kings Park','Floral Park','Ozone Park',
               'rego park', 'rockaway park','Hyde Park','Fenway Park','PNC Park','Marine Park','Harlem River',
               'Hutchinson River','Lake George','Park Slope','Massapequa Park','Bronx River','Nats Park',
               'Mountain View','Nationals Park','Salt Lake','Madison Sq','Florham Park','Morris Park',
               'Forest Park','Finsbury Park','Hershey Park','Silver Lake' ]

padKey = "AAAAAAAAAAAAAAAAAA "

In [13]:
# load pickled preprocessed data
def loadDatasets(pickleParams):
    trainDict = pickle.load(open(pickleParams['trainDictPicklePath'],'rb'))
    devDict = pickle.load(open(pickleParams['devDictPicklePath'],'rb'))
    testDict = pickle.load(open(pickleParams['testDictPicklePath'],'rb'))
    NYC_Dict = pickle.load(open(pickleParams['NYC_DictPicklePath'],'rb'))
    embeddingMatrix = pickle.load(open(pickleParams['embeddingMatrixPicklePath'],'rb'))
    word2IndexMap = pickle.load(open(pickleParams['word2IndexPicklePath'],'rb'))
    return(trainDict,devDict,testDict,NYC_Dict,embeddingMatrix,word2IndexMap)

In [6]:
# test whether substring variable is a subset of the wholestring variable.  Used in a map-reduce function
def testForSubstring(wholeString,subString):
    if(subString.lower() in wholeString):
        return subString
    return("None")

In [43]:
# replace instances of keyWords in the wholeString with the padKey string
def replaceKeywords(wholeString,keyWords,padKey):
    newString = wholeString.lower()
    for word in keyWords:
        if word.lower() in newString:
            wordLength = len(word.split())
            padding = padKey*wordLength
            startIndex = newString.find(word.lower())
            endIndex = newString.find(" ",startIndex+len(word))
            stringToReplace = newString[startIndex:endIndex]
            newString = newString.replace(stringToReplace,padding[:-1])
    return(newString)

In [44]:
def compareWords(padKey,compareWord):
    if(padKey == compareWord):
        return 1
    return 0

### identify regional indicators for one tweet ###
**Inputs**: <br>
- **wholeString** (string) - text of single tweet to screen for regional indicators <br>
- **keyWords** (string array) - phrases which if present within wholeString should be flagged with a positive indicator variable <br>
- **padKey** (string) - intermediate used to identify where within a tweet regional phrases are present <br>
**Outputs**: <br>
- **loc_indicator** (int array) - series of regional indicators, where each index flags whether the word at index 1 belongs to a regional key phrase or not

In [3]:
# add location indicator to a single 
def createSingleLocInd(wholeString,keyWords,padKey):
    codedString = replaceKeywords(wholeString,keyWords,padKey)
    parsedCode = codedString.split()
    padKeyArray = [padKey[:-1]]*len(parsedCode)
    locResults = list(map(compareWords,parsedCode,padKeyArray))
    loc_indicator = np.asarray(locResults)
    return(loc_indicator)

### create regional indicators for all tweets in a single dataset ###
**Inputs**: <br>
- **textArray** (string array) - all texts in the dataset <br>
- **keyWords** (string array) - key phrases that are regional indciators.  Indices of the words in these phrases are where regional indicator flags should be positive <br>
- **padKey** (string) - intermediate to communicate where regional indicators where indentified within a text <br>

**Outputs**:
- **padResults** (array of int arrays) - the ith int array corresponds to the reiongal indicator flags for all words in the ith text string 

In [4]:
def createAllLocInd(textArray,keyWords,padKey):
    keyWordMap = [keyWords]*len(textArray)
    padKeyMap = [padKey]*len(textArray)
    padResults = list(map(createSingleLocInd,textArray,keyWordMap,padKeyMap))
    return(padResults)

### add regional indicator to multiple datasets ### 
**Inputs**: <br>
- **savePaths** (string array) - absolute filepaths where modified datasets should be stored <br>
- **dictArray** (dict array) - input dictionaries to add regional indicator variable to <br>
- **keyWords** (string array) - keyStrings that are positive indicators <br>
- **padKey** (string) - intermediate product - used to identify where in a text keyWords occur and positive regional indicators should be written

In [38]:
# add regional location indicator to datasets 
def addLocToAllDatasets(savePaths,dictArray,keyWords,padKey):

    #testString = "I'm testing to see if Fire on the mountain and merlo Park are picked up by the map function"

    index=0
    for tempDict in dictArray:
        tempDictLocInd = createAllLocInd(tempDict['sent'],keyWords,padKey)
        tempDict['loc_ind'] = tempDictLocInd
        pickle.dump(tempDict,open(savePaths[index],"wb" ))
        index+=1

In [40]:
def main():
    trainDict, devDict, testDict, NYC_Dict,embeddingMatrix, word2IndexMap = loadDatasets(datasetPickleParams)
    PDXDicts = [trainDict,devDict,testDict]
    PDXSavePaths = [
        parentFolder + 'trainDictv2t.p',
        parentFolder + 'devDictv2t.p',
        parentFolder + 'testDictv2t.p'
    ]
    addLocToAllDatasets(PDXSavePaths,PDXDicts,keyWordsPDX,padKey)
    addLocToAllDatasets([parentFolder + 'NYCDictv2t.p'],[NYC_Dict],keyWordsNYC,padKey)
main()

dict_keys(['x', 'y', 'seqlens', 'hash', 'emot', 'loc_ind', 'sent', 'labels', 'seqLens'])
dict_keys(['x', 'y', 'seqlens', 'hash', 'emot', 'loc_ind', 'sent', 'labels', 'seqLens'])
dict_keys(['x', 'y', 'seqlens', 'hash', 'emot', 'loc_ind', 'sent', 'labels', 'seqLens'])
dict_keys(['labels', 'seqLens', 'sent', 'hash', 'emot', 'loc_ind'])
