# Script for preprocessing tweets for bi-LSTM training #

**Author:** [Andrew Larkin](https://www.linkedin.com/in/andrew-larkin-525ba3b5/) <br>
**Affiliation:** [Oregon State University, College of Public Health and Human Sciences](https://health.oregonstate.edu/) <br>
**Date Created:** October 14th, 2018 <br>

**Summary** <br>
This script contains functions used to preprocess tweets and corresponding metadata as features for deep learning model input preprocessing includes:

- tokenization
- partitioning multiple word hashtags (e.g. #lastchildinthewoods) into multiple words and adjusting hashtag indicators and sentence length metadata
- encoding words to 300 length word vectors created by Stanford 
- creating trainable, randomly initizlied word encodings for unknown and pad token words
- replace uknown words with an UNKNOWN tag
- fill the end sentences with empty tag so all tweet vectors are the same length
- apply QA/QC inclusion criteria
- create train, dev, and test datasets
- calculate descriptive statistics

**Note**: text was already partially processed at the time of record insertion into the SQL database.  Input text has already had emojis replaced with word descriptions, hashtag and emoji indicators created, and hyperlinks removed.  

### Setup ###

** Import libraries, define model and dataset parameters, and set filepaths **

In [None]:
# libraries
import zipfile
import numpy as np
import pandas as ps
import re, string
import pickle

In [None]:
# filepaths
parentFolder =  "C:/Users/larkinan/Desktop/DBTraining/"
outputFolder = parentFolder + "preprocessingOutput/" # folder to store train, dev, and test datasets

In [None]:
# hyperparameters and constants for the deep learning model.  
# While only the word_vec_dim is needed for this script, 
# the entire dictionary is copied here to facilitate ease of search and comparison.
modelParams = {
    'word_vec_dim':300, # dimension of each word vector
    'mini_batch_size':256, 
    'learning_rate':0.01,
    'momentum':0.9,
    'num_outcomes':1, # whether testing for just 1 or multiple outcomes,
    'hidden_layer size':32,
    'num_epochs':10000,
    'num_dev':5000,
    'num_test':5000,
    'hidden_layer_activation':'tanh'
}

twitterCSVParams = {
    'filepath':parentFolder + "trainingDatav8.csv",
    'classifyFilepath':parentFolder + "allScores__Oct14_18.csv",
    'text':'clean_text',  # whether to use cleaned or raw (i.e. with emojis, hashtags, etc) twitter text
    'nature_ind':'nature',
    'safety_ind':'safety',
    'beauty_ind':'beauty',
    'exercise_ind':'exercise',
    'social_ind':'social',
    'stress_ind':'stress',
    'air_ind':'air',
    'hashtag_ind':'hash_ind',
    'emoticon_ind':'emot_ind' 
}

wordVecParams = {  # word encodings created by Stanford
    'zipFilename':parentFolder + "glove.840B.300d.zip",
    'txtFilename':"glove.840B.300d.txt"
}

unknownWordSubs = { # identified using the 'countUnknownWords' function
    "'04":'4',
    "'10":'10',
    "'12":'12',
    "'14":'14',
    "'15":'15',
    "'16":'16',
    "'17":'17',
    "'18":'18',
    "'72":'72',
    "'93":'93',
    "'96":'96',
    "'cause":'because',
    "'ll":'will',
    "'m":'am',
    "'re":'are',
    "'s":'is',
    "'ve":'have',
    "'Ve":'have',
    "'em":'them'
}

picleParams = { # where to store datasets for model training on hard disk
    "trainDictPicklePath":outputFolder + "trainDict.p",
    "devDictPicklePath":outputFolder + "devDict.p",
    "testDictPicklePath":outputFolder + "testDict.p",
    "allDictPicklePath":outputFolder + "allDict.p",
    "embeddingMatrixPicklePath":outputFolder + "embeddingMatrix.p",
    "word2IndexPicklePath":outputFolder + "word2Index.p",
}

### convert binary variable to two binary variables ###
Classification in Tensorflow requires a unique vector for each classification level.  This function converts a single binary classification vector into two binary classification vectors, one vector for positive classification (value = 1 in the original input vector) and one vector for negative classification (value = 0 in the original input vector 

**TODO: <br> **
[  ] consider optimizing by converting for loop into vector operation.  But, it runs very fast even with a loop.  Maybe python is unrolling the loop?

**Inputs:** <br>
- **inputLabels ** (n x 1 numpy matrix) - Values are either 1 for positive classification and 0 for negative classification.  
- **debug** (boolean) - whether or not to print debug statements for testing/validation

**Outputs:** <br>
- **twoClassLabels** (n x 2 numpy matrix).  First column values are 1 for positive classification, 0 otherwise.  Second column values are 1 for negative classification, 0 otherwise.

In [None]:
def expandOneClassLabel(inputLabels,debug=False):
    assert(max(inputLabels) == 1)
    assert(min(inputLabels) == 0)
    positiveClass = np.reshape(np.array(inputLabels),(len(inputLabels),1))
    negativeClass = np.ones((len(inputLabels),1))
    negativeClass = np.subtract(negativeClass, positiveClass)
    twoClassLabels = np.concatenate((positiveClass,negativeClass),axis=1)
    if(debug):
        print(
            "epandOneClassLabel - input dim = %s, output dim = %s" 
            % (positiveClass.shape, twoClassLabels.shape)
        )
    assert(sum(np.sum(twoClassLabels,axis=1)) == len(inputLabels))
    return(twoClassLabels)

In [None]:
def sumInputs(inputLabels):
    return sum(inputLabels)

### transforming class labels for Tensorflow compatibility ###
Tensorflow requires one vector for each potential class. This function transforms a binary vector of class labels into two numpy arrays, one for positive classification and one for negative classification.  Finally, arrays for all 7 outcomes are combined to create a single nx14 numpy matrix. 

**Inputs:** <br>
- **inputDict** (dict) - dict containing class label arrays, with one array for each outcome of interest <br>
- **twitterParamDict** (dict) - contains names for each class label <br>
- **debug** (boolean) - whether ot not to print debug statements

**Outputs** <br>
- **LabelMatrix** (numpy matrix) - n x14 numpy matrix of expanded outcomes, with outcomes listed in the same column order as inputs.


![Alt text](https://raw.githubusercontent.com/larkinandy/GreenTweet_MultivariateBiLSTM/master/DataPreprocessing/images/methods_images.jpg)
>> **Figure 1.** Diagram depicting the input and output of expandAllClassLabels function. 

In [None]:
def expandAllClassLabels(inputDict,twitterParamDict,debug=False):
    inputLabels = getClassLabelsFromDict(inputDict,twitterParamDict)
    sumClassifiers = list(map(sumInputs,inputLabels))
    tempLabels = list(map(expandOneClassLabel,inputLabels))
    outLabels = np.concatenate(tempLabels,axis=1).astype(np.int32)
    if(debug):
        print("sample input Labels before expanding labels: %s" % str(inputLabels[1][0:10]))
        print("sample temp labels after expanding labels: %s" % str(inputLabels[1][0:10]))
        print("shape of output matrix: %s" % str(outLabels.shape))
    # verify num positive records for nature class in output is same as input
    assert(np.sum(outLabels,axis=0)[0] == sumClassifiers[0]) 
    assert(outLabels.shape[0] == len(inputLabels[0]) and outLabels.shape[1] == len(inputLabels)*2)
    return(outLabels)

### remove records that do not meet inclusion criteria###
Some records have too little context or are too noisy to accurately classify.  This step removes records with too few words (both including and excluding unknown words) and too high a percentage of unknown words.  This is to reduce the bias in the model and develop a better understanding of the bayes optimal error for the target classifiers.

**Inputs:** <br>
- **inputData** (dict) - contains sentences to be screened along with len of sentences.  At this point sentences may already be padded with end tokens, so we can't simply use the len function to calculate sentence length <br>
- **minLength** (int) - minimum number of words (including unknowns) for inclusion in the dataset <br>
- **minnumVals** (int) - mimimum number of words (excluding unknowns) for inclusion in the datastet <br>
- **maxPercUnk** (float) - max percentage of words that can be unknown for inclusion in the dataset <br>

**Outputs** <br>
- **scrDict** (dict) - revised version of inputData, with sentences removed that failed to meet inclusion criteria

In [None]:
def applyExclusionCriteria(inputData,minLength =5,minNumVals=3,maxPercUnk=0.25):
    numValid, index = (0 for i in range(2))
    scrSents, scrLengs, scrLabels, scrHash, scrEmot = ([] for i in range(5))
    for sent in inputData['paddedSents']:
        splitVals = sent.split(" ")
        b = inputData['sentLengths'][index]
        splitVals = splitVals[0:b]
        numWords = len(splitVals)
        numUnk = sum(word == 'UNK' for word in splitVals)
        percUnk = (numUnk*1.0)/(numWords*1.0)
        if(numWords > minLength and (numWords - numUnk) > minNumVals and percUnk < maxPercUnk):
            numValid+=1
            scrSents.append(inputData['paddedSents'][index])
            scrLengs.append(inputData['sentLengths'][index])
            scrLabels.append(inputData['twoClassLabels'][index])
            scrHash.append(inputData['paddedHashtags'][index])
            scrEmot.append(inputData['paddedEmots'][index])
        index +=1
    scrDict = {'sent':scrSents,'seqLens':scrLengs,'labels':scrLabels,'hash':scrHash,'emot':scrEmot}
    return(scrDict)

In [None]:
# create a dictionary from multiple arrays based on a set of indices
def getDataDict(indices,data_x,data_y,data_seqlens,data_hash,data_emot,labels):
    subset_x = [data_x[i] for i in indices]
    subset_y = np.asarray([data_y[i] for i in indices]).reshape((len(indices), 14))
    subset_seqlens = [data_seqlens[i] for i in indices]
    subset_hash = [data_hash[i] for i in indices]
    subset_emot = [data_emot[i] for i in indices]
    dataDict = {labels[0]:subset_x,labels[1]:subset_y,labels[2]:subset_seqlens,labels[3]:subset_hash,labels[4]:subset_emot}
    return(dataDict)

In [None]:
# partition datasets into train, dev, and test subsets and store in dict format
def splitTrainDevTest(modelParams,screenedDict):
    
    numDev = modelParams['num_dev']
    numTest = modelParams['num_test']
    data_x = screenedDict['sent']
    data_y = screenedDict['labels']
    data_seqlens = screenedDict['seqLens']
    hashInd = screenedDict['hash']
    emotInd = screenedDict['emot']
    
#def splitTrainDevTest(numDev,numTest,data_x,data_y,data_seqlens,hashInd,emotInd):
    instance_indices = list(range(len(data_x)))
    np.random.shuffle(instance_indices)
    devBatch = instance_indices[0:numDev]
    devDict = getDataDict(devBatch,data_x,data_y,data_seqlens,hashInd,emotInd,
               ['x','y','seqlens','hash','emot'])
    testBatch = instance_indices[numDev:numDev+numTest]
    testDict = getDataDict(testBatch,data_x,data_y,data_seqlens,hashInd,emotInd,
               ['x','y','seqlens','hash','emot'])
    trainBatch = instance_indices[numDev+numTest:]
    trainDict = getDataDict(trainBatch,data_x,data_y,data_seqlens,hashInd,emotInd,
               ['x','y','seqlens','hash','emot'])
    allDict = getDataDict(instance_indices,data_x,data_y,data_seqlens,hashInd,emotInd,
                          ['x','y','seqlens','hash','emot'])
    assert(len(devDict['x']) == numDev)
    assert(len(testDict['x'])==numTest)
    assert(len(trainDict['x'])==(len(allDict['x'])-(numDev+numTest)))
    return devDict,testDict,trainDict,allDict

In [None]:
def getClassLabelsFromDict(twitterData,twitterParamDict):
    natureInd = twitterData[twitterCSVParams['nature_ind']]
    safetyInd = twitterData[twitterCSVParams['safety_ind']]
    beautyInd = twitterData[twitterCSVParams['beauty_ind']]
    exerciseInd = twitterData[twitterCSVParams['exercise_ind']]
    socialInd = twitterData[twitterCSVParams['social_ind']]
    stressInd = twitterData[twitterCSVParams['stress_ind']]
    airInd = twitterData[twitterCSVParams['air_ind']]
    return(natureInd,safetyInd,beautyInd,exerciseInd,socialInd,stressInd,airInd)

### create mapping dictionaries for all words in the twitter dataset ###
The mapping dictionaries are imporrtant for converting words into vectors and vice versa in later functions. Note that dictionaries include punctuation and  

**Inputs: ** <br>
- **inputSents ** (list) - twitter sentences containing all candidate words to map to

**Outputs: ** <br>
- ** index2word_map ** (dict) - dictionary with index as the key and word as the value
- ** word2index_map ** (dict) - dictionary with words as the key and index as the value

** TODO: ** consider vectorizing, but it runs very fast even with nested loops (< 1s for 100,000 sentences)

In [None]:
def generateWordMap(inputSents):
    word2index_map = {}
    index = 0
    for sent in inputSents:
        reducedPunc = string.punctuation.replace("'", " ")
        for word in sent:
            if(len(word)>0 and word[0] == "'"):                
                word = word[1:]
            if(len(word)>0 and word[-1] == "'"):
                word = word[:-1]
            if(len(word)>0 and word not in word2index_map):
                    word2index_map[word] = index
                    index+=1
    index2word_map = {word:index for word, index in word2index_map.items()}
    return index2word_map, word2index_map

In [None]:
# load class scores and twitter data from csv files 
def loadCSV(inputCSVFile,debug=False,inDelim=',',inQuoteChar='"'):
    rawData = ps.read_csv(inputCSVFile,encoding='utf-8',delimiter = inDelim,quotechar = inQuoteChar)
    if(debug):
        print(rawData.head())
    rawData.tweet_id = rawData.tweet_id.astype(str)
    return(rawData)

In [None]:
# extract csv filepaths from param dictionary 
def getTwitterCSVFilepaths(twitterDict):
    CSVFilepath = twitterCSVParams['filepath']
    classifyFilepath = twitterCSVParams['classifyFilepath']
    return CSVFilepath, classifyFilepath

In [None]:
# load twitter data features 
# Get the names of features and class labels from the twitter params dictionary
def getTwitterCSVParams(twitterDict):
    CSVFilepath = twitterCSVParams['filepath']
    classifyFilepath = twitterCSVParams['classifyFilepath']
    textVar = twitterCSVParams['text']
    hashtagVar = twitterCSVParams['hashtag_ind']
    emoticonVar = twitterCSVParams['emoticon_ind']
    natureInd = twitterCSVParams['nature_ind']
    safetyInd = twitterCSVParams['safety_ind']
    beautyInd = twitterCSVParams['beauty_ind']
    exerciseInd = twitterCSVParams['exercise_ind']
    socialInd = twitterCSVParams['social_ind']
    stressInd = twitterCSVParams['stress_ind']
    return (
        [CSVFilepath,classifyFilepath, textVar, hashtagVar, emoticonVar],
        [natureInd,safetyInd,beautyInd,exerciseInd,socialInd,stressInd]
    )

### load class labels and twitter data from csv files and merge ###

**Inputs** <br>
- **twitterCSVParams** (dict) - contains column names and filepaths.  See Setup section above for more details
- **debug** (boolean) - whether to print debug statements in subfunctions

**Outputs** <br>
- **tweetWithScores ** (dict) - contains class labels and twitter data merged into single dictionary.  Dictionary params are documented in the setup section above

**TODO** <br>
[ ] - combine csv files and simplify the data loading process

In [None]:
def loadWordList(twitterCSVParams,debug=False):
    TwitterFilepath, classifyFilepath = getTwitterCSVFilepaths(twitterCSVParams)
    classScores = loadCSV(classifyFilepath,debug)
    assert(max(classScores.drop('tweet_id', 1).max())==1 and min(classScores.drop('tweet_id', 1).min()) == 0)
    tweetData = loadCSV(TwitterFilepath,debug,inQuoteChar='|')
    tweetWithScores = tweetData.merge(classScores,  how='inner', left_on='tweet_id', right_on = 'tweet_id')
    return tweetWithScores

### load the word vectors ###
Read word vectors from file and load into memory.  Note that this file is large and compressed.  Recommend reading from an SSD drive and a computer with at least 16GB of memory (run on intel 900p storage with with 32GB RAM).

**Inputs:** <br>
- **vectorFilepath** (string) - entire filepath to zip file containing word vectors 

**Outputs:** <br>
- **embeddingWeights** (dict) - dictionary, where words are the keys and vectors are the values 

In [None]:
def loadWordVec(wordVecParams):
    embedding_weights = {}
    test = 0
    with zipfile.ZipFile(wordVecParams['zipFilename']) as z:
        with z.open(wordVecParams['txtFilename']) as f:
            for line in f:
                vals = line.split()
                word = str(vals[0].decode("utf-8"))
                coefs = np.asarray(vals[1:], dtype = 'float32')
                coefs/=np.linalg.norm(coefs)
                embedding_weights[word] = coefs
    return embedding_weights

### find the longest word at the end of a string of characters.  Return -1 if no word exists###
this function is for partitioning multiple words in tweets, compacted together without a space (e.g. lastchildinthewoods)

**Inputs:** <br>
- **wordSpace** (string) - string that potentially contains multiple words concatenated without spaces
- **inDict** (dictionary) - the word search space, i.e. list of all words we consider to be 'valid'
- **debug** (boolean) - whether to print debug statements

**Outputs:** <br>
- integer where string can be partitioned to subset words from wordSpace.  -1 if no such partition exists 

In [None]:
def searchForFirstWord(wordSpace,inDict,debug=False):
    lastWordLen = len(wordSpace)
    while(lastWordLen > 0):
        candidateWord = wordSpace[0:lastWordLen]
        if(candidateWord in inDict):
            endInd = lastWordLen
            if(debug):
                print(
                    "found long word in wordspace: \n word: %s \n index: %i" 
                    % (candidateWord,endInd)
                )
            return endInd
        lastWordLen -=1
    return -1

### for a single string of an unknown word, partition the string into smaller words ###
This function also extends the hashtag and emoticon indicator vectors corresponding to the increase in word length

**Inputs:** <br>
- **inWord** (string) - word to partition
- **inDict** (dict) - contains all acceptable or 'legal' words
- **isHashtag** (int array) - binary vector indicating whether each word in a sentence is from  a hashtag. Update to reflect new words
- **isEmot** (int array) - binary vector indicating whether each word in a sentence if from an emoticon.  Update to reflect new words
- **debug** (boolean) - whether to print debug statements 

**Outputs:** <br>
- **wordVec** (string array) - list of words, in sequential order of original sentence
- **outHashtag** (int array) - binary hashtag vector, updated to reflect any added words
- **outEmot** (int array) - binary emot vector, updated to reflect any added words

In [None]:
def checkOneCompactedWord(inWord,inDict,isHashtag,isEmot,debug=False):
    wordVec, outHashtag, outEmot = ([] for i in range(3))
    origWord = inWord
    partitionIndex = searchForFirstWord(inWord,inDict,debug)
    while(partitionIndex >0):
        outHashtag.append(isHashtag)
        outEmot.append(isEmot)
        wordVec.append(inWord[0:partitionIndex])
        inWord = inWord[partitionIndex:]
        partitionIndex = searchForFirstWord(inWord,inDict,debug)
    if(len(inWord)>0):
        wordVec.append(inWord)
        outHashtag.append(isHashtag)
        outEmot.append(isEmot)
    if(len(wordVec)==0):
        wordVec = [origWord]
        outHashtag = isHashtag
    if(debug):
        print(
            "%s was partitioned into %i words: %s \n with hashtag: %s"
            % (origWord,len(wordVec),wordVec,outHashtag)
             )
    assert (len(wordVec) == len(outHashtag) and len(outHashtag) == len(outEmot))
    return(wordVec,outHashtag,outEmot)

### for a single sentence, split the sentence into individual words, and split compacted words that are missing a space ###

this function also adjusts hashtag and emoji indicator vectors to accomadate expanded list of words

**Inputs:** <br>
- **inSent** (string) - entire sentence <br>
- **inDict** (dict) - dictionary of allowed, or 'legal' words <br>
- **inHashtag** (int array) - indicator of which words in the sentence, if any, belong to a hashtag <br>
- **inEmot** (int array) - indicator of which words in the sentence, if any, belong to an emoji description <br>
- **debug** (boolean) - whether debug statements should be printed <br>

**Outputs:** <br>
- **splitSent** (string array) - revised sentence in array format, with each word an element in the array <br>
- **outHashtag** (int array) - revised hashtag indicator array <br>
- **outEmot** (int array) - revised emot indicator array <br>

In [None]:
def screenForCompactedWordsInSentence(inSent,inDict,inHashtag,inEmot,debug=False):
    splitSent, outHashtag, outEmot = ([] for i in range(3))
    wordIndex = 0
    for word in inSent.split(" "):
        if(len(word)>0):
            try:
                currHashtag = inHashtag[wordIndex]
                currEmot = inEmot[wordIndex]
            except Exception as e:
                if(debug):
                    print(str(e))
            if not word in inDict:
                splitWords,hashtags,emotes = checkOneCompactedWord(word,inDict,currHashtag,currEmot,debug)
                splitSent += splitWords
                outHashtag += hashtags 
                outEmot += emotes
            else:
                splitSent.append(word)
                outHashtag.append(currHashtag)
                outEmot.append(currEmot)
            if(word not in string.punctuation and word not in ['...']):
                wordIndex +=1
    if(debug):
        print(
            "sentence after splitting: %s \n hashtag: %s \n emote: %s"
            % (splitSent,outHashtag,outEmot)
        )
    assert(len(splitSent) == len(outHashtag) & len(outHashtag) == len(outEmot))
    return splitSent, outHashtag, outEmot

### for all sentences in the dataset, split the sentences into individual words, and split compacted words that are missing a space ###

**Inputs:** <br>
- **TweetSentences** (string array) - sentences to process <br>
- **inDict** (dict) - dictionary of allowable or 'legal' words <br>
- **inHashtag** (list of integer arrays) - one integer array for each sentence, indicating which words belong to a hashtag <br>
- **inEmot** (list of integer arrays) - one integer array for each sentence, indicating which words belong to an emoticon description <br>
- **debug** (boolean) - whether to print debug statements <br>

**Outputs:**<br>
- **screenedSents** (list of string arrays) - revised version of Tweet sentences, with compacted words partitioned into multiple words.  Each sentence is now an arrays of strings instead of a single string <br>
- **screenedHashtags** (list of integer arrays) - revised version of inHashtags, adjusted for word insertions <br>
- **screenedEmots** (list of integer arrays) - revised version of inEmot, adjusted for word insertions <br>


In [None]:
def screenForCompactedWordsAllTweets(TweetSentences,inDict,inHashtag,inEmot,debug=False):
    screenedSents, screenedHashtags, screenedEmots = ([] for i in range(3))
    screenedHashtags = []
    screenedEmots = []
    for currIndex in range(len(TweetSentences)):
        splitSent, updatedHashtag, updatedEmot = screenForCompactedWordsInSentence(
            TweetSentences[currIndex],
            inDict,
            inHashtag[currIndex],
            inEmot[currIndex],
            debug = False
        )
        screenedSents.append(splitSent)
        screenedHashtags.append(updatedHashtag)
        screenedEmots.append(updatedEmot)
    if(debug):
        print(
            "Example screening for compact words: \n %s, %s, %s"
            % (screenedSents[0],screenedHashtags[0],screenedEmots[0])
        )
        print(
            "Number of tweets screened: %i" %len(screenedSents)
        )
    assert(len(screenedSents) == len(screenedHashtags) & len(screenedHashtags) == len(screenedEmots))
    return(screenedSents,screenedHashtags,screenedEmots)

In [None]:
# remove word encodings whose corresponding word isn't in the Twiter dataset
def reduceWordEmbeddings(embeddedWeights,word2indexMap):
    reducedEmbeddingWeights = {}
    for key, value in embeddedWeights.items():
        if key in word2indexMap:
            reducedEmbeddingWeights[key] = embeddedWeights[key]
    return reducedEmbeddingWeights

In [None]:
# replace a single word not found in the dictionary with an unknown token
def tagSingleWord(word,unknownTag,word2indexMap,subWords):
    if(len(word)>0):
        if word not in word2indexMap:
            if word in subWords:
                word = subWords[word]
            else:
                word = unknownTag
        return(word)
    return(None)

In [None]:
# for all sentences, replace all words not found in the dictionary with an unknown token
def substituteAllUnknownTags(inputSentences,unknownTag,word2index_map,subWords):
    taggedSentences = []
    for sent in inputSentences:
        screenedSentence = ""  
        for word in sent:
            taggedWord = tagSingleWord(word,unknownTag,word2index_map,subWords)
            if not taggedWord == None: screenedSentence += taggedWord + " "
        screenedSentence = screenedSentence[:-1]
        taggedSentences.append(screenedSentence)
    return(taggedSentences)

In [None]:
# map word encoding to corresponding word, and create a lookup dictionary.
def convertVectorsToMatrix(word2embeddingDict,vecDim):
    embeddingMatrix = np.zeros((len(word2embeddingDict)+2,vecDim))
    wordToIndex = {}
    index = 0
    for word, vector in word2embeddingDict.items():
        wordEmbedding = word2embeddingDict[word]
        embeddingMatrix[index,:] = wordEmbedding
        wordToIndex[word] = index
        index +=1
    return(embeddingMatrix,wordToIndex)

In [None]:
# identify unknown words and their corresponding frequency
def countUnknownWords(embeddingWeights,screenedSents):
    unknownWordDict = {}
    for sent in screenedSents:
        for word in sent:
            if word not in embeddingWeights:
                if word in unknownWordDict:
                    unknownWordDict[word] = unknownWordDict[word] + 1
                else:
                    unknownWordDict[word] = 1
    return(unknownWordDict)

In [None]:
# replace unknown words with list of known substitutions
# for example, replace "'ve" with "have"
def substituteUnknownWords(screenedSents,unknownWordDict):
    for sentIndex in range(len(screenedSents)):
        for wordIndex in range(len(screenedSents[sentIndex])):
            if screenedSents[sentIndex][wordIndex] in unknownWordDict:
                screenedSents[sentIndex][wordIndex] = unknownWordDict[screenedSents[sentIndex][wordIndex]]
    return screenedSents

### find the number of words in the longest tweet (by word count) in the database ###
To do this we find the max length of the emoticon indicator vectors, since each vector contains one element for one word in a sentence

**Inputs:** <br>
- **inputEmoticons** (list of integer arrays) - one integer array for each sentence.  Len of array corresponds to number of words in the sentence <br>

**Outputs:** <br>
- **maxSentenceLength** (int) - number of words in the longest tweet (by word count)

In [None]:
def findMaxSentenceLength(inputEmoticons):
    maxSentenceLength = 0
    for emot in inputEmoticons:
        maxSentenceLength = max(len(emot), maxSentenceLength)
    maxSentenceLength +=1
    return(maxSentenceLength)

### pad shorter sentences with pad tokens to ensure all sentences have the same vector length ###
this allows data to be stored in matrices, presenting the opportunity to vectorize deep learning and perform minibatch rather than stochastic gradient descent training 

**Inputs:**<br>
- **inputSentences** (list of string arrays) - sentences that need to be padded <br>
- **maxSentenceLength** (int) - number of words that each sentence must contain.  Add pad tokens to the end of shorter sentences to increase their length <br>
- **padToken** (string) - specific word to add to sentences to indicate end of sentence has already been reached <br>

**Outputs:**<br>
- **paddedSents** (list of string arrays) - revised version of inputSentences, with padding added so all sentences have maxSentenceLength number of words <br>
- **sentLengths** (integer array) - number of words in each sentence **before** adding padding <br>

In [None]:
def padSentences(inputSentences,maxSentenceLength,padToken):
    paddedSents = []
    sentLengths = []
    for sent in inputSentences:
        sentLength = len(sent.split(" "))
        sentLengths.append(sentLength)
        while(sentLength < maxSentenceLength):
            sent = sent + " " + padToken
            sentLength +=1
        paddedSents.append(sent)
    assert(min(sentLengths) >0)
    assert(max(sentLengths)<=maxSentenceLength)
    return(paddedSents,sentLengths)

### extend length of an indicator vector so that all indicator vectors are the same as the longest vector ###
This is needed so that all indicator vectors can be combined into a matrix, letting tensorflow vectorize deep learning operations <br>

**Inputs:** <br>
- **inputVectors** (list of integer arrays) - indicator vectors that need to be padded
- **maxSentenceLength** (int) - number of elements each vector must contain.  Add zeros to the end of shorter sentences to increase their length <br>

**Outputs:** <br>
- **paddedVectors** - revised version of inputVectors, with padding added so all vectors have maxSentence length number of elements <br>


In [None]:
def padIndicatorVector(inputVectors,maxSentenceLength):
    paddedVectors = []
    for vector in inputVectors:
        vectorLength = len(vector)
        numVector = np.asarray(list(map(int, vector)),dtype=np.int32).reshape((vectorLength))
        if(maxSentenceLength-vectorLength >0):
            padding = np.zeros((maxSentenceLength-vectorLength,1),dtype=np.int32)
            numVector = np.append(numVector,padding)
            numVector = numVector.reshape(maxSentenceLength)
        paddedVectors.append(numVector)
    return(paddedVectors)

### add padding to all input features to the tensorflow model ###
This ensures all vectors are the same length, greatly increasing tensorflow's ability to vectorize operations and reduce training time.

**Inputs:** <br>
- **inputSentences** (list of string arrays) - sentences that need to be padded <br>
- **inputHashtagInd** (list of integer arrays) - hashtag vectors that need to be padded <br>
- **inputEmoticonInd** (list of integer arrays) - emoticon vectors that need to be padded <br>
- **padToken** (string) - token word to add at end of sentences to make their length the same <br>

**Outputs:** <br>
- **paddedSents** (list of string arrays) - revised version of inputSentences, with padding added so all sentences have maxSentenceLength number of words <br>
- **sentLengths** (int array) - length of each sentence in inputSentences **before** adding padding <br>
- **paddedHashtagInds** - revised version of inputHashtagInd, with padding added so all vectors have maxSentence length number of elements <br>
- **paddedEmoticonInds** - revised version of inputEmoticonIn, with padding added so all vectors have maxSentenceLength number of elements <br>
- **maxSentenceLength** - number of elements each vector contains after padding <br>

In [None]:
def addPadding(inputSentences,inputHashtagInd,inputEmoticonInd,padToken):
    paddedDict = {'maxSentLength': findMaxSentenceLength(inputEmoticonInd) }
    paddedDict['paddedSents'], paddedDict['sentLengths'] = padSentences(inputSentences,maxSentenceLength,padToken)
    paddedDict['paddedHashtags'] = padIndicatorVector(inputHashtagInd,maxSentenceLength)
    paddedDict['paddedEmots'] = padIndicatorVector(inputEmoticonInd,maxSentenceLength)
    return(paddedDict)

In [None]:
# save output files in pickled format
def pickleOutputFiles(pickleParams,devDict,testDict,trainDict,allDict,embeddingMatrix,word2Index):
    pickle.dump(devDict,open(pickleParams['devDictPicklePath'],"wb" ))
    pickle.dump(testDict,open(pickleParams['testDictPicklePath'],"wb" ))
    pickle.dump(trainDict,open(pickleParams['trainDictPicklePath'],"wb" ))
    pickle.dump(allDict,open(pickleParams['allDictPicklePath'],"wb"))
    pickle.dump(embeddingMatrix,open(pickleParams['embeddingMatrixPicklePath'],"wb" ))
    pickle.dump(word2Index,open(pickleParams['word2IndexPicklePath'],"wb" ))

### main function ###

In [None]:
embeddingWeights = loadWordVec(wordVecParams)
tweetDict = loadWordList(twitterCSVParams)
# print(tweetDict.head()) # for debug/ validation purposes only
screenedSents,screenedHashtags,screenedEmots = screenForCompactedWordsAllTweets(tweetDict['clean_text'],embeddingWeights,tweetDict['hash_ind'],tweetDict['emot_ind'],True)
index2WordMap, word2IndexMap = generateWordMap(screenedSents)
word2embeddingDict = reduceWordEmbeddings(embeddingWeights,index2WordMap)
#countUnknownWords(word2embeddingDict,screenedSents) # for debug/ validation purposes only
word2embeddingDict['UNK'] = np.random.randn(modelParams['word_vec_dim'])*0.001 # initialize to small but non-zero random weight vector
word2embeddingDict['PAD_TOKEN'] = np.random.randn(modelParams['word_vec_dim'])*0.001 # initialize to small but non-zero random weight vector
embeddingMatrix,word2Index = convertVectorsToMatrix(word2embeddingDict,modelParams['word_vec_dim'])
taggedSentences = substituteAllUnknownTags(screenedSents,"UNK",word2Index,unknownWordSubs)
taggedSentences = substituteAllUnknownTags(screenedSents,"UNK",word2Index)
paddedDict = addPadding(taggedSentences, screenedHashtags, screenedEmots, "PAD_TOKEN")
paddedSent, sentLengths,paddedHashtags,paddedEmoticons,maxSentenceLength = addPadding(taggedSentences, screenedHashtags, screenedEmots, "PAD_TOKEN")
paddedDict['twoClassLabels'] = expandAllClassLabels(tweetDict, twitterCSVParams)
twoClassLabels = expandAllClassLabels(tweetDict, twitterCSVParams)
screenedDict = applyExclusionCriteria(paddedDict)
devDict,testDict,trainDict,allDict = splitTrainDevTest(modelParams,screenedDict)
# commented out to prevent overriding original datasets during code optimization
#pickleOutputFiles(picleParams,devDict,testDict,trainDict,allDict,embeddingMatrix,word2Index) 

### functions to calculate descriptive statistics of train, dev, and test sets ###

In [None]:
# descriptive statistics for sentence length
def calcSentLengthStats(inDict,indicatorKey):
    sentDict = {'avgSentLength':np.mean(np.array(inDict[indicatorKey])) }
    sentDict['stdDevSentLength'] = np.std(np.array(inDict[indicatorKey]))
    sentDict['maxSentLength'] = max(inDict[indicatorKey])
    sentDict['minSentLength'] = min(inDict[indicatorKey])
    sentDict['percentiles'] = np.percentile(np.array(inDict[indicatorKey]),[5,25,50,75,95])
    return sentDict

In [None]:
# descriptive statistics for word indicators (e.g. if a word is in a hashtag)
def calcStatsWordIndicator(inDict,indicatorKey):
    tempDict = {'numPositive_' + indicatorKey + 'Records':sum(np.max(np.array(inDict[indicatorKey]),axis=1)) }
    tempDict['max_' + indicatorKey + 'Words'] = max(np.sum(np.array(inDict[indicatorKey]),axis=1))
    tempDict['min_' + indicatorKey + 'Words'] = min(np.sum(np.array(inDict[indicatorKey]),axis=1))
    subsetIndices = np.where(np.sum(np.array(inDict[indicatorKey]),axis=1)>0)
    subsetRecords = np.take(inDict[indicatorKey],subsetIndices,axis=0).reshape(
        subsetIndices[0].size,len(inDict[indicatorKey][0]))
    tempDict['avg_' + indicatorKey + 'Words'] = np.mean(np.sum(subsetRecords,axis=1))#np.sum(np.array(inDict[indicatorKey]),axis=1))
    tempDict['stdDev_' + indicatorKey + 'Words'] = np.std(np.sum(subsetRecords,axis=1))
    tempDict[indicatorKey + '_percentiles'] = np.percentile(np.sum(subsetRecords,axis=1),[5,25,50,75,95])
    return tempDict

In [None]:
# descriptive statistics for outcome class labels
def calcOutcomeStats(inDict,outcomes):
    outcomeStatDict = {'outcomes':outcomes}
    numRecords = len(inDict['x'])
    outcomeIndices = [0,2,4,6,8,10,12]
    outcomeStatDict['frequencies'] = inDict['y'][:,outcomeIndices].sum(axis=0)
    outcomeStatDict['percent'] = (outcomeStatDict['frequencies']*100.0) / numRecords
    numVals = inDict['y'][:,[0,2,4,6,8,10,12]].sum(axis=1)
    numMultipleLabels = {}
    for i in range(8):
        outcomeStatDict[str(i) + "labels"] = sum(numVals==i)
    return(outcomeStatDict)

In [None]:
def subsetDictionaryByNatureVal(inDict,valToKeep):
    natureLabels = inDict['y'][:,[0]]
    totalRecordsToKeep = sum(natureLabels == valToKeep)
    natureIndices = np.where(natureLabels==valToKeep)[0]
    keys = inDict.keys()
    subsetDict = {}
    for key in keys:
        keeps = np.array([inDict[key][i] for i in natureIndices])
        subsetDict[key] = keeps
        assert(len(subsetDict[key]) == totalRecordsToKeep)
    return subsetDict

In [None]:
# all descriptive statistics for one dataset
def calcDescriptiveStatsOneDataset(inDict):
    outcomes = ['nature','safety','beauty','exercise','social','stress','air']
    outcomeStat = calcOutcomeStats(inDict,outcomes)
    sentStats = calcSentLengthStats(inDict,'seqlens')
    hashStats = calcStatsWordIndicator(inDict,'hash')
    emotStats = calcStatsWordIndicator(inDict,'emot')
    statDict = {'outcomeStats':outcomeStat,'sentStats':sentStats,'hashStats':hashStats,'emotStats':emotStats}
    return statDict

In [220]:
# calc descriptive stats for all datasets
allDictStats = calcDescriptiveStatsOneDataset(allDict)
allDictNatureStats = calcDescriptiveStatsOneDataset(subsetDictionaryByNatureVal(allDict,1))
allDictNotNatureStats = calcDescriptiveStatsOneDataset(subsetDictionaryByNatureVal(allDict,0))
trainDictStats = calcDescriptiveStatsOneDataset(trainDict)
trainDictNatureStats = calcDescriptiveStatsOneDataset(subsetDictionaryByNatureVal(trainDict,1))
trainDictNotNatureStats = calcDescriptiveStatsOneDataset(subsetDictionaryByNatureVal(trainDict,0))
devDictStats = calcDescriptiveStatsOneDataset(devDict)
devDictNatureStats = calcDescriptiveStatsOneDataset(subsetDictionaryByNatureVal(devDict,1))
devDictNotNatureStats = calcDescriptiveStatsOneDataset(subsetDictionaryByNatureVal(devDict,0))
testDictStats = calcDescriptiveStatsOneDataset(testDict)
testDictNatureStats = calcDescriptiveStatsOneDataset(subsetDictionaryByNatureVal(testDict,1))
testDictNotNatureStats = calcDescriptiveStatsOneDataset(subsetDictionaryByNatureVal(testDict,0))