# Personality Detection Using Emotion Frequencies
*Written by Mark Endo*
## Introduction: Capturing an Author's Expression
In class, we determined the author of Federalist Paper 49 using the "Bag of Words" technique which compared word probabilities for Madison and Hamilton. This approach produced promising results, but it only looked at one aspect of writing. In reality, people uniquely express themselves (and therefore identify themselves) through much more than the frequency of words that they use. Writers have personalities that come through their writing, which cannot solely be captured by a count of individual words. I'm interested in finding other ways to identify the *person* behind the text. Particularly, **can we determine the author of a document based solely off of the writing's emotional expression?**

The concept of decoding emotional expression from text has countless applications. Here are a few that seem particularly interesting and impactful:
 - Solving disputes about who wrote a particular part of a body of work (focus of this project).
 - Advanced plagiarism detection that goes beyond word/sentence structure and examines tones of writing.
 - Writing tools to help users understand their tone, like Grammarly's new [tone detector.](https://www.grammarly.com/blog/tone-detector/)
 - Speech analytics tools to help speakers strengthen their "voice."

In this project, I tackle the first problem of detecting authors given training and test data, but the underlying concepts of the work can be applied to the other applications. With a refined model of emotional extraction, the possibilities are endless.
## How To Capture Expression?

The first step in solving any of these problems is finding some way of capturing a text's emotional expression. For this, we cannot simply use the "Bag of Words" model. The tone of a piece relies on more than the frequencies of words that are used. It also depends on the ordering of the words, and how the words relate to each other. For example, let's look at an example sentence, 

> "The pizza is looking awfully tasty."

 With our understanding of language, we know that the tone of the sentence is positive, as the phrase "awfully tasty" means very tasty. However, if we were to ignore the relation between words, we could think the tone was neutral to negative as one definition of awfully means very bad. So, we must have a method that captures the *context* of the words. To do this, we will employ an NLP technique of [context analysis](https://www.lexalytics.com/lexablog/context-analysis-nlp). Using context analysis, we can break the text up into words and phrases that have meaning. For example, the phrase "awfully tasty" in the previous example is a context that has textual meaning.

So now that we have a way to capture the relation between words, how can we figure out what emotions the contexts represent? To do this, we will use a database provided by [SenticNet](https://sentic.net/) that maps contexts to emotional scores. The scores represent the positivity (or negativity) of a context. If the number is positive, then it is positive. Otherwise, it is negative. With these context scores, we will use the frequencies of scores to determine the probability that someone exhibits the same emotional scores for the entire text.

## Step 1: Creating a dictionary of all contexts from SenticNet

First, we create a dictionary that maps all contexts from SenticNet to their corresponding emotional scores. We do this so we can later easily access them when extracting a text's contexts.

In [87]:
import math
contexts_filename = 'senticnet5.txt'
contexts_dict = {}

with open(contexts_filename) as f:
    for line in f:
        info = line.split()
        contexts_dict[info[0]] = info[2]

## Step 2: Extracting contexts from a file

In order to analyze the tone of documents, we first have to extract all of the contexts from the documents in question. To do so, we will use n-grams. According to Lexalytics,
> N-grams are combinations of one or more words that represent entities, phrases, concepts, and themes that appear in text.

One benefit of using n-grams is that it does not rely on using deep learning libraries, and is therefore simple to implement and use. The main downside of this method is that it only captures phrases that are contiguous. In other words, it cannot capture contexts of words that are split up. For example, if we are working with the phrase "sad, old man", sad man will not be captured as a context. Typically, n-grams also include a lot of "noise," as a lot of phrases are words with little meaning such as "with that." However, our implementation is able to reduce this noise, since we only include contexts that are in the SenticNet library.

In [88]:
def getContextsFromFile(filename):
    text_list = []
    with open(filename) as f:
        text_list = f.read().split()
    preprocessed_text_list = list(map(lambda word: word.lower().strip('!:;,'), text_list))
    contexts = []
    for i in range(len(preprocessed_text_list)):
        num_words_considering = 1
        while(True):
            if i + num_words_considering > len(preprocessed_text_list): #break out if you are going past the end of the text
                break
            potential_context = ''
            for j in range(num_words_considering):
                if j == 0:
                    potential_context += preprocessed_text_list[i + j]
                else:
                    potential_context += '_' + preprocessed_text_list[i + j]
            if potential_context in contexts_dict:
                contexts.append((potential_context, float(contexts_dict[potential_context])))
                num_words_considering += 1
            else:
                break
    return contexts

madisonContexts = getContextsFromFile('madison.txt')
hamiltonContexts = getContextsFromFile('hamilton.txt')
unknownContexts = getContextsFromFile('unknown.txt')

## Aside: Getting Overall Tone Score
One thing we can do with our contexts is find an overall tone score for a document. This is done by totaling all of the scores for each context in the document.

In [89]:
def findOverallMood(contexts):
    total = 0
    for context, score in contexts:
        total += score
    return total


madison_score = findOverallMood(madisonContexts)
hamilton_score = findOverallMood(hamiltonContexts)
unknown_score = findOverallMood(unknownContexts)

print('Madison overall mood: ' + str(madison_score))
print('Hamilton overall mood: ' + str(hamilton_score))
print('Unknown doc. overall mood: ' + str(unknown_score))

Madison overall mood: 630.9989999999962
Hamilton overall mood: 536.512
Unknown doc. overall mood: 487.19800000000083


From here, we may be tempted to use our results as a form of author detection. However, I strongly warn against this. The overall tone score only captures the general sentiment (positive or negative) of the document as a whole. This takes away all nuances of particular emotional expression. For example, a writer that uses both extremely positive and negative language can get the same score as a writer that uses neutral language throughout. In order to find the more likely author, we will use the probabilities of each writer expressing exact emotions.

## Step 3: Generating emotion probability lookups from known writings
Instead of looking at an overall emotion score, we will look at the count of how many times each emotion appears. Here, we are defining an emotion as the score corresponding to a context from the text. From here, we can create a probability lookup `emotionProbMap` that stores $P(emotion|writer)$. Then, we can implement `getEmotionProb(emotionProbMap, emotion)`.

In [90]:
EPSILON = 0.000001

def makeEmotionProbMap(contexts):
    emotionProbMap = {}
    num_contexts = 0
    for context, emotion in contexts:
        if emotion not in emotionProbMap:
            emotionProbMap[emotion] = 0
        emotionProbMap[emotion] += 1
        num_contexts += 1
    for emotion in emotionProbMap:
        emotionProbMap[emotion] /= num_contexts
    return emotionProbMap
        
madisonEmotionProb = makeEmotionProbMap(madisonContexts)
hamiltonEmotionProb = makeEmotionProbMap(hamiltonContexts)
    
def getEmotionProb(emotionProbMap, emotion):
    if emotion in emotionProbMap:
        return emotionProbMap[emotion]
    return EPSILON

print("P(.12|madison) = ", getEmotionProb(madisonEmotionProb, .12))
print("P(.12|hamilton) = ", getEmotionProb(hamiltonEmotionProb, .12))

P(.12|madison) =  0.0006257822277847309
P(.12|hamilton) =  0.0007479431563201197


## Step 4: Generating the emotion counts from the unknown document

Here, we map the emotions (context scores) to the number of times each appears in the unknown text.

In [93]:
def makeEmotionCountMap(contexts):
    emotionCountMap = {}
    for context, emotion in contexts:
        if emotion not in emotionCountMap:
            emotionCountMap[emotion] = 0
        emotionCountMap[emotion] += 1
    return emotionCountMap

unknownDocCount = makeEmotionCountMap(unknownContexts)
print('# unique emotions in unknown.txt:', len(unknownDocCount))
print('# of times .12 appears in unknown.txt:', unknownDocCount[.12])

# unique emotions in unknown.txt: 239
# of times .12 appears in unknown.txt: 3


## Step 5: Using Bayes' Theorem and Computing Log Probabilities

From Bayes' Theorem, we can say:

$$P(writer|unknownDoc) = \frac{P(unknownDoc|writer)P(writer)}{P(unknownDoc)}$$

Since we are working with a ratio of two probabilities, we only worry about the numerator.

We will model the distribution of emotion counts in an unknown document (conditioned on knowing the writer) as a Multinomial RV. We can compute a ratio of the product of probabilities of observing each emotion given each author wrote it.

$$P(unknownDoc|Madison) \propto \prod_{i=1}^{m} (p_{M, i}^{\text{# appearances of emotion i in unknown}})$$

Since we are working with very small probabilities, in order to avoid overflow, we will calculate log probabilities as so:

$logP(unknownDoc|Madison)−logP(unknownDoc|Hamilton)>0 → \text{Madison wrote document}$,

where $P(unknownDoc|Madison) \propto \sum_{i=1}^{m}(\text{(# appearances of emotion i in unknown)}log(p_{M, i}))$

In [92]:
def calcLogProbDoc(emotionProbMap, countMap):
    logprob = 0
    for emotion in countMap:
        c_i = countMap[emotion]
        p_i = getEmotionProb(emotionProbMap, emotion)
        logprob += c_i * math.log(p_i)
    return logprob


logpMadison = calcLogProbDoc(madisonEmotionProb, unknownDocCount)
logpHamilton = calcLogProbDoc(hamiltonEmotionProb, unknownDocCount)
print('log madison: \t\t',logpMadison)
print('log hamilton: \t\t', logpHamilton)
print('log madison/hamilton:\t',logpMadison - logpHamilton)


log madison: 		 -5960.401251604811
log hamilton: 		 -6171.8374006506065
log madison/hamilton:	 211.4361490457959


## Future Research
There are many areas in which this research can be further explored. First, we could consider more than the positivity/negativity of contexts. To do so, each context score could store a range of emotions, and the score would be considered "the same" if the individual emotions are within some range of each other. Second, we could weight the different contexts, since some contexts seem to be more critical than others. For example, we could give more emphasis on the emotions of 2-grams compared to 1-grams, as 2-grams generally hold more meaning. These are just a few suggestions, but there are many other areas of research/applications that can be explored.