#Developing the Western Business Sentiment Dictionary - Report

##1. Introduction

In order to power an artificial intelligence program able to analyze vast amounts of text and assess the related sentiment, we needed to have human subjects report their reaction to business-related textual excerpts. We had 190 participants who signed up for one of the available time slots to complete the survey at Ivey's Behavioural Lab. Participants went to the Lab at the scheduled time and for 60 minutes (with an 8-minute break in between) they assessed as many sentences as they could using the PyBossa crowdsourcing platform.

The main goal of this project was to create a reliable sentiment classifier for the business domain. Our working hypothesis was that due to the complexity and rhetoric of the business world, current industry solutions for sentiment analysis would perform poorly. We used the renowned LIWC as the de facto standard to evaluate these texts and compare its results with our pool of participants.

This work was performed by the [CulturePlex Lab](http://www.cultureplex.ca/) with the contributions of Antonio Jiménez-Mavillard, Javier de la Rosa, Adriana Soto-Corominas, Juan Luis Suárez. The code of this work can be found [here](http://nbviewer.ipython.org/github/mavillard/liwc/blob/master/comparison.ipynb).

##2. LIWC's results vs participants' results

###a) Dataset

The dataset consisted of 8996 sentences obtained by searching specific terms (see [Appendix A](#Appendix-A)) from The New York Times API and retrieving the related sentences from articles published in 2013 that contained those terms.

###b) Sentence scores

####Participants' assessment
The sentences were evaluated by 190 participants with values from the range [-2 (very negative), -1 (negative), 0 (neutral), +1 (positive), +2 (very positive)]. Not every participant evaluated all the senteces but every sentence was evaluated at least three times (by three different participants).

The final score per sentence was the average of its individual evaluations. Therefore, every sentence got a final score between -2 and +2. Finally, this score was normalized to the range [-1, +1].

####LIWC scores
In order to estimate the soundness of LIWC at analyzing the sentiment of our dataset, this software was set up with the following options:
- Metrics:
    - posemo: % of words that express a positive emotion
    - negemo: % of words that express a negative emotion
- Segment delimiter:
    - newline: original file splitted into individual lines, i. e., the analysis was performed sentence by sentence
- Tokens:
    - words: the words were the only type of token taken into account; numerals and punctuation marks were ignored

Upon completing the analysis, every sentence got a value for posemo, between 0 and 100 (%), and another value for negemo, also between 0 and 100 (%). This two metrics were used to calculate the final score per sentence by considering two options:
- Option 1) Subtraction:
    - score = posemo - negemo
- Option 2) Maximum:
    - if posemo > negemo, then score = posemo
    - if posemo < negemo, then score = -negemo
    - if posemo = negemo, then score = 0
In both cases, the final score for every sentence was between -100 and +100.

In order to normalize, two different criteria were followed:
- Option a) Normalizing from the full range:
    - [-100, +100] -> [-1, +1]
- Option b) Normalizing from the maximum range observed. Let mpe be the maximum value observed for posemo, mne the maximum value observed for negemo, among all the sentences, and m the maximum value between mpe and mne; then:
    - [-m, +m] -> [-1, +1]

###c) Tests

Note:
- norm_liwc = LIWC normalized score
- norm_score = Participants normalized score

####Histograms
The next images show the participants' frequency distributions for the 8996 sentences compared to each of the four different combinations (options a) b) 1) and 2)) of LIWC's distributions. The normalized range ([-1, +1]) was splitted into 15 bins.

<div align="center">
    <figure>
        <img src="histogram_a1_sep.png"/>
        <figcaption>Fig.1 Participants (rigth) vs LIWC (left) with options a) full range 1) subtraction</figcaption>
    </figure>
</div>

<div align="center">
    <figure>
        <img src="histogram_a1_tog.png"/>
        <figcaption>Fig.2 Participants vs LIWC (overlaid) with options a) full range 1) subtraction</figcaption>
    </figure>
</div>

<div align="center">
    <figure>
        <img src="histogram_a2_sep.png"/>
        <figcaption>Fig.3 Participants (rigth) vs LIWC (left) with options a) full range 2) maximum</figcaption>
    </figure>
</div>

<div align="center">
    <figure>
        <img src="histogram_a2_tog.png"/>
        <figcaption>Fig.4 Participants vs LIWC (overlaid) with options a) full range 2) maximum</figcaption>
    </figure>
</div>

<div align="center">
    <figure>
        <img src="histogram_b1_sep.png"/>
        <figcaption>Fig.5 Participants (rigth) vs LIWC (left) with options b) maximum range observed 1) subtraction</figcaption>
    </figure>
</div>

<div align="center">
    <figure>
        <img src="histogram_b1_tog.png"/>
        <figcaption>Fig.6 Participants vs LIWC (overlaid) with options b) maximum range observed 1) subtraction</figcaption>
    </figure>
</div>

<div align="center">
    <figure>
        <img src="histogram_b2_sep.png"/>
        <figcaption>Fig.7 Participants (rigth) vs LIWC (left) with options b) maximum range observed 2) maximum</figcaption>
    </figure>
</div>

<div align="center">
    <figure>
        <img src="histogram_b2_tog.png"/>
        <figcaption>Fig.8 Participants vs LIWC (overlaid) with options b) maximum range observed 2) maximum</figcaption>
    </figure>
</div>

The previous histograms note how different are LIWC results from the participant results. LIWC is strongly biased towards 0 (it evaluates a large amount of sentences as neutral). The second remarkable point is that LIWC rarely classifies the sentence as very negative or very possitive, in contrast to the participants.

The rest of the analysis was done considering only the option b) 2), the most favorable to LIWC as it is the one that provides the best accuracy.

####Normality tests
Despite the fact that the images above seem to show normal distributions, normality tests reveal the opposite. The next section shows some normality performed to both distributions.

<table>
    <caption>Tab.1 Normality tests</caption>
    <thead>
        <th>name</th><th>distribution</th><th>p-value</th>
    </thead>
    <tbody>
        <tr><td>Anderson-Darling</td><td>participants</td><td>0.0</td></tr>
        <tr><td>Anderson-Darling</td><td>LIWC</td><td>0.0</td></tr>
        <tr><td>Kolmogorov-Smirnov</td><td>participants</td><td>4.63e-121</td></tr>
        <tr><td>Kolmogorov-Smirnov</td><td>LIWC</td><td>0.0</td></tr>
        <tr><td>Pearson</td><td>participants</td><td>1.39e-26</td></tr>
        <tr><td>Pearson</td><td>LIWC</td><td>4.83e-181</td></tr>
    </tbody>
</table>

For every test, p-value < 0.05, i. e., neither of the distributions are normal. This is not a surprising result as "data from the human social world is often not normally distributed." (John Canning, "[Statistics for the Humanities](http://statisticsforhumanities.net/book/)", 2014)

In order to quantify how different both distributions are, more analysis (KDE, QQ plot, and Kolmogorov-Smirnov for two samples) were done. The results are shown in the next subsections:

####Kernel density estimation

<div align="center">
    <figure>
        <img src="histogram_b2_den.png"/>
        <figcaption>Fig.9 KDE: participants' vs LIWC's kernel density estimation</figcaption>
    </figure>
</div>

The figure 9 shows a result very similar to the histograms: LIWC is strongly biased towards 0, while the natural is a bias towards positive values ("[Human language is biased towards happiness, say computational linguists](https://medium.com/the-physics-arxiv-blog/data-mining-reveals-how-human-language-is-biased-towards-happiness-773df682c4a7)", 2014), and besides, it rarely classifies the sentence as very negative or very possitive, in contrast to the students.

####Quantile-quantile plot

<div align="center">
    <figure>
        <img src="qqplot-students.png"/>
        <figcaption>Fig.10 QQPlot: participants' distribution (blue dots) against the normal distribution (red line)</figcaption>
    </figure>
</div>

<div align="center">
    <figure>
        <img src="qqplot-liwc.png"/>
        <figcaption>Fig.11 QQPlot: LIWC's distribution (blue dots) against the normal distribution (red line)</figcaption>
    </figure>
</div>

Figures 10 and 11 show how different both distributions are: the participants' distribution is very close to the normal distribution while LIWC's is very far.

####Kolmogorov-Smirnov (two samples)
The comparison of the two distributions performed with the test provided by scipy returned a value for p-value=0.0, i. e., participants' and LIWC's results come from different distributions.

###d) Accuracy
The accuracy of LIWC's result with respect to the polarity assigned by the participants is:

<div>
    <table>
        <tr><td>Total sentences</td><td>8996</td></tr>
        <tr><td>Sentences evaluated with same polarity</td><td>3202</td></tr>
        <tr><td>Sentences evaluated with different polarity</td><td>5794</td></tr>
        <tr><td>LIWC accuracy</td><td>35.59%</td></tr>
        <caption>LIWC accuracy</caption>
    </table>
</div>

LIWC accuracy (35.59%) is very below the minimum value admissible for any classificator (70%).

###e) Conclusions
A variety of tests were performed in order to compare how similar or different were the polarities assigned to the sentences by the participants and by LIWC. Both distributions are slightly biased towards a positive polarity (as the histograms and the density estimations show); however, there are much more important differences:
1. The histograms shows that LIWC is strongly biased towards 0 (it evaluates a large amount of sentences as neutral) and it rarely classifies the sentence as very negative or very possitive, in contrast to the participants.
2. The KDE reinforces the results obtained from the histograms.
3. The QQPlot shows how different both distributions are with respect to the normal distribution: the participants' distribution is very close while LIWC's is very far from the normal distribution.
4. Kolmogorov-Smirnov gives a definitive result: participants' and LIWC's results come from different distributions.
5. LIWC accuracy (35.59%) is very below the minimum value admissible for any classificator (70%).

Therefore, the conclusion is that it is necessary to create a more suitable classifier for this case study.

##3. Demographic data

The next images shows information on demographics. As each participant evaluated a different subset of sentences, it is not possible to establish a direct correlation between demographic data (participants' country, gender, or age) and sentences' polarity.

Figure 11 shows 


Figure 11 shows the participants' evaluations by country and gender.

<div align="center">
    <figure>
        <img src="demo.png"/>
        <figcaption>Fig.11 Participants' sentence evaluation by country and gender</figcaption>
    </figure>
</div>

##Appendix A