# Detecting Emotions in Headlines

## Exploratory Data Analysis

Author: Kelly Epley


In this notebook, I explore the label distributions and word frequencies in the dataset. In the process, I illustrate some of the challenges associated with doing sentiment and emotion analysis well.

### About the Data

link to data: http://web.eecs.umich.edu/~mihalcea/affectivetext/


### Organization

I used a custom function to organize my data into a corpus DataFrame and two sets of targets: emotions and valences. There is also a separate validation set. See the file named "get_labeled_dfs" for details.

The emotion target is a set of intensity ratings for six emotion categories on a scale of 0 to 100. The categories are: anger, disgust, fear, joy, sadness, and surprise. The valence target is a single valence rating between -100 and 100. 

I added a "label" column to each target DataFrame. The labels in "emotion_df" represent the emotion with the highest intensity rating for each headline. The labels in "valence_df" represent negative ratings beween -100 and -15, positive raings between 15 and 100, and low valence/neutral ratings in the middle. 


In [1]:
# import all necessary packages and custom functions, classes
import pandas as pd
import numpy as np

from nltk.util import bigrams

from wordcloud import WordCloud
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

from get_labeled_dfs import *
from process_text import *
from get_emotion_wordcount import *

In [2]:
# use custom function to get the corpus df, label dfs, and validation dfs
corpus_df, val_corpus_df, emotion_df, val_emotion_df, valence_df, val_valence_df = get_labeled_dfs()


In [6]:
corpus_df.head()
# corpus_df.to_csv('corpus_df.csv')

Unnamed: 0,text
0,Test to predict breast cancer relapse is approved
1,"Two Hussein allies are hanged, Iraqi official ..."
2,Sights and sounds from CES
3,Schuey sees Ferrari unveil new car
4,Closings and cancellations top advice on flu o...


In [7]:
valence_df.head()


Unnamed: 0,valence,label
0,32,1
1,-48,0
2,26,1
3,40,1
4,-6,2


In [8]:
emotion_df.head()


Unnamed: 0,anger,disgust,fear,joy,sadness,surprise,max,label
0,0,0,15,38,9,11,joy,3
1,24,26,16,13,38,5,sadness,4
2,0,0,0,17,0,4,joy,3
3,0,0,0,46,0,31,joy,3
4,1,0,23,8,11,8,fear,2


## Presence of Emotion Words

One obvious strategy for detecting emotions in text is to look for the presence of emotion words such as "happy," "sad," and "afraid."

To explore this strategy, I used a list of emotion words from WordNet ([described here](http://wndomains.fbk.eu/wnaffect.html)) and counted the instances of these words in each headline to see whether the words in the list correspond with their emotion ratings.

Here are a few examples of what I found:

* Sometimes the words accurately track a headline's emotion rating: The words "tirade" and "outrage" and from the anger words list appear in the headline "Israeli woman's tirade spurs PM outrage," which received a score of 71 for anger.

* Sometimes the words do not accurately track a headline's emotion rating: The word "aggressive" from the anger list appears in the headline "Bigger, more aggressive rats infesting UK," which received only a 4 for anger. It's highest scores were 56 for fear and 54 for disgust. 

* Sometimes words appearing in the emotion word lists have alternative meanings that are neutral or have different emotional import: The word "offensive from the disgust list appears in the headline "US to urge Nato Afghan spring offensive." The word "offensive" in this headline means: taking an action against an opponent. It likely appears in the disgust list for its other meaning: causing someone to be insulted by a slight or a breach of social expectations. The headline receives its highest rating of 54 for fear. 

This illustrates the fact that the emotions expressed by a text are highly contextual. 

The lists' author's also caution: "All words can potentially convey affective meaning."

Notably, 832 headlines from the dataset contain zero words from the Affect Net emotion word list, and yet most of these are rated as having detectable emotions and emotional valence.


Distinction between "direct" and "indirect" affective words. 

In [9]:
# get df containing counts of words appearing in 
emotion_wordcount = get_emotion_wordcount()

In [10]:
emotion_wordcount.loc[emotion_wordcount['anger_count']>0]['text'].head()

216      Israeli woman's tirade spurs PM outrage
267    Bigger, more aggressive rats infesting UK
648                  One search does not fit all
759            'WarioWare,' Wii make perfect fit
987            Roddick, Murray score in San Jose
Name: text, dtype: object

In [13]:
emotion_wordcount.loc[emotion_wordcount['disgust_count']>0]['text'].head()

109    US to urge Nato Afghan spring offensive
Name: text, dtype: object

In [23]:
emotion_df.iloc[109]

anger         42
disgust       14
fear          54
joy            0
sadness       33
surprise       0
max         fear
label          2
Name: 109, dtype: object

In [14]:
emotion_wordcount.loc[emotion_wordcount['fear_count']>0]['text'].head()

33        TBS to pay $2M fine for ad campaign bomb scare
140                      Firms on alert for letter bombs
234    Memo from Frankfurt: Germany relives 1970s ter...
292    Freed Muslim terror suspect says Britain is "p...
300    In rigorous test, talk therapy works for panic...
Name: text, dtype: object

In [15]:
emotion_wordcount.loc[emotion_wordcount['joy_count']>0]['text'].head()

0     Test to predict breast cancer relapse is approved
16                    Asian nations urge Myanmar reform
35                     Discovered boys bring shock, joy
43    Protesters end strike as Nepal PM concedes dem...
67              Google executive acts as goodwill envoy
Name: text, dtype: object

In [16]:
emotion_wordcount.loc[emotion_wordcount['sadness_count']>0]['text'].head()

31     Really?: The claim: the pill can make you put ...
89                Walters on Trump: 'Poor, pathetic man'
150                   BP CEO Browne to step down in June
191                 Snow brings travel misery to England
295                  Why gas follows oil up but not down
Name: text, dtype: object

In [17]:
emotion_wordcount.loc[emotion_wordcount['surprise_count']>0]['text'].head()

602              Thousands line up to get late flu shots
608    Whether to save cord blood can be puzzle for p...
672                'Grey's,' 'Betty,' 'Scrubs' get boost
714             Area should get 3-5 inches of snow today
Name: text, dtype: object

In [18]:
len(emotion_wordcount.loc[(emotion_wordcount['anger_count']==0) & (emotion_wordcount['disgust_count']==0) & (emotion_wordcount['fear_count']==0) & (emotion_wordcount['joy_count']==0) & (emotion_wordcount['sadness_count']==0) & (emotion_wordcount['surprise_count']==0)])


904

## Label Distributions 

In [None]:
# use custom class method to process text
processor = Process_Text_Data()
processor.transform(corpus_df)


In [None]:
# create bar charts showing headline counts for each category
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
emotion_df['max'].value_counts().plot(kind='bar')
plt.title("Number of Headlines per Emotion Category")

plt.subplot(1,2,2)
valence_df['label'].value_counts().plot(kind='bar')
plt.xticks([0,1,2], ["negative", "positive", "neutral"])
plt.title("Number of Headlines per Valence Category")


## Label Correlations

In [None]:
# correlations among emotion categories
plt.figure(figsize=(12,10))
corr = emotion_df.iloc[:,:-1].corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True)

## Intensity Distributions

In [None]:
# boxplot showing the disribution of intensity scores for each emotion category
plt.figure(figsize=(12,10))
emotion_df.iloc[:,:-1].boxplot()

In [None]:
# note that there there are no rows with rating 0 for all emotion categories, 
emotion_df.loc[(emotion_df.anger==0)&(emotion_df.disgust==0)&(emotion_df.fear==0)&(emotion_df.joy==0)&(emotion_df.sadness==0)& (emotion_df.surprise==0)]

In [None]:
# boxplot showing the disribution of valencescores 
plt.figure(figsize=(10,5))
plt.subplot(1,3,1)
valence_df.loc[valence_df['label']==0]['valence'].plot(kind='box')
plt.xticks([1], labels=["Negative"])
plt.subplot(1,3,2)
valence_df.loc[valence_df['label']==1]['valence'].plot(kind='box')
plt.xticks([1], labels=["Positive"])
plt.subplot(1,3,3)
valence_df.loc[valence_df['label']==2]['valence'].plot(kind='box')
plt.xticks([1], labels=["Netural"])


In [None]:
# note that there there are a few rows with rating 0 for valence, 
valence_df.loc[(valence_df.valence==0)]

## Word Frequencies

### Corpus Top Words

In [None]:
# make a dictionary with words as keys and word counts as values
word_count_dict = {}

voc = set()
for row in corpus_df['text']:
    for word in row.split():
        voc.add(word)
        
for word in voc:
    word_count_dict[word]=0
    
for row in corpus_df['text']:
    for word in row.split():
        word_count_dict[word]+=1
        
# make a df of word counts        
word_count_df = pd.DataFrame({"word": [key for key in word_count_dict.keys()], "count": [val for val in word_count_dict.values()]})
word_count_df.sort_values('count', ascending=False, inplace=True)

In [None]:
# plot the 30 most used words in the corpus
word_count_df[:30].plot(kind='barh', figsize=(10,15))
labels = [i for i in word_count_df[:30]['word']]
plt.yticks(ticks = range(30), labels = labels)
plt.title("Most Frequently Used Words")


### Corpus Top Bigrams

In [None]:
bigram_count_dict = {}

bigrams_set = set()
for row in corpus_df['text']:
    for bigram in list(bigrams(row.split())):
        bigrams_set.add(bigram)
        
        
for bigram in bigrams_set:
    bigram_count_dict[bigram]=0
    
for row in corpus_df['text']:
    for bigram in list(bigrams(row.split())):
        bigram_count_dict[bigram]+=1
        
# make a df of bigram counts        
bigram_count_df = pd.DataFrame({"bigram": [key for key in bigram_count_dict.keys()], "count": [val for val in bigram_count_dict.values()]})
bigram_count_df.sort_values('count', ascending=False, inplace=True)

In [None]:
bigram_count_df[:15].plot(kind='barh', figsize=(8,10))
labels = [i for i in bigram_count_df[:15]['bigram']]
plt.yticks(ticks = range(15), labels = labels)
plt.title("Most Frequently Used Bigrams")

### Top Words Per Emotion Label

In [None]:
for i in list(emotion_df.columns[0:-2]):
    
    indexes = []
    for index, val in enumerate(emotion_df['max']):
        if val==i:
            indexes.append(index)
    
    single_emotion = corpus_df['text'][indexes] 
    word_string = ' '.join([str(i) for i in single_emotion])

    wordcloud = WordCloud(width=800, height=500, random_state=42).generate(word_string)
    plt.figure(figsize=(10, 7))
    plt.title(i)
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()


### Top Words Per Valence Label

In [None]:
for i in [0,1,2]:
    
    indexes = []
    for index, val in enumerate(valence_df['label']):
        if val==i:
            indexes.append(index)
    
    single_valence = corpus_df['text'][indexes] 
    word_string = ' '.join([str(i) for i in single_valence])

    wordcloud = WordCloud(width=800, height=500, random_state=42).generate(word_string)
    plt.figure(figsize=(10, 7))
    
    if i==0:
        plt.title("Negative")
    elif i==1:
        plt.title("Positive")
    else:
        plt.title("Neutral/Low Valence")
    
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis('off')
    plt.show()

## Length of Headlines

In [None]:
lengths = []
for i in corpus_df['text']:
    lengths.append(len(i.split()))
    
print("The average headline length is {0} words.".format(np.round(sum(lengths)/len(lengths), 2)))
print("The shortest headline length is {0} words.".format(min(lengths)))
print("The longest headline length is {0} words.".format(max(lengths)))