# Emotionality & sentiment assessment

## Assessing emotionality with SpaCy & NLTK

In this Jupyter notebook, I assess the emotionality of each article, as well as its negativity and positivity (I also included sentiment functions but ended up not using them).
I do so using two different sentiment dictionaries, in order to have a comparison of sentiment measurements.   
The first sentiment dictionary is taken from the SentiWS corpus. The second is taken from the Rauh sentiment dictionary and, unlike the SentiWS dicitonary, accounts for negations (to an extent).

The steps that are performed in this notebook are the following:   
   
1) Loading necessary packages
   
2) Loading the cleaned data
   
3) Assessing emotionality with the SentiWS corpus   
    - Creating lists of positive and negative words   
    - Creating functions to count positve and negative words   
    - Applying counting functions to get the overall number as well as the ratio of positive and negative words   
    - Calculating overall emotionality   
    
4) Computing overall article sentiment according to the SentiWS corpus    
    - Creating a sentiment function   
    - Applying the function to the dataframe   
    - Inspecting the data   
   
5) Assessing emotionality with the Rauh sentiment dictionary     
    - Creating lists of positive and negative words   
    - Creating functions to count positve and negative words   
    - Applying counting functions to get the overall number as well as the ratio of positive and negative words   
    - Calculating overall emotionality   
    
6) Computing overall article sentiment according to Rauh sentiment dictionary   
    - Creating a sentiment function that accounts for negations   
    - Applying the function to the dataframe   
    - Inspecting the data   
    
7) Saving the data with the newly created variables

### 1) Loading necessary packages

In [1]:
#import pandas
import pandas as pd
from pandas import read_excel
#import numpy
import numpy as np
#load SpaCy
import spacy
#import German language model
import de_core_news_md
#define nlp pipe
nlp = de_core_news_md.load()

### 2) Loading the cleaned data

In [2]:
df = read_excel("complete_data_cleaned.xlsx")

### 3) Assessing emotionality with the SentiWS corpus

#### a) Creating lists of positive and negative words

In [3]:
#read in the SentiWS text file with positive words
pos = pd.read_csv("SentiWS_v1.8c_Positive.txt",sep='\t', names=["word", "value", "forms"])
#clean the text file
pos[['word','wordtype']] = pos.word.str.split("|",expand=True) 
#create list of positive words (empty for now)
pos_list = []
#loop over the df in order to create a list with all positive words
for e in pos["word"]:
    pos_list.append(e)

In [4]:
#same as above
neg = pd.read_csv("SentiWS_v1.8c_Negative.txt",sep='\t', names=["word", "value", "forms"])
neg[['word','wordtype']] = neg.word.str.split("|",expand=True) 
neg_list = []
for e in neg["word"]:
    neg_list.append(e)

#### b) Creating functions to count positve and negative words

In [5]:
#function to detect positive words
def count_positivewords(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    poscount = 0
    for w in tokens:
        if w in pos_list:
            poscount+=1
    return poscount

#function to detect negative words
def count_negativewords(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    negcount = 0
    for w in tokens:
        if w in neg_list:
            negcount+=1
    return negcount

#### c) Applying counting functions to get the overall number as well as the ratio of positive and negative words

In [6]:
#apply functions in order to...

#... create a column with the number of positive words
df["positive words"] = [count_positivewords(text) for text in df["clean text"]]

#... create a column with the number of negative words
df["negative words"] = [count_negativewords(text) for text in df["clean text"]]

In [7]:
#creating a new column for overall emotionality (percenage of overall words that are negative)
df["negativity ratio"] = df["negative words"]/df["words in clean text"]
#creating a new column for overall emotionality (percenage of overall words that are positve)
df["positivity ratio"] = df["positive words"]/df["words in clean text"]

#### d) Use the number of positive and negative words per article to calculate overall emotionality

In [8]:
#creating a new column for overall emotionality 
df["emotionality"] = (df["negative words"]+df["positive words"])
#creating a new column for overall emotionality ratio (percenage of overall words that are positve or negative)
df["emotionality ratio"] = (df["negative words"]+df["positive words"])/df["words in clean text"]

In [9]:
#inspect data
df.head(3)

Unnamed: 0.1,Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,...,clean text,words in clean text,reach_dummy,modality_dummy,positive words,negative words,negativity ratio,positivity ratio,emotionality,emotionality ratio
0,6,100006,sueddeutschet politik (www),2020-05-28T15:34:08,367,,,SZ Espresso: Nachrichten kompakt - die Übersic...,<p>Was heute wichtig war - und was Sie auf SZ....,Das Wichtigste zum Coronavirus. Berufstätige M...,...,"das wichtig coronavirus . berufstat mutt vat ,...",224,1,0,8,6,0.026786,0.035714,14,0.0625
1,8,100008,sueddeutschet politik (www),2020-05-28T17:01:43,200,,,Kommunalpolitik: Abgeblendet,<p>Bayreuths Stadtrat im Stream</p>,"Livestream aus dem Stadtrat, das klingt transp...",...,"livestream stadtrat , klingt transparent erstr...",104,1,0,2,0,0.0,0.019231,2,0.019231
2,24,100024,aachener zeitung (www),2020-05-28T03:01:52,512,Politik,,Länder planen Öffnung: Streit über Schulen und...,"<img src=""https://www.aachener-zeitung.de/imgs...",Der Streit über die Wiederöffnung von Schulen ...,...,der streit wiederoffn schul kindergart kris ve...,318,0,0,11,6,0.018868,0.034591,17,0.053459


### 4) Computing overall article sentiment according to the SentiWS corpus
    
#### a) Creating a sentiment function

In [None]:
def compute_sentiment(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    word_list = []
    for e in tokens:
        if e in pos_list:
            word_list.append(e)
        elif e in neg_list:
            word_list.append(e)
    df1 = pos[pos['word'].isin(word_list)]
    df2 = neg[neg['word'].isin(word_list)]
    dfs = [df1, df2]
    df_final = pd.concat(dfs)
    sentimentscore = df_final["value"].sum()
    return sentimentscore

#### b) Applying the function to the dataframe

In [11]:
#first a function for the overall sentiment score
df["sentimentscore"] = [compute_sentiment(text) for text in df["clean text"]]
#then calculate a ratio to account for article length
df["sentimentscore ratio"] = df["sentimentscore"]/df["words in clean text"]

#### c) Data overview

In [10]:
#overview of how many articles are in the dataset
df.groupby("Newspaper").count()

Unnamed: 0_level_0,Unnamed: 0,ID,Date,Length,Category,Author,Headline,Teaser,Article,Modality,...,clean text,words in clean text,reach_dummy,modality_dummy,positive words,negative words,negativity ratio,positivity ratio,emotionality,emotionality ratio
Newspaper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aachener Zeitung,970,970,963,970,970,0,970,0,970,970,...,970,970,970,970,970,970,970,970,970,970
Der Tagesspiegel,1286,1286,1281,1286,1286,397,1286,0,1286,1286,...,1286,1286,1286,1286,1286,1286,1286,1286,1286,1286
Die Welt,831,831,831,831,831,657,831,0,831,831,...,831,831,831,831,831,831,831,831,831,831
Rheinische Post,2375,2375,2365,2375,2375,1384,2375,0,2375,2375,...,2375,2375,2375,2375,2375,2375,2375,2375,2375,2375
Stuttgarter Zeitung,1237,1237,1237,1237,1237,1111,1237,0,1237,1237,...,1237,1237,1237,1237,1237,1237,1237,1237,1237,1237
Süddeutsche Zeitung (inkl. Regionalausgaben),3720,3720,3708,3720,3720,3613,3720,0,3720,3720,...,3720,3720,3720,3720,3720,3720,3720,3720,3720,3720
aachener zeitung (www),168,168,168,168,168,0,168,168,168,168,...,168,168,168,168,168,168,168,168,168,168
der tagesspiegel (www),264,264,264,264,264,0,264,264,264,264,...,264,264,264,264,264,264,264,264,264,264
die welt (www),177,177,177,177,177,0,177,177,177,177,...,177,177,177,177,177,177,177,177,177,177
rheinische post (www),173,173,173,173,173,0,173,173,173,173,...,173,173,173,173,173,173,173,173,173,173


In [16]:
#overview of average values for the newly created columns
df.groupby("Newspaper").mean()

Unnamed: 0_level_0,Unnamed: 0,ID,Length,words in clean text,reach_dummy,modality_dummy,positive words,negative words,negativity ratio,positivity ratio,emotionality,emotionality ratio,positive words rauh
Newspaper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Aachener Zeitung,492.615464,493.615464,484.008247,294.691753,0.0,1.0,6.880412,2.341237,0.009237,0.027133,9.221649,0.03637,16.123711
Der Tagesspiegel,10927.010886,10928.010886,574.946345,345.008554,1.0,1.0,9.789269,3.784603,0.010418,0.027761,13.573872,0.038179,23.438569
Die Welt,1407.464501,1408.464501,774.438026,458.566787,1.0,1.0,14.0,5.649819,0.011864,0.029807,19.649819,0.041671,32.687124
Rheinische Post,3486.029474,3487.029474,377.456,227.909895,0.0,1.0,6.619789,1.952,0.007886,0.028257,8.571789,0.036143,15.503158
Stuttgarter Zeitung,5749.415521,5750.415521,394.241714,236.194826,0.0,1.0,6.800323,2.523848,0.010065,0.027775,9.324171,0.037839,16.225546
Süddeutsche Zeitung (inkl. Regionalausgaben),8341.211828,8342.211828,529.366129,312.950806,1.0,1.0,9.069355,3.546774,0.010704,0.028514,12.616129,0.039217,21.301075
aachener zeitung (www),4215.535714,104215.535714,401.488095,247.35119,0.0,0.0,6.654762,2.839286,0.011044,0.027246,9.494048,0.03829,15.52381
der tagesspiegel (www),4147.094697,104147.094697,574.189394,340.988636,1.0,0.0,9.102273,4.329545,0.01272,0.025923,13.431818,0.038643,21.935606
die welt (www),3980.39548,103980.39548,578.751412,346.711864,1.0,0.0,9.548023,4.214689,0.011909,0.026389,13.762712,0.038298,22.19209
rheinische post (www),4260.479769,104260.479769,338.699422,206.323699,0.0,0.0,5.248555,2.427746,0.011159,0.024648,7.676301,0.035807,12.294798


### 5) Assessing emotionality with the Rauh sentiment dictionary

In [12]:
#read in the rauh dictionary
rauh = pd.read_csv("rauh_sentiment.csv")
#remove whitespaces
rauh["feature"] = rauh["feature"].str.strip()
#inspect the dictionary
rauh.head(3)

Unnamed: 0,feature,sentiment
0,aalen,1
1,aalglatt,-1
2,aasen,-1


In [13]:
#split the dictionary into a positive and a negative one
pos_rauh = rauh.loc[rauh["sentiment"] == 1]
neg_rauh = rauh.loc[rauh["sentiment"] == -1]
#delete spaces at the end of each string


#create a list of positive words
pos_list_rauh = []
for e in pos_rauh["feature"]:
    pos_list_rauh.append(e)
#and a list of negative words
neg_list_rauh = []
for e in neg_rauh["feature"]:
    neg_list_rauh.append(e)

In [14]:
#function to detect positive words
def count_positivewords_rauh(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    poscount = 0
    for w in tokens:
        if w in pos_list_rauh:
            poscount+=1
    return poscount

#function to detect negative words
def count_negativewords_rauh(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    negcount = 0
    for w in tokens:
        if w in neg_list_rauh:
            negcount+=1
    return negcount

In [17]:
#... create a column with the number of positive words
df["positive words rauh"] = [count_positivewords_rauh(text) for text in df["clean text"]]

#... create a column with the number of negative words
df["negative words rauh"] = [count_negativewords_rauh(text) for text in df["clean text"]]

In [18]:
#inspect the df
df.head(3)

Unnamed: 0.1,Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,...,reach_dummy,modality_dummy,positive words,negative words,negativity ratio,positivity ratio,emotionality,emotionality ratio,positive words rauh,negative words rauh
0,6,100006,sueddeutschet politik (www),2020-05-28T15:34:08,367,,,SZ Espresso: Nachrichten kompakt - die Übersic...,<p>Was heute wichtig war - und was Sie auf SZ....,Das Wichtigste zum Coronavirus. Berufstätige M...,...,1,0,8,6,0.026786,0.035714,14,0.0625,16,16
1,8,100008,sueddeutschet politik (www),2020-05-28T17:01:43,200,,,Kommunalpolitik: Abgeblendet,<p>Bayreuths Stadtrat im Stream</p>,"Livestream aus dem Stadtrat, das klingt transp...",...,1,0,2,0,0.0,0.019231,2,0.019231,3,3
2,24,100024,aachener zeitung (www),2020-05-28T03:01:52,512,Politik,,Länder planen Öffnung: Streit über Schulen und...,"<img src=""https://www.aachener-zeitung.de/imgs...",Der Streit über die Wiederöffnung von Schulen ...,...,0,0,11,6,0.018868,0.034591,17,0.053459,23,18


In [19]:
#creating a new column for overall emotionality 
df["emotionality  rauh"] = (df["negative words rauh"]+df["positive words rauh"])
#creating a new column for overall emotionality ratio (percenage of overall words that are positve or negative)
df["emotionality ratio rauh"] = (df["negative words rauh"]+df["positive words rauh"])/df["words in clean text"]
#creating a new column for overall emotionality (percenage of overall words that are negative)
df["negativity ratio rauh"] = df["negative words rauh"]/df["words in clean text"]
#creating a new column for overall emotionality (percenage of overall words that are positve)
df["positivity ratio rauh"] = df["positive words rauh"]/df["words in clean text"]

### 6) Computing overall sentiment with the Rauh sentiment dictionary   
   
#### a) Writing a function to compute sentiment

In [20]:
#Rauh designed an extra dictionary that also includes negations. 
#However, those negations are always the same. 
#I incorporated Rauh's approach towards accounting for negations by creating a seperate negation list and computing sentiment accordingly.

#create a list of negations
negations = ["nicht","nichts","kein","keine","keinen"]
#remove negations from negative list
neg_rauh = neg_rauh[~neg_rauh["feature"].isin(negations)]

def compute_sentiment_rauh(text):
    #tokenise the text
    doc = nlp(text)
    #create a list of sentences
    sentences = [sent.text for sent in doc.sents]
    #create list of words
    wordlist = []
    wordlist_new2 = []
    wordlist_new = []
    #create lists for negations and their sentiment value
    negationlist = []
    sentimentscore_negations = []
    #analyse a document sentence by sentence
    for sentence in sentences:
        #tokenise the sentence
        sentence = nlp(sentence)
        #save tokens as a list
        tokens = [token.text for token in sentence]
        #append the tokens to a list of words
        wordlist.append(tokens[0])
        #check for negations
        for last_item, e in zip(tokens, tokens[1:]):
            if last_item in negations:
                negationlist.append(e)
            else:
                wordlist.append(e)
        #append positive an negative words to a new list
        for e in wordlist:
            if e in pos_list_rauh:
                wordlist_new.append(e)
            elif e in neg_list_rauh:
                wordlist_new.append(e)
    #analyse negationlist and save a list of the words' sentiment value
    for e in negationlist:
        if e in pos_list_rauh:
            sentimentscore_negations.append(-1)
        elif e in neg_list_rauh:
            sentimentscore_negations.append(1)
    #create and merge dataframes
    df1 = pos_rauh[pos_rauh['feature'].isin(wordlist_new)]
    df2 = neg_rauh[neg_rauh['feature'].isin(wordlist_new)]
    dfs = [df1, df2]
    df_final = pd.concat(dfs)
    #compute sentimentscore without negations
    sentimentscore = df_final["sentiment"].sum()
    #compute sentimentscore for only the negations
    sentimentscore_negations = pd.DataFrame(sentimentscore_negations, columns=["sentiment"])
    sentimentscore_negations = sentimentscore_negations["sentiment"].sum()
    #add the two sentiment scores together for a final sentiment value
    sentimentscore_final = sentimentscore + sentimentscore_negations
    return sentimentscore_final

#### b) Applying the function to calculate both overall sentiment and sentiment ratio

In [None]:
#compute the sentiment
df["sentimentscore_rauh"] = [compute_sentiment_rauh(text) for text in df["clean text"]]
#calcuate the sentiment ratio to account for length
df["sentimentscore_rauh_ratio"] = df["sentimentscore_rauh"]/df["words in clean text"]

#### c) Data overview

In [20]:
#inspect data
df.groupby("Newspaper").mean()

Unnamed: 0_level_0,Unnamed: 0,ID,Length,words in clean text,reach_dummy,modality_dummy,positive words,negative words,negativity ratio,positivity ratio,emotionality,emotionality ratio,positive words rauh,negative words rauh,emotionality rauh,emotionality ratio rauh,negativity ratio rauh,positivity ratio rauh
Newspaper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Aachener Zeitung,492.615464,493.615464,484.008247,294.691753,0.0,1.0,6.880412,2.341237,0.009237,0.027133,9.221649,0.03637,16.123711,11.380412,27.504124,0.107716,0.044877,0.062839
Der Tagesspiegel,10927.010886,10928.010886,574.946345,345.008554,1.0,1.0,9.789269,3.784603,0.010418,0.027761,13.573872,0.038179,23.438569,17.987558,41.426128,0.116717,0.050711,0.066006
Die Welt,1407.464501,1408.464501,774.438026,458.566787,1.0,1.0,14.0,5.649819,0.011864,0.029807,19.649819,0.041671,32.687124,25.820698,58.507822,0.125908,0.056057,0.069851
Rheinische Post,3486.029474,3487.029474,377.456,227.909895,0.0,1.0,6.619789,1.952,0.007886,0.028257,8.571789,0.036143,15.503158,9.561263,25.064421,0.105901,0.039766,0.066135
Stuttgarter Zeitung,5749.415521,5750.415521,394.241714,236.194826,0.0,1.0,6.800323,2.523848,0.010065,0.027775,9.324171,0.037839,16.225546,11.839935,28.065481,0.11409,0.047511,0.06658
Süddeutsche Zeitung (inkl. Regionalausgaben),8341.211828,8342.211828,529.366129,312.950806,1.0,1.0,9.069355,3.546774,0.010704,0.028514,12.616129,0.039217,21.301075,15.922312,37.223387,0.115467,0.048909,0.066559
aachener zeitung (www),4215.535714,104215.535714,401.488095,247.35119,0.0,0.0,6.654762,2.839286,0.011044,0.027246,9.494048,0.03829,15.52381,14.327381,29.85119,0.118968,0.056615,0.062353
der tagesspiegel (www),4147.094697,104147.094697,574.189394,340.988636,1.0,0.0,9.102273,4.329545,0.01272,0.025923,13.431818,0.038643,21.935606,20.412879,42.348485,0.123896,0.060853,0.063043
die welt (www),3980.39548,103980.39548,578.751412,346.711864,1.0,0.0,9.548023,4.214689,0.011909,0.026389,13.762712,0.038298,22.19209,19.99435,42.186441,0.118053,0.057265,0.060788
rheinische post (www),4260.479769,104260.479769,338.699422,206.323699,0.0,0.0,5.248555,2.427746,0.011159,0.024648,7.676301,0.035807,12.294798,12.104046,24.398844,0.116459,0.058407,0.058051


## 7) Saving the data

In [21]:
df.to_excel("complete_data_cleaned_with_emotionality.xlsx")