# Contibution of Lexicon Expansion to Datasets of Lexicon Evaluation

This notebook provides code to investigate whether and how many words which of the lexicon expansion are in fact present in the datasets used for evaluation. Thus, if no performance increase of the extended lexicon compared to the base Sentida2 is observed, this may be related to the fact that only few words of the expanded lexicon are actually in the datasets.

### Dependencies

In [1]:
# basics
import sys, os
import argparse
import pandas as pd

# nlp
import spacy
nlp = spacy.load("da_core_news_lg")
from danlp.datasets import EuroparlSentiment1, LccSentiment, TwitterSent

# utils
sys.path.append(os.path.join("..", ".."))
from utils.twittertokens import setup_twitter

2022-01-02 12:13:04.720259: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-02 12:13:04.720296: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


### Lexicons

In [3]:
# base sentida2
sentida2 = pd.read_csv("../../lexicons/sentida2_lexicon.csv")
# expanded with neural network, word2vec
expansion = pd.read_csv(f"../../output/sentiment_prediction/neural_network/word2vec/sentiment_predictions.csv")
expansion = expansion.rename(columns={"word":"word", "restricted_sentiments":"score"})
expansion = expansion.drop(columns =["predicted_sentiment"])
sentida2_expanded = pd.concat([sentida2, expansion])

In [41]:
len(sentida2)

6592

In [4]:
# get words of lexica
sentida2_words = sentida2.word.tolist()
sentida2_expanded_words = sentida2_expanded.word.tolist()

### Lexicon Evaluation Datasets

In [8]:
# eurpal and lcc
euro1 = EuroparlSentiment1().load_with_pandas()
lcc = LccSentiment().load_with_pandas()

# twitter
setup_twitter()
twitter_val, _ = TwitterSent().load_with_pandas()

Downloading file /tmp/tmpo3pdppez


100% |########################################################################|


### EuropalSentiment1

In [9]:
lemma_list=[]
for index, row in euro1.iterrows():
    doc = nlp(row.text)
    tokens = [token.lemma_ for token in doc]
    lemma_list = lemma_list+tokens

In [17]:
# total number of lemmas in corpus
print("Total:", len(lemma_list))
# unique number of lemmas in corpus
print("Unique:", len(set(lemma_list)))

Total: 3359
Unique: 945


In [23]:
# overlap with base sentida2
in_sentida2 = (set(lemma_list) & set(sentida2_words))
print("Unique words in sentida2 and europal dataset:", len(in_sentida2))

Unique words in sentida2 and europal dataset: 413


In [25]:
# overlap with expanded sentida2
in_sentida2_expanded = set(lemma_list) & set(sentida2_expanded_words)
print("Unique words in sentida2 expanded and europal dataset:", len(in_sentida2_expanded))

Unique words in sentida2 expanded and europal dataset: 451


In [28]:
# contribution of expansion
len(list(set(in_sentida2_expanded) - set(in_sentida2)))

38

### LccSentiment

In [29]:
lemma_list=[]
for index, row in lcc.iterrows():
    doc = nlp(row.text)
    tokens = [token.lemma_ for token in doc]
    lemma_list = lemma_list+tokens

In [30]:
# total number of lemmas in corpus
print("Total:", len(lemma_list))
# unique number of lemmas in corpus
print("Unique:", len(set(lemma_list)))

Total: 10584
Unique: 3530


In [31]:
# overlap with base sentida2
in_sentida2 = (set(lemma_list) & set(sentida2_words))
print("Unique words in sentida2 and lcc dataset:", len(in_sentida2))

Unique words in sentida2 and lcc dataset: 1016


In [32]:
# overlap with expanded sentida2
in_sentida2_expanded = set(lemma_list) & set(sentida2_expanded_words)
print("Unique words in sentida2 expanded and lcc dataset:", len(in_sentida2_expanded))

Unique words in sentida2 expanded and lcc dataset: 1192


In [33]:
# contribution of expansion
len(list(set(in_sentida2_expanded) - set(in_sentida2)))

176

### Twitter Sentiment

In [35]:
lemma_list=[]
for index, row in twitter_val.iterrows():
    doc = nlp(row.text)
    tokens = [token.lemma_ for token in doc]
    lemma_list = lemma_list+tokens

In [36]:
# total number of lemmas in corpus
print("Total:", len(lemma_list))
# unique number of lemmas in corpus
print("Unique:", len(set(lemma_list)))

Total: 19001
Unique: 4724


In [38]:
# overlap with base sentida2
in_sentida2 = (set(lemma_list) & set(sentida2_words))
print("Unique words in sentida2 and twitter dataset:", len(in_sentida2))

Unique words in sentida2 and twitter dataset: 1354


In [39]:
# overlap with expanded sentida2
in_sentida2_expanded = set(lemma_list) & set(sentida2_expanded_words)
print("Unique words in sentida2 expanded and twitter dataset:", len(in_sentida2_expanded))

Unique words in sentida2 expanded and twitter dataset: 1557


In [40]:
# contribution of expansion
len(list(set(in_sentida2_expanded) - set(in_sentida2)))

203