In [None]:
import matplotlib.pyplot as plt
from nltk.sentiment import vader
import numpy as np



## Data preparation

Since our real data is in the form of news article abstracts, we perform certain data cleaning steps:

1) turn the entire sentence to lowercase
2) handle negations: The way we are handling negation is by using the wordnet library which contains lists of antonyms for several words. Whenever the word "not" or "n't" is encountered, we replace the next word with its antonym. For example, "not good" becomes "bad". This process will allow us to then remove stopwords without affecting the sentiment of the sentence.
3) remove punctuation
4) remove stopwords

## Single-embedding model

We'll start with a simple model which gives each word a sentiment score. The sentence sentiment will be determined by the *average* of the sentiment scores of the words in the sentence.

For this model, we used nltk's vader sentiment analyzer to get the sentiment scores for each word. 

## Synthetic data generation

We generated synthetic data using the following process:
1) obtain a vocabulary of words using nltk's sentiment analyzer's lexicon
2) clean that vocabulary to only include words and filters out punctuation, numbers, etc.
3) generate random sized sentences using random words from the vocabulary

In [None]:
# Initialize VADER
sia = vader.SentimentIntensityAnalyzer()

# make a vocabulary from the lexicon which excludes non alpha tokens
vocab = sorted([token for token in sia.lexicon if token.isalpha()])

values = np.array([sia.lexicon[word] for word in vocab])

# show a histogram of vocab sentiment scores
plt.hist(values, bins='auto')  # 'auto' automatically determines the number of bins
plt.title('Histogram of VADER Lexicon Values')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()