# Get Sentiment Data

We tried out vaderSentiment (https://github.com/cjhutto/vaderSentiment) as one of the approaches used to generate sentiment data for use as features for our model.

In this notebook, we used vaderSentiment to generate sentiment labels for every single post, and store them in a column under `sentiment`.

For each sentiment label generated by vaderSentiment (-2, -1, 0, 1, 2), we sampled 100 posts in order to manually verify that the results are acceptable.

Sentiment, by its very nature, is extremely subjective. From a cursory glance, sadly, we saw that the sampled posts (which would be generated by this notebook in `sentiment_df.csv`) usually did not match the sentiments assigned to them.

In [2]:
import numpy as np
import pandas as pd
from collections import deque
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

Mounted at /content/gdrive


In [3]:
# Read data
df = pd.read_csv("01-updated.csv")
df = df.append(pd.read_csv("02-updated.csv"), ignore_index=True)
df = df.append(pd.read_csv("03-updated.csv"), ignore_index=True)
df = df.append(pd.read_csv("04-updated.csv"), ignore_index=True)
df = df.append(pd.read_csv("05-updated.csv"), ignore_index=True)
df = df.append(pd.read_csv("06-updated.csv"), ignore_index=True)

In [4]:
#score indication:  -2:strongly negative    -1:negative   0:neutral     1:positive    2:strongly positive
def sentiment_analyzer_scores(row):
    score = analyser.polarity_scores(row['p'])
    score = float(str(score['compound']))
    if score != 0:
        new_score = score * 2
        if new_score > 0:
          new_score += 0.5
        else:
          new_score -= 0.5
    else:
        new_score = 0
    return int(new_score)

In [5]:
df['sentiment'] = df.apply(sentiment_analyzer_scores, axis=1)

In [6]:
max_sample_len = 100

neg2 = deque(maxlen=max_sample_len)
neg1 = deque(maxlen=max_sample_len)
zero = deque(maxlen=max_sample_len)
pos1 = deque(maxlen=max_sample_len)
pos2 = deque(maxlen=max_sample_len)

for index, row in df.iterrows():
    if row['sentiment'] == -2:
      neg2.append(row['p'])
    elif row['sentiment'] == -1:
      neg1.append(row['p'])
    elif row['sentiment'] == 0:
      zero.append(row['p'])
    elif row['sentiment'] == 1:
      pos1.append(row['p'])
    elif row['sentiment'] == 2:
      pos2.append(row['p'])

In [7]:
deque_list = [neg2, neg1, zero, pos1, pos2]
len_list = [len(neg2), len(neg1), len(zero), len(pos1), len(pos2)]
print(len_list)
merged_deque = deque()
for i in deque_list:
  merged_deque += i

sentiment_df = pd.DataFrame(list(merged_deque), columns=["text"])

sentiment_list = []

count = 0
for i in range(-2, 3):
  sentiment_list += [i] * len_list[count]
  count += 1
sentiment_df['sentiment'] = sentiment_list

sentiment_df.to_csv("sentiment_df.csv")

[100, 100, 100, 100, 100]


## Next is to verify the sentiment data, check if it make sense and then use it as the ground truth for training