# 3-4 Custom Sentiments

In this notebook we will explore how to take "standard" sentiments found in the NLTK and use them to tease out "hidden" sentiments in the gab corpus. This notebook uses the Real Python's [Sentiment Analysis: First Steps With Python's NLTK Library](https://realpython.com/python-nltk-sentiment-analysis/#using-nltks-pre-trained-sentiment-analyzer) as a starting place.

Note that the top line of the imports contains 3 modules. This is not considered "pythonic" by some, but it is acceptable. (That is, Python is not going to throw an error.) I don't tend to do this a lot, but sometimes when it's for well-known libraries, I just throw them all on the same line.

In [1]:
# IMPORTS
import re, numpy as np, random
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

In [2]:
# MPL IMPORT & SETTINGS
# import matplotlib.pyplot as plt
# plt.rcParams['figure.dpi'] = 300
# plt.rcParams["figure.figsize"] = (10,5)

In [3]:
# import nltk
# I only need to do this once, so I put it in a cell all its own
# and then comment it out once it's done so if/when I re-run this notebook,
# I do not re-download the lexicon.

# nltk.download("vader_lexicon")

In [4]:
# DATA
# Load the gabs:
with open("../queue/gabs.txt", "r") as f:
    gabs = f.readlines()

# How many do we have?
print(len(gabs))

70596


At some point in working through this notebook, I realized that there were enough posts that were simply links that I wanted to remove them. It may well be that attending to what people linked to in Gab will be something to look into later, but for now I want to focus only on what people write.

The code below is supposed to remove any links it finds.

In [5]:
html = re.compile(r'<[^>]*>')
html_free = [ re.sub(html, " ", gab) for gab in gabs ]

<div class="alert alert-block alert-warning">
<b>NOTE:</b> This simple change shifted the numbers in the section below rather significantly. It had a large impact on the longest texts, perhaps suggesting that longer texts were made up of lots of links.
</div>

## Getting a Better Grip on the Data

I think I want to try something a little different with the gab data: I am not interested in status checks or short posts. I am only really interested, in general, in the longer posts. In the cells that follow I want to explore the lengths of the gabs to determine if there is a more a threshold, a floor in particular,  that I can use to exlcude statuses, quick back and forths, and link and runs.  

My first impulse was to do a simple character count, and I may yet do that, but I think I will try to be more "textual" and start with a word count. 

I'll start with some simple calculations and then try a histogram.

In [6]:
# This is in a cell by itself because tokenizing takes time, 
# and we only want to do it once.
tokenized = []
for i in html_free:
    tokens = word_tokenize(i)
    tokenized.append(tokens)

The cell below is an example of a list comprehension -- the code inside the square brackets, [] -- embedded inside two functions, first a length and then a print. It's simple, compact, and it allows me to change the number at the end and re-run the cell to "map" out the data a bit in my mind.

In [7]:
print(len([i for i in tokenized if len(i) < 5]))
print(len([i for i in tokenized if len(i) < 10]))

18458
35309


These two simple lines reveal that almost half the gabs are less than 10 words and about a quarter are less than five words.

If we want a somewhat more nuanced "mapping" of the corpus, we can use numpy's `histogram`:

In [8]:
lengths = [ len(i) for i in tokenized ]
counts, bins = np.histogram(lengths)
print(counts)
print(bins)

[62486  5779  1310   462   225   114    71    61    65    23]
[  0.   38.7  77.4 116.1 154.8 193.5 232.2 270.9 309.6 348.3 387. ]


histogram's default number of bins is 10, and so it divides the longest post by 10 and creates a bin for each tenth of that size. Most of the posts are less than 300 words in length.

While it might be worth looking at the 4 posts that weigh in at 1500-3000 words, I am really intrigued by the 138 posts that are 300-600 words long. I am bookmarking those as ones to return to.

**Conclusions**: dropping posts less than five words is pretty straightforward. The real question is: how interesting are the posts that are 5-10 words long?

In [9]:
shorts = [post for post in tokenized if len(post) > 5 and len(post) < 11]

In [10]:
for item in random.sample(shorts, 5):
    print(" ".join(item))

Viva Cristo Rey , brother .
My Q day was awesome , Thank You ❤️❤️
Kunde inte klicka på följ , fanns ingen sådan .
Hi , how are you today ?
Lets get keep maga movement going Patriots_Unite


After running that cell 3-5 times, I did not see any gabs that looked terribly interesting, so I am going to set my floor at 10 words, eliminating half my corpus from consideration. *Sigh*.

In [11]:
# List comprehension for gabs of greater than 10 words
texts = [ post for post in tokenized if len(post) > 10 ]

# Join the gabs back together because NLTK's sentiment expects it?
joins = [ " ".join(text) for text in texts ]

# Let's see a random one of them:
print(random.choice(joins))

CONFIRMATION OF REGISTRATION what caption do we see in site header : Tricky question


## Sentiment the NLTK Way

In [12]:
# How to get a score
sia.polarity_scores("Python is the best programming language.")

{'neg': 0.0, 'neu': 0.543, 'pos': 0.457, 'compound': 0.6369}

In [13]:
# What it looks like for our corpus:
sample = random.choice(joins)
print(sample)
print(sia.polarity_scores(sample))

Hoping this day greets you with good health and peace of mind .
{'neg': 0.0, 'neu': 0.426, 'pos': 0.574, 'compound': 0.8689}


In [14]:
# And now just getting the compound:
sample = random.choice(joins)
print(sample)
print(sia.polarity_scores(sample)["compound"])

Hey there ! I 'm just wasting some time at work right now haha . How are you doing ?
0.1511
