# VADER lexicon 
#### lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media

sum-up the key aspects taken into consideration by the lexicon : 
* <b>Punctuation</b>: The use of an exclamation mark(!), increases the magnitude of the intensity without modifying the semantic orientation
* <b>Capitalization</b>: Using upper case letters to emphasize a sentiment-relevant word in the presence of other non-capitalized words, increases the magnitude of the sentiment intensity.
* <b>Degree modifiers</b>: Also called intensifiers, they impact the sentiment intensity by either increasing or decreasing the intensity.
* <b>Conjunctions</b>: Use of conjunctions like ‚Äúbut‚Äù signals a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant.
* <b>!!!!!!! Preceding Tri-gram</b>: By examining the tri-gram preceding a sentiment-laden lexical feature, we catch nearly 90% of cases where negation flips the polarity of the text. A negated sentence would be ‚ÄúThe food here isn‚Äôt really all that great‚Äù

* <b>Handling Emojis, Slangs and Emoticons </b>

In [21]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
#nltk.download('vader_lexicon')
import pandas as pd

In [72]:
sid = SentimentIntensityAnalyzer()
sid.polarity_scores("The food here isn‚Äôt really that great")

{'neg': 0.0, 'neu': 0.578, 'pos': 0.422, 'compound': 0.6573}

The compound score is computed by summing the valence scores of each word in the lexicon, 
adjusted according to the rules, and then normalized to be between -1 (most extreme negative) 
and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional 
measure of sentiment for a given sentence.
Calling it a 'normalized, weighted composite score' is accurate.

In [33]:
df = pd.read_pickle('df_model.p')
df_tweet_en_neutral = df[df['sentiment'] == 'neutral'].sample(n=100)
df_tweet_en_neutral
df_tweet_en_positive = df[df['sentiment'] == 'positive'].sample(n=100)
df_tweet_en_positive
df_tweet_en_negative = df[df['sentiment'] == 'negative'].sample(n=100)
df_tweet_en_negative

Unnamed: 0,count_words,count_mentions,count_hashtags,count_capital_words,count_excl_quest_marks,count_urls,count_emojis,sentiment,origin_text,likes,retweets,clean_text
3149586,21,1,1,1,0,1,0,negative,"Lazy IoT, router makers reuse skeleton keys ov...",1,1,lazi iot router maker reus skeleton key thousa...
4583687,23,0,1,0,0,1,0,negative,#cyber Combatting the evil Internet of Things ...,0,1,cyber combat evil internet thing read peter re...
1097716,15,0,0,0,0,1,0,negative,How the Internet of Things Puts Our Medical Id...,1,0,internet thing put medic ident risk
3919585,23,1,4,1,0,1,0,negative,Learn why #IoT is an opportunity #manufacturer...,0,0,learn iot opportun manufactur can‚Äôt afford pas...
3697538,28,0,1,1,0,1,0,negative,#IoT is such a kerfuffle for the out of touch ...,1,0,iot kerfuffl touch boardroom uk say world co t...
4278234,20,0,9,2,0,1,0,negative,"IoT Is A Security Nightmare,Warns EFF https:/...",0,0,iot secur nightmar warn eff iot tech cybersecu...
5859429,11,0,1,0,0,1,0,negative,TheIoT: IoT security issues and vulnerabilitie...,0,0,theiot iot secur issu vulner iot
381105,29,0,0,0,1,0,0,negative,The ‚ÄúInternet of Things‚Äù: The real deal for 20...,0,0,‚Äúinternet things‚Äù real deal new way sell app t...
3989705,19,0,2,0,0,1,0,negative,Why a Faith-Based Approach to the IoT Is a Rea...,0,0,faith base approach iot realli bad idea iot in...
2260539,12,0,1,0,0,1,0,negative,The #IoT is an opportunity not a threat http:/...,2,0,iot opportun not threat


In [34]:
dff = pd.concat([df_tweet_en_neutral,df_tweet_en_positive])
dff = pd.concat([dff,df_tweet_en_negative])

In [35]:
dff['sentiment'].value_counts()
for s in dff['origin_text']:
    print(s)

Trains, planes and miners are all part of #Cisco‚Äôs #internetofthings vision http://t.co/Oz6cDCsdQw @om #mobile
The minds at @UMengineering are creating low-power circuits &amp; systems to efficiently fuel the #InternetOfThings ‚Üí http://t.co/rh2VyK4TzP
RT @superflux: Great day at a school in Manchester w @Forum4theFuture @techwillsaveus. Internet of Things Academy pilot is a go! #iota #iot
Smart move MSFT. Now software company for Internet of things http://t.co/ydFiyacAue
How the Internet of Things Will Change Cloud Computing ‚Üí http://t.co/fw8uRxstHe #IoT #InternetOfThings #Cloud
Internet of things Panopticon  Tactics That Enhance Panopticon Success https://t.co/gYPYxPILrG
What Is The Key Differentiator In Your Business by @SeanPMcCauley1 on @LinkedIn https://t.co/iFTUiIlbt8 #SaaS #IoT #SMB #SocialSelling
RT @ThreatBrief: How to secure industrial systems for the Internet of Things http://t.co/WStOGTEtVm
@senetco and #PaigeAG leverage low-power, wide-area #networks to monitor weath

In [37]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

lemmatizer=WordNetLemmatizer()
nltk.download('wordnet')
ps=PorterStemmer()


tok = WordPunctTokenizer()

pat1 = r'@[A-Za-z0-9_]+' #mention
pat2 = r'https?://[^ ]+' #urls
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+' #www urls
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not","I'm":"I am","you're":"you are"} #negation words
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def tweet_cleaner_updated(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    stripped = re.sub(combined_pat, '', bom_removed)
    stripped = re.sub(www_pat, '', stripped)
   # lower_case = stripped.lower()
   # neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], stripped)
    letters_only = re.sub("[^a-zA-Z]", " ", stripped)
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    return (" ".join(words)).strip()

print(len(df))
testing = dff['origin_text']

test_result = []
for t in testing:
    test_result.append(
        ps.stem(
        tweet_cleaner_updated(t)))
test_result

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\smail\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


49469


['trains planes and miners are all part of cisco internetofthings vision mobil',
 'the minds at are creating low power circuits systems to efficiently fuel the internetofth',
 'rt great day at school in manchester internet of things academy pilot is go iota iot',
 'smart move msft now software company for internet of th',
 'how the internet of things will change cloud computing iot internetofthings cloud',
 'internet of things panopticon tactics that enhance panopticon success',
 'what is the key differentiator in your business by on saas iot smb socialsel',
 'rt how to secure industrial systems for the internet of th',
 'and paigeag leverage low power wide area networks to monitor weather and soil conditions iot',
 'intel security warns window to address internet of things security is rapidly closing via',
 'the world top most innovative companies in the internet of things by fast company logmein via',
 'blackberry announces project ion internet of things initiative eweek',
 'internet

In [39]:
y_pred = []
analyzer = SentimentIntensityAnalyzer()
for sentence in test_result:
    vs = analyzer.polarity_scores(sentence)
    y_pred.append(vs['compound'])

In [40]:
y_pred

[0.25,
 0.4215,
 0.6249,
 0.4019,
 0.0,
 0.5719,
 0.0,
 0.34,
 -0.2732,
 0.5267,
 0.6115,
 0.0,
 0.5574,
 0.4019,
 0.0,
 0.1779,
 0.0,
 0.3612,
 0.0,
 0.0,
 0.0772,
 0.34,
 -0.4404,
 0.0516,
 0.2023,
 0.25,
 0.5994,
 0.3612,
 0.34,
 0.0,
 0.2732,
 0.4404,
 -0.5994,
 0.0,
 0.1779,
 0.3182,
 0.0,
 0.0,
 0.4939,
 0.0,
 0.0,
 -0.1531,
 0.0,
 -0.296,
 0.0,
 0.0,
 0.0,
 -0.3182,
 -0.3089,
 -0.296,
 0.2732,
 0.3612,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.4588,
 0.1027,
 0.0,
 0.0,
 0.0,
 0.0,
 0.4404,
 0.0,
 0.4019,
 0.743,
 0.0,
 0.0,
 0.4404,
 0.5574,
 0.4019,
 0.0,
 0.5994,
 0.0,
 0.3182,
 0.2732,
 0.4215,
 0.6249,
 -0.2263,
 0.4588,
 0.5574,
 0.5994,
 0.2023,
 0.7906,
 0.0,
 0.4588,
 0.0,
 0.0,
 0.3612,
 -0.3612,
 0.6124,
 0.0,
 -0.0772,
 0.0,
 0.0,
 0.0,
 0.4939,
 0.5106,
 0.4927,
 0.93,
 0.7264,
 0.875,
 0.4767,
 0.2732,
 0.5267,
 0.765,
 0.3612,
 0.7351,
 0.0,
 0.6369,
 0.34,
 0.6369,
 0.5719,
 0.34,
 0.7717,
 0.6478,
 0.5994,
 0.0,
 0.7906,
 0.6249,
 0.6369,
 0.7717,
 0.0,

In [64]:

#positive sentiment: compound score >= 0.05
#neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
#negative sentiment: compound score <= -0.05

In [65]:
#dff['sentiment'][dff['sentiment']=='negative']=-1

y_pr = []
for i in y_pred: 
    if 0.05>i>-0.05: y_pr.append(0)
    if i>0.05: y_pr.append(1)
    if i<-0.05: y_pr.append(-1)
y_pr

[1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 -1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 -1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 -1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 -1,
 0,
 -1,
 0,
 0,
 0,
 -1,
 -1,
 -1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 -1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 -1,
 1,
 0,
 -1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 -1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 -1,
 -1,
 -1,
 1,
 -1,
 1,
 1,
 0,
 -1,
 1,
 -1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 -1,
 -1,
 0,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,
 0,
 1,
 0,
 -1,
 -1,
 -1,
 -1,
 -1,
 -1,


In [66]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [67]:
Y = dff['sentiment'].values
Y.shape
Y = [x for x in Y]

In [68]:
cm = confusion_matrix(Y, y_pr, labels=[1,-1,0])

In [69]:
cm

array([[83,  1, 16],
       [27, 44, 29],
       [48, 11, 41]], dtype=int64)

In [70]:
accuracy_score(Y, y_pr)

0.56

# References
* http://comp.social.gatech.edu/papers/ 
* http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf
* https://pypi.org/project/vaderSentiment/ 
* <b> Medium article : </b> https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f 
* <b> how to change Vader Lexicon : </b> https://stackoverflow.com/questions/40481348/is-it-possible-to-edit-nltks-vader-sentiment-lexicon

* <b> to check later : </b> https://www.kaggle.com/pekyewfina/debate-wordclouds-with-sentiment-analysis