This notebook maps every tweet to the VAD dictionary. The VAD Dictionary has almost 14,000 lemmas, each assigned a Valence, Dominance, and Arousal score. When there is a matching lemma in the tweet, it is added to each vector (valence, dominance, arousal). If the lemma appears more than once in the tweet, the values are added. For example, if'be' has a valence score of 2 and appears twice in the tweet, the valence score will be 4 for this vector. 

<br>

The goal is to compare the Valence Vector for the Sarcastic Responses to the Valence Vector for the Sarcastic Contexts. Next, compare the Valence Vector for the Not Sarcastic Responses to the Not Valence Vector for the Sarcastic Contexts




In [2]:
import pandas as pd
data = pd.read_csv('Reddit/Reddit_Training2Contexts.csv', index_col=0) 

In [3]:
data

Unnamed: 0,label,response,context/0,context/1
0,SARCASM,"Yeah I mean there's only one gender anyways, w...",When gender is unknown he/him is default.,LPT: If you're worried about hurting someone's...
1,SARCASM,"Sounds like you don't like science, you theist...",I wouldn't let that robot near me.,Promotional images for some guy's Facebook page
2,SARCASM,"Ofc play them in try mode, Blizzard were so ge...",And if i want to play a chimp that isn't on fr...,My friends won't play Dota2; I won't play LoL;...
3,SARCASM,"I don't understand, Reddit told me that Hillar...",+11 in PA +3 in AZ +15 in NH +9 in MI +1 in MO...,Poll: Convention boosts Clinton to 11-point le...
4,SARCASM,"yeh, they're the reigning triple premiers, why...","Live in the moment mate, it's not healthy to d...",Wayne Ludbey: Jordan Lewis has the ultimate co...
...,...,...,...,...
4395,NOT_SARCASM,well you could've been adulting if you hadn't ...,I want to let you know that your one comment h...,Nephelim?
4396,NOT_SARCASM,Also they'll have to join the euro,A real border might be a turn off for Scottish...,I think Scotland may actually leave this time ...
4397,NOT_SARCASM,plot: AI assists a cyborg in freelance investi...,"Honestly, this is a good idea for a pinoy cybe...",Mag-ingat sa riding in tandem
4398,NOT_SARCASM,Some airlines proposed this but too much publi...,So a fit person should be allowed to take extr...,Not to mention the people it's carrying as well.


In [57]:
import nltk

def clean_column(col):   #get rid of punctuation
    return col.str.replace('[^\w\s]','')
data[['context/1', 'context/0', 'response']] = data[['context/1', 'context/0', 'response']].apply(clean_column)

columns = ['context/1','context/0','response']  #tokenize
for column in columns:
    data[(column+'_tokenized')] = data.apply(lambda row: nltk.word_tokenize(row[column]), axis=1) #tokenize



In [58]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer


class LemmatizationWithPOSTagger(object):
    def __init__(self):
        pass
    def get_wordnet_pos(self,treebank_tag):
        """
        return WORDNET POS compliance to WORDENT lemmatization (a,n,r,v) 
        """
        if treebank_tag.startswith('J'):
            return wordnet.ADJ
        elif treebank_tag.startswith('V'):
            return wordnet.VERB
        elif treebank_tag.startswith('N'):
            return wordnet.NOUN
        elif treebank_tag.startswith('R'):
            return wordnet.ADV
        else:
            # As default pos in lemmatization is Noun
            return wordnet.NOUN

    def pos_tag(self,tokens):
        # find the pos tagginf for each tokens [('What', 'WP'), ('can', 'MD'), ('I', 'PRP') ....
        pos_tokens = [nltk.pos_tag(token) for token in tokens]

        # lemmatization using pos tagg   
        # convert into feature set of [('What', 'What', ['WP']), ('can', 'can', ['MD']), ... ie [original WORD, Lemmatized word, POS tag]
#         pos_tokens = [ [(lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag)), [pos_tag]) for (word,pos_tag) in pos] for pos in pos_tokens]
        pos_tokens = [ [(lemmatizer.lemmatize(word,self.get_wordnet_pos(pos_tag))) for (word,pos_tag) in pos] for pos in pos_tokens]

        return pos_tokens

lemmatizer = WordNetLemmatizer()
lemmatization_using_pos_tagger = LemmatizationWithPOSTagger()

#step 2 lemmatization using pos tagger 

columns = ['context/1_tokenized','context/0_tokenized','response_tokenized']
for column in columns:
    data[(column)] = lemmatization_using_pos_tagger.pos_tag((data[column]))#tokenize



In [59]:
data.head()

Unnamed: 0,label,response,context/0,context/1,context/1_tokenized,context/0_tokenized,response_tokenized
0,NOT_SARCASM,I was elected to golf not to uh got nothing,I was elected to LEAD not to READ,Calling it a decision is pretty generous,"[Calling, it, a, decision, be, pretty, generous]","[I, be, elect, to, LEAD, not, to, READ]","[I, be, elect, to, golf, not, to, uh, got, not..."
1,NOT_SARCASM,I thought those kids were in a very bad spot T...,God damn Thai Navy SEALs and the crew that pla...,Thailand cave rescue All 12 boys coach freed l...,"[Thailand, cave, rescue, All, 12, boy, coach, ...","[God, damn, Thai, Navy, SEALs, and, the, crew,...","[I, think, those, kid, be, in, a, very, bad, s..."
2,NOT_SARCASM,Nothing gives off that hipster low budget star...,They have an entire spot on the Venice pier bo...,The whole thing seemed like a way to trick inv...,"[The, whole, thing, seem, like, a, way, to, tr...","[They, have, an, entire, spot, on, the, Venice...","[Nothing, give, off, that, hipster, low, budge..."
3,NOT_SARCASM,A major corporation would run a kickstarter so...,Nickelodeon could Kickstart that shit The peop...,It needs the intro Outro not so much,"[It, need, the, intro, Outro, not, so, much]","[Nickelodeon, could, Kickstart, that, shit, Th...","[A, major, corporation, would, run, a, kicksta..."
4,SARCASM,Yup scott accidentally added a last name to a ...,Yeah so we shouldnt hate on MatPat because he ...,To be honest the whole lore of fnaf has become...,"[To, be, honest, the, whole, lore, of, fnaf, h...","[Yeah, so, we, shouldnt, hate, on, MatPat, bec...","[Yup, scott, accidentally, add, a, last, name,..."


In [61]:
"""VAD dictionary"""
import csv
reader = csv.reader(open('Desktop/UPDATED_NLP_COURSE/vad1.csv'))

dictionary = {}
for row in reader:
    key = row[1]
    value = row[2:]
    dictionary[key] = value
    
dictionary.pop('Word')      
print('VAD Dictionary Contains', len(dictionary), 'lemmas')


VAD Dictionary Contains 13915 lemmas


In [62]:
#This will create a dataframe with a column for each word found in the reponse, context1 or context0 columns

vector_df = pd.DataFrame() 
def create_vector(column):
    for row in column:
        for token in row:
            if token in dictionary:
                if token not in vector_df:
                    vector_df[token] = {}
for column in columns:
    create_vector(data[column]) 

print("vector_df is now a dataframe with", len(vector_df.columns), 'columns')

vector_df is now a dataframe with 4490 columns


In [63]:
import numpy as np

columnslist = vector_df.columns.tolist()
row_by_column = np.zeros(shape=(len(data),len(columnslist))) #create 1800*4490 df
vector_df = pd.DataFrame(row_by_column,columns=columnslist) 

In [64]:
#make 9 copies of the vector
vector_response_valence,vector_response_arousal,vector_response_dominance = [vector_df.copy() for i in range(3)]
vector_context0_valence,vector_context0_arousal,vector_context0_dominance = [vector_df.copy() for i in range(3)]
vector_context1_valence,vector_context1_arousal,vector_context1_dominance = [vector_df.copy() for i in range(3)]

In [65]:
def assign_values(text_df, vector_valence, vector_arousal,vector_dominance):   #one df has tweets, the other will have values assigned
    for idx,row in enumerate(text_df):
        val_vader_dict = {}
        aro_vader_dict = {}
        dom_vader_dict = {}
        for token in row:
            if token in dictionary:
                val_value = float(dictionary[token][0])
                aro_value = float(dictionary[token][1])
                dom_value = float(dictionary[token][2])
                if token not in val_vader_dict:   #if its in one, its in all
                    val_vader_dict[token] = val_value
                    aro_vader_dict[token] = aro_value
                    dom_vader_dict[token] = dom_value
                else:
                    val_vader_dict[token] = val_vader_dict[token] + val_value
                    aro_vader_dict[token] = aro_vader_dict[token] + aro_value
                    dom_vader_dict[token] = dom_vader_dict[token] + dom_value
        for key in val_vader_dict:
            vector_valence[key][idx] = val_vader_dict[key]
            vector_arousal[key][idx] = aro_vader_dict[key]
            vector_dominance[key][idx] = dom_vader_dict[key]


In [66]:
assign_values(data['response_tokenized'],vector_response_valence,vector_response_arousal,vector_response_dominance)
assign_values(data['context/0_tokenized'],vector_context0_valence,vector_context0_arousal,vector_context0_dominance)
assign_values(data['context/1_tokenized'],vector_context1_valence,vector_context1_arousal,vector_context1_dominance)

In [67]:
len(vector_response_valence.columns)

4490

In [68]:
vector_response_valence

Unnamed: 0,decision,be,pretty,generous,cave,rescue,boy,coach,free,late,...,dibs,veritable,overlap,sheepdog,negotiator,depressing,zoo,evolutionary,biologist,prospective
0,0.0,6.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
1,0.0,12.36,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
2,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
3,0.0,6.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
4,0.0,6.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1795,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
1796,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
1797,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00
1798,0.0,6.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.76


In [69]:
vector_context0_valence

Unnamed: 0,decision,be,pretty,generous,cave,rescue,boy,coach,free,late,...,dibs,veritable,overlap,sheepdog,negotiator,depressing,zoo,evolutionary,biologist,prospective
0,0.0,6.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,18.54,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,6.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,6.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1795,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1796,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1797,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1798,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [70]:
vector_response_valence['tweet_label']= data['label'] #append the label column
vector_response_arousal['tweet_label']= data['label'] #append the label column
vector_response_dominance['tweet_label']= data['label'] #append the label column

vector_context0_valence['tweet_label']= data['label'] #append the label column
vector_context0_arousal['tweet_label']= data['label'] #append the label column
vector_context0_dominance['tweet_label']= data['label'] #append the label column


vector_context1_valence['tweet_label']= data['label'] #append the label column
vector_context1_arousal['tweet_label']= data['label'] #append the label column
vector_context1_dominance['tweet_label']= data['label'] #append the label column



In [73]:
response_valence_sarc = vector_response_valence[vector_response_valence["tweet_label"]=="SARCASM"].copy()
response_arousal_sarc = vector_response_arousal[vector_response_arousal["tweet_label"]=="SARCASM"].copy()
response_dominance_sarc = vector_response_dominance[vector_response_dominance["tweet_label"]=="SARCASM"].copy()

response_valence_notsarc = vector_response_valence[vector_response_valence["tweet_label"]=="NOT_SARCASM"].copy()
response_arousal_notsarc = vector_response_arousal[vector_response_arousal["tweet_label"]=="NOT_SARCASM"].copy()
response_dominance_notsarc = vector_response_dominance[vector_response_dominance["tweet_label"]=="NOT_SARCASM"].copy()


context0_valence_sarc = vector_context0_valence[vector_context0_valence["tweet_label"]=="SARCASM"].copy()
context0_arousal_sarc = vector_context0_arousal[vector_context0_arousal["tweet_label"]=="SARCASM"].copy()
context0_dominance_sarc = vector_context0_dominance[vector_context0_dominance["tweet_label"]=="SARCASM"].copy()

context0_valence_notsarc = vector_context0_valence[vector_context0_valence["tweet_label"]=="NOT_SARCASM"].copy()
context0_arousal_notsarc = vector_context0_arousal[vector_context0_arousal["tweet_label"]=="NOT_SARCASM"].copy()
context0_dominance_notsarc = vector_context0_dominance[vector_context0_dominance["tweet_label"]=="NOT_SARCASM"].copy()


context1_valence_sarc = vector_context1_valence[vector_context0_valence["tweet_label"]=="SARCASM"].copy()
context1_arousal_sarc = vector_context1_arousal[vector_context0_arousal["tweet_label"]=="SARCASM"].copy()
context1_dominance_sarc = vector_context1_dominance[vector_context0_dominance["tweet_label"]=="SARCASM"].copy()

context1_valence_notsarc = vector_context1_valence[vector_context0_valence["tweet_label"]=="NOT_SARCASM"].copy()
context1_arousal_notsarc = vector_context1_arousal[vector_context0_arousal["tweet_label"]=="NOT_SARCASM"].copy()
context1_dominance_notsarc = vector_context1_dominance[vector_context0_dominance["tweet_label"]=="NOT_SARCASM"].copy()


In [74]:
rvs =response_valence_sarc.corrwith(context0_valence_sarc, axis = 1, method = 'pearson') 
print("VALENCE SARCASTIC - RESPONSE & CONTEXT 0")
print()
print(rvs)
print()
print(rvs.mean(axis=0))
print()

rvns = response_valence_notsarc.corrwith(context0_valence_notsarc, axis = 1, method = 'pearson') 
print("VALENCE NOT SARCASTIC - RESPONSE & CONTEXT 0")
print()
print(rvns)
print()
print(rvns.mean(axis=0))

VALENCE SARCASTIC - RESPONSE & CONTEXT 0

4       0.513365
6       0.066922
9      -0.001058
11     -0.000422
12     -0.000487
          ...   
1793    0.116657
1795         NaN
1797   -0.000816
1798   -0.000877
1799   -0.000737
Length: 900, dtype: float64

0.1589732230361785

VALENCE NOT SARCASTIC - RESPONSE & CONTEXT 0

0       0.825202
1       0.377769
2       0.124362
3       0.395440
5      -0.001119
          ...   
1786   -0.000907
1790    0.326219
1792   -0.000315
1794    0.418991
1796   -0.000736
Length: 900, dtype: float64

0.15217070029435123


In [75]:
rds = response_dominance_sarc.corrwith(context0_dominance_sarc, axis = 1, method = 'pearson') 
print("DOMINANCE SARCASTIC - RESPONSE & CONTEXT 0")
print()
print(rds)
print()
print(rds.mean(axis=0))
print()

rdns = response_dominance_notsarc.corrwith(context0_dominance_notsarc, axis = 1, method = 'pearson') 
print("DOMINANCE NOT SARCASTIC - RESPONSE & CONTEXT 0")
print()
print(rdns)
print()
print(rdns.mean(axis=0))

DOMINANCE SARCASTIC - RESPONSE & CONTEXT 0

4       0.413302
6       0.061027
9      -0.001080
11     -0.000435
12     -0.000496
          ...   
1793    0.100078
1795         NaN
1797   -0.000857
1798   -0.000925
1799   -0.000761
Length: 900, dtype: float64

0.15148968414873049

DOMINANCE NOT SARCASTIC - RESPONSE & CONTEXT 0

0       0.799235
1       0.353741
2       0.103885
3       0.437330
5      -0.001133
          ...   
1786   -0.000913
1790    0.403467
1792   -0.000315
1794    0.407785
1796   -0.000764
Length: 900, dtype: float64

0.14787701204021936


In [76]:
ras = response_arousal_sarc.corrwith(context0_arousal_sarc, axis = 1, method = 'pearson') 
print("AROUSAL SARCASTIC - RESPONSE & CONTEXT 0")
print()
print(ras)
print()
print(ras.mean(axis=0))
print()

rans = response_arousal_notsarc.corrwith(context0_dominance_notsarc, axis = 1, method = 'pearson') 
print("AROUSAL NOT SARCASTIC - RESPONSE & CONTEXT 0")
print()
print(rans)
print()
print(rans.mean(axis=0))

AROUSAL SARCASTIC - RESPONSE & CONTEXT 0

4       0.363098
6       0.043259
9      -0.001071
11     -0.000442
12     -0.000489
          ...   
1793    0.073634
1795         NaN
1797   -0.000815
1798   -0.000917
1799   -0.000762
Length: 900, dtype: float64

0.144011414200297

AROUSAL NOT SARCASTIC - RESPONSE & CONTEXT 0

0       0.816939
1       0.299081
2       0.107093
3       0.379875
5      -0.001121
          ...   
1786   -0.000916
1790    0.220933
1792   -0.000306
1794    0.410509
1796   -0.000752
Length: 900, dtype: float64

0.14030036439189694
