In [1]:
!pip install nltk



Vader scores calculation: https://github.dev/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py
- compound: float(sum(sentiments_of_words)) +/- coefficient connected to number of "!" and "?" in the sentence; later normalized
- pos: pos_sum +/- coefficient connected to number of "!" and "?", divided by total sentiment and abs
- neg: neg_sum +/- coefficient connected to number of "!" and "?", divided by total sentiment(pos+abs(neg)+neutral) and abs
- neu: neu_count, divided by total sentiment(pos+abs(neg)+neutral) and abs


# Sentiment analysis - graph based

Sentiment analysis - graph based - general idea:

For each text in dataset to get its sentiment:
1. Get nouns, verbs, adjectives and NERs (e.g. with nltk library) - they will be the nodes of the graph. Only single words, no bigrams.
2. The weights of edges between nodes will be created based on the distance in the original text between two given words. The weight will be the sum of inverses of distances between them in the whole text.
3. For each word in node the sentiment will be calculated (e.g. with VADER model). The final outcome will be -1 for negative, 0 for neutral and 1 for positive.
4. The sentiment will later be "weighted". This means that for each node the sentiment from VADER will be multiplied by scaled_sum_of_weights_of_edges_to_this_node (mean of sum of weights * number of occurencies of the word in original text).
5. The sentiment of the whole text will be the normalized sum of weights for all nodes. There will be a certain threshold, which scores will be treated as neutral. In general, if sum_of_sentiment > 0 than positive, < 0 negative, ~0 neutral.



More details:
- there will be a possibility to choose
		- maximum number of nodes in the graph - selected will be topk words with most occurencies in the text
		- DONE max_distance between words to add the inverse to edge weight
		- DONE NER_list to provide NERs as the important ones to calculate and return the sentiment score for them instead of for the whole text
		- DONE calculate_overall_score -> to say if user wants to get the score of the whole text or just for given NERs
    - DONE (can be as output) maybe just available dict with words and their sentiment

# Ideas used:
- for NER https://spacy.io/api/entityrecognizer

- https://www.nltk.org/book/ch05.html

- "NLTK offers flexible algorithms for tasks like tokenization and part-of-speech tagging, while spaCy is renowned for its speed and performance, ideal for efficient NLP solutions."

- https://spacy.io/usage/linguistic-features

In [21]:
import numpy as np

In [22]:
import nltk
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('tagsets')
import spacy

In [23]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [24]:
def find_ents(doc):
  list_of_ents=[]
  if doc.ents:
    for ent in doc.ents:
      list_of_ents.append(ent.text)
  return list_of_ents

In [132]:
def get_words_for_nodes(text, nlp, list_ners=[], lemmatization=False):
  doc1 = nlp(text)
  # find ners
  if type(list_ners)==list and len(list_ners)>0:
    ners=list_ners
  else:
    ners = find_ents(doc1) #maybe TO DO: find better way to get NER (what was Actaware idea for that? they are happy with it, so...)
  # if I put lemmatization before ners, they do not catch everything, e.g. "NY"
  if lemmatization:
    #a co z podmiankami w tekście, jeśli on zwróci all w których jest "be" a w tekście są "is"? done
    #TO DO: one of the experiments: does lemmatization change accuracy? and what is the influence on the performance?
    lemmatized_tokens=[token.lemma_ for token in doc1]
    text = ' '.join(lemmatized_tokens)
    doc1=nlp(text)
  tags=['JJ', 'JJR', 'JJS', 'NN', 'NNP', 'NNS', 'RB', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
  nouns_verbs_etc = [token.text for token in doc1 if token.tag_ in tags]
  text2=" ".join([str(word).replace(" ", "_") for word in list(doc1)])
  all_nodes=[str(word).replace(" ", "_") for word in list(set(ners+nouns_verbs_etc))]
  return all_nodes,text2, ners

In [26]:
  all, text3, ners = get_words_for_nodes("what is super cool in NewYork and Abu Dhabi, this is NewYork", lemmatization=True)

In [27]:
all

['super', 'cool', 'be', 'NewYork', 'Abu_Dhabi']

In [28]:
text3

'what be super cool in NewYork and Abu_Dhabi , this be NewYork'

Links:
- https://spacy.io/universe/project/video-spacys-ner-model
- https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk

## weights of edges

In [29]:
def get_weights_of_edges(text, words, max_distance=20, ner_list=[]):
  # if ner_list not empty, we should calculate the distances only between ners and other words
  # and do not between other words and other words

  list_from_text = text.split()
  weight_matrix=np.zeros((len(words), len(words)))
  occurences_list=np.zeros(len(words))
  for i, word1 in enumerate(list_from_text):
    j=i+1
    try:
      index1=words.index(word1)
      occurences_list[index1]+=1
    except ValueError:
      pass
    while j<=i+max_distance and j<len(list_from_text):
      word2=list_from_text[j]
      if len(ner_list) != 0 and (word1 in ner_list or word2 in ner_list):
        try:
          index1=words.index(word1)
          index2=words.index(word2)
          distance=j-i
          if distance!=0:
            inv_distance=1/distance
            weight_matrix[index1][index2]+=inv_distance
        except ValueError:
          pass
      elif len(ner_list)==0:
        try:
          index1=words.index(word1)
          index2=words.index(word2)
          distance=j-i
          if distance!=0:
            inv_distance=1/distance
            weight_matrix[index1][index2]+=inv_distance
        except ValueError:
          pass
      else:
        pass
      j+=1
  # upper_right_ones=np.triu(np.ones(len(words)))
  # return (weight_matrix+weight_matrix.T)*upper_right_ones, occurences_list
  return weight_matrix, occurences_list

In [30]:
text3

'what be super cool in NewYork and Abu_Dhabi , this be NewYork'

In [31]:
all

['super', 'cool', 'be', 'NewYork', 'Abu_Dhabi']

In [32]:
weight_matrix, occ_list=get_weights_of_edges(text3, all, max_distance=20, ner_list=[])

In [33]:
occ_list

array([1., 1., 2., 2., 1.])

In [34]:
weight_matrix

array([[0.        , 1.        , 0.125     , 0.44444444, 0.2       ],
       [0.        , 0.        , 0.14285714, 0.625     , 0.25      ],
       [1.        , 0.5       , 0.11111111, 1.35      , 0.16666667],
       [0.        , 0.        , 0.2       , 0.16666667, 0.5       ],
       [0.        , 0.        , 0.33333333, 0.25      , 0.        ]])

In [35]:
weight_matrix2, occ_list2=get_weights_of_edges(text3, all, max_distance=20, ner_list=['super'])
weight_matrix2

array([[0.        , 1.        , 0.125     , 0.44444444, 0.2       ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ]])

# sentiment for each node

'compound' - The normalized compound score which calculates the sum of all lexicon ratings and takes values from -1 to 1

In [36]:
def calculate_sentiment_for_nodes(text, compound=False):
  analyzer = SentimentIntensityAnalyzer()
  # Loop through the words/ bigrams from 1. and get the sentiment scores for each one
  sentiment_scores=np.zeros(len(text))
  for i, text in enumerate(text):
    scores=analyzer.polarity_scores(text)
    if compound:
      sentiment_scores[i]=scores['compound']
    else:
      if scores['neg']==1.0:
        sentiment_scores[i]=-1
      elif scores['pos']==1.0:
        sentiment_scores[i]=1
  return sentiment_scores

In [37]:
calculate_sentiment_for_nodes(all, compound=False)

array([1., 1., 0., 0., 0.])

In [38]:
sentiment_scores=calculate_sentiment_for_nodes(all, compound=True)

# weighting of the sentiment

In [39]:
weight_matrix

array([[0.        , 1.        , 0.125     , 0.44444444, 0.2       ],
       [0.        , 0.        , 0.14285714, 0.625     , 0.25      ],
       [1.        , 0.5       , 0.11111111, 1.35      , 0.16666667],
       [0.        , 0.        , 0.2       , 0.16666667, 0.5       ],
       [0.        , 0.        , 0.33333333, 0.25      , 0.        ]])

In [40]:
sentiment_scores

array([0.5994, 0.3182, 0.    , 0.    , 0.    ])

In [41]:
occ_list

array([1., 1., 2., 2., 1.])

In [64]:
def weighted_sentiment_func(weight_matrix, occ_list, sentiment_scores, words, ner_list=[]):
  sum_columns_weights=np.sum(weight_matrix, axis=0)
  sum_rows_weights=np.sum(weight_matrix, axis=1)
  if len(ner_list)!=0: #if ners are predefined, calculate only for them
    for i, word in enumerate(words):
      if word not in ner_list:
        sum_columns_weights[i]=0
        sum_rows_weights[i]=0
  sum_all_weights=sum_columns_weights+sum_rows_weights
  count_nonzero_weights_columns=np.count_nonzero(weight_matrix, axis=0)
  count_nonzero_weights_rows=np.count_nonzero(weight_matrix, axis=1)
  count_nonzero_weights=count_nonzero_weights_columns+count_nonzero_weights_rows
  count_nonzero_weights[count_nonzero_weights==0]=1
  mean_columns_weights=sum_all_weights/count_nonzero_weights
  total_weight=mean_columns_weights*occ_list
  return total_weight*sentiment_scores

In [43]:
weight_matrix2

array([[0.        , 1.        , 0.125     , 0.44444444, 0.2       ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [1.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ]])

In [44]:
weight_matrix

array([[0.        , 1.        , 0.125     , 0.44444444, 0.2       ],
       [0.        , 0.        , 0.14285714, 0.625     , 0.25      ],
       [1.        , 0.5       , 0.11111111, 1.35      , 0.16666667],
       [0.        , 0.        , 0.2       , 0.16666667, 0.5       ],
       [0.        , 0.        , 0.33333333, 0.25      , 0.        ]])

In [45]:
weighted_sentiment=weighted_sentiment_func(weight_matrix, occ_list, sentiment_scores, all)

In [46]:
weighted_sentiment

array([0.332001  , 0.16023643, 0.        , 0.        , 0.        ])

In [47]:
weighted_sentiment=weighted_sentiment_func(weight_matrix2, occ_list, sentiment_scores, all, ner_list=["super"])

In [48]:
all #'super' is first on list

['super', 'cool', 'be', 'NewYork', 'Abu_Dhabi']

In [49]:
weighted_sentiment #value of sentiment for 'super' is the first one - the only non zero here

array([0.332001, 0.      , 0.      , 0.      , 0.      ])

# sentiment of the whole text

normalized?

In [50]:
np.sum(weighted_sentiment)

0.332001

In [51]:
def calculate_sentiment_of_text(weighted_sentiment, threshold=0.05, output_number=False):
  sum_sentiment=np.sum(weighted_sentiment)
  if output_number:
    if sum_sentiment>threshold:
      return 1
    elif sum_sentiment<-threshold:
      return -1
    else:
      return 0
  else:
    if sum_sentiment>threshold:
      return "positive"
    elif sum_sentiment<-threshold:
      return "negative"
    else:
      return "neutral"

In [52]:
calculate_sentiment_of_text(weighted_sentiment)

'positive'

In [53]:
calculate_sentiment_of_text([-1,2,-4])

'negative'

In [54]:
calculate_sentiment_of_text([0.5, -0.45], output_number=True)

0

# run all at once

In [86]:
def graph_sentiment_analysis(text, nlp, lemmatization=False, max_distance=20, ner_list=[], compound=False, output_number=False, calculate_overall_score=1, threshold=0.05):
  words_all, text_all, ners_all = get_words_for_nodes(text, lemmatization=lemmatization, nlp=nlp)
  weight_matrix_all, occ_list_all=get_weights_of_edges(text_all, words_all, max_distance=max_distance, ner_list=ner_list)
  sentiment_scores_all=calculate_sentiment_for_nodes(words_all, compound=compound)
  weighted_sentiment_all=weighted_sentiment_func(weight_matrix_all, occ_list_all, sentiment_scores_all, words=words_all, ner_list=ner_list)
  if calculate_overall_score==1:
    return calculate_sentiment_of_text(weighted_sentiment_all, output_number=output_number, threshold=threshold)
  elif calculate_overall_score==0:
    return dict(zip(words_all, weighted_sentiment_all))
  else:
    words_all.append("overall_sentiment")
    weighted_sentiment_all=list(weighted_sentiment_all)
    weighted_sentiment_all.append(calculate_sentiment_of_text(weighted_sentiment_all, output_number=True))
    return dict(zip(words_all, weighted_sentiment_all))

In [56]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=1)

'negative'

In [57]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=0)

{'are': 0.0, 'liar': -0.75, 'Tom': 0.0}

In [58]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=3)

{'are': 0.0, 'liar': -0.75, 'Tom': 0.0, 'overall_sentiment': -1}

In [59]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=3, ner_list=['Tom', 'liar'])

{'are': 0.0, 'liar': -0.75, 'Tom': 0.0, 'overall_sentiment': -1}

# tests for twitter sentiment extraction dataset

https://www.kaggle.com/competitions/tweet-sentiment-extraction/data

In [1]:
import pandas as pd

In [111]:
test_tse_data=pd.read_csv("/content/drive/MyDrive/Stuuudia/Magisterka/other_datasets_for_tests/tweet-sentiment-extraction/test.csv")

In [112]:
test_tse_data

Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive
...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",negative
3530,416863ce47,All alone in this old house again. Thanks for...,positive
3531,6332da480c,I know what you mean. My little dog is sinkin...,negative
3532,df1baec676,_sutra what is your next youtube video gonna b...,positive


In [113]:
test_tse_data["predicted_sentiment"]=''

In [16]:
test_tse_data

Unnamed: 0,textID,text,sentiment,predicted_sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,
3,01082688c6,happy bday!,positive,
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,
...,...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",negative,
3530,416863ce47,All alone in this old house again. Thanks for...,positive,
3531,6332da480c,I know what you mean. My little dog is sinkin...,negative,
3532,df1baec676,_sutra what is your next youtube video gonna b...,positive,


In [17]:
test_tse_data.shape

(3534, 4)

In [114]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
for i in range(test_tse_data.shape[0]):
  if i%200==0:
    print(i)
  text_to_check=test_tse_data['text'][i]
  test_tse_data['predicted_sentiment'][i]=graph_sentiment_analysis(text_to_check, calculate_overall_score=1, nlp=nlp)

0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400


In [115]:
test_tse_data

Unnamed: 0,textID,text,sentiment,predicted_sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,negative
3,01082688c6,happy bday!,positive,neutral
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,positive
...,...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",negative,negative
3530,416863ce47,All alone in this old house again. Thanks for...,positive,positive
3531,6332da480c,I know what you mean. My little dog is sinkin...,negative,negative
3532,df1baec676,_sutra what is your next youtube video gonna b...,positive,positive


In [77]:
accuracy=np.sum(test_tse_data['sentiment']==test_tse_data['predicted_sentiment'])/test_tse_data.shape[0]

In [81]:
from sklearn.metrics import balanced_accuracy_score

In [85]:
ba_tse=balanced_accuracy_score(test_tse_data['sentiment'], test_tse_data['predicted_sentiment'])

In [84]:
accuracy, ba_tse

(0.6069609507640068, 0.6016598994840155)

# tests for news sentiment

https://www.kaggle.com/datasets/hoshi7/news-sentiment-dataset

In [117]:
test_ns_data2=pd.read_csv("/content/drive/MyDrive/Stuuudia/Magisterka/other_datasets_for_tests/news_sentiment_dataset/Sentiment_dataset.csv")
test_ns_data2

Unnamed: 0,news_title,reddit_title,sentiment,text,url
0,Mark Cuban launches generic drug company,Billionaire Mark Cuban just launched a drug co...,1.0,Billionaire investor and Shark Tank star Mark ...,https://www.beckershospitalreview.com/pharmacy...
1,From Defendant to Defender: One Wrongfully Con...,"Man falsely imprisoned for 10 years, uses pris...",1.0,Attorney Jarrett Adams recently helped overtur...,https://www.nbcnews.com/news/us-news/defendant...
2,"Amazon Tribe Wins Lawsuit Against Big Oil, Sav...",Amazon tribe wins legal battle against oil com...,1.0,The Amazon Rainforest is well known across the...,https://www.disclose.tv/amazon-tribe-wins-laws...
3,Newark police: No officer fired a single shot ...,Newark police: No officer fired a single shot ...,1.0,Newark police: No officer fired a single shot ...,https://newjersey.news12.com/newark-police-no-...
4,Ingen barn døde i trafikken i 2019,No children died in traffic accidents in Norwa...,1.0,I 1970 døde det 560 mennesker i den norske tra...,https://www.nrk.no/trondelag/ingen-barn-dode-i...
...,...,...,...,...,...
843,Dee Why attack: Man allegedly choked and threa...,Dee Why attack: Man allegedly choked and threa...,0.0,Frightening details have emerged about a toile...,https://www.9news.com.au/2018/11/30/17/55/sydn...
844,Africa: Children and HIV/Aids - 'We Need to Ta...,Africa: Children and HIV/Aids - 'We Need to Ta...,0.0,"interview\n\nJohannesburg — 360,000 adolescent...",https://allafrica.com/stories/201811300567.html
845,Terrorism suspected in Eilat attack,Terrorism suspected in Eilat attack,0.0,A violent attack in the southern Israeli port ...,http://www.israelnationalnews.com/News/News.as...
846,Anti-Semitism never disappeared in Europe. It'...,Anti-Semitism never disappeared in Europe. It'...,0.0,"It's a 17-year-old boy, too frightened to wear...",https://edition.cnn.com/2018/11/27/europe/anti...


In [118]:
test_ns_data2['sentiment'].unique() #1 positive, 0 negative

array([1., 0.])

In [121]:
def map_values(x):
  if x==1:
    return 'positive'
  elif x==0:
    return 'negative'
  else:
    return 'neutral'

test_ns_data2['sentiment']=test_ns_data2['sentiment'].apply(map_values)

In [122]:
test_ns_data2["predicted_sentiment"]=''
test_ns_data2

Unnamed: 0,news_title,reddit_title,sentiment,text,url,predicted_sentiment
0,Mark Cuban launches generic drug company,Billionaire Mark Cuban just launched a drug co...,positive,Billionaire investor and Shark Tank star Mark ...,https://www.beckershospitalreview.com/pharmacy...,
1,From Defendant to Defender: One Wrongfully Con...,"Man falsely imprisoned for 10 years, uses pris...",positive,Attorney Jarrett Adams recently helped overtur...,https://www.nbcnews.com/news/us-news/defendant...,
2,"Amazon Tribe Wins Lawsuit Against Big Oil, Sav...",Amazon tribe wins legal battle against oil com...,positive,The Amazon Rainforest is well known across the...,https://www.disclose.tv/amazon-tribe-wins-laws...,
3,Newark police: No officer fired a single shot ...,Newark police: No officer fired a single shot ...,positive,Newark police: No officer fired a single shot ...,https://newjersey.news12.com/newark-police-no-...,
4,Ingen barn døde i trafikken i 2019,No children died in traffic accidents in Norwa...,positive,I 1970 døde det 560 mennesker i den norske tra...,https://www.nrk.no/trondelag/ingen-barn-dode-i...,
...,...,...,...,...,...,...
843,Dee Why attack: Man allegedly choked and threa...,Dee Why attack: Man allegedly choked and threa...,negative,Frightening details have emerged about a toile...,https://www.9news.com.au/2018/11/30/17/55/sydn...,
844,Africa: Children and HIV/Aids - 'We Need to Ta...,Africa: Children and HIV/Aids - 'We Need to Ta...,negative,"interview\n\nJohannesburg — 360,000 adolescent...",https://allafrica.com/stories/201811300567.html,
845,Terrorism suspected in Eilat attack,Terrorism suspected in Eilat attack,negative,A violent attack in the southern Israeli port ...,http://www.israelnationalnews.com/News/News.as...,
846,Anti-Semitism never disappeared in Europe. It'...,Anti-Semitism never disappeared in Europe. It'...,negative,"It's a 17-year-old boy, too frightened to wear...",https://edition.cnn.com/2018/11/27/europe/anti...,


In [123]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
for i in range(test_ns_data2.shape[0]):
  if i%200==0:
    print(i)
  text_to_check=test_ns_data2['text'][i]
  test_ns_data2['predicted_sentiment'][i]=graph_sentiment_analysis(text_to_check, calculate_overall_score=1, nlp=nlp, threshold=0.0)

0
200
400
600
800


In [124]:
accuracy_ns=np.sum(test_ns_data2['sentiment']==test_ns_data2['predicted_sentiment'])/test_ns_data2.shape[0]

In [125]:
ba_ns=balanced_accuracy_score(test_ns_data2['sentiment'], test_ns_data2['predicted_sentiment'])



In [126]:
accuracy_ns, ba_ns

(0.7570754716981132, 0.7626737967914439)

In [129]:
len(max(test_ns_data2['text'], key=len).split()) #12706 words

12706

In [131]:
len(max(test_tse_data['text'], key=len).split()) #20 words

20