In [1]:
!pip install nltk



Vader scores calculation: https://github.dev/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py
- compound: float(sum(sentiments_of_words)) +/- coefficient connected to number of "!" and "?" in the sentence; later normalized
- pos: pos_sum +/- coefficient connected to number of "!" and "?", divided by total sentiment and abs
- neg: neg_sum +/- coefficient connected to number of "!" and "?", divided by total sentiment(pos+abs(neg)+neutral) and abs
- neu: neu_count, divided by total sentiment(pos+abs(neg)+neutral) and abs


# Sentiment analysis - graph based

Sentiment analysis - graph based - general idea:

For each text in dataset to get its sentiment:
1. DONE (update: also using bigrams, but they are in format word_word) Get nouns, verbs, adjectives and NERs (e.g. with nltk library) - they will be the nodes of the graph. Only single words, no bigrams.
2. DONE The weights of edges between nodes will be created based on the distance in the original text between two given words. The weight will be the sum of inverses of distances between them in the whole text.
3. DONE (update: parameter to change if compound or singular function) For each word in node the sentiment will be calculated (e.g. with VADER model). The final outcome will be -1 for negative, 0 for neutral and 1 for positive.
4. DONE The sentiment will later be "weighted". This means that for each node the sentiment from VADER will be multiplied by scaled_sum_of_weights_of_edges_to_this_node (mean of sum of weights * number of occurencies of the word in original text).
5. DONE The sentiment of the whole text will be the normalized sum of weights for all nodes. There will be a certain threshold, which scores will be treated as neutral. In general, if sum_of_sentiment>0 than positive, < 0 negative, ~0 neutral.



More details:
- there will be a possibility to choose
		- DONE maximum number of nodes in the graph - selected will be topk words with most occurencies in the text
		- DONE max_distance between words to add the inverse to edge weight
		- DONE NER_list to provide NERs as the important ones to calculate and return the sentiment score for them instead of for the whole text
		- DONE calculate_overall_score -> to say if user wants to get the score of the whole text or just for given NERs
    - DONE (can be as output) maybe just available dict with words and their sentiment

# Ideas used:
- for NER https://spacy.io/api/entityrecognizer

- https://www.nltk.org/book/ch05.html

- "NLTK offers flexible algorithms for tasks like tokenization and part-of-speech tagging, while spaCy is renowned for its speed and performance, ideal for efficient NLP solutions."

- https://spacy.io/usage/linguistic-features

In [3]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import spacy
import numpy as np
import pandas as pd
import time
from sklearn.metrics import balanced_accuracy_score

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [4]:
def find_ents(doc):
  list_of_ents=[]
  if doc.ents:
    for ent in doc.ents:
      list_of_ents.append(ent.text)
  return list_of_ents

In [5]:
def get_words_for_nodes(text, nlp, list_ners=[], lemmatization=False, max_nodes=0):
  doc1 = nlp(text)
  # find ners
  if type(list_ners)==list and len(list_ners)>0:
    ners=list_ners
  else:
    ners = find_ents(doc1)
  if lemmatization:
    lemmatized_tokens=[token.lemma_ for token in doc1]
    text = ' '.join(lemmatized_tokens)
    doc1=nlp(text)
  tags=['JJ', 'JJR', 'JJS', 'NN', 'NNP', 'NNS', 'RB', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
  nouns_verbs_etc = [token.text for token in doc1 if token.tag_ in tags]
  text2=" ".join([str(word).replace(" ", "_") for word in list(doc1)])
  all_nodes=[str(word).replace(" ", "_") for word in list(set(ners+nouns_verbs_etc))]
  if max_nodes !=0:
    word_dict = {word: 0 for word in all_nodes}
    for word in text2.split():
      if word in word_dict.keys():
        word_dict[word] +=1
    word_dict_sorted=sorted(word_dict.items(), key=lambda x: x[1], reverse=True)[:max_nodes]
    all_nodes = [item[0] for item in word_dict_sorted]
  else:
    word_dict=[]
  return all_nodes,text2, ners, word_dict

In [6]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
all, text3, ners, word_dict = get_words_for_nodes("what is super cool in NewYork and Abu Dhabi, this is NewYork", lemmatization=True, max_nodes=2, nlp=nlp)

In [7]:
word_dict

{'Abu_Dhabi': 1, 'be': 2, 'super': 1, 'cool': 1, 'NewYork': 2}

In [8]:
all

['be', 'NewYork']

In [9]:
all, text3, ners, word_dict = get_words_for_nodes("what is super cool in NewYork and Abu Dhabi, this is NewYork", lemmatization=True, max_nodes=3, nlp=nlp)

In [10]:
all

['be', 'NewYork', 'Abu_Dhabi']

In [11]:
all, text3, ners, word_dict = get_words_for_nodes("what is super cool in NewYork and Abu Dhabi, this is NewYork", lemmatization=True, max_nodes=5, nlp=nlp)

In [12]:
all

['be', 'NewYork', 'Abu_Dhabi', 'super', 'cool']

In [13]:
word_dict

{'Abu_Dhabi': 1, 'be': 2, 'super': 1, 'cool': 1, 'NewYork': 2}

In [14]:
text3

'what be super cool in NewYork and Abu_Dhabi , this be NewYork'

Links:
- https://spacy.io/universe/project/video-spacys-ner-model
- https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk

## weights of edges

In [15]:
def get_weights_of_edges(text, words, max_distance=20, ner_list=[]):
  # if ner_list not empty, we should calculate the distances only between ners and other words
  # and do not between other words and other words

  list_from_text = text.split()
  weight_matrix=np.zeros((len(words), len(words)))
  occurences_list=np.zeros(len(words))
  for i, word1 in enumerate(list_from_text):
    j=i+1
    try:
      index1=words.index(word1)
      occurences_list[index1]+=1
    except ValueError:
      pass
    while j<=i+max_distance and j<len(list_from_text):
      word2=list_from_text[j]
      if len(ner_list) != 0 and (word1 in ner_list or word2 in ner_list):
        try:
          index1=words.index(word1)
          index2=words.index(word2)
          distance=j-i
          if distance!=0:
            inv_distance=1/distance
            weight_matrix[index1][index2]+=inv_distance
        except ValueError:
          pass
      elif len(ner_list)==0:
        try:
          index1=words.index(word1)
          index2=words.index(word2)
          distance=j-i
          if distance!=0:
            inv_distance=1/distance
            weight_matrix[index1][index2]+=inv_distance
        except ValueError:
          pass
      else:
        pass
      j+=1
  return weight_matrix, occurences_list

In [16]:
text3

'what be super cool in NewYork and Abu_Dhabi , this be NewYork'

In [17]:
all

['be', 'NewYork', 'Abu_Dhabi', 'super', 'cool']

In [18]:
weight_matrix, occ_list=get_weights_of_edges(text3, all, max_distance=20, ner_list=[])

In [19]:
occ_list

array([2., 2., 1., 1., 1.])

In [20]:
weight_matrix

array([[0.11111111, 1.35      , 0.16666667, 1.        , 0.5       ],
       [0.2       , 0.16666667, 0.5       , 0.        , 0.        ],
       [0.33333333, 0.25      , 0.        , 0.        , 0.        ],
       [0.125     , 0.44444444, 0.2       , 0.        , 1.        ],
       [0.14285714, 0.625     , 0.25      , 0.        , 0.        ]])

In [21]:
weight_matrix2, occ_list2=get_weights_of_edges(text3, all, max_distance=20, ner_list=['super'])
weight_matrix2

array([[0.        , 0.        , 0.        , 1.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.125     , 0.44444444, 0.2       , 0.        , 1.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ]])

# sentiment for each node

'compound' - The normalized compound score which calculates the sum of all lexicon ratings and takes values from -1 to 1

In [22]:
def calculate_sentiment_for_nodes(text, compound=False):
  analyzer = SentimentIntensityAnalyzer()
  # Loop through the words/ bigrams from 1. and get the sentiment scores for each one
  sentiment_scores=np.zeros(len(text))
  for i, text in enumerate(text):
    scores=analyzer.polarity_scores(text)
    if compound:
      sentiment_scores[i]=scores['compound']
    else:
      if scores['neg']==1.0:
        sentiment_scores[i]=-1
      elif scores['pos']==1.0:
        sentiment_scores[i]=1
  return sentiment_scores

In [23]:
calculate_sentiment_for_nodes(all, compound=False)

array([0., 0., 0., 1., 1.])

In [24]:
sentiment_scores=calculate_sentiment_for_nodes(all, compound=True)

# weighting of the sentiment

In [25]:
weight_matrix

array([[0.11111111, 1.35      , 0.16666667, 1.        , 0.5       ],
       [0.2       , 0.16666667, 0.5       , 0.        , 0.        ],
       [0.33333333, 0.25      , 0.        , 0.        , 0.        ],
       [0.125     , 0.44444444, 0.2       , 0.        , 1.        ],
       [0.14285714, 0.625     , 0.25      , 0.        , 0.        ]])

In [26]:
sentiment_scores

array([0.    , 0.    , 0.    , 0.5994, 0.3182])

In [27]:
occ_list

array([2., 2., 1., 1., 1.])

In [28]:
def weighted_sentiment_func(weight_matrix, occ_list, sentiment_scores, words, ner_list=[]):
  sum_columns_weights=np.sum(weight_matrix, axis=0)
  sum_rows_weights=np.sum(weight_matrix, axis=1)
  if len(ner_list)!=0: #if ners are predefined, calculate only for them
    for i, word in enumerate(words):
      if word not in ner_list:
        sum_columns_weights[i]=0
        sum_rows_weights[i]=0
  sum_all_weights=sum_columns_weights+sum_rows_weights
  count_nonzero_weights_columns=np.count_nonzero(weight_matrix, axis=0)
  count_nonzero_weights_rows=np.count_nonzero(weight_matrix, axis=1)
  count_nonzero_weights=count_nonzero_weights_columns+count_nonzero_weights_rows
  count_nonzero_weights[count_nonzero_weights==0]=1
  mean_columns_weights=sum_all_weights/count_nonzero_weights
  total_weight=mean_columns_weights*occ_list
  return total_weight*sentiment_scores

In [29]:
weight_matrix2

array([[0.        , 0.        , 0.        , 1.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.125     , 0.44444444, 0.2       , 0.        , 1.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ]])

In [30]:
weight_matrix

array([[0.11111111, 1.35      , 0.16666667, 1.        , 0.5       ],
       [0.2       , 0.16666667, 0.5       , 0.        , 0.        ],
       [0.33333333, 0.25      , 0.        , 0.        , 0.        ],
       [0.125     , 0.44444444, 0.2       , 0.        , 1.        ],
       [0.14285714, 0.625     , 0.25      , 0.        , 0.        ]])

In [31]:
weighted_sentiment=weighted_sentiment_func(weight_matrix, occ_list, sentiment_scores, all)

In [32]:
weighted_sentiment

array([0.        , 0.        , 0.        , 0.332001  , 0.16023643])

In [33]:
weighted_sentiment=weighted_sentiment_func(weight_matrix2, occ_list, sentiment_scores, all, ner_list=["super"])

In [34]:
all #'super' is fourth on list

['be', 'NewYork', 'Abu_Dhabi', 'super', 'cool']

In [35]:
weighted_sentiment #value of sentiment for 'super' is the fourth one - the only non zero here

array([0.      , 0.      , 0.      , 0.332001, 0.      ])

# sentiment of the whole text

normalized?

In [36]:
np.sum(weighted_sentiment)

0.332001

In [37]:
def calculate_sentiment_of_text(weighted_sentiment, threshold=0.05, output_number=False):
  sum_sentiment=np.sum(weighted_sentiment)
  if output_number:
    if sum_sentiment>threshold:
      return 1
    elif sum_sentiment<-threshold:
      return -1
    else:
      return 0
  else:
    if sum_sentiment>threshold:
      return "positive"
    elif sum_sentiment<-threshold:
      return "negative"
    else:
      return "neutral"

In [38]:
calculate_sentiment_of_text(weighted_sentiment)

'positive'

In [39]:
calculate_sentiment_of_text([-1,2,-4])

'negative'

In [40]:
calculate_sentiment_of_text([0.5, -0.45], output_number=True)

0

# run all at once

In [41]:
def graph_sentiment_analysis(text, nlp, lemmatization=False, max_distance=20, ner_list=[], compound=False, output_number=False, calculate_overall_score=1, threshold=0.05, max_nodes=0):
  words_all, text_all, ners_all, words_dict = get_words_for_nodes(text, lemmatization=lemmatization, nlp=nlp, max_nodes=max_nodes)
  weight_matrix_all, occ_list_all=get_weights_of_edges(text_all, words_all, max_distance=max_distance, ner_list=ner_list)
  sentiment_scores_all=calculate_sentiment_for_nodes(words_all, compound=compound)
  weighted_sentiment_all=weighted_sentiment_func(weight_matrix_all, occ_list_all, sentiment_scores_all, words=words_all, ner_list=ner_list)
  if calculate_overall_score==1:
    return calculate_sentiment_of_text(weighted_sentiment_all, output_number=output_number, threshold=threshold)
  elif calculate_overall_score==0:
    return dict(zip(words_all, weighted_sentiment_all))
  else:
    words_all.append("overall_sentiment")
    weighted_sentiment_all=list(weighted_sentiment_all)
    weighted_sentiment_all.append(calculate_sentiment_of_text(weighted_sentiment_all, output_number=True))
    return dict(zip(words_all, weighted_sentiment_all))

In [42]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=1, nlp=nlp)

'negative'

In [43]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=0, nlp=nlp)

{'liar': -0.75, 'Tom': 0.0, 'are': 0.0}

In [44]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=3, nlp=nlp, max_nodes=2)

{'liar': -0.5, 'Tom': 0.0, 'overall_sentiment': -1}

In [45]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=3, nlp=nlp)

{'liar': -0.75, 'Tom': 0.0, 'are': 0.0, 'overall_sentiment': -1}

In [46]:
graph_sentiment_analysis("you are liar, Tom", calculate_overall_score=3, ner_list=['Tom', 'liar'], nlp=nlp)

{'liar': -0.75, 'Tom': 0.0, 'are': 0.0, 'overall_sentiment': -1}

# tests for twitter sentiment extraction dataset

https://www.kaggle.com/competitions/tweet-sentiment-extraction/data

In [None]:
import pandas as pd

In [None]:
test_tse_data=pd.read_csv("/source_repository/other_datasets_for_tests/tweet-sentiment-extraction/test.csv")

In [None]:
test_tse_data

Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive
...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",negative
3530,416863ce47,All alone in this old house again. Thanks for...,positive
3531,6332da480c,I know what you mean. My little dog is sinkin...,negative
3532,df1baec676,_sutra what is your next youtube video gonna b...,positive


In [None]:
test_tse_data["predicted_sentiment"]=''

In [None]:
test_tse_data

Unnamed: 0,textID,text,sentiment,predicted_sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,
3,01082688c6,happy bday!,positive,
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,
...,...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",negative,
3530,416863ce47,All alone in this old house again. Thanks for...,positive,
3531,6332da480c,I know what you mean. My little dog is sinkin...,negative,
3532,df1baec676,_sutra what is your next youtube video gonna b...,positive,


In [None]:
test_tse_data.shape

(3534, 4)

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
for i in range(test_tse_data.shape[0]):
  if i%200==0:
    print(i)
  text_to_check=test_tse_data['text'][i]
  test_tse_data['predicted_sentiment'][i]=graph_sentiment_analysis(text_to_check, calculate_overall_score=1, nlp=nlp)

0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400


In [None]:
test_tse_data

Unnamed: 0,textID,text,sentiment,predicted_sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative,negative
3,01082688c6,happy bday!,positive,neutral
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive,positive
...,...,...,...,...
3529,e5f0e6ef4b,"its at 3 am, im very tired but i can`t sleep ...",negative,negative
3530,416863ce47,All alone in this old house again. Thanks for...,positive,positive
3531,6332da480c,I know what you mean. My little dog is sinkin...,negative,negative
3532,df1baec676,_sutra what is your next youtube video gonna b...,positive,positive


In [None]:
accuracy=np.sum(test_tse_data['sentiment']==test_tse_data['predicted_sentiment'])/test_tse_data.shape[0]

In [None]:
from sklearn.metrics import balanced_accuracy_score

In [None]:
ba_tse=balanced_accuracy_score(test_tse_data['sentiment'], test_tse_data['predicted_sentiment'])

In [None]:
accuracy, ba_tse

(0.6069609507640068, 0.6016598994840155)

## compound=True

In [None]:
test_tse_data["predicted_sentiment_compound"]=''

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
for i in range(test_tse_data.shape[0]):
  if i%200==0:
    print(i)
  text_to_check=test_tse_data['text'][i]
  test_tse_data['predicted_sentiment_compound'][i]=graph_sentiment_analysis(text_to_check, calculate_overall_score=1, nlp=nlp, compound=True)

0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400


In [None]:
accuracy_compound=np.sum(test_tse_data['sentiment']==test_tse_data['predicted_sentiment_compound'])/test_tse_data.shape[0]

In [None]:
ba_tse_compound=balanced_accuracy_score(test_tse_data['sentiment'], test_tse_data['predicted_sentiment_compound'])

In [None]:
test_tse_data.to_csv('/source_repository/test_tse_data_graph.csv', index=False)

In [None]:
accuracy_compound, ba_tse_compound

(0.6078098471986417, 0.5959738357743798)

# tests for news sentiment

https://www.kaggle.com/datasets/hoshi7/news-sentiment-dataset

In [None]:
test_ns_data2=pd.read_csv("/source_repository/other_datasets_for_tests/news_sentiment_dataset/Sentiment_dataset.csv")
test_ns_data2

Unnamed: 0,news_title,reddit_title,sentiment,text,url
0,Mark Cuban launches generic drug company,Billionaire Mark Cuban just launched a drug co...,1.0,Billionaire investor and Shark Tank star Mark ...,https://www.beckershospitalreview.com/pharmacy...
1,From Defendant to Defender: One Wrongfully Con...,"Man falsely imprisoned for 10 years, uses pris...",1.0,Attorney Jarrett Adams recently helped overtur...,https://www.nbcnews.com/news/us-news/defendant...
2,"Amazon Tribe Wins Lawsuit Against Big Oil, Sav...",Amazon tribe wins legal battle against oil com...,1.0,The Amazon Rainforest is well known across the...,https://www.disclose.tv/amazon-tribe-wins-laws...
3,Newark police: No officer fired a single shot ...,Newark police: No officer fired a single shot ...,1.0,Newark police: No officer fired a single shot ...,https://newjersey.news12.com/newark-police-no-...
4,Ingen barn døde i trafikken i 2019,No children died in traffic accidents in Norwa...,1.0,I 1970 døde det 560 mennesker i den norske tra...,https://www.nrk.no/trondelag/ingen-barn-dode-i...
...,...,...,...,...,...
843,Dee Why attack: Man allegedly choked and threa...,Dee Why attack: Man allegedly choked and threa...,0.0,Frightening details have emerged about a toile...,https://www.9news.com.au/2018/11/30/17/55/sydn...
844,Africa: Children and HIV/Aids - 'We Need to Ta...,Africa: Children and HIV/Aids - 'We Need to Ta...,0.0,"interview\n\nJohannesburg — 360,000 adolescent...",https://allafrica.com/stories/201811300567.html
845,Terrorism suspected in Eilat attack,Terrorism suspected in Eilat attack,0.0,A violent attack in the southern Israeli port ...,http://www.israelnationalnews.com/News/News.as...
846,Anti-Semitism never disappeared in Europe. It'...,Anti-Semitism never disappeared in Europe. It'...,0.0,"It's a 17-year-old boy, too frightened to wear...",https://edition.cnn.com/2018/11/27/europe/anti...


In [None]:
test_ns_data2['sentiment'].unique() #1 positive, 0 negative

array([1., 0.])

In [None]:
def map_values(x):
  if x==1:
    return 'positive'
  elif x==0:
    return 'negative'
  else:
    return 'neutral'

test_ns_data2['sentiment']=test_ns_data2['sentiment'].apply(map_values)

In [None]:
test_ns_data2["predicted_sentiment"]=''
test_ns_data2

Unnamed: 0,news_title,reddit_title,sentiment,text,url,predicted_sentiment
0,Mark Cuban launches generic drug company,Billionaire Mark Cuban just launched a drug co...,positive,Billionaire investor and Shark Tank star Mark ...,https://www.beckershospitalreview.com/pharmacy...,
1,From Defendant to Defender: One Wrongfully Con...,"Man falsely imprisoned for 10 years, uses pris...",positive,Attorney Jarrett Adams recently helped overtur...,https://www.nbcnews.com/news/us-news/defendant...,
2,"Amazon Tribe Wins Lawsuit Against Big Oil, Sav...",Amazon tribe wins legal battle against oil com...,positive,The Amazon Rainforest is well known across the...,https://www.disclose.tv/amazon-tribe-wins-laws...,
3,Newark police: No officer fired a single shot ...,Newark police: No officer fired a single shot ...,positive,Newark police: No officer fired a single shot ...,https://newjersey.news12.com/newark-police-no-...,
4,Ingen barn døde i trafikken i 2019,No children died in traffic accidents in Norwa...,positive,I 1970 døde det 560 mennesker i den norske tra...,https://www.nrk.no/trondelag/ingen-barn-dode-i...,
...,...,...,...,...,...,...
843,Dee Why attack: Man allegedly choked and threa...,Dee Why attack: Man allegedly choked and threa...,negative,Frightening details have emerged about a toile...,https://www.9news.com.au/2018/11/30/17/55/sydn...,
844,Africa: Children and HIV/Aids - 'We Need to Ta...,Africa: Children and HIV/Aids - 'We Need to Ta...,negative,"interview\n\nJohannesburg — 360,000 adolescent...",https://allafrica.com/stories/201811300567.html,
845,Terrorism suspected in Eilat attack,Terrorism suspected in Eilat attack,negative,A violent attack in the southern Israeli port ...,http://www.israelnationalnews.com/News/News.as...,
846,Anti-Semitism never disappeared in Europe. It'...,Anti-Semitism never disappeared in Europe. It'...,negative,"It's a 17-year-old boy, too frightened to wear...",https://edition.cnn.com/2018/11/27/europe/anti...,


In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
for i in range(test_ns_data2.shape[0]):
  if i%200==0:
    print(i)
  text_to_check=test_ns_data2['text'][i]
  test_ns_data2['predicted_sentiment'][i]=graph_sentiment_analysis(text_to_check, calculate_overall_score=1, nlp=nlp, threshold=0.0)

0
200
400
600
800


In [None]:
accuracy_ns=np.sum(test_ns_data2['sentiment']==test_ns_data2['predicted_sentiment'])/test_ns_data2.shape[0]

In [None]:
ba_ns=balanced_accuracy_score(test_ns_data2['sentiment'], test_ns_data2['predicted_sentiment'])



In [None]:
accuracy_ns, ba_ns

(0.7570754716981132, 0.7626737967914439)

In [None]:
test_ns_data2['predicted_sentiment'].value_counts()

predicted_sentiment
positive    585
negative    254
neutral       9
Name: count, dtype: int64

## compound=True

In [None]:
test_ns_data2["predicted_sentiment_compound"]=''

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")
for i in range(test_ns_data2.shape[0]):
  if i%200==0:
    print(i)
  text_to_check=test_ns_data2['text'][i]
  test_ns_data2['predicted_sentiment_compound'][i]=graph_sentiment_analysis(text_to_check, calculate_overall_score=1, nlp=nlp, threshold=0.0, compound=True)

0
200
400
600
800


In [None]:
accuracy_ns_compound=np.sum(test_ns_data2['sentiment']==test_ns_data2['predicted_sentiment_compound'])/test_ns_data2.shape[0]

In [None]:
ba_ns_compound=balanced_accuracy_score(test_ns_data2['sentiment'], test_ns_data2['predicted_sentiment_compound'])



In [None]:
accuracy_ns_compound,ba_ns_compound

(0.7382075471698113, 0.7649732620320856)

In [None]:
test_ns_data2.to_csv('/source_repository/test_ns_data2.csv', index=False)

In [None]:
test_ns_data2['predicted_sentiment_compound'].value_counts()

predicted_sentiment_compound
positive    563
negative    276
neutral       9
Name: count, dtype: int64

# check which ones are neutral

most of them is not in english => I should check the data at the beginning and take to my analysis only those that are in english since Actaware data are all in english

https://github.com/fedelopez77/langdetect

In [None]:
test_ns_data2[test_ns_data2['predicted_sentiment_compound']=='neutral']

Unnamed: 0,news_title,reddit_title,sentiment,text,url,predicted_sentiment,predicted_sentiment_compound
63,Görme Engelli Kızına 4 Yıl Boyunca Notlarını O...,Turkish mom who read lecture notes for four ye...,positive,Üniversite birçok öğrenci için kazandıktan son...,https://listelist.com/sakarya-gorme-engelli-an...,neutral,neutral
104,Incendie à Notre-Dame : la famille Pinault déb...,French billionaire François-Henri Pinault pled...,positive,La famille Pinault va débloquer cent millions ...,http://www.lefigaro.fr/flash-actu/notre-dame-d...,neutral,neutral
159,Монголия приняла меры по облегчению положения ...,"Mongolia will pay for electricity, water, heat...",positive,13 декабря состоялось внеочередное заседание к...,http://www.mongolnow.com/mongoliya-prinyala-me...,neutral,neutral
187,Natale: 94enne solo a casa chiama Cc per fare ...,"94 years old man calls the police: ""I got ever...",positive,"(ANSA) - BOLOGNA, 25 DIC - Ha telefonato ai Ca...",https://www.ansa.it/amp/emiliaromagna/notizie/...,neutral,neutral
237,Costco raising minimum wage to $14 an hour,Costco raising minimum wage to $14 an hour,positive,Retail giant Costco said Thursday that it woul...,http://thehill.com/blogs/blog-briefing-room/39...,neutral,neutral
500,"Volunteers remove 9,208 lbs. of trash from Ten...","Volunteers remove 9,208 lbs. of trash from Ten...",positive,"HUMPHREYS COUNTY, Tenn. – Keep the Tennessee R...",https://whnt.com/news/volunteers-remove-9208-l...,neutral,neutral
752,Больше половины краж в Казахстане не раскрываю...,More than half of thefts in Kazakstan are not ...,negative,"По данным Министерства внутренних дел, в 2017 ...",https://tengrinews.kz/crime/bolshe-polovinyi-k...,neutral,neutral
766,Мать бросила младенца на обочине дороги в спор...,Mother threw the baby on the side of the road ...,negative,В Туркестанской области женщина бросила новоро...,https://www.nur.kz/1746113-mat-brosila-mladenc...,neutral,neutral
771,Издевательства мальчика над женщиной в Капшага...,Bullying of a boy over a woman in Kapshagai: t...,negative,В ДВД Алматинской области прокомментировали ви...,https://tengrinews.kz/kazakhstan_news/izdevate...,neutral,neutral


## check if english

In [None]:
pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993227 sha256=2d7c2685c02d4e286fe70834b2cdd5a9ea6feb6c3c6be642a756373faceaf858
  Stored in directory: /root/.cache/pip/wheels/95/03/7d/59ea870c70ce4e5a370638b5462a7711ab78fba2f655d05106
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [None]:
from langdetect import detect

In [None]:
def is_english(text):
  try:
    lang = detect(text)
    return lang == 'en'
  except: #e.g. unsupported language
    return False

### ns dataset

In [None]:
test_ns_data2.loc[63]['text']

'Üniversite birçok öğrenci için kazandıktan sonra rahatlayacağını düşündüğü bir aşama. Gençliğin tam olarak yaşandığı, özgürlüğün tadıldığı ve derslerin biraz ikinci planda kaldığı bu evre bazıları için hedeflerine ulaşacakları zorlu bir yol. Kocaeli’nde yaşayan 22 yaşındaki Berru Merve Kul da 4 yıl önce Sakarya Üniversitesi’ni kazandı ancak görme engelli olduğu için önünde zorlu ve uzun bir yol vardı. Neyse ki annesi 4 yıl boyunca olduğu gibi mezun olurken de yanındaydı…\n\n22 yaşındaki Berru Merve Kul da 4 yıl önce Sakarya Üniversitesi Hukuk Fakültesi’ni kazandı. Ancak görme engelli olduğu için üniversite yaşamı onun için diğer öğrencilere göre biraz daha zordu\n\n4 yıllık eğitim hayatı boyunca onun eli ayağı olan kişi ise annesiydi. Annesi Havva Kul 4 yıl boyunca kızına tüm notlarını, kitaplarını okuyarak ödevlerini yapmasını, sınavlardan başarıyla geçmesini sağladı\n\nVe 4 yılın ardından Berru Merve Kul okuldan başarıyla mezun oldu. Sakarya Üniversitesi Hukuk Fakültesi Binasında ya

In [None]:
is_english(test_ns_data2.loc[63]['text'])

False

In [None]:
test_ns_data_english=test_ns_data2[test_ns_data2['text'].apply(is_english)]

In [None]:
test_ns_data_english

Unnamed: 0,news_title,reddit_title,sentiment,text,url,predicted_sentiment,predicted_sentiment_compound
0,Mark Cuban launches generic drug company,Billionaire Mark Cuban just launched a drug co...,positive,Billionaire investor and Shark Tank star Mark ...,https://www.beckershospitalreview.com/pharmacy...,positive,positive
1,From Defendant to Defender: One Wrongfully Con...,"Man falsely imprisoned for 10 years, uses pris...",positive,Attorney Jarrett Adams recently helped overtur...,https://www.nbcnews.com/news/us-news/defendant...,negative,positive
2,"Amazon Tribe Wins Lawsuit Against Big Oil, Sav...",Amazon tribe wins legal battle against oil com...,positive,The Amazon Rainforest is well known across the...,https://www.disclose.tv/amazon-tribe-wins-laws...,positive,positive
3,Newark police: No officer fired a single shot ...,Newark police: No officer fired a single shot ...,positive,Newark police: No officer fired a single shot ...,https://newjersey.news12.com/newark-police-no-...,negative,negative
5,"Budweiser will sit out Super Bowl, funneling m...","Budweiser will sit out Super Bowl, funneling m...",positive,Budweiser will not be running a commercial dur...,https://www.cnbc.com/2021/01/25/super-bowl-bud...,positive,positive
...,...,...,...,...,...,...,...
843,Dee Why attack: Man allegedly choked and threa...,Dee Why attack: Man allegedly choked and threa...,negative,Frightening details have emerged about a toile...,https://www.9news.com.au/2018/11/30/17/55/sydn...,negative,negative
844,Africa: Children and HIV/Aids - 'We Need to Ta...,Africa: Children and HIV/Aids - 'We Need to Ta...,negative,"interview\n\nJohannesburg — 360,000 adolescent...",https://allafrica.com/stories/201811300567.html,positive,positive
845,Terrorism suspected in Eilat attack,Terrorism suspected in Eilat attack,negative,A violent attack in the southern Israeli port ...,http://www.israelnationalnews.com/News/News.as...,negative,negative
846,Anti-Semitism never disappeared in Europe. It'...,Anti-Semitism never disappeared in Europe. It'...,negative,"It's a 17-year-old boy, too frightened to wear...",https://edition.cnn.com/2018/11/27/europe/anti...,negative,negative


In [None]:
accuracy_ns_compound_english=np.sum(test_ns_data_english['sentiment']==test_ns_data_english['predicted_sentiment_compound'])/test_ns_data_english.shape[0]

In [None]:
ba_ns_compound_english=balanced_accuracy_score(test_ns_data_english['sentiment'], test_ns_data_english['predicted_sentiment_compound'])



In [None]:
accuracy_ns_compound_english, ba_ns_compound_english

(0.7458233890214797, 0.7801174228195389)

In [None]:
test_ns_data_english['predicted_sentiment_compound'].value_counts()

predicted_sentiment_compound
positive    562
negative    274
neutral       2
Name: count, dtype: int64

In [None]:
test_ns_data2[~test_ns_data2['text'].apply(is_english)] #seems to perform good for longer texts

Unnamed: 0,news_title,reddit_title,sentiment,text,url,predicted_sentiment,predicted_sentiment_compound
4,Ingen barn døde i trafikken i 2019,No children died in traffic accidents in Norwa...,positive,I 1970 døde det 560 mennesker i den norske tra...,https://www.nrk.no/trondelag/ingen-barn-dode-i...,positive,positive
63,Görme Engelli Kızına 4 Yıl Boyunca Notlarını O...,Turkish mom who read lecture notes for four ye...,positive,Üniversite birçok öğrenci için kazandıktan son...,https://listelist.com/sakarya-gorme-engelli-an...,neutral,neutral
104,Incendie à Notre-Dame : la famille Pinault déb...,French billionaire François-Henri Pinault pled...,positive,La famille Pinault va débloquer cent millions ...,http://www.lefigaro.fr/flash-actu/notre-dame-d...,neutral,neutral
159,Монголия приняла меры по облегчению положения ...,"Mongolia will pay for electricity, water, heat...",positive,13 декабря состоялось внеочередное заседание к...,http://www.mongolnow.com/mongoliya-prinyala-me...,neutral,neutral
187,Natale: 94enne solo a casa chiama Cc per fare ...,"94 years old man calls the police: ""I got ever...",positive,"(ANSA) - BOLOGNA, 25 DIC - Ha telefonato ai Ca...",https://www.ansa.it/amp/emiliaromagna/notizie/...,neutral,neutral
458,Coronavirus-Pandemie: Bosch erfindet eigenen C...,German company Bosch produces 95% accurate tes...,positive,Schneller und sicherer auf das Virus testen – ...,https://www.faz.net/aktuell/wirtschaft/digitec...,negative,negative
538,"Coronavirus, a Rimini guarisce anziano di 101 ...","101 year old man, born during the Spanish flu,...",positive,Nemmeno a 101 anni il futuro è scritto.Non lo ...,http://www.today.it/attualita/coronavirus-guar...,negative,negative
752,Больше половины краж в Казахстане не раскрываю...,More than half of thefts in Kazakstan are not ...,negative,"По данным Министерства внутренних дел, в 2017 ...",https://tengrinews.kz/crime/bolshe-polovinyi-k...,neutral,neutral
766,Мать бросила младенца на обочине дороги в спор...,Mother threw the baby on the side of the road ...,negative,В Туркестанской области женщина бросила новоро...,https://www.nur.kz/1746113-mat-brosila-mladenc...,neutral,neutral
771,Издевательства мальчика над женщиной в Капшага...,Bullying of a boy over a woman in Kapshagai: t...,negative,В ДВД Алматинской области прокомментировали ви...,https://tengrinews.kz/kazakhstan_news/izdevate...,neutral,neutral


### tse dataset

In [None]:
test_tse_data_english=test_tse_data[test_tse_data['text'].apply(is_english)]

In [None]:
accuracy_tse_compound_english=np.sum(test_tse_data_english['sentiment']==test_tse_data_english['predicted_sentiment_compound'])/test_tse_data_english.shape[0]

In [None]:
ba_tse_compound_english=balanced_accuracy_score(test_tse_data_english['sentiment'], test_tse_data_english['predicted_sentiment_compound'])

In [None]:
accuracy_tse_compound_english, ba_tse_compound_english #in this dataset nothing changes

(0.6047370039987696, 0.59453124805803)

# test on ACTAWARE data - chosen subset

In [None]:
path_list=('/source_repository/chosen_articles.txt')
df_my=pd.read_csv("/source_repository/articles_categories_my_gt_2.csv", sep=';')
with open(path_list, 'r') as file:
  list_of_contents_new = file.readlines()

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")

In [None]:
actaware_df=pd.DataFrame({'Text': list_of_contents_new, 'Sentiment': ""})
actaware_df.head()

Unnamed: 0,Text,Sentiment
0,People work in the Amazon Fulfillment Center i...,
1,A federal agency is seeking to force Starbucks...,
2,You might have seen a new energy drink on Amaz...,
3,The BBC's director-general has tried to calm t...,
4,Amazon is running a competition to give its br...,


In [None]:
for i in range(len(list_of_contents_new)):
  if i%200==0:
    print(i)
  text_to_check=list_of_contents_new[i]
  actaware_df['Sentiment'][i]=graph_sentiment_analysis(text_to_check, calculate_overall_score=1, nlp=nlp, threshold=0.0, compound=True)

0


In [None]:
actaware_df.to_csv('/source_repository/actaware_df_with_sentiment_graph_chosen_articles.csv', index=False)

In [None]:
actaware_df.head()

Unnamed: 0,Text,Sentiment
0,People work in the Amazon Fulfillment Center i...,negative
1,A federal agency is seeking to force Starbucks...,positive
2,You might have seen a new energy drink on Amaz...,positive
3,The BBC's director-general has tried to calm t...,positive
4,Amazon is running a competition to give its br...,positive


In [None]:
balanced_accuracy_score(df_my['Sentiment'], actaware_df['Sentiment'])

0.5584921614333379

# Experiments

In [49]:
def run_actaware(file_name, df_my, nlp):
  path_list=(f'/source_repository/{file_name}.txt')
  with open(path_list, 'r') as file:
    list_of_contents_new = file.readlines()
  actaware_df=pd.DataFrame({'Text': list_of_contents_new, 'Sentiment': ""})
  for i in range(len(list_of_contents_new)):
    text_to_check=list_of_contents_new[i]
    actaware_df['Sentiment'][i]=graph_sentiment_analysis(text_to_check, calculate_overall_score=1, nlp=nlp, threshold=0.0, compound=True)
  actaware_df.to_csv(f'/source_repository/actaware_df_with_sentiment_graph_{file_name}.csv', index=False)
  return balanced_accuracy_score(df_my['Sentiment'], actaware_df['Sentiment'])

In [50]:
df_my=pd.read_csv("/source_repository/articles_categories_my_gt_2.csv", sep=';')
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("merge_noun_chunks")

In [51]:
list_names_scores_iter_time=[]
for i in range(3):
  file_name='chosen_articles'
  print(i, file_name)
  start_time=time.time()
  ba_1=run_actaware(file_name, df_my, nlp)
  end_time=time.time()
  print(ba_1)
  list_names_scores_iter_time.append([file_name, ba_1, i, end_time-start_time])


for i in range(3):
  file_name='chosen_articles_cleaned_4o'
  print(i, file_name)
  start_time=time.time()
  ba_1=run_actaware(file_name, df_my, nlp)
  end_time=time.time()
  print(ba_1)
  list_names_scores_iter_time.append([file_name, ba_1, i, end_time-start_time])


for i in range(3):
  file_name='chosen_articles_cleaned_by_me'
  print(i, file_name)
  start_time=time.time()
  ba_1=run_actaware(file_name, df_my, nlp)
  end_time=time.time()
  print(ba_1)
  list_names_scores_iter_time.append([file_name, ba_1, i, end_time-start_time])

for i in range(3):
  file_name='chosen_articles_cleaned_regex'
  print(i, file_name)
  start_time=time.time()
  ba_1=run_actaware(file_name, df_my, nlp)
  end_time=time.time()
  print(ba_1)
  list_names_scores_iter_time.append([file_name, ba_1, i, end_time-start_time])


0 chosen_articles
0.5584921614333379
1 chosen_articles
0.5584921614333379
2 chosen_articles
0.5584921614333379
0 chosen_articles_cleaned_4o
0.6049179578591343
1 chosen_articles_cleaned_4o
0.6049179578591343
2 chosen_articles_cleaned_4o
0.6049179578591343
0 chosen_articles_cleaned_by_me
0.5660679190090955
1 chosen_articles_cleaned_by_me
0.5660679190090955
2 chosen_articles_cleaned_by_me
0.5660679190090955
0 chosen_articles_cleaned_regex
0.5660679190090955
1 chosen_articles_cleaned_regex
0.5660679190090955
2 chosen_articles_cleaned_regex
0.5660679190090955


In [52]:
list_names_scores_iter_time

[['chosen_articles', 0.5584921614333379, 0, 20.226020574569702],
 ['chosen_articles', 0.5584921614333379, 1, 18.058598279953003],
 ['chosen_articles', 0.5584921614333379, 2, 17.96317481994629],
 ['chosen_articles_cleaned_4o', 0.6049179578591343, 0, 14.947876214981079],
 ['chosen_articles_cleaned_4o', 0.6049179578591343, 1, 13.61849570274353],
 ['chosen_articles_cleaned_4o', 0.6049179578591343, 2, 13.132906436920166],
 ['chosen_articles_cleaned_by_me', 0.5660679190090955, 0, 16.332794904708862],
 ['chosen_articles_cleaned_by_me', 0.5660679190090955, 1, 15.651582479476929],
 ['chosen_articles_cleaned_by_me', 0.5660679190090955, 2, 15.521533966064453],
 ['chosen_articles_cleaned_regex', 0.5660679190090955, 0, 16.535138607025146],
 ['chosen_articles_cleaned_regex', 0.5660679190090955, 1, 15.682967185974121],
 ['chosen_articles_cleaned_regex', 0.5660679190090955, 2, 15.5935959815979]]

In [4]:
with open('/source_repository/graph_sentiment_results_chosen_actaware.txt', 'w') as f:
    for line in list_names_scores_iter_time:
        f.write(f"{line}\n")