# [ATTENTION] PoC notebook. The final one is the aspect_based_sentiment_analysis.

# Imports

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForTokenClassification, pipeline
from lib.sentiment_analysis_utils import combine_lede_and_text, remove_text_formatting, read_all_news_in_dir
import os

In [None]:
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

Running the computations on GPU for efficiency.

# Load and preprocess data

In [None]:
df_en_raw = read_all_news_in_dir(os.getcwd() + "/../data_preparation/raw_data/en/")
df_en_raw

The above method iterates over files in specified directory, reading and combining the content into single dataframe.

In [None]:
df_en_raw = combine_lede_and_text(df_en_raw)
df_en_raw = remove_text_formatting(df_en_raw)
df_en_raw

The above methods do 2 tasks.

First method combines lede with the rest of the article text. Data from STA is organized in such a way that first paragraph of the article is separated from the rest of the text which is usually displayed only under paid subscription (first paragraph for free). We want to analyze both, therefore we combine.

Second method is responsible for removing observed by us text formatting attributes (i.e. we want to remove \n\n that separates article paragraphs, or html formatting tags like <b> </b> used to display test in bold).

# Classify polarity (keywords + named entities as aspects)

In [None]:
df_en_raw = df_en_raw.head(10)

As this is just PoC of the project, we limit the number of articles to reduce the time for computations. We want to do some testing on smaller batch of data before we test our solution on something computationally expensive.

Model selected for initial ABSA available at: https://huggingface.co/yangheng/deberta-v3-large-absa-v1.1

In [None]:
#Prepare classifying pipeline
tokenizer = AutoTokenizer.from_pretrained("yangheng/deberta-v3-large-absa-v1.1")
model = AutoModelForSequenceClassification.from_pretrained("yangheng/deberta-v3-large-absa-v1.1")

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

Model selected for initial NER: https://huggingface.co/dslim/bert-base-NER

In [None]:
#Prepare NER pipeline
tokenizer_ner = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model_ner = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

ner_pipeline = pipeline("ner", model=model_ner, tokenizer=tokenizer_ner)

The ABSA model requires to specify aspects towards which we want to measure the sentiment. Those needs to be provided as input along with the text that we want to analyze. For this purpose, we are leveraging keywords associated with each article and available in its metadata. We have seen, that those are often reasonable aspects, towards which the sentiment could be investigated and provide interesting insights for end user. However, we also noticed, that they might not cover fully issues mentioned in the article that employees from SLA might be interested in as well. To bring extra value to the project and potentially enrich final results, we decided extended our project by leveraging NER to extract additional aspects from texts towards which the polarity will also be scored.

In [None]:
for i, row in df_en_raw.iterrows():
    keywords = [keyword.lower() for keyword in row.keywords]
    keywords_aspect_sentiment_dict = dict()
    for aspect in keywords:
        keywords_aspect_sentiment_dict[aspect] = classifier(row.whole_text, text_pair=aspect)

    df_en_raw.loc[i, 'keywords_sentiment'] = [keywords_aspect_sentiment_dict]

    ner_results = ner_pipeline(row.whole_text)
    ner_list = [result['word'] for result in ner_results]
    ner_aspect_sentiment_dict = dict()
    for aspect in ner_list:
        ner_aspect_sentiment_dict[aspect] = classifier(row.whole_text, text_pair=aspect)

    df_en_raw.loc[i, 'ner_sentiment'] = [ner_aspect_sentiment_dict]  #aspect_sentiment_dict


In [None]:
df_en_raw

In [None]:
df_en_raw.to_csv('example_results2.csv')

Results checked manually on few examples in the context of aspect based sentiment analysis seem reasonable. (as in case of document analysis, we are noc eligible to provide samples of articles, on which we evaluated). However, the preliminary results showed that we have some errors when it comes NER task, which we added. There are some entities extracted, which seem to be erroneous and not even a part of the article's text e.g. ##lo, Go, ##e, V. Which are not valid entities. We will pay attention to that and try to improve. (But still, there are entities recognized correctly, like NATO, Slovenia etc.)