# Coursework IDA

## Task 1

## 1.1. 
Implement and train a method for automatically classifying texts in the FiQA sentiment analysis
dataset as positive, neutral or negative. Refer to the labs, lecture materials and textbook to identify
a suitable method. In your report:
• Briefly explain how your chosen method works and its main strengths and limitations;
• Describe the preprocessing steps and the features you use to represent each text instance;
• Explain why you chose those features and preprocessing steps and hypothesise how they
will affect your results;
• Briefly describe your software implementation.
(10 marks)



In [1]:
%load_ext autoreload
%autoreload 2

# Use HuggingFace's datasets library to access the financial_phrasebank dataset
from datasets import load_dataset
import pandas as pd
import numpy as np
import json

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

# pre trained analyser for comparison
from nltk.sentiment import SentimentIntensityAnalyzer

import matplotlib.pyplot as plt

# for negation
import re


In [2]:
train_files = [
    'data_cache/FiQA_ABSA_task1/task1_headline_ABSA_train.json',
    'data_cache/FiQA_ABSA_task1/task1_post_ABSA_train.json'
]

In [4]:


def load_fiqa_sa_from_json(json_files):
    train_text = []
    train_labels = []

    # iterate through each tweet file
    for file in json_files:
        # open file in read mode, with method closes file after getting data stream
        with open(file, 'r', encoding = 'utf8') as handle:
            # load file object and convert into json object
            dataf = json.load(handle)
        
        
        dataf_text = [dataf[k]["sentence"] for k in dataf.keys()]
        # print(len(dataf_text))
        train_text.extend(dataf_text)

        dataf_labels = [float(dataf[k]["info"][0]["sentiment_score"]) for k in dataf.keys()]
        # print(len(dataf_labels))
        train_labels.extend(dataf_labels)

    train_text = np.array(train_text)
    train_labels = np.array(train_labels)
    
    return train_text, train_labels


def threshold_scores(scores):
    """
    Convert sentiment scores to discrete labels.
    0 = negative.
    1 = neutral.
    2 = positive.
    """
    labels = []
    for score in scores:
        if score < -0.25:
            labels.append(0)
        elif score > 0.32:
            labels.append(2)
        else:
            labels.append(1)
            
    return np.array(labels)


all_text, all_labels = load_fiqa_sa_from_json(train_files)
    
print(f'Number of instances: {len(all_text)}')
print(f'Number of labels: {len(all_labels)}')

all_labels = threshold_scores(all_labels)
print(f'Number of negative labels: {np.sum(all_labels==0)}')
print(f'Number of neutral labels: {np.sum(all_labels==1)}')
print(f'Number of positive labels: {np.sum(all_labels==2)}')

FileNotFoundError: [Errno 2] No such file or directory: 'data_cache/FiQA_ABSA_task1/task1_headline_ABSA_train.json'

In [None]:
print(all_labels[0])

In [None]:
type(load_fiqa_sa_from_json(train_files))

In [None]:
print(len(load_fiqa_sa_from_json(train_files)[0]))

In [None]:

# Split test data from training data
train_documents, test_documents, train_labels, test_labels = train_test_split(
    all_text, 
    all_labels, 
    test_size=0.2, 
    stratify=all_labels  # make sure the same proportion of labels is in the test set and training set
)

# Split validation data from training data
train_documents, val_documents, train_labels, val_labels = train_test_split(
    train_documents, 
    train_labels, 
    test_size=0.15, 
    stratify=train_labels  # make sure the same proportion of labels is in the test set and training set
)

print(f'Number of training instances = {len(train_documents)}')
print(f'Number of validation instances = {len(val_documents)}')
print(f'Number of test instances = {len(test_documents)}')


In [None]:
print(f'What does one instance look like from the training set? \n\n{train_documents[233]}')
print(f'...and here is its corresponding label \n\n{train_labels[233]}')

In [None]:


# CountVectorizer can do its own tokenization, but for consistency we want to
# carry on using WordNetTokenizer. We write a small wrapper class to enable this:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)

    
# create stopwords function from nltk
stop_words = set(stopwords.words('english'))
vectorizer = CountVectorizer(tokenizer=Tokenizer())  # construct the vectorizer
# with stop word removal
#vectorizer = CountVectorizer(tokenizer=Tokenizer(), stop_words=stop_words)  # construct the vectorizer

vectorizer.fit(train_documents)  # Learn the vocabulary
X_train = vectorizer.transform(train_documents)  # extract training set bags of words
X_val = vectorizer.transform(val_documents)  # extract test set bags of words
X_test = vectorizer.transform(test_documents)  # extract test set bags of words

In [None]:
# print(nltk.data.path)

In [None]:
# see count vector from training set
counts = pd.DataFrame(X_train.toarray(), columns = vectorizer.get_feature_names_out())

In [None]:
X_train

In [None]:
counts.columns


## 1.2. Evaluate Method

Evaluate your method, then interpret and discuss your results. Include the following points:
• Define your performance metrics and state their limitations;
• Describe the testing procedure (e.g., how you used each split of the dataset);
• Show your results using suitable plots or tables;
• How could you improve the method or experimental process? Consider the errors that your
method makes.
(9 marks)


## Naive Bayes Classifier

In [None]:
# initialise and fit classifier
classifier = MultinomialNB()
classifier.fit(X_train, train_labels)

In [None]:
y_val_pred = classifier.predict(X_val)

In [None]:
cm = metrics.confusion_matrix(val_labels, y_val_pred)


In [None]:
# Define class labels
classes = ['0', '1', '2']


In [None]:
# Plot confusion matrix
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
       yticks=np.arange(cm.shape[0]),
       xticklabels=classes, yticklabels=classes,
       xlabel='Predicted label', ylabel='True label',
       title='Confusion matrix')
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Add counts to each cell
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > cm.max() / 2. else "black")

# Show plot
plt.show()

In [None]:

acc = accuracy_score(val_labels, y_val_pred)
print(f'Accuracy = {acc}')

prec = precision_score(val_labels, y_val_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(val_labels, y_val_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(val_labels, y_val_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

# We can get all of these with a per-class breakdown using classification_report:
print(classification_report(val_labels, y_val_pred))

In [None]:
vocabulary = vectorizer.vocabulary_

### CHANGE THE NAME OF THE CLASSIFIER VARIABLE BELOW TO USE YOUR TRAINED CLASSIFIER
feat_likelihoods = np.exp(classifier.feature_log_prob_)  # Use exponential to convert the logs back to probabilities
###

# WRITE YOUR CODE HERE
print(feat_likelihoods[:, vocabulary['a']])
print(feat_likelihoods[:, vocabulary['it']])

# Logistic Regression Classifier

In [None]:
classifier = LogisticRegression()
classifier.fit(X_train, train_labels)

In [None]:
y_val_pred = classifier.predict(X_val)

In [None]:
cm = metrics.confusion_matrix(val_labels, y_val_pred)


In [None]:

# Define class labels
classes = ['0', '1', '2']

# Plot confusion matrix
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
       yticks=np.arange(cm.shape[0]),
       xticklabels=classes, yticklabels=classes,
       xlabel='Predicted label', ylabel='True label',
       title='Confusion matrix')
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Add counts to each cell
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > cm.max() / 2. else "black")

# Show plot
plt.show()

In [None]:
acc = accuracy_score(val_labels, y_val_pred)
print(f'Accuracy = {acc}')

prec = precision_score(val_labels, y_val_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(val_labels, y_val_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(val_labels, y_val_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

# We can get all of these with a per-class breakdown using classification_report:
print(classification_report(val_labels, y_val_pred))

In [None]:
# WRITE YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

acc = accuracy_score(val_labels, y_val_pred)
print(f'Accuracy = {acc}')

prec = precision_score(val_labels, y_val_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(val_labels, y_val_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(val_labels, y_val_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

# We can get all of these with a per-class breakdown using classification_report:
print(classification_report(val_labels, y_val_pred))

## With Data Processing

In [None]:
def add_negation(sentence):
    # define regex pattern to match words after "not", "n't", or "never"
    pattern = r"(?:(?:(?:not)|(?:n't)|(?:never))\s+)(\w+)"
    
    # use regex to find and replace words with negation prefix
    result = re.sub(pattern, r" not_\1", sentence)
    
    return result

In [None]:
# apply add_negation to each tweet in the array using a list comprehension
all_text_negated = np.array([add_negation(text) for text in all_text])


In [None]:
# convert all tweets to lower case
all_text_negated = np.char.lower(all_text_negated)

In [None]:

# Split test data from training data
train_documents, test_documents, train_labels, test_labels = train_test_split(
    all_text_negated, 
    all_labels, 
    test_size=0.2, 
    stratify=all_labels,  # make sure the same proportion of labels is in the test set and training set
    random_state= 43
)

# Split validation data from training data
train_documents, val_documents, train_labels, val_labels = train_test_split(
    train_documents, 
    train_labels, 
    test_size=0.15, 
    stratify=train_labels,  # make sure the same proportion of labels is in the test set and training set
    random_state= 43
)

print(f'Number of training instances = {len(train_documents)}')
print(f'Number of validation instances = {len(val_documents)}')
print(f'Number of test instances = {len(test_documents)}')


In [None]:
print(f'What does one instance look like from the training set? \n\n{train_documents[233]}')
print(f'...and here is its corresponding label \n\n{train_labels[233]}')

In [None]:


# CountVectorizer can do its own tokenization, but for consistency we want to
# carry on using WordNetTokenizer. We write a small wrapper class to enable this:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)

    
# create stopwords function from nltk
stop_words = set(stopwords.words('english'))
#vectorizer = CountVectorizer(tokenizer=Tokenizer())  # construct the vectorizer

# with stop word removal
vectorizer = CountVectorizer(tokenizer=Tokenizer(), stop_words=stop_words)  # construct the vectorizer

vectorizer.fit(train_documents)  # Learn the vocabulary
X_train = vectorizer.transform(train_documents)  # extract training set bags of words
X_val = vectorizer.transform(val_documents)  # extract test set bags of words
X_test = vectorizer.transform(test_documents)  # extract test set bags of words

In [None]:
# see count vector from training set
counts = pd.DataFrame(X_train.toarray(), columns = vectorizer.get_feature_names_out())

In [None]:
X_train

In [None]:
counts.columns


## Naive Bayes Classifier With Data Processing

In [None]:
# WRITE YOUR CODE HERE
classifier = MultinomialNB()
classifier.fit(X_train, train_labels)

In [None]:
y_val_pred = classifier.predict(X_val)

In [None]:
cm = metrics.confusion_matrix(val_labels, y_val_pred)


In [None]:
# Define class labels
classes = ['0', '1', '2']


In [None]:
# Plot confusion matrix
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
       yticks=np.arange(cm.shape[0]),
       xticklabels=classes, yticklabels=classes,
       xlabel='Predicted label', ylabel='True label',
       title='Confusion matrix')
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Add counts to each cell
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > cm.max() / 2. else "black")

# Show plot
plt.show()

In [None]:
# WRITE YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

acc = accuracy_score(val_labels, y_val_pred)
print(f'Accuracy = {acc}')

prec = precision_score(val_labels, y_val_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(val_labels, y_val_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(val_labels, y_val_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

# We can get all of these with a per-class breakdown using classification_report:
print(classification_report(val_labels, y_val_pred))

In [None]:
vocabulary = vectorizer.vocabulary_

### CHANGE THE NAME OF THE CLASSIFIER VARIABLE BELOW TO USE YOUR TRAINED CLASSIFIER
feat_likelihoods = np.exp(classifier.feature_log_prob_)  # Use exponential to convert the logs back to probabilities
###

# WRITE YOUR CODE HERE
#print(feat_likelihoods[:, vocabulary['a']])
#print(feat_likelihoods[:, vocabulary['it']])

# Logistic Regression Classifier With Data Processing



In [None]:


classifier = LogisticRegression()
classifier.fit(X_train, train_labels)

In [None]:
y_val_pred = classifier.predict(X_val)

In [None]:
cm = metrics.confusion_matrix(val_labels, y_val_pred)


In [None]:

# Define class labels
classes = ['0', '1', '2']

# Plot confusion matrix
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
       yticks=np.arange(cm.shape[0]),
       xticklabels=classes, yticklabels=classes,
       xlabel='Predicted label', ylabel='True label',
       title='Confusion matrix')
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Add counts to each cell
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > cm.max() / 2. else "black")

# Show plot
plt.show()

## Multi Logistic Regression

In [None]:

classifier = LogisticRegression( multi_class= 'multinomial')
classifier.fit(X_train, train_labels)
y_val_pred = classifier.predict(X_val)

In [None]:
cm = metrics.confusion_matrix(val_labels, y_val_pred)

In [None]:

# Define class labels
classes = ['0', '1', '2']

# Plot confusion matrix
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
ax.figure.colorbar(im, ax=ax)
ax.set(xticks=np.arange(cm.shape[1]),
       yticks=np.arange(cm.shape[0]),
       xticklabels=classes, yticklabels=classes,
       xlabel='Predicted label', ylabel='True label',
       title='Confusion matrix')
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Add counts to each cell
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, format(cm[i, j], 'd'),
                ha="center", va="center",
                color="white" if cm[i, j] > cm.max() / 2. else "black")

# Show plot
plt.show()

## 1.2. Evaluate Method

Evaluate your method, then interpret and discuss your results. Include the following points:
• Define your performance metrics and state their limitations;
• Describe the testing procedure (e.g., how you used each split of the dataset);
• Show your results using suitable plots or tables;
• How could you improve the method or experimental process? Consider the errors that your
method makes.  
(9 marks)


In [None]:
# WRITE YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

acc = accuracy_score(val_labels, y_val_pred)
print(f'Accuracy = {acc}')

prec = precision_score(val_labels, y_val_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(val_labels, y_val_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(val_labels, y_val_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

# We can get all of these with a per-class breakdown using classification_report:
print(classification_report(val_labels, y_val_pred))

# 1.3 Common Themes & Topics

1.3. Can you identify common themes or topics associated with negative sentiment or positive
sentiment in this dataset?
• Explain the method you use to identify themes or topics;
• Show your results (e.g., by listing or visualising example topics or themes);
• Interpret the results and summarise the limitations of your approach.
(12 marks) 

In [None]:
n_feats_to_show = 10

# Flip the index so that values are keys and keys are values:
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))

for c, weights_c in enumerate(classifier.coef_):
    print(f'\nWeights for class {c}:\n')
    strongest_idxs = np.argsort(weights_c)[-n_feats_to_show:]

    for idx in strongest_idxs:
        print(f'{vocab_inverted[idx]} with weight {weights_c[idx]}')

### Topics

In [None]:
pos_index = all_labels == 2  # compare predictions to gold labels
neg_index = all_labels == 0  # compare predictions to gold labels
# get the text of tweets where the classifier made an error:
pos_tweets = np.array(all_text)[pos_index]
neg_tweets = np.array(all_text)[neg_index]

In [None]:
#type(pos_tweets)
print(pos_tweets[0])
print(neg_tweets[0])

In [None]:
processed_pos = []
processed_neg = []

In [None]:
from nltk.stem import WordNetLemmatizer 
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS # find stopwords

np.random.seed(400)  # We fix the random seed to ensure we get consistent results when we repeat the lab.

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in simple_preprocess(text) :  # Tokenize, remove very short and very long words, convert to lower case, remove words containing non-letter characters
        if token not in STOPWORDS:
            result.append(WordNetLemmatizer().lemmatize(token, 'v'))
            
    return result

# Create lists of preprocessed documents
for tweet in pos_tweets:
    processed_pos.append(preprocess(tweet))
    
for tweet in neg_tweets:
    processed_neg.append(preprocess(tweet))

In [None]:
# nltk.download('wordnet')

In [None]:
print(processed_pos[0])
print(processed_neg[0])

In [None]:
from gensim.corpora import Dictionary

dictionary_pos = Dictionary(processed_pos) # construct word<->id mappings - it does it in alphabetical order
print(dictionary_pos)

pos_bow_corpus = [dictionary_pos.doc2bow(tweet) for tweet in processed_pos]

dictionary_neg = Dictionary(processed_neg) # construct word<->id mappings - it does it in alphabetical order
print(dictionary_neg)

neg_bow_corpus = [dictionary_neg.doc2bow(tweet) for tweet in processed_neg]

In [None]:
len(pos_bow_corpus)

In [None]:
len(neg_bow_corpus)

In [None]:
from gensim.models import LdaModel

lda_pos_model =  LdaModel(pos_bow_corpus, 
                      num_topics=10, 
                      id2word=dictionary_pos,                                    
                      passes=10,
                    ) 

lda_neg_model =  LdaModel(neg_bow_corpus, 
                      num_topics=10, 
                      id2word=dictionary_neg,                                    
                      passes=10,
                    ) 

In [None]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_pos_model.print_topics(-1):
    print("Pos Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")
    
for idx, topic in lda_neg_model.print_topics(-1):
    print("Neg Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

### Individual Topic Distribution

In [None]:
test_document_idx = 10
unseen_document = pos_tweets[test_document_idx]
print(unseen_document)

#print(f' This document is from newsgroup {newsgroups_test.target_names[newsgroups_test.target[test_document_idx]]}')

# Data preprocessing step for the unseen document - It is the same preprocessing we have performed for the training data
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for idx, count in bow_vector:
    print(f'{dictionary[idx]}: {count}')

In [None]:
topic_distribution = lda_model[bow_vector]

for index, probability in sorted(topic_distribution, key=lambda tup: -1*tup[1]):
    print("Index: {}\nProbability: {}\t Topic: {}".format(index, probability, lda_model.print_topic(index, 5)))

In [None]:
# make list of tuples ready for model training

train_set = list(zip(list_a, list_b))

## Task 2: Named Entity Recognition (max. 19%)  

In scientific research, information extraction can help researchers to discover relevant findings from
across a wide body of literature. As a first step, your task is to build a tool for named entity
recognition in scientific journal article abstracts. We will be working with the BioNLP 2004 dataset of
abstracts from MEDLINE, a database containing journal articles from fields including medicine and
pharmacy. The data was collected by searching for the terms ‘human’, ‘blood cells’ and
‘transcription factors’, and then annotated with five entity types: DNA, protein, cell type, cell line,
RNA. 

More information can be found in the paper: https://aclanthology.org/W04-1213.pdf .
We provide a cache of the data and code for loading the data in ‘data_loader_demo’ in our Github
repository, https://github.com/uob-TextAnalytics/intro-labs-public. This script downloaded the data
from HuggingFace, where you can also find more information about the dataset:
https://huggingface.co/datasets/tner/bionlp2004 .


The data is presented in this paper:
Nigel Collier, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Jin-Dong Kim. 2004. Introduction
to the Bio-entity Recognition Task at JNLPBA. In Proceedings of the International Joint Workshop on
Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pages 73–78,
Geneva, Switzerland. COLING.