# Coursework IDA

## Task 1

## 1.1. 
Implement and train a method for automatically classifying texts in the FiQA sentiment analysis
dataset as positive, neutral or negative. Refer to the labs, lecture materials and textbook to identify
a suitable method. In your report:
• Briefly explain how your chosen method works and its main strengths and limitations;
• Describe the preprocessing steps and the features you use to represent each text instance;
• Explain why you chose those features and preprocessing steps and hypothesise how they
will affect your results;
• Briefly describe your software implementation.
(10 marks)



In [1]:
%load_ext autoreload
%autoreload 2

# Use HuggingFace's datasets library to access the financial_phrasebank dataset
from datasets import load_dataset

import numpy as np

In [2]:
train_files = [
    # 'data_cache/FiQA_ABSA_task1/task1_headline_ABSA_train.json',
    'data_cache/FiQA_ABSA_task1/task1_post_ABSA_train.json'
]

In [3]:
import json

def load_fiqa_sa_from_json(json_files):
    train_text = []
    train_labels = []

    # iterate through each tweet file
    for file in json_files:
        # open file in read mode, with method closes file after getting data stream
        with open(file, 'r', encoding = 'utf8') as handle:
            # load file object and convert into json object
            dataf = json.load(handle)
        
        
        dataf_text = [dataf[k]["sentence"] for k in dataf.keys()]
        # print(len(dataf_text))
        train_text.extend(dataf_text)

        dataf_labels = [float(dataf[k]["info"][0]["sentiment_score"]) for k in dataf.keys()]
        # print(len(dataf_labels))
        train_labels.extend(dataf_labels)

    train_text = np.array(train_text)
    train_labels = np.array(train_labels)
    
    return train_text, train_labels


def threshold_scores(scores):
    """
    Convert sentiment scores to discrete labels.
    0 = negative.
    1 = neutral.
    2 = positive.
    """
    labels = []
    for score in scores:
        if score < -0.2:
            labels.append(0)
        elif score > 0.2:
            labels.append(2)
        else:
            labels.append(1)
            
    return np.array(labels)


all_text, all_labels = load_fiqa_sa_from_json(train_files)
    
print(f'Number of instances: {len(all_text)}')
print(f'Number of labels: {len(all_labels)}')

all_labels = threshold_scores(all_labels)
print(f'Number of negative labels: {np.sum(all_labels==0)}')
print(f'Number of neutral labels: {np.sum(all_labels==1)}')
print(f'Number of positive labels: {np.sum(all_labels==2)}')

Number of instances: 675
Number of labels: 675
Number of negative labels: 203
Number of neutral labels: 74
Number of positive labels: 398


In [4]:
type(load_fiqa_sa_from_json(train_files))

tuple

In [5]:
print(len(load_fiqa_sa_from_json(train_files)[0]))

675


In [6]:
from sklearn.model_selection import train_test_split

# Split test data from training data
train_documents, test_documents, train_labels, test_labels = train_test_split(
    all_text, 
    all_labels, 
    test_size=0.2, 
    stratify=all_labels,  # make sure the same proportion of labels is in the test set and training set
    random_state = 43
)

# Split validation data from training data
train_documents, val_documents, train_labels, val_labels = train_test_split(
    train_documents, 
    train_labels, 
    test_size=0.15, 
    stratify=train_labels,  # make sure the same proportion of labels is in the test set and training set
    random_state = 43
)

print(f'Number of training instances = {len(train_documents)}')
print(f'Number of validation instances = {len(val_documents)}')
print(f'Number of test instances = {len(test_documents)}')


Number of training instances = 459
Number of validation instances = 81
Number of test instances = 135


In [7]:
print(f'What does one instance look like from the training set? \n\n{train_documents[233]}')
print(f'...and here is its corresponding label \n\n{train_labels[233]}')

What does one instance look like from the training set? 

Facebook, near a buy point last week, faces a different technical test today https://t.co/c72LLMpiNM $FB $AAPL $NFLX https://t.co/fPFbYTYPuY
...and here is its corresponding label 

1


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk import word_tokenize

# CountVectorizer can do its own tokenization, but for consistency we want to
# carry on using WordNetTokenizer. We write a small wrapper class to enable this:
class Tokenizer(object):
    def __call__(self, tweets):
        return word_tokenize(tweets)

vectorizer = CountVectorizer(tokenizer=Tokenizer())  # construct the vectorizer

vectorizer.fit(train_documents)  # Learn the vocabulary
X_train = vectorizer.transform(train_documents)  # extract training set bags of words
X_val = vectorizer.transform(val_documents)  # extract test set bags of words
X_test = vectorizer.transform(test_documents)  # extract test set bags of words



## Naive Bayes Classifier

In [9]:
# WRITE YOUR CODE HERE

from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, train_labels)

MultinomialNB()

In [10]:
y_val_pred = classifier.predict(X_val)

In [11]:
# WRITE YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

acc = accuracy_score(val_labels, y_val_pred)
print(f'Accuracy = {acc}')

prec = precision_score(val_labels, y_val_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(val_labels, y_val_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(val_labels, y_val_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

# We can get all of these with a per-class breakdown using classification_report:
print(classification_report(val_labels, y_val_pred))

Accuracy = 0.654320987654321
Precision (macro average) = 0.4567099567099568
Recall (macro average) = 0.4236111111111111
F1 score (macro average) = 0.40661824051654566
              precision    recall  f1-score   support

           0       0.73      0.33      0.46        24
           1       0.00      0.00      0.00         9
           2       0.64      0.94      0.76        48

    accuracy                           0.65        81
   macro avg       0.46      0.42      0.41        81
weighted avg       0.60      0.65      0.59        81



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [12]:
import numpy as np

vocabulary = vectorizer.vocabulary_

### CHANGE THE NAME OF THE CLASSIFIER VARIABLE BELOW TO USE YOUR TRAINED CLASSIFIER
feat_likelihoods = np.exp(classifier.feature_log_prob_)  # Use exponential to convert the logs back to probabilities
###

# WRITE YOUR CODE HERE
print(feat_likelihoods[:, vocabulary['a']])
print(feat_likelihoods[:, vocabulary['it']])

[0.00563792 0.00534283 0.01019968]
[0.00480267 0.00356189 0.00330412]


# Logistic Regression Classifier



In [13]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(X_train, train_labels)

LogisticRegression()

In [14]:
y_val_pred = classifier.predict(X_val)

## 1.2. Evaluate Method

Evaluate your method, then interpret and discuss your results. Include the following points:
• Define your performance metrics and state their limitations;
• Describe the testing procedure (e.g., how you used each split of the dataset);
• Show your results using suitable plots or tables;
• How could you improve the method or experimental process? Consider the errors that your
method makes.  
(9 marks)


In [15]:
# WRITE YOUR CODE HERE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

acc = accuracy_score(val_labels, y_val_pred)
print(f'Accuracy = {acc}')

prec = precision_score(val_labels, y_val_pred, average='macro')
print(f'Precision (macro average) = {prec}')

rec = recall_score(val_labels, y_val_pred, average='macro')
print(f'Recall (macro average) = {rec}')

f1 = f1_score(val_labels, y_val_pred, average='macro')
print(f'F1 score (macro average) = {f1}')

# We can get all of these with a per-class breakdown using classification_report:
print(classification_report(val_labels, y_val_pred))

Accuracy = 0.6790123456790124
Precision (macro average) = 0.7813620071684587
Recall (macro average) = 0.4953703703703704
F1 score (macro average) = 0.5116883116883116
              precision    recall  f1-score   support

           0       0.67      0.50      0.57        24
           1       1.00      0.11      0.20         9
           2       0.68      0.88      0.76        48

    accuracy                           0.68        81
   macro avg       0.78      0.50      0.51        81
weighted avg       0.71      0.68      0.64        81



# 1.3 Common Themes & Topics

1.3. Can you identify common themes or topics associated with negative sentiment or positive
sentiment in this dataset?
• Explain the method you use to identify themes or topics;
• Show your results (e.g., by listing or visualising example topics or themes);
• Interpret the results and summarise the limitations of your approach.
(12 marks) 

In [16]:
n_feats_to_show = 10

# Flip the index so that values are keys and keys are values:
keys = vectorizer.vocabulary_.values()
values = vectorizer.vocabulary_.keys()
vocab_inverted = dict(zip(keys, values))

for c, weights_c in enumerate(classifier.coef_):
    print(f'\nWeights for class {c}:\n')
    strongest_idxs = np.argsort(weights_c)[-n_feats_to_show:]

    for idx in strongest_idxs:
        print(f'{vocab_inverted[idx]} with weight {weights_c[idx]}')


Weights for class 0:

downgraded with weight 0.5185444501971485
might with weight 0.5521056340933332
lower with weight 0.5545015307535974
tsla with weight 0.5561943080907948
model with weight 0.5848685720546261
downside with weight 0.6024350157845776
sbux with weight 0.6975188496435217
spy with weight 0.7108861954239818
down with weight 1.0078950248461631
short with weight 1.0895618849349944

Weights for class 1:

but with weight 0.5098350676115879
not with weight 0.5126206823906297
under with weight 0.5203803357645186
here with weight 0.5223132194115013
today with weight 0.5589399753330635
sells with weight 0.5635830752801967
aapl with weight 0.6562792693583221
rt with weight 0.6977507971335787
nvda with weight 0.7028193840768459
sideways with weight 0.7028193840768459

Weights for class 2:

upgrades with weight 0.43920196388803384
positive with weight 0.4531631501095449
stocks with weight 0.4826728950064811
good with weight 0.510436117458946
calls with weight 0.5506516322243667
bull

### Topics

In [17]:
pos_index = all_labels == 2  # compare predictions to gold labels
neg_index = all_labels == 0  # compare predictions to gold labels
# get the text of tweets where the classifier made an error:
pos_tweets = np.array(all_text)[pos_index]
neg_tweets = np.array(all_text)[neg_index]

In [18]:
#type(pos_tweets)
print(pos_tweets[0])
print(neg_tweets[0])

Slowly adding some $FIO here but gotta be careful. This will be one of biggest winners in 2012
I am not optimistic about $amzn both fundementals and charts look like poopoo this quarter.


In [19]:
processed_pos = []
processed_neg = []

In [20]:
from nltk.stem import WordNetLemmatizer 
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS # find stopwords

np.random.seed(400)  # We fix the random seed to ensure we get consistent results when we repeat the lab.

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in simple_preprocess(text) :  # Tokenize, remove very short and very long words, convert to lower case, remove words containing non-letter characters
        if token not in STOPWORDS:
            result.append(WordNetLemmatizer().lemmatize(token, 'v'))
            
    return result

# Create lists of preprocessed documents
for tweet in pos_tweets:
    processed_pos.append(preprocess(tweet))
    
for tweet in neg_tweets:
    processed_neg.append(preprocess(tweet))

In [21]:
print(processed_pos[0])
print(processed_neg[0])

['slowly', 'add', 'fio', 'gotta', 'careful', 'biggest', 'winners']
['optimistic', 'amzn', 'fundementals', 'chart', 'look', 'like', 'poopoo', 'quarter']


In [22]:
from gensim.corpora import Dictionary

dictionary_pos = Dictionary(processed_pos) # construct word<->id mappings - it does it in alphabetical order
print(dictionary_pos)

pos_bow_corpus = [dictionary_pos.doc2bow(tweet) for tweet in processed_pos]

dictionary_neg = Dictionary(processed_neg) # construct word<->id mappings - it does it in alphabetical order
print(dictionary_neg)

neg_bow_corpus = [dictionary_neg.doc2bow(tweet) for tweet in processed_neg]

Dictionary(1514 unique tokens: ['add', 'biggest', 'careful', 'fio', 'gotta']...)
Dictionary(887 unique tokens: ['amzn', 'chart', 'fundementals', 'like', 'look']...)


In [23]:
len(pos_bow_corpus)

398

In [24]:
len(neg_bow_corpus)

203

In [25]:
from gensim.models import LdaModel

lda_pos_model =  LdaModel(pos_bow_corpus, 
                      num_topics=10, 
                      id2word=dictionary_pos,                                    
                      passes=10,
                    ) 

lda_neg_model =  LdaModel(neg_bow_corpus, 
                      num_topics=10, 
                      id2word=dictionary_neg,                                    
                      passes=10,
                    ) 

In [26]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_pos_model.print_topics(-1):
    print("Pos Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")
    
for idx, topic in lda_neg_model.print_topics(-1):
    print("Neg Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Pos Topic: 0 
Words: 0.026*"http" + 0.026*"stks" + 0.015*"strong" + 0.012*"today" + 0.011*"look" + 0.009*"https" + 0.009*"aapl" + 0.006*"resistance" + 0.006*"year" + 0.006*"outperform"


Pos Topic: 1 
Words: 0.027*"http" + 0.026*"stks" + 0.020*"https" + 0.019*"long" + 0.015*"aapl" + 0.010*"close" + 0.010*"break" + 0.009*"stock" + 0.009*"day" + 0.009*"spy"


Pos Topic: 2 
Words: 0.022*"stks" + 0.022*"http" + 0.015*"nice" + 0.015*"bounce" + 0.013*"today" + 0.013*"aapl" + 0.011*"earn" + 0.011*"buy" + 0.009*"fb" + 0.008*"long"


Pos Topic: 3 
Words: 0.020*"higher" + 0.016*"stks" + 0.016*"http" + 0.015*"aapl" + 0.013*"https" + 0.011*"low" + 0.011*"buy" + 0.011*"high" + 0.009*"breakout" + 0.009*"tomorrow"


Pos Topic: 4 
Words: 0.013*"buy" + 0.013*"long" + 0.013*"bbry" + 0.010*"today" + 0.008*"call" + 0.008*"pop" + 0.008*"report" + 0.008*"yhoo" + 0.008*"eps" + 0.005*"stock"


Pos Topic: 5 
Words: 0.024*"stks" + 0.024*"http" + 0.014*"long" + 0.011*"day" + 0.009*"aapl" + 0.008*"good" + 0.008*"

### Individual Topic Distribution

In [27]:
test_document_idx = 10
unseen_document = pos_tweets[test_document_idx]
print(unseen_document)

#print(f' This document is from newsgroup {newsgroups_test.target_names[newsgroups_test.target[test_document_idx]]}')

# Data preprocessing step for the unseen document - It is the same preprocessing we have performed for the training data
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for idx, count in bow_vector:
    print(f'{dictionary[idx]}: {count}')

$TZOO a close above 28.64 and we are ready to rock and roll


NameError: name 'dictionary' is not defined

In [None]:
topic_distribution = lda_model[bow_vector]

for index, probability in sorted(topic_distribution, key=lambda tup: -1*tup[1]):
    print("Index: {}\nProbability: {}\t Topic: {}".format(index, probability, lda_model.print_topic(index, 5)))

In [None]:
# make list of tuples ready for model training

train_set = list(zip(list_a, list_b))

## pyLDAvis  For Visualisation

## Task 2: Named Entity Recognition (max. 19%)  

In scientific research, information extraction can help researchers to discover relevant findings from
across a wide body of literature. As a first step, your task is to build a tool for named entity
recognition in scientific journal article abstracts. We will be working with the BioNLP 2004 dataset of
abstracts from MEDLINE, a database containing journal articles from fields including medicine and
pharmacy. The data was collected by searching for the terms ‘human’, ‘blood cells’ and
‘transcription factors’, and then annotated with five entity types: DNA, protein, cell type, cell line,
RNA. 

More information can be found in the paper: https://aclanthology.org/W04-1213.pdf .
We provide a cache of the data and code for loading the data in ‘data_loader_demo’ in our Github
repository, https://github.com/uob-TextAnalytics/intro-labs-public. This script downloaded the data
from HuggingFace, where you can also find more information about the dataset:
https://huggingface.co/datasets/tner/bionlp2004 .


The data is presented in this paper:
Nigel Collier, Tomoko Ohta, Yoshimasa Tsuruoka, Yuka Tateisi, and Jin-Dong Kim. 2004. Introduction
to the Bio-entity Recognition Task at JNLPBA. In Proceedings of the International Joint Workshop on
Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), pages 73–78,
Geneva, Switzerland. COLING.