University of Zagreb\
Faculty of Electrical Engineering and Computing

## Text Analysis and Retrieval 2021/2022
https://www.fer.unizg.hr/predmet/apt/

------------------------------

### Semantics
### LAB 2


*Version: 1.0*

(c) 2022 Josip Jukić, Jan Šnajder

Submission deadline: **May 4, 2022, 23:59 CET** 

------------------------------

### Instructions

Hello visitor, this lab assignment consists of three parts. Your task boils down to filling out the missing parts of code and evaluating the cells. These parts are indicated by the "YOUR CODE HERE" template.

Each subtask is supplemented by several tests that you can run. Apart from that, there are additional test that will be executed after submition. If your solution is valid and it passes all of the visible tests, there shouldn't be any problems with the additional tests.

**IMPORTANT: Don't change the names of the predefined methods or random seeds**, because the tests won't execute properly.

You're required to do this assignment **on your own**.

If you stumble upon problems, please refer to josip.jukic@fer.hr for office hours.

## Tasks

### 1. Paraphrase identification

The goal of paraphrase identification is to determine whether a pair of sentences have the same meaning. In this assignment, we will use the [MRPC](https://paperswithcode.com/dataset/mrpc) dataset, a part of the [GLUE benchmark](https://gluebenchmark.com/).

Load the data frames (train & test) and explore their structure. The column `label` indicates whether a given pair of sentences is semantically equivalent (1) or not (0).

In [1]:
import numpy as np
import pandas as pd


# Load CSV files.
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

df_train.head()

Unnamed: 0,sentence1,sentence2,label
0,"Amrozi accused his brother , whom he called "" ...","Referring to him as only "" the witness "" , Amr...",1
1,Yucaipa owned Dominick 's before selling the c...,Yucaipa bought Dominick 's in 1995 for $ 693 m...,0
2,They had published an advertisement on the Int...,"On June 10 , the ship 's owners had published ...",1
3,"Around 0335 GMT , Tab shares were up 19 cents ...","Tab shares jumped 20 cents , or 4.6 % , to set...",0
4,"The stock rose $ 2.11 , or about 11 percent , ...",PG & E Corp. shares jumped $ 1.63 or 8 percent...,1


The library that will do the heavy lifting for this task is [Podium](https://takelab.fer.hr/podium/), TakeLab's data loading and preprocessing tool for NLP. We advise you to take a look at the official documenation as well as the [walkthrough](https://takelab.fer.hr/podium/walkthrough.html) examples. Additionally, go through the main primitives: [Vocab](https://takelab.fer.hr/podium/package_reference/vocab_and_fields.html#vocab), [Field](https://takelab.fer.hr/podium/package_reference/vocab_and_fields.html#field), and [Dataset](https://takelab.fer.hr/podium/package_reference/datasets.html).

In [2]:
from podium import Vocab, Field, LabelField
from podium.datasets import TabularDataset
from podium.vectorizers import GloVe

  "class": algorithms.Blowfish,


#### (a)

Construct three fields: `sentence1` (Field), `sentence2` (Field), and `label` (LabelField). Transform the sentences to lower case using the attribute `pretokenize_hooks`. Refer to the [documentation](https://takelab.fer.hr/podium/preprocessing.html#hooks) to see how this can be achieved.

In order to model semantics, we will utilize [GloVe](https://nlp.stanford.edu/projects/glove/) distributional word vectors.

In [3]:
max_vocab_size = 10_000
vocab = Vocab(max_size=max_vocab_size, min_freq=2)

S1 = None # Sentence1 field
S2 = None # Sentence2 field
LABEL = None # Label field

S1 = Field('sentence1', numericalizer=vocab, pretokenize_hooks=str.lower)
S2 = Field('sentence2', numericalizer=vocab, pretokenize_hooks=str.lower)
LABEL = LabelField('label')

fields = [
    S1,
    S2,
    LABEL,
]

train = TabularDataset.from_pandas(df_train, fields)
test = TabularDataset.from_pandas(df_test, fields)
train.finalize_fields()

glove = GloVe()
# Load only the vectors of vocab words.
embeddings = glove.load_vocab(vocab)

# Generate padded batch.
train_batch = train.batch(add_padding=True)
test_batch = test.batch(add_padding=True)



In [4]:
assert "washington" in vocab.stoi

Generated batch stores values for the corresponding fields, which can be accessed with the dot notation. See the example below.

In [5]:
train_batch.sentence1, train_batch.sentence2, train_batch.label

(array([[1319,  471,   33, ...,    1,    1,    1],
        [5929, 1799, 5930, ...,    1,    1,    1],
        [  40,   36,  669, ...,    1,    1,    1],
        ...,
        [  10,   50,   34, ...,    1,    1,    1],
        [   2, 5757,   16, ...,    1,    1,    1],
        [   2,  916,  469, ...,    1,    1,    1]]),
 array([[2482,    5,  138, ...,    1,    1,    1],
        [5929, 1968, 5930, ...,    1,    1,    1],
        [  13,  184,  148, ...,    1,    1,    1],
        ...,
        [   7,    8,  140, ...,    1,    1,    1],
        [   0,   57,  153, ...,    1,    1,    1],
        [   2,  916,  469, ...,    1,    1,    1]]),
 array([[0],
        [1],
        [0],
        ...,
        [0],
        [0],
        [1]]))

#### (b)

Implement `cosine_similarity`, which computes cosine similarities of two 2D arrays across their second axis (index = 1). Refer to the tests below to see a concrete example. [np.einsum](https://numpy.org/doc/stable/reference/generated/numpy.einsum.html) and [np.linalg.norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html) may be useful for this task. However, you are not obliged to use those methods.

In [6]:
def cosine_similarity(a, b):
    """
    Receives two 2D numpy arrays and calculates cosine similarity across the second axis.
    For examples, if `a` and `b` have shape (32, 10), the resulting array should have shape (32,).
    
    Returns:
        1D numpy array with cosine similarities
    """
    return np.einsum('ij,ij->i', a, b)/(np.linalg.norm(a, axis=1) * np.linalg.norm(b, axis=1))    

In [7]:
a = np.array(
    [
        [0.79674043, -0.21774995, 0.24626283, -1.92507862, -0.44471655],
        [-0.83365243, -1.05258529, -0.69114343, 1.94794818, 0.81859483],
        [-1.1742791, 0.39978046, -1.14265924, 1.50492221, 0.99339915],
        [0.58896543, 0.8214453, -0.27131406, 0.45817815, -0.21055904],
    ]
)

b = np.array(
    [
        [0.20286607, 0.34114231, -1.14127489, 0.11783557, -0.43729267],
        [0.34177672, -1.66142734, 1.13159559, 0.07148497, 0.24896589],
        [-0.10376178, 0.30639966, 0.54675361, -0.04626362, 0.1408809],
        [-0.6056932, -1.24619744, -0.2720515, 1.26427211, 1.47021337],
    ]
)

sol = np.array([-0.08127636, 0.1919786, -0.19251815, -0.37211115])

assert np.isclose(cosine_similarity(a, b), sol).all()


#### (c)
Implement `top_n`, which retrieves the sorted indices of top `n` values in a numpy array. Refer to the test below.

In [8]:
def top_n(sims, n=10):
    """
    Receives a numpy array `sims` and finds the indices of the top `n` highest similarities.
    The indices are returned in the ascending order (from lowest to highest index).
    """
    return np.argsort(sims)[-n:]

In [9]:
assert (top_n(np.array([0.1, 0.5, 0.2, 0.3, 0.9]), 2) == np.array([1, 4])).all()

#### (d)
We need to somehow transform the sequences to fixed-size vectors. We will explore several simple approaches.

Extract word embeddings for sentence1 and sentence2 (both for train and test), using `train_batch`, `test_batch`, and `embeddings`. Additionally, compute the mean embedding for both fields and both sets. Store them to `sentence1_train_mean`, `sentence2_train_mean`, `sentence1_test_mean`, and `sentence2_test_mean`.

Once you compute the means, retrieve and print out the top 10 most similar sentence pairs from the train set using the previously implemented methods `cosine_similarity` and `top_n`.

In [10]:
sentence1_train, sentence2_train = None, None
sentence1_test, sentence2_test = None, None
sentence1_train_mean, sentence2_train_mean = None, None
sentence1_test_mean, sentence2_test_mean = None, None
    
sentence1_train = embeddings[train_batch.sentence1]
sentence2_train = embeddings[train_batch.sentence2]

sentence1_test =  embeddings[test_batch.sentence1]
sentence2_test =  embeddings[test_batch.sentence2]

sentence1_train_mean = np.mean(sentence1_train, axis=1)
sentence2_train_mean = np.mean(sentence2_train, axis=1)

sentence1_test_mean = np.mean(sentence1_test, axis=1)
sentence2_test_mean = np.mean(sentence2_test, axis=1)

similiarities = cosine_similarity(sentence1_train_mean, sentence2_train_mean)
top = top_n(similiarities)

print(similiarities[top])
print(top)
df_train.iloc[top]

[0.99968619 0.99969057 0.99969104 0.99969878 0.99972877 0.99973616
 0.99980307 0.99981289 0.99985794 0.99992784]
[1748 1660 2282  368  172 2385 1903 3219 2824  506]


Unnamed: 0,sentence1,sentence2,label
1748,An attempt last month in the Senate to keep th...,An attempt to keep the fund open for another y...,1
1660,The United States and Britain are seeking back...,"At the United Nations , the United States and ...",1
2282,No pill is ever expected to replace earplugs a...,Nobody is saying such a pill could replace ear...,1
368,Moore had no immediate comment Tuesday .,Moore did not have an immediate response Tuesd...,1
172,ConAgra stock closed Monday on the New York St...,ConAgra shares closed Monday at $ 21.63 a shar...,1
2385,"Other , more traditional tests are also availa...",Traditional tests also are available at no cos...,0
1903,South Africa has the world 's highest caseload...,With 4.7 million people infected with HIV or A...,1
3219,"Druce will face murder charges , Conte said .",Conte said Druce will be charged with murder .,1
2824,"Justices Stephen Breyer , Sandra Day O 'Connor...","Justices Anthony Kennedy , Sandra Day O 'Conno...",1
506,The rest said they belonged to another party o...,The rest said they had no affiliation or belon...,1


In [11]:
assert sentence1_train_mean.shape == (len(train), 300)

#### (e)
Now use the computed means to create representations. You should make two variants: `*_mul`**multiplies the values element-wise** and `*_cat` simply **concatenates** the means from sentence1 and sentence2.
Load the train and test labels from their corresponding batches. Make sure that the label array is 1D.

In [12]:
X_train_mul, X_test_mul = None, None
X_train_cat, X_test_cat = None, None
y_train, y_test = None, None


X_train_mul = sentence1_train_mean * sentence2_train_mean
X_test_mul = sentence1_test_mean * sentence2_test_mean

X_train_cat = np.concatenate((sentence1_train_mean, sentence2_train_mean), axis=1)
X_test_cat = np.concatenate((sentence1_test_mean, sentence2_test_mean), axis=1)

y_train = train_batch.label.flatten()
y_test = test_batch.label.flatten()

In [13]:
assert X_train_mul.shape == (len(train), 300)
assert X_train_cat.shape == (len(train), 600)
assert y_train.shape == (len(train),)

#### (f)
Finally, let's exploit the fruits of our labor by building a classifier that will identify paraphrases. Implement `train_model` that trains an [`sklearn`](https://scikit-learn.org/stable/) model and returns the fitted model. You may use any `sklearn` model, but you need to achieve **binary $F_1$ generalization score higher than 0.4** for both represenations (multiplied and concatenated) to pass the tests. To calculate $F_1$ with `sklearn.metrics.f1_score`, set `average="binary"`. See the example bellow. For additional information, we advise you to use `classification_report`. Remember that the MRPC dataset is quite challenging and we're using a simple approach, so don't be surprised with low binary $F_1$ scores.

In [14]:
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

def train_model(X_train, y_train):
    """
    Fit and return the fitted sklearn model.
    """
    model = LogisticRegression(C=0.5, class_weight='balanced')
    model.fit(X_train, y_train)
    return model

model = train_model(X_train_mul, y_train)
y_pred = model.predict(X_test_mul)
print(classification_report(y_pred, y_test))

model = train_model(X_train_cat, y_train)
y_pred = model.predict(X_test_cat)
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.47      0.77      0.58       698
           1       0.72      0.41      0.52      1027

    accuracy                           0.56      1725
   macro avg       0.60      0.59      0.55      1725
weighted avg       0.62      0.56      0.55      1725

              precision    recall  f1-score   support

           0       0.34      0.76      0.47       516
           1       0.79      0.38      0.51      1209

    accuracy                           0.49      1725
   macro avg       0.56      0.57      0.49      1725
weighted avg       0.65      0.49      0.50      1725



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
