# Document structure

## Requirements

Only in case nltk and the file do not exist in the current computer

In [0]:
import nltk
nltk.download() #1. d, 2. book, 3. q

import tarfile

!wget https://gebakx.github.io/ihlt/s2/resources/trial.tgz

with tarfile.open('trial.tgz', "r:gz") as tar:
  tar.extractall()

!pip install beautifulsoup4
!pip install lxml

!ls

## Input data

In [3]:
# import pandas as pd

trial_data = []
with open('trial/STS.input.txt') as f:
  # trial_data = 
  trial_data = [row.rstrip().split('\t') for row in f.readlines()]

trial_data

[['id1',
  'The bird is bathing in the sink.',
  'Birdie is washing itself in the water basin.'],
 ['id2',
  'In May 2010, the troops attempted to invade Kabul.',
  'The US army invaded Kabul on May 7th last year, 2010.'],
 ['id3',
  'John said he is considered a witness but not a suspect.',
  '"He is not a suspect anymore." John said.'],
 ['id4',
  'They flew out of the nest in groups.',
  'They flew into the nest together.'],
 ['id5',
  'The woman is playing the violin.',
  'The young lady enjoys listening to the guitar.'],
 ['id6',
  'John went horse back riding at dawn with a whole group of friends.',
  'Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.']]

## Calculating the Jacard distance

It just measures how close or far are the given sentences, it is a not so robust way to measure two equivalents sentences. 

It is a weak model.

In [4]:
import numpy as np
import nltk

from nltk.metrics import jaccard_distance

result = []
for in_data in trial_data:
  result.append(jaccard_distance(set(nltk.word_tokenize((in_data[1]))),
                             set(nltk.word_tokenize((in_data[2])))))

result = 1 - np.array(result)
result

array([0.30769231, 0.26315789, 0.46666667, 0.45454545, 0.23076923,
       0.13793103])

## Calculating the Pearson correlation

First we gather the gold standard results (the reference) and compare with our previous results. It seems that the gold standard should be in reversed order, so that action is performed before comparing. Furthermore, the previous results were multiply by 5, just to have the same scale (although the pearson correlation is robust to scaling).

For this example, the score is 0.3962, that indicates somehow, a bad score (the maximun is 1.0). It means that both lists have the same direction when plotted, but they are not so well correlated.



In [5]:
import pandas as pd

from scipy.stats import pearsonr

gs = pd.read_csv('trial/STS.gs.txt', sep='\t', header=None)
refs = list(reversed(gs[1].values))
print(f'Gold standard: {refs}')
tsts = result * 5
print(f'Jaccard distance: {tsts}')
print(f'Pearson correlation: {pearsonr(refs, tsts)[0]}')


Gold standard: [5, 4, 3, 2, 1, 0]
Jaccard distance: [1.53846154 1.31578947 2.33333333 2.27272727 1.15384615 0.68965517]
Pearson correlation: 0.39623897761192317


## Optional: Evaluate against training data set

In [6]:
url_train = 'https://www.cs.york.ac.uk/semeval-2012/task6/data/uploads/datasets/train.tgz'
!wget $url_train

with tarfile.open('train.tgz', "r:gz") as tar:
  tar.extractall()
!ls

--2019-09-26 08:33:19--  https://www.cs.york.ac.uk/semeval-2012/task6/data/uploads/datasets/train.tgz
Resolving www.cs.york.ac.uk (www.cs.york.ac.uk)... 144.32.128.40
Connecting to www.cs.york.ac.uk (www.cs.york.ac.uk)|144.32.128.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 125822 (123K) [application/x-gzip]
Saving to: ‘train.tgz.1’


2019-09-26 08:33:20 (401 KB/s) - ‘train.tgz.1’ saved [125822/125822]

sample_data  train  train.tgz  train.tgz.1  trial  trial.tgz


In [10]:
import pandas as pd

train_data = pd.read_csv('train/STS.input.MSRpar.txt', sep='\t', names=['sent1', 'sent2'])
train_data = train_data.astype(str)                
train_data.head(5)

Unnamed: 0,sent1,sent2
0,But other sources close to the sale said Viven...,But other sources close to the sale said Viven...
1,Micron has declared its first quarterly profit...,Micron's numbers also marked the first quarter...
2,The fines are part of failed Republican effort...,"Perry said he backs the Senate's efforts, incl..."
3,"The American Anglican Council, which represent...","The American Anglican Council, which represent..."
4,The tech-loaded Nasdaq composite rose 20.96 po...,The technology-laced Nasdaq Composite Index <....


## Pre-process the input data and Jacard distance


In [15]:
result = []

for index, row in train_data.iterrows():
  result.append(jaccard_distance(set(nltk.word_tokenize(row['sent1'])),
                             set(nltk.word_tokenize(row['sent2']))))
  
result = 1 - np.array(result)
result[:10]

array([0.5483871 , 0.42105263, 0.33333333, 0.63333333, 0.19354839,
       0.1875    , 0.38888889, 0.57142857, 0.47826087, 0.42307692])

## Pearson correlation

In [16]:
gs = pd.read_csv('train/STS.gs.MSRpar.txt', names=['gs'])
gs['gs'] = gs['gs'].astype(float)
refs = list(gs['gs'][:728].values)
print(f'Gold standard: {refs[:10]}')
tsts = result * 5
print(f'Jaccard distance: {tsts[:10]}')
print(f'Pearson correlation: {pearsonr(refs, tsts)[0]}')

Gold standard: [4.0, 3.75, 2.8, 3.4, 2.4, 1.3330000000000002, 4.6, 3.8, 4.2, 2.6]
Jaccard distance: [2.74193548 2.10526316 1.66666667 3.16666667 0.96774194 0.9375
 1.94444444 2.85714286 2.39130435 2.11538462]
Pearson correlation: 0.167166888344058
