# STS Project
## Introduction
*Jupyter Notebook* of the STS (Semantic Textual Similarity) Project of **Introduction to Human Language Technologies** course from UPC in MAI (Master of Artificial Intelligence).

This project has been done by:
- David Dueñas Gaviria
- Kevin David Rosales Santana

The statement is as follows:
- Use data set and description of task Semantic Textual Similarity in SemEval 2012.

- Implement some approaches to detect paraphrase using sentence similarity metrics.

    - Explore some lexical dimensions.
    - Explore the syntactic dimension alone.
    - Explore the combination of both previous.
    
- Add new components at your choice (optional).

- Not word neither sentence embeddings should be allowed.

- Compare and comment the results achieved by these approaches among them and among the official results.

- Send files to raco in IHLT STS Project before the oral presentation:

    - Jupyter notebook: `sts-[Student1]-[Student2].ipynb`

    - Slides: `sts-[Student1]-[Student2].pdf`

## Imports

In [30]:
import nltk, re
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr

from nltk.tag import CRFTagger

from nltk.corpus import wordnet as wn

import dill

## 1. Data Preparation
This section covers the preparation of the Input Data. The data used along the IHLT course are mostly from the *trial* set. Nevertheless, in this project, the *test* set will be used in order to compute the similarities and measure the performance of the different proposed models.

Moreover, the input data is composed by five different files:
- `STS.input.MSRpar.txt`
- `STS.input.MSRvid.txt`
- `STS.input.SMTeuroparl.txt`
- `STS.input.surprise.OnWN.txt`
- `STS.input.surprise.SMTnews.txt`

Therefore, the proposed pairs of sentences will be formed by the concatenation of the five different proposed inputs.

In [2]:
pairs = list()
input_files = ['STS.input.MSRpar.txt',
               'STS.input.MSRvid.txt',
               'STS.input.SMTeuroparl.txt',
               'STS.input.surprise.OnWN.txt',
               'STS.input.surprise.SMTnews.txt']

for file in input_files:
    with open('inputs/test-gold/' + file, 'r') as f:
        lines = f.readlines()
        for line in lines:
            line = nltk.TabTokenizer().tokenize(line.strip())
            pairs.append((line[0], line[1]))
for index, pair in enumerate(pairs, 1):
    print(str(index) + ".", pair)

1. ('The problem likely will mean corrective changes before the shuttle fleet starts flying again.', 'He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again.')
2. ('The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650.', "The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970.")
3. ('"It\'s a huge black eye," said publisher Arthur Ochs Sulzberger Jr., whose family has controlled the paper since 1896.', '"It\'s a huge black eye," Arthur Sulzberger, the newspaper\'s publisher, said of the scandal.')
4. ('SEC Chairman William Donaldson said there is a "building confidence out there that the cop is on the beat."', '"I think there\'s a building confidence that the cop is on the beat."')
5. ('Vivendi shares closed 1.9 percent at 15.80 euros in Paris after falling 3.6 percent on Monday.', 'In New York, Vivendi shares were 1.4 percent down at $18.29.')
6. ("Myanmar's pro-democr

1079. ('A woman is slicing a pumpkin.', 'A woman is riding an elephant.')
1080. ('A woman is dancing.', 'A woman is kneading dough.')
1081. ('A woman is dancing.', 'A woman is playing violin.')
1082. ('A woman is cracking eggs into a bowl.', 'A woman is placing skewers onto a rack.')
1083. ('A car is moving through a road.', 'A car is driving down the road.')
1084. ('A woman is slicing some tomatoes.', 'A woman is chopping a potato.')
1085. ('Someone is greasing a pan.', 'Smeone is laying down.')
1086. ('A man is playing guitar.', 'A man is draining pasta.')
1087. ('A woman is drinking vodka.', 'A woman is exercising.')
1088. ('A woman is water skiing.', 'A woman is slicing fish.')
1089. ('A man is playing an electronic keyboard.', 'A man is playing a flute.')
1090. ('A woman is cutting meat into pieces.', 'A woman is cutting up some meat.')
1091. ('A man is playing a guitar.', 'A man is eating a banana.')
1092. ('A man is playing a flute.', 'A man is riding a scooter.')
1093. ('A man 

2194. ('forbid, prohibit', 'forbid the public distribution of ( a movie or a newspaper).')
2195. ('computer text files', '(computer science) a computer file that contains text (and possibly formatting instructions) using seven-bit ASCII characters.')
2196. ('a speech act of acknowledging gratitude for something', 'an acknowledgment of appreciation.')
2197. ('a state of pretending to be someone, something else', 'an outward semblance that misrepresents the true nature of something.')
2198. ('Become widespread as an idea or feeling.', 'cause to become widely known.')
2199. ('connect with; reach a target or goal', 'perceive with the senses quickly, suddenly, or momentarily.')
2200. ('income collected as tax by a government', 'government income due to taxation.')
2201. ('be the deciding factor in a state of affairs or an outcome', 'shape or influence; give direction to.')
2202. ('end', 'put an end to.')
2203. ('a region allocated to hold something', 'the particular portion of space occupie

In [3]:
print(pairs[0][0])

The problem likely will mean corrective changes before the shuttle fleet starts flying again.


In [4]:
print("Number of pairs of sentences:", len(pairs))

Number of pairs of sentences: 3108


Furthermore, it is required to read the already mentioned *Gold Standard* file. This file contains the correct similarity for each read pair of sentences. Consequently, these values will be utilized in the measurement of the performance of the proposed models.

In [5]:
with open('inputs/test-gold/STS.gs.ALL.txt','r') as f:
    gs = [float(line) for line in f.readlines()]

print("Gold standard size:", len(gs))

Gold standard size: 3108


## 2. Paraphrase using different approaches
- Lower case
- Stop words
- Regular expression -> 'word'

### 2.1 Lexical Dimensions

### 2.1.1 Words tokenization

In [6]:
# Kevin

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

### 2.1.2 Lemma tokenization (PoS)

In [7]:
print(pairs[0][0])

The problem likely will mean corrective changes before the shuttle fleet starts flying again.


In [8]:
pairs_tokenized = [(nltk.word_tokenize(p[0]), nltk.word_tokenize(p[1])) for p in pairs]

for index, pair in enumerate(pairs_tokenized):
    print(str(index + 1) + ".", pair, '\n')

1. (['The', 'problem', 'likely', 'will', 'mean', 'corrective', 'changes', 'before', 'the', 'shuttle', 'fleet', 'starts', 'flying', 'again', '.'], ['He', 'said', 'the', 'problem', 'needs', 'to', 'be', 'corrected', 'before', 'the', 'space', 'shuttle', 'fleet', 'is', 'cleared', 'to', 'fly', 'again', '.']) 

2. (['The', 'technology-laced', 'Nasdaq', 'Composite', 'Index', '.IXIC', 'inched', 'down', '1', 'point', ',', 'or', '0.11', 'percent', ',', 'to', '1,650', '.'], ['The', 'broad', 'Standard', '&', 'Poor', "'s", '500', 'Index', '.SPX', 'inched', 'up', '3', 'points', ',', 'or', '0.32', 'percent', ',', 'to', '970', '.']) 

3. (['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'said', 'publisher', 'Arthur', 'Ochs', 'Sulzberger', 'Jr.', ',', 'whose', 'family', 'has', 'controlled', 'the', 'paper', 'since', '1896', '.'], ['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'Arthur', 'Sulzberger', ',', 'the', 'newspaper', "'s", 'publisher', ',', 'said', 'of', 'the', 'scandal', '.'

576. (['The', 'veteran', 'Malyasian', 'diplomat', 'met', 'Suu', 'Kyi', 'Wednesday', 'at', 'the', 'lakeside', 'home', 'in', 'Yangon', 'where', 'she', 'is', 'under', 'house', 'arrest', '.'], ['Razali', 'Ismail', 'met', 'for', '90', 'minutes', 'with', 'Suu', 'Kyi', ',', 'a', '1991', 'winner', 'of', 'the', 'Nobel', 'Peace', 'Prize', ',', 'at', 'her', 'lakeside', 'home', ',', 'where', 'she', 'is', 'under', 'house', 'arrest', '.']) 

577. (['The', 'flamboyant', 'entrepreneur', 'flagged', 'the', 'plan', 'after', 'a', 'meeting', 'in', 'London', 'with', 'Australian', 'Tourism', 'Minister', 'Joe', 'Hockey', '.'], ['Sir', 'Richard', 'was', 'speaking', 'after', 'a', 'meeting', 'in', 'London', 'with', 'Australian', 'Tourism', 'Minister', 'Joe', 'Hockey', '.']) 

578. (['Government', 'bonds', 'sold', 'off', 'sharply', 'after', 'Greenspan', 'told', 'Congress', 'on', 'Tuesday', 'that', 'the', 'U.S.', 'economy', '``', 'could', 'very', 'well', 'be', 'embarking', 'on', 'a', 'period', 'of', 'extended', 'g

1154. (['The', 'man', 'is', 'shooting', 'an', 'automatic', 'rifle', '.'], ['A', 'man', 'is', 'shooting', 'a', 'gun', '.']) 

1155. (['A', 'rabbit', 'is', 'playing', 'with', 'a', 'toy', 'rabbit', '.'], ['A', 'bunny', 'is', 'playing', 'with', 'a', 'stuffed', 'bunny', '.']) 

1156. (['A', 'plane', 'rides', 'on', 'a', 'road', '.'], ['An', 'airplane', 'moves', 'along', 'a', 'runway', '.']) 

1157. (['A', 'man', 'is', 'crushing', 'garlic', 'with', 'the', 'back', 'of', 'a', 'knife', '.'], ['A', 'woman', 'is', 'slicing', 'meat', 'with', 'a', 'knife', '.']) 

1158. (['A', 'woman', 'is', 'filing', 'her', 'fingernails', 'with', 'an', 'emery', 'board', '.'], ['A', 'woman', 'is', 'feeding', 'her', 'baby', 'with', 'a', 'bottle', '.']) 

1159. (['A', 'man', 'is', 'riding', 'a', 'skateboard', '.'], ['A', 'woman', 'is', 'cutting', 'a', 'potato', '.']) 

1160. (['A', 'woman', 'is', 'taking', 'a', 'shower', '.'], ['A', 'man', 'is', 'riding', 'a', 'white', 'horse', '.']) 

1161. (['A', 'man', 'is', 'danci

1917. (['Consumers', 'will', 'lose', 'out', ',', 'employees', 'will', 'lose', 'out', ',', 'Europe', 'will', 'lose', 'competitive', 'strength', 'and', 'growth', '.'], ['The', 'users', 'will', 'be', 'the', 'losers', ',', 'with', 'employees', ',', 'and', 'European', 'competitiveness', 'and', 'growth', 'will', 'diminish', '.']) 

1918. (['As', 'I', 'already', 'explained', 'during', 'second', 'reading', ',', 'there', 'is', 'a', 'crisis', 'underlying', 'this', 'directive', 'amendment', '.'], ['As', 'I', 'have', 'already', 'explained', 'in', 'second', 'reading', ',', 'a', 'crisis', 'is', 'on', 'the', 'basis', 'of', 'this', 'amendment', 'of', 'directive', '.']) 

1919. (['Consumers', 'will', 'lose', 'out', ',', 'employees', 'will', 'lose', 'out', ',', 'Europe', 'will', 'lose', 'competitive', 'strength', 'and', 'growth', '.'], ['The', 'users', 'will', 'be', 'the', 'losers', ',', 'with', 'employees', ',', 'and', 'European', 'competitiveness', 'and', 'growth', 'will', 'diminish', '.']) 

1920. ([

2764. (['None', 'of', 'this', 'absolves', 'rich', 'countries', 'of', 'their', 'responsibility', 'to', 'help', '.'], ['This', 'does', 'not', 'however', 'absolve', 'the', 'rich', 'countries', 'of', 'their', 'obligation', 'to', 'help', '.']) 

2765. (['Indeed', ',', 'the', 'French', 'philosopher', 'Marcel', 'Gauchet', 'entitled', 'a', 'recent', 'book', 'Democracy', 'Against', 'Itself', '.'], ['We', 'can', 'also', 'note', 'that', 'the', 'French', 'philosopher', 'Marcel', 'Gauchet', 'has', 'called', 'one', 'of', 'these', 'recent', 'books', 'democracy', 'against', 'itself', '.']) 

2766. (['After', 'all', ',', 'they', 'recommend', 'the', 'policies', 'that', 'politicians', 'may', 'or', 'may', 'not', 'want', 'to', 'consider', '.'], ['After', 'all', ',', 'they', 'have', 'recommended', 'the', 'policies', 'that', 'the', 'politicians', 'or', 'not', 'consideration', '.']) 

2767. (['That', 'is', 'the', 'spirit', 'in', 'which', 'the', 'European', 'Constitution', 'must', 'be', 'written', '.'], ['This

To compute lemmas, the use of WordNet's built-in morphy function `WordNetLemmatizer` from `nltk` is necessary.

In [9]:
wnl = WordNetLemmatizer()

def lemmatize(p, lower=False):
    try:
        return wnl.lemmatize(p[0].lower(), pos=p[1][0].lower())
    except:
        if lower:
            return p[0].lower()
        return p[0]

In [10]:
pairs_pos = [(pos_tag(p[0]), pos_tag(p[1])) for p in pairs_tokenized]

for index, pair in enumerate(pairs_pos):
    print(str(index + 1) + ".", pair, '\n')

1. ([('The', 'DT'), ('problem', 'NN'), ('likely', 'RB'), ('will', 'MD'), ('mean', 'VB'), ('corrective', 'JJ'), ('changes', 'NNS'), ('before', 'IN'), ('the', 'DT'), ('shuttle', 'JJ'), ('fleet', 'NN'), ('starts', 'NNS'), ('flying', 'VBG'), ('again', 'RB'), ('.', '.')], [('He', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('problem', 'NN'), ('needs', 'VBZ'), ('to', 'TO'), ('be', 'VB'), ('corrected', 'VBN'), ('before', 'IN'), ('the', 'DT'), ('space', 'NN'), ('shuttle', 'NN'), ('fleet', 'NN'), ('is', 'VBZ'), ('cleared', 'VBN'), ('to', 'TO'), ('fly', 'VB'), ('again', 'RB'), ('.', '.')]) 

2. ([('The', 'DT'), ('technology-laced', 'JJ'), ('Nasdaq', 'NNP'), ('Composite', 'NNP'), ('Index', 'NNP'), ('.IXIC', 'NNP'), ('inched', 'VBD'), ('down', 'RB'), ('1', 'CD'), ('point', 'NN'), (',', ','), ('or', 'CC'), ('0.11', 'CD'), ('percent', 'NN'), (',', ','), ('to', 'TO'), ('1,650', 'CD'), ('.', '.')], [('The', 'DT'), ('broad', 'JJ'), ('Standard', 'NNP'), ('&', 'CC'), ('Poor', 'NNP'), ("'s", 'POS'), ('500', '


496. ([('Other', 'JJ'), ('changes', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('plan', 'NN'), ('refine', 'VB'), ('his', 'PRP$'), ('original', 'JJ'), ('vision', 'NN'), (',', ','), ('Libeskind', 'NNP'), ('said', 'VBD'), ('.', '.')], [('Many', 'JJ'), ('of', 'IN'), ('the', 'DT'), ('changes', 'NNS'), ('are', 'VBP'), ('improvements', 'NNS'), ('to', 'TO'), ('the', 'DT'), ('original', 'JJ'), ('plan', 'NN'), (',', ','), ('Libeskind', 'NNP'), ('said', 'VBD'), ('.', '.')]) 

497. ([('Revenue', 'NN'), ('rose', 'VBD'), ('to', 'TO'), ('$', '$'), ('616.5', 'CD'), ('million', 'CD'), ('from', 'IN'), ('$', '$'), ('610.6', 'CD'), ('million', 'CD'), ('a', 'DT'), ('year', 'NN'), ('earlier', 'RBR'), ('.', '.')], [('Revenue', 'NN'), ('was', 'VBD'), ('up', 'RB'), ('a', 'DT'), ('tad', 'NN'), (',', ','), ('from', 'IN'), ('$', '$'), ('610.6', 'CD'), ('million', 'CD'), ('to', 'TO'), ('$', '$'), ('616.5', 'CD'), ('million', 'CD'), ('.', '.')]) 

498. ([('The', 'DT'), ('broad', 'JJ'), ('Standard', 'NNP'), ('&', 'CC'), 

1107. ([('A', 'DT'), ('woman', 'NN'), ('is', 'VBZ'), ('cutting', 'VBG'), ('an', 'DT'), ('onion', 'NN'), ('.', '.')], [('A', 'DT'), ('woman', 'NN'), ('is', 'VBZ'), ('cleaning', 'VBG'), ('a', 'DT'), ('garden', 'NN'), ('.', '.')]) 

1108. ([('Butter', 'NNP'), ('is', 'VBZ'), ('melting', 'VBG'), ('in', 'IN'), ('a', 'DT'), ('pan', 'NN'), ('.', '.')], [('A', 'DT'), ('man', 'NN'), ('is', 'VBZ'), ('melting', 'VBG'), ('butter', 'NN'), ('in', 'IN'), ('a', 'DT'), ('pan', 'NN'), ('.', '.')]) 

1109. ([('Some', 'DT'), ('men', 'NNS'), ('are', 'VBP'), ('sawing', 'VBG'), ('.', '.')], [('Men', 'NNS'), ('are', 'VBP'), ('sawing', 'VBG'), ('logs', 'NNS'), ('.', '.')]) 

1110. ([('Three', 'CD'), ('boys', 'NNS'), ('are', 'VBP'), ('dancing', 'VBG'), ('.', '.')], [('Kids', 'NNS'), ('are', 'VBP'), ('dancing', 'VBG'), ('.', '.')]) 

1111. ([('A', 'DT'), ('woman', 'NN'), ('is', 'VBZ'), ('playing', 'VBG'), ('a', 'DT'), ('game', 'NN'), ('with', 'IN'), ('a', 'DT'), ('man', 'NN'), ('.', '.')], [('A', 'DT'), ('man', '


1694. ([('Maij-Weggen', 'JJ'), ('report', 'NN'), ('(', '('), ('A5-0323/2000', 'NNP'), (')', ')')], [('Maij-Weggen', 'JJ'), ('report', 'NN'), ('(', '('), ('A5-0323', 'NNP'), ('/', 'NNP'), ('2000', 'CD'), (')', ')')]) 

1695. ([('Mr', 'NNP'), ('President', 'NNP'), (',', ','), ('the', 'DT'), ('Cashman', 'NNP'), ('report', 'NN'), ('can', 'MD'), ('be', 'VB'), ('summarised', 'VBN'), ('in', 'IN'), ('four', 'CD'), ('words', 'NNS'), (':', ':'), ('citizens', 'NNS'), ("'", 'POS'), ('power', 'NN'), ('over', 'IN'), ('bureaucracy', 'NN'), ('.', '.')], [('Mr', 'NNP'), ('President', 'NNP'), (',', ','), ('the', 'DT'), ('Cashman', 'NNP'), ('report', 'NN'), ('can', 'MD'), ('be', 'VB'), ('summarized', 'VBN'), ('in', 'IN'), ('two', 'CD'), ('words', 'NNS'), (':', ':'), ('the', 'DT'), ('power', 'NN'), ('of', 'IN'), ('the', 'DT'), ('people', 'NNS'), ('on', 'IN'), ('the', 'DT'), ('tape', 'NN'), ('.', '.')]) 

1696. ([('Then', 'RB'), ('perhaps', 'RB'), ('we', 'PRP'), ('could', 'MD'), ('have', 'VB'), ('avoided'

2279. ([('a', 'DT'), ('coefficient', 'NN'), ('assigned', 'VBN'), ('to', 'TO'), ('elements', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('frequency', 'NN'), ('distribution', 'NN'), ('in', 'IN'), ('order', 'NN'), ('to', 'TO'), ('indicate', 'VB'), ('the', 'DT'), ('relative', 'JJ'), ('importance', 'NN'), ('of', 'IN'), ('each', 'DT'), ('element', 'NN')], [('(', '('), ('statistics', 'NNS'), (')', ')'), ('a', 'DT'), ('coefficient', 'NN'), ('assigned', 'VBN'), ('to', 'TO'), ('elements', 'NNS'), ('of', 'IN'), ('a', 'DT'), ('frequency', 'NN'), ('distribution', 'NN'), ('in', 'IN'), ('order', 'NN'), ('to', 'TO'), ('represent', 'VB'), ('their', 'PRP$'), ('relative', 'JJ'), ('importance', 'NN'), ('.', '.')]) 

2280. ([('(', '('), ('cause', 'NN'), ('to', 'TO'), (')', ')'), ('move', 'NN'), ('around', 'IN')], [('cause', 'NN'), ('to', 'TO'), ('move', 'VB'), ('round', 'NN'), ('and', 'CC'), ('round', 'NN'), ('.', '.')]) 

2281. ([('become', 'NN'), ('due', 'JJ')], [('become', 'NN'), ('due', 'JJ'), ('for', 'IN'), (

2948. ([('Moreover', 'RB'), (',', ','), ('the', 'DT'), ('main', 'JJ'), ('oil', 'NN'), ('exporters', 'NNS'), ('are', 'VBP'), ('unwilling', 'JJ'), ('to', 'TO'), ('subordinate', 'VB'), ('their', 'PRP$'), ('investment', 'NN'), ('policies', 'NNS'), ('to', 'TO'), ('market', 'NN'), ('requirements', 'NNS'), ('.', '.')], [('Moreover', 'RB'), (',', ','), ('the', 'DT'), ('major', 'JJ'), ('oil', 'NN'), ('exporters', 'NNS'), ('are', 'VBP'), ('not', 'RB'), ('prepared', 'JJ'), ('to', 'TO'), ('subject', 'VB'), ('their', 'PRP$'), ('policies', 'NNS'), ('of', 'IN'), ('investment', 'NN'), ('to', 'TO'), ('market', 'NN'), ('rates', 'NNS'), ('.', '.')]) 

2949. ([('The', 'DT'), ('book', 'NN'), ("'s", 'POS'), ('success', 'NN'), ('is', 'VBZ'), ('itself', 'PRP'), ('a', 'DT'), ('sign', 'NN'), ('of', 'IN'), ('a', 'DT'), ('kind', 'NN'), ('of', 'IN'), ('``', '``'), ('malaise', 'NN'), ('.', '.'), ("''", "''")], [('The', 'DT'), ('success', 'NN'), ('of', 'IN'), ('the', 'DT'), ('book', 'NN'), ('is', 'VBZ'), ('in', 'IN'

In [27]:
sw = set(stopwords.words('english'))

pairs_lemmas = list()
t_pairs_lower = list()
t_pairs_lower_no_sw = list()
t_pairs_lower_ow = list()

for pair in pairs_pos:
    pairs_lemmas.append(([lemmatize(word) for word in pair[0]], [lemmatize(word) for word in pair[1]]))
    t_pairs_lower.append(([lemmatize(word, True) for word in pair[0]],
                          [lemmatize(word, True) for word in pair[1]]))
    t_pairs_lower_no_sw.append(([lemmatize(word, True) for word in pair[0] if word[0].lower() not in sw],
                                [lemmatize(word, True) for word in pair[1] if word[0].lower() not in sw]))
    t_pairs_lower_ow.append(([lemmatize(word, True) for word in pair[0] if re.search(r"\w", word[0])],
                             [lemmatize(word, True) for word in pair[1] if re.search(r"\w", word[0])]))
    
for index, t_pair in enumerate(pairs_lemmas):
    print(str(index + 1) + ".", t_pair, '\n')

1. (['The', 'problem', 'likely', 'will', 'mean', 'corrective', 'change', 'before', 'the', 'shuttle', 'fleet', 'start', 'fly', 'again', '.'], ['He', 'say', 'the', 'problem', 'need', 'to', 'be', 'correct', 'before', 'the', 'space', 'shuttle', 'fleet', 'be', 'clear', 'to', 'fly', 'again', '.']) 

2. (['The', 'technology-laced', 'nasdaq', 'composite', 'index', '.ixic', 'inch', 'down', '1', 'point', ',', 'or', '0.11', 'percent', ',', 'to', '1,650', '.'], ['The', 'broad', 'standard', '&', 'poor', "'s", '500', 'index', '.spx', 'inch', 'up', '3', 'point', ',', 'or', '0.32', 'percent', ',', 'to', '970', '.']) 

3. (['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'say', 'publisher', 'arthur', 'ochs', 'sulzberger', 'jr.', ',', 'whose', 'family', 'have', 'control', 'the', 'paper', 'since', '1896', '.'], ['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'arthur', 'sulzberger', ',', 'the', 'newspaper', "'s", 'publisher', ',', 'say', 'of', 'the', 'scandal', '.']) 

4. (['sec', 'ch

683. (['sterling', 'be', 'down', '0.8', 'percent', 'against', 'the', 'dollar', 'at', '$', '1.5875', 'gbp=', '.'], ['The', 'dollar', 'rise', '0.15', 'percent', 'against', 'the', 'Japanese', 'currency', 'to', '115.97', 'yen', '.']) 

684. (['The', 'dow', 'jones', 'industrial', 'average', '.dji', 'rise', '41.61', 'point', ',', 'or', '0.44', 'percent', ',', 'to', '9,415.82', '.'], ['The', 'dow', 'jones', 'rise', '41.61', 'point', 'friday', ',', 'a', 'gain', 'of', '0.4', '%', 'for', 'the', 'day', 'and', '0.7', '%', 'for', 'the', 'week', '.']) 

685. (['``', 'right', 'from', 'the', 'beginning', ',', 'we', 'do', "n't", 'want', 'to', 'see', 'anyone', 'take', 'a', 'cut', 'in', 'pay', '.'], ['But', 'mr.', 'crosby', 'tell', 'The', 'associated', 'press', ':', '``', 'right', 'from', 'the', 'beginning', ',', 'we', 'do', "n't", 'want', 'to', 'see', 'anyone', 'take', 'a', 'cut', 'in', 'pay', '.']) 

686. (['``', 'There', 'be', 'a', 'number', 'of', 'bureaucratic', 'and', 'administrative', 'miss', 'sign

1222. (['A', 'monkey', 'pull', 'the', 'tail', 'of', 'a', 'dog', '.'], ['A', 'monkey', 'be', 'tease', 'a', 'dog', 'at', 'the', 'zoo', '.']) 

1223. (['A', 'dog', 'lie', 'on', 'his', 'back', 'on', 'a', 'wooden', 'floor', '.'], ['A', 'dog', 'be', 'lay', 'on', 'the', 'floor', '.']) 

1224. (['A', 'man', 'be', 'play', 'a', 'flute', '.'], ['A', 'dog', 'be', 'bark', 'at', 'a', 'fly', '.']) 

1225. (['A', 'woman', 'be', 'fry', 'something', 'in', 'the', 'pan', '.'], ['A', 'man', 'be', 'play', 'his', 'guitar', '.']) 

1226. (['A', 'boy', 'be', 'crawl', 'into', 'a', 'dog', 'house', '.'], ['A', 'boy', 'be', 'play', 'a', 'wooden', 'flute', '.']) 

1227. (['A', 'man', 'put', 'season', 'in', 'a', 'bowl', 'of', 'water', '.'], ['The', 'man', 'add', 'season', 'to', 'water', 'in', 'a', 'bowl', '.']) 

1228. (['An', 'elderly', 'woman', 'be', 'pour', 'oil', 'into', 'a', 'frying', 'pan', '.'], ['A', 'tied', 'man', 'be', 'put', 'into', 'water', '.']) 

1229. (['A', 'baby', 'rhino', 'be', 'walk', 'around', 'h

1820. (['It', 'be', 'our', 'job', 'to', 'continue', 'to', 'support', 'latvia', 'with', 'the', 'integration', 'of', 'the', 'Russian', 'population', '.'], ['It', 'be', 'our', 'duty', 'to', 'continue', 'to', 'press', 'for', 'latvia', 'on', 'the', 'issue', 'of', 'the', 'integration', 'of', 'the', 'Russian', 'people', '.']) 

1821. (['Unanimous', 'decision', ',', 'and', 'hence', 'an', 'inherent', 'incapacity', 'to', 'act', ',', 'remain', 'largely', 'the', 'norm', 'in', 'the', 'council', '.'], ['We', 'maintain', 'unanimity', 'in', 'the', 'council', 'and', 'therefore', 'a', 'latent', 'inability', 'to', 'act', '.']) 

1822. (['As', 'I', 'already', 'explain', 'during', 'second', 'reading', ',', 'there', 'be', 'a', 'crisis', 'underlie', 'this', 'directive', 'amendment', '.'], ['As', 'I', 'have', 'already', 'explain', 'at', 'second', 'reading', ',', 'a', 'crisis', 'be', 'at', 'the', 'root', 'of', 'this', 'amendment', 'to', 'directive', '.']) 

1823. (['The', 'standard', 'be', 'scarcely', 'compara

2568. (['give', 'an', 'incentive', 'or', 'reason', 'for', 'action', '.'], ['give', 'an', 'incentive', 'for', 'action', '.']) 

2569. (['the', 'day', 'before', 'today'], ['the', 'day', 'immediately', 'before', 'today', '.']) 

2570. (['take', 'or', 'capture', 'by', 'force', 'or', 'authority'], ['take', 'or', 'capture', 'by', 'force', '.']) 

2571. (['(', 'medicine', ')', 'apply', 'a', 'plaster', 'cast', 'to', '.'], ['apply', 'a', 'plaster', 'cast', 'to', '.']) 

2572. (['interfere', 'with', 'someone', 'else', "'s", 'activity'], ['interfere', 'in', 'someone', 'else', "'s", 'activity', '.']) 

2573. (['assign', 'a', 'new', 'time', 'for', 'an', 'event'], ['assign', 'a', 'new', 'time', 'and', 'place', 'for', 'an', 'event', '.']) 

2574. (['value', 'measure', 'by', 'what', 'must', 'be', 'give', 'or', 'do', 'to', 'obtain', 'something', '.'], ['value', 'measure', 'by', 'what', 'must', 'be', 'give', 'or', 'do', 'or', 'undergone', 'to', 'obtain', 'something', '.']) 

2575. (['touch', 'or', 'hold

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

After tokenizing the sentences using lemmas, the similarities can be computed using the same steps of the last laboratory session. For that reason, [*Jaccard distance*](https://www.nltk.org/api/nltk.metrics.html#nltk.metrics.distance.jaccard_distance) is used in order to get the similarity between a pair of sentences.

$ Similarity = 1 - Jaccard_{Distance} $ 

In [28]:
similarities_l = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in pairs_lemmas]
similarities_l_lower = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower]
similarities_l_lower_no_sw = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_no_sw]
similarities_l_lower_ow = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_ow]

print("Similarities (considering lemmas):\n")
for index, similarity in enumerate(similarities_l):
    print(str(index + 1) + ".", similarity)

Similarities (considering lemmas):

1. 0.34782608695652173
2. 0.3214285714285714
3. 0.5555555555555556
4. 0.5909090909090908
5. 0.19999999999999996
6. 0.4878048780487805
7. 0.34782608695652173
8. 0.5405405405405406
9. 0.6428571428571428
10. 0.631578947368421
11. 0.4117647058823529
12. 0.46153846153846156
13. 0.5
14. 0.3846153846153846
15. 0.31999999999999995
16. 0.4516129032258065
17. 0.4054054054054054
18. 0.45833333333333337
19. 0.4444444444444444
20. 0.42307692307692313
21. 0.4285714285714286
22. 0.4242424242424242
23. 0.76
24. 0.64
25. 0.5
26. 0.6296296296296297
27. 0.65625
28. 0.2857142857142857
29. 0.38888888888888884
30. 0.4
31. 0.6538461538461539
32. 0.4
33. 0.47619047619047616
34. 0.3913043478260869
35. 0.4782608695652174
36. 0.84
37. 0.46153846153846156
38. 0.33333333333333337
39. 0.5161290322580645
40. 0.5666666666666667
41. 0.5333333333333333
42. 0.8518518518518519
43. 0.4545454545454546
44. 0.4482758620689655
45. 0.6551724137931034
46. 0.30000000000000004
47. 0.38095238095

1342. 0.25
1343. 0.5555555555555556
1344. 0.30000000000000004
1345. 0.3076923076923077
1346. 0.3846153846153846
1347. 0.30000000000000004
1348. 0.36363636363636365
1349. 0.5
1350. 0.3846153846153846
1351. 0.4285714285714286
1352. 0.4444444444444444
1353. 0.18181818181818177
1354. 0.2222222222222222
1355. 0.25
1356. 0.36363636363636365
1357. 0.3571428571428571
1358. 0.2857142857142857
1359. 0.3076923076923077
1360. 0.3846153846153846
1361. 0.09090909090909094
1362. 0.25
1363. 0.2727272727272727
1364. 0.33333333333333337
1365. 0.2727272727272727
1366. 0.4285714285714286
1367. 0.41666666666666663
1368. 0.33333333333333337
1369. 0.3846153846153846
1370. 0.33333333333333337
1371. 0.2222222222222222
1372. 0.2941176470588235
1373. 0.375
1374. 0.25
1375. 0.33333333333333337
1376. 0.1875
1377. 0.36363636363636365
1378. 0.3076923076923077
1379. 0.33333333333333337
1380. 0.23076923076923073
1381. 0.33333333333333337
1382. 0.3846153846153846
1383. 0.4
1384. 0.23076923076923073
1385. 0.272727272727

2657. 0.75
2658. 0.7333333333333334
2659. 0.9166666666666666
2660. 0.625
2661. 0.5454545454545454
2662. 0.7
2663. 0.4666666666666667
2664. 0.6666666666666667
2665. 0.7272727272727273
2666. 0.7
2667. 0.6666666666666667
2668. 0.7647058823529411
2669. 0.6666666666666667
2670. 0.7692307692307692
2671. 0.8333333333333334
2672. 0.5
2673. 0.5833333333333333
2674. 0.625
2675. 0.6363636363636364
2676. 0.5
2677. 0.6666666666666667
2678. 0.7857142857142857
2679. 0.7142857142857143
2680. 0.8333333333333334
2681. 0.7142857142857143
2682. 0.7058823529411764
2683. 0.5555555555555556
2684. 0.6875
2685. 0.8333333333333334
2686. 0.6111111111111112
2687. 0.6666666666666667
2688. 0.5555555555555556
2689. 0.625
2690. 0.5555555555555556
2691. 0.5714285714285714
2692. 0.6
2693. 0.9
2694. 0.6923076923076923
2695. 0.8181818181818181
2696. 0.625
2697. 0.6666666666666667
2698. 0.625
2699. 0.7
2700. 0.9230769230769231
2701. 0.6666666666666667
2702. 0.6363636363636364
2703. 0.625
2704. 0.8823529411764706
2705. 0.5

In [14]:
print("Gold standard size:", len(gs))
# print("Gold standard:", gs)
print("Pearson correlation (lemmas):", pearsonr(gs, similarities_l)[0])

Gold standard size: 3108
Pearson correlation (lemmas): 0.4066630148703153


### 2.1.3 Lexical Semantics

In [33]:
# David
synsets = None
# for lemma, category in pairs:
#     try:
#         synset = wn.synsets(lemma, category[0].lower())[0]
#         print("Appending most frequent synset for lemma '" + lemma + "' (category: '" + category + "'):\n" + str(synset) + "\n")
#         synsets.append(synset)
#     except:
#         print("Lemma '" + lemma + "' cannot be a synset because its category is " + category + ".\n")


Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

### 2.2 Syntactic Dimension

### 2.2.1 Word Sense Disambiguation

In [None]:
# David

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & Pearson Correlation Coefficient).

### 2.2.2 Word Sequences

In [None]:
# Kevin

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & Pearson Correlation Coefficient).

### 2.3 Combination of Lexical & Syntatic Dimensions

In [None]:
# Next Week

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & Pearson Correlation Coefficient).

## 3. Other proposed approaches

### 3.1 Utilizing other PoS taggers.

Get pairs tokenized once again

In [None]:
pairs_pos = [(pos_tag(p[0]), pos_tag(p[1])) for p in pairs_tokenized]

for index, pair in enumerate(pairs_tokenized):
    print(str(index + 1) + ".", pair, '\n')

Perceptron Model

In [None]:
PER=None
with open("models/per_treebank_pos_tagger_3000", "rb") as f:
    PER = dill.load(f)

In [None]:
pairs__PER = [(PER.tag(p[0]), PER.tag(p[1])) for p in pairs_tokenized]

for index, pair in enumerate(pairs__PER):
    print(str(index + 1) + ".", pair, '\n')

In [None]:
sw = set(stopwords.words('english'))

t_pairs = list()
t_pairs_lower = list()
t_pairs_lower_no_sw = list()
t_pairs_lower_ow = list()

for pair in pairs__PER:
    t_pairs.append(([lemmatize(word) for word in pair[0]], [lemmatize(word) for word in pair[1]]))
    t_pairs_lower.append(([lemmatize(word, True) for word in pair[0]],
                          [lemmatize(word, True) for word in pair[1]]))
    t_pairs_lower_no_sw.append(([lemmatize(word, True) for word in pair[0] if word[0].lower() not in sw],
                                [lemmatize(word, True) for word in pair[1] if word[0].lower() not in sw]))
    t_pairs_lower_ow.append(([lemmatize(word, True) for word in pair[0] if re.search(r"\w", word[0])],
                             [lemmatize(word, True) for word in pair[1] if re.search(r"\w", word[0])]))
    
for index, t_pair in enumerate(t_pairs):
    print(str(index + 1) + ".", t_pair, '\n')

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

In [None]:
similarities_l = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs]
similarities_l_lower = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower]
similarities_l_lower_no_sw = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_no_sw]
similarities_l_lower_ow = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_ow]

print("Similarities with Perceptron model (considering lemmas):\n")
for index, similarity in enumerate(similarities_l):
    print(str(index + 1) + ".", similarity)

In [None]:
print("Gold standard size:", len(gs))
print("Pearson correlation (lemmas):", pearsonr(gs, similarities_l)[0])

CRF Model

In [None]:
CRF=None
CRF_model = CRFTagger()
CRF_model.set_model_file('models/crf_treebank_pos_tagger_3000')
CRF = CRF_model

In [None]:
pairs__CRF = [(CRF.tag(p[0]), CRF.tag(p[1])) for p in pairs_tokenized]

for index, pair in enumerate(pairs__CRF):
    print(str(index + 1) + ".", pair, '\n')

In [None]:
sw = set(stopwords.words('english'))

t_pairs = list()
t_pairs_lower = list()
t_pairs_lower_no_sw = list()
t_pairs_lower_ow = list()

for pair in pairs__CRF:
    t_pairs.append(([lemmatize(word) for word in pair[0]], [lemmatize(word) for word in pair[1]]))
    t_pairs_lower.append(([lemmatize(word, True) for word in pair[0]],
                          [lemmatize(word, True) for word in pair[1]]))
    t_pairs_lower_no_sw.append(([lemmatize(word, True) for word in pair[0] if word[0].lower() not in sw],
                                [lemmatize(word, True) for word in pair[1] if word[0].lower() not in sw]))
    t_pairs_lower_ow.append(([lemmatize(word, True) for word in pair[0] if re.search(r"\w", word[0])],
                             [lemmatize(word, True) for word in pair[1] if re.search(r"\w", word[0])]))
    
for index, t_pair in enumerate(t_pairs):
    print(str(index + 1) + ".", t_pair, '\n')

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

In [None]:
similarities_l = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs]
similarities_l_lower = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower]
similarities_l_lower_no_sw = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_no_sw]
similarities_l_lower_ow = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_ow]

print("Similarities with CRF model (considering lemmas):\n")
for index, similarity in enumerate(similarities_l):
    print(str(index + 1) + ".", similarity)

In [None]:
print("Gold standard size:", len(gs))
print("Pearson correlation (lemmas):", pearsonr(gs, similarities_l)[0])

HMM Model

In [None]:
HMM=None
with open("models/per_treebank_pos_tagger_3000", "rb") as f:
    HMM = dill.load(f)

In [None]:
pairs__HMM = [(HMM.tag(p[0]), HMM.tag(p[1])) for p in pairs_tokenized]

for index, pair in enumerate(pairs__HMM):
    print(str(index + 1) + ".", pair, '\n')

In [None]:
sw = set(stopwords.words('english'))

t_pairs = list()
t_pairs_lower = list()
t_pairs_lower_no_sw = list()
t_pairs_lower_ow = list()

for pair in pairs__HMM:
    t_pairs.append(([lemmatize(word) for word in pair[0]], [lemmatize(word) for word in pair[1]]))
    t_pairs_lower.append(([lemmatize(word, True) for word in pair[0]],
                          [lemmatize(word, True) for word in pair[1]]))
    t_pairs_lower_no_sw.append(([lemmatize(word, True) for word in pair[0] if word[0].lower() not in sw],
                                [lemmatize(word, True) for word in pair[1] if word[0].lower() not in sw]))
    t_pairs_lower_ow.append(([lemmatize(word, True) for word in pair[0] if re.search(r"\w", word[0])],
                             [lemmatize(word, True) for word in pair[1] if re.search(r"\w", word[0])]))
    
for index, t_pair in enumerate(t_pairs):
    print(str(index + 1) + ".", t_pair, '\n')

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

In [None]:
similarities_l = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs]
similarities_l_lower = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower]
similarities_l_lower_no_sw = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_no_sw]
similarities_l_lower_ow = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_ow]

print("Similarities with HMM model (considering lemmas):\n")
for index, similarity in enumerate(similarities_l):
    print(str(index + 1) + ".", similarity)

In [None]:
print("Gold standard size:", len(gs))
print("Pearson correlation (lemmas):", pearsonr(gs, similarities_l)[0])

TnT Model

In [None]:
TnT=None
with open("models/tnt_treebank_pos_tagger_3000", "rb") as f:
    TnT = dill.load(f)

In [None]:
pairs__TnT = [(TnT.tag(p[0]), TnT.tag(p[1])) for p in pairs_tokenized]

for index, pair in enumerate(pairs__TnT):
    print(str(index + 1) + ".", pair, '\n')

In [None]:
sw = set(stopwords.words('english'))

t_pairs = list()
t_pairs_lower = list()
t_pairs_lower_no_sw = list()
t_pairs_lower_ow = list()

for pair in pairs__TnT:
    t_pairs.append(([lemmatize(word) for word in pair[0]], [lemmatize(word) for word in pair[1]]))
    t_pairs_lower.append(([lemmatize(word, True) for word in pair[0]],
                          [lemmatize(word, True) for word in pair[1]]))
    t_pairs_lower_no_sw.append(([lemmatize(word, True) for word in pair[0] if word[0].lower() not in sw],
                                [lemmatize(word, True) for word in pair[1] if word[0].lower() not in sw]))
    t_pairs_lower_ow.append(([lemmatize(word, True) for word in pair[0] if re.search(r"\w", word[0])],
                             [lemmatize(word, True) for word in pair[1] if re.search(r"\w", word[0])]))
    
for index, t_pair in enumerate(t_pairs):
    print(str(index + 1) + ".", t_pair, '\n')

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

In [None]:
similarities_l = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs]
similarities_l_lower = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower]
similarities_l_lower_no_sw = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_no_sw]
similarities_l_lower_ow = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in t_pairs_lower_ow]

print("Similarities with TnT model (considering lemmas):\n")
for index, similarity in enumerate(similarities_l):
    print(str(index + 1) + ".", similarity)

In [None]:
print("Gold standard size:", len(gs))
print("Pearson correlation (lemmas):", pearsonr(gs, similarities_l)[0])

## 4. Conclusions