# STS Project
## Introduction
*Jupyter Notebook* of the STS (Semantic Textual Similarity) Project of **Introduction to Human Language Technologies** course from UPC in MAI (Master of Artificial Intelligence).

This project has been done by:
- David Dueñas Gaviria
- Kevin David Rosales Santana

The statement is as follows:
- Use data set and description of task Semantic Textual Similarity in SemEval 2012.

- Implement some approaches to detect paraphrase using sentence similarity metrics.

    - Explore some lexical dimensions.
    - Explore the syntactic dimension alone.
    - Explore the combination of both previous.
    
- Add new components at your choice (optional).

- Not word neither sentence embeddings should be allowed.

- Compare and comment the results achieved by these approaches among them and among the official results.

- Send files to raco in IHLT STS Project before the oral presentation:

    - Jupyter notebook: `sts-[Student1]-[Student2].ipynb`

    - Slides: `sts-[Student1]-[Student2].pdf`

#### Quick notes about similarities and performance measure.
    
- In order to measure the similarity between each pair of sentences, the [*Jaccard distance*](https://www.nltk.org/api/nltk.metrics.html#nltk.metrics.distance.jaccard_distance) will be used:

    - $ Similarity = 1 - Jaccard_{Distance} $

    - $ Jaccard_{Distance} = \frac{|A \cap B |}{|A \cup B|} $

- The [*Pearson correlation coefficient*](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) will be used to measure the relation between our similarities and the proposed similarities from the *Gold Standard*.  The coefficient varies between -1 and +1, with 0 implying no correlation. Correlations of -1 and +1 imply an exact linear relationship.

## Imports

In [22]:
import nltk, re

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.metrics import jaccard_distance
from nltk.corpus import stopwords
from nltk.parse.corenlp import CoreNLPDependencyParser
from scipy.stats import pearsonr

## 1. Data Preparation
This section covers the preparation of the Input Data. The data used along the IHLT course are mostly from the *trial* set. Nevertheless, in this project, the *test* set will be used in order to compute the similarities and measure the performance of the different proposed models.

The input data is composed by five different files:
- `STS.input.MSRpar.txt`
- `STS.input.MSRvid.txt`
- `STS.input.SMTeuroparl.txt`
- `STS.input.surprise.OnWN.txt`
- `STS.input.surprise.SMTnews.txt`

Therefore, the proposed pairs of sentences will be formed by the concatenation of the five different proposed inputs.

The variable `sw` will store the set of english stopwords, which will be used in several different approaches.

In [2]:
pairs = list()

sw = set(stopwords.words('english'))

input_files = ['STS.input.MSRpar.txt',
               'STS.input.MSRvid.txt',
               'STS.input.SMTeuroparl.txt',
               'STS.input.surprise.OnWN.txt',
               'STS.input.surprise.SMTnews.txt']

for file in input_files:
    with open('inputs/test-gold/' + file, 'r') as f:
        lines = f.readlines()
        for line in lines:
            line = nltk.TabTokenizer().tokenize(line.strip())
            pairs.append((line[0], line[1]))
        
for index, pair in enumerate(pairs, 1):
    print(str(index) + ".", pair)

1. ('The problem likely will mean corrective changes before the shuttle fleet starts flying again.', 'He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again.')
2. ('The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650.', "The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970.")
3. ('"It\'s a huge black eye," said publisher Arthur Ochs Sulzberger Jr., whose family has controlled the paper since 1896.', '"It\'s a huge black eye," Arthur Sulzberger, the newspaper\'s publisher, said of the scandal.')
4. ('SEC Chairman William Donaldson said there is a "building confidence out there that the cop is on the beat."', '"I think there\'s a building confidence that the cop is on the beat."')
5. ('Vivendi shares closed 1.9 percent at 15.80 euros in Paris after falling 3.6 percent on Monday.', 'In New York, Vivendi shares were 1.4 percent down at $18.29.')
6. ("Myanmar's pro-democr

1654. ('Van Orden Report (A5-0241/2000)', 'Van Orden report (5 / 2000)')
1655. ('As I already explained during second reading, there is a crisis underlying this directive amendment.', 'As I have already explained at second reading, a crisis is on the basis of this amendment of directive.')
1656. ('Maij-Weggen report (A5-0323/2000)', 'Report Maij-Weggen (A5-0323 / 2000)')
1657. ('Consumers will lose out, employees will lose out, Europe will lose competitive strength and growth.', 'The consumers are the losers, with the employees, and the competitiveness and the growth European régresseront.')
1658. ('Van Orden Report (A5-0241/2000)', 'Report Horsebox Orden (A5-0241 / 2000)')
1659. ('Unfortunately, others separate on the basis of accumulated hatred.', 'Some separate themselves unfortunately in view of a grudge gained.')
1660. ('Tunisia', 'Tunisia')
1661. ('We often pontificate here about being the representatives of the citizens of Europe.', 'We ourselves often represent European citizen

2956. ('Today there are two plausible ways to proceed against a deposed tyrant.', 'Today there are two plausible ways to a tyrant deposed.')
2957. ('He did, but the initiative did not get very far.', 'What he has done without the initiative goes too far.')
2958. ('None of this absolves rich countries of their responsibility to help.', 'This does not release but rich countries of their obligation to help.')
2959. ('None of this absolves rich countries of their responsibility to help.', 'This does not, however, rich countries emerging from their obligation to help.')
2960. ('Today there are two plausible ways to proceed against a deposed tyrant.', 'Today there are two ways to proceed plausible towards a tyrant deposed.')
2961. ('But they were necessary.', 'But they were necessary.')
2962. ('Ahmadinejad is embarking on an adventure; Bernanke is not.', 'Ahmadinejad has embarque in an adventure, not Bernanke.')
2963. ('But, like the Union itself, it will be built and it will be done.', 'But

In [3]:
print("Number of pairs of sentences:", len(pairs))

Number of pairs of sentences: 3108


Furthermore, it is required to read the already mentioned *Gold Standard* file. This file contains the correct similarity for each read pair of sentences. Consequently, these values will be utilized in the measurement of the performance of the proposed models.

In [4]:
with open('inputs/test-gold/STS.gs.ALL.txt','r') as f:
    gs = [float(line) for line in f.readlines()]

print("Gold standard size:", len(gs))

Gold standard size: 3108


## 2. Paraphrase detection using different approaches

The paraphrase detection will be performed using different approaches:

- First, different **Lexical Dimension** approaches will be analyzed such as *Words Tokenization*, *Lemma Tokenization* and *Lexical Semantics*. 

- After that, the **Syntactic Dimension** will be studied with *Word Sense Disambiguation* and *Word Sequences*.

- Finally, some systems based on **combinations of both dimensions** will be built taking into account the results of the mentioned previous approaches.

### 2.1 Lexical Dimensions

### 2.1.1 Words tokenization

In this basic approach, the word tokenization will be utilized for each pair of sentences:

- `wt_pairs`: contains the basic word tokenization approach.
- `l_wt_pairs`: contains the `wt_pairs` approach in lower case.
- `l_sw_wt_pairs`: contains the `l_wt_pairs` approach without stopwords.
- `l_sw_jw_wt_pairs`: contains the `l_sw_wt_pairs` approach using only words.

The word tokenization will be performed using the `nltk` function `word_tokenize(sentence)`. Notice that the loaded `sw` will be used to filter the stopwords in `l_sw_*_pairs` approaches and the package `re` will be used to obtain only words (*i.e.* no punctuation marks) in the `l_sw_jw_wt_pairs` approach.

In [5]:
wt_pairs = [(nltk.word_tokenize(p[0]), nltk.word_tokenize(p[1])) for p in pairs]

l_wt_pairs = list()
l_sw_wt_pairs = list()
l_sw_jw_wt_pairs = list()

for pair in wt_pairs:
    l_wt_pairs.append(([w.lower() for w in pair[0]],
                       [w.lower() for w in pair[1]]))
    l_sw_wt_pairs.append(([w.lower() for w in pair[0] if w.lower() not in sw],
                          [w.lower() for w in pair[1] if w.lower() not in sw]))
    l_sw_jw_wt_pairs.append(([w.lower() for w in pair[0] if w.lower() not in sw and re.search(r"\w", w)],
                             [w.lower() for w in pair[1] if w.lower() not in sw and re.search(r"\w", w)]))

# Basic approach visualization
for index, pair in enumerate(wt_pairs, 1):
    print(str(index) + ".", pair, '\n')

1. (['The', 'problem', 'likely', 'will', 'mean', 'corrective', 'changes', 'before', 'the', 'shuttle', 'fleet', 'starts', 'flying', 'again', '.'], ['He', 'said', 'the', 'problem', 'needs', 'to', 'be', 'corrected', 'before', 'the', 'space', 'shuttle', 'fleet', 'is', 'cleared', 'to', 'fly', 'again', '.']) 

2. (['The', 'technology-laced', 'Nasdaq', 'Composite', 'Index', '.IXIC', 'inched', 'down', '1', 'point', ',', 'or', '0.11', 'percent', ',', 'to', '1,650', '.'], ['The', 'broad', 'Standard', '&', 'Poor', "'s", '500', 'Index', '.SPX', 'inched', 'up', '3', 'points', ',', 'or', '0.32', 'percent', ',', 'to', '970', '.']) 

3. (['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'said', 'publisher', 'Arthur', 'Ochs', 'Sulzberger', 'Jr.', ',', 'whose', 'family', 'has', 'controlled', 'the', 'paper', 'since', '1896', '.'], ['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'Arthur', 'Sulzberger', ',', 'the', 'newspaper', "'s", 'publisher', ',', 'said', 'of', 'the', 'scandal', '.'

797. (['A', 'girl', 'is', 'flying', 'a', 'kite', '.'], ['A', 'girl', 'running', 'is', 'flying', 'a', 'kite', '.']) 

798. (['A', 'man', 'is', 'riding', 'a', 'mechanical', 'bull', '.'], ['A', 'man', 'rode', 'a', 'mechanical', 'bull', '.']) 

799. (['The', 'man', 'is', 'playing', 'the', 'guitar', '.'], ['A', 'man', 'is', 'playing', 'a', 'guitar', '.']) 

800. (['The', 'man', 'is', 'playing', 'the', 'guitar', '.'], ['A', 'man', 'is', 'playing', 'a', 'guitar', '.']) 

801. (['A', 'woman', 'is', 'dancing', 'and', 'singing', 'with', 'other', 'women', '.'], ['A', 'woman', 'is', 'dancing', 'and', 'singing', 'in', 'the', 'rain', '.']) 

802. (['A', 'man', 'is', 'finding', 'something', '.'], ['A', 'woman', 'is', 'slicing', 'something', '.']) 

803. (['A', 'man', 'is', 'slicing', 'a', 'bun', '.'], ['A', 'man', 'is', 'slicing', 'an', 'onion', '.']) 

804. (['A', 'man', 'is', 'pouring', 'oil', 'into', 'a', 'pan', '.'], ['A', 'man', 'is', 'pouring', 'oil', 'into', 'a', 'skillet', '.']) 

805. (['A',


1797. (['As', 'I', 'already', 'explained', 'during', 'second', 'reading', ',', 'there', 'is', 'a', 'crisis', 'underlying', 'this', 'directive', 'amendment', '.'], ['As', 'I', 'have', 'already', 'said', 'in', 'the', 'second', 'reading', ',', 'a', 'crisis', 'is', 'at', 'the', 'basis', 'of', 'this', 'amendment', 'of', 'directive', '.']) 

1798. (['(', 'Parliament', 'adopted', 'the', 'legislative', 'resolution', ')'], ['In', 'particular', 'Parliament', 'adopted', 'the', 'legislative', 'resolution', ')']) 

1799. (['Let', 'me', 'remind', 'you', 'that', 'our', 'allies', 'include', 'fervent', 'supporters', 'of', 'this', 'tax', '.'], ['I', 'would', 'like', 'to', 'remind', 'you', 'that', 'among', 'our', 'allies', ',', 'there', 'are', 'ardent', 'supporters', 'of', 'this', 'tax', '.']) 

1800. (['As', 'I', 'already', 'explained', 'during', 'second', 'reading', ',', 'there', 'is', 'a', 'crisis', 'underlying', 'this', 'directive', 'amendment', '.'], ['As', 'I', 'have', 'already', 'explained', 'in'

2797. (['This', 'is', 'a', 'clear', 'if', 'implicit', 'repudiation', 'of', 'Mubarak', ',', 'the', 'sole', 'ruler', 'for', '24', 'years', '.'], ['This', 'is', 'an', 'implicit', 'but', 'clear', 'repudiation', 'of', 'Mubarak', ',', 'only', 'Head', 'of', 'State', 'for', '24', 'years', '.']) 

2798. (['Foremost', 'among', 'these', 'is', 'that', 'economic', 'development', 'is', 'largely', 'in', 'the', 'hands', 'of', 'poor', 'nations', 'themselves', '.'], ['At', 'the', 'forefront', 'of', 'these', 'lessons', ',', 'we', 'learn', 'that', 'economic', 'development', 'is', 'largely', 'left', 'in', 'the', 'hands', 'of', 'poor', 'nations', 'themselves', '.']) 

2799. (['Some', 'results', 'are', 'remarkable', '.'], ['Some', 'are', 'noteworthy', 'results', '.']) 

2800. (['Will', 'it', 'give', 'us', 'the', 'right', 'to', 'divorce', 'the', 'husbands', 'who', 'abandon', 'us', '?'], ['They', 'would', 'give', 'the', 'right', 'to', 'divorce', 'the', 'husbands', 'who', 'would', 'have', 'abandoned', 'them', '

**Comparison and comments of the results achieved by this approach among them and among the official results (Similarities & *Pearson Correlation Coefficient*).**

In [6]:
wt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in wt_pairs]
l_wt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_wt_pairs]
l_sw_wt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_sw_wt_pairs]
l_sw_jw_wt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_sw_jw_wt_pairs]

print("Similarities (word tokenization):\n")
for index, similarity in enumerate(wt_similarities):
    print(str(index + 1) + ".", similarity)

Similarities (word tokenization):

1. 0.28
2. 0.27586206896551724
3. 0.5555555555555556
4. 0.5909090909090908
5. 0.19999999999999996
6. 0.4878048780487805
7. 0.34782608695652173
8. 0.5405405405405406
9. 0.6428571428571428
10. 0.55
11. 0.4117647058823529
12. 0.40740740740740744
13. 0.5
14. 0.3846153846153846
15. 0.3076923076923077
16. 0.4516129032258065
17. 0.4054054054054054
18. 0.5217391304347826
19. 0.4444444444444444
20. 0.3571428571428571
21. 0.30434782608695654
22. 0.4242424242424242
23. 0.7307692307692308
24. 0.64
25. 0.43999999999999995
26. 0.5714285714285714
27. 0.65625
28. 0.2857142857142857
29. 0.3421052631578947
30. 0.4736842105263158
31. 0.6538461538461539
32. 0.4
33. 0.47619047619047616
34. 0.3913043478260869
35. 0.4782608695652174
36. 0.6785714285714286
37. 0.46153846153846156
38. 0.33333333333333337
39. 0.5161290322580645
40. 0.5666666666666667
41. 0.5333333333333333
42. 0.8214285714285714
43. 0.4545454545454546
44. 0.4482758620689655
45. 0.6551724137931034
46. 0.3000000

455. 0.4571428571428572
456. 0.3846153846153846
457. 0.3571428571428571
458. 0.42307692307692313
459. 0.4347826086956522
460. 0.5454545454545454
461. 0.7241379310344828
462. 0.6538461538461539
463. 0.4347826086956522
464. 0.28125
465. 0.34782608695652173
466. 0.5357142857142857
467. 0.34782608695652173
468. 0.34782608695652173
469. 0.6
470. 0.4
471. 0.4
472. 0.5416666666666667
473. 0.53125
474. 0.44999999999999996
475. 0.7241379310344828
476. 0.8947368421052632
477. 0.5217391304347826
478. 0.25806451612903225
479. 0.5517241379310345
480. 0.31999999999999995
481. 0.3157894736842105
482. 0.4516129032258065
483. 0.5483870967741935
484. 0.5
485. 0.6666666666666667
486. 0.7619047619047619
487. 0.4
488. 0.6451612903225806
489. 0.5555555555555556
490. 0.5333333333333333
491. 0.5384615384615384
492. 0.34782608695652173
493. 0.43999999999999995
494. 0.5
495. 0.36363636363636365
496. 0.4444444444444444
497. 0.5625
498. 0.29032258064516125
499. 0.38888888888888884
500. 0.2962962962962963
501. 0.2

1383. 0.4
1384. 0.23076923076923073
1385. 0.2727272727272727
1386. 0.30000000000000004
1387. 0.2727272727272727
1388. 0.23076923076923073
1389. 0.1428571428571429
1390. 0.16666666666666663
1391. 0.2222222222222222
1392. 0.2727272727272727
1393. 0.3571428571428571
1394. 0.23076923076923073
1395. 0.16666666666666663
1396. 0.23076923076923073
1397. 0.23076923076923073
1398. 0.2222222222222222
1399. 0.3076923076923077
1400. 0.3571428571428571
1401. 0.2857142857142857
1402. 0.2727272727272727
1403. 0.4444444444444444
1404. 0.25
1405. 0.3076923076923077
1406. 0.1428571428571429
1407. 0.3846153846153846
1408. 0.0625
1409. 0.2857142857142857
1410. 0.33333333333333337
1411. 0.19999999999999996
1412. 0.2666666666666667
1413. 0.33333333333333337
1414. 0.25
1415. 0.25
1416. 0.23529411764705888
1417. 0.3076923076923077
1418. 0.07692307692307687
1419. 0.1875
1420. 0.23076923076923073
1421. 0.15384615384615385
1422. 0.17647058823529416
1423. 0.33333333333333337
1424. 0.33333333333333337
1425. 0.22222

1804. 0.5
1805. 0.6086956521739131
1806. 0.33333333333333337
1807. 0.4285714285714286
1808. 0.5
1809. 0.7857142857142857
1810. 0.5
1811. 0.8333333333333334
1812. 0.42105263157894735
1813. 0.5
1814. 0.25
1815. 0.5
1816. 0.4545454545454546
1817. 0.5
1818. 1.0
1819. 0.30000000000000004
1820. 0.5238095238095238
1821. 0.29166666666666663
1822. 0.5416666666666667
1823. 0.3125
1824. 0.23809523809523814
1825. 0.4375
1826. 0.36363636363636365
1827. 0.9
1828. 0.7222222222222222
1829. 0.5
1830. 1.0
1831. 1.0
1832. 0.29166666666666663
1833. 0.36363636363636365
1834. 0.9
1835. 0.4444444444444444
1836. 0.13636363636363635
1837. 0.25
1838. 0.6956521739130435
1839. 0.52
1840. 0.5454545454545454
1841. 0.3529411764705882
1842. 0.5
1843. 0.5
1844. 0.3529411764705882
1845. 0.7222222222222222
1846. 0.25
1847. 0.4
1848. 0.368421052631579
1849. 0.55
1850. 0.6363636363636364
1851. 0.6363636363636364
1852. 1.0
1853. 0.4285714285714286
1854. 0.19999999999999996
1855. 0.30434782608695654
1856. 0.2608695652173913

In [7]:
print("Pearson correlation (word tokenization):", pearsonr(gs, wt_similarities)[0])
print("Pearson correlation (word tokenization + lowercase):", pearsonr(gs, l_wt_similarities)[0])
print("Pearson correlation (word tokenization + lowercase + no stopwords):", pearsonr(gs, l_sw_wt_similarities)[0])
print("Pearson correlation (word tokenization + lowercase + no stopwords + just words):", pearsonr(gs, l_sw_jw_wt_similarities)[0])

Pearson correlation (word tokenization): 0.3576889958413405
Pearson correlation (word tokenization + lowercase): 0.4030921572160367
Pearson correlation (word tokenization + lowercase + no stopwords): 0.4823471916155675
Pearson correlation (word tokenization + lowercase + no stopwords + just words): 0.5605613209285878


As it can be observed, the *Pearson Correlation* is improved by adding the utilization of different minor techniques (*e.g.* use of lowercase, regular expressions, stopwords...).

For example, if the first pair of sentences is analyzed in each approach:

In [8]:
print("Word tokenization approach (similarity: " + str(round(wt_similarities[1]*5, 3)) + "/5.0, gs: " + str(gs[1]) + "/5.0):")
print("\nFirst sentence:", wt_pairs[1][0])
print("Second sentence:", wt_pairs[1][1])

Word tokenization approach (similarity: 1.379/5.0, gs: 0.8/5.0):

First sentence: ['The', 'technology-laced', 'Nasdaq', 'Composite', 'Index', '.IXIC', 'inched', 'down', '1', 'point', ',', 'or', '0.11', 'percent', ',', 'to', '1,650', '.']
Second sentence: ['The', 'broad', 'Standard', '&', 'Poor', "'s", '500', 'Index', '.SPX', 'inched', 'up', '3', 'points', ',', 'or', '0.32', 'percent', ',', 'to', '970', '.']


There are some words located in both sentences (*e.g.* 'Index', 'inched', 'percent', 'to'...). Therefore, the similarity is greater than expected as these sentences are on different topics (even if they both refer to the economy field. Note the *Gold Standard* value in the `print`).

Nevertheless, if other approaches are studied:

In [9]:
print("Word tokenization + lowercase approach (similarity: " + str(round(l_wt_similarities[1]*5, 3)) + "/5.0, gs: " + str(gs[1]) + "/5.0):")
print("\nFirst sentence:", l_wt_pairs[1][0])
print("Second sentence:", l_wt_pairs[1][1])

print("\nWord tokenization + lowercase + no stopwords approach (similarity: " + str(round(l_sw_wt_similarities[1]*5, 3)) + "/5.0, gs: " + str(gs[1]) + "/5.0):")
print("\nFirst sentence:", l_sw_wt_pairs[1][0])
print("Second sentence:", l_sw_wt_pairs[1][1])

print("\nWord tokenization + lowercase + no stopwords + just words approach (similarity: " + str(round(l_sw_jw_wt_similarities[1]*5, 3)) + "/5.0, gs: " + str(gs[1]) + "/5.0):")
print("\nFirst sentence:", l_sw_jw_wt_pairs[1][0])
print("Second sentence:", l_sw_jw_wt_pairs[1][1])

Word tokenization + lowercase approach (similarity: 1.379/5.0, gs: 0.8/5.0):

First sentence: ['the', 'technology-laced', 'nasdaq', 'composite', 'index', '.ixic', 'inched', 'down', '1', 'point', ',', 'or', '0.11', 'percent', ',', 'to', '1,650', '.']
Second sentence: ['the', 'broad', 'standard', '&', 'poor', "'s", '500', 'index', '.spx', 'inched', 'up', '3', 'points', ',', 'or', '0.32', 'percent', ',', 'to', '970', '.']

Word tokenization + lowercase + no stopwords approach (similarity: 1.042/5.0, gs: 0.8/5.0):

First sentence: ['technology-laced', 'nasdaq', 'composite', 'index', '.ixic', 'inched', '1', 'point', ',', '0.11', 'percent', ',', '1,650', '.']
Second sentence: ['broad', 'standard', '&', 'poor', "'s", '500', 'index', '.spx', 'inched', '3', 'points', ',', '0.32', 'percent', ',', '970', '.']

Word tokenization + lowercase + no stopwords + just words approach (similarity: 0.714/5.0, gs: 0.8/5.0):

First sentence: ['technology-laced', 'nasdaq', 'composite', 'index', '.ixic', 'inch

This approach is improved by adding minor changes like the filtering of stopwords and punctuation marks. 

For example, in the *Word tokenization + lowercase + no stopwords approach*, words like 'to' (stopwords) are not counted and the *Jaccard similarity* is lower than in the basic approach. This technique supposes that stopwords do not add any relevant meaning in the sentence, so they should get removed.

Finally, if the punctuation marks get removed too using regular expressions, the similarity gets a lower value than in the previous approaches. The computed similarity in this case is the nearest to the suggested by the *Gold Standard* for this pair (*i.e.* 0.8). This technique supposes that punctuation marks should not be taken into consideration in the sentences when a paraphrase detection is being performed.

Notice that this last approach (*Word tokenization + lowercase + no stopwords + just words approach*) gets the best *Pearson Correlation Coefficient* with respect to this section **(*i.e.* 0.561)**. This study of these different minor changes will be repeated in the following sections in order to analyze how they can help the improvement of the paraphrase detection.

### 2.1.2 Lemma tokenization (PoS)

In [10]:
# David

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

### 2.1.3 Lexical Semantics

In [11]:
# David 

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

### 2.2 Syntactic Dimension

### 2.2.1 Word Sense Disambiguation

In [12]:
# David

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & Pearson Correlation Coefficient).

### 2.2.2 Word Sequences

In this approach, the word sequences (Words + Named Entities) approach used in practical session will be used:

- `ws_pairs`: contains the basic word sequences (Words + Named Entities) approach.
- `l_ws_pairs`: contains the `ws_pairs` approach in lower case.
- `l_sw_ws_pairs`: contains the `l_ws_pairs` approach without stopwords.
- `l_sw_jw_ws_pairs`: contains the `l_sw_ws_pairs` approach using only words.

The option `binary=True` will be used in NLTK's NERC in order to just recognize named entities, without taking into consideration the classification of three NEs classes (PERSON, LOCATION, ORGANIZATION). Moreover, the function `tree2conlltags(ne_chunk_res)` will be utilized in order to iterate and get the NEs.

The function `transform_sentence(ne_chunk_res)` will iterate over the tree using the mentioned function `tree2conlltags(ne_chunk_res)` and will return the proposed approach (Words + NEs).

In [13]:
def transform_sentence(ne_chunk_res, lowercase=False, remove_sw=False, just_words=False):
    conlltags = nltk.chunk.tree2conlltags(ne_chunk_res)
    transformed_sentence = []
    index = 0
    while index < len(conlltags):
        if conlltags[index][2] == 'B-NE':
            ne = conlltags[index][0]
            consecutive_index = index+1
            for consecutive_index in range(index+1, len(conlltags)):
                if conlltags[consecutive_index][2] == 'I-NE':
                    ne += " " + conlltags[consecutive_index][0]
                else:
                    break
            transformed_sentence.append(ne)
            index = consecutive_index
        else:
            word = conlltags[index][0]
            index += 1
            if lowercase:
                word = word.lower()
            if remove_sw and word.lower() in sw:
                continue
            if just_words and not re.search(r"\w", word):
                continue
            transformed_sentence.append(word)

    return transformed_sentence

In [14]:
ws_pairs = [(transform_sentence(ne_chunk(pos_tag(word_tokenize(p[0])), binary=True)), 
             transform_sentence(ne_chunk(pos_tag(word_tokenize(p[1])), binary=True))) 
            for p in pairs]

l_ws_pairs = [(transform_sentence(ne_chunk(pos_tag(word_tokenize(p[0])), binary=True), lowercase=True), 
               transform_sentence(ne_chunk(pos_tag(word_tokenize(p[1])), binary=True), lowercase=True)) 
               for p in pairs]

l_sw_ws_pairs = [(transform_sentence(ne_chunk(pos_tag(word_tokenize(p[0])), binary=True), lowercase=True, remove_sw=True), 
                  transform_sentence(ne_chunk(pos_tag(word_tokenize(p[1])), binary=True), lowercase=True, remove_sw=True)) 
                  for p in pairs]

l_sw_jw_ws_pairs = [(transform_sentence(ne_chunk(pos_tag(word_tokenize(p[0])), binary=True), lowercase=True, remove_sw=True, just_words=True), 
                     transform_sentence(ne_chunk(pos_tag(word_tokenize(p[1])), binary=True), lowercase=True, remove_sw=True, just_words=True)) 
                     for p in pairs]

In [15]:
# Basic word sequences approach visualization
for index, pair in enumerate(ws_pairs, 1):
    print(str(index) + ".", pair, '\n')

1. (['The', 'problem', 'likely', 'will', 'mean', 'corrective', 'changes', 'before', 'the', 'shuttle', 'fleet', 'starts', 'flying', 'again', '.'], ['He', 'said', 'the', 'problem', 'needs', 'to', 'be', 'corrected', 'before', 'the', 'space', 'shuttle', 'fleet', 'is', 'cleared', 'to', 'fly', 'again', '.']) 

2. (['The', 'technology-laced', 'Nasdaq Composite Index', '.IXIC', 'inched', 'down', '1', 'point', ',', 'or', '0.11', 'percent', ',', 'to', '1,650', '.'], ['The', 'broad', 'Standard', '&', 'Poor', "'s", '500', 'Index', '.SPX', 'inched', 'up', '3', 'points', ',', 'or', '0.32', 'percent', ',', 'to', '970', '.']) 

3. (['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'said', 'publisher', 'Arthur Ochs Sulzberger', 'Jr.', ',', 'whose', 'family', 'has', 'controlled', 'the', 'paper', 'since', '1896', '.'], ['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'Arthur Sulzberger', ',', 'the', 'newspaper', "'s", 'publisher', ',', 'said', 'of', 'the', 'scandal', '.']) 

4. (['SEC'

733. (['About', '1,557', 'genes', 'on', 'chromosome', '6', 'are', 'thought', 'to', 'be', 'functional', '.'], ['The', 'remaining', '1,557', 'genes', 'are', 'believed', 'to', 'be', 'all', 'functional', '.']) 

734. (['MessageLabs', ',', 'which', 'runs', 'outsourced', 'e-mail', 'servers', 'for', '700,000', 'customers', 'around', 'the', 'world', ',', 'said', 'it', 'had', 'filtered', 'out', '27,000', 'infected', 'e-mails', 'in', '115', 'countries', 'as', 'of', 'Thursday', 'morning', '.'], ['Messagelabs', ',', 'which', 'runs', 'outsourced', 'e-mail', 'servers', 'for', '700,000', 'customers', 'around', 'the', 'world', ',', 'has', 'labeled', 'the', 'worm', '``', 'high', 'risk', "''", 'and', 'reports', 'more', 'than', '31,000', 'infections', 'in', '120', 'countries', '.']) 

735. (['In', 'the', 'second', 'quarter', ',', 'Anadarko', 'now', 'expects', 'volume', 'of', '46', 'million', 'BOE', ',', 'down', 'from', '48', 'million', 'BOE', '.'], ['Production', 'for', 'the', 'second', 'quarter', 'was',


1447. (['A', 'girl', 'is', 'playing', 'a', 'piano', '.'], ['A', 'boy', 'is', 'doing', 'push', 'ups', '.']) 

1448. (['The', 'lady', 'stirred', 'up', 'raw', 'eggs', 'in', 'the', 'bowl', '.'], ['A', 'woman', 'is', 'pouring', 'eyes', 'into', 'a', 'bowl', '.']) 

1449. (['A', 'cheetah', 'is', 'running', 'behind', 'its', 'prey', '.'], ['A', 'cheetah', 'chases', 'prey', 'on', 'across', 'a', 'field', '.']) 

1450. (['The', 'lady', 'sliced', 'a', 'tomatoe', '.'], ['Someone', 'is', 'cutting', 'a', 'tomato', '.']) 

1451. (['Someone', 'is', 'holding', 'a', 'hedgehog', '.'], ['A', 'onion', 'is', 'being', 'sliced', '.']) 

1452. (['A', 'woman', 'peels', 'a', 'potato', '.'], ['A', 'woman', 'is', 'grilling', 'pineapples', '.']) 

1453. (['A', 'cat', 'is', 'playing', 'on', 'the', 'floor', '.'], ['A', 'man', 'is', 'slicing', 'garlic', '.']) 

1454. (['A', 'boy', 'is', 'playing', 'violin', 'on', 'stage', '.'], ['A', 'person', 'is', 'mixing', 'a', 'pot', '.']) 

1455. (['A', 'cow', 'is', 'eating', 'gra

2207. (['close', 'within', 'bounds', ';', 'deprive', 'of', 'freedom'], ['to', 'close', 'within', 'bounds', ',', 'limit', 'or', 'hold', 'back', 'from', 'movement', '.']) 

2208. (['to', 'collect', ',', 'acquire', 'or', 'gather'], ['get', 'or', 'gather', 'together', '.']) 

2209. (['deliver', 'a', 'formal', 'talk', 'or', 'reprimand', 'at', 'length'], ['censure', 'severely', 'or', 'angrily', '.']) 

2210. (['a', 'pair', 'of', 'mated', 'people', ',', 'e.g.', ',', 'married'], ['a', 'pair', 'of', 'people', 'who', 'live', 'together', '.']) 

2211. (['rot', ';', 'become', 'unfit', 'for', 'consumption'], ['become', 'unfit', 'for', 'consumption', 'or', 'use', '.']) 

2212. (['lock', 'with', 'one', 'another'], ['become', 'engaged', 'or', 'intermeshed', 'with', 'one', 'another', '.']) 

2213. (['take', 'vows', ',', 'join', 'or', 'allow', 'to', 'join', 'a', 'religious', 'order'], ['take', 'vows', ',', 'as', 'in', 'religious', 'order', '.']) 

2214. (['activity', 'of', 'selling'], ['the', 'general',


2934. (['But', ',', 'like', 'the', 'Union', 'itself', ',', 'it', 'will', 'be', 'built', 'and', 'it', 'will', 'be', 'done', '.'], ['But', 'just', 'as', 'the', 'European Union', 'itself', ',', 'and', 'that', 'it', 'will', 'be', 'done', '.']) 

2935. (['This', 'gross', 'error', 'is', 'leading', 'Russia', 'to', 'political', 'ruin', '.'], ['And', 'this', 'serious', 'error', 'is', 'taking', 'to', 'Russia', 'to', 'its', 'ruin', 'policy', '.']) 

2936. (['But', 'America', "'s", 'interest', 'in', 'Iraqi', 'oil', 'was', 'not', 'driven', 'either', 'by', 'economics', 'or', 'energy', 'policy', '.'], ['But', 'America', "'s", 'interest', 'for', 'Iraqi', 'oil', 'is', 'not', 'dictated', 'by', 'the', 'economy', 'or', 'by', 'energy', 'policy', '.']) 

2937. (['Some', 'results', 'are', 'remarkable', '.'], ['Some', 'results', 'are', 'remarkable', '.']) 

2938. (['A', 'Europe', 'for', 'All'], ['A', 'Europe', 'for', 'all']) 

2939. (['Those', 'who', 'know', 'little', 'about', 'modern', 'factory', 'farming',

**Comparison and comments of the results achieved by this approach among them and among the official results (Similarities & *Pearson Correlation Coefficient*).**

In [16]:
ws_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in ws_pairs]
l_ws_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_ws_pairs]
l_sw_ws_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_sw_ws_pairs]
l_sw_jw_ws_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_sw_jw_ws_pairs]

print("Similarities (considering Words + NEs):\n")
for index, similarity in enumerate(ws_similarities, 1):
    print(str(index) + ".", similarity)

Similarities (considering Words + NEs):

1. 0.28
2. 0.25
3. 0.5
4. 0.6190476190476191
5. 0.20833333333333337
6. 0.4473684210526315
7. 0.33333333333333337
8. 0.5405405405405406
9. 0.6428571428571428
10. 0.55
11. 0.33333333333333337
12. 0.40740740740740744
13. 0.5
14. 0.3846153846153846
15. 0.2222222222222222
16. 0.4516129032258065
17. 0.3939393939393939
18. 0.5217391304347826
19. 0.4444444444444444
20. 0.33333333333333337
21. 0.26086956521739135
22. 0.40625
23. 0.7307692307692308
24. 0.625
25. 0.43999999999999995
26. 0.5714285714285714
27. 0.65625
28. 0.21052631578947367
29. 0.33333333333333337
30. 0.4736842105263158
31. 0.6538461538461539
32. 0.35
33. 0.44999999999999996
34. 0.40909090909090906
35. 0.4782608695652174
36. 0.6666666666666667
37. 0.46153846153846156
38. 0.33333333333333337
39. 0.5172413793103448
40. 0.5666666666666667
41. 0.5517241379310345
42. 0.8148148148148149
43. 0.4545454545454546
44. 0.4642857142857143
45. 0.6551724137931034
46. 0.30000000000000004
47. 0.28571428571

446. 0.3142857142857143
447. 0.45833333333333337
448. 0.5517241379310345
449. 0.33333333333333337
450. 0.42307692307692313
451. 0.5652173913043479
452. 0.6153846153846154
453. 0.33333333333333337
454. 0.5
455. 0.4571428571428572
456. 0.368421052631579
457. 0.3571428571428571
458. 0.3913043478260869
459. 0.4347826086956522
460. 0.5454545454545454
461. 0.7241379310344828
462. 0.64
463. 0.4347826086956522
464. 0.25
465. 0.34782608695652173
466. 0.5185185185185186
467. 0.34782608695652173
468. 0.34782608695652173
469. 0.6
470. 0.4
471. 0.4
472. 0.5238095238095238
473. 0.5483870967741935
474. 0.42105263157894735
475. 0.6896551724137931
476. 0.8888888888888888
477. 0.5
478. 0.20833333333333337
479. 0.5517241379310345
480. 0.31999999999999995
481. 0.3157894736842105
482. 0.25806451612903225
483. 0.5172413793103448
484. 0.5294117647058824
485. 0.6538461538461539
486. 0.7619047619047619
487. 0.3571428571428571
488. 0.6333333333333333
489. 0.5555555555555556
490. 0.5333333333333333
491. 0.538461

1083. 0.4545454545454546
1084. 0.4
1085. 0.2222222222222222
1086. 0.5
1087. 0.5714285714285714
1088. 0.5
1089. 0.5
1090. 0.6
1091. 0.5555555555555556
1092. 0.5555555555555556
1093. 0.5555555555555556
1094. 0.4444444444444444
1095. 0.4444444444444444
1096. 0.4444444444444444
1097. 0.4
1098. 0.5
1099. 0.5
1100. 0.5
1101. 0.4444444444444444
1102. 0.30000000000000004
1103. 0.5
1104. 0.4285714285714286
1105. 0.4
1106. 0.5
1107. 0.4
1108. 0.6
1109. 0.4285714285714286
1110. 0.5
1111. 0.5
1112. 0.25
1113. 0.25
1114. 0.4545454545454546
1115. 0.5
1116. 0.4444444444444444
1117. 0.5
1118. 0.4545454545454546
1119. 0.41666666666666663
1120. 0.30000000000000004
1121. 0.6363636363636364
1122. 0.30000000000000004
1123. 0.33333333333333337
1124. 0.5555555555555556
1125. 0.5
1126. 0.625
1127. 0.5
1128. 0.36363636363636365
1129. 0.36363636363636365
1130. 0.4545454545454546
1131. 0.41666666666666663
1132. 0.5714285714285714
1133. 0.36363636363636365
1134. 0.375
1135. 0.3076923076923077
1136. 0.444444444444

1577. 0.25
1578. 1.0
1579. 0.5454545454545454
1580. 0.3529411764705882
1581. 0.3529411764705882
1582. 0.4375
1583. 0.4285714285714286
1584. 1.0
1585. 0.4285714285714286
1586. 1.0
1587. 0.5
1588. 0.4
1589. 0.6428571428571428
1590. 1.0
1591. 0.25
1592. 0.5555555555555556
1593. 0.16666666666666663
1594. 0.6
1595. 0.2857142857142857
1596. 0.4117647058823529
1597. 0.4
1598. 0.5
1599. 0.7222222222222222
1600. 0.34782608695652173
1601. 0.2727272727272727
1602. 0.4117647058823529
1603. 1.0
1604. 0.5
1605. 0.5555555555555556
1606. 0.2666666666666667
1607. 0.33333333333333337
1608. 1.0
1609. 0.5
1610. 0.31818181818181823
1611. 0.47058823529411764
1612. 0.9
1613. 1.0
1614. 0.33333333333333337
1615. 0.5652173913043479
1616. 1.0
1617. 0.2272727272727273
1618. 0.9
1619. 0.29166666666666663
1620. 0.4782608695652174
1621. 0.625
1622. 0.7222222222222222
1623. 0.5
1624. 0.7142857142857143
1625. 0.2222222222222222
1626. 0.6
1627. 0.19999999999999996
1628. 0.5416666666666667
1629. 0.4285714285714286
1630.

2163. 0.2727272727272727
2164. 0.09999999999999998
2165. 0.0714285714285714
2166. 0.25
2167. 0.09999999999999998
2168. 0.09999999999999998
2169. 0.15384615384615385
2170. 0.06666666666666665
2171. 0.30000000000000004
2172. 0.2777777777777778
2173. 0.05882352941176472
2174. 0.16666666666666663
2175. 0.15384615384615385
2176. 0.1875
2177. 0.19999999999999996
2178. 0.30000000000000004
2179. 0.2222222222222222
2180. 0.11111111111111116
2181. 0.06666666666666665
2182. 0.06666666666666665
2183. 0.23529411764705888
2184. 0.19999999999999996
2185. 0.15000000000000002
2186. 0.16666666666666663
2187. 0.06666666666666665
2188. 0.09999999999999998
2189. 0.19999999999999996
2190. 0.09999999999999998
2191. 0.0
2192. 0.25
2193. 0.09523809523809523
2194. 0.0714285714285714
2195. 0.10526315789473684
2196. 0.08333333333333337
2197. 0.10526315789473684
2198. 0.07692307692307687
2199. 0.125
2200. 0.18181818181818177
2201. 0.052631578947368474
2202. 0.19999999999999996
2203. 0.0714285714285714
2204. 0.0714

2628. 0.7692307692307692
2629. 0.6363636363636364
2630. 0.6666666666666667
2631. 0.5714285714285714
2632. 0.375
2633. 0.6428571428571428
2634. 0.625
2635. 0.8181818181818181
2636. 0.7272727272727273
2637. 0.5555555555555556
2638. 0.625
2639. 0.6
2640. 0.5555555555555556
2641. 0.5333333333333333
2642. 0.75
2643. 0.4444444444444444
2644. 0.8421052631578947
2645. 0.41666666666666663
2646. 0.5
2647. 0.5714285714285714
2648. 0.7
2649. 0.7
2650. 0.5333333333333333
2651. 0.6666666666666667
2652. 0.8571428571428572
2653. 0.6666666666666667
2654. 0.6666666666666667
2655. 0.5
2656. 0.75
2657. 0.75
2658. 0.625
2659. 0.7692307692307692
2660. 0.625
2661. 0.5454545454545454
2662. 0.7
2663. 0.5714285714285714
2664. 0.6666666666666667
2665. 0.7272727272727273
2666. 0.7
2667. 0.6666666666666667
2668. 0.6666666666666667
2669. 0.5384615384615384
2670. 0.7692307692307692
2671. 0.8333333333333334
2672. 0.5
2673. 0.46153846153846156
2674. 0.625
2675. 0.6363636363636364
2676. 0.5
2677. 0.6666666666666667
267

In [17]:
print("Pearson correlation (words + NEs):", pearsonr(gs, ws_similarities)[0])
print("Pearson correlation (words + NEs + lowercase):", pearsonr(gs, l_ws_similarities)[0])
print("Pearson correlation (words + NEs + lowercase + no stopwords):", pearsonr(gs, l_sw_ws_similarities)[0])
print("Pearson correlation (words + NEs + lowercase + no stopwords + just words):", pearsonr(gs, l_sw_jw_ws_similarities)[0])

Pearson correlation (words + NEs): 0.35193061002587833
Pearson correlation (words + NEs + lowercase): 0.378475124223706
Pearson correlation (words + NEs + lowercase + no stopwords): 0.4545719952567138
Pearson correlation (words + NEs + lowercase + no stopwords + just words): 0.5276485437381505


Each *Pearson Correlation Coefficient* is improved by adding the minor changes that were explained in section 2.1.1 (with *Word Tokenization*). Nevertheless, the results in this approach are worse than in *Words Tokenization* approach. Consequently, the addition of Named Entities in *Words Tokenization* technique can reduce the performance of the paraphrase detection system.

If the second pair of sentences is analyzed with respect to the results in the *Words Tokenization* approach:

In [18]:
print("Word tokenization + lowercase + no stopwords + just words approach (similarity: " + str(round(l_sw_jw_wt_similarities[1]*5, 3)) + "/5.0, gs: " + str(gs[1]) + "/5.0):")
print("\nFirst sentence:", l_sw_jw_wt_pairs[1][0])
print("Second sentence:", l_sw_jw_wt_pairs[1][1])

print("\nWord + Named Entities + lowercase + no stopwords + just words approach (similarity: " + str(round(l_sw_jw_ws_similarities[1]*5, 3)) + "/5.0, gs: " + str(gs[1]) + "/5.0):")
print("\nFirst sentence:", l_sw_jw_ws_pairs[1][0])
print("Second sentence:", l_sw_jw_ws_pairs[1][1])

Word tokenization + lowercase + no stopwords + just words approach (similarity: 0.714/5.0, gs: 0.8/5.0):

First sentence: ['technology-laced', 'nasdaq', 'composite', 'index', '.ixic', 'inched', '1', 'point', '0.11', 'percent', '1,650']
Second sentence: ['broad', 'standard', 'poor', "'s", '500', 'index', '.spx', 'inched', '3', 'points', '0.32', 'percent', '970']

Word + Named Entities + lowercase + no stopwords + just words approach (similarity: 0.5/5.0, gs: 0.8/5.0):

First sentence: ['technology-laced', 'Nasdaq Composite Index', '.ixic', 'inched', '1', 'point', '0.11', 'percent', '1,650']
Second sentence: ['broad', 'standard', 'Poor', "'s", '500', 'index', '.spx', 'inched', '3', 'points', '0.32', 'percent', '970']


It can be inferred that the reduction of the similarity value is originated by the joint of the detected Name Entity 'Nasdaq Composite Index' as the system is not able to count the token 'Index' as a word located in the other sentence. Consequently, the *Word Sequences* approach gets a worse value with respect to the *Word Tokenization* technique in this case (and in general cases as the *Pearson Correlation Coefficient* describes).

The underlying reason behind this performance of the word + NEs approach is the complex task that supposes the obtainment of correct Named Entities. For example, if the first sentence gets 'Nasdaq Composite Index' as a Named Entity, it should consider 'Standard & Poor's Index' as another Named Entity in the second sentence. Therefore, it would not get 'Index' as a single word that is not located as a single token in the first sentence.

**The best obtained *Pearson Correlation Coefficient* with *Words Sequences* in this section is: 0.528**

### 2.2.3 Dependency Triples

In this syntactic dimension approach, the `CoreNLPDependencyParser` will be used in order to get the dependency triples from each sentence. Therefore, the similarity (based on *Jaccard Distance*) will be computed taking into consideration these triples.

Please, take into consideration that this command should be executed before running this cell in the *Stanford CoreNLP* folder:

`java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000`

In [30]:
parser = CoreNLPDependencyParser(url='http://localhost:9000')

# Notice that if the sentence contains a "%", it gets removed 
# as the parser does not allow this symbol.
dt_pairs = [([t for t in next(parser.raw_parse(p[0].replace("%", ""))).triples()], 
             [t for t in next(parser.raw_parse(p[1].replace("%", ""))).triples()]) 
            for p in pairs]

for index, pair in enumerate(dt_pairs, 1):
    print(str(index) + ".", pair, '\n')

1. ([(('mean', 'VB'), 'nsubj', ('problem', 'NN')), (('problem', 'NN'), 'det', ('The', 'DT')), (('mean', 'VB'), 'advmod', ('likely', 'RB')), (('mean', 'VB'), 'aux', ('will', 'MD')), (('mean', 'VB'), 'obj', ('changes', 'NNS')), (('changes', 'NNS'), 'amod', ('corrective', 'JJ')), (('changes', 'NNS'), 'acl', ('starts', 'VBZ')), (('starts', 'VBZ'), 'mark', ('before', 'IN')), (('starts', 'VBZ'), 'nsubj', ('fleet', 'NN')), (('fleet', 'NN'), 'det', ('the', 'DT')), (('fleet', 'NN'), 'compound', ('shuttle', 'NN')), (('starts', 'VBZ'), 'xcomp', ('flying', 'VBG')), (('flying', 'VBG'), 'advmod', ('again', 'RB')), (('mean', 'VB'), 'punct', ('.', '.'))], [(('said', 'VBD'), 'nsubj', ('He', 'PRP')), (('said', 'VBD'), 'ccomp', ('needs', 'VBZ')), (('needs', 'VBZ'), 'nsubj', ('problem', 'NN')), (('problem', 'NN'), 'det', ('the', 'DT')), (('needs', 'VBZ'), 'xcomp', ('corrected', 'VBN')), (('corrected', 'VBN'), 'mark', ('to', 'TO')), (('corrected', 'VBN'), 'aux:pass', ('be', 'VB')), (('corrected', 'VBN'), '

1418. ([(('share', 'VBP'), 'nsubj', ('robots', 'NNS')), (('robots', 'NNS'), 'nummod', ('Two', 'CD')), (('share', 'VBP'), 'obj', ('kiss', 'NN')), (('kiss', 'NN'), 'det', ('a', 'DT')), (('share', 'VBP'), 'punct', ('.', '.'))], [(('kissing', 'VBG'), 'nsubj', ('robot', 'NN')), (('robot', 'NN'), 'det', ('A', 'DT')), (('robot', 'NN'), 'amod', ('male', 'JJ')), (('male', 'JJ'), 'conj', ('female', 'JJ')), (('female', 'JJ'), 'cc', ('and', 'CC')), (('kissing', 'VBG'), 'aux', ('are', 'VBP')), (('kissing', 'VBG'), 'punct', ('.', '.'))]) 

1419. ([(('trying', 'VBG'), 'nsubj', ('cat', 'NN')), (('cat', 'NN'), 'det', ('A', 'DT')), (('trying', 'VBG'), 'aux', ('is', 'VBZ')), (('trying', 'VBG'), 'xcomp', ('touch', 'VB')), (('touch', 'VB'), 'mark', ('to', 'TO')), (('touch', 'VB'), 'obj', ('dog', 'NN')), (('dog', 'NN'), 'det', ('a', 'DT')), (('trying', 'VBG'), 'punct', ('.', '.'))], [(('teased', 'VBD'), 'nsubj', ('cat', 'NN')), (('cat', 'NN'), 'det', ('The', 'DT')), (('teased', 'VBD'), 'obj', ('dog', 'NN'))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



**Comparison and comments of the results achieved by this approach among them and among the official results (Similarities & *Pearson Correlation Coefficient*).**

In [39]:
# One pair is: 'Tunisia' vs 'Tunisia'. Consequently, as these pairs takes into account direct relations
# in the sentence to compose the triples, a new condition must be added in the computation of the similarities
# to avoid dividing by zero.
dt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) 
                   if len(set(p[0]).union(set(p[1]))) != 0 else 1.0
                   for p in dt_pairs]

print("Similarities (considering Dependency Triples):\n")
for index, similarity in enumerate(dt_similarities, 1):
    print(str(index) + ".", similarity)

Similarities (considering Dependency Triples):

1. 0.06666666666666665
2. 0.02777777777777779
3. 0.2941176470588235
4. 0.3214285714285714
5. 0.0357142857142857
6. 0.21999999999999997
7. 0.16000000000000003
8. 0.5
9. 0.2666666666666667
10. 0.30434782608695654
11. 0.19047619047619047
12. 0.17142857142857137
13. 0.18518518518518523
14. 0.07894736842105265
15. 0.18518518518518523
16. 0.23684210526315785
17. 0.23809523809523814
18. 0.05882352941176472
19. 0.19999999999999996
20. 0.12121212121212122
21. 0.0
22. 0.15000000000000002
23. 0.5
24. 0.5
25. 0.1470588235294118
26. 0.5
27. 0.5882352941176471
28. 0.07407407407407407
29. 0.2142857142857143
30. 0.30000000000000004
31. 0.4193548387096774
32. 0.03448275862068961
33. 0.3076923076923077
34. 0.09677419354838712
35. 0.16129032258064513
36. 0.4
37. 0.17142857142857137
38. 0.06896551724137934
39. 0.34285714285714286
40. 0.38888888888888884
41. 0.45945945945945943
42. 0.6764705882352942
43. 0.19999999999999996
44. 0.10810810810810811
45. 0.4
46.

2393. 0.16666666666666663
2394. 0.0
2395. 0.0
2396. 0.0
2397. 0.04347826086956519
2398. 0.4545454545454546
2399. 0.0714285714285714
2400. 0.11111111111111116
2401. 0.1428571428571429
2402. 0.11111111111111116
2403. 0.33333333333333337
2404. 0.08333333333333337
2405. 0.1428571428571429
2406. 0.0
2407. 0.0
2408. 0.125
2409. 0.25
2410. 0.19999999999999996
2411. 0.07692307692307687
2412. 0.33333333333333337
2413. 0.0
2414. 0.25
2415. 0.0
2416. 0.11111111111111116
2417. 0.25
2418. 0.1428571428571429
2419. 0.04761904761904767
2420. 0.11111111111111116
2421. 0.19999999999999996
2422. 0.11111111111111116
2423. 0.0
2424. 0.23529411764705888
2425. 0.0625
2426. 0.16000000000000003
2427. 0.1428571428571429
2428. 0.36363636363636365
2429. 0.19999999999999996
2430. 0.19999999999999996
2431. 0.0
2432. 0.15384615384615385
2433. 0.06666666666666665
2434. 0.18181818181818177
2435. 0.0
2436. 0.09999999999999998
2437. 0.2142857142857143
2438. 0.08108108108108103
2439. 0.5
2440. 0.0
2441. 0.0
2442. 0.06666

In [40]:
print("Pearson correlation (Dependency Triples):", pearsonr(gs, dt_similarities)[0])

Pearson correlation (Dependency Triples): 0.3183824763521411


When the dependency triples were studied in the Lab Sessions, it was observed that the result was the worst one obtained along the course using the *Trial* set (*i.e.* -0.133). Even taking into consideration the improvement using the *Test* set with the new obtained *Pearson Correlation Coefficient* of 0.318, if it is compared to the rest computed in this notebook, it is still inferred its low performance to compute the similarities.

Nevertheless, as it is done in every approach, some cases will by studied separately in order to extract insight about the underlying reason of this performance:

In [63]:
print("Original pair of sentences (Gold standard: " + str(gs[0]) + "/5)")
print(pairs[0])

Original pair of sentences (Gold standard: 4.4/5)
('The problem likely will mean corrective changes before the shuttle fleet starts flying again.', 'He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again.')


In [45]:
print("Dependency Triples from first sentence:")
dt_pairs[0][0]

Dependency Triples of first sentence:


[(('mean', 'VB'), 'nsubj', ('problem', 'NN')),
 (('problem', 'NN'), 'det', ('The', 'DT')),
 (('mean', 'VB'), 'advmod', ('likely', 'RB')),
 (('mean', 'VB'), 'aux', ('will', 'MD')),
 (('mean', 'VB'), 'obj', ('changes', 'NNS')),
 (('changes', 'NNS'), 'amod', ('corrective', 'JJ')),
 (('changes', 'NNS'), 'acl', ('starts', 'VBZ')),
 (('starts', 'VBZ'), 'mark', ('before', 'IN')),
 (('starts', 'VBZ'), 'nsubj', ('fleet', 'NN')),
 (('fleet', 'NN'), 'det', ('the', 'DT')),
 (('fleet', 'NN'), 'compound', ('shuttle', 'NN')),
 (('starts', 'VBZ'), 'xcomp', ('flying', 'VBG')),
 (('flying', 'VBG'), 'advmod', ('again', 'RB')),
 (('mean', 'VB'), 'punct', ('.', '.'))]

In [46]:
print("Dependency Triples from second sentence:")
dt_pairs[0][1]

Dependency Triples from second sentence:


[(('said', 'VBD'), 'nsubj', ('He', 'PRP')),
 (('said', 'VBD'), 'ccomp', ('needs', 'VBZ')),
 (('needs', 'VBZ'), 'nsubj', ('problem', 'NN')),
 (('problem', 'NN'), 'det', ('the', 'DT')),
 (('needs', 'VBZ'), 'xcomp', ('corrected', 'VBN')),
 (('corrected', 'VBN'), 'mark', ('to', 'TO')),
 (('corrected', 'VBN'), 'aux:pass', ('be', 'VB')),
 (('corrected', 'VBN'), 'advcl', ('cleared', 'VBN')),
 (('cleared', 'VBN'), 'mark', ('before', 'IN')),
 (('cleared', 'VBN'), 'nsubj:pass', ('fleet', 'NN')),
 (('fleet', 'NN'), 'det', ('the', 'DT')),
 (('fleet', 'NN'), 'compound', ('shuttle', 'NN')),
 (('shuttle', 'NN'), 'compound', ('space', 'NN')),
 (('cleared', 'VBN'), 'aux:pass', ('is', 'VBZ')),
 (('cleared', 'VBN'), 'xcomp', ('fly', 'VB')),
 (('fly', 'VB'), 'mark', ('to', 'TO')),
 (('fly', 'VB'), 'advmod', ('again', 'RB')),
 (('said', 'VBD'), 'punct', ('.', '.'))]

As it was already explained, *Jaccard similarity* takes into account in this case those triples located in both sentences. In this example:

```
(('problem', 'NN'), 'det', ('The', 'DT'))
(('fleet', 'NN'), 'det', ('the', 'DT')),
(('fleet', 'NN'), 'compound', ('shuttle', 'NN')),
```

Therefore, the returned Dependency Triples in this pair can give the idea that both sentences have a meaning related to 'problem', 'fleet' and 'shuttle'. Moreover, these terms have the same dependency in their respective sentences (*e.g.* the 'fleet' is related to the 'shuttle').

Nevertheless, the quantity of Dependency Triples located in both pairs is too low with respect to the total of computed Dependency Triples of each sentence. Consequently, the Similarity will be much lower than expected:

In [60]:
print("Similarity using Dependency Triples: " + str(round(dt_similarities[0]*5, 3)) + "/5.")

Similarity using Dependency Triples: 0.333/5.


Moreover, if some Dependency Triples are observed, it can be analyzed how some of them are not computed as equal because one part of the triple is not the same, deducing the difficulty of getting exact matches. For example:

```
(('flying', 'VBG'), 'advmod', ('again', 'RB'))
(('fly', 'VB'), 'advmod', ('again', 'RB'))
```

For that reason, a combination of this approach with the lemma's one could be considered in order to fix these cases. However, the initial *Pearson Correlation Coefficient* is too low and it is not expected to increase this value up to the point of getting a greater value with respect to the ones already obtained. 

Furthermore, it can be studied how this approach involves some problems like the possibility of getting empty sets as the sentences are too short to have any dependency in them.

If the second pair of sentences is studied:

In [62]:
print("Original pair of sentences (Gold standard: " + str(gs[1]) + "/5)")
print(pairs[1])

Original pair of sentences (Gold standard: 0.8/5)
('The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650.', "The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970.")


In [50]:
print("Dependency Triples from first sentence:")
dt_pairs[1][0]

Dependency Triples from first sentence:


[(('Index', 'NNP'), 'det', ('The', 'DT')),
 (('Index', 'NNP'), 'amod', ('laced', 'VBN')),
 (('laced', 'VBN'), 'obl', ('technology', 'NN')),
 (('laced', 'VBN'), 'punct', ('-', 'HYPH')),
 (('Index', 'NNP'), 'compound', ('Nasdaq', 'NNP')),
 (('Index', 'NNP'), 'compound', ('Composite', 'NNP')),
 (('Index', 'NNP'), 'punct', ('.', '.')),
 (('Index', 'NNP'), 'dep', ('inched', 'VBD')),
 (('inched', 'VBD'), 'nsubj', ('IXIC', 'NNP')),
 (('inched', 'VBD'), 'advmod', ('down', 'RB')),
 (('down', 'RB'), 'obl:npmod', ('point', 'NN')),
 (('point', 'NN'), 'nummod', ('1', 'CD')),
 (('point', 'NN'), 'punct', (',', ',')),
 (('point', 'NN'), 'conj', ('percent', 'NN')),
 (('percent', 'NN'), 'cc', ('or', 'CC')),
 (('percent', 'NN'), 'nummod', ('0.11', 'CD')),
 (('point', 'NN'), 'punct', (',', ',')),
 (('inched', 'VBD'), 'obl', ('1,650', 'CD')),
 (('1,650', 'CD'), 'case', ('to', 'IN')),
 (('Index', 'NNP'), 'punct', ('.', '.'))]

In [51]:
print("Dependency Triples from second sentence:")
dt_pairs[1][1]

Dependency Triples from second sentence:


[(('Index', 'NN'), 'det', ('The', 'DT')),
 (('Index', 'NN'), 'amod', ('broad', 'JJ')),
 (('Index', 'NN'), 'nmod:poss', ('Standard', 'NNP')),
 (('Standard', 'NNP'), 'conj', ('Poor', 'NNP')),
 (('Poor', 'NNP'), 'cc', ('&', 'CC')),
 (('Standard', 'NNP'), 'case', ("'s", 'POS')),
 (('Index', 'NN'), 'nummod', ('500', 'CD')),
 (('Index', 'NN'), 'punct', ('.', '.')),
 (('Index', 'NN'), 'dep', ('inched', 'VBD')),
 (('inched', 'VBD'), 'nsubj', ('SPX', 'NNP')),
 (('inched', 'VBD'), 'compound:prt', ('up', 'RP')),
 (('inched', 'VBD'), 'obj', ('points', 'NNS')),
 (('points', 'NNS'), 'nummod', ('3', 'CD')),
 (('points', 'NNS'), 'punct', (',', ',')),
 (('points', 'NNS'), 'conj', ('percent', 'NN')),
 (('percent', 'NN'), 'cc', ('or', 'CC')),
 (('percent', 'NN'), 'nummod', ('0.32', 'CD')),
 (('points', 'NNS'), 'punct', (',', ',')),
 (('inched', 'VBD'), 'obl', ('970', 'CD')),
 (('970', 'CD'), 'case', ('to', 'IN')),
 (('Index', 'NN'), 'punct', ('.', '.'))]

In [64]:
print("Similarity using Dependency Triples: " + str(round(dt_similarities[1]*5, 3)) + "/5.")

Similarity using Dependency Triples: 0.139/5.


In this case, the number of computed Dependency Triples is greater. Nevertheless, if the matching triples are collected:

```
(('Index', 'NNP'), 'det', ('The', 'DT'))
(('Index', 'NNP'), 'punct', ('.', '.'))
(('Index', 'NN'), 'dep', ('inched', 'VBD'))
(('percent', 'NN'), 'cc', ('or', 'CC'))
```

The same situation shown in the last example happens. The proportion of matching triples is low with respect to the total of returned Dependency Triples. Therefore, the resulting similarity is lower than the expected using the Gold Standard (*i.e.* 0.139 vs. 0.8) even if this expected similarity is supposed to be low.

In conclusion, all the described difficulties make the utilization of Dependency Triples very complex in paraphrasing detection. Nevertheless, if they are treated with care (*i.e.* using lemmas or proposing solutions to the exact matches problem), they could provide better results.

### 2.3 Combination of Lexical & Syntatic Dimensions

### 2.3.1 Weightning Combined Similarities Approach

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & Pearson Correlation Coefficient).

## 3. Other proposed approaches

### 3.1 Utilizing other PoS taggers.

In [20]:
# David

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

## 4. Conclusions