# STS Project
## Introduction
*Jupyter Notebook* of the STS (Semantic Textual Similarity) Project of **Introduction to Human Language Technologies** course from UPC in MAI (Master of Artificial Intelligence).

This project has been done by:
- David Dueñas Gaviria
- Kevin David Rosales Santana

The statement is as follows:
- Use data set and description of task Semantic Textual Similarity in SemEval 2012.

- Implement some approaches to detect paraphrase using sentence similarity metrics.

    - Explore some lexical dimensions.
    - Explore the syntactic dimension alone.
    - Explore the combination of both previous.
    
- Add new components at your choice (optional).

- Not word neither sentence embeddings should be allowed.

- Compare and comment the results achieved by these approaches among them and among the official results.

- Send files to raco in IHLT STS Project before the oral presentation:

    - Jupyter notebook: `sts-[Student1]-[Student2].ipynb`

    - Slides: `sts-[Student1]-[Student2].pdf`
    
In order to measure the similarity between each pair of sentences, the [*Jaccard distance*](https://www.nltk.org/api/nltk.metrics.html#nltk.metrics.distance.jaccard_distance) will be used:

$ Similarity = 1 - Jaccard_{Distance} $

The [*Pearson correlation coefficient*](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) will be used to measure the relation between our similarities and the proposed similarities from the *Gold Standard*.  The coefficient varies between -1 and +1, with 0 implying no correlation. Correlations of -1 and +1 imply an exact linear relationship.

## Imports

In [1]:
import nltk, re

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.metrics import jaccard_distance
from nltk.corpus import stopwords
from scipy.stats import pearsonr

## 1. Data Preparation
This section covers the preparation of the Input Data. The data used along the IHLT course are mostly from the *trial* set. Nevertheless, in this project, the *test* set will be used in order to compute the similarities and measure the performance of the different proposed models.

Moreover, the input data is composed by five different files:
- `STS.input.MSRpar.txt`
- `STS.input.MSRvid.txt`
- `STS.input.SMTeuroparl.txt`
- `STS.input.surprise.OnWN.txt`
- `STS.input.surprise.SMTnews.txt`

Therefore, the proposed pairs of sentences will be formed by the concatenation of the five different proposed inputs.

The variable `sw` will store the set of english stopwords.

In [2]:
pairs = list()

sw = set(stopwords.words('english'))

input_files = ['STS.input.MSRpar.txt',
               'STS.input.MSRvid.txt',
               'STS.input.SMTeuroparl.txt',
               'STS.input.surprise.OnWN.txt',
               'STS.input.surprise.SMTnews.txt']

for file in input_files:
    with open('inputs/test-gold/' + file, 'r') as f:
        lines = f.readlines()
        for line in lines:
            line = nltk.TabTokenizer().tokenize(line.strip())
            pairs.append((line[0], line[1]))
        
for index, pair in enumerate(pairs, 1):
    print(str(index) + ".", pair)

1. ('The problem likely will mean corrective changes before the shuttle fleet starts flying again.', 'He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again.')
2. ('The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650.', "The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970.")
3. ('"It\'s a huge black eye," said publisher Arthur Ochs Sulzberger Jr., whose family has controlled the paper since 1896.', '"It\'s a huge black eye," Arthur Sulzberger, the newspaper\'s publisher, said of the scandal.')
4. ('SEC Chairman William Donaldson said there is a "building confidence out there that the cop is on the beat."', '"I think there\'s a building confidence that the cop is on the beat."')
5. ('Vivendi shares closed 1.9 percent at 15.80 euros in Paris after falling 3.6 percent on Monday.', 'In New York, Vivendi shares were 1.4 percent down at $18.29.')
6. ("Myanmar's pro-democr

1352. ('Paper is being cut with scissors.', 'A piece of paper is being cut.')
1353. ('The lady peeled the potatoe.', 'A woman is peeling a potato.')
1354. ('A person peels shrimp.', 'The lady peeled the shrimp.')
1355. ('A man is playing a guitar.', 'A man tries to read the paper.')
1356. ('A man making a bed in a hotel.', 'A man is holding a animal.')
1357. ('A woman is slicing some tofu.', 'A woman is cutting a block of tofu into small cubes.')
1358. ('A woman is peeling some fish.', 'A woman is pouring a yellow mixture on a frying pan.')
1359. ('The dog pulled the dogs tail and then his leg.', 'A monkey pulled a dogs tail.')
1360. ('A woman beats two eggs in a bowl.', 'A person is mixing ingredients in a bowl.')
1361. ('People are playing baseball.', 'The cricket player hit the ball.')
1362. ('A person is boiling soup.', 'A woman is placing eggs into a pan.')
1363. ('A man is jumping rope outside.', 'A woman is slicing a cucumber.')
1364. ('A person is slicing some onions.', 'A hams

2431. ('forked form or shape', 'a part of a forked or branching shape.')
2432. ('insert or close with a plug', 'fill or close tightly with or as if with a plug.')
2433. ('incite some act of insubordination', 'incite, move, or persuade to some act of lawlessness or insubordination.')
2434. ('formally announce the termination of an agreement', 'announce the termination of, as of treaties.')
2435. ('beat or pound rapidly', 'cause to throb or beat rapidly.')
2436. ('Make a line or marks on a surface; copy by          following the lines of', 'make a mark or lines on a surface.')
2437. ('recite as a chant, intone', 'recite with musical intonation; recite as a chant or a psalm.')
2438. ('the ability of computers to exchange digital information between them and make use of it', '(computer science) the ability to exchange and use information (usually in a large heterogeneous network made up of several local area networks).')
2439. ('act of applying force', 'the act of applying force suddenly.'

In [3]:
print("Number of pairs of sentences:", len(pairs))

Number of pairs of sentences: 3108


Furthermore, it is required to read the already mentioned *Gold Standard* file. This file contains the correct similarity for each read pair of sentences. Consequently, these values will be utilized in the measurement of the performance of the proposed models.

In [4]:
with open('inputs/test-gold/STS.gs.ALL.txt','r') as f:
    gs = [float(line) for line in f.readlines()]

print("Gold standard size:", len(gs))

Gold standard size: 3108


## 2. Paraphrase using different approaches
- Lower case
- Stop words
- Regular expression -> 'word'

### 2.1 Lexical Dimensions

### 2.1.1 Words tokenization

In this basic approach, the word tokenization will be utilized for each pair of sentences:

- `wt_pairs`: contains the basic word tokenization approach.
- `l_wt_pairs`: contains the `wt_pairs` approach in lower case.
- `l_sw_wt_pairs`: contains the `l_wt_pairs` approach without stopwords.
- `l_sw_jw_wt_pairs`: contains the `l_sw_wt_pairs` approach using only words.

In [7]:
# Kevin

wt_pairs = [(nltk.word_tokenize(p[0]), nltk.word_tokenize(p[1])) for p in pairs]

l_wt_pairs = list()
l_sw_wt_pairs = list()
l_sw_jw_wt_pairs = list()

for pair in wt_pairs:
    l_wt_pairs.append(([w.lower() for w in pair[0]],
                       [w.lower() for w in pair[1]]))
    l_sw_wt_pairs.append(([w.lower() for w in pair[0] if w.lower() not in sw],
                          [w.lower() for w in pair[1] if w.lower() not in sw]))
    l_sw_jw_wt_pairs.append(([w.lower() for w in pair[0] if w.lower() not in sw and re.search(r"\w", w)],
                             [w.lower() for w in pair[1] if w.lower() not in sw and re.search(r"\w", w)]))

for index, pair in enumerate(wt_pairs, 1):
    print(str(index) + ".", pair, '\n')

1. (['The', 'problem', 'likely', 'will', 'mean', 'corrective', 'changes', 'before', 'the', 'shuttle', 'fleet', 'starts', 'flying', 'again', '.'], ['He', 'said', 'the', 'problem', 'needs', 'to', 'be', 'corrected', 'before', 'the', 'space', 'shuttle', 'fleet', 'is', 'cleared', 'to', 'fly', 'again', '.']) 

2. (['The', 'technology-laced', 'Nasdaq', 'Composite', 'Index', '.IXIC', 'inched', 'down', '1', 'point', ',', 'or', '0.11', 'percent', ',', 'to', '1,650', '.'], ['The', 'broad', 'Standard', '&', 'Poor', "'s", '500', 'Index', '.SPX', 'inched', 'up', '3', 'points', ',', 'or', '0.32', 'percent', ',', 'to', '970', '.']) 

3. (['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'said', 'publisher', 'Arthur', 'Ochs', 'Sulzberger', 'Jr.', ',', 'whose', 'family', 'has', 'controlled', 'the', 'paper', 'since', '1896', '.'], ['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'Arthur', 'Sulzberger', ',', 'the', 'newspaper', "'s", 'publisher', ',', 'said', 'of', 'the', 'scandal', '.'

773. (['A', 'dog', 'is', 'trying', 'to', 'get', 'bacon', 'off', 'his', 'back', '.'], ['A', 'dog', 'is', 'trying', 'to', 'eat', 'the', 'bacon', 'on', 'its', 'back', '.']) 

774. (['A', 'woman', 'is', 'carrying', 'a', 'boy', '.'], ['A', 'woman', 'is', 'carrying', 'her', 'baby', '.']) 

775. (['A', 'girl', 'is', 'styling', 'her', 'hair', '.'], ['A', 'girl', 'is', 'brushing', 'her', 'hair', '.']) 

776. (['The', 'polar', 'bear', 'is', 'sliding', 'on', 'the', 'snow', '.'], ['A', 'polar', 'bear', 'is', 'sliding', 'across', 'the', 'snow', '.']) 

777. (['A', 'woman', 'is', 'writing', '.'], ['A', 'woman', 'is', 'swimming', '.']) 

778. (['Three', 'men', 'are', 'playing', 'guitars', '.'], ['Three', 'men', 'are', 'on', 'stage', 'playing', 'guitars', '.']) 

779. (['A', 'cat', 'is', 'rubbing', 'against', 'baby', "'s", 'face', '.'], ['A', 'cat', 'is', 'rubbing', 'against', 'a', 'baby', '.']) 

780. (['The', 'man', 'is', 'riding', 'a', 'horse', '.'], ['A', 'man', 'is', 'riding', 'on', 'a', 'horse',

1525. (['(', 'Parliament', 'adopted', 'the', 'legislative', 'resolution', ')'], ['(', 'Parliament', 'adopted', 'the', 'legislative', 'resolution', ')']) 

1526. (['Let', 'me', 'remind', 'you', 'that', 'our', 'allies', 'include', 'fervent', 'supporters', 'of', 'this', 'tax', '.'], ['I', 'want', 'to', 'say', 'to', 'you', 'that', 'among', 'our', 'allies', ',', 'there', 'are', 'fervent', 'of', 'this', 'tax', '.']) 

1527. (['Let', 'me', 'remind', 'you', 'that', 'our', 'allies', 'include', 'fervent', 'supporters', 'of', 'this', 'tax', '.'], ['I', 'insist', 'on', 'reminding', 'you', 'that', 'among', 'our', 'allies', ',', 'there', 'are', 'devotees', 'of', 'this', 'tax', '.']) 

1528. (['Then', 'perhaps', 'we', 'could', 'have', 'avoided', 'a', 'catastrophe', '.'], ['Perhaps', 'we', 'should', 'have', 'been', 'able', 'to', 'prevent', 'a', 'disaster', '.']) 

1529. (['Selective', 'aid', ',', 'such', 'as', 'market', 'support', 'and', 'a', 'grass', 'subsidy', ',', 'are', 'essential', '.'], ['Specif


2289. (['the', 'act', 'of', 'shoving', 'something', 'away'], ['the', 'act', 'of', 'applying', 'force', 'in', 'order', 'to', 'move', 'something', 'away', '.']) 

2290. (['move', 'or', 'drive', 'forcefully', 'as', 'if', 'by', 'a', 'punch', '.'], ['drive', 'forcibly', 'as', 'if', 'by', 'a', 'punch', '.']) 

2291. (['an', 'intense', 'surprise', ',', 'often', 'unpleasant'], ['an', 'unpleasant', 'or', 'disappointing', 'surprise', '.']) 

2292. (['Make', 'less', 'emotionally', 'hostile', ';', 'win', 'over', 'mentally', 'or', 'emotionally', '.'], ['make', 'less', 'hostile', ';', 'win', 'over', '.']) 

2293. (['informal', 'usage', 'for', 'a', 'domestic', 'cat', ',', 'often', 'young'], ['informal', 'terms', 'referring', 'to', 'a', 'domestic', 'cat', '.']) 

2294. (['The', 'act', 'of', 'sorting', 'one', 'thing', 'from', 'others', '.'], ['sorting', 'one', 'thing', 'from', 'others', '.']) 

2295. (['the', 'act', 'of', 'physically', 'affixing', 'or', 'connecting', 'things'], ['the', 'act', 'of', 'f


3085. (['Indeed', ',', 'intolerance', 'goes', 'right', 'to', 'the', 'top', 'of', 'the', 'Turkish', 'government', '.'], ['It', 'is', 'undeniable', 'that', 'intolerance', 'reached', 'summits', 'of', 'the', 'Turkish', 'Government', '.']) 

3086. (['The', 'Political', 'Stock', 'Market'], ['The', 'politization', 'of', 'the', 'markets', 'of', 'the', 'transferable', 'securities']) 

3087. (['Iraq', "'s", 'future', 'depends', 'directly', 'on', 'the', 'fate', 'of', 'Iraqi', 'oil', 'production', '.'], ['The', 'future', 'of', 'Iraq', 'is', 'linked', 'to', 'oil', 'production', '.']) 

3088. (['This', 'tendency', 'extends', 'deeper', 'than', 'headscarves', '.'], ['This', 'trend', 'goes', 'well', 'beyond', 'simple', 'scarves', '.']) 

3089. (['Indeed', ',', 'intolerance', 'goes', 'right', 'to', 'the', 'top', 'of', 'the', 'Turkish', 'government', '.'], ['It', 'is', 'undeniable', 'that', 'intolerance', 'reached', 'until', 'the', 'heights', 'of', 'the', 'Turkish', 'Government', '.']) 

3090. (['Indeed

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

In [8]:
wt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in wt_pairs]
l_wt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_wt_pairs]
l_sw_wt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_sw_wt_pairs]
l_sw_jw_wt_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in l_sw_jw_wt_pairs]

print("Similarities (word tokenization):\n")
for index, similarity in enumerate(wt_similarities):
    print(str(index + 1) + ".", similarity)

Similarities (word tokenization):

1. 0.28
2. 0.27586206896551724
3. 0.5555555555555556
4. 0.5909090909090908
5. 0.19999999999999996
6. 0.4878048780487805
7. 0.34782608695652173
8. 0.5405405405405406
9. 0.6428571428571428
10. 0.55
11. 0.4117647058823529
12. 0.40740740740740744
13. 0.5
14. 0.3846153846153846
15. 0.3076923076923077
16. 0.4516129032258065
17. 0.4054054054054054
18. 0.5217391304347826
19. 0.4444444444444444
20. 0.3571428571428571
21. 0.30434782608695654
22. 0.4242424242424242
23. 0.7307692307692308
24. 0.64
25. 0.43999999999999995
26. 0.5714285714285714
27. 0.65625
28. 0.2857142857142857
29. 0.3421052631578947
30. 0.4736842105263158
31. 0.6538461538461539
32. 0.4
33. 0.47619047619047616
34. 0.3913043478260869
35. 0.4782608695652174
36. 0.6785714285714286
37. 0.46153846153846156
38. 0.33333333333333337
39. 0.5161290322580645
40. 0.5666666666666667
41. 0.5333333333333333
42. 0.8214285714285714
43. 0.4545454545454546
44. 0.4482758620689655
45. 0.6551724137931034
46. 0.3000000

709. 0.30000000000000004
710. 0.4117647058823529
711. 0.6060606060606061
712. 0.52
713. 0.25
714. 0.6176470588235294
715. 0.4814814814814815
716. 0.3793103448275862
717. 0.5217391304347826
718. 0.5
719. 0.7272727272727273
720. 0.3928571428571429
721. 0.6799999999999999
722. 0.5652173913043479
723. 0.6333333333333333
724. 0.3214285714285714
725. 0.2962962962962963
726. 0.36363636363636365
727. 0.6666666666666667
728. 0.64
729. 0.42307692307692313
730. 0.2857142857142857
731. 0.23076923076923073
732. 0.4
733. 0.4375
734. 0.34090909090909094
735. 0.34782608695652173
736. 0.38888888888888884
737. 0.3846153846153846
738. 0.42307692307692313
739. 0.47058823529411764
740. 0.18181818181818177
741. 0.368421052631579
742. 0.48
743. 0.42105263157894735
744. 0.3076923076923077
745. 0.7407407407407407
746. 0.43999999999999995
747. 0.4
748. 0.29166666666666663
749. 0.4
750. 0.2857142857142857
751. 0.8
752. 0.625
753. 0.875
754. 0.7272727272727273
755. 0.875
756. 0.6153846153846154
757. 0.71428571428

1199. 0.3571428571428571
1200. 0.4285714285714286
1201. 0.5
1202. 0.33333333333333337
1203. 0.33333333333333337
1204. 0.36363636363636365
1205. 0.33333333333333337
1206. 0.36363636363636365
1207. 0.4545454545454546
1208. 0.36363636363636365
1209. 0.3846153846153846
1210. 0.2222222222222222
1211. 0.30000000000000004
1212. 0.36363636363636365
1213. 0.23076923076923073
1214. 0.33333333333333337
1215. 0.4
1216. 0.3571428571428571
1217. 0.25
1218. 0.4444444444444444
1219. 0.18181818181818177
1220. 0.4
1221. 0.2727272727272727
1222. 0.46153846153846156
1223. 0.3846153846153846
1224. 0.36363636363636365
1225. 0.23076923076923073
1226. 0.41666666666666663
1227. 0.5384615384615384
1228. 0.1875
1229. 0.3571428571428571
1230. 0.5
1231. 0.2727272727272727
1232. 0.5
1233. 0.4444444444444444
1234. 0.5
1235. 0.4
1236. 0.2666666666666667
1237. 0.4
1238. 0.30000000000000004
1239. 0.46153846153846156
1240. 0.4285714285714286
1241. 0.33333333333333337
1242. 0.25
1243. 0.25
1244. 0.4
1245. 0.3076923076923

1747. 0.4
1748. 0.29166666666666663
1749. 0.6363636363636364
1750. 0.4375
1751. 0.29166666666666663
1752. 0.4285714285714286
1753. 0.25
1754. 0.5
1755. 0.38888888888888884
1756. 0.21739130434782605
1757. 0.2727272727272727
1758. 0.5
1759. 0.17647058823529416
1760. 0.3913043478260869
1761. 1.0
1762. 0.5
1763. 0.4285714285714286
1764. 0.17647058823529416
1765. 0.6
1766. 0.30000000000000004
1767. 0.10526315789473684
1768. 0.11111111111111116
1769. 0.11111111111111116
1770. 0.4347826086956522
1771. 0.4
1772. 0.631578947368421
1773. 0.25
1774. 0.2666666666666667
1775. 0.36363636363636365
1776. 0.052631578947368474
1777. 1.0
1778. 0.47058823529411764
1779. 0.4285714285714286
1780. 0.4285714285714286
1781. 0.4117647058823529
1782. 0.33333333333333337
1783. 0.5
1784. 0.52
1785. 0.4
1786. 0.5
1787. 0.5555555555555556
1788. 0.30000000000000004
1789. 0.2857142857142857
1790. 0.4285714285714286
1791. 0.16666666666666663
1792. 0.7
1793. 0.4375
1794. 0.19999999999999996
1795. 0.30000000000000004
179

2198. 0.07692307692307687
2199. 0.125
2200. 0.18181818181818177
2201. 0.052631578947368474
2202. 0.19999999999999996
2203. 0.0714285714285714
2204. 0.0714285714285714
2205. 0.125
2206. 0.16666666666666663
2207. 0.1875
2208. 0.2222222222222222
2209. 0.08333333333333337
2210. 0.33333333333333337
2211. 0.4444444444444444
2212. 0.33333333333333337
2213. 0.3846153846153846
2214. 0.5
2215. 0.4285714285714286
2216. 0.3076923076923077
2217. 0.33333333333333337
2218. 0.36363636363636365
2219. 0.4
2220. 0.41666666666666663
2221. 0.375
2222. 0.5
2223. 0.6153846153846154
2224. 0.4666666666666667
2225. 0.36363636363636365
2226. 0.4285714285714286
2227. 0.4444444444444444
2228. 0.2727272727272727
2229. 0.375
2230. 0.30000000000000004
2231. 0.30000000000000004
2232. 0.5454545454545454
2233. 0.33333333333333337
2234. 0.36363636363636365
2235. 0.4
2236. 0.33333333333333337
2237. 0.5454545454545454
2238. 0.41666666666666663
2239. 0.33333333333333337
2240. 0.2727272727272727
2241. 0.2142857142857143
2242

2960. 0.8571428571428572
2961. 1.0
2962. 0.4285714285714286
2963. 0.4444444444444444
2964. 0.47058823529411764
2965. 0.3157894736842105
2966. 0.12121212121212122
2967. 0.5652173913043479
2968. 0.5
2969. 0.23529411764705888
2970. 0.4
2971. 0.2941176470588235
2972. 0.41666666666666663
2973. 0.6
2974. 0.23809523809523814
2975. 0.5714285714285714
2976. 0.5833333333333333
2977. 0.5652173913043479
2978. 0.6
2979. 0.4444444444444444
2980. 0.2777777777777778
2981. 0.33333333333333337
2982. 0.21739130434782605
2983. 0.6
2984. 0.7142857142857143
2985. 0.5454545454545454
2986. 0.3846153846153846
2987. 0.6086956521739131
2988. 0.4
2989. 0.21052631578947367
2990. 0.4
2991. 0.7777777777777778
2992. 0.3846153846153846
2993. 0.6
2994. 0.625
2995. 0.43333333333333335
2996. 0.6666666666666667
2997. 0.4375
2998. 0.15384615384615385
2999. 0.4444444444444444
3000. 0.5652173913043479
3001. 0.7777777777777778
3002. 0.4
3003. 0.6
3004. 0.33333333333333337
3005. 0.25
3006. 0.23076923076923073
3007. 0.545454545

In [11]:
print("Pearson correlation (word tokenization):", pearsonr(gs, wt_similarities)[0])
print("Pearson correlation (word tokenization + lower):", pearsonr(gs, l_wt_similarities)[0])
print("Pearson correlation (word tokenization + lower + stopwords):", pearsonr(gs, l_sw_wt_similarities)[0])
print("Pearson correlation (word tokenization + lower + stopwords + just words):", pearsonr(gs, l_sw_jw_wt_similarities)[0])

Pearson correlation (word tokenization): 0.3576889958413405
Pearson correlation (word tokenization + lower): 0.4030921572160367
Pearson correlation (word tokenization + lower + stopwords): 0.4823471916155675
Pearson correlation (word tokenization + lower + stopwords + just words): 0.5605613209285878


### 2.1.2 Lemma tokenization (PoS)

In [76]:
# David

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

### 2.1.3 Lexical Semantics

In [77]:
# David 

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

### 2.2 Syntactic Dimension

### 2.2.1 Word Sense Disambiguation

In [78]:
# David

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & Pearson Correlation Coefficient).

### 2.2.2 Word Sequences

In this approach, word sequences (Words + Named Entities) approach will be used.

The option `binary=True` will be used in NLTK's NERC in order to just recognize named entities, without taking into consideration the classification of three NEs classes (PERSON, LOCATION, ORGANIZATION). Moreover, the function `tree2conlltags(ne_chunk_res)` will be utilized in order to iterate and get the NEs.

The function `transform_sentence(ne_chunk_res)` will iterate over the tree using `tree2conlltags(ne_chunk_res)` and will return the proposed approach (Words + NEs).

In [79]:
def transform_sentence(ne_chunk_res):
    conlltags = nltk.chunk.tree2conlltags(ne_chunk_res)
    transformed_sentence = []
    index = 0
    while index < len(conlltags):
        if conlltags[index][2] == 'B-NE':
            ne = conlltags[index][0]
            consecutive_index = index+1
            for consecutive_index in range(index+1, len(conlltags)):
                if conlltags[consecutive_index][2] == 'I-NE':
                    ne += " " + conlltags[consecutive_index][0]
                else:
                    break
            transformed_sentence.append(ne)
            index = consecutive_index
        else:
            transformed_sentence.append(conlltags[index][0])
            index += 1
    return transformed_sentence

In [80]:
# Kevin

# TODO: Take the best pos_tag method.
# TODO: For 'words' part, try with lowercase, stopwords and regular expressions.
ws_pairs = [(transform_sentence(ne_chunk(pos_tag(word_tokenize(p[0])), binary=True)), 
             transform_sentence(ne_chunk(pos_tag(word_tokenize(p[1])), binary=True))) 
            for p in pairs]

for index, pair in enumerate(ws_pairs, 1):
    print(str(index) + ".", pair, '\n')

1. (['The', 'problem', 'likely', 'will', 'mean', 'corrective', 'changes', 'before', 'the', 'shuttle', 'fleet', 'starts', 'flying', 'again', '.'], ['He', 'said', 'the', 'problem', 'needs', 'to', 'be', 'corrected', 'before', 'the', 'space', 'shuttle', 'fleet', 'is', 'cleared', 'to', 'fly', 'again', '.']) 

2. (['The', 'technology-laced', 'Nasdaq Composite Index', '.IXIC', 'inched', 'down', '1', 'point', ',', 'or', '0.11', 'percent', ',', 'to', '1,650', '.'], ['The', 'broad', 'Standard', '&', 'Poor', "'s", '500', 'Index', '.SPX', 'inched', 'up', '3', 'points', ',', 'or', '0.32', 'percent', ',', 'to', '970', '.']) 

3. (['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'said', 'publisher', 'Arthur Ochs Sulzberger', 'Jr.', ',', 'whose', 'family', 'has', 'controlled', 'the', 'paper', 'since', '1896', '.'], ['``', 'It', "'s", 'a', 'huge', 'black', 'eye', ',', "''", 'Arthur Sulzberger', ',', 'the', 'newspaper', "'s", 'publisher', ',', 'said', 'of', 'the', 'scandal', '.']) 

4. (['SEC'

1615. (['As', 'I', 'already', 'explained', 'during', 'second', 'reading', ',', 'there', 'is', 'a', 'crisis', 'underlying', 'this', 'directive', 'amendment', '.'], ['As', 'I', 'already', 'explained', 'in', 'second', 'reading', ',', 'a', 'crisis', 'is', 'at', 'the', 'base', 'of', 'this', 'modification', 'of', 'directive', '.']) 

1616. (['Thank', 'you', ',', 'Commissioner', '.'], ['Thank', 'you', ',', 'Commissioner', '.']) 

1617. (['The', 'European Union', 'has', 'got', 'to', 'do', 'something', 'and', 'do', 'it', 'quickly', '.'], ['It', 'is', 'right', 'that', 'the', 'European Union', 'is', 'involved', ',', 'and', 'for', 'this', 'to', 'be', 'done', 'quickly', '.']) 

1618. (['The', 'vote', 'will', 'take', 'place', 'today', 'at', '5.30', 'p.m', '.'], ['The', 'vote', 'will', 'take', 'place', 'at', '5.30', 'p.m', '.']) 

1619. (['Unanimous', 'decisions', ',', 'and', 'hence', 'an', 'inherent', 'incapacity', 'to', 'act', ',', 'remain', 'largely', 'the', 'norm', 'in', 'the', 'Council', '.'], [

3080. (['But', 'from', 'the', 'American', 'point', 'of', 'view', ',', 'the', 'international', 'role', 'of', 'the', 'dollar', 'was', 'a', 'trap', '.'], ['But', 'the', 'US', 'point', 'of', 'view', ',', 'the', 'international', 'role', 'of', 'the', 'dollar', 'was', 'a', 'trap', '.']) 

3081. (['The', 'old', 'version', 'of', 'the', 'European', 'response', '--', 'what', 'psychologists', 'might', 'call', '``', 'dollar', 'envy', "''", '--', 'will', 'only', 'become', 'more', 'acute', '.'], ['The', 'former', 'version', 'of', 'the', 'European', 'reaction', '(', 'the', 'desire', 'called', '``', 'the', 'desire', 'of', 'the', "''", ')', '”', '.']) 

3082. (['Pro-market', 'economists', 'do', "n't", 'object', 'to', 'corporations', 'that', 'blatantly', 'use', 'snob', 'appeal', 'to', 'promote', 'their', 'products', '.'], ['The', 'economists', 'pro-market', 'do', 'not', 'oppose', 'the', 'companies', 'which', 'openly', 'use', 'the', 'attraction', 'of', 'the', 'luxury', 'to', 'promote', 'their', 'products'

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & Pearson Correlation Coefficient).

In [82]:
ws_similarities = [1 - jaccard_distance(set(p[0]), set(p[1])) for p in ws_pairs]

print("Similarities (considering Words + NEs):\n")
for index, similarity in enumerate(ws_similarities, 1):
    print(str(index) + ".", similarity)

Similarities (considering Words + NEs):

1. 0.28
2. 0.25
3. 0.5
4. 0.6190476190476191
5. 0.20833333333333337
6. 0.4473684210526315
7. 0.33333333333333337
8. 0.5405405405405406
9. 0.6428571428571428
10. 0.55
11. 0.33333333333333337
12. 0.40740740740740744
13. 0.5
14. 0.3846153846153846
15. 0.2222222222222222
16. 0.4516129032258065
17. 0.3939393939393939
18. 0.5217391304347826
19. 0.4444444444444444
20. 0.33333333333333337
21. 0.26086956521739135
22. 0.40625
23. 0.7307692307692308
24. 0.625
25. 0.43999999999999995
26. 0.5714285714285714
27. 0.65625
28. 0.21052631578947367
29. 0.33333333333333337
30. 0.4736842105263158
31. 0.6538461538461539
32. 0.35
33. 0.44999999999999996
34. 0.40909090909090906
35. 0.4782608695652174
36. 0.6666666666666667
37. 0.46153846153846156
38. 0.33333333333333337
39. 0.5172413793103448
40. 0.5666666666666667
41. 0.5517241379310345
42. 0.8148148148148149
43. 0.4545454545454546
44. 0.4642857142857143
45. 0.6551724137931034
46. 0.30000000000000004
47. 0.28571428571

2577. 0.625
2578. 0.6363636363636364
2579. 0.625
2580. 0.7142857142857143
2581. 0.8181818181818181
2582. 0.375
2583. 0.6428571428571428
2584. 0.6666666666666667
2585. 0.8666666666666667
2586. 0.5454545454545454
2587. 0.4444444444444444
2588. 0.6666666666666667
2589. 0.6428571428571428
2590. 0.6666666666666667
2591. 0.6666666666666667
2592. 0.7
2593. 0.5714285714285714
2594. 0.6428571428571428
2595. 0.7
2596. 0.6666666666666667
2597. 0.75
2598. 0.7
2599. 0.5
2600. 0.625
2601. 0.5
2602. 0.5714285714285714
2603. 0.5555555555555556
2604. 0.75
2605. 0.7222222222222222
2606. 0.6666666666666667
2607. 0.5384615384615384
2608. 0.7096774193548387
2609. 0.5
2610. 0.5
2611. 0.5555555555555556
2612. 0.7
2613. 0.6666666666666667
2614. 0.75
2615. 0.75
2616. 0.625
2617. 0.7333333333333334
2618. 0.8
2619. 0.5833333333333333
2620. 0.6111111111111112
2621. 0.8
2622. 0.6
2623. 0.5
2624. 0.5
2625. 0.6363636363636364
2626. 0.6363636363636364
2627. 0.625
2628. 0.7692307692307692
2629. 0.6363636363636364
2630

In [83]:
print("Pearson correlation (words + NEs):", pearsonr(gs, ws_similarities)[0])

Pearson correlation (words + NEs): 0.35193061002587833


### 2.3 Combination of Lexical & Syntatic Dimensions

In [None]:
# Next Week

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & Pearson Correlation Coefficient).

## 3. Other proposed approaches

### 3.1 Utilizing other PoS taggers.

In [None]:
# David

Compare and comment the results achieved by these approaches among them and among the official results (Similarities & *Pearson Correlation Coefficient*).

## 4. Conclusions