# Morphology


## Requirements

Only in case nltk and the file do not exist in the current computer

In [0]:
import tarfile
import nltk
nltk.download() #1. d, 2. book, 3. q

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')



!wget https://gebakx.github.io/ihlt/s2/resources/trial.tgz

with tarfile.open('trial.tgz', "r:gz") as tar:
  tar.extractall()



In [2]:
!ls

sample_data  trial  trial.tgz


# Lab Session 3

## Input data

In [19]:
import pandas as pd

trial_data = pd.read_csv('trial/STS.input.txt', sep='\t', names=['id', 'sent1', 'sent2'])
trial_data = trial_data.astype(str)                
trial_data

Unnamed: 0,id,sent1,sent2
0,id1,The bird is bathing in the sink.,Birdie is washing itself in the water basin.
1,id2,"In May 2010, the troops attempted to invade Ka...",The US army invaded Kabul on May 7th last year...
2,id3,John said he is considered a witness but not a...,He is not a suspect anymore. John said.
3,id4,They flew out of the nest in groups.,They flew into the nest together.
4,id5,The woman is playing the violin.,The young lady enjoys listening to the guitar.
5,id6,John went horse back riding at dawn with a who...,Sunrise at dawn is a magnificent view to take ...


## Preprocessing: Tagging and lemmatization

Steps:

- Tokenize the sentence.
- Tag each word as a Part of speech (POS)
- Lemmatize each word into its root but with sense (different than stemming).


In [0]:
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

def lemmatize(pairs):
  result = []
  for pair in pairs:
    if pair[1][0] in {'N','V'}:
      result.append(wnl.lemmatize(pair[0].lower(), 
                                     pos=pair[1][0].lower()))
    else:
      result.append(pair[0])
  return result

In [21]:
for col in ['sent1', 'sent2']:
  trial_data[col+'_processed'] = trial_data[col].apply(nltk.word_tokenize)
  trial_data[col+'_processed'] = trial_data[col+'_processed'].apply(nltk.pos_tag)
  trial_data[col+'_processed'] = trial_data[col+'_processed'].apply(lemmatize)

trial_data

Unnamed: 0,id,sent1,sent2,sent1_processed,sent2_processed
0,id1,The bird is bathing in the sink.,Birdie is washing itself in the water basin.,"[The, bird, be, bath, in, the, sink, .]","[birdie, be, wash, itself, in, the, water, bas..."
1,id2,"In May 2010, the troops attempted to invade Ka...",The US army invaded Kabul on May 7th last year...,"[In, may, 2010, ,, the, troop, attempt, to, in...","[The, u, army, invade, kabul, on, may, 7th, la..."
2,id3,John said he is considered a witness but not a...,He is not a suspect anymore. John said.,"[john, say, he, be, consider, a, witness, but,...","[He, be, not, a, suspect, anymore, ., john, sa..."
3,id4,They flew out of the nest in groups.,They flew into the nest together.,"[They, fly, out, of, the, nest, in, group, .]","[They, fly, into, the, nest, together, .]"
4,id5,The woman is playing the violin.,The young lady enjoys listening to the guitar.,"[The, woman, be, play, the, violin, .]","[The, young, lady, enjoy, listen, to, the, gui..."
5,id6,John went horse back riding at dawn with a who...,Sunrise at dawn is a magnificent view to take ...,"[john, go, horse, back, rid, at, dawn, with, a...","[sunrise, at, dawn, be, a, magnificent, view, ..."


## Calculating the Jacard distance

It just measures how close or far are the given sentences, it is a not so robust way to measure two equivalents sentences. 

It is a weak model.

In [22]:
import numpy as np
import nltk

from nltk.metrics import jaccard_distance

result = []

for index, row in trial_data.iterrows():
  result.append(jaccard_distance(set(row['sent1_processed']),
                             set(row['sent2_processed'])))

result = 1 - np.array(result)
result

array([0.30769231, 0.33333333, 0.53846154, 0.45454545, 0.23076923,
       0.13793103])

## Calculating the Pearson correlation

First we gather the gold standard results (the reference) and compare with our previous results. It seems that the gold standard should be in reversed order, so that action is performed before comparing. Furthermore, the previous results were multiply by 5, just to have the same scale (although the pearson correlation is robust to scaling).

For this example, the score is 0.455, that indicates somehow, a bad score (the maximun is 1.0) but is better than using only words (shown in the previous session). It means that both lists have the same direction when plotted, but they are not so well correlated.



In [23]:
import pandas as pd

from scipy.stats import pearsonr

gs = pd.read_csv('trial/STS.gs.txt', sep='\t', header=None)
refs = list(reversed(gs[1].values))
print(f'Gold standard: {refs}')
tsts = result * 5
print(f'Jaccard distance: {tsts}')
print(f'Pearson correlation: {pearsonr(refs, tsts)[0]}')


Gold standard: [5, 4, 3, 2, 1, 0]
Jaccard distance: [1.53846154 1.66666667 2.69230769 2.27272727 1.15384615 0.68965517]
Pearson correlation: 0.45509668497522504


## Optional: Evaluate against training data set


### Download and read training data set

In [24]:
url_train = 'https://www.cs.york.ac.uk/semeval-2012/task6/data/uploads/datasets/train.tgz'
!wget $url_train

with tarfile.open('train.tgz', "r:gz") as tar:
  tar.extractall()
!ls

--2019-09-26 10:55:58--  https://www.cs.york.ac.uk/semeval-2012/task6/data/uploads/datasets/train.tgz
Resolving www.cs.york.ac.uk (www.cs.york.ac.uk)... 144.32.128.40
Connecting to www.cs.york.ac.uk (www.cs.york.ac.uk)|144.32.128.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 125822 (123K) [application/x-gzip]
Saving to: ‘train.tgz.1’


2019-09-26 10:56:00 (210 KB/s) - ‘train.tgz.1’ saved [125822/125822]

sample_data  train  train.tgz  train.tgz.1  trial  trial.tgz  trial.tgz.1


In [38]:
import pandas as pd

train_data = pd.read_csv('train/STS.input.MSRvid.txt', sep='\t', names=['sent1', 'sent2'])
print(f'# rows: {len(train_data)}')
train_data = train_data.astype(str)                
train_data.head(5)

# rows: 750


Unnamed: 0,sent1,sent2
0,A man is riding a bicycle.,A man is riding a bike.
1,A woman and man are dancing in the rain.,A man and woman are dancing in rain.
2,Someone is drawing.,Someone is dancing.
3,A man and a woman are kissing each other.,A man and a woman are talking to each other.
4,A woman is slicing an onion.,A woman is cutting an onion.


## Pre-process the input data

In [39]:
for col in ['sent1', 'sent2']:
  train_data[col+'_processed'] = train_data[col].apply(nltk.word_tokenize)
  train_data[col+'_processed'] = train_data[col+'_processed'].apply(nltk.pos_tag)
  train_data[col+'_processed'] = train_data[col+'_processed'].apply(lemmatize)

train_data.head(5)

Unnamed: 0,sent1,sent2,sent1_processed,sent2_processed
0,A man is riding a bicycle.,A man is riding a bike.,"[A, man, be, rid, a, bicycle, .]","[A, man, be, rid, a, bike, .]"
1,A woman and man are dancing in the rain.,A man and woman are dancing in rain.,"[A, woman, and, man, be, dance, in, the, rain, .]","[A, man, and, woman, be, dance, in, rain, .]"
2,Someone is drawing.,Someone is dancing.,"[someone, be, draw, .]","[someone, be, dance, .]"
3,A man and a woman are kissing each other.,A man and a woman are talking to each other.,"[A, man, and, a, woman, be, kiss, each, other, .]","[A, man, and, a, woman, be, talk, to, each, ot..."
4,A woman is slicing an onion.,A woman is cutting an onion.,"[A, woman, be, slice, an, onion, .]","[A, woman, be, cut, an, onion, .]"


## Jacard distance

In [40]:
result = []

for index, row in train_data.iterrows():
  result.append(jaccard_distance(set(row['sent1_processed']),
                             set(row['sent2_processed'])))

result = 1 - np.array(result)
result[:10]

array([0.75      , 0.9       , 0.6       , 0.75      , 0.75      ,
       0.71428571, 0.71428571, 0.55555556, 0.875     , 0.7       ])

## Pearson correlation

In [42]:
gs = pd.read_csv('train/STS.gs.MSRvid.txt', names=['gs'])
print(f'# rows: {len(gs)}')
gs['gs'] = gs['gs'].astype(float)
refs = list(gs['gs'].values)
print(f'Gold standard: {refs[:10]}')
tsts = result * 5
print(f'Jaccard distance: {tsts[:10]}')
print(f'Pearson correlation: {pearsonr(refs, tsts)[0]}')

# rows: 750
Gold standard: [5.0, 5.0, 0.3, 0.6, 4.2, 3.6, 5.0, 2.75, 5.0, 3.75]
Jaccard distance: [3.75       4.5        3.         3.75       3.75       3.57142857
 3.57142857 2.77777778 4.375      3.5       ]
Pearson correlation: 0.4941575471981352


In [35]:
# !ls train

00-readme.txt	   STS.gs.MSRvid.txt	   STS.input.MSRvid.txt
correlation.pl	   STS.gs.SMTeuroparl.txt  STS.input.SMTeuroparl.txt
STS.gs.MSRpar.txt  STS.input.MSRpar.txt    STS.output.MSRpar.txt


In [44]:
# gss = pd.read_csv('train/STS.gs.MSRpar.txt', names=['gs'])
# print(f'# rows: {len(gss)}')

# rows: 750


In [45]:
# len(pd.read_csv('train/STS.input.MSRpar.txt', sep='\t', names=['sent1', 'sent2']))

728