# Word Sense Disambiguation


# Lab Session 6

### Mandatory exercise

https://gebakx.github.io/ihlt/s6/index.html#7

**Statement:**

- Read all pairs of sentences of the trial set within the evaluation framework of the project.

- Apply Lesk’s algorithm to the words in the sentences.

- Compute their similarities by considering senses and Jaccard coefficient.

- Compare the results with those in session 2 (document) and 3 (morphology) in which words and lemmas were considered.

- Compare the results with gold standard by giving the pearson correlation between them.

# Solution

## Requirements

Only in case nltk and the file do not exist in the current computer

In [0]:
import tarfile
import nltk
nltk.download() # 1. d | 2. book | 3. q
nltk.download('wordnet')


!wget https://gebakx.github.io/ihlt/s6/resources/trial.tgz

with tarfile.open('trial.tgz', "r:gz") as tar:
  tar.extractall()

!ls


## Input data

In [2]:
import pandas as pd

trial_data = pd.read_csv('trial/STS.input.txt', sep='\t', 
                         names=['id', 'sent1', 'sent2'])
trial_data = trial_data.astype(str)                
trial_data

Unnamed: 0,id,sent1,sent2
0,id1,The bird is bathing in the sink.,Birdie is washing itself in the water basin.
1,id2,"In May 2010, the troops attempted to invade Ka...",The US army invaded Kabul on May 7th last year...
2,id3,John said he is considered a witness but not a...,He is not a suspect anymore. John said.
3,id4,They flew out of the nest in groups.,They flew into the nest together.
4,id5,The woman is playing the violin.,The young lady enjoys listening to the guitar.
5,id6,John went horse back riding at dawn with a who...,Sunrise at dawn is a magnificent view to take ...


## Preprocessing: Tagging and lemmatization

Steps:

- Tokenize the sentence.
- Tag each word as a Part of speech (POS)
- Word sense disambiguation: Lesk



## Mapping pos tags to wordnet tags

The way POS-tag works is different from how wordnet does.

This mapping is needed when working with synsets from wordnet, such as, lemmatizer, lesk, you name it.


In [0]:
from nltk.corpus import wordnet

def map_pos_wordnet(pos):
  d = {"N": wordnet.NOUN, # 'n'
       "V": wordnet.VERB, # 'v'
       "J": wordnet.ADJ, #  'a'
       "R": wordnet.ADV} #  'r'


  return d[pos[0]]

## Calculate lesk algorithm to each sentence

Let's find the sense of a word given its context by using the lesk algorithm.

Sometimes lesk yields a synset and others does not. It happens because of the existence of that word in wordnet either with POS or not, or even if it exists as a valid word.


In [0]:
from nltk.wsd import lesk


def wsd_lesk(pairs, if_lesk_null='no_pos'):
  result = []
  context = dict(pairs).keys()
  for (token, pos) in pairs:
    # print(token, pos)
    if pos[0] in {'N','V', 'J', 'R'}:
      synset = lesk(context, token.lower(), pos=map_pos_wordnet(pos))
      # try if synset exists with pos tag
      if synset is not None:
        result.append(synset.name().split('.')[0])
      # try if synset exists without pos tag
      else:
        # calculate lesk without considering pos
        if if_lesk_null=='no_pos':
          synset = lesk(context, token.lower())

          # if synset doesn't exist at all
          if synset is None:
            result.append(token) 
          else:
            result.append(synset.name().split('.')[0])
        else:
          result.append(token)
    else:
      result.append(token)

  return result

In [0]:
# nltk.help.upenn_tagset()

In [7]:
for col in ['sent1', 'sent2']:
  trial_data[col+'_processed'] = trial_data[col].apply(nltk.word_tokenize)
  trial_data[col+'_processed'] = trial_data[col+'_processed'].apply(nltk.pos_tag)
  trial_data[col+'_processed'] = trial_data[col+'_processed'].apply(wsd_lesk, if_lesk_null='nothing')

trial_data

Unnamed: 0,id,sent1,sent2,sent1_processed,sent2_processed
0,id1,The bird is bathing in the sink.,Birdie is washing itself in the water basin.,"[The, bird, be, bathe, in, the, sinkhole, .]","[shuttlecock, be, wash, itself, in, the, body_..."
1,id2,"In May 2010, the troops attempted to invade Ka...",The US army invaded Kabul on May 7th last year...,"[In, whitethorn, 2010, ,, the, troop, undertak...","[The, uranium, united_states_army, invade, kab..."
2,id3,John said he is considered a witness but not a...,He is not a suspect anymore. John said.,"[whoremaster, suppose, he, embody, view, a, wi...","[He, embody, not, a, defendant, anymore, ., wh..."
3,id4,They flew out of the nest in groups.,They flew into the nest together.,"[They, fly, out, of, the, nest, in, group, .]","[They, fly, into, the, nest, together, .]"
4,id5,The woman is playing the violin.,The young lady enjoys listening to the guitar.,"[The, woman, be, play, the, violin, .]","[The, young, lady, love, heed, to, the, guitar..."
5,id6,John went horse back riding at dawn with a who...,Sunrise at dawn is a magnificent view to take ...,"[toilet, plump, knight, back, ride, at, dawn, ...","[sunrise, at, dawn, be, a, magnificent, view, ..."


## Calculating the Jacard distance

It measures how close or far are the given sentences, it is a not so robust way to measure two equivalents sentences. 


In [8]:
import numpy as np
import nltk

from nltk.metrics import jaccard_distance

result = []

for index, row in trial_data.iterrows():
  result.append(jaccard_distance(set(row['sent1_processed']),
                             set(row['sent2_processed'])))

result = 1 - np.array(result)
result

array([0.30769231, 0.33333333, 0.53846154, 0.45454545, 0.23076923,
       0.13793103])

## Calculating the Pearson correlation




In [9]:
import pandas as pd

from scipy.stats import pearsonr

gs = pd.read_csv('trial/STS.gs.txt', sep='\t', header=None)
refs = list(reversed(gs[1].values))
print(f'Gold standard: {refs}')
tsts = result * 5
print(f'Jaccard distance: {tsts}')
print(f'Pearson correlation: {pearsonr(refs, tsts)[0]}')


Gold standard: [5, 4, 3, 2, 1, 0]
Jaccard distance: [1.53846154 1.66666667 2.69230769 2.27272727 1.15384615 0.68965517]
Pearson correlation: 0.45509668497522504


## Evaluate against training data set


### Download and read training data set

In [10]:
url_train = 'https://www.cs.york.ac.uk/semeval-2012/task6/data/uploads/datasets/train.tgz'
!wget $url_train

with tarfile.open('train.tgz', "r:gz") as tar:
  tar.extractall()
!ls

--2019-10-19 17:24:52--  https://www.cs.york.ac.uk/semeval-2012/task6/data/uploads/datasets/train.tgz
Resolving www.cs.york.ac.uk (www.cs.york.ac.uk)... 144.32.128.40
Connecting to www.cs.york.ac.uk (www.cs.york.ac.uk)|144.32.128.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 125822 (123K) [application/x-gzip]
Saving to: ‘train.tgz’


2019-10-19 17:24:53 (1.13 MB/s) - ‘train.tgz’ saved [125822/125822]

sample_data  train  train.tgz  trial  trial.tgz


In [11]:
import pandas as pd

train_data = pd.read_csv('train/STS.input.MSRvid.txt', sep='\t', names=['sent1', 'sent2'])
print(f'# rows: {len(train_data)}')
train_data = train_data.astype(str)                
train_data.head(5)

# rows: 750


Unnamed: 0,sent1,sent2
0,A man is riding a bicycle.,A man is riding a bike.
1,A woman and man are dancing in the rain.,A man and woman are dancing in rain.
2,Someone is drawing.,Someone is dancing.
3,A man and a woman are kissing each other.,A man and a woman are talking to each other.
4,A woman is slicing an onion.,A woman is cutting an onion.


## Pre-process the input data

In [12]:
for col in ['sent1', 'sent2']:
  train_data[col+'_processed'] = train_data[col].apply(nltk.word_tokenize)
  train_data[col+'_processed'] = train_data[col+'_processed'].apply(nltk.pos_tag)
  train_data[col+'_processed'] = train_data[col+'_processed'].apply(wsd_lesk, if_lesk_null='no_pos')

train_data.head(5)

Unnamed: 0,sent1,sent2,sent1_processed,sent2_processed
0,A man is riding a bicycle.,A man is riding a bike.,"[A, man, embody, ride, a, bicycle, .]","[A, man, embody, ride, a, bicycle, .]"
1,A woman and man are dancing in the rain.,A man and woman are dancing in rain.,"[A, woman, and, man, be, dance, in, the, rain, .]","[A, man, and, woman, be, dance, in, rain, .]"
2,Someone is drawing.,Someone is dancing.,"[person, exist, draw, .]","[person, exist, dance, .]"
3,A man and a woman are kissing each other.,A man and a woman are talking to each other.,"[A, man, and, a, woman, embody, snog, each, ot...","[A, valet, and, a, woman, equal, lecture, to, ..."
4,A woman is slicing an onion.,A woman is cutting an onion.,"[A, woman, exist, slit, an, onion, .]","[A, woman, exist, cut, an, onion, .]"


## Jacard distance

In [13]:
result = []

for index, row in train_data.iterrows():
  result.append(jaccard_distance(set(row['sent1_processed']),
                             set(row['sent2_processed'])))

result = 1 - np.array(result)
result[:10]

array([1.        , 0.9       , 0.6       , 0.5       , 0.75      ,
       0.71428571, 0.71428571, 0.4       , 0.875     , 0.7       ])

## Pearson correlation

In [14]:
gs = pd.read_csv('train/STS.gs.MSRvid.txt', names=['gs'])
print(f'# rows: {len(gs)}')
gs['gs'] = gs['gs'].astype(float)
refs = list(gs['gs'].values)
print(f'Gold standard: {refs[:10]}')
tsts = result * 5
print(f'Jaccard distance: {tsts[:10]}')
print(f'Pearson correlation: {pearsonr(refs, tsts)[0]}')

# rows: 750
Gold standard: [5.0, 5.0, 0.3, 0.6, 4.2, 3.6, 5.0, 2.75, 5.0, 3.75]
Jaccard distance: [5.         4.5        3.         2.5        3.75       3.57142857
 3.57142857 2.         4.375      3.5       ]
Pearson correlation: 0.3620718615944819


## Previous and current results on training sets

- Session 2 (Words), Pearson score : 0.167
- Session 3 (Morphology), Pearson score : 0.494
- Session 6 (Word Sense Disambiguation): 0.362

# Conclusions

- It is important to map POS tags into wordnet tags, the latter just accepts a subset of tags, otherwhise, an error raises.

- Calculate the lesk (WSD) algorithm to a sentence takes into account some conditions: 

  - If a word exists with its POS tag. This scenario is the ideal as to find the sense of that given its context is more precise due to its POS tag.
  - If a word exists without its POS tag. Less precise, because the word lexical category is not given.
  - If a word does not exist at all. Decide what to do, probably, just do nothing.

- The results show that the performance is worse than the previous experiment (Session 1) but better than just considering words (Session 2).

- One of the reasons why the score is worse is due to the fact that the lesk algorithm is weaker than other approaches. Basically, when searching for  intersections of lemmas, it can give meanings that do not have any link with the context, hence, distorting the meaning of the word.

- WSD approaches are challenging, because find a good training corpus to build a robust model is not trivial as the task of  labeling the right meaning of a word given a sentence for all the possible sentences is hard, that's why a semi-supervised or unsupervised models might behave better.
