**TODO:**
* Convert numbers to words before tokenization

# Testing fastText for semantic similarity

Install [fasttext](https://fasttext.cc/) if needed

In [None]:
!pip install fasttext

In [2]:
import pandas as pd
import numpy as np
import nltk
import fasttext
import fasttext.util

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/emilnuutinen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## 1. Get fastText ready to use

We are using the normal English model.

More info about the fastText [models](https://fasttext.cc/docs/en/crawl-vectors.html) for different languages.

> We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.

In [3]:
fasttext.util.download_model('en', if_exists='ignore')  # English
ft = fasttext.load_model('cc.en.300.bin')
ft.get_dimension()



300

Example about the word "learning".

In [4]:
print(ft.get_word_vector('learning').shape)
print(ft.get_word_vector('learning')[:20])

ft.get_nearest_neighbors('learning') # may take some time

(300,)
[-3.1326819e-02  5.8432957e-03  3.5721278e-05  3.2791961e-02
 -9.6422508e-03 -5.0007503e-02  1.6288273e-02  3.5059921e-02
 -6.6784739e-02 -1.8172603e-03 -1.8895891e-02 -5.0050311e-02
  5.2792020e-02  3.0742858e-02  1.2085622e-02 -1.8491376e-03
  5.5508241e-02 -9.5799835e-03  3.2117605e-02  1.1655847e-02]


[(0.7456761598587036, 'learing'),
 (0.6895476579666138, 'Learning'),
 (0.6878188848495483, 'learning.This'),
 (0.6796225309371948, 'learning.The'),
 (0.6753033399581909, 'learning.It'),
 (0.6706692576408386, 'learning.So'),
 (0.6673312187194824, 'learning.What'),
 (0.6648250222206116, 'learning.But'),
 (0.664309024810791, 'learning-'),
 (0.6633586883544922, 'learning.As')]

## 2. Load the sts-benchmark data and remove lines that contain errors. 

In [5]:
# Remove "warn_bad_lines=False" to print the lines that have errors.
train_df = pd.read_csv('stsbenchmark/sts-train.csv', sep='\t', engine='python', header=None, encoding='utf-8', error_bad_lines=False, warn_bad_lines=False)

## 3. A quick look at the dataset we are using

In [6]:
print(train_df[0].value_counts())
print('\n')
print('Train dataset shape: ' + str(train_df.shape))
print('\n')
train_df.head()

main-news        2976
main-captions    2000
main-forum        438
Name: 0, dtype: int64


Train dataset shape: (5414, 7)




Unnamed: 0,0,1,2,3,4,5,6
0,main-captions,MSRvid,2012test,1,5.0,A plane is taking off.,An air plane is taking off.
1,main-captions,MSRvid,2012test,4,3.8,A man is playing a large flute.,A man is playing a flute.
2,main-captions,MSRvid,2012test,5,3.8,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...
3,main-captions,MSRvid,2012test,6,2.6,Three men are playing chess.,Two men are playing chess.
4,main-captions,MSRvid,2012test,9,4.25,A man is playing the cello.,A man seated is playing the cello.


## 4. Comparing two sentence paires with fastText as an example

In [7]:
print(train_df.loc[0])
print('\n')
print(train_df.loc[45])

0                  main-captions
1                         MSRvid
2                       2012test
3                              1
4                              5
5         A plane is taking off.
6    An air plane is taking off.
Name: 0, dtype: object


0                     main-captions
1                            MSRvid
2                          2012test
3                                68
4                                 1
5       A man is playing the piano.
6    A woman is playing the violin.
Name: 45, dtype: object


In [8]:
s1 = train_df.loc[0][5]
s2 = train_df.loc[0][6]
s3 = train_df.loc[45][5]
s4 = train_df.loc[45][6]

print(f's1 = {s1}')
print(f's2 = {s2}')
print('\n')
print(f's3 = {s3}')
print(f's4 = {s4}')

s1 = A plane is taking off.
s2 = An air plane is taking off.


s3 = A man is playing the piano.
s4 = A woman is playing the violin.


In [10]:
from scipy.spatial import distance

s1_vec = ft.get_sentence_vector(s1)
s2_vec = ft.get_sentence_vector(s2)
s3_vec = ft.get_sentence_vector(s3)
s4_vec = ft.get_sentence_vector(s4)

print(f's1 vs s2 = {distance.cosine(s1_vec,s2_vec)}')
print(f'Human score = {train_df.loc[0][4]}')
print(f'fastText score = {round((1-distance.cosine(s1_vec,s2_vec))*5,1)}')

print(f's3 vs s4 = {distance.cosine(s3_vec,s4_vec)}')
print(f'Human score = {train_df.loc[45][4]}')
print(f'fastText score = {round((1-distance.cosine(s3_vec,s4_vec))*5,1)}')

print(f's1 vs s3 = {distance.cosine(s1_vec,s3_vec)}')
print(f's1 vs s4 = {distance.cosine(s1_vec,s4_vec)}')

s1 vs s2 = 0.10187679529190063
Human score = 5.0
fastText score = 4.5
s3 vs s4 = 0.03787785768508911
Human score = 1.0
fastText score = 4.8
s1 vs s3 = 0.28191500902175903
s1 vs s4 = 0.28155577182769775


## 5. Getting the human scores and the fasttext scores and comparing them

**https://ixa2.si.ehu.es/stswiki/index.php/STS_benchmark_reproducibility**

> The averaged word embedding baselines compute a sentence embedding by averaging word embeddings and then using cosine to compute pairwise sentence similarity scores. 

> FastText: Since, to our knowledge, the tokenizer and preprocessing used for the pre-trained FastText embeddings is not publicly described. We use the following heuristics to preprocess and tokenize sentences for Fast-Text: numbers are converted into words, text is lowercased, and finally prefixed, suffixed and infixed punctuation is recursively removed from each token that does not match an entry in the model’s lexicon;

### 5.1 Load the data and preprocess it

In [10]:
import nltk

data = []
with open('stsbenchmark/sts-dev.csv') as f:
    for line in f.read().splitlines():
        splits = line.split('\t')
        data.append({
            'score': float(splits[4]),
            's1': splits[5],
            's2': splits[6]
        })

# removes punctuation from sentences
tokenizer = nltk.RegexpTokenizer(r"\w+")

# lowercase, tokenize and remove punctuation from sentences
for x in data:
    x['s1'].lower()
    x['s2'].lower()
    x['s1'] = tokenizer.tokenize(x['s1'])
    x['s2'] = tokenizer.tokenize(x['s2'])
    x['s1'] = ' '.join(x['s1'])
    x['s2'] = ' '.join(x['s2'])

In [11]:
data[3]

{'score': 2.4,
 's1': 'A woman is playing the guitar',
 's2': 'A man is playing guitar'}

### 5.2 Get the scores and normalize them

In [12]:
score_human = []

for x in data:
    score = x['score']/5
    score_human.append(score)

In [13]:
score_machine = []

for x in data:
    s1_vec = ft.get_sentence_vector(x['s1'])
    s2_vec = ft.get_sentence_vector(x['s2'])
    score = (1-distance.cosine(s1_vec,s2_vec))
    score_machine.append(score)

### 5.3 Compare human and fastText scores

In [14]:
from scipy.spatial import pearsonr

result, _ = pearsonr(score_machine, score_human)
print('Pearsonr:', end=' ')
print("%.1f" % (result*100))

Pearsonr: 56.1


## 6. Numbers and written numbers are expressed very differently in fasttext
Converting numbers to written form could produce better results.

In [15]:
vec_1 = ft.get_word_vector('40')
vec_2 = ft.get_word_vector('forty')
vec_3 = ft.get_word_vector('41')
vec_4 = ft.get_word_vector('forty-one')
print(f'40 and forty        =   {distance.cosine(vec_1,vec_2)}')
print(f'41 and forty-one    =   {distance.cosine(vec_3,vec_4)}')
print(f'40 and 41           =   {distance.cosine(vec_1,vec_3)}')
print(f'forty and forty-one =   {distance.cosine(vec_2,vec_4)}')

40 and forty        =   0.4287061095237732
41 and forty-one    =   0.534650593996048
40 and 41           =   0.11887586116790771
forty and forty-one =   0.18680739402770996


In [16]:
print(data[653])

{'score': 1.0, 's1': 'Why is the speed of light 299 792 458 m s and not for instance 3 1 or 4 3 x 10 44 m s', 's2': 'Speed of light being finite is one of the fundamentals of our Universe'}


In [None]:
!pip install num2word

In [17]:
from num2word import word

def numbers_to_words(x):
    y = []
    for i in x:
        if i.isdigit():
            y.append(word(i))
        else:
            y.append(i)
    return y

for x in data:
    x['s1'] = nltk.word_tokenize(x['s1'])
    x['s1'] = numbers_to_words(x['s1'])
    x['s1'] = ' '.join(x['s1'])
    
    x['s2'] = nltk.word_tokenize(x['s2'])
    x['s2'] = numbers_to_words(x['s2'])
    x['s2'] = ' '.join(x['s2'])

In [18]:
print(data[653])

{'score': 1.0, 's1': 'Why is the speed of light Two Hundred Ninety Nine Seven Hundred Ninety Two Four Hundred Fifty Eight m s and not for instance Three One or Four Three x Ten Fourty Four m s', 's2': 'Speed of light being finite is one of the fundamentals of our Universe'}


In [19]:
score_machine = []

for x in data:
    s1_vec = ft.get_sentence_vector(x['s1'])
    s2_vec = ft.get_sentence_vector(x['s2'])
    score = (1-distance.cosine(s1_vec,s2_vec))
    score_machine.append(score)

In [20]:
result, _ = pearsonr(score_machine, score_human)
print("%.1f" % (result*100))

55.9


Converting numbers to words after tokenization creates wrong values as you can see from the example above