Hi there! In this notebook we'll review a different methods of searching similar texts! At the end of notebook we'll consider a problem of fast similarity score calculation.

<img src='https://vene.ro/images/wmd-obama.png' height='600' width='600'>

Let's review our data!

In [None]:
import csv, pandas, tqdm, random
import numpy as np
from numpy.linalg import norm

In [None]:
data = pandas.read_csv('./training.1600000.processed.noemoticon.csv',names=['id', 'date', 'type', 'author', 'tweet'])

In [None]:
data.head()

It's tweets - small pieces texts. Let's tokenize these texts. Particularly we have to remove hashtags and names

In [None]:
import tokenize_tweets

In [None]:
texts = data['tweet'][:100000].tolist() 

In [None]:
for i in range(len(texts)):
    texts[i] = tokenize_tweets.tokenize(texts[i].lower()).decode("ascii", errors="ignore").encode()

# Pretrain word2vec

let's pretrain word2vec model for computing similarity score 

In [None]:
import gensim
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

In [None]:
stop = set()
for ch in ";!,.?":
    stop.add(ch)

In [None]:
text = " ".join(texts)

In [None]:
sentences = []
for snt in tqdm.tqdm(sent_tokenize(text)):
    snt = [w for w in word_tokenize(snt) if not w in stop]
    if len(snt) > 0:
        sentences.append(snt)

In [None]:
for snt in sentences[:10]:
    print(" ".join(snt))

In [None]:
from gensim.models.word2vec import Word2Vec

In [None]:
w2v = Word2Vec(sentences, size=100, max_vocab_size=200000, workers=4, iter=10)
w2v.init_sims(replace=True)

In [None]:
w2v.similar_by_word('good')

In [None]:
#query_snt = "worry about passing my test"
query_snt = "i'm ill today"
print("Query: " + query_snt)

# Document vectors

The esiest way to compare two documents - just average their word2vec embeddings!

In [None]:
def text2vec(text):
    # convert every word in the text to it's w2vec embedding and average all of them!

def cosine_similarity(a, b):
    return np.sum(a * b)/(norm(a) * norm(b))

In [None]:
database = [(txt, text2vec(txt)) for txt in texts]

In [None]:
%%time
query_vec = text2vec(query_snt)
results = sorted([(snt, -cosine_similarity(vec, query_vec)) for snt, vec in database], key=lambda x:x[1])

In [None]:
for snt, score in results[:10]:
    print("{}: {}".format(snt, str(score)))

# Let's try WMD

Here we'll try WMD distance to find sentences which are semantically similar to the query.

In [None]:
database = [word_tokenize(txt) for txt in texts[:10000]]

In [None]:
%%time
query_vec = text2vec(query_snt)
query_snt_words = query_snt.split()
results = sorted([(snt, w2v.wmdistance(query_snt_words, snt)) for snt in database], key=lambda x:x[1])

In [None]:
for snt, score in results[:10]:
    print("{}: {}".format(" ".join(snt), str(score)))

Seems to be WMD gives promising results. But it's too slow... Let's t try to train a siamese network!

# Siamese network

<img src='http://slideplayer.com/12063757/69/images/9/Siamese+network+architecture.jpg' height='600' width='600'>

In [None]:
texts_preprocessed = []
for txt in texts[:10000]:
    texts_preprocessed.append([w for w in word_tokenize(txt) if not w in stop])

In [None]:
from keras.models import Model
from keras.layers import Input, Embedding, Dense, LSTM, GRU
from keras.layers.core import Lambda
from keras.layers.wrappers import Bidirectional
import keras.backend as K

In [None]:
input = Input(shape=(None,w2v.vector_size))
out = GRU(256, return_sequences=True)(input)
out = LSTM(256)(out)
encoder = Model(input, out)

In [None]:
# implement siamese neural network.
# Let's suppose that texts embeddings (v1 and v2) are compared in the following way  = 1 / (1 + |v1 - v2|)
input = ...

In [None]:
def encode_text(snt, seq_size):
    assert isinstance(snt, str)
    snt = snt.split()
    result = np.zeros((seq_size, w2v.vector_size))
    for i,w in enumerate(snt[:seq_size]):
        if w in w2v:
            result[i,:] = w2v[w]
    return result

In [None]:
def get_batches(texts, batch_count=64, batch_size=64, seq_size=20):
    batch_a = np.zeros((batch_size, seq_size, w2v.vector_size))
    batch_b = np.zeros((batch_size, seq_size, w2v.vector_size))
    labels = np.zeros((batch_size,))
    
    for bi in range(batch_count):
        for seq_index in range(batch_size):
            seq_a, seq_b = random.sample(texts,2)
            batch_a[seq_index, :] = encode_text(seq_a, seq_size)
            batch_b[seq_index, :] = encode_text(seq_b, seq_size)
            dist = w2v.wmdistance(seq_a, seq_b)
            labels[seq_index] = 1.0/(1+dist)
        yield batch_a, batch_b, labels

In [None]:
from sklearn.model_selection import train_test_split
sentences_train, sentences_test = train_test_split(texts, test_size=0.3)

In [None]:
for epoch in range(100):
    print("epoch: {}".format(epoch))
    losses = []
    for batch_a, batch_b, labels in get_batches(sentences_train):
        loss = siamese_model.train_on_batch([batch_a, batch_b], labels)
        losses.append(loss)
    print("train_loss: {}".format(np.mean(losses)))
    
    losses = []
    for batch_a, batch_b, labels in get_batches(sentences_test):
        loss = siamese_model.test_on_batch([batch_a, batch_b], labels)
        losses.append(loss)
    print("test_loss: {}".format(np.mean(losses)))

Let's check our model!

In [None]:
batch = np.array([encode_text(snt, 20) for snt in sentences_test])

Let's prepare our index

In [None]:
index_database = encoder.predict(batch)

In [None]:
query_vec = encoder.predict(np.expand_dims(encode_text("let's go to the sinema !", 20),0))[0]

In [None]:
index_database.shape

In [None]:
# 1. calculate similarity scores vector
# 2. select top5 sentences with maximum scores

# Homework

1. **3 points** tune siamese network. Feel free to perform research. It's highly appreciated if you apply new tricks which were not presented in the scope of this course!
2. **7 points** implement/train/test two extra models with  contrastive and triplet losses. Compare their outputs. What's your thoughts about these models?