<center><h1>Semantic Search</h1></center>
<center> - </center>
<center> Exploring Embedding techniques and Similarity Measures </center>

The task is to develop a semantic search engine. We have a corpus of texts. Since there is a large amount of documents, we want to run small search queries so that is returns texts that are semantically close to the query.

The process can be illustrated the following way :

<img src="images/process.png">

# Imports

In [1]:
### General ###
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt; plt.rcdefaults()
import re

from sklearn.model_selection import train_test_split

### Text processing ###
from nltk import wordpunct_tokenize, WordNetLemmatizer, sent_tokenize, pos_tag
from nltk.corpus import stopwords as sw, wordnet as wn
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import string 
import spacy
from tqdm import tqdm
from nltk.tokenize import RegexpTokenizer

### Pre-trained model ###
import tensorflow as tf
import tensorflow_hub as hub

### Live Search Engine ###
from IPython.core.magic import (register_line_magic, register_cell_magic,
                                register_line_cell_magic)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
EN = spacy.load('en')

# The data

### Elon Musk's Tweets

To replicate the kind of messages we would observe in a message app, I have chosen to explore Elon Musk's tweets data set : https://data.world/adamhelsinger/elon-musk-tweets-until-4-6-17

In [185]:
tweets = pd.read_csv('tweets.csv')['text']
tweets.head(10)

0    b'And so the robots spared humanity ... https:...
1    b"@ForIn2020 @waltmossberg @mims @defcon_5 Exa...
2        b'@waltmossberg @mims @defcon_5 Et tu, Walt?'
3                  b'Stormy weather in Shortville ...'
4    b"@DaveLeeBBC @verge Coal is dying due to nat ...
5    b"@Lexxxzis It's just a helicopter in helicopt...
6                            b"@verge It won't matter"
7                        b'@SuperCoolCube Pretty good'
8    b"Why did we waste so much time developing sil...
9    b'Technology breakthrough: turns out chemtrail...
Name: text, dtype: object

In [186]:
tweets.shape

(2819,)

### South Park Series Data

In [198]:
south = pd.read_csv('All-seasons.csv')['Line']

0           You guys, you guys! Chef is going away. \n
1                          Going away? For how long?\n
2                                           Forever.\n
3                                    I'm sorry boys.\n
4    Chef said he's been bored, so he joining a gro...
Name: Line, dtype: object

In [None]:
south.head(15)

# I . Pre-processing 

The first step is to tokenize the text :

In [197]:
def preprocess(document, max_features = 150, max_sentence_len = 300):
    """
    Returns a normalized, lemmatized list of tokens from a document by
    applying segmentation (breaking into sentences), then word/punctuation
    tokenization, and finally part of speech tagging. It uses the part of
    speech tags to look up the lemma in WordNet, and returns the lowercase
    version of all the words, removing stopwords and punctuation.
    """
    
    def lemmatize(token, tag):
        """
        Converts the tag to a WordNet POS tag, then uses that
        tag to perform an accurate WordNet lemmatization.
        """
        tag = {
        'N': wn.NOUN,
        'V': wn.VERB,
        'R': wn.ADV,
        'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return WordNetLemmatizer().lemmatize(token, tag)

    def vectorize(doc, max_features, max_sentence_len):
        """
        Converts a document into a sequence of indices of length max_sentence_len retaining only max_features unique words
        """
        tokenizer = Tokenizer(num_words=max_features)
        tokenizer.fit_on_texts(doc)
        doc = tokenizer.texts_to_sequences(doc)
        doc_pad = pad_sequences(doc, padding = 'pre', truncating = 'pre', maxlen = max_sentence_len)
        return np.squeeze(doc_pad), tokenizer.word_index

    lemmatized_tokens = []

    # Clean the text using a few regular expressions
    document = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", document)
    document = re.sub(r"what's", "what is ", document)
    document = re.sub(r"\'", " ", document)
    document = re.sub(r"@", " ", document)
    document = re.sub(r"\'ve", " have ", document)
    document = re.sub(r"can't", "cannot ", document)
    document = re.sub(r"n't", " not ", document)
    document = re.sub(r"i'm", "i am ", document)
    document = re.sub(r"\'re", " are ", document)
    document = re.sub(r"\'d", " would ", document)
    document = re.sub(r"\'ll", " will ", document)
    document = re.sub(r"(\d+)(k)", r"\g<1>000", document)
    document = re.sub(r"\n", " ", document)
    
    cleaned_document = []
    
    # Break the document into sentences
    for sent in sent_tokenize(document):

        # Break the sentence into part of speech tagged tokens
        for token, tag in pos_tag(wordpunct_tokenize(sent)):

            # Apply preprocessing to the tokens
            token = token.lower()
            token = token.strip()
            token = token.strip('_')
            token = token.strip('*')

            # If punctuation or stopword, ignore token and continue
            if token in set(sw.words('english')) or all(char in set(string.punctuation) for char in token):
                continue

            # Lemmatize the token
            lemma = lemmatize(token, tag)
            lemmatized_tokens.append(lemma)

        cleaned_document.append(' '.join(lemmatized_tokens))
    
    
    #vectorized_document, word_index = vectorize(cleaned_document, max_features, max_sentence_len)
    #return vectorized_document, word_index
    return cleaned_document

In [188]:
tweets = preprocess(str(list(tweets)))

In [None]:
south = preprocess(str(list(south)))

In [None]:
south

# II . Embedding

## 1. Word Embedding

## 2. Rolling Window Embedding

## 3. Sentence Embedding

### a. Test the embedding

In [None]:
# Download the USE module
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3" 
embed = hub.Module(module_url)

In [32]:
# Generate the embedding and print out some descriptive data
def generate_embedding(messages):
    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        message_embeddings = session.run(embed(messages))
    for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
        print("Message: {}".format(messages[i]))
        print("Embedding size: {}".format(len(message_embedding)))
        message_embedding_snippet = ", ".join(
            (str(x) for x in message_embedding[:3]))
        print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
    return(message_embeddings[0])

Let's try to generate a first embedding :

In [38]:
emb = generate_embedding(["How can I reset my password"])

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0328 13:54:23.967081 4613125568 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


Message: How can I reset my password
Embedding size: 512
Embedding: [0.025553924962878227, -0.034720830619335175, 0.0020717347506433725, ...]



### b. Embed the whole file

In [71]:
session = tf.InteractiveSession()
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())



In [145]:
sts_input1 = tf.placeholder(tf.string, shape=(None))

# For evaluation we use exactly normalized rather than
# approximately normalized.
sts_encode1 = tf.nn.l2_normalize(embed(sts_input1), axis=1)

def get_embeds(session, text):
    """Returns the similarity scores"""
    embed = session.run(
        [sts_encode1],
        feed_dict={
            sts_input1: text
        })
    return(embed[0][0])

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0328 16:38:48.509640 4613125568 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


In [146]:
%time
embed_tweets = []

for message in range(len(tweets)) :
    embed_tweets.append(get_embeds(session, [tweets[message]]))

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 8.82 µs


In [159]:
tweets_df = pd.DataFrame(tweets)
tweets_df['embed'] = embed_tweets
tweets_df.head()

Unnamed: 0,text,embed
0,b'And so the robots spared humanity ... https:...,"[0.05683216, -0.019019788, 0.11236075, -0.0346..."
1,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa...","[0.01255778, 0.019348582, 0.09177675, -0.01283..."
2,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'","[0.008331334, -0.010440288, -0.011671914, -0.0..."
3,b'Stormy weather in Shortville ...',"[-0.046504013, 0.01973891, 0.0043140077, -0.02..."
4,"b""@DaveLeeBBC @verge Coal is dying due to nat ...","[0.028700273, -0.027458383, 0.09349278, 0.0341..."


In [174]:
tweets_df['embed'] = tweets_df['embed'].apply(lambda x : x.astype('float'))

In [175]:
tweets_df.head()

Unnamed: 0,text,embed
0,b'And so the robots spared humanity ... https:...,"[0.05683216080069542, -0.01901978813111782, 0...."
1,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa...","[0.012557780370116234, 0.01934858225286007, 0...."
2,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'","[0.008331334218382835, -0.010440288111567497, ..."
3,b'Stormy weather in Shortville ...',"[-0.04650401324033737, 0.01973891071975231, 0...."
4,"b""@DaveLeeBBC @verge Coal is dying due to nat ...","[0.028700273483991623, -0.02745838277041912, 0..."


### c. Define semantic similarity metric

In [176]:
sts_input1 = tf.placeholder(tf.string, shape=(None))
sts_encode2 = tf.placeholder(tf.float32)

# For evaluation we use exactly normalized rather than
# approximately normalized.
sts_encode1 = tf.nn.l2_normalize(embed(sts_input1), axis=1)

cosine_similarities = tf.reduce_sum(tf.multiply(sts_encode1, sts_encode2), axis=1)
clip_cosine_similarities = tf.clip_by_value(cosine_similarities, 0.0, 1.0)
sim_scores = 1.0 - tf.divide(tf.acos(clip_cosine_similarities), 3.14)

def get_scores(session, text_a, text_b):
    """Returns the similarity scores"""
    scores= session.run(
        [sim_scores],
        feed_dict={
            sts_input1: text_a,
            sts_encode2: text_b
        })
    return(scores)

INFO:tensorflow:Saver not created because there are no variables in the graph to restore


I0328 16:55:13.300189 4613125568 tf_logging.py:115] Saver not created because there are no variables in the graph to restore


In [180]:
def get_results(sessions, sentence, num):
    examples = [e for e in tweets_df['embed']]
    scores = get_scores(session, [sentence], examples)
    tweets_df['cosine'] = scores[0].tolist()
    return(tweets_df.sort_values('cosine', ascending=False).head(n=num))

In [181]:
def print_res(test, num=20):
    res = get_results(session, test, num).round(4)
    res = (res.set_index('cosine'))
    print('{}\n'.format(test))
    print('\x1b[31mScore{:<1} \x1b[0m: \x1b[34m Matching sentence\x1b[0m'.format(''))
    for i in res.iterrows():
        print('\x1b[31m{:<6} \x1b[0m: \x1b[0m \x1b[34m{}\x1b[0m'.format(i[0], i[1][0]))

In [183]:
print_res("space launch")

space launch

[31mScore  [0m: [34m Matching sentence[0m
[31m0.774  [0m: [0m [34mb"Can't delay any longer. Must proceed with primary mission to launch the Deep Space Climate Observatory spacecraft."[0m
[31m0.7663 [0m: [0m [34mb'High velocity reentry (2700 lbs/sqft) appeared to succeed, but, as expected, not enough propellant to land for this and the next mission.'[0m
[31m0.7613 [0m: [0m [34mb'Primary mission on target. Spacecraft head towards the sun! All good there.'[0m
[31m0.7563 [0m: [0m [34mb'Dragon captured by the International Space Station! Just awesome ... http://t.co/ihoZqgj7'[0m
[31m0.7562 [0m: [0m [34mb'Drone spaceport ship heads to its hold position in the Atlantic to prepare for a rocket landing http://t.co/kXYHGVKTfE'[0m
[31m0.7542 [0m: [0m [34mb'Jeff maybe unaware SpaceX suborbital VTOL flight began 2013. Orbital water landing 2014. Orbital land landing next. https://t.co/S6WMRnEFY5'[0m
[31m0.7524 [0m: [0m [34mb'Counting down to the f

### d. Live search prediction

In [193]:
@register_cell_magic
def search(line, cell):
    return print_res(cell)

In [192]:
%%search
"Messages on twitter"

"Messages on twitter"

[31mScore  [0m: [34m Matching sentence[0m
[31m0.7404 [0m: [0m [34mb'@apple_defense @samabuelsamid Exactly! I love Twitter.'[0m
[31m0.7214 [0m: [0m [34mb"@tonykatz Don't like having a zillion tweets in the log. Makes it tough to wade through if someone wants to read my tweet history."[0m
[31m0.7182 [0m: [0m [34mb'Single character Tweets are the ulitmate extension of the Twitmeme...'[0m
[31m0.7155 [0m: [0m [34mb'Not easy to convey irony in a tweet'[0m
[31m0.7091 [0m: [0m [34mb'@SwiftOnSecurity I like your tweets!'[0m
[31m0.7001 [0m: [0m [34mb'Please ignore prior tweets, as that was someone pretending to be me :)  This is actually me.'[0m
[31m0.6972 [0m: [0m [34mb'Signing off now. That was more than enough Twitter trouble for one morning!'[0m
[31m0.6838 [0m: [0m [34mb'@BullFlags Yeah. And Twitter is a hater Hellscape.'[0m
[31m0.6716 [0m: [0m [34mb'@YostRobert @StephenAtHome Yeah, and several others at various times. My

Sources :
- https://github.com/choran/sentence_embeddings