# Doc2Vec Searching of Lang Database

Use `gensim` and a simple `doc2vec` model trained on stories from the Lang coloured fairy books to support semantic retrieval of fairy stories.

The approach can be summarised as follows:

- generate a vocabulary of terms representative of the search corpus;
- generate a vector space where each dimension is a word in the vocabulary;
- generate a vector for each document or search phrase;
- retrieve documents based on similarity between document vector and search phrase vector.

The following recipe is inspired by [How to make a search engine on Movies Description](https://github.com/ppontisso/Text-Search-Engine-using-Doc2Vec-and-TF-IDF/blob/master/notebook.ipynb).

## Connecting to the Database

We're going to work with our Lang fairy story database, so let's set up a connection to it:

In [1]:
from sqlite_utils import Database

db_name = "demo.db"

db = Database(db_name)

Let's remind ourselves of the database structure:

In [2]:
print(db.schema)

CREATE TABLE [books] (
   [book] TEXT,
   [title] TEXT,
   [text] TEXT,
   [last_para] TEXT,
   [first_line] TEXT,
   [provenance] TEXT,
   [chapter_order] INTEGER,
   PRIMARY KEY ([book], [title])
);
CREATE TABLE [books_metadata] (
   [title] TEXT,
   [year] INTEGER
);
CREATE VIRTUAL TABLE [books_fts] USING FTS5 (
    [title], [text],
    content=[books]
);
CREATE TABLE 'books_fts_data'(id INTEGER PRIMARY KEY, block BLOB);
CREATE TABLE 'books_fts_idx'(segid, term, pgno, PRIMARY KEY(segid, term)) WITHOUT ROWID;
CREATE TABLE 'books_fts_docsize'(id INTEGER PRIMARY KEY, sz BLOB);
CREATE TABLE 'books_fts_config'(k PRIMARY KEY, v) WITHOUT ROWID;
CREATE TRIGGER [books_ai] AFTER INSERT ON [books] BEGIN
  INSERT INTO [books_fts] (rowid, [title], [text]) VALUES (new.rowid, new.[title], new.[text]);
END;
CREATE TRIGGER [books_ad] AFTER DELETE ON [books] BEGIN
  INSERT INTO [books_fts] ([books_fts], rowid, [title], [text]) VALUES('delete', old.rowid, old.[title], old.[text]);
END;
CREATE TRIGGER 

Recall that we can perform a full text search:

In [3]:
#q = 'king "three sons" gold'
q = 'hansel witch'
_q = f'SELECT title FROM books_fts WHERE books_fts MATCH {db.quote(q)} ;'

for row in db.query(_q):
    print(row["title"])

Hansel And Grettel


We can randomly sample a selection of rows with a query of the following form:

In [4]:
# Via https://gist.github.com/alecco/9976dab8fda8256ed403054ed0a65d7b

_q_random_sample = """
SELECT * FROM books
WHERE rowid IN (SELECT rowid FROM books
                ORDER BY random() LIMIT {});
"""

for row in db.query(_q_random_sample.format(5)):
    print(row["title"])

The White Cat
A Tale Of The Tontlawald
The Death Of The Sun-Hero
The Prince Who Would Seek Immortality
Graciosa And Percinet


## Simple Model

We could use an off-the-shelf model to process documents, or we can train our own model from our own documents so that the word vectors are aligned to our dataset. In a large corpus, we can train on a sample of documents if they are representative of the whole.

If we train against the whole dataset, we can search into the dataset directly from the model. If train the model on a partial collection, then we can only compare search phrases and documents that we have generated vectors for.

To create the model, it helps if we clean the documents, e.g. by decasing, and removing punctuation:

In [5]:
from gensim.parsing.preprocessing import preprocess_string
from gensim.parsing.preprocessing import strip_tags, strip_punctuation, strip_numeric, remove_stopwords
def clean_text(text):
    """Generate a cleaned, tokenised version of a text."""
    CUSTOM_FILTERS = [lambda x: x.lower(),
                      strip_tags, strip_punctuation,
                      strip_numeric, remove_stopwords]
    
    return preprocess_string(text, CUSTOM_FILTERS)

Apply the cleaning function to the text on the way in to creating the training corpus:

In [6]:
sample_corpus = db.query(_q_random_sample.format(9999))

sample_docs = [(clean_text(r['text']),
               f"{r['book']}::{r['title']}", #create a unique tag
               r['title'])
               for r in sample_corpus]

# For the first doc, preview the first 5 cleaned words and title
sample_docs[0][0][:5], sample_docs[0][1], sample_docs[0][2]

(['time', 'certain', 'country', 'lived', 'king'],
 'The Blue Fairy Book::The Bronze Ring',
 'The Bronze Ring')

In [7]:
# The gensim model needs named tuples
# including at least a words and tags dimension
# Naively we can just use a document index count as the tag
from collections import namedtuple

StoryDoc = namedtuple('StoryDoc',
                      'words tags title')

sample_docs_training = []

for i, sample_doc in enumerate(sample_docs):
    sample_docs_training.append(StoryDoc(sample_doc[0],
                                         [sample_doc[1]],
                                         sample_doc[2]))

In [8]:
from gensim.models import Doc2Vec

# Define the parameters for building the model.
# We can also pass a list of documents
# via the first "documents" parameter
# and the model will be trained against those.
# Alternatively, create an empty model and train it later.
model = Doc2Vec(
                # dm: training algorithm;
                # 1: distributed memory/PV-DM;
                # 0: distributed bag of words (PV-DBOW)
                dm=1,
                # vector_size: size of feature vectors
                vector_size=300,
                # window: max dist between current & predicted word
                window=10,
                # hs: 1: hierarchical softmax;
                # hs: 0 : negative sampling if negative
                hs=0,
                # min_count: ignore words w/ lower frequency
                # There is a risk to setting this too high
                # particularly if a search term is likely unique,
                # as it might be with a name. On the other hand,
                # for such situations, a simple search might be better?
                min_count=5,
                # sample: randomly downsample hi-frequnecy words
                # useful range: (0, 1e-5)
                sample=1e-5,
                )

The model is built around a vocabulary extracted from the training document corpus.

In [9]:
# Build the model vocabulary
model.build_vocab(sample_docs_training)

We can now train the model (this may take some time for a large corpus):

In [10]:
# It would be useful if we could display a progress bar for this...
model.train(sample_docs_training,
            total_examples=model.corpus_count,
            epochs=100, start_alpha=0.01, end_alpha=0.01)

Rather than creating a model each time we want to use it, we can save the model and then load it as required:

In [11]:
# Save a model
model.save("lang_model.gensim")

# Load in a model
model = Doc2Vec.load("lang_model.gensim")

To retrieve a document matching a search phrase, we need to encode the search phrase and then try to find a matching document:

In [12]:
search_phrase = """
hansel and his sister were cast out by their wicked stepmother
and went into forest and met an evil witch
"""

# Preprocess the search phrase
tokens = clean_text(search_phrase)

tokens

['hansel',
 'sister',
 'cast',
 'wicked',
 'stepmother',
 'went',
 'forest',
 'met',
 'evil',
 'witch']

Generate a vector for the tokens:

In [13]:
# Generate the vector representation from the model
search_vector = model.infer_vector(tokens, alpha=0.001, steps = 5)

We can now search for related documents from the original training set based on how well their vectors match the vector generated for the search phrase:

In [14]:
# Find the top 5 matches
matches = model.docvecs.most_similar([search_vector], topn=10)
# To rank every document from the training corpus
# set: topn=model.docvecs.count

# The response gives the original training document ids and match scores
matches

[('The Blue Fairy Book::Hansel And Grettel', 0.7299070954322815),
 ('The Yellow Fairy Book::The Witch', 0.5531911849975586),
 ('The Red Fairy Book::Mother Holle', 0.4939098060131073),
 ('The Red Fairy Book::The Twelve Brothers', 0.47860634326934814),
 ('The Red Fairy Book::Brother And Sister', 0.450879842042923),
 ('The Red Fairy Book::The Three Dwarfs', 0.391923725605011),
 ('The Yellow Fairy Book::The White Duck', 0.3815954923629761),
 ('The Pink Fairy Book::The House In The Wood', 0.3764336109161377),
 ('The Lilac Fairy Book::The Fairy Nurse', 0.36403894424438477),
 ('The Blue Fairy Book::Snow-White And Rose-Red', 0.3379814028739929)]

Let's try another one:

In [15]:
search_phrase = """
a poor orphan girl lives with her wicked stepmother and sisters
but then her fairy godmother appears and she goes to a ball in a 
pumpkin carriage and leaves at midnight
but loses her slipper then finally marries the prince
"""

# Preprocess the search phrase
tokens = clean_text(search_phrase)
search_vector = model.infer_vector(tokens, alpha=0.001, steps = 5)
model.docvecs.most_similar([search_vector], topn=10)

[('The Blue Fairy Book::Cinderella, Or The Little Glass Slipper',
  0.5139161348342896),
 ('The Blue Fairy Book::The Sleeping Beauty In The Wood', 0.3240140974521637),
 ('The Blue Fairy Book::Toads And Diamonds', 0.3050570487976074),
 ('The Violet Fairy Book::The Child Who Came From An Egg', 0.2949081063270569),
 ('The Yellow Fairy Book::The Swineherd', 0.27728572487831116),
 ('The Lilac Fairy Book::The Brown Bear Of Norway', 0.2717766761779785),
 ('The Yellow Fairy Book::The Yellow Fairy Book', 0.24191826581954956),
 ('The Red Fairy Book::The Twelve Dancing Princesses', 0.21802163124084473),
 ('The Blue Fairy Book::Little Thumb', 0.2116871178150177),
 ('The Pink Fairy Book::Peter Bull', 0.1903074085712433)]