# Introduction

This notebook will accomplish the following

- Set up an ElasticTransformers class
- Instantiate an index and index the Million headlines dataset in it
- Preview some search results from comparing lexical vs semantic search


## Loading requirements

In [1]:
%load_ext autoreload
import os
os.chdir(os.path.abspath(os.curdir).replace('notebooks',''))

In [2]:
%autoreload 2
from src.database import ElasticTransformers


## Sentence Transformers

This creates the sentence transformer object as well as small helper function which simplifies the embedding call and helps lading data into elastic easier

In [3]:
from sentence_transformers import SentenceTransformer
bert_embedder = SentenceTransformer('bert-base-nli-mean-tokens')


100%|██████████| 405M/405M [02:04<00:00, 3.26MB/s] 


In [24]:
def embed_wrapper(ls):
    results=bert_embedder.encode(ls)
    print(type(results))
    results = [r.tolist() for r in results]
    return results

## Quick Preview of the raw data

The data contains 1.15mn news headlines (all in lower case) and their published date

In [25]:
import pandas as pd
df=pd.read_csv('data/abcnews-date-text.csv')

In [26]:
df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


# A tiny example

Let's first do this with a tiny example of 1000 headlines (the full dataset is 1.1mn headlines)

In [27]:
df.head(1000).to_csv('data/tiny_sample.csv')


# Setting up ElasticTransformers

The below lines initialize the class, meaning setting the url and index name

In [28]:
et=ElasticTransformers(url='http://localhost:9200',index_name='et-tiny')
_ = et.ping()


2020-08-31 17:00:35,684 - src.logger - DEBUG - ping:32 - Ping successful
DEBUG:src.logger:Ping successful


Next, we define the index specification (Elasticsearch index mapping)

In [29]:
et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)

2020-08-31 17:00:37,125 - src.logger - DEBUG - create_index_spec:93 - Index spec index_spec/spec_et-tiny.json created
DEBUG:src.logger:Index spec index_spec/spec_et-tiny.json created


{'settings': {'number_of_shards': 3, 'number_of_replicas': 1},
 'mappings': {'dynamic': 'true',
  '_source': {'enabled': 'true'},
  'properties': {'publish_date': {'type': 'text'},
   'headline_text': {'type': 'text'},
   'headline_text_embedding': {'type': 'dense_vector', 'dims': 768}}}}

In [30]:
et.create_index()


Creating 'et-tiny' index.


In [32]:
et.write_large_csv('data/tiny_sample.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')




0it [00:00, ?it/s][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:01:48,149 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-tiny
DEBUG:src.logger:Successfully wrote 1000 docs to et-tiny



1it [00:13, 13.15s/it][A[A[A


[A[A[A

One sample looks like this

## Indexing the entire dataset

Lets do this now with 1.1mn records 

In [33]:
# Initialize
et=ElasticTransformers(url='http://localhost:9200',index_name='et-large')
_ = et.ping()
# Create index mapping
et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)
# Create index
et.create_index()

2020-08-31 17:01:48,217 - src.logger - DEBUG - ping:32 - Ping successful
DEBUG:src.logger:Ping successful
2020-08-31 17:01:48,226 - src.logger - DEBUG - create_index_spec:93 - Index spec index_spec/spec_et-large.json created
DEBUG:src.logger:Index spec index_spec/spec_et-large.json created


Creating 'et-large' index.


### Indexing with sentence-transformers... 

This takes 6hrs on CPU, consumes 4CPUs & 2GB RAM for the embedding process and about 2GB RAM for Elastic

In [None]:
et.write_large_csv('data/abcnews-date-text.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')





0it [00:00, ?it/s][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:02:15,086 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



1it [00:14, 14.63s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:02:35,218 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



2it [00:34, 16.28s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:02:49,734 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



3it [00:49, 15.75s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:03:05,452 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



4it [01:04, 15.74s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:03:22,616 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



5it [01:22, 16.17s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:03:36,878 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



6it [01:36, 15.60s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:03:51,890 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



7it [01:51, 15.42s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:04:07,947 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



8it [02:07, 15.61s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:04:23,395 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



9it [02:22, 15.56s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:04:38,449 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



10it [02:37, 15.41s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:04:52,891 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



11it [02:52, 15.12s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:05:07,959 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



12it [03:07, 15.10s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:05:22,624 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



13it [03:22, 14.97s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:05:37,441 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



14it [03:36, 14.93s/it][A[A[A

<class 'numpy.ndarray'>


2020-08-31 17:05:52,954 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-large
DEBUG:src.logger:Successfully wrote 1000 docs to et-large



15it [03:52, 15.10s/it][A[A[A