# Introduction

This notebook will accomplish the following

- Set up an ElasticTransformers class
- Instantiate an index and index the Million headlines dataset in it
- Preview some search results from comparing lexical vs semantic search


## Loading requirements

In [70]:
%load_ext autoreload
import os
os.chdir(os.path.abspath(os.curdir).replace('notebooks',''))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [71]:
%autoreload 2
from src.database import ElasticTransformers


## Sentence Transformers

This creates the sentence transformer object as well as small helper function which simplifies the embedding call and helps lading data into elastic easier

In [8]:
from sentence_transformers import SentenceTransformer
bert_embedder = SentenceTransformer('bert-base-nli-mean-tokens')


In [24]:
def embed_wrapper(ls):
    results=bert_embedder.encode(ls)
    results = [r.to_list() for r in results]
    return results

## Quick Preview of the raw data

The data contains 1.15mn news headlines (all in lower case) and their published date

In [65]:
import pandas as pd
df=pd.read_csv('data/abcnews-date-text.csv')

In [27]:
df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


# A tiny example

Let's first do this with a tiny example of 1000 headlines (the full dataset is 1.1mn headlines)

In [68]:
df.head(1000).to_csv('data/tiny_sample.csv')


# Setting up ElasticTransformers

The below lines initialize the class, meaning setting the url and index name

In [74]:
et=ElasticTransformers(url='http://localhost:9200',index_name='et-tiny')
_ = et.ping()


2020-08-31 12:05:54,800 - src.logger - DEBUG - ping:32 - Ping successful
DEBUG:src.logger:Ping successful


Next, we define the index specification (Elasticsearch index mapping)

In [76]:
et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)

2020-08-31 12:07:00,626 - src.logger - DEBUG - create_index_spec:93 - Index spec index_spec/spec_et-tiny.json created
DEBUG:src.logger:Index spec index_spec/spec_et-tiny.json created


{'settings': {'number_of_shards': 3, 'number_of_replicas': 1},
 'mappings': {'dynamic': 'true',
  '_source': {'enabled': 'true'},
  'properties': {'publish_date': {'type': 'text'},
   'headline_text': {'type': 'text'},
   'headline_text_embedding': {'type': 'dense_vector', 'dims': 768}}}}

In [77]:
et.create_index()


Creating 'et-tiny' index.


In [78]:
et.write_large_csv('data/tiny_sample.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')

0it [00:00, ?it/s]2020-08-31 12:07:47,044 - src.logger - DEBUG - write_large_csv:181 - Successfully wrote 1000 docs to et-tiny
DEBUG:src.logger:Successfully wrote 1000 docs to et-tiny
1it [00:18, 18.21s/it]


One sample looks like this

## Indexing the entire dataset

Lets do this now with 1.1mn records 

In [None]:
# Initialize
et=ElasticTransformers(url='http://localhost:9200',index_name='et-large')
_ = et.ping()
# Create index mapping
et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)
# Create index
et.create_index()

### Indexing with sentence-transformers... 

This takes 6hrs on CPU, consumes 4CPUs & 2GB RAM for the embedding process and about 2GB RAM for Elastic

In [59]:
et.write_large_csv('data/abcnews-date-text.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')


1187it [6:15:07, 18.96s/it]
