# Introduction

This notebook will accomplish the following

- Set up an ElasticTransformers class
- Instantiate an index and index the Million headlines dataset in it
- Preview some search results from comparing lexical vs semantic search


## Loading requirements

In [36]:
%load_ext autoreload
import os
os.chdir(os.path.abspath(os.curdir).replace('notebooks',''))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [37]:
%autoreload 2
from src.database import ElasticTransformers


## Sentence Transformers

This creates the sentence transformer object as well as small helper function which simplifies the embedding call and helps lading data into elastic easier

In [38]:
from sentence_transformers import SentenceTransformer
bert_embedder = SentenceTransformer('bert-base-nli-mean-tokens')


In [39]:
def embed_wrapper(ls):
    """
    Helper function which simplifies the embedding call and helps lading data into elastic easier
    """
    results=bert_embedder.encode(ls, convert_to_tensor=True)
    results = [r.tolist() for r in results]
    return results

## Quick Preview of the raw data

The data contains 1.15mn news headlines (all in lower case) and their published date

In [40]:
import pandas as pd
df=pd.read_csv('data/abcnews-date-text.csv')

In [41]:
df.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


# A tiny example

Let's first do this with a tiny example of 1000 headlines (the full dataset is 1.1mn headlines)

In [42]:
df.head(1000).to_csv('data/tiny_sample.csv')


# Setting up ElasticTransformers

The below lines initialize the class, meaning setting the url and index name

In [32]:
et=ElasticTransformers(url='http://localhost:9300',index_name='et-tiny')
_ = et.ping()



Next, we define the index specification (Elasticsearch index mapping)

In [33]:
et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)

{'settings': {'number_of_shards': 3, 'number_of_replicas': 1},
 'mappings': {'dynamic': 'true',
  '_source': {'enabled': 'true'},
  'properties': {'publish_date': {'type': 'text'},
   'headline_text': {'type': 'text'},
   'headline_text_embedding': {'type': 'dense_vector', 'dims': 768}}}}

In [34]:
et.create_index()


Creating 'et-tiny' index.


In [35]:
et.write_large_csv('data/tiny_sample.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')

1it [00:08,  8.52s/it]


One sample looks like this

## Indexing the entire dataset

Lets do this now with 1.1mn records 

In [44]:
# Initialize
et=ElasticTransformers(url='http://localhost:9200',index_name='et-large')
_ = et.ping()
# Create index mapping
et.create_index_spec(
    text_fields=['publish_date','headline_text'],
    dense_fields=['headline_text_embedding'],
    dense_fields_dim=768
)
# Create index
et.create_index()

Creating 'et-large' index.


### Indexing with sentence-transformers... 

This takes 3hrs on CPU, consumes 4CPUs & 2GB RAM for the embedding process and about 2GB RAM for Elastic

In [None]:
et.write_large_csv('data/abcnews-date-text.csv',
                  chunksize=1000,
                  embedder=embed_wrapper,
                  field_to_embed='headline_text')


188it [30:19,  9.96s/it]