# Text embedding

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

Embeddings are commonly used for:

* Search (where results are ranked by relevance to a query string)
* Clustering (where text strings are grouped by similarity)
* Recommendations (where items with related text strings are recommended)
* Anomaly detection (where outliers with little relatedness are identified)
* Diversity measurement (where similarity distributions are analyzed)
* Classification (where text strings are classified by their most similar label)

The common process of text analysis is:
* Tokenization
* Embedding
* Similarity/distance calculation


Reference: https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

## OpenAI embedding model
		
* MODEL: text-embedding-ada-002	
* TOKENIZER: cl100k_base	
* MAX INPUT TOKENS: 8191
* OUTPUT DIMENSIONS: 1536

In [1]:
import openai
import pandas as pd
import numpy as np

def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']


#df.text is the column with the text

#1. Get the embedding for each text
df['ada_embedding'] = df.text.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))

#2. Save them for future reuse
df.to_csv('output/embedded_text.csv', index=False)

#3. Load them
df = pd.read_csv('output/embedded_text.csv')
df['ada_embedding'] = df.ada_embedding.apply(eval).apply(np.array)

In [None]:
#View the number of tokens

import tiktoken

# Load the cl100k_base tokenizer which is designed to work with the ada-002 model
tokenizer = tiktoken.get_encoding("cl100k_base")

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# Visualize the distribution of the number of tokens per row using a histogram
df.n_tokens.hist()

## BERT

KeyBERT: a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases that are most similar to a document.

* Ref: https://maartengr.github.io/KeyBERT/
* Use cases: apps/central_bank_speech_BERT
