# Test notebook for generating embeddings
- https://huggingface.co/spaces/mteb/leaderboard

## Structure
- Get text
- Chunk text
- Embed text
- Retrieve text

## TODO
- Improve embedding model. Current embedding model (all-MiniLM-L6-v2 is ranked 42th on the MTEB)
- Improve vector store. QDrant?
- To write benchmarking function, see how long each embedding model take. 

## Get text

In [None]:
import os 
import glob

# Load data
documents = []
for name in glob.glob('./data/*.md'): 
    print(name) 
    documents.append(open(name, 'r').read())

len(documents)

## Chunk texts

In [None]:
import re

text = documents[0]

# Find and print all matching headers
header_pattern = r'^#+ .+'
headers = re.findall(header_pattern, text, re.MULTILINE)
headers_to_split_on = []
for i, header in enumerate(headers):
    headers_to_split_on.append((header, f'Header {i}'))

headers_to_split_on

In [None]:
from langchain.text_splitter import CharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# from langchain.vectorstores import Qdrant
# from langchain.document_loaders import TextLoader

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(text)

# Optional: We can do recursive splitting within each document
# # Char-level splits

# chunk_size = 250
# chunk_overlap = 30
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=chunk_size, chunk_overlap=chunk_overlap
# )

# # Split
# splits = text_splitter.split_documents(md_header_splits)
# splits

In [6]:
# Putting it together as a function
# TODO: Document loading seems to remove the markdown headers by default. Is there any way to keep the headers?
from langchain.text_splitter import CharacterTextSplitter, MarkdownHeaderTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
import re

def chunk(doc):
    """
    Args:
    doc - in Document format
    
    Output:
    md_header_splits - markdown split format. Gives a more principled level of splitting 
    """
    text = doc.page_content
    header_pattern = r'^#+ .+'
    headers = re.findall(header_pattern, text, re.MULTILINE)
    headers_to_split_on = []
    for i, header in enumerate(headers):
        headers_to_split_on.append((header, f'Header {i}'))
    return headers_to_split_on
    # markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    # md_header_splits = markdown_splitter.split_text(text)

    # # Optional char-level splits within each document 
    # chunk_size = 250
    # chunk_overlap = 30
    # text_splitter = RecursiveCharacterTextSplitter(
    #     chunk_size=chunk_size, chunk_overlap=chunk_overlap
    # )
    # splits = text_splitter.split_documents(md_header_splits)
    # return splits


chunk(documents[0])

Lesley Castle: an unfinished novel in letters

✍ Jane Austen

To Henry Thomas Austen Esqre.

Sir

I am now availing myself of the Liberty you have frequently honoured me with of dedicating one of my Novels to you. That it is unfinished, I greive; yet fear that from me, it will always remain so; that as far as it is carried, it should be so trifling and so unworthy of you, is another concern to your obliged humble Servant

The Author

Messrs Demand and Co—please to pay Jane Austen Spinster the sum of one hundred guineas on account of your Humble Servant.

H. T. Austen

L105. 0. 0.

Letter the First: from Miss Margaret Lesley to Miss Charlotte

Lutterell. Lesley Castle Janry 3rd—1792.

My Brother has just left us. “Matilda (said he at parting) you and Margaret will I am certain take all the care of my dear little one, that she might have received from an indulgent, and affectionate and amiable Mother.” Tears rolled down his cheeks as he spoke these words—the remembrance of her, who had s

[]

## Other embedding models

In [None]:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")
model = AutoModel.from_pretrained("thenlper/gte-base")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())


## Retriever

In [None]:
from langchain.vectorstores import Qdrant
qdrant = Qdrant.from_documents(
    docs, embeddings, 
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="my_documents",
)


## TF-IDF

In [None]:
from langchain.retrievers import TFIDFRetriever
retriever = TFIDFRetriever.from_texts(["foo", "bar", "world", "hello", "foo bar"])
result = retriever.get_relevant_documents("foo")
result

## Main

In [35]:
%%time
from langchain.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

# Step 1: Load the document(s) and split it into chunks
# loader = DirectoryLoader("./data", glob = "*.md")
# documents = loader.load()
# text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# chunks = text_splitter.split_documents(documents)

# Step 2: Create embeddings
# embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# chunks2 = [x.page_content for x in chunks] # HFEmbeddings only accepts str
# embeddings = embedding_model.embed_documents(chunks2)

# Step 3: Store embeddings in ChromaDB and save locally.
# db = Chroma.from_embeddings(embeddings)
db = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db")

# Step 4: Create a retriever
retriever = db.as_retriever()


KeyboardInterrupt



In [36]:
# Step 5: Define the prompt template
# source: https://chat.openai.com/share/c85e64f6-4dd2-4920-b82e-78f128898cbb
template = """Engage in a conversation as though you were a character in a Jane Austen novel.\
 Use formal language and manners, and discuss topics that would be of interest in the early 19th century,\
 such as society, love, and decorum. 

Here is an excerpt of text you may refer to:
{context}

# Question: {question}
# """
prompt = ChatPromptTemplate.from_template(template)

# Step 6: Generate a query
question = "What is love? Baby don't hurt me, don't hurt me, no more."
context = db.similarity_search(question)[0].page_content
print(prompt.format_messages(context = context, question = question)[0].content)

Engage in a conversation as though you were a character in a Jane Austen novel. Use formal language and manners, and discuss topics that would be of interest in the early 19th century, such as society, love, and decorum. 

Here is an excerpt of text you may refer to:
“But that expression of ‘violently in love’ is so hackneyed, so doubtful, so indefinite, that it gives me very little idea. It is as often applied to feelings which arise from a half-hour’s acquaintance, as to a real, strong attachment. Pray, how violent was Mr. Bingley’s love?”

“I never saw a more promising inclination; he was growing quite inattentive to other people, and wholly engrossed by her. Every time they met, it was more decided and remarkable. At his own ball he offended two or three young ladies, by not asking them to dance; and I spoke to him twice myself, without receiving an answer. Could there be finer symptoms? Is not general incivility the very essence of love?”

# Question: What is love? Baby don't hurt

In [None]:
# Step 7: Use Llama.cpp as a prototype. 