## Convert PDF to Markdown

In [None]:
!pip install -q marker-pdf

In [6]:
!DEFAULT_LANG="en" marker data/pdfs/readings data/general/readings/

2024-05-10 17:25:08,913	INFO worker.py:1749 -- Started a local Ray instance.
Loading detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loading detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loading reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Converting 22 pdfs in chunk 1/1 with 5 processes, and storing in /nfshome/tchen307/aqua_rewrite/data/general/readings
100%|███████████████████████████████████████████| 22/22 [02:49<00:00,  7.72s/it]
[0m

In [1]:
!DEFAULT_LANG="en" marker data/pdfs/textbook data/general/textbook/

2024-05-10 17:29:32,073	INFO worker.py:1749 -- Started a local Ray instance.
Loading detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loading detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loading reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Converting 13 pdfs in chunk 1/1 with 5 processes, and storing in /nfshome/tchen307/aqua_rewrite/data/general/textbook
100%|███████████████████████████████████████████| 13/13 [01:59<00:00,  9.16s/it]
[0m

## Chunk Data

- [LangChain Document Class](https://api.python.langchain.com/en/v0.0.339/schema/langchain.schema.document.Document.html)
- [LangChain MarkdownTextSplitter Class](https://api.python.langchain.com/en/latest/markdown/langchain_text_splitters.markdown.MarkdownTextSplitter.html)

In [23]:
md_split = MarkdownTextSplitter(chunk_size=256, chunk_overlap=0)

In [25]:
with open('data/final_exam_logistics.md') as f:
    md_text = f.read()

In [26]:
docs = md_split.create_documents([md_text])

In [27]:
docs[0].page_content

'# Final Exam Logistics\n\n## Time and Location\n- The location of the final exam is to be announced\n- The final exam will be on June 14th 8am - 11am'

### Llama Index Splitter

LangChain's splitter really sucks

In [6]:
from pathlib import Path
from langchain_core.documents import Document
from llama_index.core.node_parser import SentenceSplitter

In [4]:
parser = SentenceSplitter(chunk_size=256, chunk_overlap=128)

In [7]:
doc_path = Path('data/general/readings/A_case_for_partially_TAgged_Geometric_history_length branch_prediction/A_case_for_partially_TAgged_Geometric_history_length branch_prediction.md')
with open(doc_path) as f:
    doc = f.read()

In [8]:
txt_chunks = parser.split_text(doc)
chunks = [Document(page_content=chunk, metadata={'source': doc_path.name}) for chunk in txt_chunks]
chunks[0]

Document(page_content='# A Case For (Partially) Tagged Geometric History Length Branch Prediction ⇤\n\nAndr´e Seznec seznec@irisa.fr Pierre Michaud pmichaud@irisa.fr IRISA/INRIA/HIPEAC Campus de Beaulieu, 35042 Rennes Cedex, France\n\n## Abstract\n\nIt is now widely admitted that in order to provide state-of-the-art accuracy, a conditional branch predictor must combine several predictions. Recent research has shown that an adder tree is a very e↵ective approach for the prediction combination function.\n\nIn this paper, we present a more cost e↵ective solution for this prediction combination function for predictors relying on several predictor components indexed with di↵erent history lengths. Using GEometric history length as the O-GEHL predictor, the TAGE\npredictor uses (partially) tagged components as the PPM-like predictor. TAGE relies on\n(partial) hit-miss detection as the prediction computation function. TAGE provides stateof-the-art prediction accuracy on conditional branches.',

## Embed Data

- [BGE Instructions](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding/baai_general_embedding)

In [28]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-base-en')
model = AutoModel.from_pretrained('BAAI/bge-base-en')

In [60]:
import torch
import torch.nn.functional as F

@torch.no_grad()
def bge_embed(text):
    input_toks = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    output = model(**input_toks)  # Last hidden state: [1, seq_len, 768]
    embd = F.normalize(output.last_hidden_state[:, 0, :], p=2, dim=1)
    return embd.reshape(-1)

In [61]:
bge_embed(docs[0].page_content).size()

torch.Size([768])

## Create Index

- Couldn't install FAISS
- [Spotify Annoy](https://github.com/spotify/annoy)

[Tuning `n_trees` and `search_k`](https://github.com/spotify/annoy?tab=readme-ov-file#tradeoffs)

In [56]:
from annoy import AnnoyIndex

In [73]:
bge_idx = AnnoyIndex(768, 'euclidean')

In [74]:
for i, doc in enumerate(docs):
    embd = bge_embed(doc.page_content)
    bge_idx.add_item(i, embd)
bge_idx.build(3)
bge_idx.save('bge_idx.ann')

True

In [75]:
bge_idx = AnnoyIndex(768, 'euclidean')
bge_idx.load('bge_idx.ann')
query = 'Can I bring my own scratch paper?'
q_embd = bge_embed(query)
nn_vs = bge_idx.get_nns_by_vector(q_embd, 3, include_distances=True)

In [76]:
for i, dist in zip(*nn_vs):
    print(f'doc={docs[i].page_content}\n{dist=:.4f}\n===========')

doc=Anything beyond your own pens and calculators are prohibited during the exam. Therefore, you cannot use any of the following during the exam:
- You cannot use your own scratch paper
- You cannot have a cheatsheet
dist=0.5346
doc=## Items Allowed
You may use your own pens and calculators (not those ones on mobile phones).
dist=0.6098
doc=- You cannot borrow/exchange anything during the exam
dist=0.6457


## Build BGE Index

In [4]:
import os
from pathlib import Path
import torch
import torch.nn.functional as F
from annoy import AnnoyIndex
from langchain.text_splitter import MarkdownTextSplitter
from transformers import AutoTokenizer, AutoModel

In [5]:
os.environ['TOKENIZERS_PARALLELISM'] = '(true | false)'

In [6]:
def create_docs(md_text_dir, chunk_size, chunk_overlap):
    md_texts = []
    md_split = MarkdownTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    md_text_paths = list(Path(md_text_dir).rglob('*.md'))
    for md_text_path in md_text_paths:
        with open(md_text_path) as f:
            md_text = f.read()
        md_texts.append(md_text)
    
    doc_metadatas = [{'source': md_text_path.name} for md_text_path in md_text_paths]
    docs = md_split.create_documents(md_texts, metadatas=doc_metadatas)

    return docs

docs_ = create_docs('data/quiz', 512, 128)
len(docs_)

551

In [8]:
def init_bge_idx():
    model_name = 'BAAI/bge-base-en'
    hid_dim = 768

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name).to('cuda')
    bge_idx = AnnoyIndex(hid_dim, 'euclidean')
    
    @torch.no_grad()
    def bge_embed(text):
        input_toks = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
        output = model(**{k: v.to(model.device) for k, v in input_toks.items()})  # Last hidden state: [1, seq_len, 768]
        embd = F.normalize(output.last_hidden_state[:, 0, :], p=2, dim=1)
        return embd.to('cpu').reshape(-1)

    return bge_embed, bge_idx

In [21]:
def create_bge_idx(docs, n_trees):
    bge_embed, bge_idx = init_bge_idx()

    for i, doc in enumerate(docs):
        embd = bge_embed(doc.page_content)
        bge_idx.add_item(i, embd)
    
    bge_idx.build(n_trees)
    bge_idx.save('bge_idx.ann')

In [9]:
def init_query_bge_idx(docs, top_k):
    bge_embed, bge_idx = init_bge_idx()

    file_name = 'bge_idx.ann'
    assert Path(file_name).exists(), f'Index {file_name} not found.'
    bge_idx.load(file_name)

    def query_bge_idx(query):
        '''Output: List[tuple(float, int)]: (distance, document idx)'''
        q_embd = bge_embed(query)
        idxs, dists = bge_idx.get_nns_by_vector(q_embd, top_k, include_distances=True)

        top_docs = []
        for dist, idx in sorted(zip(dists, idxs), reverse=True):
            top_doc = docs[idx].copy(update={'dist': dist})
            top_docs.append(top_doc)

        return top_docs

    return query_bge_idx

In [31]:
create_bge_idx(docs_, 1024)

In [17]:
query_bge_idx = init_query_bge_idx(docs_, 5)
query = 'how does TAGE predictor match branch with a table.'
top_docs_ = query_bge_idx(query)
for top_doc in top_docs_:
    print(f'doc: {top_doc.page_content}\nsource: {top_doc.metadata["source"]}\ndist={top_doc.dist:.4f}')
    print('='*30)

doc: or if the anticipated branch address were found to be incorrect, a small *gshare* table would be consulted for a quick prediction. The study shows that a similar predictor, using two *gshare* tables, is able to use the larger table 47% of the time.
source: dynamic_branch_prediction_with_perceptrons.md
dist=0.5237
doc: # A Study Of Branch Prediction Strategies

JAMES E. SMITH 
Control Data Corporation Arden Hills, Minnesota 

## Abstract
source: a_study_of_branch_prediction_strategies.md
dist=0.5221
doc: However, in Section 4, we present the simulation results of the TAGE predictor strictly respecting the 1st Championship Branch Prediction Rules.

## 2.3 Information Used For Indexing The Branch Predictor

For computing the indexes for global history predictors, most studies consider either hashing the conditional branch history with the branch address or hashing the path history with the branch address [22]. Both these solutions lead to consider distinct paths as equal.
source: a_c

## Build GTE Index

In [45]:
# BERT is an encoder Transformer, the attention mask is always one during inference
model_name = 'thenlper/gte-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to('cuda')
for doc_ in docs_:
    input_toks = tokenizer(doc_.page_content, padding=True, truncation=True, return_tensors='pt')
    if not torch.allclose(input_toks['attention_mask'], torch.ones_like(input_toks['attention_mask'])):
        print('Gotcha!')

In [4]:
def init_gte_idx():
    model_name = 'thenlper/gte-base'
    hid_dim = 768

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name).to('cuda')
    gte_idx = AnnoyIndex(hid_dim, 'euclidean')
    
    @torch.no_grad()
    def gte_embed(text):
        input_toks = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
        output = model(**{k: v.to(model.device) for k, v in input_toks.items()})  # Last hidden state: [1, seq_len, 768]
        embd = F.normalize(output.last_hidden_state.mean(dim=1), p=2, dim=1)
        return embd.to('cpu').reshape(-1)

    return gte_embed, gte_idx

In [11]:
def create_gte_idx(docs, n_trees):
    gte_embed, gte_idx = init_gte_idx()

    for i, doc in enumerate(docs):
        embd = gte_embed(doc.page_content)
        gte_idx.add_item(i, embd)
    
    gte_idx.build(n_trees)
    gte_idx.save('gte_idx.ann')

In [16]:
def init_query_gte_idx(docs, top_k):
    gte_embed, gte_idx = init_gte_idx()

    file_name = 'gte_idx.ann'
    assert Path(file_name).exists(), f'Index {file_name} not found.'
    gte_idx.load(file_name)

    def query_gte_idx(query):
        '''Output: List[tuple(float, int)]: (distance, document idx)'''
        q_embd = gte_embed(query)
        idxs, dists = gte_idx.get_nns_by_vector(q_embd, top_k, include_distances=True)

        top_docs = []
        for dist, idx in sorted(zip(dists, idxs), reverse=True):
            top_doc = docs[idx].copy(update={'dist': dist})
            top_docs.append(top_doc)

        return top_docs

    return query_gte_idx

In [12]:
create_gte_idx(docs_, 1024)

In [17]:
query_gte_idx = init_query_gte_idx(docs_, 5)
top_docs_ = query_gte_idx('how does TAGE predictor match branch with a table.')
for top_doc in top_docs_:
    print(f'doc: {top_doc.page_content}\nsource: {top_doc.metadata["source"]}\ndist={top_doc.dist:.4f}')
    print('='*30)

doc: The last stage in the prediction computation on the TAGE predictor consists in the tag match followed by the prediction selection. The tag match computations are performed in parallel on the tags flowing out from the tagged components.
source: a_case_for_partially_tagged_geometric_history_length_branch_prediction.md
dist=0.4527
doc: However, in Section 4, we present the simulation results of the TAGE predictor strictly respecting the 1st Championship Branch Prediction Rules.

## 2.3 Information Used For Indexing The Branch Predictor

For computing the indexes for global history predictors, most studies consider either hashing the conditional branch history with the branch address or hashing the path history with the branch address [22]. Both these solutions lead to consider distinct paths as equal.
source: a_case_for_partially_tagged_geometric_history_length_branch_prediction.md
dist=0.4506
doc: We present the TAGE conditional branch predictor. TAGE stands for TAgged GEometric his

## LLM

[StableLM Zephyr 3B](https://huggingface.co/stabilityai/stablelm-zephyr-3b)

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer

In [3]:
model_name = 'stabilityai/stablelm-zephyr-3b'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [16]:
context = '\n---\n'.join(top_doc.page_content for top_doc in top_docs_)
print(context)

or if the anticipated branch address were found to be incorrect, a small *gshare* table would be consulted for a quick prediction. The study shows that a similar predictor, using two *gshare* tables, is able to use the larger table 47% of the time.
---
# A Study Of Branch Prediction Strategies

JAMES E. SMITH 
Control Data Corporation Arden Hills, Minnesota 

## Abstract
---
However, in Section 4, we present the simulation results of the TAGE predictor strictly respecting the 1st Championship Branch Prediction Rules.

## 2.3 Information Used For Indexing The Branch Predictor

For computing the indexes for global history predictors, most studies consider either hashing the conditional branch history with the branch address or hashing the path history with the branch address [22]. Both these solutions lead to consider distinct paths as equal.
---
For a 8-component TAGE predictor, we use respectively 9-bit tags for T1 and T2, 10-bit tags for T3 and T4, 11-bit tags for T5 and T6, 12-bit ta

In [28]:
instr = '''\
Context information is below.
===
{context}
===
Given the context information above and not prior knowledge, answer the query.
Respond "Sorry I cannot answer that" if no relevant information is in the context.
Query: {query}'''

In [32]:
prompt = [{'role': 'user', 'content': instr.format(context=context, query=query)}]
in_toks = tokenizer.apply_chat_template(prompt, add_generation_prompt=True, return_tensors='pt')
out_toks = model.generate(in_toks.to(model.device), max_new_tokens=256, temperature=0.0)
answer = tokenizer.decode(out_toks[0, ...])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
