In [62]:
import json 
import numpy as np
import matplotlib.pyplot as plt 
from tqdm import tqdm

import os
from sentence_transformers import util
import torch 
import torch.nn.functional as F 

Motivation: combine T5 with rerank procedure & pre-computed embeddings to get a Hoffbot capable of answering some questions we ask, and being aware when he can't answer them. 

## Issues:
* long range dependencies between ideas throughout video, can't capture those strictly through retrieval of fixed sentences
    * might be able to resolve this by embedding a whole doc? !can't pass whole doc level context to t5
    * possibly use another model to summarize each video and concatenate to querried collection 
    

In [63]:
from sentence_transformers import SentenceTransformer, CrossEncoder

semb_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
xenc_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [68]:
#load corpus 
with open('./clean/hoff.json') as f:
    hoff = json.loads(f.read())
    
hoff = hoff['data']

# extract sentences only, in k sentence contexts 
def paragraphize(text, n=5, delimiter='\n'):
    sentences = text.split('. ')
    assert len(sentences)>=n
    
    i = n
    blocks = []
    while i<len(sentences):
        block = '. '.join(sentences[i:i+n])
        block += f'.{delimiter}'
        blocks.append(block)
        i+=n
        
    blocks[-1] = blocks[-1][:-1]
    return ''.join(blocks)

#print(paragraphize(hoff[0]['text'], n = 3))

In [74]:
joined = ' '.join([h['text'] for h in hoff])
contexts = paragraphize(joined, n=10, delimiter='\t').split('\t')
videos = [v['text'] for v in hoff]

len(contexts)

2795

In [75]:
print(contexts[1])

Now, today is minimally my opinions. What I want to do is give you the bare bones, the layouts, the differences between the certifications, but I'm not gonna weigh in too heavily on my own personal opinions until maybe towards the end. The questions and the topics are as follows. Firstly, price for coffee. Is there a guaranteed premium or price? How much is it? Then labour practices. What do these certifications consider essential to prevent things like exploitation? And then farming practices. What is considered important in how plants are treated or soil is treated? Then eligibility, who actually can attain or get that certification? And then the cost. Who is on the hook for the cost of the certification? Who pays? And finally, we're gonna touch on how these certifications impact coffee quality. I'm not gonna lie. This video was tricky to make.


We should probably try to limit the context size, although they might just pool the contexts obtained from separate windows if you go over context limit (who knows)

In [78]:
import os
import pickle

# Define hnswlib index path
embeddings_cache_path = f'./embeddings/hoff10_embeddings_cache.pkl'


# Load cache if available
if os.path.exists(embeddings_cache_path):
    print('Loading embeddings cache')
    with open(embeddings_cache_path, 'rb') as f:
        corpus_embeddings = pickle.load(f)
# Else compute embeddings
else:
    print('Computing embeddings')
    corpus_embeddings = semb_model.encode(contexts, convert_to_tensor=True, show_progress_bar=True)
    # Save the index to a file for future loading
    print(f'Saving index to: \'{embeddings_cache_path}\'')
    with open(embeddings_cache_path, 'wb') as f:
        pickle.dump(corpus_embeddings, f)

Computing embeddings


Batches:   0%|          | 0/88 [00:00<?, ?it/s]

Saving index to: './embeddings/hoff10_embeddings_cache.pkl'


In [79]:
corpus_embeddings.shape

torch.Size([2795, 384])

In [53]:
#we could use the ANN or we could just argmax the inner product and basta 
query_embedding = semb_model.encode("what is the best cheap espresso machine?")

indices = torch.topk(corpus_embeddings@query_embedding, k=10)[1]
for i in indices:
    print(contexts[i.item()]+'\n')

I have three options for you at three different price points. And just so you know, this cost about 25 pounds, this cost 50 pounds, and this cost 70 pounds. All of them are cheaper than any decent espresso machine that might provide you some sort of steam with which you could steam milk. So while they do involve spending some money, they involve spending less money. Let's start with this one because it might look a little bit familiar.

So as we luxuriate here on this roof, I think I should probably wrap up and tell you which I think is the best portable espresso maker, 'cause it's not a simple question to answer. Because I would say probably the best espresso maker is maybe the Flair, but it's not the best portable espresso maker. And I would say the best sort of most portable thing is clearly the Wacaco Picopresso, but it needs a few things, it needs a few tweaks, it needs a new basket to be better. But out of the box, it's not the best espresso maker. The Handpresso is just out, I d

In [54]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", 
                                                   device_map="auto", 
                                                   torch_dtype=torch.float32
                                                  )



In [55]:
#let's define that structure for searching since I'm getting fucked on the prelim cosine similarity
import os
import hnswlib

# Create empthy index
index = hnswlib.Index(space='cosine', dim=384)

# Define hnswlib index path
!mkdir './ann_index/'
index_path = './ann_index/hoff5.index'

# Load index if available
if os.path.exists(index_path):
    print('Loading index...')
    index.load_index(index_path)
# Else index data collection
else:
    # Initialise the index
    print('Start creating HNSWLIB index')
    index.init_index(max_elements=corpus_embeddings.size(0), ef_construction=400, M=64)
    #  Compute the HNSWLIB index (it may take a while)
    index.add_items(corpus_embeddings.cpu(), list(range(len(corpus_embeddings))))
    # Save the index to a file for future loading
    print(f'Saving index to: {index_path}')
    index.save_index(index_path)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
mkdir: ./ann_index/: File exists
Loading index...


In [56]:
#reranked answering 
def answer_bot_skeleton(query, corpus, corpus_embeddings, n_responses = 1):
    #make sure it's a question 
    if query.strip()[-1]!='?':
        query += '?'
        
    #embed
    query_embedding = semb_model.encode(query, convert_to_tensor=True)
    
    #cosine similarity -- might be computationally intensive 
    #indices = torch.topk(corpus_embeddings@query_embedding, k = 10)[1]
    indices, distances = index.knn_query(query_embedding.cpu(), k=32)
    
    #rerank
    relevant_subset = [corpus[i] for i in indices[0]]
    reranked_indices = np.argsort(-xenc_model.predict([[query, c] for c in relevant_subset]))
    reranked = [relevant_subset[r] for r in reranked_indices[0:n_responses]]
    #print(reranked)
    
    #t5
    # If unanswerable, return \"I don\'t know\".
    prompt_eng = lambda x,y: f'Answer the below question with the context provided.\n\nQuestion: {x}\n\nContext: {y}'
    #promt_eng = lambda x,y: f'Give me a long answer to the following question: {x}, using information from this context: {y}'
    input_texts = [prompt_eng(query, context) for context in reranked]
    input_ids = tokenizer(input_texts, return_tensors="pt", padding=True, truncation=True).input_ids.to(device)

    output_ids = model.generate(input_ids, max_new_tokens=128)
    output_texts = [tokenizer.decode(o, skip_special_tokens=True) for o in output_ids]
    
    return output_texts

In [61]:
queries = ["what is the crack called when roasting",
            "how much caffeine is in instant coffee?", 
           "what is the ideal ratio for coffee grounds to water for the v60?",
           "how expensive is a good espresso machine?",
           "how long to wait after roasting coffee before i can drink it?",
           "how many cups of coffee is too much?",
           "if i underextracted my coffee how does it taste"
          ]

for q in queries:
    print(f'Q: {q}')
    print(f'A: {answer_bot_skeleton(q, contexts, corpus_embeddings, n_responses = 1)[0]}\n')

Q: what is the crack called when roasting
A: first crack

Q: how much caffeine is in instant coffee?
A: just under 40 milligrams

Q: what is the ideal ratio for coffee grounds to water for the v60?
A: 30 to 500

Q: how expensive is a good espresso machine?
A: $575

Q: how long to wait after roasting coffee before i can drink it?
A: the next day

Q: how many cups of coffee is too much?
A: between 50 and 100

Q: if i underextracted my coffee how does it taste
A: sourness and also bitterness



some decent questions
- "what is the ideal ratio for coffee grounds to water for the v60?"

### `chromadb` instead of the ANN
- needed to specify duckdb to store stuff locally, also need to create a client with same persist directory to load stuff again afterwards
- manually need to call `client.persist` after updating collection 
- https://github.com/chroma-core/chroma/blob/main/examples/local_persistence.ipynb

In [81]:
from chromadb import Client
from chromadb.config import Settings
client = Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory=".chromadb/" # Optional, defaults to .chromadb/ in the current directory
))

In [82]:
#yea this is like the same as the one from tutorial 
collection = client.create_collection(name="hoff10", 
                                      metadata={"hnsw:space": "cosine"})

In [80]:
# we already have the embeddings so I don't wanna re-encode; just provide some ids and docs 
docs = list(contexts)

# import re 
# title_format = lambda x: '_'.join(re.findall(r'\b\w+\b', x))

# ids = []
# for h in hoff:
#     pre = title_format(h['title'])
    
#     paragraphs = paragraphize(h['text'], n=5, delimiter='\t').split('\t') 
    
#     for i, p in enumerate(paragraphs):
#         ids.append(pre+f'_{i}')
        
# assert len(docs) == len(ids)

ids = [f'id{i}' for i in range(len(contexts))]

In [9]:
#client.delete_collection('hoff5')

In [83]:
#okay 
collection.add(
    documents = docs,
    embeddings= corpus_embeddings.tolist(),
    ids=ids,
)
client.persist()

True

In [85]:
queries = ["how much caffeine is in instant coffee?", 
           "what is the ideal ratio for coffee grounds to water for the v60?",
           "how expensive is a good espresso machine?",
           "how long to wait after roasting coffee before i can drink it?",
           "how many cups of coffee is too much?",
           "how much pressure is needed to make espresso",
           "what is the ratio of coffee grounds to water in the v60"
          ]

embs = [semb_model.encode(q, convert_to_tensor=True).tolist() for q in queries]

In [86]:
#query 
result = collection.query(
    query_embeddings=embs,
    n_results = 10,
)

In [87]:
from pprint import pprint
pprint(result)

{'distances': [[0.25181931257247925,
                0.26629960536956787,
                0.33985745906829834,
                0.3504807949066162,
                0.3748343586921692,
                0.38581812381744385,
                0.3911476731300354,
                0.3974166512489319,
                0.42317086458206177,
                0.4299890995025635],
               [0.3496313691139221,
                0.38709378242492676,
                0.3889116644859314,
                0.3953595757484436,
                0.4018215537071228,
                0.40209364891052246,
                0.4050033688545227,
                0.42367053031921387,
                0.4250156879425049,
                0.43084603548049927],
               [0.2579856514930725,
                0.30593645572662354,
                0.3154538869857788,
                0.326404333114624,
                0.3477988839149475,
                0.35346847772598267,
                0.3607497215270996,
                

In [47]:
len(result['documents'])

5