# Build the Semantic Search Index
According to the <a href="https://www.sbert.net/examples/applications/semantic-search/README.html" target="_blank">sbert</a> documentations I will use MSMarco model:

*"For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you wand to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense."
"Suitable models for asymmetric semantic search: Pre-Trained MS MARCO Models"*

**In the folder, Training-Transformers I built a notebook to finetune this model**

In [3]:
corpus = [
    'A mining pool is a joint group of cryptocurrency miners who combine their computational resources over a network to strengthen the probability of finding a block or otherwise successfully mining for cryptocurrency.',
    'Individually, participants in a mining pool contribute their processing power toward the effort of finding a block. If the pool is successful in these efforts, they receive a reward, typically in the form of the associated cryptocurrency.',
    'Mining is the process of extracting useful materials from the earth. Some examples of substances that are mined include coal, gold, or iron ore. Iron ore is the material from which the metal iron is produced.',
    'A cryptocurrency is a digital or virtual currency that is secured by cryptography, which makes it nearly impossible to counterfeit or double-spend.',
    'Bitcoin, which was made available to the public in 2009, remains the most widely-traded coin',
    'The best crypto credit cards',
    'Best Crypto & Blockchain Right Now',
    'There is no single best cryptocurrency, but there are best cryptocurrencies for certain use cases. For example, Bitcoin is the best cryptocurrency to use as a reserve asset because it has the most widespread adoption and a finite supply.',
    'The best cryptocurrency exchanges are those that offer secure, easy-to-use platforms, with high trading volumes, and on which customers can trade multiple cryptos and pay in multiple payment options.',
    'Another one of the easiest cryptocurrencies to mine is Vertcoin.',
    'In 2021, Litecoin is still considered one of the best cryptocurrencies, despite the strong competition.',
    'The best cryptocurrency to buy right now in 2021 is Ethereum.',
    'Following are some of the best cryptocurrencies to Mine with GPU:',
    'It’s valuable the same way Bitcoin is valuable but in a more personable and practical way that involves more humans and less computer machines. ',
    'So I personally think Bitcoin is valuable as a measure of value, but it’s not very effective in terms of a cryptocurrency. There has been many better versions created which process faster, are more affordable to transfer, and are safer. ',
    'First and foremost, Bitcoin has value due to the same reason the paper and digital cash does – it’s a handy form of money commonly accepted by people. It is used to transfer value and buy or sell things. Yet, unlike the US dollars, whose value and legal status are enforced by the government, Bitcoin’s value comes from its code, infrastructure, scarcity, and adoption.'
]

# Query sentences:
queries = ['what is mining pool', 'what is the best cryptocurrency', 'which crypto is worth the investment?', 'best way to mine minerals', 'what makes bitcoin valuable']

## Testing MSMarco model

In [4]:
from sentence_transformers import SentenceTransformer, util

In [5]:
embedder = SentenceTransformer('msmarco-distilbert-dot-v5')

#Encode all sentences
corpus_embeddings = embedder.encode(corpus)

# Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
top_k = min(5, len(corpus))
for query in queries:
    query_embedding = embedder.encode(query, convert_to_tensor=True)

    hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=5)
    hits = hits[0]      #Get the hits for the first query

    print("\n\n======================\n\n")
    print("Query:", query, "\n")

    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))





Query: what is mining pool 

A mining pool is a joint group of cryptocurrency miners who combine their computational resources over a network to strengthen the probability of finding a block or otherwise successfully mining for cryptocurrency. (Score: 0.9194)
Individually, participants in a mining pool contribute their processing power toward the effort of finding a block. If the pool is successful in these efforts, they receive a reward, typically in the form of the associated cryptocurrency. (Score: 0.8327)
Mining is the process of extracting useful materials from the earth. Some examples of substances that are mined include coal, gold, or iron ore. Iron ore is the material from which the metal iron is produced. (Score: 0.8253)
Another one of the easiest cryptocurrencies to mine is Vertcoin. (Score: 0.7226)
Following are some of the best cryptocurrencies to Mine with GPU: (Score: 0.7210)




Query: what is the best cryptocurrency 

The best crypto credit cards (Score: 0.9253)
Bes

## Data cleaning and preparation

In [6]:
from db.conn_db import *
import pandas as pd

f = open("db.json")
data = json.load(f)

corpus = pd.read_sql_table('index', data['connection'] + 'hnsw')
corpus

FileNotFoundError: [Errno 2] No such file or directory: 'db.json'

In [None]:
# Drop duplicates, reset index and remove index row
corpus = corpus.drop_duplicates(subset=['text'])
corpus.reset_index(level=0, inplace=True)
corpus = corpus.iloc[:,1:]
corpus

In [None]:
# function to remove \n and \t
def clean(text):
    text = text.replace('\n', '')
    text = text.replace('\t', '')

    return text

In [None]:
# use map to apply the fuction to all rows in the text column
corpus['text'] = corpus['text'].map(clean)
corpus['id'] = corpus.index
corpus

### Store it in a database to use it as index

In [None]:
corpus.to_sql(name='hnsw_index', con=db_connection('hnsw'), index=False, chunksize=500)


## Embedding the corpus file

In [18]:
from sentence_transformers import SentenceTransformer

model_embedder = SentenceTransformer('./models/msmarco-distilbert-dot-v5')

In [19]:
corpus_text = corpus['text'].tolist()

corpus_embeddings = model_embedder.encode(corpus_text, show_progress_bar=True, convert_to_numpy=True)

Batches:   0%|          | 0/3254 [00:00<?, ?it/s]

### Save the embeddings into a pickle file

In [20]:
# Save it
import pickle

with open('./embeddings/embeddings.pkl', "wb") as fOut:
    pickle.dump({'sentences': corpus_text, 'embeddings': corpus_embeddings}, fOut)

In [3]:
# Import it
import pickle

with open('./embeddings/embeddings.pkl', "rb") as fIn:
    cache_data = pickle.load(fIn)
    corpus_sentences = cache_data['sentences']
    corpus_embeddings = cache_data['embeddings']

## Create HNSW index
https://github.com/nmslib/hnswlib/blob/master/README.md

In [21]:
import hnswlib

embedding_size = 768
top_k_hits = 50
len_corpus = len(corpus_embeddings)

# We use Inner Product (dot-product) as Index. We will normalize our vectors to unit length, then is Inner Product equal to cosine similarity
index_path = "./hnswlib.bin"
index = hnswlib.Index(space='cosine', dim=embedding_size)

In [22]:
# Create the HNSWLIB index
# ef_construction - controls index search speed/build speed tradeoff
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
index.init_index(max_elements=len_corpus, ef_construction=400, M=64)

# Then we train the index to find a suitable clustering
index.add_items(corpus_embeddings, list(range(len(corpus_embeddings))))
index.save_index(index_path)

# Controlling the recall by setting ef:
index.set_ef(90) # ef should always be > top_k_hits

In [23]:
len(corpus_embeddings)

104120

## Testing

In [1]:
import hnswlib

embedding_size = 768
top_k_hits = 50
len_corpus = 104120

index = hnswlib.Index(space='cosine', dim=embedding_size)

index.load_index("hnswlib.bin", max_elements=len_corpus)

In [2]:
from sentence_transformers import SentenceTransformer

model_embedder = SentenceTransformer('./models/msmarco-distilbert-dot-v5')

In [3]:
from sentence_transformers import CrossEncoder

model_cross = CrossEncoder('models/ms-marco-MiniLM-L-12-v2')

In [4]:
from transformers import pipeline

model_sum = './models/t5-base'
sum_pipe = pipeline('summarization', model=model_sum, tokenizer=model_sum,
                   framework='pt', device=0)

In [5]:
from db.conn_db import query_hnsw

In [17]:
query = "is proof of stake safe?"

In [18]:
# Embed the query into vector space
question_embedding = model_embedder.encode([query]).tolist()

# Search with HNSW for the best passsage
corpus_ids, distances = index.knn_query(question_embedding, k=top_k_hits)
hits = [{'corpus_id': id, 'score': 1-score} for id, score in zip(corpus_ids[0], distances[0])]

# Append the passages
passages_id = []

for hit in hits[0:top_k_hits]:
    passages_id.append(hit['corpus_id'])

db_results = query_hnsw(tuple(passages_id))
passages = [i[0] for i in db_results]

# Use cross encoder to rank the best passages
model_inputs = [[query, passage] for passage in passages]
scores = model_cross.predict(model_inputs)

results = [{'input': inp, 'score': score} for inp, score in zip(model_inputs, scores)]
results = sorted(results, key=lambda x: x['score'], reverse=True)

# Append the best contexts
context = []
links = set()

for hit in results[0:3]:
    context.append(hit['input'][1])
    for i in db_results:
        if i[0] == hit['input'][1]:
            links.add(i[1])

# Summarize the context
answer = sum_pipe(
    f"question: {query} context: {' '.join(context)}",
    num_beams=4,
    do_sample=True,
    temperature=1.6,
    min_length=60,
    max_length=300
)
print('question: ' + query +'\n' + answer[0]['summary_text'].replace(' .', '.'))

print("\nFor more information: ")
for link in links:
    print(link)

Your max_length is set to 300, but you input_length is only 148. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=74)


question: is proof of stake safe?
Proof of Stake consensus is not only more environmentally friendly, but it significantly increases transaction throughput while simultaneously reducing transaction fees. of the many cryptocurrencies today, proof of stake is the most popular in a recent period. the probability of users forming the next block in the blockchain is proportional to share units of cryptocurrency belonging to participant and to their total number

For more information: 
https://www.benzinga.com/money/is-cryptocurrency-really-the-future/
https://crypto.bi/profit-pos/
https://en.bitcoinwiki.org//wiki/Ouroboros
