### WORKING ON COLAB ENVIRONMENT

This section allows:
* Mount google drive where the notebook we will be working on is saved and also the .env file whcih contains git key is saved.
* clone of online git repository
* make a copy of the colab notebook to the cloned repo
* and then perform of the git command

In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
!apt-get install git

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git is already the newest version (1:2.34.1-1ubuntu1.11).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


**COLAB ENVIROMENT VARIABLE MANAGER**

In [3]:
# for loading environment variable
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [4]:
# Set your GitHub personal access token
from dotenv import load_dotenv
import os

# Load the environment variables from the file on Google Drive
load_dotenv('/content/drive/My Drive/file2.env')

token = os.getenv('GIT_TOKEN')
GOOGLE_API_KEY = os.getenv("GOOGLE_AI_API_KEY")

**CLONE THE REPOSITORY**

In [5]:
# Clone the repository using the token
!git clone https://{token}@github.com/omogbolahan94/Vector-Database.git

Cloning into 'Vector-Database'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (4/4), done.[K
Receiving objects: 100% (6/6), done.
Resolving deltas: 100% (1/1), done.
remote: Total 6 (delta 1), reused 5 (delta 0), pack-reused 0[K


**COPY THE NOTEBOOK TO THE REPOSITORY**

In [13]:
!pwd

/content


In [14]:
!ls

drive  sample_data  Vector-Database


In [None]:
!cp '/content/drive/MyDrive/Colab Notebooks/LLM_vector_semantic_search.ipynb' /content/Open-Source-LLM-Model/

### WORKING ON LOCAL ENVIRONMENT

**PREPARE THE DATASET**

In [1]:
import json

with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

documents[1]

{'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
 'section': 'General course-related questions',
 'question': 'Course - What are the prerequisites for this course?',
 'course': 'data-engineering-zoomcamp'}

**CREATE EMBEDDINGS WITH PRE-TRAISNED MODEL**

In [2]:
!pip install sentence_transformers==2.7.0 --quiet

In [1]:
from sentence_transformers import SentenceTransformer

In [4]:
model = SentenceTransformer("all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
# tsting the model with a simple text:
sample_sentence = "This is a simple sentence"

# create embeddings for the sentence above:
embeded_sample = model.encode(sample_sentence)
print(f"length of embeded sample: {len(embeded_sample)}'\n\n'")
embeded_sample[0:5]

length of embeded sample: 768'

'


array([ 0.00444875, -0.07613144, -0.00037748,  0.00752524, -0.03809796],
      dtype=float32)

In [6]:
# created the dense vector for each text in the document using the pre-trained model
operations = []
for doc in documents:
    # Transforming the title into an embedding using the model
    doc["text_vector"] = model.encode(doc["text"]).tolist()
    operations.append(doc)

**CONNECT TO RUNNING ELASTICSEARCH**

In [7]:
 from elasticsearch import Elasticsearch

In [9]:
es_client = Elasticsearch('http://localhost:9200')

es_client.info()

ObjectApiResponse({'name': 'bb42b2cfb03b', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'J1vkYNj5T4mqPBZongUw3A', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

**CREATE MAPPING AND INDEX FOR THE DATABASE**

Each document is a collection of fields, which each have their own data type.

Mapping: this is the process of defining how a document, and the fields it contains, are stored and indexed.



In [10]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} ,
            "text_vector": {"type": "dense_vector", "dims": 768, "index": True, "similarity": "cosine"},
        }
    }
}

In [11]:
# using the index setting above to configure elastic search
index_name = "course-questions"

# delete index if it already exist 
es_client.indices.delete(index=index_name, ignore_unavailable=True)

# create it again
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

**ADD DOCUMENT INTO INDEX**

In [12]:
for doc in operations:
    try:
        es_client.index(index=index_name, document=doc)
    except Exception as e:
        print(e)

**USER QUERY**

In [13]:
search_term = "windows or mac?"
vector_search_term = model.encode(search_term)

query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5, # 5 KNN from the DB that are closer to the user query
    "num_candidates": 10000, #  compare to the 10000 documents
}

res = es_client.search(index=index_name,
                       knn=query, 
                       source=["text", "section", "question", "course"])
res["hits"]["hits"]

[{'_index': 'course-questions',
  '_id': 'KxjsppABeHJdJ6B_zo3Q',
  '_score': 0.71479183,
  '_source': {'question': 'Environment - Is the course [Windows/mac/Linux/...] friendly?',
   'course': 'data-engineering-zoomcamp',
   'section': 'General course-related questions',
   'text': 'Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully'}},
 {'_index': 'course-questions',
  '_id': 'PhjtppABeHJdJ6B_T5A7',
  '_score': 0.61347336,
  '_source': {'question': 'WSL instructions',
   'course': 'mlops-zoomcamp',
   'section': 'Module 1: Introduction',
   'text': 'If you wish to use WSL on your windows machine, here are the setup instructions:\nCommand: Sudo apt install wget\nGet Anaconda download address here. wget <download address>\nTurn on Docker Desktop WFree Download | AnacondaSL2\nCommand: git clone <github repository address>\nVSCODE on WSL\nJupyter: pip3 install jupyter\nAdded by Gregory Morris (gwm1980@gmail.com)\nAll in all softwares 

**Note:** If we had not emmbeded the user query before searching the `elasticsearch` DB, it will no longer be **semantic search** but **key-word search** just as used on my github chatbot **[repo](https://github.com/omogbolahan94/LLM-QA-Chatbot)**

**ADVANCE SEMANTIC SEARCH: FILTER**

* The document has three (3) courses. We can filter for a particular course using the `filter` in the index mapping settings.

We will perform query that search for a perticular section from the vector DB.

In [14]:
knn_query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5,
    "num_candidates": 10000
}

response = es_client.search(
    index=index_name,
    query={
        "match": {"section": "General course-related questions"},
    },
    knn=knn_query,
    size=5
)
response["hits"]["hits"]

[{'_index': 'course-questions',
  '_id': 'KxjsppABeHJdJ6B_zo3Q',
  '_score': 11.614713,
  '_source': {'text': 'Yes! Linux is ideal but technically it should not matter. Students last year used all 3 OSes successfully',
   'section': 'General course-related questions',
   'question': 'Environment - Is the course [Windows/mac/Linux/...] friendly?',
   'course': 'data-engineering-zoomcamp',
   'text_vector': [-0.026965485885739326,
    -0.0006261198432184756,
    -0.016629479825496674,
    0.052851513028144836,
    0.054765306413173676,
    -0.03133990615606308,
    0.029942603781819344,
    -0.04808563366532326,
    0.04467551410198212,
    0.005839459598064423,
    0.016233060508966446,
    0.012001175433397293,
    -0.03122228942811489,
    0.01660051941871643,
    -0.04886902868747711,
    -0.06496305018663406,
    0.04643420875072479,
    -0.009297742508351803,
    -0.06425285339355469,
    -0.013732698746025562,
    -0.01597622223198414,
    0.008629552088677883,
    -0.024479001760

* In the basic search, the score was in the range of 0 and 1 where 1 is the best score and 0 is the poorest score.
* In the advance search, the score value is not within the 0 and 1 range.
* To understand what the search score is, use the `explain` parameter in the `es_client.search` object and set it to `True`.

### EVALUATING METRICS FOR RETRIEVAL

1. **Precision at k (P@k)**:
   - Measures the number of relevant documents in the top k results.
   - Formula: `P@k = (Number of relevant documents in top k results) / k`

2. **Recall**:
   - Measures the number of relevant documents retrieved out of the total number of relevant documents available.
   - Formula: `Recall = (Number of relevant documents retrieved) / (Total number of relevant documents)`

3. **Mean Average Precision (MAP)**:
   - Computes the average precision for each query and then averages these values over all queries.
   - Formula: `MAP = (1 / |Q|) * Σ (Average Precision(q))` for q in Q

4. **Normalized Discounted Cumulative Gain (NDCG)**:
   - Measures the usefulness, or gain, of a document based on its position in the result list.
   - Formula: `NDCG = DCG / IDCG`
     - `DCG = Σ ((2^rel_i - 1) / log2(i + 1))` for i = 1 to p
     - `IDCG` is the ideal DCG, where documents are perfectly ranked by relevance.

5. **Mean Reciprocal Rank (MRR)**:
   - Evaluates the rank position of the first relevant document.
   - Formula: `MRR = (1 / |Q|) * Σ (1 / rank_i)` for i = 1 to |Q|

6. **F1 Score**:
   - Harmonic mean of precision and recall.
   - Formula: `F1 = 2 * (Precision * Recall) / (Precision + Recall)`

7. **Area Under the ROC Curve (AUC-ROC)**:
   - Measures the ability of the model to distinguish between relevant and non-relevant documents.
   - AUC is the area under the Receiver Operating Characteristic (ROC) curve, which plots true positive rate (TPR) against false positive rate (FPR).

8. **Mean Rank (MR)**:
   - The average rank of the first relevant document across all queries.
   - Lower values indicate better performance.

9. **Hit Rate (HR) or Recall at k**:
   - Measures the proportion of queries for which at least one relevant document is retrieved in the top k results.
   - Formula: `HR@k = (Number of queries with at least one relevant document in top k) / |Q|`

10. **Expected Reciprocal Rank (ERR)**:
    - Measures the probability that a user finds a relevant document at each position in the ranked list, assuming a cascading model of user behavior.
    - Formula: `ERR = Σ (1 / i) * Π (1 - r_j) * r_i` for j = 1 to i-1
      - Where `r_i` is the relevance probability of the document at position i.

### GROUND TRUTH DATASET GENERATION
* Generate an `id` key for each document using the hashed values of the combination of the course, question and text values for each document. We depned on the content here and not the order hence the hashed value from an existing string.

In [2]:
import hashlib

def generate_doc_id(doc):
    combined_text = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined_text.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id

In [3]:
for doc in documents:
    doc['id'] = generate_doc_id(doc)

documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [4]:
from collections import defaultdict

hashes = defaultdict(list)

for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)

In [5]:
len(documents), len(hashes)

(948, 947)

In [6]:
# check if there is a particular hash with the same values
for k, v in hashes.items():
    if len(v) > 1:
        print(k, len(v))

593f7569 2


There are twwo documents with the same hash keys:

In [7]:
hashes['593f7569']

[{'text': "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'},
 {'text': "They both do the same, it's just less typing from the script.",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'}]

It is noticed that they have the same values.

**Save the new document with an `id` key as a json file**

In [15]:
import json

with open('documents-with-ids.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2)

In [16]:
!head documents-with-ids.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "c02e79ef"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


**Generate 5 question from the documents**

In [8]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [9]:
formated_prompt = prompt_template.format(**doc)

formated_prompt

'You emulate a student who\'s taking our course.\nFormulate 5 questions this student might ask based on a FAQ record. The record\nshould contain the answer to the questions, and the questions should be complete and not too short.\nIf possible, use as fewer words as possible from the record. \n\nThe record:\n\nsection: Module 6: Best practices\nquestion: How to destroy infrastructure created via GitHub Actions\nanswer: Problem description\nInfrastructure created in AWS with CD-Deploy Action needs to be destroyed\nSolution description\nFrom local:\nterraform init -backend-config="key=mlops-zoomcamp-prod.tfstate" --reconfigure\nterraform destroy --var-file vars/prod.tfvars\nAdded by Erick Calderin\n\nProvide the output in parsable JSON without using code blocks:\n\n["question1", "question2", ..., "question5"]'

In [12]:
import textwrap

import google.generativeai as genai

import os

from IPython.display import display
from IPython.display import Markdown

In [13]:
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [14]:
GOOGLE_API_KEY = os.getenv("GOOGLE_AI_API_KEY")

genai.configure(api_key=GOOGLE_API_KEY)

In [16]:
def generate_questions(doc, model="gemini-1.0-pro-latest"):
    formated_prompt = prompt_template.format(**doc)

    model = genai.GenerativeModel(model)
    response = model.generate_content(formated_prompt)

    return response

In [17]:
test_doc = documents[3]
test_doc

{'text': "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
 'section': 'General course-related questions',
 'question': 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?',
 'course': 'data-engineering-zoomcamp',
 'id': '0bbf41ec'}

In [18]:
questions = generate_questions(test_doc)

RetryError: Timeout of 600.0s exceeded, last exception: 503 DNS resolution failed for generativelanguage.googleapis.com:443: C-ares status is not ARES_SUCCESS qtype=A name=generativelanguage.googleapis.com is_balancer=0: Timeout while contacting DNS servers

In [None]:
to_markdown(questions.text)

**From the result above, we can run the code for each of the documents in the json file:**

In [22]:
from tqdm.auto import tqdm

In [23]:
# generate 5 questions from each of the documents: id as key and list of the 5 question as values 

results = {}
for doc in tqdm(documents): 
    doc_id = doc['id']
    # for the sake of the docudments with the same hash keys
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

  0%|          | 0/948 [00:00<?, ?it/s]

RetryError: Timeout of 600.0s exceeded, last exception: 503 DNS resolution failed for generativelanguage.googleapis.com:443: C-ares status is not ARES_SUCCESS qtype=A name=generativelanguage.googleapis.com is_balancer=0: Timeout while contacting DNS servers

In [None]:
i = 0
for k, v in results.items:
    print(f"{k} -> {v}\n\n")
    i += 1
    if i == 5:
        break
    

In [None]:
import pickle

In [None]:
# save the result as a pickle file
with open('results.bin', 'wb') as f:
    pickle.dump(results, f)

In [None]:
# read the pickle file
with open('results.bin', 'rb') as f_in:
    results = pickle.load(f_in)

# Scrape medical data

In [19]:
from bs4 import BeautifulSoup

In [20]:
import requests

In [21]:
url = 'https://nimedhealth.com.ng/2019/07/24/list-of-diseases-and-their-yoruba-equivalents/'

In [25]:
# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")
    print('Successful [200]')
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")


Successful [200]


In [26]:
cls ="td-post-content"

content = soup.find("div", class_="td-post-content").find_all("p")

In [67]:
result = []
for i, paragraph in enumerate(content):
    if i > 2 and i < 72:
        text = paragraph.get_text().replace('-', '–').replace(':', '–').strip().split('–')
        result.append((text[0].strip(), text[1].strip()))

In [68]:
result

[('Abdominal pain (Belly ache)', 'Inú rírun'),
 ('AIDS', 'ààrùn ìsọdọ̀lẹ àjẹsára'),
 ('Amnesia', 'Ìgbàgbé'),
 ('Anaemia', 'Àìsàn àìlẹjẹtó'),
 ('Peptic ulcer/stomach ulcer', 'Ogbé inù'),
 ('Jaundice', 'Iba ponju'),
 ('Asthma', 'Ikọ efée/Ikó sémísèmí'),
 ('Backache', 'Ẹ̀yín ríro'),
 ('Small pox', 'sopona/shopona'),
 ('Body ache', 'Ara ríro'),
 ('Boil', 'Eéwo'),
 ('Cancer', 'Àrùn Jẹjẹrẹ'),
 ('Cholera', 'Àrùn Onígbáméjì'),
 ('Communicable disease', 'Arun aranni'),
 ('Congenital disease', 'Àrùn abínibí'),
 ('Cough', 'Ikọ'),
 ('Craw', 'craw'),
 ('Dental caries', 'Eyín kíkẹ'),
 ('Dental plaque', 'Gẹdẹgẹdẹ eyín'),
 ('Diabetes', 'Àtọgbẹ, Ito suga'),
 ('Diarrhea', 'Ìgbẹ gbuuru'),
 ('Dry cough', 'Ikọ gbígbẹ'),
 ('Dysentry', 'Ìgbẹ ọrìn'),
 ('Elephantiasis', 'Jàbùtẹ, Òkè'),
 ('Epidemic', 'Àjàkálẹ'),
 ('Epilepsy', 'Wárápá'),
 ('Fever', 'Ibà'),
 ('Furuncle/Boil', 'Eéwo'),
 ('Genetic disease', 'Àrùn àfijogún; Àrùn ìdílé'),
 ('Gonorrhea', 'Àtọsí'),
 ('Guineaworm', 'Sòbìyà'),
 ('HIV', 'Kòkòrò Apa Sójà A

In [69]:

import pandas as pd


In [71]:
df = pd.DataFrame(result, columns=['English Medical Terms', 'Yoruba Medical Terms'])

In [72]:
df

Unnamed: 0,English Medical Terms,Yoruba Medical Terms
0,Abdominal pain (Belly ache),Inú rírun
1,AIDS,ààrùn ìsọdọ̀lẹ àjẹsára
2,Amnesia,Ìgbàgbé
3,Anaemia,Àìsàn àìlẹjẹtó
4,Peptic ulcer/stomach ulcer,Ogbé inù
...,...,...
64,Rheumatism,Aromóleégun
65,Sickler/ sickle cell anemia,Fi ònìkú fòla dìde
66,Tooth ache,Eyín ríro
67,Goitre,Gbegbe


In [79]:
df.to_csv('online-medical-term-eng-yor.csv', index=False)


In [80]:
pip install openpyxl --quiet

Note: you may need to restart the kernel to use updated packages.


In [81]:
df.to_excel('online-medical-term-eng-yor.xlsx', index=False)