<a href="https://colab.research.google.com/github/leakydishes/AppTruckSharing/blob/main/embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div class="chatbot-title"><b>AI Chatbot: </b> Alcohol and Drug Foundation (ADF)</div>

<div class="chatbot-authors"><b>Project Manager: </b>Dotahn</div>

<div class="chatbot-authors"><b>Authored by (interns): </b>Te' Claire and Khuzaima Jamil</div>

<div class="chatbot-dates"><b>Dates: </b>December 2023/ January, February 2024</div>

<div class="chatbot-github"><b>GitHub Repo: </b>
<a href="https://github.com/Dotahn/ADFAIChatbot-Internship/tree/main"> Github Link</a></div>

<br>

<div class="chatbot-section-header"><b>Outline: </b>Utilising an API to seamlessly integrate natural language capabilities (LLM models) into a chat application customised to Alcohol and Drug Foundation (ADF) website (https://adf.org.au/), supporting Flowise Hub for database, scripts and model testing while respecting ADF Artifical Intelligence Ethical Framework.
<br>
</div>

---

<div class="chatbot-sub-section"><b>embeddings.ipynb: </b></div>

<div class="chatbot-sub-section">
  <i>Python script #3</i>
</div>

##### Overview:
1. Set-up, inserts **'cleandata'** table (JSON) from Supabase & Examines Dataset.
2. Embeddings & Tokenizing, text into individual words upload to Supabase
3. Creates Knowledge Base for Generative Component
4. Export embeddings to supabase as (vecs schema) **text_embeddings'** table (JSON)

##### RAM Usage Rate Google Collab
- Usage rate: approximately 5.53 per hour V100 (High-ram)
- Usage rate: approximately 1.96 per hour T4 GPU (Low-ram)
- Usage rate: approximately 2.05 per hour T4 GPU (High-ram)




---



##Research

### Embeddings
The Washington Post found in Googles C4 dataset, that the quality and quantity of embeddings are equally important to the model LLM training [3], [4]. When to fine-tuning models, the industry typically uses datasets (high-quality) to protect users from some unwanted content.\

*   The size of the embedding depends on the model that we choose.
*   The higher the cost, the more dimensions the embeddings will have, resulting in more accurate results, ie. Ada (1024), Babbage (2048), Curie (4096), Davinci (12288) [5].
<br>

### Model: all-mpnet-base-v2
Model [25], [27]
*   Performance Sentence Embeddings (14 Datasets) 69.57
*   Performance Semantic Search (6 Datasets) 57.02
*   Avg. Performance 63.30
*   Speed 2800
*   Model Size 420 MB



### Supabase (Vector Storage)

##### Supabase (relational database PostgreSQL) is used to store vectors, treating them as arrays [3], [9], [10].
*   (vecs) Python client uses Manage unstructured vector stores in PostgreSQL.
*   A tool for creating and querying collections in PostgreSQL using the pgvector extension.
*   Create a connection string and vecs handles setting up your database to store and query vectors with associated metadata [5].
*   db_vectors_e5_mistral_7b_instruct (Collection/ Table)

##### Metrics
*   Most typical metric used in similarity learning models is the cosine metric [6].
*  Aim is to count metric in 2 steps,
The first step is to normalize the vector when adding it to the collection. It

1.   The first step is to normalize the vector when adding it to the collection (once for each vector)
2.   The comparison of vectors, equivalent to Dot product.


##### Benefits Vector Databases <br>
*   Efficient storage and indexing of high-dimensional data.
*   Ability to handle large-scale datasets with billions of data points.
*   Support for real-time analytics and queries.
*   Ability to handle vectors derived from complex data types such as images, videos, and natural language text.
*   Improved performance and reduced latency in machine learning and AI applications.
*   Reduced development and deployment time and cost compared to building a custom solution.

<br>

#### Limitations and Considerations
1. Vectors (Supabase)
* Lacks operations (nearest neighbor search) used in databases like Qdrant, aren't directly supported in Supabase/PostgreSQL. Additionally add these search algorithms.
*   Performance: PostgreSQL (Supabase) is not optimized for vector search operations.
*   Functionality: Lacks a built-in vector search functions (like cosine similarity search) in Supabase.
*   Storage and Retrieval: Large vectors can consume significant storage and bandwidth.

<br>

#### *Alternative Vector Storing*:
*   Qdrant functions as a database and search engine for vectors, storing neural embeddings and the metadata (payload) [6], [8].
*    It uses an API to store, search, and manage vectors with an additional payload (metadata). This enables faster and more accurate retrieval of unstructured data.
*   An opensource alternative to Pinecone
*   Qdrant DB stores data in document/JSON format.
*   API https://cloud.qdrant.io/
*   Create a new 'cluster' using the basic free tier version.
*   Cluster: 'adf_chatbot_embeddings'
*   Note: a cluster can have several Collections, each collection can contain one or more points (vectors).
*   Authorised with Github & API key added to JSON secrets file (qdrantKey)
*   Qdrant and FlowiseAI [7]
<br>




---



References
<br>
[1] https://medium.com/simform-engineering/revolutionizing-conversational-ai-with-openai-embeddings-b3fda3de6ed4
<br>
[2] https://huggingface.co/spaces/mteb/leaderboard
<br>
[3] https://supabase.com/docs/guides/ai/vector-columns
<br>
[4] https://supabase.com/docs/guides/ai/google-colab
<br>
[5] https://supabase.com/docs/guides/ai/python-clients
<br>
[6] https://qdrant.tech/documentation/concepts/search/
<br>
[7] https://supabase.github.io/vecs/0.4/api/
<br>
[8] https://docs.flowiseai.com/integrations/vector-stores/qdrant
<br>
[9] https://huggingface.co/sentence-transformers/all-mpnet-base-v2
<br>
[10] https://www.analyticsvidhya.com/blog/2023/11/rag-langchain-and-vector-databases/
<br>
[11] https://www.sbert.net/docs/pretrained_models.html#:~:text=The%20all%2Dmpnet%2Dbase%2D,and%20still%20offers%20good%20quality

## Set Up
#### Install Dependencies

In [None]:
import warnings
warnings.filterwarnings('ignore')
!pip install -qU vecs sentence-transformers tqdm

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for vecs (pyproject.toml) ... [?25l[?25hdone
  Building wheel for flupy (pyproject.toml) ... [?25l[?25hdone
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


In [None]:
!nvidia-smi -L # Check GPU

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-b5b3a00c-8c7a-3ad4-b1cd-c5f4417892e8)


#### Import Libraries

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import os, warnings, json, uuid, requests, vecs
from sentence_transformers import SentenceTransformer
from huggingface_hub import notebook_login
from google.colab import drive

#### Drive

In [None]:
# Import and Mount Google Collab
drive.mount('/content/drive',force_remount=True)
%cd /content/drive/MyDrive/ADFAIChatbot
!ls

Mounted at /content/drive
/content/drive/.shortcut-targets-by-id/1OP0r8cO6DFjxo5iF0yrlyXaRPCme93DI/ADFAIChatbot
category_counts_per_year.csv  embeddings	output_stats_latest_50_updated_urls.csv
data			      embeddings.ipynb	output_stats_oldest_50_updated_urls.csv
data_cleaned		      models		raw_data_from_apify
data_clean.ipynb	      model_train	webcrawler_data_upsert.ipynb
Docker			      output_stats


#### Load API Keys


In [None]:
file_path = "/content/drive/MyDrive/ADFAIChatbot/model_train/secrets.json"
with open(file_path, "r") as file: # Read JSON
      keys = json.load(file)
      huggingfaceKey = keys["huggingfaceKey"] # Hugging Face
      supabase_token = keys["supabaseKey"] # Supabase
      supabase_url = keys["supabaseUrl"] # Supabase
      supabase_db = keys["supabaseDBPooler"] # Supabase

### Supabase Set Up
- API request
- GET Request (retrieve the data from table)
- Convert Data to Dataframe

In [None]:
# Supabase
supabase_url = supabase_url
supabase_api_key = supabase_token
table_name = "cleandata"

# Set up Headers
headers = {
    "Content-Type": "application/json",
    "apikey": supabase_api_key,
    "Authorization": f"Bearer {supabase_api_key}"
}

api_endpoint = f"{supabase_url}/rest/v1/{table_name}" # API endpoint
response = requests.get(api_endpoint, headers=headers) # Fetch data

# Request Check
if response.status_code == 200:
    data = response.json()
    # Convert JSON data to dataFrame
    cleaned_data = pd.DataFrame(data)
    print("Data loaded into DataFrame successfully.")
else:
    print(f"Error: {response.status_code} - {response.text}")

### Model

In [None]:
# Load Sentence Transformer model
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Generate the embeddings
# Function that receives a dictionary with the texts and returns a list with embeddings.
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model}"
headers = {"Authorization": f"Bearer {huggingfaceKey}"}

# Generate embeddings
text_embeddings = model.encode(texts, show_progress_bar=True)

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

### Supabase



In [None]:
DB_CONNECTION = supabase_db # Supabase
vx = vecs.create_client(DB_CONNECTION) # Client
text_collection = vx.get_or_create_collection(name="text_embeddings", dimension=384) # Dimension depends on your model


### Data Insertion

In [None]:
# Data insertion
records = [
    (
        str(uuid.uuid4()),               # id
        embedding,                       # vec numpy array or list
        json.dumps({'text': text})       # metadata: dict to JSON string
    )
    for embedding, text in zip(text_embeddings, texts)
]

# Upsert operation
text_collection.upsert(records)
text_collection.create_index()


### Test Embeddings

In [None]:
# Test embedding
sample_vector = text_embeddings[0]

# If Embeddings loaded
# sample_vector = embeddings[0]

# Perform query
try:
    query_results = text_collection.query(
        data=sample_vector,
        limit=5,
        measure="cosine_distance",
        include_value=True,  # distance values
        include_metadata=True  # metadata
    )
    print("Query Results:")
    for result in query_results:
        print(result)
except Exception as e:
    print(f"An error occurred: {e}")


Query Results:
('5644e116-b3d8-4477-8ec4-5770630e88cc', 0.0, {'text': 'help and support by state need state specific help and support services? especially during the covid19 pandemic, it is important that we pro ... (881 characters truncated) ... ces tas pdf  33.6 kb  pamphlet covid support services vic pdf  33.6 kb  pamphlet covid support services wa pdf  34.2 kb  last updated 06 nov 2023  x'})
('4d970aff-7c46-4da9-9a66-3127eafd787e', 0.343814790248871, {'text': 'talking about drugs seeking help alcohol and other drug treatment services aim to assist people with problems around their drug use. their g ... (6302 characters truncated) ... for? try our intuitive path2help tool and be matched with support information and services tailored to you.  references  last updated 05 nov 2021  x'})
('c48f2d2a-733d-463d-a65a-677ea3be23eb', 0.419903287899098, {'text': 'help  support use path2help for tailored recommendations answer a few quick questions to be matched to drug and alcohol services and