<a href="https://colab.research.google.com/github/juyalm/AITest/blob/main/vector_hello_world.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/supabase/supabase/blob/master/examples/ai/vector_hello_world.ipynb)

Explain:

1. Vector Database?
2. Embeddings?
3. How Vector DB work?
4. OpenAI?
5. Open source embedding?
6. SBERT? Sentence Transformer
7. Other

**Vector Database** - A vector database is a database where we store and manage unstructured information, like text, images, and audio.




###Calculate Embeddings

In [None]:
from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = ["Govind"]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(f"Your embedding length is {len(embeddings)}")
print(embeddings)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])


Your embedding length is 1
[[-7.23612458e-02  9.58128273e-03  3.29899341e-02  1.33510614e-02
  -8.06161016e-03 -3.86873186e-02  8.66838768e-02  1.50415152e-02
  -2.75545437e-02  2.23062076e-02 -8.90007988e-03 -3.39011587e-02
   1.64142426e-03 -4.63389568e-02 -4.69612591e-02 -4.24349345e-02
  -1.54521791e-02 -3.14539634e-02 -3.98297831e-02 -3.08344979e-02
   1.07656889e-01  8.92074704e-02 -5.64672239e-02 -5.23667410e-03
  -3.77349705e-02  5.52422293e-02 -6.81892410e-02  3.91672179e-03
   2.24090815e-02 -1.44242663e-02  3.99211384e-02  7.05085788e-03
   1.45064788e-02 -7.61716906e-03  2.28933934e-02 -4.99336235e-03
   4.55425791e-02 -4.64929342e-02  5.92937022e-02 -1.52060296e-02
   3.05485949e-02 -9.33179632e-03  3.54366936e-02 -1.71512160e-02
   1.33418534e-02 -1.73805188e-02 -5.00551332e-03  3.19467001e-02
  -5.25505980e-03 -4.92046180e-04 -5.00572510e-02 -5.03150094e-03
   7.59648904e-02  6.96488395e-02  6.60534874e-02 -6.49355203e-02
  -5.45739308e-02 -1.16346143e-02  1.07741635e-03

### Semantic or similarity search

Given a question / search query, these models are able to find relevant text passages

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("multi-qa-mpnet-base-cos-v1")

query_embedding = model.encode("How big is London")
passage_embeddings = model.encode([
    "London is known for its financial district",
    "London has 9,787,426 inhabitants at the 2011 census",
    "The United Kingdom is the fourth largest exporter of goods in the world",
])

similarity = model.similarity(query_embedding, passage_embeddings)
# => tensor([[0.4659, 0.6142, 0.2697]])
print(similarity)

tensor([[0.4656, 0.6142, 0.2697]])


###Get similarity score between 2 sentences using cosine similarity

In [None]:
from sentence_transformers import SentenceTransformer , util, InputExample, losses

model = SentenceTransformer("all-MiniLM-L6-v2")

emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my black cat?")
#Get the cosine similarity score between sentences
cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.4389]])


###Semantic Search 2

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Two lists of sentences
sentences1 = [
    "India is a big country",
    "India is diverified",
    "Australia is even bigger than India",
]

sentences2 = [
    "India is amongst the vast countries in the world",
    "Australia is a continent",
    "Nepal is a tiney country",
]

# Compute embeddings for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)

# Compute cosine similarities
similarities = model.similarity(embeddings1, embeddings2)

# Output the pairs with their score
for idx_i, sentence1 in enumerate(sentences1):
    print(sentence1)
    for idx_j, sentence2 in enumerate(sentences2):
        print(f" - {sentence2: <30}: {similarities[idx_i][idx_j]:.4f}")

India is a big country
 - India is amongst the vast countries in the world: 0.8375
 - Australia is a continent      : 0.3187
 - Nepal is a tiney country      : 0.3991
India is diverified
 - India is amongst the vast countries in the world: 0.6734
 - Australia is a continent      : 0.2808
 - Nepal is a tiney country      : 0.3424
Australia is even bigger than India
 - India is amongst the vast countries in the world: 0.6213
 - Australia is a continent      : 0.6686
 - Nepal is a tiney country      : 0.2844


# HuggingFace Embeddings
Hugging Face sentence-transformers is a Python framework for state-of-the-art sentence, text and image embeddings. You can use these embedding models from the HuggingFaceEmbeddings class.

In [None]:
%pip install -qU langchain-huggingface

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
text = "This is a test document."

hf_embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
query_result = hf_embeddings.embed_query(text)
print(query_result)

[-0.038338541984558105, 0.12346471846103668, -0.02864297851920128, 0.05365270376205444, 0.008845366537570953, -0.03983934596180916, -0.07300589233636856, 0.04777132719755173, -0.030462471768260002, 0.05497974902391434, 0.08505292981863022, 0.03665666654706001, -0.005319973453879356, -0.002233141800388694, -0.06071099638938904, -0.027237920090556145, -0.01135166734457016, -0.042437683790922165, 0.00912993960082531, 0.10081552714109421, 0.07578728348016739, 0.06911715865135193, 0.009857431054115295, -0.0018377641681581736, 0.02624903991818428, 0.03290243074297905, -0.07177437096834183, 0.028384247794747353, 0.06170954555273056, -0.052529532462358475, 0.033661652356386185, 0.07446812838315964, 0.07536034286022186, 0.03538404777646065, 0.06713404506444931, 0.010798045434057713, 0.08167017996311188, 0.016562897711992264, 0.03283063694834709, 0.036325663328170776, 0.0021727988496422768, -0.09895738214254379, 0.0050467848777771, 0.05089650675654411, 0.009287580847740173, 0.024507684633135796,

# Similarity Search with a document

In [4]:
%pip install -qU langchain-community
%pip install -qU langchain-community unstructured
!pip install unstructured
!pip install pdfminer
!pip install pdfminer.six
!pip install pi_heif
!pip install unstructured_inference
!apt-get install poppler-utils
!pip install unstructured_pytesseract
!apt-get install tesseract-ocr
!apt-get install libtesseract-dev


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/2.5 MB[0m [31m10.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m1.6/2.5 MB[0m [31m25.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.6/411.6 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

Load environment variables

In [5]:
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pi_heif import register_heif_opener
from dotenv import load_dotenv
import os
#Instead of importing open_filename from the old pdfminer library
#we will rely on the unstructured library and other supporting libraries to do the heavy lifting
#All the required dependencies should have already been installed

In [6]:


load_dotenv('.env')
pinecone_api_key = os.getenv('PINECONE_API_KEY')
open_ai_key = os.getenv('OPENAI_API_KEY')
pinecone_env_key = os.getenv('PINECONE_ENV_KEY')
print(pinecone_api_key)
print(open_ai_key)
print(pinecone_env_key)

pcsk_3mLvpR_PwgBS6tQXNxptTRrsKUMhDP7F1uGU4DdgemPyDB1Nbns4k2dNSwqqAYPuWsHmtZ
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
us-east-1


Load the data

In [None]:
loader = UnstructuredPDFLoader("/content/sample_data/Java_8_in_Action.pdf")
# loader = PyPDFLoader("\sample_data\Java_8_in_Action.pdf")
data = loader.load()

In [None]:
print (f'You have {len(data)} document(s) in your data')
print (f'There are {len(data[0].page_content)} characters in your sample document')
print (f'Here is a sample: {data[0].page_content[:300]}')

You have 1 document(s) in your data
There are 833480 characters in your sample document
Here is a sample: IN ACTION

ld

Mario Fusco Alan My

p

Raoul-Gabr

Ba MANNING

www. it-ebooks.info

Java 8 in Action: Lambdas, streams, and functional-style programming

Raoul-Gabriel Urma, Mario Fusco, and Alan Mycroft

(BY Mannine pusticarions

2

www. it-ebooks.info

Copyright

For online information and orderin


Chunking of data

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(data)

In [None]:
# Let's see how many small chunks we have
print (f'Now you have {len(texts)} documents')

Now you have 2337 documents


In [7]:
!pip install pinecone-client -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m143.4/244.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/85.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Pinecone integration

In [8]:
import pinecone
from pinecone import Pinecone

# Instead of using pinecone.init, create a Pinecone instance
pinecone = Pinecone(api_key=pinecone_api_key, environment=pinecone_env_key)

# Now you can use the instance methods for index operations
#index = pinecone.create_index("demo", dimension=384)  # Assuming you still want to create the index
index = pinecone.Index("demo")  # To get an existing index

# Upsert the data

In [None]:
# Set batch size for upserting
batch_size = 100  # Adjust this value as needed
#trying to upload the entire upserted_data list to Pinecone in a single upsert call. This creates a very large request payload, exceeding the 4MB limit.
upserted_data = []
i = 0
for item in texts:
    id = index.describe_index_stats()['total_vector_count']
    upserted_data.append(
        (
            str(id + i),
            model.encode(item.page_content).tolist(),
            {
                'content': item.page_content
            }
        )
    )
    i += 1

    # Upsert in batches
    if len(upserted_data) >= batch_size:
        index.upsert(vectors=upserted_data)
        upserted_data = []  # Reset for the next batch

# Upsert any remaining data
if upserted_data:
    index.upsert(vectors=upserted_data)

# Query result
This code snippet is performing a similarity search within the Pinecone vector database that was set up earlier in the notebook. This code takes a text query, transforms it into a numerical representation, uses that representation to search for similar items within a Pinecone vector database, and then displays the most similar results.

In [13]:
query = "Whats new in java 8"
query_em = model.encode(query).tolist()
result = index.query(vector=query_em, top_k=4, include_metadata=True)

# Iterate through the matches and print only the 'content'
for match in result['matches']:
    print(match['metadata']['content'])

So far we’ve summarized the concepts of Java 8. We now turn to the thornier subject of what future enhancements and great new features may be in Java’s pipeline beyond Java 8.

434

www. it-ebooks.info

16.2. What’s ahead for Java?
material is positioned toward the end of the book to provide additional insight into why the new Java 8 features were added.
This chapter covers

e New Java 8 features and their evolutionary effect on programming style e A few unfinished-business ideas started by Java 8

e¢ What Java 9 and Java 10 might bring
e¢ What Java 9 and Java 10 might bring

We covered a lot of material in this book, and we hope you now feel that you’re ready to start using the new Java 8 features in your own code, perhaps building on our examples and quizzes. In this chapter we review the journey of learning about Java 8 and the gentle push toward functional-style programming. In addition, we speculate on what future enhancements and great new features may be in Java’s pipeline beyon

In [None]:
!apt-get install -qq git

 let's augment the Pinecone search results with an open-source LLM.
 Install Required Libraries:

In [9]:
   !pip install -q torch transformers transformers accelerate bitsandbytes langchain sentence-transformers faiss-cpu openpyxl pacmap datasets langchain-community ragatouille
   !pip install bitsandbytes --upgrade  # Upgrade bitsandbytes
   !pip install accelerate --upgrade  # Upgrade accelerate
   # For multi-platform bitsandbytes (CPU support)
   !pip install bitsandbytes[cpu]  # Install the CPU-enabled version of bitsandbytes

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/647.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m266.2/647.5 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━



# Load the LLM and Tokenizer:
This code loads the Flan-T5 model and tokenizer and creates a pipeline for text generation.

In [40]:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Model name
READER_MODEL_NAME = "google/flan-t5-small"

# Check if CUDA is available and set the device accordingly
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(
    READER_MODEL_NAME,
    trust_remote_code=True  # Add this if the model has custom code
).to(device)  # Move the model to the device

tokenizer = AutoTokenizer.from_pretrained(READER_MODEL_NAME)

READER_LLM = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text2text-generation",  # Change the task here
    do_sample=True,
    temperature=0.8,
    repetition_penalty=1.1,
    #return_full_text=False,  # Remove this line as it's causing the error
    max_new_tokens=500,
    device=device
)
print(READER_LLM)
test_prompt = "Translate this: Hello, how are you? to French."
test_output = READER_LLM(test_prompt)
print(test_output)

Device set to use cpu


<transformers.pipelines.text2text_generation.Text2TextGenerationPipeline object at 0x7e464a178340>
[{'generated_text': "Beaucoup, c'est-ce que vous?"}]


This prompt template will be used to structure the input to the LLM. It includes a system message, the context (retrieved documents), and the user's question.

In [41]:
from transformers import AutoTokenizer

    # Assuming 'tokenizer' is already defined from previous code
    # ...

    # Define the chat template
chat_template = """<s><|system|>
    Using the information contained in the context, give a comprehensive answer to the question.
    Respond only to the question asked, response should be concise and relevant to the question.
    Provide the number of the source document when relevant.
    If the answer cannot be deduced from the context, do not give an answer.
    <|user|>
    Context:
    {context}
    ---
    Now here is the question you need to answer.

    Question: {question}
    <|assistant|>"""

    # Format the chat template with placeholders for question and context
RAG_PROMPT_TEMPLATE = chat_template.format(question="{question}", context="{context}")

    # Print the formatted template (optional)
print(RAG_PROMPT_TEMPLATE)

<s><|system|>
    Using the information contained in the context, give a comprehensive answer to the question.
    Respond only to the question asked, response should be concise and relevant to the question.
    Provide the number of the source document when relevant.
    If the answer cannot be deduced from the context, do not give an answer.
    <|user|>
    Context:
    {context}
    ---
    Now here is the question you need to answer.

    Question: {question}
    <|assistant|>


In [12]:
index_stats = index.describe_index_stats()
print(index_stats)

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 2337}},
 'total_vector_count': 2337}


In [54]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Perform Pinecone query
query = "Streams vs. collections in Java"
query_em = embedding_model.encode(query).tolist()
result = index.query(vector=query_em, top_k=2, include_metadata=True)

# Extract content from retrieved documents, improving formatting
retrieved_docs_text = [result]
context = ""  # Initialize context as an empty string
for i, query_result in enumerate(retrieved_docs_text):
    for match in query_result['matches']:
        content = match['metadata']['content']
        context += content + "\n\n"  # Add content with separators

# Check if context is empty and if so add a message
if not context:
    context = "No relevant documents found."

# Format the final prompt
final_prompt = RAG_PROMPT_TEMPLATE.format(question=query, context=context)

# Generate answer using the LLM
generated_answer = READER_LLM(final_prompt)[0]['generated_text']
print(generated_answer)

Using the information contained in the context of this passage, we'll just say that the new Streams API behaves very similarly to Java’s existing collections API: both provide access to sequences of data items. But it’s useful for now to keep in mind that Collections is mostly about storing and accessing data, whereas Streams is mostly about describing computations on data.
