<a href="https://colab.research.google.com/github/jggomez/demos-embeddings-gemini/blob/main/Demo_Embeddings_Q%26A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
!pip install --upgrade google-cloud-aiplatform

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.55.0-py2.py3-none-any.whl (5.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.1/5.1 MB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: google-cloud-aiplatform
  Attempting uninstall: google-cloud-aiplatform
    Found existing installation: google-cloud-aiplatform 1.54.1
    Uninstalling google-cloud-aiplatform-1.54.1:
      Successfully uninstalled google-cloud-aiplatform-1.54.1
Successfully installed google-cloud-aiplatform-1.55.0


In [1]:
!pip install rich



In [2]:
from google.colab import auth
auth.authenticate_user()

In [3]:
import pandas as pd
from rich.console import Console
console = Console()

In [6]:
so_df = pd.read_csv('questions_answers_db.csv')
so_df.head()

Unnamed: 0,input_text,output_text,category
0,"python's inspect.getfile returns ""<string>""<p>...",<p><code>&lt;string&gt;</code> means that the ...,python
1,Passing parameter to function while multithrea...,<p>Try this and note the difference:</p>\n<pre...,python
2,How do we test a specific method written in a ...,"<p>Duplicate of <a href=""https://stackoverflow...",python
3,how can i remove the black bg color of an imag...,<p>The alpha channel &quot;disappears&quot; be...,python
4,How to extract each sheet within an Excel file...,<p>You need to specify the <code>index</code> ...,python


In [7]:
import pickle

In [8]:
with open('embeddings.pkl', 'rb') as file:
  question_embeddings = pickle.load(file)

In [9]:
so_df['embeddings'] = question_embeddings.tolist()

In [10]:
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances_argmin as distances_argmin

In [11]:
query = ['How to concat dataframes pandas']

In [26]:
import vertexai
from vertexai.language_models import TextEmbeddingInput, TextEmbeddingModel

vertexai.init(project="devhack-3f0c2", location="us-central1")

model_name = "text-embedding-004"
task = "SEMANTIC_SIMILARITY"
dimensionality: int = 768
model = TextEmbeddingModel.from_pretrained(model_name)
kwargs = dict(output_dimensionality=dimensionality) if dimensionality else {}

query_embedding = model.get_embeddings(query, **kwargs)[0].values

In [13]:
cos_sim_array = cosine_similarity([query_embedding],list(so_df.embeddings.values))

In [14]:
index_doc_cosine = np.argmax(cos_sim_array)
index_doc_distances = distances_argmin([query_embedding],
                                       list(so_df.embeddings.values))[0]

In [15]:
console.print(so_df.input_text[index_doc_cosine])

In [16]:
console.print(so_df.output_text[index_doc_cosine])

In [17]:
from vertexai.generative_models import GenerativeModel, Part
model = GenerativeModel(model_name="gemini-1.5-flash-001")

In [18]:
context = "Question: " + so_df.input_text[index_doc_cosine] +\
"\n Answer: " + so_df.output_text[index_doc_cosine]

In [19]:
prompt = f"""Here is the context: {context}
             Using the relevant information from the context,
             provide an answer to the query: {query}."
             If the context doesn't provide \
             any relevant information, \
             answer with \
             [I couldn't find a good match in the \
             document database for your query]
             """

In [20]:
response = model.generate_content([prompt])
print(response.text)

The provided context shows how to concatenate two dataframes in pandas, repeating values from the smaller dataframe to match the dimensions of the larger one. 

Here's a breakdown of the code and the steps involved:

1. **Import pandas:** 
   ```python
   import pandas as pd
   ```

2. **Create sample dataframes:**
   ```python
   df1 = pd.DataFrame({'Field1': [0.5, 2, 3, 4], 'Field2': [0.7, 1, 0.1, 0.4]})
   df2 = pd.DataFrame({'Date': ['2022-08-01', '2022-08-01'], 'Time': [1, 2]})
   ```

3. **Repeat the smaller dataframe:**
   ```python
   n = int(df1.size / df2.size)  # Calculate the repetition factor
   df3 = pd.concat([df2] * n, axis=0).reset_index(drop=True) 
   ```
   * `df1.size / df2.size` calculates how many times `df2` needs to be repeated to match the size of `df1`.
   * `pd.concat([df2] * n, axis=0)` concatenates `df2` with itself `n` times vertically (axis=0).
   * `reset_index(drop=True)` resets the index to avoid any unintended issues.

4. **Concatenate the dataframes:

In [21]:
!pip install scann



In [22]:
import scann
def create_index(embedded_dataset,
                 num_leaves,
                 num_leaves_to_search,
                 training_sample_size):

    # normalize data to use cosine sim as explained in the paper
    normalized_dataset = embedded_dataset / np.linalg.norm(embedded_dataset, axis=1)[:, np.newaxis]

    searcher = (
        scann.scann_ops_pybind.builder(normalized_dataset, 10, "dot_product")
        .tree(
            num_leaves = num_leaves,
            num_leaves_to_search = num_leaves_to_search,
            training_sample_size = training_sample_size,
        )
        .score_ah(2, anisotropic_quantization_threshold = 0.2)
        .reorder(100)
        .build()
    )
    return searcher

In [23]:
#Create index using scann
index = create_index(embedded_dataset = question_embeddings,
                     num_leaves = 25,
                     num_leaves_to_search = 10,
                     training_sample_size = 2000)


In [None]:
query = ["how to concat dataframes pandas"]

In [40]:
import time

start = time.time()
query_embedding = model.get_embeddings(query, **kwargs)[0].values
neighbors, distances = index.search(query_embedding, final_num_neighbors = 1)
end = time.time()

for id, dist in zip(neighbors, distances):
    console.print(f"[docid:{id}] [{dist}] -- {so_df.input_text[int(id)][:125]}...")

console.print("Latency (ms):", 1000 * (end - start))

In [37]:
start = time.time()
query_embedding = model.get_embeddings(query, **kwargs)[0].values
cos_sim_array = cosine_similarity([query_embedding], list(so_df.embeddings.values))
index_doc = np.argmax(cos_sim_array)
end = time.time()

print(f"[docid:{index_doc}] [{np.max(cos_sim_array)}] -- {so_df.input_text[int(index_doc)][:125]}...")

print("Latency (ms):", 1000 * (end - start))

[docid:35] [0.6432657949636755] -- Concatenate 2 dataframes and repeat values from small one with pandas<p>I have these two dataframes:</p>
<div class="s-table-...
Latency (ms): 251.06334686279297
