requests
: This library is used for making HTTP requests in Python. It provides a simple and intuitive way to interact with web services and retrieve data from URLs. With requests, you can send GET, POST, PUT, DELETE, and other types of requests to web servers and handle the responses.

---
urllib3
: This library is a powerful HTTP client for Python. It provides a higher-level interface for making HTTP requests compared to the built-in urllib module. urllib3 supports features like connection pooling, SSL/TLS verification, and automatic retries, making it a popular choice for handling HTTP requests in Python.

``` 
*Connection pooling* is a technique used to manage a pool of reusable connections to a server. In the context of HTTP requests, connection pooling allows you to reuse existing connections instead of creating a new connection for each request. This can significantly improve the performance of your application by reducing the overhead of establishing a new connection for every request.

When you make an HTTP request using a library like urllib3, it maintains a pool of connections to the server. When you send a request, the library checks if there is an available connection in the pool. If there is, it reuses that connection instead of creating a new one. This avoids the overhead of establishing a new TCP connection, which involves a handshake process.

*SSL/TLS verification*  is a security feature that ensures the authenticity and integrity of the server you are communicating with over HTTPS. When you make an HTTPS request, the server presents a digital certificate that contains its public key. The client (your application) verifies the authenticity of the certificate to ensure that it is issued by a trusted certificate authority (CA) and that it is valid for the server's domain.

SSL/TLS verification helps protect against man-in-the-middle attacks, where an attacker intercepts the communication between the client and the server. By verifying the server's certificate, you can ensure that you are communicating with the intended server and that the data exchanged is encrypted and secure.

In the context of urllib3, it provides built-in support for SSL/TLS verification. It automatically verifies the server's certificate using a set of trusted CA certificates bundled with the library. This helps ensure that your HTTPS requests are secure and that you are communicating with trusted servers.
```
---
tqdm
: This library stands for "taqaddum" in Arabic, which means "progress" in English. It is a handy tool for adding progress bars to your Python loops. With tqdm, you can easily visualize the progress of your loops, making it easier to track the execution and estimate the remaining time


In [10]:
!pip install requests
!pip install urllib3
!pip install tqdm

795.33s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




802.40s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




809.54s - pydevd: Sending message related to process being replaced timed-out after 5 seconds




In [13]:
!pip install langchain

1584.04s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting langchain
  Downloading langchain-0.1.11-py3-none-any.whl.metadata (13 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.28-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community<0.1,>=0.0.25 (from langchain)
  Downloading langchain_community-0.0.25-py3-none-any.whl.metadata (8.1 kB)
Collecting langchain-core<0.2,>=0.1.29 (from langchain)
  Downloading langchain_core-0.1.29-py3-

In [1]:
!pip install upstash-vector

Collecting upstash-vector
  Downloading upstash_vector-0.2.0-py3-none-any.whl.metadata (5.4 kB)
Collecting httpx<0.26.0,>=0.25.0 (from upstash-vector)
  Downloading httpx-0.25.2-py3-none-any.whl.metadata (6.9 kB)
Downloading upstash_vector-0.2.0-py3-none-any.whl (9.7 kB)
Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m721.5 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: httpx, upstash-vector
  Attempting uninstall: httpx
    Found existing installation: httpx 0.26.0
    Uninstalling httpx-0.26.0:
      Successfully uninstalled httpx-0.26.0
Successfully installed httpx-0.25.2 upstash-vector-0.2.0


In [7]:
!pip install google-cloud-aiplatform

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.43.0-py2.py3-none-any.whl.metadata (27 kB)
Collecting google-api-core!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1 (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.34.1->google-cloud-aiplatform)
  Downloading google_api_core-2.17.1-py3-none-any.whl.metadata (2.7 kB)
Collecting google-auth<3.0.0dev,>=2.14.1 (from google-cloud-aiplatform)
  Downloading google_auth-2.28.1-py2.py3-none-any.whl.metadata (4.7 kB)
Collecting proto-plus<2.0.0dev,>=1.22.0 (from google-cloud-aiplatform)
  Downloading proto_plus-1.23.0-py3-none-any.whl.metadata (2.2 kB)
Collecting protobuf!=3.20.0,!=3.20.1,!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.19.5 (from google-cloud-aiplatform)
  Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Collecting google-cloud-storage<3.0.0dev,>=1.32.0 (from 

In [11]:
import requests
import urllib.parse
from tqdm import tqdm

When you make an HTTP request and receive a response, the response object contains various information such as the status code, headers, and the response body. In this case, it seems that the response object is an HTTP response object.

The json() method is used to extract the response body as JSON (JavaScript Object Notation) data. JSON is a lightweight data interchange format that is commonly used for transmitting data between a server and a web application.

In [12]:
def extract_papers(query: str):
    query = urllib.parse.quote(query)
    url = f"https://paperswithcode.com/api/v1/papers/?q={query}"
    response = requests.get(url)
    response = response.json()
    count = response["count"]
    results = []
    results+=results

    num_pages = count // 50
    for page in tqdm(range(2, num_pages)):
        url = f"https://paperswithcode.com/api/v1/papers/?page={page}&q={query}"
        response = requests.get(url)
        response = response.json()
        results += response["results"]
    return results

In [9]:
query = "Large Language Models"
results = extract_papers(query)
print(len(results))


100%|██████████| 179/179 [05:09<00:00,  1.73s/it]

8950





Document objects have two parameters:

page_content (str): to store the text of the paper's abstract
metadata (dict): to store additional information. In our use case we'll keep: id, arxiv_id, url_pdf, title, authors, published

In [15]:
from langchain.docstore.document import Document

documents = [
    Document(
        page_content = result["abstract"],
        metadata={
            "id": result["id"] if result["id"] else "",
            "arxiv_id": result["arxiv_id"] if result["arxiv_id"] else "", 
            "url_pdf": result["url_pdf"] if result["url_pdf"] else "",
            "title": result["title"] if result["title"] else "",
            "authors": result["authors"] if result["authors"] else "",
            "published": result["published"] if result["published"] else ""
            },
    ) for result in results
]

we need to chunk them into smaller pieces. This helps overcome LLMs' limitations in terms of input tokens and provides fine-grained information per chunk.

Example: After chunking the documents with a chunk_size of 1200 characters and a chunk_overlap of 200, we end up with over 11K splits.

In [17]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap = 200,
    separators =["."],
)

splits = text_splitter.split_documents(documents)

print(len(splits))

14016


![Working](https://miro.medium.com/v2/resize:fit:700/1*DcFteAgc_cK37rCbMCxRMg.png)

In [3]:
from upstash_vector import Index
import os

index = Index(url="https://normal-mammal-32964-eu1-vector.upstash.io", token=os.get_env("UPSTASH_TOKEN"))

simple class that imitates LangChain Vectorstore implementation.

- An __init__ constructor that expects an Upstash Index and an Embeddings object

- add_documents to embed documents and index them in batches

- similarity_search_with_score to query the index and retrieve the top_k most relevant documents along with their corresponding scores

The `typing` library in Python provides support for type hints, which allow you to specify the expected types of variables, function parameters, and return values. This can help improve code readability and catch potential type-related errors during development.

In your code snippet, you are importing specific types from the `typing` library: `List`, `Optional`, `Tuple`, and `Union`.

- `List` represents a list of elements of a specific type. For example, `List[int]` represents a list of integers.
- `Optional` is used to indicate that a variable or function parameter can be of a specific type or `None`. For example, `Optional[str]` represents a string or `None`.
- `Tuple` represents an immutable sequence of elements of different types. For example, `Tuple[int, str]` represents a tuple containing an integer followed by a string.
- `Union` is used to indicate that a variable or function parameter can be of multiple types. For example, `Union[int, float]` represents a variable that can be either an integer or a float.

These type hints are not enforced at runtime, but they can be checked using static type checkers like `mypy` or integrated development environments (IDEs) with type checking capabilities. They provide valuable information to developers and tools for better understanding and maintaining the codebase.

Moving on to the next import statement, `from uuid import uuid4`, you are importing the `uuid4` function from the `uuid` module. The `uuid` module provides functions for generating and working with universally unique identifiers (UUIDs). The `uuid4` function specifically generates a random UUID.

UUIDs are commonly used to uniquely identify objects or entities in distributed systems or databases. They are 128-bit numbers represented as strings in a specific format. The `uuid4` function generates a random UUID using a combination of the system's MAC address, timestamp, and random numbers.

By importing `uuid4`, you can use it in your code to generate unique identifiers for various purposes, such as generating session IDs, creating unique filenames, or ensuring uniqueness in database records.

In [4]:
from typing import List, Optional, Tuple, Union
from uuid import uuid4
from langchain.docstore.document import Document
from langchain.embeddings.base import Embeddings
from tqdm import tqdm
from upstash_vector import Index

In [5]:
class UpstashVectorStore:
    def __init__(self, index: Index, embeddings: Embeddings):
        self.index = index
        self.embeddings = embeddings

    def delete_vectors(self, ids: Union[str, List[str]], delete_all: bool = None,):
        if delete_all:
            self.index.rest()
        else:
            self.index.delete(ids)

    def add_documents(self, documents: List[Document], ids: Optional[List[str]] = None, batch_size: int = 32):
        texts = []
        metadatas = []
        all_ids = []

        for document in documents:
            text = document.page_content
            metadata = document.metadata
            metadata = {"context": text, **metadata}
            texts.append(text)
            metadatas.append(metadata)

            if len(texts) >= batch_size:
                ids = [str(uuid4()) for _ in range(len(texts))]
                all_ids += ids
                embeddings = self.embeddings.embed_documents(texts, batch_size=250)
                self.index.upsert(
                    vectors = zip(ids, embeddings, metadatas)
                )
                texts = []
                metadatas = []

        if len(texts) > 0:
            ids = [str(uuid4()) for _ in range(len(texts))]
            all_ids += ids
            embeddings = self.embeddings.embed_documents(texts, batch_size=250)
            self.index.upsert(
                vectors = zip(ids, embeddings, metadatas)
            )

        n = len(all_ids)
        print(f"Sucessfully indexed {n} dense vectors to Upstash")
        print(self.index.stats())
        return all_ids
    

    def similarity_search_with_score(self, query: str, k: int = 4
                                     ) -> List[Tuple[Document, float]]:
        query_embedding = self.embeddings.embed_query(query)
        query_results = self.index.query(
            query_embedding, 
            top_k = k,
            include_metadata=True
        )

        output = []
        for query_result in query_results:
            score = query_result.score
            metadata = query_result.metadata
            context = metadata.pop("context")
            doc = Document(page_content=context, metadata=metadata)
            output.append((doc, score))

        return output



In [None]:
%pip install --upgrade --quiet langchain langchain-google-vertexai

In [8]:
from langchain.embeddings import VertexAIEmbeddings
from upstash_vector import Index

index = Index(url="https://normal-mammal-32964-eu1-vector.upstash.io", token="=")

embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@003")

upstash_vector_store = UpstashVectorStore(index=index, embeddings=embeddings)

ids = upstash_vector_store.add_documents(splits, batch_size=25)

GoogleAuthError: 
Unable to authenticate your request.
Depending on your runtime environment, you can complete authentication by:
- if in local JupyterLab instance: `!gcloud auth login` 
- if in Colab:
    -`from google.colab import auth`
    -`auth.authenticate_user()`
- if in service account or other: please follow guidance in https://cloud.google.com/docs/authentication