## Install a Pinecone client

Pinecone exposes a simple REST API for interacting with your vector database. You can use the API directly, or you can use one of the official Pinecone clients:

## Initialize your connection

Using your API key and environment, initialize your client connection to Pinecone:

In [9]:
import os
import pinecone
from dotenv import load_dotenv, find_dotenv
import tqdm

load_dotenv(find_dotenv())
API_KEY=os.getenv("PINECONE_API_KEY")

pinecone.init(api_key=API_KEY, environment="asia-southeast1-gcp")

## Create an index

In Pinecone, you store vector embeddings in indexes. In each index, the vectors share the same dimensionality and distance metric for measuring similarity.

Create an index named "quickstart" that performs nearest-neighbor search using the Euclidean distance metric for 8-dimensional vectors:

In [10]:
pinecone.create_index("quickstart", dimension=8, metric="euclidean")
pinecone.describe_index("quickstart")

IndexDescription(name='quickstart', metric='euclidean', replicas=1, dimension=8.0, shards=1, pods=1, pod_type='p1.x1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

In [11]:
index = pinecone.Index("quickstart")

## Insert vectors

Now that you've created your index, insert sample vectors into 2 distinct namespaces.

Namespaces let you partition vectors within a single index. Although optional, they are a best practice for speeding up queries, which can be filtered by namespace, and for complying with multi-tenancy requirements.

Create a client instance that targets the "quickstart" index:
Use the upsert operation to write 8 8-dimensional vectors into 2 distinct namespaces

When upserting larger amounts of data, upsert data in batches of 100 vectors or fewer over multiple upsert requests.



In [13]:
index.upsert(
  vectors=[
    {"id": "vec1", "values": [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]},
    {"id": "vec2", "values": [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]},
    {"id": "vec3", "values": [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]},
    {"id": "vec4", "values": [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]}
  ],
  namespace="ns1"
)

index.upsert(
  vectors=[
    {"id": "vec5", "values": [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]},
    {"id": "vec6", "values": [0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6]},
    {"id": "vec7", "values": [0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7]},
    {"id": "vec8", "values": [0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8]}
  ],
  namespace="ns2"
)

{'upserted_count': 4}

## Check the index

Pinecone is eventually consistent, so there can be a delay before your vectors are visible to queries. Use the describe_index_stats operation to check if the current vector count matches the number of vectors you inserted:

In [14]:
index.describe_index_stats()

{'dimension': 8,
 'index_fullness': 0.0,
 'namespaces': {'ns1': {'vector_count': 4}, 'ns2': {'vector_count': 4}},
 'total_vector_count': 8}

## Run a similarity search

Query each namespace in your index for the 3 vectors that are most similar to an example 8-dimensional vector, using the Euclidean distance metric you specified for the index:

In [15]:
index.query(
  namespace="ns1",
  vector=[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],
  top_k=3,
  include_values=True
)

index.query(
  namespace="ns2",
  vector=[0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7],
  top_k=3,
  include_values=True
)

{'matches': [{'id': 'vec7',
              'score': 0.0,
              'values': [0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7]},
             {'id': 'vec8',
              'score': 0.0799999237,
              'values': [0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8]},
             {'id': 'vec6',
              'score': 0.0799999237,
              'values': [0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6, 0.6]}],
 'namespace': 'ns2'}

## Clean up

The Starter plan allows only a single index, so once you're done with the "quickstart" index, use the delete_index operation to delete it:

In [16]:
pinecone.delete_index("quickstart")

## Retrieval Augmentation in LangChain

LLMs have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is retrieval augmentation. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

To begin, we must install the prerequisite libraries that we will be using in this notebook.

!pip install -qU langchain==0.0.162 openai tiktoken "pinecone-client[grpc]" datasets apache_beam mwparserfromhell

# Building the Knowledge Base

In [2]:
from datasets import load_dataset

data = load_dataset("wikipedia", "20220301.simple", split='train[:10000]')
data

ModuleNotFoundError: No module named 'datasets'