# Pinecone

>[Pinecone](https://docs.pinecone.io/docs/overview) is a vector database with broad functionality.

This notebook shows how to use functionality related to the `Pinecone` vector database.

To use Pinecone, you must have an API key.
Here are the [installation instructions](https://docs.pinecone.io/docs/quickstart).

# Poetry Integration

This Notebook demonstrates how to use the Python dependency and virtualenv manager, Poetry.

Scenarios it demonstrates:

* Installing packages via Poetry
* Installing a package from a GitHub branch of a project that itself uses Poetry
* Installing the above correctly when the target Python package resides in a subdirectory of that project
* Configuring Poetry to create a virtualenv directly in the project working directory - simplifying use within a Jupyter Notebook

# Original project requirements
Prior to cutting this Notebook / demo over to use Poetry, its dependencies were:
* pinecone-client
* openai
* tiktoken
* langchain


In [None]:
# Install Poetry which is a dependency and virtualenv manager
# Read more at https://python-poetry.org/docs
! curl -sSL https://install.python-poetry.org | python3 -

In [None]:
# We need to add Poetry's install location to the PATH so that subsequent commands can simply call `poetry` <cmd>
# This is admittedly a bit of a hack - what we really want here is a unix alias that will persist throughout the Notebook
poetry = "/root/.local/bin/poetry"

# Sanity check that poetry installation and alias hack succeeded
! $poetry --version

In [None]:
# Create a new pyproject.toml file in the root, which signifies that this "project" within Jupyter Notebook is using Poetry
! $poetry init --no-interaction

In [None]:
# To keep things simple for the purposes of testing, tell poetry not to create virtualenvs
! $poetry config virtualenvs.create false
! $poetry config virtualenvs.in-project false

In [None]:
# We've disabled virtualenv creation to keep things simple for this test notebook, but the code below
# demonstrates how you could handle a Poetry virtualenv within a Jupyter notebook

# Run this next command to get poetry to tell you about the virtualenv settings
#! $poetry env info

# This is how you could extract the virtualenv's path
# VENV_PATH = ! $poetry env info --path

# Sanity check the output - note that, if successful, the return value will be a Python list containing
# a single string
# print(VENV_PATH)

# VENV_PATH = VENV_PATH[0]
# print(VENV_PATH)

# Activate the virtualenv
# !source {VENV_PATH}/bin/activate

# Ensure the virtualenv set up above is also the active one - the output of this command is the list of
# virtualenvs that Poetry is managing as well as which is currently active
# ! $poetry env list

In [None]:
# If you don't pass the --no-ansi flag, then the poetry add and install commands will run, but
# will not persist the changes to the pyproject.toml file.
# The --no-ansi flag appears to be required for at least the poetry add and poetry install commands
# See: https://stackoverflow.com/questions/75245758/how-to-use-poetry-in-google-colab for more info

# Install base dependencies and persist them to to pyproject.toml and poetry.lock file
! $poetry --no-ansi add pinecone-client openai tiktoken

# Install the Jupyter package as a dev depenendency
! $poetry --no-ansi add --group dev jupyter

In [None]:
# Install langchain branch from the smartcat fork that we're testing - there's a couple of things going on here:
# 1. We are installing langchain from a fork, not from its default location in GitHub or via pip
# 2. We are furthermore getting a specific git ref (in this case the branch @pinecone-optimization)
# 3. This repository is a monorepo that contains the actual Python package in a subdirectory, so we pass the
# subdirectory param at the end, pointing to the subdirectory that contains the Python package's actual pyproject.toml
# since that is primary file driving packages managed by Poetry (langchain the library also happens to be a Poetry-managed project)
! $poetry add --no-ansi git+https://github.com/smartcat-labs/langchain.git@pinecone-optimization#subdirectory=libs/langchain

In [None]:
# Sanity check that our package installs have been persisted to this Jupyter Notebook's pyproject.toml file correctly:
!cat pyproject.toml

# Install the dependencies to the virtualenv and write the poetry.lock file
# TEST if this is necessary or not
#! $poetry --no-ansi install

In [None]:
# Import langchain, which will now be our modified version installed from our target fork's branch
import langchain

In [None]:
import tiktoken
import openai
!ls /usr/lib/python3/dist-packages | grep -i lang

In [None]:
import os
import getpass

os.environ["PINECONE_API_KEY"] = getpass.getpass("Pinecone API Key:")

In [None]:
os.environ["PINECONE_ENV"] = getpass.getpass("Pinecone Environment:")

We want to use `OpenAIEmbeddings` so we have to get the OpenAI API Key.

In [None]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.document_loaders import TextLoader

In [None]:
from urllib.request import urlopen
from langchain.document_loaders import TextLoader

# Fetch the state of the union text file from GitHub
target_url = 'https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/extras/modules/state_of_the_union.txt'

data = urlopen(target_url).read().decode('utf-8')

target_filepath = '/content/state_of_the_union.txt'

with open(target_filepath, 'w') as writer:
  writer.write(data)

loader = TextLoader(target_filepath)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

In [None]:
! rm -rf /content/sample_data

In [None]:
import pinecone

# initialize pinecone
pinecone.init(
    api_key=os.getenv("PINECONE_API_KEY"),  # find at app.pinecone.io
    environment=os.getenv("PINECONE_ENV"),  # next to api key in console
)

index_name = "langchain-demo"

# First, check if our index already exists. If it doesn't, we create it
if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
      name=index_name,
      metric='cosine',
      dimension=1536
)
# The OpenAI embedding model `text-embedding-ada-002 uses 1536 dimensions`
docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)

# if you already have an index, you can load it like this
# docsearch = Pinecone.from_existing_index(index_name, embeddings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

In [None]:
print(docs[0].page_content)

### Adding More Text to an Existing Index

More text can embedded and upserted to an existing Pinecone index using the `add_texts` function


In [None]:
index = pinecone.Index("langchain-demo")
vectorstore = Pinecone(index, embeddings.embed_query, "text")

vectorstore.add_texts("More text!")

### Maximal Marginal Relevance Searches

In addition to using similarity search in the retriever object, you can also use `mmr` as retriever.


In [None]:
retriever = docsearch.as_retriever(search_type="mmr")
matched_docs = retriever.get_relevant_documents(query)
for i, d in enumerate(matched_docs):
    print(f"\n## Document {i}\n")
    print(d.page_content)

Or use `max_marginal_relevance_search` directly:

In [None]:
found_docs = docsearch.max_marginal_relevance_search(query, k=2, fetch_k=10)
for i, doc in enumerate(found_docs):
    print(f"{i + 1}.", doc.page_content, "\n")