# Creating a Langchain application with Streamlit, OpenAI to talk to your own text documents using Pinecone as Vector DB

## Some basic concepts and terminology 

### Langchain

LangLangChain is a framework designed to simplify the creation of applications using large language models (LLMs). It is used for chatbots, document analysis and summarization, code analysis and more . LangChain provides a framework for connecting LLMs to other sources of data such as the internet or personal files. This allows developers to chain together multiple commands to create more complex applications.

### Streamlit

Streamlit its just a framework which allows to easily create UX to text your Python Apps, no more console apps!.

### Vector Database

A vector database is a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes. Each vector has a certain number of dimensions, which can range from tens to thousands, depending on the complexity and granularity of the data. Vector databases are purpose-built databases that are specialized to tackle the problems that arise when managing vector embeddings in production scenarios. For that reason, they offer significant advantages over traditional scalar-based databases and standalone vector indexes.

### OpenAI

OpenAI is an AI research and deployment company. Their mission is to ensure that artificial general intelligence benefits all of humanity. OpenAI conducts AI research with the declared intention of promoting and developing a friendly AI. They have developed several models such as GPT-4 which can solve difficult problems with greater accuracy, thanks to its broader general knowledge and problem-solving abilities.


## Context

Basically what our customer (Acme Corp) wants to achieve its to be able to chat with their own documents, be able to ask plan english questions and get a response back nicely formatted.

For this we are going to create a simple app with the following functionalities:
1.  Create a PineCone Index.
2.  Add free text and index it into the PineCone Index, basically its converting from text to vectors.
3.  Create a user interface to be able to chat with this index.
    - For the user interface we will use Streamlit for testing.
    - We will use open AI to instantiate an Open AI LLM.
    - We will use langchain to prove its benefits when developing LLM Apps, Langchain will also use an LLM to format the questions as embeddings and the responses as plain english.
    - We will use PineCone SDK to get results from the the vector db.



## Requirements

- streamlit
- langchain
- openai
- python-dotenv
- pinecone-client
- streamlit-chat

Note: Streamlit is intended for UX apps, and not to be used within Jupyter Notebooks, for this reason and for explainability I have created 2 versions of this application:
1. The current Jupyter notebook which doesnt use Streamlit and which can be executed cell by cell.
2. A streamlit version which you can see in the files app/home.py and app/Langchain.txt.
In order to run this version, clone the repo or download the files to your enviroment and run
``` streamlit run app/home.py```



In [1]:
#this will install all requirements
!pip install -r requirements.txt



In [2]:
# Create a .env file and add the following lines of code

OPENAI_DEPLOYMENT_ENDPOINT = "https://yourresource.openai.azure.com/"
OPENAI_API_KEY = "yourapikey"
OPENAI_DEPLOYMENT_NAME = "yourdeploymentname"
OPENAI_API_VERSION = "2023-03-15-preview"
OPENAI_MODEL_NAME="gpt-35-turbo"
OPENAI_API_TYPE = "azure"
OPENAI_EMBEDDING_DEPLOYMENT_NAME = "text-embedding-ada-002"
OPENAI_EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

PINECONE_API_KEY = "yourpineconeapikey"
PINECONE_ENV ="us-west4-gcp-free"



### Imports

In [3]:
import streamlit as st
import openai
import os
import pinecone
import streamlit as st

from dotenv import load_dotenv
from langchain.chains.question_answering import load_qa_chain
from dotenv import load_dotenv
from streamlit_chat import message
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from streamlit_jupyter import StreamlitPatcher, tqdm

  from tqdm.autonotebook import tqdm


## Load environment variables.

This will load your secrets and configuration settings from the .env file.

In [4]:
#load environment variables
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_DEPLOYMENT_ENDPOINT = os.getenv("OPENAI_DEPLOYMENT_ENDPOINT")
OPENAI_DEPLOYMENT_NAME = os.getenv("OPENAI_DEPLOYMENT_NAME")
OPENAI_MODEL_NAME = os.getenv("OPENAI_MODEL_NAME")
OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.getenv("OPENAI_EMBEDDING_DEPLOYMENT_NAME")
OPENAI_EMBEDDING_MODEL_NAME = os.getenv("OPENAI_EMBEDDING_MODEL_NAME")
OPENAI_API_VERSION = os.getenv("OPENAI_API_VERSION")
OPENAI_API_TYPE = os.getenv("OPENAI_API_TYPE")
PINECONE_API_KEY =  os.getenv("PINECONE_API_KEY")
PINECONE_ENV = os.getenv("PINECONE_ENV")

In [5]:
#init Azure OpenAI
openai.api_type = OPENAI_API_TYPE
openai.api_version = OPENAI_API_VERSION
openai.api_base = OPENAI_DEPLOYMENT_ENDPOINT
openai.api_key = OPENAI_API_KEY

### Lets code

Here we first to define our filename, for this case we will simply use a text file.
Then we have to load the file with the TextLoader class, and then with the ```loader.load()``` method we put in a variable ```documents```

In order to put the data into our vector database, we have to split it with the ```CharacterTextSplitter```, here we can configure how big the chunk size must be and the chunk overlap, when testing this functionality with OpenAI, I got errors when settings a chunk size higher than 1.

Then we need to instantiate a ```OpenAIEmbeddings()``` from OpenAI sdk, we use this embedding to convert our text file into vectors.

In [6]:
# File available in the profile
filename = "profiles.txt"
loader = TextLoader("profiles.txt", encoding='utf8')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(deployment="text-embedding-ada-002", model="text-embedding-ada-002", chunk_size=1)

Created a chunk of size 205, which is longer than the specified 1
Created a chunk of size 212, which is longer than the specified 1
Created a chunk of size 218, which is longer than the specified 1
Created a chunk of size 230, which is longer than the specified 1
Created a chunk of size 216, which is longer than the specified 1
Created a chunk of size 214, which is longer than the specified 1
Created a chunk of size 218, which is longer than the specified 1
Created a chunk of size 222, which is longer than the specified 1
Created a chunk of size 230, which is longer than the specified 1
Created a chunk of size 222, which is longer than the specified 1


Then we just initialize the PineCone client

In [7]:
import pinecone
pinecone.init(
    api_key=PINECONE_API_KEY,  # find at app.pinecone.io
    environment=PINECONE_ENV  # next to api key in console
)

And finally we convert our text to vectors and store it in PineCone

In [8]:
index_name = "default"

In [9]:
docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)
       

### Similarity search

Similarity search, also known as similarity retrieval or similarity querying, is a fundamental problem in information retrieval and database systems. It refers to the task of finding objects or data points that are similar to a given query object based on a specified similarity measure.

In similarity search, the goal is to identify items in a dataset that are most similar to a particular query item. The similarity can be measured using various distance metrics, such as Euclidean distance, cosine similarity, Jaccard similarity, or edit distance, depending on the nature of the data.

Basically it will return results based on distance which is measure from the embedded vectors and the embedded query to return the appropiate results.

In [10]:
#
query = "How many people likes gardening?"
docs = docsearch.similarity_search(query)
print(docs)

[Document(page_content='Name: Michael Johnson\nAge: 72\nGender: Male\nAddress: 654 Oak St\nPhone: 555-6789\nEmail: michaeljohnson@example.com\nInterests: Gardening, Golf, Traveling\nLikes: Tea\nDislikes: Snakes, Pollen\nAllergies: None\nLocation: Miami', metadata={'source': 'file.txt'}), Document(page_content='Name: Michael Johnson\nAge: 72\nGender: Male\nAddress: 654 Oak St\nPhone: 555-6789\nEmail: michaeljohnson@example.com\nInterests: Gardening, Golf, Traveling\nLikes: Tea\nDislikes: Snakes, Pollen\nAllergies: None\nLocation: Miami', metadata={'source': 'profiles.txt'}), Document(page_content='Name: Michael Johnson\nAge: 72\nGender: Male\nAddress: 654 Oak St\nPhone: 555-6789\nEmail: michaeljohnson@example.com\nInterests: Gardening, Golf, Traveling\nLikes: Tea\nDislikes: Snakes, Pollen\nAllergies: None\nLocation: Miami', metadata={'source': 'profiles.txt'}), Document(page_content='Name: David Smith\nAge: 77\nGender: Male\nAddress: 456 Oak Ln\nPhone: 555-1234\nEmail: davidsmith@exampl

As you can see it returns one document which matches our question, however we want to go a bit deeper and be able to have responses in plain english, not just some json response which is hard to read, right?

For this we use langchain, we instantiate an AzureChatOpenAI, which allows to reuse a trained model, for example gpt3.5 turbo. And we set other relevant configuration settings.

In [11]:
from langchain.chat_models import AzureChatOpenAI
llm = AzureChatOpenAI(
    openai_api_base=OPENAI_DEPLOYMENT_ENDPOINT,
    openai_api_version=OPENAI_API_VERSION ,
    deployment_name=OPENAI_DEPLOYMENT_NAME,
    openai_api_key=OPENAI_API_KEY,
    openai_api_type = OPENAI_API_TYPE ,
    model_name=OPENAI_MODEL_NAME,
    temperature=0)
chain = load_qa_chain(llm, chain_type="stuff")

And then we can just ask questions to our documents,

In [12]:
response = chain.run(input_documents=docs, question=query)
print(response)

I'm sorry, I cannot answer that question as there is no information provided about the number of people who like gardening. The context only provides information about Michael Johnson and David Smith and their interests.


In [13]:
response = chain.run(input_documents=docs, question="How old is Michael?")
print(response)

Michael is 72 years old.


Pretty interesting and easy to do, in future Portfolio projects we will cover more advanced concepts like Chat Memory, Chains, Retrievers, Agents, etc.


