Weaviate Vector Database :
This is another kind of vector database where we convert our text into numbers and we allow mathematical operations on top of the numbers. This type of database follows the data through online like pinecone.

In [None]:
# installing few components through pip
!pip install weaviate-client
!pip install langchain
!pip install openai

In [2]:
# get your own api key from goggle colab
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
WEAVIATE_API_KEY = userdata.get('WEAVIATE_API_KEY')
WEAVIATE_CLUSTER = userdata.get('WEAVIATE_CLUSTER')

In [3]:
# get the key from our defined api key
# (followed by upper cell's api key variable)
import os
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
# installing pypdf through pip
!pip install pypdf

In [None]:
# creating a folder
!mkdir pdfs

In [6]:
# importing PyPDFDirectoryLoader from langchain.document_loader
# here again langchain has a component where we call it and load ourdatqset
# this is a RAG system
# step-1 : loading the dataset
from langchain.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("pdfs")
data = loader.load()

In [7]:
# Step-02 : Text Splitting into chunks because of the limitations of the tokens. token is nothing but a single word
# importing RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
# creating object and passing few parameters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
docs = text_splitter.split_documents(data)

In [8]:
# checking the length of the entire corpus or documents or dataset
len(docs)

39

In [None]:
# print all the page content of the whole dataset
for i in docs:
  print(i.page_content)

In [10]:
# Step-03 : Embeddings of the whole dataset that was just chunked wise
# importing OpenAIEmbeddings but we can also import model from huggingface hub.
from langchain.embeddings.openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

  warn_deprecated(


In [None]:
# we should import OpenAIEmbeddings through these codes
# at first update langchain-openai through this commnad
# !pip install -U langchain-openai
# then import it through this code
# from langchain_openai import OpenAIEmbeddings
# embeddings = OpenAIEmbeddings()

In [12]:
# now there is no warnings showing
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
# printing embeddings
print(embeddings)

In [14]:
# embeddings are stored into vector database
# and here we are using weaviate vector database

In [15]:
# Step-04 : data stored into vector database
import weaviate
from langchain.vectorstores import Weaviate

#Connect to weaviate Cluster
auth_config = weaviate.auth.AuthApiKey(api_key = WEAVIATE_API_KEY)
WEAVIATE_URL = WEAVIATE_CLUSTER

client = weaviate.Client(
    url = WEAVIATE_URL,
    additional_headers = {"X-OpenAI-Api-key": OPENAI_API_KEY},
    auth_client_secret = auth_config,
    startup_period = 10
)

In [16]:
# we will call the is_ready() function through client
# True means the model is ready
client.is_ready()

True

In [17]:
# define input structure
client.schema.delete_all()
client.schema.get()
schema = {
    "classes": [
        {
            "class": "Chatbot",
            "description": "Documents for chatbot",
            "vectorizer": "text2vec-openai",
            "moduleConfig": {"text2vec-openai": {"model": "ada", "type": "text"}},
            "properties": [
                {
                    "dataType": ["text"],
                    "description": "The content of the paragraph",
                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": False,
                            "vectorizePropertyName": False,
                        }
                    },
                    "name": "content",
                },
            ],
        },
    ]
}

client.schema.create(schema)
vectorstore = Weaviate(client, "Chatbot", "content", attributes=["source"])

In [None]:
# load text into the vectorstore
text_meta_pair = [(doc.page_content, doc.metadata) for doc in docs]
texts, meta = list(zip(*text_meta_pair))
vectorstore.add_texts(texts, meta)

In [24]:
# showing the output
query_01 = "what is a transformer?"

# retrieve text related to the query
docs = vectorstore.similarity_search(query_01, top_k=1)

In [None]:
docs
for j in docs:
  print(j.page_content)

### Custom ChatBot :

In [28]:
# importing the libraries
from langchain.chains.question_answering import load_qa_chain
# from langchain.llms import OpenAI
from langchain_openai import OpenAI

In [29]:
# define chain
chain = load_qa_chain(
    OpenAI(),
    chain_type="stuff"
    )

In [38]:
# create answer
chain.run(input_documents=docs, question=query_01)
# LangChainDeprecationWarning: The function `run` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead
# chain.invoke(input_documents=docs, question=query_01, input=input)

' The Transformer is a type of neural sequence transduction model that relies entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. It has been shown to perform well on simple-language question answering and language modeling tasks and outperforms even previously reported ensembles. The model has an encoder-decoder structure, where the encoder maps an input sequence of symbols to a sequence of continuous representations, and the decoder generates an output sequence of symbols one element at a time. The Transformer also plans to explore using attention-based models for tasks involving input and output modalities other than text, such as images, audio, and video.'