<a href="https://colab.research.google.com/github/parthdasawant/LLM-Pinecone-OpenAI/blob/main/QnA_app_for_custom_doc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Setup

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
!pip install -r /content/drive/MyDrive/LLM/requirements.txt -q

In [4]:
!cp /content/drive/MyDrive/LLM/env /content/

In [5]:
import os

old_name = r"/content/env"
new_name = r"/content/.env"
os.rename(old_name, new_name)

In [6]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())

print(os.environ["PINECONE_ENV"])

gcp-starter


### Actual Business

In [7]:
!pip install pypdf -q
!pip install docx2txt -q
!pip install wikipedia -q

In [8]:
def load_document(file):
  import os
  name, extension = os.path.splitext(file)

  if extension == '.pdf':
    from langchain.document_loaders import PyPDFLoader
    print(f'Loading {file}')
    loader = PyPDFLoader(file)
  elif extension == '.docx':
    from langchain.document_loaders import Docx2txtLoader
    print(f'Loading {file}')
    loader = Docx2txtLoader(file)
  else:
    print(f'{extension} is not supported yet')
    return None

  data = loader.load()
  return data

# wikipedia
def load_from_wikipedia(query, lang='en', load_max_docs=2):
  from langchain.document_loaders import WikipediaLoader
  loader = WikipediaLoader(query = query, lang= lang, load_max_docs=load_max_docs)
  data = loader.load()
  return data


In [9]:
def chunk_data(data, chunk_size = 256):
  from langchain.text_splitter import RecursiveCharacterTextSplitter
  text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap = 0)
  chunks = text_splitter.split_documents(data)
  return chunks


In [10]:
def print_embedding_cost(texts):
  import tiktoken
  enc = tiktoken.encoding_for_model('text-embedding-ada-002')
  total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
  print(f'Total Tokens: {total_tokens}')
  print(f'Embedding Cost in USD: {total_tokens/1000 * 0.0004:.6f}')


Embedding & uplaoding to vector database (Pinecone)

In [11]:
def insert_or_fetch_embbedings(index_name):
  import pinecone
  from langchain.vectorstores import Pinecone
  from langchain.embeddings.openai import OpenAIEmbeddings

  embeddings = OpenAIEmbeddings()

  pinecone.init(api_key = os.environ.get('PINECONE_API_KEY'), environment = os.environ.get('PINECONE_ENV'))

  if index_name in pinecone.list_indexes():
    print(f'Index {index_name} already exists. Loading embeddings ... ')
    vector_store = Pinecone.from_existing_index(index_name, embeddings)
    print('OK')
  else:
    print(f'Creating index {index_name} and embeddings ...')
    pinecone.create_index(index_name, dimension = 1536, metric = 'cosine')
    vector_store = Pinecone.from_documents(chunks, embeddings, index_name = index_name)
    print('OK')

  return vector_store

In [12]:
def delete_pinecone_index(index_name = 'all'):
  import pinecone
  pinecone.init(api_key = os.environ.get('PINECONE_API_KEY'), environment = os.environ.get('PINECONE_ENV'))

  if index_name == 'all':
    indexes = pinecone.list_indexes()
    print('Deleting all indexes')
    for index in indexes:
      pinecone.delete_index(index)
    print('OK')
  else:
    print(f'Deleting index {index_name}...', end='')

Asking & Getting Answers

In [61]:
from langchain.schema import retriever

def ask_and_get_answer(vector_store, q):
  from langchain.chains import RetrievalQA
  from langchain.chat_models import ChatOpenAI

  llm = ChatOpenAI(model = 'gpt-3.5-turbo', temperature =1)

  retriever = vector_store.as_retriever(search_type = 'similarity', search_kwargs = {'k': 3})

  chain = RetrievalQA.from_chain_type(llm = llm, chain_type='stuff', retriever = retriever)

  answer = chain.run(q)

  return answer


def ask_with_memory(vector_store, question, chat_history=[]):
  from langchain.chains import ConversationalRetrievalChain
  from langchain.chat_models import ChatOpenAI

  llm = ChatOpenAI(temperature = 1)

  retriever = vector_store.as_retriever(search_type = 'similarity', search_kwargs = {'k': 3})

  crc = ConversationalRetrievalChain.from_llm(llm, retriever)

  result = crc({'question': question, 'chat_history': chat_history})
  chat_history.append((question, result['answer']))

  return result, chat_history


Running Code

In [52]:
# @markdown ---
# @markdown ### Enter a file path:
file_path = "/content/drive/MyDrive/LLM/Constitution of Bharat.pdf" # @param {type:"string"}
# @markdown ---

data = load_document(file_path)

print(f'You have {len(data)} pages in your data')
print(f'{data[0].page_content}')

Loading /content/drive/MyDrive/LLM/Constitution of Bharat.pdf
You have 404 pages in your data
 
 
 
 
 
 THE CONSTITUTION OF INDIA 
[As on       May , 2022] 
2022 
 


##### Chunking the data & Testing

In [53]:
chunks = chunk_data(data)
print(len(data))
print(chunks[102].page_content)

404
THE MUNICIPALITIES 
243P. Definitions. 
243Q. Constitution of Municipalities. 
243R. Composition of Municipalities. 
243S. Constitution and composition of Wards Committees, etc. 
243T. Reservation of seats. 
243U. Duration of Municipalities, etc.


In [18]:
print_embedding_cost(chunks)

Total Tokens: 214890
Embedding Cost in USD: 0.085956


In [57]:
delete_pinecone_index()

Deleting all indexes
OK


In [58]:
index_name = 'askadocument'
vector_store = insert_or_fetch_embbedings(index_name)

Creating index askadocument and embeddings ...




OK


In [22]:
q = 'What is the whole document about?'
answer = ask_and_get_answer(vector_store, q)
print(answer)

The document is about the Constitution of India, which outlines the fundamental principles, structures, and procedures of the Indian government. It provides a framework for the governance of the country and protects the fundamental rights and freedoms of its citizens.


In [25]:
import time
i = 1

print('Write Quit or Exit to quit')

while True:
  q = input(f'Question #{i}: ')
  i += 1
  if q.lower() in ['quit', 'exit']:
    print('Quitting... bye bye')
    time.sleep(2)
    break
  else:
    answer = ask_and_get_answer(vector_store, q)
    print(f'\nAnswer: {answer}')
    print(f'\n {"-" * 50} \n')

Write Quit or Exit to quit
Question #1: What is the first amendment of the Constitution of India

Answer: The first amendment of the Constitution of India was added by the Constitution (First Amendment) Act, 1951. The specific content of the first amendment is not provided in the given context.

 -------------------------------------------------- 

Question #2: List the Fundamental Rights

Answer: The Fundamental Rights in India are as follows:

1. Right to Equality (Article 14-18)
- Equality before the law (Article 14)
- Prohibition of discrimination on grounds of religion, race, caste, sex or place of birth (Article 15)
- Equality of opportunity in matters of public employment (Article 16)
- Abolition of Untouchability (Article 17)
- Abolition of titles (Article 18)

2. Right to Freedom (Article 19-22)
- Protection of certain rights regarding freedom of speech, etc. (Article 19)
- Protection in respect of conviction for offenses (Article 20)
- Protection of life and personal liberty 

In [51]:
delete_pinecone_index()

Deleting all indexes
OK


##### From Wikipedia

In [47]:
# Directly from wikipedia and its definitly not from gpt since at this time gpt is trained on till 2021 data and this wikipedia has been created on Mar 2023
data = load_from_wikipedia('2023 Cricket World Cup', 'mr' )
# print(data[0].page_content)
chunks = chunk_data(data)
index_name = "asiacup"
vector_store = insert_or_fetch_embbedings(index_name)

Creating index asiacup and embeddings ...
OK


In [49]:
q = 'या विश्वचषकाचा यजमान देश कोण आहे?'
answer = ask_and_get_answer(vector_store, q)
print(answer)

विश्वचषकाच्या आयोजनाचा यजमान देश भारत आहे.


##### With Memory

In [73]:
chat_history = []
question = 'What is the article number of Right to Constitutional Remedies'
result, chat_history = ask_with_memory(vector_store, question, chat_history)
print(result['answer'])
print(chat_history)

The article number of the Right to Constitutional Remedies in the Constitution of India is Article 32.
[('What is the article number of Right to Constitutional Remedies', 'The article number of the Right to Constitutional Remedies in the Constitution of India is Article 32.')]


In [74]:
question = 'Multiply that article number by 2'
result, chat_history = ask_with_memory(vector_store, question, chat_history)
print(result['answer'])
print(chat_history)



The result when you multiply the article number of the Right to Constitutional Remedies (Article 32) by 2 is 64.
[('What is the article number of Right to Constitutional Remedies', 'The article number of the Right to Constitutional Remedies in the Constitution of India is Article 32.'), ('Multiply that article number by 2', 'The result when you multiply the article number of the Right to Constitutional Remedies (Article 32) by 2 is 64.')]
