In [6]:
!pip install langchain openai unstructured pdf2image pinecone-client tiktoken python-dotenv




[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Problem Statement

To create embedding of custom data and store these embeddings in a vector store.

### Solution Approach
To solve this we will be using the following:
- **Langchain** to load, split and create vector embeddings from our data.
- **Pinecone** to store the vector embeddings on a cloud based vector database.

In [7]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader, DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.callbacks import get_openai_callback
from dotenv import load_dotenv

import os

In [8]:
OPENAI_API_KEY = str(os.getenv('OPENAI_API_KEY'))
PINECONE_API_KEY = str(os.getenv('PINECONE_API_KEY'))
PINECONE_API_ENV = str(os.getenv('PINECONE_API_ENV'))

### Add your data

In [7]:
loader = DirectoryLoader("data/", glob="**/*.pdf")

In [8]:
documents = loader.load()

In [9]:
print(len(documents))

1


In [10]:
# Chunk your data up into smaller documents

text_splitter = CharacterTextSplitter(chunk_size=2500, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

In [11]:
print(f'Now you have {len(texts)} documents')

Now you have 126 documents


In [12]:
texts[2]

Document(page_content='This FAQ on GST compiled by NACEN and vetted by the Source Trainers is based on the CGST/SGST/UTGST/IGST Act(s). This FAQ is for training and academic purposes only. The information in this booklet is intended only to provide a general overview and is not intended to be treated as legal ad vice or opinion. For greater details, you are respective refer CGST/SGST/UTGST/IGST Acts. The FAQs refer to CGST and SGST Acts as CGST/SGST as CGST Act and SGST Act are identical in most of the provisions. CGST Act has been in the Parliament. The SGST Acts will be passed by respective state legislatures. A few provisions may be specific to state and may not be in CGST Act.\n\nintroduced\n\nrequested\n\nthe\n\nto\n\nto\n\n1. Overview of Goods and Services Tax (GST)\n\nQ 1. What is Goods and Services Tax\n\n(GST)?\n\nAns. It is a destination based tax on consumption of goods and services. It is proposed to be levied at all stages right from manufacture up to final consumption wit

### Create embeddings of your documents to get ready for semantic search

In [13]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone, Chroma

import pinecone

  from tqdm.autonotebook import tqdm


In [16]:
with get_openai_callback() as cb:
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
    print(cb)

Tokens Used: 0
	Prompt Tokens: 0
	Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0


In [17]:
# Initialize pinecone

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_API_ENV
)

pinecone.create_index('gstdata', dimension=1536, metric="cosine")

index_name="gstdata"

In [18]:
with get_openai_callback() as cb:
    vecstore = Pinecone.from_texts([t.page_content for t in texts], embeddings, index_name=index_name)
    print(cb)

Tokens Used: 0
	Prompt Tokens: 0
	Completion Tokens: 0
Successful Requests: 0
Total Cost (USD): $0.0


In [17]:
question = 'What are the main issues with GST?'
docs = vecstore.similarity_search(question)

In [18]:
# Here's an example of the first document that was returned
docs[1]

Document(page_content='The GST Council shall make recommendations to the Union and States on the taxes, cesses and surcharges levied by the Centre, the States and the local bodies which may be subsumed in the GST.\n\nQ 4. What principles were adopted for subsuming\n\nthe above taxes under GST?\n\nAns. The various Central, State and Local levies were examined to identify their possibility of being subsumed under GST. While identifying, the following principles were kept in mind:\n\n(i) Taxes or levies to be subsumed should be primarily in the nature of indirect taxes, either on the supply of goods or on the supply of services.\n\n(ii) Taxes or levies to be subsumed should be part of the transaction chain which commences with import/ manufacture/ production of goods or provision of services at one end and the consumption of goods and services at the other.\n\n(iii) The subsumation should result in free flow of tax credit in intra and inter-State levels. The taxes, levies and fees that ar

### Query your docs to get an answer back

In [19]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.callbacks import get_openai_callback

In [20]:
qa = RetrievalQA.from_chain_type(
    llm = OpenAI(),
    chain_type="stuff",
    retriever=vecstore.as_retriever()
)

In [21]:
def query(q):
    print("Question: ", q)
    with get_openai_callback() as cb:
        resposnse = print("Answer: ", qa.run(q))
        print(cb)

In [23]:
query('What are the main pillars of GST?')

Question:  What are the main pillars of GST?
Answer:   The main pillars of GST are Central GST (CGST), State GST (SGST), Integrated GST (IGST), and Union Territory GST (UTGST). CGST and SGST will be applicable in intra-state transactions, i.e. within the state, while IGST will be applicable in inter-state transactions, i.e. between states. UTGST will be applicable in Union Territories. GST is a destination-based tax, which means that taxes will be collected at the point of consumption, and not the point of origin.
Tokens Used: 2150
	Prompt Tokens: 2038
	Completion Tokens: 112
Successful Requests: 1
Total Cost (USD): $0.043


In [None]:
query('What are the main issues with GST?')

Question:  What are the main issues with GST?
Answer:   The main issues with Goods and Services Tax (GST) are related to the complexity of implementation, and the challenges of developing a unified system of taxation across multiple states. In addition, some states have not yet fully adopted the GST system due to political and regional differences.


In [None]:
query('What are the drawbacks of GST?')

Question:  What are the drawbacks of GST?
Answer:  

The introduction of GST may lead to certain drawbacks, such as: 

1. Difficulty in filing taxes - GST is a complex tax system, and it can be difficult to understand the rules and regulations. This can lead to confusion and difficulty in filing taxes.

2. Increased cost of compliance - GST requires businesses to file multiple returns and submit various forms, which can lead to increased costs in terms of time, money, and manpower.

3. Impact on small businesses - Small businesses may find it difficult to comply with the GST requirements due to the lack of resources and expertise.

4. Increased costs for consumers - As GST is a consumption-based tax, it may lead to an increase in the cost of goods and services, which can be a burden on consumers.
