<a href="https://colab.research.google.com/github/mad0907/GAN/blob/main/langchain_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
 #!pip3 install langchain
 #!pip3 install transformers
 #!pip3 install pypdf
 #!pip install faiss-cpu
 #!pip install textract

from transformers import GPT2TokenizerFast
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import ConversationalRetrievalChain
from google.colab import userdata
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub
import sentence_transformers
import os

# 1. Loading PDFs and chunking with **LangChain**

In [50]:
loader = PyPDFLoader("/The_Concept_of_Sustainable_Development_From_its_Be.pdf")
pages = loader.load_and_split()
print(pages[0])

# SKIP TO STEP 2 IF YOU'RE USING THIS METHOD
chunks = pages

page_content='67Zagreb International Review of Economics & Business, V ol. 21, No. 1, pp. 67-94, 2018© 2018 Faculty of Economics and Business, University of Zagreb and De Gruyter OpenAll rights reserved. Printed in CroatiaISSN 1331-5609; UDC: 33+65DOI: 10.2478/zireb-2018-0005\n* Tomislav Klarin is at University of Zadar, Department of Tourism and Communication, Zadar, Croatia. This paper is part of the PhD work.The Concept of Sustainable Development: From its Beginning to the Contemporary IssuesTomislav Klarin *Abstract: The concept of sustainable development has undergone various developmental phases since its introduction. The historical development of the concept saw participation of vari-ous organizations and institutions, which nowadays work intensely on the implementation of its principles and objectives. The concept has experienced different critiques and inter-pretations over the time while being accepted in different areas of human activity, and the definition of sustainable d

In [51]:
import textract
doc = textract.process("/The_Concept_of_Sustainable_Development_From_its_Be.pdf")

# Step 2: Save to .txt and reopen (helps prevent issues)
with open('conceptdoc.txt', 'w') as f:
    f.write(doc.decode('utf-8'))

with open('conceptdoc.txt', 'r') as f:
    text = f.read()

# Step 3: Create function to count tokens
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

# Step 4: Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 512,
    chunk_overlap  = 24,
    length_function = count_tokens,
)

chunks = text_splitter.create_documents([text])

# 2. Embed text and store embeddings


In [52]:
# Get embedding model
embeddings = HuggingFaceEmbeddings()

# Create vector database
db = FAISS.from_documents(chunks, embeddings)

# 3. Setup search function

In [53]:
# Check similarity search is working
query = "When the concept of sustainable development evolved?"
docs = db.similarity_search(query)
docs[0].page_content

'History of the Concept of Sustainable Development\n\nIn the 18th century economic theoreticians such as Adam Smith pointed out issues \nof development, in the 19th century Karl Marx and classical economists Malthus, \nRicardo and Mill also argued about certain elements of sustainable development, \nwhile later neoclassical economic theory emphasized the importance of pure air and \nwater and renewable resources (fossil fuels, ores) as well as the need for government \nintervention in the case of externalities and public goods (Willis, 2005: 147; Bâc, \n2008: 576; Črnjar & Črnjar, 2009: 79). Previous periods, and even the following cen-\ntury, saw the dominance of the economic doctrine with focus on human as a ruler of \nnatural resources (Šimleša, 2003: 404; Črnjar & Črnjar, 2009: 61). The term sustain-\nable development was originally introduced in the field of forestry, and it included \nmeasures of afforestation and harvesting of interconnected forests which should not \nundermine 

In [54]:
# huggingfaceToken=userdata.get('huggingfacetoken')
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = huggingfaceToken

In [55]:
# # Create QA chain to integrate similarity search with user queries (answer query from knowledge base)
# llm = HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.8, "max_length": 1024})

# chain = load_qa_chain(llm, chain_type="stuff")
# huggingfaceToken=userdata.get('huggingfacetoken')
# query = "the year of Sharpley?"
# docs = db.similarity_search(query)

# chain.run(input_documents=docs, question=query)

# 5. Create chatbot with chat memory (OPTIONAL)

In [None]:
# from IPython.display import display
# import ipywidgets as widgets

# # Create conversation chain that uses our vectordb as retriver, this also allows for chat history management
# qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())

In [None]:
# chat_history = []

# def on_submit(_):
#     query = input_box.value
#     input_box.value = ""

#     if query.lower() == 'exit':
#         print("Thank you for using the State of the Union chatbot!")
#         return

#     result = qa({"question": query, "chat_history": chat_history})
#     chat_history.append((query, result['answer']))

#     display(widgets.HTML(f'<b>User:</b> {query}'))
#     display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> {result["answer"]}'))

# print("Welcome to the Transformers chatbot! Type 'exit' to stop.")

# input_box = widgets.Text(placeholder='Please enter your question:')
# input_box.on_submit(on_submit)

# display(input_box)