<center><h1>Langchain Demo - Text Analytics QA System</h1></center>

**Notebook Overview:** The following notebook contains the code for creating a QA system using Langchain. The QA system will invoke OpenAI's GPT-3.5 model to answer questions about Text Analytics using a corpus of relevant links collected from the internet via the SerpAPI.

<h2>Import OpenAI Model</h2>

In [1]:
# import needed libraries
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI

# Load environment variables from .env file
load_dotenv()

# Access the API key
openai_api_key = os.getenv("OPENAI_API_KEY")

# initalize openai model 
llm = ChatOpenAI(openai_api_key=openai_api_key)

<h2> Read in Data from Scraped URLs</h2>

In [3]:
# read data from txt file 
with open('../corpus/web_data/all_urls.txt') as file:
    data = file.readlines()

# parse data for easy loading 
clean_urls = [data[i][:-1] for i in range(len(data))]

print(f'Count of urls: {len(clean_urls)}')

Count of urls: 62


In [4]:
clean_urls[0:10]

['https://monkeylearn.com/text-analysis/',
 'https://www.lexalytics.com/technology/text-analytics/',
 'https://azure.microsoft.com/en-us/products/ai-services/ai-language',
 'https://monkeylearn.com/blog/what-is-text-analytics/',
 'https://www.qualtrics.com/experience-management/research/text-analysis/',
 'https://getthematic.com/insights/5-text-analytics-approaches/',
 'https://www.spotfire.com/glossary/what-is-text-analytics',
 'https://en.wikipedia.org/wiki/Text_mining',
 'https://www.techtarget.com/searchbusinessanalytics/definition/text-mining',
 'https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)']

<h2>Create the Corpus</h2>

The below code creates and stores a corpus inside of a vectorstore. Each link is loaded into a Langchain document object using the WebBaseLoader() function. These documents are then vectorized with OpenAI's embedding model, and ultimately stored in a FAISS vectorstore. This database of vectors allows for efficient retrieval of relevant documents based on the cosine similarities between the documents and a given prompt/question.

In [5]:
# library to load in web pages 
from langchain_community.document_loaders import WebBaseLoader

# load in documents via their links 
loader = WebBaseLoader(clean_urls)

# load web page as langchain doc 
docs = loader.load()

In [6]:
# all documents have been loaded into doc objects
print(f'Count of documents in loader: {len(docs)}')

docs[0]

Count of documents in loader: 62


Document(page_content='What is Text Analysis? A Beginner’s GuideTry MonkeyLearnWhat is Text Analysis? A Beginner’s GuideThe BasicsHow Does It work?Use Cases & ApplicationsOnline ToolsIf you receive huge amounts of unstructured data in the form of text (emails, social media conversations, chats), you’re probably aware of the challenges that come with analyzing this data.  Manually processing and organizing text data takes time, it’s tedious, inaccurate, and it can be expensive if you need to hire extra staff to sort through text. Automate text analysis with a no-code toolTRY NOWIn this guide, learn more about what text analysis is, how to perform text analysis using AI tools, and why it’s more important than ever to automatically analyze your text in real time. Text Analysis BasicsMethods & TechniquesHow Does Text Analysis Work?How to Analyze Text DataUse Cases and ApplicationsTools and ResourcesTutorialWhat Is Text Analysis?Text analysis (TA) is a machine learning technique used to aut

In [7]:
# lang chain library for embedding models
from langchain_openai import OpenAIEmbeddings

# create embedding model
embedding_model = OpenAIEmbeddings()

In [108]:
# libraries to create vector stores 
from langchain_community.vectorstores import FAISS # Facebook AI Similarity Search 
from langchain.text_splitter import RecursiveCharacterTextSplitter

# splitter for docs, which are split into chuncks - chunk_size and overlap are hyperparameters to tune 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(docs)

# create vectors of the documents 
vector_store = FAISS.from_documents(documents, embedding_model)

In [109]:
print(documents[0])
print(documents[1])
print(f'count of document chunks: {len(documents)}')

page_content='What is Text Analysis? A Beginner’s GuideTry MonkeyLearnWhat is Text Analysis? A Beginner’s GuideThe BasicsHow Does It work?Use Cases & ApplicationsOnline ToolsIf you receive huge amounts of unstructured data in the form of text (emails, social media conversations, chats), you’re probably aware of the challenges that come with analyzing this data.  Manually processing and organizing text data takes time, it’s tedious, inaccurate, and it can be expensive if you need to hire extra staff to sort through text. Automate text analysis with a no-code toolTRY NOWIn this guide, learn more about what text analysis is, how to perform text analysis using AI tools, and why it’s more important than ever to automatically analyze your text in real time. Text Analysis BasicsMethods & TechniquesHow Does Text Analysis Work?How to Analyze Text DataUse Cases and ApplicationsTools and ResourcesTutorialWhat Is Text Analysis?Text analysis (TA) is a machine learning technique used to automaticall

In [110]:
vector_store

<langchain_community.vectorstores.faiss.FAISS at 0x7f7a7a3c4b20>

In [111]:
# save the vector store for later use 
vector_store.save_local("faiss_index")

# new_db = FAISS.load_local("faiss_index", embeddings)

# docs = new_db.similarity_search(query)

<h2>Create Prompts and Document Chain</h2>

The below code will create a document chain that takes a model (GPT-3.5) and a prompt. The prompt must be user defined but a template can be provided using the ChatPromptTemplate library. This approach is recommend for efficient and consistent processing. 

In [112]:
# library for doc retrieval chain 
from langchain.chains.combine_documents import create_stuff_documents_chain

# library for prompt creation 
from langchain_core.prompts import ChatPromptTemplate

# create prompt that will take question and relevant doc 
prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""")

# create the chain with model and prompt
document_chain = create_stuff_documents_chain(llm, prompt)

<h2>Create Retrieval Chain from VectorStore</h2>

The below code augments our document chain by adding a retrieval component. This retrieval mechanism is our vectorstore, which returns the most similar documents based on a given prompt. We add together our retriver with the above document chain to create our full QA chain.

In [113]:
# library for making retrieval chain 
from langchain.chains import create_retrieval_chain

# create retriever object from vector store - k is the number of returned docs 
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

# create retrieval chain with document chain 
retrieval_chain = create_retrieval_chain(retriever, document_chain)

In [114]:
# generates good responses for questions within the domain of the corpus 
response = retrieval_chain.invoke({"input": "what are applications of text analytics?"})
print(response["answer"])

The applications of text analytics include clarifying information and providing better search experiences for online media applications, improving predictive analytics models for customer churn in business and marketing, extracting meaning from unstructured data to improve customer satisfaction and conduct market research, analyzing customer feedback to evaluate company performance, responding to business problems by discovering and presenting knowledge locked in textual form, and identifying patterns and providing actionable insights for companies to make improvements.


In [115]:
response

{'input': 'what are applications of text analytics?',
 'context': [Document(page_content='Online media applications[edit]\nText mining is being used by large media companies, such as the Tribune Company, to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.\n\nBusiness and marketing applications[edit]\nText analytics is being used in business, particularly, in marketing, such as in customer relationship management.[29] Coussement and Van den Poel (2008)[30][31] apply it to improve predictive analytics models for customer churn (customer attrition).[30] Text mining is also being applied in stock returns prediction.[32]', metadata={'source': 'https://en.wikipedia.org/wiki/Text_mining', 'title': 'Text mining - Wikipedia', 'langua

In [116]:
# bad responses for questions outside of the domain of the corpus 
response = retrieval_chain.invoke({"input": "who is Joe Biden?"})
print(response["answer"])

The provided context does not contain any information about Joe Biden.


In [118]:
# missing gaps in the information provided by the corpus 
response = retrieval_chain.invoke({"input": "how do you remove duplicate documents from a corpus?"})
print(response["answer"])

The provided context does not contain information on how to remove duplicate documents from a corpus.


In [119]:
response

{'input': 'how do you remove duplicate documents from a corpus?',
 'context': [Document(page_content='corpora (documents).Machine translation. Automatic translation of text or speech from one language to another.', metadata={'source': 'https://www.sas.com/en_us/insights/analytics/what-is-natural-language-processing-nlp.html', 'title': 'Natural Language Processing (NLP): What it is and why it matters | SAS', 'description': 'Natural language processing (NLP) makes it possible for humans to talk to machines. Find out how our devices understand language and how to apply this technology.', 'language': 'en'}),
  Document(page_content='Easily select the topics and check the recall on each one\nCheck the rules for each topic\nCheck the verbatims each topic is part of speech tagging\nMake changes to the rules and check the changes the edit has made to the count of verbatim tagged', metadata={'source': 'https://www.qualtrics.com/experience-management/research/text-analysis/', 'title': 'Text Anal

<h2>Agents for Retrieval Augmentation</h2>

The above examples show that a RAG system can be used to provide a LLM with better context for answering questions about a given domain. This retrieval process has the effect of grounding the langugage model by ensuring that it has the relevant information to answer a question. This system also minimizes the chance a model will hallunicate by preventing it from giving an answer to a question outside of the domain of the corpus. However, the model will also be unable to answer domain related questions, that it should presumably know, if the relevant document is not already in the corpus. Agents can be used to augment this retrieval process by allowing the model to search the web for information if it can not find pertinent information in the corpus.