<a href="https://colab.research.google.com/github/leaderman77/RAG-AcademicSearch/blob/main/RAG_with_Arxiv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing necessary libraries

In [1]:
!pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken faiss-cpu

# Providing openai key to use openai gpt 3.5 large language model for this project

In [2]:
import os
import openai
from getpass import getpass

# openai.api_key = getpass("Please provide your OpenAI Key: ")
# os.environ["OPENAI_API_KEY"] = openai.api_key
os.environ["OPENAI_API_KEY"] = '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'

# Getting papers from Arxiv
## Here **load_max_docs** is number of papers we want to load and this value can be changed.

## User asks general questions to load list of papers relevant to his query. For example: "k-Means cluster algorithm"

In [3]:
from langchain.document_loaders import ArxivLoader

base_docs = ArxivLoader(query="k-Means cluster algorithm", load_max_docs=10).load()
len(base_docs)

10

# Check papers metadata

In [4]:
for doc in base_docs:
  print(doc.metadata)

{'Published': '2014-10-26', 'Title': 'Notes on using Determinantal Point Processes for Clustering with Applications to Text Clustering', 'Authors': 'Apoorv Agarwal, Anna Choromanska, Krzysztof Choromanski', 'Summary': 'In this paper, we compare three initialization schemes for the KMEANS\nclustering algorithm: 1) random initialization (KMEANSRAND), 2) KMEANS++, and\n3) KMEANSD++. Both KMEANSRAND and KMEANS++ have a major that the value of k\nneeds to be set by the user of the algorithms. (Kang 2013) recently proposed a\nnovel use of determinantal point processes for sampling the initial centroids\nfor the KMEANS algorithm (we call it KMEANSD++). They, however, do not provide\nany evaluation establishing that KMEANSD++ is better than other algorithms. In\nthis paper, we show that the performance of KMEANSD++ is comparable to KMEANS++\n(both of which are better than KMEANSRAND) with KMEANSD++ having an additional\nthat it can automatically approximate the value of k.'}
{'Published': '201

In [5]:
base_docs[0].page_content

'Notes on Using Determinantal Point Processes for Clustering with Applications to\nText Clustering\nApoorv Agarwal\nColumbia University\nNew York, NY, USA\napoorv@cs.columbia.edu\nAnna Choromanska\nCourant Institute of Mathematical Sciences\nNew York, NY, USA\nachoroma@cims.nyu.edu\nKrzysztof Choromanski\nGoogle Research\nNew York, NY, USA\nkchoro@gmail.com\nAbstract\nIn this paper, we compare three initialization schemes\nfor the KMEANS clustering algorithm: 1) random ini-\ntialization (KMEANSRAND), 2) KMEANS++, and 3)\nKMEANSD++. Both KMEANSRAND and KMEANS++\nhave a major that the value of k needs to be set by the\nuser of the algorithms. (Kang 2013) recently proposed a\nnovel use of determinantal point processes for sampling\nthe initial centroids for the KMEANS algorithm (we call\nit KMEANSD++). They, however, do not provide any\nevaluation establishing that KMEANSD++ is better than\nother algorithms. In this paper, we show that the perfor-\nmance of KMEANSD++ is comparable to KMEA

In [6]:
# from langchain.vectorstores import Chroma
# from langchain.embeddings import OpenAIEmbeddings
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

# docs = text_splitter.split_documents(base_docs)
# vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

# Indexing
1.   Split each paper text into chunks and convert them into embeddings
2.   Store embeddings into vectore database (FAISS)

In [7]:
from langchain_community.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [8]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

docs = text_splitter.split_documents(base_docs)
# vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

vectorstore = FAISS.from_documents(docs, OpenAIEmbeddings())

  warn_deprecated(


In [9]:
len(docs)
print(max([len(chunk.page_content) for chunk in docs]))

499


# Retrieving relevant documents. We use top k method to retrieve N relevant papers and "mmr" (Maximal Marginal Relevance) to select different results

In [10]:
base_retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k":5})
relevant_docs = base_retriever.get_relevant_documents( "k-Means cluster algorithm")
len(relevant_docs)
for doc in relevant_docs:
  print(doc.metadata)

{'Published': '2022-05-09', 'Title': 'A Hybrid Approach: Utilising Kmeans Clustering and Naive Bayes for IoT Anomaly Detection', 'Authors': 'Lincoln Best, Ernest Foo, Hui Tian', 'Summary': 'The proliferation and variety of Internet of Things devices means that they\nhave increasingly become a viable target for malicious users. This has created\na need for anomaly detection algorithms that can work across multiple devices.\nThis thesis suggests a potential alternative to the current anomaly detection\nalgorithms to be implemented within IoT systems that can be applied across\ndifferent types of devices. This algorithm is comprised of both unsupverised\nand supervised machine areas of machine learning combining the strongest facet\nof each. The algorithm involves the initial k-means clustering of attacks and\nassigns them to clusters. Next, the clusters are then used by the AdaBoosted\nNaive Bayes supervised learning algorithm in order to teach itself which piece\nof data should be clust

# Create prompt and response templates

In [14]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

In [16]:
template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

def generate_summary_for_doc(doc, language_model, template):
    # Extracting information from the Document object
    title = doc.metadata.get('Title', 'No Title')
    authors = doc.metadata.get('Authors', 'Unknown Authors')
    published = doc.metadata.get('Published', 'Unknown Date')
    page_content = doc.page_content

    # Forming the context with the document information
    context = f"**Title**: {title}\n**Authors**: {authors}\n**Published**: {published}\n**Content**: {page_content}"

    # Forming the full prompt by substituting context into the template
    full_prompt = template.format(context=context, question="Summarize the content in no more than two lines.")

    # Sending the prompt to the language model for generating the summary
    summary = language_model.invoke(full_prompt)

    # Updating the document's metadata with the summary
    doc.metadata['LLM_Summary'] = summary

    return doc.metadata

# Initialize the language model
primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

  warn_deprecated(


{'Published': '2022-05-09', 'Title': 'A Hybrid Approach: Utilising Kmeans Clustering and Naive Bayes for IoT Anomaly Detection', 'Authors': 'Lincoln Best, Ernest Foo, Hui Tian', 'Summary': 'The proliferation and variety of Internet of Things devices means that they\nhave increasingly become a viable target for malicious users. This has created\na need for anomaly detection algorithms that can work across multiple devices.\nThis thesis suggests a potential alternative to the current anomaly detection\nalgorithms to be implemented within IoT systems that can be applied across\ndifferent types of devices. This algorithm is comprised of both unsupverised\nand supervised machine areas of machine learning combining the strongest facet\nof each. The algorithm involves the initial k-means clustering of attacks and\nassigns them to clusters. Next, the clusters are then used by the AdaBoosted\nNaive Bayes supervised learning algorithm in order to teach itself which piece\nof data should be clust

# Test RAG app: For each paper we retrieved we can see Title, Authors, Published, Summary (Paper's Abstract) and LLM generated Summary in the field "LLM Summary" based on question above

In [20]:
# Example usage
for doc in relevant_docs:
  summary_output = generate_summary_for_doc(doc, primary_qa_llm, template)
  print(summary_output)

{'Published': '2022-05-09', 'Title': 'A Hybrid Approach: Utilising Kmeans Clustering and Naive Bayes for IoT Anomaly Detection', 'Authors': 'Lincoln Best, Ernest Foo, Hui Tian', 'Summary': 'The proliferation and variety of Internet of Things devices means that they\nhave increasingly become a viable target for malicious users. This has created\na need for anomaly detection algorithms that can work across multiple devices.\nThis thesis suggests a potential alternative to the current anomaly detection\nalgorithms to be implemented within IoT systems that can be applied across\ndifferent types of devices. This algorithm is comprised of both unsupverised\nand supervised machine areas of machine learning combining the strongest facet\nof each. The algorithm involves the initial k-means clustering of attacks and\nassigns them to clusters. Next, the clusters are then used by the AdaBoosted\nNaive Bayes supervised learning algorithm in order to teach itself which piece\nof data should be clust

In [17]:
for doc in relevant_docs:
  summary_output = generate_summary_for_doc(doc, primary_qa_llm, template)
  print("Title:", summary_output['Title'])
  print("LLM generated summary:", summary_output['LLM_Summary'])

Title: A Hybrid Approach: Utilising Kmeans Clustering and Naive Bayes for IoT Anomaly Detection
LLM generated summary: content='The content discusses the use of Kmeans clustering and Naive Bayes for IoT anomaly detection, explaining the iterative process of Kmeans clustering in three steps.'
Title: Notes on using Determinantal Point Processes for Clustering with Applications to Text Clustering
LLM generated summary: content='Answer: The content discusses the performance of different clustering algorithms, specifically KMEANSD++, KMEANS++, and KMEANSRAND, in the context of text clustering.'
Title: Kernel KMeans clustering splits for end-to-end unsupervised decision trees
LLM generated summary: content='The authors propose a Kernel KMeans clustering algorithm for unsupervised decision trees, aiming to minimize the cluster sum of squares in a Hilbert space with Kmax centroids.'
Title: MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers
LLM generated summary:

# Create another template

In [21]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [22]:
primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

# Ask specific question to get response based on papers we retrieved.
For example: What is recommended value for number of clusters in k-Means?

In [23]:
question = "What is recommended value for number of clusters in k-Means?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result)

{'response': AIMessage(content='Answer: k = 2'), 'context': [Document(page_content='cluster.\nAccording to [13] and our experiments, k = 2 appeared to have produced the\nbest clustering results comparing to using 3 or more clusters, when some clusters\ncould end up containing too many extreme values and aﬀect the calculation of\nmean and standard deviation values. According to step 2 in Algorithm 1, the\nintra-cluster distance, from each data point to the centroid of the cluster it\nbelongs to, is calculated. All intra-cluster distances are sorted into descending', metadata={'Published': '2019-10-15', 'Title': 'MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers', 'Authors': 'Yuanyuan Wei, Julian Jang-Jaccard, Fariza Sabrina, Timothy McIntosh', 'Summary': 'Outlier detection is a technique in data mining that aims to detect unusual\nor unexpected records in the dataset. Existing outlier detection algorithms\nhave different pros and cons and exhibit differe