### In the Data Science/Machine Learning world, we often have to read papers to keep up with the state-of-the-art; going through many long papers, however, can be an exhaustive task sometimes. Here, we'll use LangChain to help us as an AI tutor that can receive a list of multiple machine learning papers and answer specific questions that we have on the matter.

First, we'll use **FAISS** (which is a very fast approximate nearest neighbor algorithm from META) with **OpenAI Embeddings**. Then, given some query representing a specific question we have, FAISS will take the **nearest semantic sentences** by comparing the query embedding with the stored sentence embeddings.

In [None]:
!pip install langchain
!pip install openai
!pip install requests transformers faiss-cpu
!pip install PyPDF2
!pip install tiktoken

In [3]:
from langchain.llms import OpenAI
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.docstore.document import Document
import requests
from google.colab import drive
import os
from PyPDF2 import PdfReader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.text_splitter import CharacterTextSplitter
import pickle
import os
import openai

In [70]:
%env OPENAI_API_KEY = <OpenAI API key here>

env: OPENAI_API_KEY=<OpenAI API key here>


In [5]:
drive.mount('/content/drive')
gdrive_path = '/content/drive/MyDrive/Papers/' # or whatever your Google Folder path is

Mounted at /content/drive


In [71]:
os.listdir(gdrive_path) # some ml papers

['Representation Learning.pdf', 'GAN.pdf', 'WGAN.pdf']

In [6]:
def get_pdf_data(file_path, num_pages = 1):
  reader = PdfReader(gdrive_path+file_path)
  full_doc_text = ""
  for page in range(len(reader.pages)):
    current_page = reader.pages[page]
    text = current_page.extract_text()
    full_doc_text += text


  return Document(
        page_content=full_doc_text,
        metadata = {"source": file_path}
    )

In [7]:
def source_docs():
    return [get_pdf_data(file) for file in os.listdir(gdrive_path)]

In [27]:
import joblib
def search_index(source_docs):
    source_chunks = []
    splitter = CharacterTextSplitter(separator=" ", chunk_size=1024, chunk_overlap=0)

    for source in source_docs:
        for chunk in splitter.split_text(source.page_content):
            source_chunks.append(Document(page_content=chunk, metadata=source.metadata))

    vectorindex_openai = FAISS.from_documents(source_chunks, OpenAIEmbeddings())
    vectorindex_openai.save_local("semantic_chunks/")

In [28]:
sources = source_docs()
search_index(sources)

In [63]:
stuff_chain = load_qa_with_sources_chain(OpenAI(temperature=0),verbose=False, chain_type="stuff")
mp_reduce_chain = load_qa_with_sources_chain(OpenAI(temperature=0),verbose=False,chain_type="map_reduce")
def print_answer(question, chain):

    search_index = FAISS.load_local("semantic_chunks", OpenAIEmbeddings())

    try:
      print(
          chain(
              {
                  "input_documents": search_index.similarity_search(question, k=3),
                  "question": question,
              },
              return_only_outputs=True,
          )["output_text"]

      )
    except Exception as e:
      print(f"Unexpected error: {e}")


In [69]:
import pdb
sources = source_docs()
search_index(sources)

In [67]:
print_answer("How does the Wasserstein distance improve traditional GANs?", stuff_chain)

 The Wasserstein distance improves traditional GANs by providing a meaningful loss metric and improved stability of the optimization process.
SOURCES: WGAN.pdf


# Responses are good, but now let's compare to when we use map-reduce chains.

In [65]:
print_answer("How does the Wasserstein distance improve traditional GANs?", mp_reduce_chain)

 The Wasserstein distance allows for cleaner gradients and leverages the geometry of the underlying space, leading to improved evaluation of generative models. 
SOURCES: WGAN.pdf


# Since map-reduce applies the LLM chain to each document separately, instead of gathering all in a single shot (as stuff chain does), the answers have better detail. Using map-reduced chains, however, is also more expensive.