# Chat to your data

Chat to your data! Select a document and chat to GPT about its content. Prompt it to summarize the text, provide details on a specific point or explain understood concepts. 

Exploiting both langchain and openai libraries (GPT3.5, GPT4), you can load data from a wide range of sources (pdf, doc, spreadsheet, url, audio). In this demo, we show how to work with pdf, a website and youtube link. 

We will show how Langchain framework simplifies data loading and manipulation and interaction with openAI. First, we split large documents into chunks that can be easily handled by (Large Language Models) LLMs, import embeddings and create vector database stores from these. Finally, we create a chain, with a memory feature, that keeps track of the entire conversation. 

This notebook has been inspired by the DeepLearning.AI course ["LangChain: Chat with Your Data"](https://www.deeplearning.ai/short-courses/). For more details check: 
- [OpenAI docs](https://platform.openai.com/docs/introduction) 
- [LangChain](https://python.langchain.com/docs/get_started/introduction.html)

Access to GPT requires to create an account in openAI and get and API key. This can be done at [openAI](https://platform.openai.com/account/api-keys). Fees are very low. For instance, if you want to do several tests for a month, loading tens documents, it may be less than 1 $. You can also set a limit to your expenses and track the usage. 

*LangChain* is a framework that facilitates the development of applications using LLMs. It comprises 
- **components**, modular abstractions to easily interact with LLM libraries, and
- **off-the-self chains** that assembly several components for specific higher-level tasks. 

Ready made chains allow to start using only a few lines of codes. Then, components allow for customization. 

LangChain comprises the following modules: **model I/O** (interface with LM), **data connection** (interface with data sources), **chains** (sequence of calls), **agents** (automatic selection of chains), **memory** (track state between chain calls), and **callbacks** (logging). 


## Requirements

Install required libraries

In [None]:
# !pip install -r requirements.txt

Import required modules and libraries.

In [None]:
import os
import json
import re
from getpass import getpass

import openai

### API key from OpenAI

You need to open an account in openAI and create a key, following these steps. Provide the organization and API key from your OpenAI account. 

In [None]:
# Retrieve and set API KEY
mode_key_json = True
if mode_key_json is True:
    """
    Load key from json file. 
    openai_key.json:     
        {"organization": <org_key>, 
        "api_key": <api_key>}
    """
    user = 'abascal'
    path_file_key = f'/home/{user}/Projects/openai'
    name_file_key = "openai_key.json" 

    def read_key_from_file(path_file, name_file_key):
        with open(os.path.join(path_file, name_file_key), 'r') as f:
            org_data = json.load(f)
            
        openai.organization = org_data['organization']
        openai.api_key = org_data['api_key']

    # Read OpenAI key from filepath_file
    openai_key = read_key_from_file(path_file_key, name_file_key)
else: 
    # Provide key on the notebook
    if os.getenv("OPENAI_API_KEY") is None:
        if any(['VSCODE' in x for x in os.environ.keys()]):
            print('Please enter password in the VS Code prompt at the top of your VS Code window!')
        os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")
        openai.api_key = os.getenv("OPENAI_API_KEY", "")

    assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
    print("OpenAI API key configured")

### API key from wandb

W&W is a platform that allows to build faster and better AI models by tracking experiments, versioning and iterating on datasets, faccilitating model evaluation and reproducibility, and sharing results on data dashboard. 

For more details on MLOps (Machine Learning Operations) -- a set of practices for collaboration and communication between data scientists and operations professionals -- check [MLOps](https://neptune.ai/blog/mlops) and [best experiment tracking tools](https://neptune.ai/blog/best-ml-experiment-tracking-tools). 

You can open a free account and create a key, following the steps provided at [W&B](https://docs.wandb.ai/quickstart). 
 

In [None]:
mode_wandb = True
if mode_wandb:
    import wandb
    # Automated logging W&B with Langchain
    os.environ["LANGCHAIN_WANDB_TRACING"] = "true"
    project_name = "chat-to-your-data"
    os.environ["WANDB_PROJECT"] = project_name
    if False:
        from wandb.integration.openai import autolog
        autolog({"project":"llmapps", "job_type": "introduction"})

    # Log in (the first time)
    # Alternatively run on cmd 'wandb login'
    # wandb.login()

## Read and process data

Select document type and path to the document. We provide the following examples:
- url: URL of lemonde news or wikipedia
- pdf: Path to a pdf document
- youtube: Link to youtube song or news. 

In [None]:
example_type = "url"                # Document type: "pdf" or "url" or "youtube"

if example_type == "url":
    doc_type = "url" # "pdf" or "url" or "youtube"
    doc_path = "https://en.wikipedia.org/wiki/Cinque_Terre"
elif example_type == "pdf":
    doc_type = "pdf" 
    doc_path = "../data/Tan_EfficientDetScalableEfficientObjectDetection20.pdf"
elif example_type == "youtube":
    doc_type = "youtube" 
    #doc_path = "https://www.youtube.com/watch?v=PNVgh4ZSjCw"
    doc_path = "https://www.youtube.com/watch?v=W0DM5lcj6mw"


### Read the data

Read the data and transcribe to text. 

LangChain provides **document loaders** (load different data sources), **document transformers** (split, convert and select documents), **text embedding models** (map unstructure text into a measurement space with a distance metric), **vector stores** (store data and search over embedded data), and **retrievers** (query your data).

First, we load data from a given source into a *Document*, which is a piece of text with associated metadata. Examples of document loaders from the module [*langchain.document_loaders*](https://python.langchain.com/docs/modules/data_connection/document_loaders/) are *TextLoader*, *CSVLoader*, *DirectoryLoader*, *PythonLoader*, *UnstructureHTMLLoader*, *BSHTMLLoader*, *JSONLoader*, *UnstructuredMarkdownLoader*, *PDFLoader*, *MathpixPDFLoader*, *UnstructuredPDFLoader*, among others.

In [None]:
# Clear white lines in web pages
def clear_blank_lines(docs):
    for doc in docs:
        doc.page_content = re.sub(r"\n\n\n+", "\n\n", doc.page_content)
    return docs

# Read document with langchain.document_loaders
def read_doc(doc_type, doc_path):
    if doc_type == "pdf":
        from langchain.document_loaders import PyPDFLoader
        loader = PyPDFLoader(doc_path)
        docs = loader.load()
    elif doc_type == "url":
        from langchain.document_loaders import WebBaseLoader
        url = doc_path
        loader = WebBaseLoader(url)
        docs = loader.load()
    elif doc_type == "youtube":
        from langchain.document_loaders.blob_loaders.youtube_audio import \
            YoutubeAudioLoader
        from langchain.document_loaders.generic import GenericLoader
        from langchain.document_loaders.parsers import OpenAIWhisperParser
        save_path = "./downloads"
        url = doc_path
        loader = GenericLoader(YoutubeAudioLoader([url], save_path), OpenAIWhisperParser())
        docs = loader.load()

    # Clear white lines in web pages
    clear_blank_lines(docs)

    print(f"Loaded {len(docs)} pages/documents")
    print(f"First page: {docs[0].metadata}")
    print(docs[0].page_content[:500])
    return docs

def pretty_print_docs(docs, question = None):
    print(f"\n{'-' * 100}\n")
    if question:
        print(f"Question: {question}")

    for i, doc in enumerate(docs):
        print(f"Document {i+1}:\n\nMetadata: {doc.metadata}\n")
        print(doc.page_content)
    print("\n")        

# Read document with langchain.document_loaders
docs = read_doc(doc_type, doc_path)

### Split the data into chunks

After loading data, these must be split into smaller chunks that fit into the LLM calls. Ideally, chunks are split keeping semantically related pieces together and with some overlap. Several test splitters can be use from the module [*langchain.text_splitter*](https://docs.langchain.com/docs/components/indexing/text-splitters): 
- *CharacterTextSplitter*: Most basic one based on *"\n\n"*.
- *RecursiveCharacterTextSplitter*: Recommended splitter for generic text. It tries to split into blocks based on the following order *["\n\n", "\n", " ", ""]*. You can provide *chunk_size*, *chunk_overlap*, and *add_start_index* (chuck position in the original document). 
- *MarkdownHeaderTextSplitter*: Split by specifying headers. 
- *TokenTextSplitter*: Split by tokens, providing the maximum number of tokens, for several tokenizers as *tiktoken* from openAI, *spaCy*, *NLTK*.
- *SentenceTransformersTokenTextSplitter*: takes into account the model used, as  hugging face or NLTK tokenizer. 

Specific texts can benefit from specialized splitters, such as for code (python, java, ...), LaTex, or markdown. 

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Parameters for splitting documents into chunks
chunk_size = 1500                   
chunk_overlap = 150
add_start_index = True

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, 
    chunk_overlap=chunk_overlap,
    add_start_index=add_start_index)

docs_split = text_splitter.split_documents(docs)
print(f"Split into {len(docs_split)} chunks")
print(f"First chunk: {docs_split[0].metadata}")
print(docs_split[0].page_content)

Other transformations allow to filter documents, selecting those that are related to provided query, such as *EmbeddingsRedundantFilter*. Re-order documents may be needed for more than 10 documents to avoid performance degradation (see *get_relevant_documents*).

### Create a vector database

It is common to store unstructured data by projecting in into an **embedding vector space**. Then, for prediction, the query is also embedded and the 'closest' embedding vectors are retrieved. A **vector store** is responsible for storing and retrieving embedded data. 

Possible vector spaces are [*Chroma*](https://python.langchain.com/docs/integrations/vectorstores/chroma) and [*OpenAIEmbeddings*](https://js.langchain.com/docs/api/embeddings_openai/classes/OpenAIEmbeddings). LangChain uses Chroma as the vectorstore to index and search embeddings. 

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DocArrayInMemorySearch
#from langchain.vectorstores import Chroma

# Define embedding
embedding = OpenAIEmbeddings(openai_api_key=openai.api_key)    

# Create vector database from data    
db = DocArrayInMemorySearch.from_documents(
    docs_split, 
    embedding=embedding)

#db = Chroma.from_texts(docs_split, embedding=embedding)

Filtering documents can be done in the embedded space. For instance, provided a query, we can perform a similarity search to retrieve a given number of documents, using *similarity_search* or *similarity_search_by_vector*. Another possibility that may work better is to compressed documents using *ContextualCompressionRetriever*. 

In [None]:
mode_similarity = False
if mode_similarity:
    k = 5   # Number of documents to return
    query = "Documents that provide details about the artists."
    # filter = {"source": "file.pdf"}
    db.similarity_search(query, k=k)
    # db.max_marginal_relevance_search(query, k=k)

    # Or select docs by embedded vector
    # embedding_vector = embedding.embed_query(query)
    # docs = db.similarity_search_by_vector(embedding_vector)

Vector store and similarity search can be created also asynchronously using *Qdrant*.

## LLM and retrievers

We select the LLM as *gpt-3.5-turbo* with *temperature=0*. The lower the temperature the more deterministic the ouput, the higher its value the more random the result ($temperature\in[0,1]$). Select values lower than 0.3 for text summarization or grammar correction and high values for text generation. You may choose a different one. Then, we initialize the LLM.

In [None]:
from langchain.chat_models import ChatOpenAI
#from langchain.llms import OpenAI

# Info user API key
llm_name = "gpt-3.5-turbo"

# Init the LLM and memory
# llm = OpenAI(temperature=0, openai_api_key=openai_key)
llm = ChatOpenAI(model_name=llm_name,
                 temperature=0,
                 openai_api_key=openai.api_key)


Retrievers are interfaces that return documents provided a query. However, they do not store documents as vectors stores do. Although there are many retrievers, we focus on those built from vector stores. 

Retrieval may be inconsistent depending on the query and its embedding. There are several retrievers: *SelfQueryRetriever*, *MultiQueryRetriever*, *ContextualCompressionRetriever*, among others. Below, we show an example of *ContextualCompressionRetriever*, which allows selecting documents by compression. 

In [None]:
# Select documents using compression
mode_compression = False 
if mode_compression:
    from langchain.retrievers import ContextualCompressionRetriever
    from langchain.retrievers.document_compressors import LLMChainExtractor

    # Wrap our vectorstore
    
    compressor = LLMChainExtractor.from_llm(llm)
    compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=db.as_retriever()
    )
    query = "Documents that provide details about the artists."
    compressed_docs = compression_retriever.get_relevant_documents(query)

## Chain

[Chains](https://python.langchain.com/docs/modules/chains/) allow to combine components in order to provide higher-level specific applications. Chat required chains to be initialized with a memory object, which allows persist data during multiple calls. 

Question answering involves creating an index, creating a retriever from this index, creating a question answering chain and making questions. 

In [None]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# QA CHAIN
qa_chain = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=db.as_retriever(),
    memory=memory
)

## Chat

In [None]:
# Start interaction
qa_on = False # Ask questions to the user
while qa_on == True:
    # Prompt the user to introduce a question
    question = input("Ask a question or type 'end chat': ")
    
    if question.lower() == "end chat":
        break

    # Run QA chain
    result = qa_chain({"question": question})
    print(f"Answer: {result['answer']}")
    # -------------------------


## Build a chat app with Gradio

Gradio is the fastest way to build interfaces for your machine learning  models. Build interactive apps to test your demos. 

Import required python libraries

In [None]:
import gradio as gr

### Simple interface

We build first a simple interface using [*gr.Interface*](https://www.gradio.app/docs/interface) that wraps any Python function with an interface. In this case we wrap the text function *qa_call* that takes a text as input and return a text as output, or *qa_call_output_history* that returns a string that contains all history. You can customize the text fields and provide a description and examples.

In [None]:
def qa_call(input):
    # QA call
    output = qa_chain({"question": input})
    return output

def qa_answer(input):
    # Return the answer from the QA call
    return qa_call(input)['answer']

def qa_history(input):
    # Return a formatted history
    response = qa_chain({"question": input})
    output = ""
    response_history = response['chat_history']
    num_qa = len(response_history)//2
    for i in range(num_qa):
        output += "Q: " + response_history[2*i].content + "\n"
        output += "A: " + response_history[2*i+1].content + "\n"
    return output

We build an interface, providing the inputs, the outputs and the function *qa_call*

In [None]:
demo = gr.Interface(fn=qa_history, 
                    inputs=[gr.Textbox(label="User question", 
                                       lines=2)],
                    outputs=[gr.Textbox(label="Chat answer", 
                                        lines=4)],
                    title="Chat to your data",
                    description=f"Ask questions about your data to {llm_name}!",
                    allow_flagging="never",
                    examples=["Summarize the document", "Can you provide details about ...", "Can you exaplin what is ...?"]
                   )
demo.launch()

An new alternative is to use *ChatInterface* that provides a minimalistic interface for chatbot UIs, for which only the function parameter is required. This should has the prompt as input and the answer as output, where the history is managed internally. Other parameters are allowed for customizing the interface: [chatinterface](https://www.gradio.app/docs/chatinterface). 

In [None]:
def qa_input_msg_history(input, history):
    # QA function that inputs the answer and the history
    # History managed internally by ChatInterface
    answer = qa_answer(input)
    return answer

In [None]:
# Init memory and QA chain
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)
qa_chain = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=db.as_retriever(),
    memory=memory
)

demo = gr.ChatInterface(fn=qa_input_msg_history, 
                    title="Chat to your data",
                    description=f"Ask questions about your data to {llm_name}!",
                    examples=["Summarize the document", "Can you provide details about ...", "Can you exaplin what is ...?"]
                   )
demo.launch()

In [None]:
if mode_wandb:
    wandb.finish()