## Chatbot with LlamaIndex And Openai:



LlamaIndex serves as a bridge between your data and Large Language Models (LLMs), providing a toolkit that enables you to establish a query interface around your data for a variety of tasks, such as question-answering and summarization.

**1- Ingest Data**

In [1]:
import os
import openai

os.environ["OPENAI_API_KEY"] = "sk-openai_api_key"
openai.api_key = os.environ["OPENAI_API_KEY"]

import nest_asyncio

nest_asyncio.apply()

In [5]:
from llama_hub.file.unstructured.base import UnstructuredReader
from pathlib import Path

In [6]:
!mkdir data
!wget "https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1" -O data/UBER.zip
!unzip data/UBER.zip -d data

mkdir: data: File exists
--2023-11-14 21:11:55--  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Resolving www.dropbox.com (www.dropbox.com)... 157.240.12.50
Connecting to www.dropbox.com (www.dropbox.com)|157.240.12.50|:443... failed: Operation timed out.
Retrying.

--2023-11-14 21:13:11--  (try: 2)  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Connecting to www.dropbox.com (www.dropbox.com)|157.240.12.50|:443... failed: Operation timed out.
Retrying.

--2023-11-14 21:14:28--  (try: 3)  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Connecting to www.dropbox.com (www.dropbox.com)|157.240.12.50|:443... failed: Operation timed out.
Retrying.

--2023-11-14 21:15:46--  (try: 4)  https://www.dropbox.com/s/948jr9cfs7fgj99/UBER.zip?dl=1
Connecting to www.dropbox.com (www.dropbox.com)|157.240.12.50|:443... ^C
Archive:  data/UBER.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part arch

And the year as a metadata of our documents.

In [9]:
years = [2022, 2021, 2020, 2019]

loader = UnstructuredReader()
doc_set = {}
all_docs = []
for year in years:
    year_docs = loader.load_data(
        file=Path(f"./data/UBER/UBER_{year}.html"), split_documents=False
    )
    # insert year metadata into each year
    for d in year_docs:
        d.metadata = {"year": year}
    doc_set[year] = year_docs
    all_docs.extend(year_docs)

[nltk_data] Downloading package punkt to /Users/smail/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/smail/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


**2- Setting up Vector Indices for each year:**

In [16]:
# initialize simple vector indices
from llama_index import VectorStoreIndex, ServiceContext, StorageContext

In [17]:
index_set = {}
service_context = ServiceContext.from_defaults(chunk_size=512)

In [19]:


for year in years:
    storage_context = StorageContext.from_defaults()
    cur_index = VectorStoreIndex.from_documents(
        doc_set[year],
        service_context=service_context,
        storage_context=storage_context,
    )
    index_set[year] = cur_index
    storage_context.persist(persist_dir=f"./storage/{year}")

In [20]:
# Load indices from disk
from llama_index import load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context, service_context=service_context
    )
    index_set[year] = cur_index

In [21]:
# Load indices from disk
from llama_index import load_index_from_storage

index_set = {}
for year in years:
    storage_context = StorageContext.from_defaults(
        persist_dir=f"./storage/{year}"
    )
    cur_index = load_index_from_storage(
        storage_context, service_context=service_context
    )
    index_set[year] = cur_index

**3- Setting up a Sub Question Query Engine to Synthesize Answer**

In [22]:
from llama_index.tools import QueryEngineTool, ToolMetadata

individual_query_engine_tools = [
    QueryEngineTool(
        query_engine=index_set[year].as_query_engine(),
        metadata=ToolMetadata(
            name=f"vector_index_{year}",
            description=f"useful for when you want to answer queries about the {year} SEC 10-K for Uber",
        ),
    )
    for year in years
]

In [23]:
from llama_index.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=individual_query_engine_tools,
    service_context=service_context,
)

**4- Setting up the Chatbot Agent:**

We use a LlamaIndex Data Agent to setup the outer chatbot agent, which has access to a set of Tools. Specifically, we will use an OpenAIAgent, that takes advantage of OpenAI API function calling. We want to use the separate Tools we defined previously for each index (corresponding to a given year), as well as a tool for the sub question query engine we defined above.

In [24]:
query_engine_tool = QueryEngineTool(
    query_engine=query_engine,
    metadata=ToolMetadata(
        name="sub_question_query_engine",
        description="useful for when you want to answer queries that require analyzing multiple SEC 10-K documents for Uber",
    ),
)

In [26]:
tools = individual_query_engine_tools + [query_engine_tool]

In [27]:
from llama_index.agent import OpenAIAgent

agent = OpenAIAgent.from_tools(tools, verbose=True)

**4- Testing the Agent**

In [28]:
response = agent.chat("hi, i am abdo")
print(str(response))

STARTING TURN 1
---------------

Hello Abdo! How can I assist you today?


In [29]:
response = agent.chat(
    "What were some of the biggest risk factors in 2020 for Uber?"
)
print(str(response))

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: vector_index_2020 with args: {
  "input": "biggest risk factors"
}
Got output: The biggest risk factors mentioned in the context are as follows:

1. The COVID-19 pandemic and the impact of actions to mitigate the pandemic.
2. The classification of Drivers as employees, workers, or quasi-employees instead of independent contractors.
3. Intense competition in the mobility, delivery, and logistics industries.
4. The need to lower fares or service fees and offer incentives and discounts to remain competitive.
5. Significant losses incurred and the uncertainty of achieving profitability.
6. The challenge of attracting and maintaining a critical mass of platform users.
7. Operational, compliance, and cultural challenges related to the workplace culture.
8. Inquiries, investigations, and requests for information from government agencies.
9. Risks related to data collection, use, transfer, disclosure, and other process

In [30]:
cross_query_str = "Compare/contrast the risk factors described in the Uber 10-K across years. Give answer in bullet points."

response = agent.chat(cross_query_str)
print(str(response))

STARTING TURN 1
---------------

=== Calling Function ===
Calling function: sub_question_query_engine with args: {
  "input": "Compare and contrast risk factors in Uber's 10-K across years"
}
Generated 4 sub questions.
[1;3;38;2;237;90;200m[vector_index_2022] Q: What are the risk factors mentioned in Uber's 2022 SEC 10-K?
[0m[1;3;38;2;90;149;237m[vector_index_2021] Q: What are the risk factors mentioned in Uber's 2021 SEC 10-K?
[0m[1;3;38;2;11;159;203m[vector_index_2020] Q: What are the risk factors mentioned in Uber's 2020 SEC 10-K?
[0m[1;3;38;2;155;135;227m[vector_index_2019] Q: What are the risk factors mentioned in Uber's 2019 SEC 10-K?
[0m[1;3;38;2;237;90;200m[vector_index_2022] A: Some of the risk factors mentioned in Uber's 2022 SEC 10-K include the potential adverse effect on their business if drivers were classified as employees instead of independent contractors, the highly competitive nature of the mobility, delivery, and logistics industries, the need to lower fare

In [33]:
agent = OpenAIAgent.from_tools(tools)  # verbose=False by default

while True:
    text_input = input("User: ")
    if text_input == "exit":
        break
    response = agent.chat(text_input)
    print(f"Agent: {response}")