# Using LangChain and Deep Lake to Explore Amazon's Revenue Growth Pre- and Post-Pandemic

In this tutorial, we will first load Amazon's quarterly financial reports, embed using OpenAI's API, store the data in Deep Lake, and then explore it by asking questions.

Installation:
To install the necessary packages, you can use the following command:

In [None]:
!pip3 install langchain deeplake pypdf openai tiktoken

This code imports classes from the langchain library. They include modules for generating text embeddings using OpenAI models, storing and retrieving these embeddings in a vector store (DeepLake), splitting text for processing, performing question-answering using retrieval methods, setting up chat systems with the OpenAI model, and loading and splitting documents for further processing.

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain import OpenAI 
from langchain.chains import RetrievalQA
from langchain.llms import OpenAIChat
from langchain.document_loaders import PagedPDFSplitter

Next, we need to get API tokens from OpenAI and Activeloop (learn [how to receive an API token](https://docs.activeloop.ai/getting-started/using-activeloop-storage) for Activeloop after registering).

In [None]:
import os
os.environ['OPENAI_API_KEY']='OPEN_AI_KEY_HERE'
os.environ['ACTIVELOOP_TOKEN']='ACTIVELOOP_TOKEN'

Next, we need to get API tokens from OpenAI and Activeloop (learn how to receive an API token for Activeloop after registering).

First, let's download financial reports from Amazon and load them into pages:

In [None]:
import requests
import tqdm
from typing import List

# financial reports of amamzon, but can be replaced by any URLs of pdfs
urls = ['https://s2.q4cdn.com/299287126/files/doc_financials/Q1_2018_-_8-K_Press_Release_FILED.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/Q2_2018_Earnings_Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q318-Amazon-Earnings-Press-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/AMAZON.COM-ANNOUNCES-FOURTH-QUARTER-SALES-UP-20-TO-$72.4-BILLION.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/Q119_Amazon_Earnings_Press_Release_FINAL.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q2-2019-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/Q3-2019-Amazon-Financial-Results.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_news/archive/Amazon-Q4-2019-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2020/Q1/AMZN-Q1-2020-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q2/Q2-2020-Amazon-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2020/q4/Amazon-Q4-2020-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q1/Amazon-Q1-2021-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q2/AMZN-Q2-2021-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q3/Q3-2021-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2021/q4/business_and_financial_update.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q1/Q1-2022-Amazon-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q2/Q2-2022-Amazon-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q3/Q3-2022-Amazon-Earnings-Release.pdf',
        'https://s2.q4cdn.com/299287126/files/doc_financials/2022/q4/Q4-2022-Amazon-Earnings-Release.pdf'
        ]

def load_reports(urls: List[str]) -> List[str]:
    """ Load pages from a list of urls"""
    pages = []

    for url in tqdm.tqdm(urls):
        r = requests.get(url)
        path = url.split('/')[-1]
        with open(path, 'wb') as f:
            f.write(r.content)
        loader = PagedPDFSplitter(path)
        local_pages = loader.load_and_split()
        pages.extend(local_pages)
    return pages

pages = load_reports(urls)

We now use the Text Splitter Util to split documents into pages.

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(pages)

embeddings = OpenAIEmbeddings()

db = DeepLake(dataset_path="hub://davitbun/amazon_earnings_6", embedding_function=embeddings, token=os.environ['ACTIVELOOP_TOKEN'])
db.add_documents(texts)

Since the dataset is stored on Activeloop, you can load it later without recomputing embeddings. This is a significant benefit cause it would save you time and computational resources. LangChain has a wrapper around Deep Lake, allowing you to use it as a Vector Store.

In [None]:
qa = RetrievalQA.from_chain_type(llm=OpenAIChat(model='gpt-3.5-turbo'), chain_type='stuff', retriever=db.as_retriever())

## How to answer questions using ML?

The state-of-the-art method of question answering is by leveraging LangChain and Activeloop Deep Lake. You can store your documents in the unified, streamable format, and connect it to LangChain to run a question-answering model such as GPT or BERT. These language models are able to understand the context of the question and generate a more accurate answer. You may also implement techniques like data augmentation, document retrieval, and summarization to enhance the system performance.

Let's now explore Amazon's revenue change pre- and post-pandemic.

In [None]:
qa.run("What is the revenue in 2021 Q3?")
# The net sales for Q3 2021 was $110.8 billion, with a 15% increase to $127.1 billion in Q3 2022.

## Concluding Remarks: Use LangChain & Activeloop for efficient LLM training

In conclusion, LangChain is a robust tool for productizing large language models, providing a simple workflow to generate output by employing prompts, utilities, and language models in a linked fashion. It offers numerous benefits, including streamlined LLM development, improved accuracy of LLMs, better use-case customization, easy integration with data sources, and other Python libraries, including Activeloop Deep Lake. Deep Lake, in its turn, enables rapid LangChain prototyping, as you can immediately access data without the need to recompute the embeddings for the model finetuning. If you're working with LLMs and looking to streamline your development process, LangChain, powered by Deep Lake as the data store for LLM training, is worth exploring.