This notebook is based on the SEC filing titled **"MGT CAPITAL INVESTMENTS, INC."** filed on date **2021-04-15**. 

The original data file can be found in this repository under:
`./data/mgt_capital.json`

In this notebook, for the sake of specificity, is focused on the section common among many SEC filings:

*item 7: Management’s Discussion and Analysis of Financial Condition and Results of Operations*

To simplify the code, item 7 is extracted from the page and stored in 
`sample_data/sec_filing1/sec1.txt`

### Load in the libraries to initialize the model

*make sure you pull the nomic text first*

`ollama pull nomic-embed-text`

In [21]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama

# We use the nomic-embed-text from our Ollama embedding wrapper. 
# We also use our Ollama LLM wrapper to load in the Llama3 model.
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.llm = Ollama(model='llama3', request_timeout=360.0, seed=42)

Testing on a small historical paragraph about the battle of Yorktown during the American Revolutionary War

In [22]:
documents = SimpleDirectoryReader("./sample_data/history/").load_data()
index = VectorStoreIndex.from_documents(
    documents,
)

In [23]:
query_engine = index.as_query_engine()
response = query_engine.query("How many soldiers did Washington move to Virginia?")
print(response)

Washington moved his force of almost 8,000 men south to Virginia.


### Moving to the SEC filing regarding MGT Capital

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.core.node_parser import LangchainNodeParser

# load the entire contents of the directory
documents = SimpleDirectoryReader("./sample_data/sec_filing1/").load_data()

# Split the text into paragraphs (not sure why I have to do this twice)
Settings.text_splitter = LangchainNodeParser(RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0
))

# Initialize the vector store index with the transformed text
index2 = VectorStoreIndex.from_documents(
    documents,
    transformations=[
        LangchainNodeParser(RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=0
    ))]
)

### Test the model for accuracy. 

If the answer below states *$450*, it is correct.

In [25]:
query_engine = index2.as_query_engine()
response = query_engine.query("what was the end of the year revenue for 2019?")
print(response)

$450.


Answer to the below query should be *$1,440*

In [26]:
response = query_engine.query("what was the end of the year revenue for 2020?")
print(response)

According to the provided data, our revenues for the year ended December 31, 2020 increased by $990, or 220%, to $1,440 as compared to $450 for the year ended December 31, 2019.


Answer below should be similar to *6 acres in Lafayette, Georgia*

In [27]:
response = query_engine.query("How many acres of land were purchased in Georgia for the facility?")
print(response)

6 acres of land were purchased in Lafayette, Georgia.


Answer below should be similar to:

Accretion of debt discount of \$5,605, partially offset by a gain on extinguishment of debt of \$3,540, interest income of \$10, a gain on sale of property and equipment of \$599, and a change in the fair value of the liability associated with the termination of the management agreements of \$176

In [28]:
response = query_engine.query("What were the non-operating expenses in 2019?")
print(response)

Non-operating expense for the year ended December 31, 2019 consisted of accretion of debt discount of $5,605, partially offset by a gain on extinguishment of debt of $3,540, interest income of $10, a gain on sale of property and equipment of $599, and a change in the fair value of the liability associated with the termination of the management agreements of $176.


Create a persistent storage for the generated Vector Store Index, to save it to disk

In [29]:
index2.storage_context.persist(persist_dir='./llamaindex/')

Reload from disk

In [30]:
from llama_index.core import load_index_from_storage
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama

# State the settings for the model (if not already done)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
Settings.llm = Ollama(model='llama3', request_timeout=360.0, seed=42)

# load from disk
re_load = load_index_from_storage(StorageContext.from_defaults(persist_dir='./llamaindex/'))

# Query (answer should be June 2019)
query_engine = re_load.as_query_engine()
response = query_engine.query("What dates did any equity purchase agreements occur on?")
print(response)

June 2019 was the date when an equity purchase agreement was entered into, which allowed for the issuance and sale of shares to an investor from time to time up to a certain amount.


Add another document to the store. This documen can also be found at `./data/nuance_comm.json`. We will be analyzing the same financial condition section of the SEC filing, item 7. The formatting and style of these papers varys, and this particular document has a lot more tables in it, which makes preprocessing and cleaning the data for the LLM difficult.

In [31]:
from llama_index.core import SimpleDirectoryReader, Settings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from llama_index.core.node_parser import LangchainNodeParser

# change the variable name to be more relevant in this context
vector_index = re_load 

new_docs = SimpleDirectoryReader("./sample_data/sec_filing2/").load_data()

splitter = LangchainNodeParser(RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0
))
Settings.text_splitter = splitter

new_nodes = splitter.get_nodes_from_documents(new_docs)
vector_index.insert_nodes(new_nodes)

Query the index with the new document. The answer should be $1,362.4 million

In [32]:
query_engine = vector_index.as_query_engine()
response = query_engine.query("What was the revenue for the end of the year in 2021?")
print(response)

$1,362.4 million.


The issue we have with this index, is by simply pulling the financial conditions out of the documents. If we were querying documents where multiple SEC filings had revenues posted for the same dates, we would be unable to tell which revenue matched to which organization. 

When running the query below, the model struggles to provide an accurate answer because both SEC filings have data from 2020. 

In [33]:
query_engine = vector_index.as_query_engine()
response = query_engine.query("What was the revenue for the end of the year in 2020?")
print(response)

According to the provided data, the total revenues for Fiscal Year 2020 were not explicitly stated. However, we can infer that the geographic split for Fiscal Year 2020 was 79% in the United States and 21% internationally, with a hosting and professional services revenue that is not specified. Additionally, maintenance and support revenue for Fiscal Year 2020 was $256.7 million.

To answer your query about the end-of-year revenue for 2020, we can look at the maintenance and support revenue, as it provides a percentage of total revenues. According to the table, maintenance and support revenue for Fiscal Year 2020 was $256.7 million, which is equivalent to 20% of total revenues.

Using this information, we can estimate that the end-of-year revenue for 2020 would be approximately $1.26 billion (assuming a total revenue of $6.3 billion, calculated by dividing $256.7 million by 20%).


The model will need to be improved to have additional context when asking questions about multiple organizations whom all have financial activity with similar topics and dates