# Paper Savior

Auto-explorative research with **Retrieval Augumented Generation(RAG)** and **AutoFollowup**, using agent framework `lionagi` with `llama_index` as ToolBox

**WARNING** : This notebook uses `gpt-4-turbo-preview` for workflow, and can get ***very expensive***

You can:
1. change workflow model, but that will require context length being managed
2. Reduce the number of steps and the number of queries

let us conduct a research session. We will, 
- download 20 papers from arxiv as our primary inspiration
- give our researcher 5 abstracts to **read** and propose some ideas to explore
- ask researcher to **look up** relevant terms and information from **sources**
- draft plans and points and present final research proposal 

In [1]:
# %pip install lionagi llama-index llama_hub unstructured pypdf arxiv wikipedia google-search 'unstructured[pdf]'

In [2]:
# if you would like to ignore logging

import logging
logging.getLogger().setLevel(logging.ERROR)

In [3]:
import lionagi

lionagi.__version__

'0.0.312'

In [4]:
import llama_index.core

llama_index.core.__version__

'0.10.25.post1'

In [5]:
topic = "Large Language Model applications in blockchain"
question = "Research on using LLM for blockchain data time series analysis on high frequency decentralized finance data"
num_papers = 20

persist_dir = ".storage/"

## 1. Setup Data QA Tools

#### a. ArXiv Index

We will download papers using our research topic from **ArXiv**, a popular pre-publish platform for research papers

with `LlamaIndex` and build it into a searchable index.

- [ArXiv Official Website](https://arxiv.org)

- [LlamaIndex Official Website](https://www.llamaindex.ai)

- [ArXivReader on LlamaHub](https://llamahub.ai/l/papers-arxiv?from=all) 

**llama-index arxiv reader is having some issues, please directly download those papers from arxiv.com**

In [6]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding, OpenAIEmbeddingModelType

Settings.llm = OpenAI(model="gpt-4-turbo-preview")
Settings.embed_model = OpenAIEmbedding(model=OpenAIEmbeddingModelType.TEXT_EMBED_3_LARGE)

In [7]:
# from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
# from llama_index.core.text_splitter import SentenceSplitter

# reader = SimpleDirectoryReader('./papers', required_exts=['.pdf'])
# documents = reader.load_data()

# splitter = SentenceSplitter(chunk_size=2048, chunk_overlap=50)
# nodes = splitter.get_nodes_from_documents(documents)

# index = VectorStoreIndex(nodes)
# index.storage_context.persist(persist_dir="./arxiv_index")

If you have already built an index and stored it by running above codes, you can `index object` build from storage

you can build from storage using the following, just need to find the `index_id` in the index_store file

- but you **still need** have some abstract of paper or other things as the starting context for `Session`

In [8]:
from llama_index.core import load_index_from_storage, StorageContext

index_id = "dff2e0fe-2b51-4043-ba7c-498029c22bbd"

storage_context = StorageContext.from_defaults(persist_dir="./arxiv_index")
index = load_index_from_storage(storage_context, index_id=index_id)

In [9]:
from llama_index.core.postprocessor import LLMRerank

reranker = LLMRerank(choice_batch_size=10, top_n=5)
arxiv_engine = index.as_query_engine(node_postprocessors=[reranker], similarity_top_k=3, response_mode= "tree_summarize")

#### b. Textbooks

use a couple pdf textbooks as references and build a Knowledge Graph Index

- [Dive into Deep Learning](https://d2l.ai)

- [Blockchain for Dummies - IBM edition](http://gunkelweb.com/coms465/texts/ibm_blockchain.pdf)

- [KnowledgeGraphIndex](https://docs.llamaindex.ai/en/stable/examples/index_structs/knowledge_graph/KnowledgeGraphDemo.html) 


In [10]:
from llama_index.core import KnowledgeGraphIndex
from llama_index.core.graph_stores import SimpleGraphStore
from llama_index.core.storage.storage_context import StorageContext

storage_context = StorageContext.from_defaults(graph_store=SimpleGraphStore())

In [11]:
# from llama_index.core import SimpleDirectoryReader, Document

# reader = SimpleDirectoryReader(input_files=['d2l-en.pdf'])
# docs1 = reader.load_data()
# documents1 = [Document(text="".join([x.text for x in docs1]))]

# reader = SimpleDirectoryReader(input_files=['ibm_blockchain.pdf'])
# docs2 = reader.load_data()
# documents2 = [Document(text="".join([x.text for x in docs2]))]

##### Build KG

this will take quite a while, I suggest you to minimize the notebook for 0.5-1 hours, or change the source to be shorter in length

In [12]:
# d2l_index = KnowledgeGraphIndex.from_documents(
#     documents1,
#     max_triplets_per_chunk=2,
#     storage_context=storage_context,
#     include_embeddings=True,
# )

# d2l_index.storage_context.persist(f'{persist_dir}/d2l/')

In [13]:
# bc_index = KnowledgeGraphIndex.from_documents(
#     documents2,
#     max_triplets_per_chunk=2,
#     storage_context=storage_context,
#     include_embeddings=True,
# )

# bc_index.storage_context.persist(f'{persist_dir}/bc_ibm/')

##### From Storage

In [14]:
from llama_index.core import KnowledgeGraphIndex, load_index_from_storage
from llama_index.core.graph_stores import SimpleGraphStore
from llama_index.core.storage.storage_context import StorageContext

storage_context = StorageContext.from_defaults(graph_store=SimpleGraphStore(), persist_dir=f'{persist_dir}/d2l/')
index_id = '7c52b76a-bd85-4aa8-a167-ab4f468b7dc9' 
d2l_index = load_index_from_storage(storage_context=storage_context, index_id=index_id)

In [15]:
storage_context = StorageContext.from_defaults(graph_store=SimpleGraphStore(), persist_dir=f'{persist_dir}/bc_ibm/')
index_id = '4850f9e9-158c-44d3-a3f0-1bfec72dfc6e'
bc_index = load_index_from_storage(storage_context=storage_context, index_id=index_id)

In [16]:
from llama_index.core.postprocessor import LLMRerank

reranker = LLMRerank(choice_batch_size=10, top_n=5)
d2l_engine = d2l_index.as_query_engine(node_postprocessors=[reranker], similarity_top_k=3, response_mode= "tree_summarize")
bc_engine = bc_index.as_query_engine(node_postprocessors=[reranker], similarity_top_k=3, response_mode= "tree_summarize")

#### c. google and wikipedia

we will use Google and Wikipedia to clarify certain domain specific terms with LlamaIndex `OpenAI agent`

in order to use google search engine you will have to register with google and get an `API_KEY` and a `google_engine`

- [Instruction on Getting Google Search API and Engine](https://developers.google.com/custom-search/v1/overview)
- [Google Tools on LlamaHub](https://llamahub.ai/l/tools-google_search?from=all)
- [Wiki Tools on LlamaHub](https://llamahub.ai/l/tools-wikipedia?from=all)

Once after you get the API_KEY and Search_Engine save them to `.env` file, 

GOOGLE_API_KEY='...'

GOOGLE_CSE_ID='...'

In [17]:
# %pip install llama-index-tools-google
# %pip install llama-index-tools-wikipedia

In [18]:
import os

google_key_scheme = 'GOOGLE_API_KEY'
google_engine_scheme = 'GOOGLE_CSE_ID'

In [19]:
llm = OpenAI(model='gpt-4-turbo-preview', temperature=0.1)

# we will create agents for google search and wikipedia querying
def create_google_agent(
    google_api_key=os.getenv(google_key_scheme), 
    google_engine=os.getenv(google_engine_scheme), 
    verbose=False
):
    from llama_index.agent.openai import OpenAIAgent
    from llama_index.core.tools.tool_spec.load_and_search.base import LoadAndSearchToolSpec
    from llama_index.tools.google import GoogleSearchToolSpec

    api_key = google_api_key
    search_engine = google_engine
    google_spec = GoogleSearchToolSpec(key=api_key, engine=search_engine)

    # Wrap the google search tool as it returns large payloads
    tools = LoadAndSearchToolSpec.from_defaults(
        google_spec.to_tool_list()[0],
    ).to_tool_list()

    agent = OpenAIAgent.from_tools(tools, verbose=verbose, llm=llm)
    return agent

def create_wiki_agent(verbose=False):
    from llama_index.tools.wikipedia import WikipediaToolSpec
    from llama_index.agent.openai import OpenAIAgent

    tool_spec = WikipediaToolSpec()

    agent = OpenAIAgent.from_tools(tool_spec.to_tool_list(), verbose=verbose, llm=llm)
    return agent

# 2. Tools

Now we have set up the needed components for tools, let us set up tools for LLM to use. 


1. write a function definition for the tool, and add google style docstring
2. keep track of the responses for source checking
3. make these functions into `lionagi.Tool` object

we will define the functions as `asynchorous` functions so we can run parallel queries concurrently 

In [20]:
responses_arxiv = []
responses_d2l = []
responses_bc = []

async def query_arxiv(query: str):
    """
    Query a vector index built with papers from arxiv. It takes 
    natural language query, and give natural language response. 

    Args:
        query (str): The natural language query to get an answer from the index

    Returns:
        str: The query response from index
    """
    response = await arxiv_engine.aquery(query)
    responses_arxiv.append(response)
    
    return str(response.response)

async def query_d2l(query: str):
    """
    Query a index built from machine learning textbooks. It takes 
    natural language query, and give natural language response. 

    Args:
        query (str): The natural language query to get an answer from the index

    Returns:
        str: The query response from index
    """
    response = await d2l_engine.aquery(query)
    responses_d2l.append(response)
    
    return str(response.response)
        
async def query_bc(query: str):
    """
    Query a index built from blockchain textbooks. It takes 
    natural language query, and give natural language response. 

    Args:
        query (str): The natural language query to get an answer from the index

    Returns:
        str: The query response from index
    """
    response = await bc_engine.aquery(query)
    responses_bc.append(response)
    
    return str(response.response)

In [21]:
responses_google = []
responses_wiki = []

# ask gpt to write you google format docstring
async def query_google(query: str):
    """
    Search Google and retrieve a natural language answer to a given query.

    Args:
        query (str): The search query to find an answer for.

    Returns:
        str: A natural language answer obtained from Google search results.

    Raises:
        Exception: If there is an issue with making the request or parsing the response.
    """
    google_agent = create_google_agent()
    response = await google_agent.achat(query)
    responses_google.append(response)
    return str(response.response)

async def query_wiki(query: str):
    """
    Search Wikipedia and retrieve a natural language answer to a given query.

    Args:
        query (str): The search query to find an answer for.

    Returns:
        str: A natural language answer obtained from Google search results.

    Raises:
        Exception: If there is an issue with making the request or parsing the response.
    """
    wiki_agent = create_wiki_agent()
    response = await wiki_agent.achat(query)
    responses_wiki.append(response)
    return str(response.response)

# 3. Workflow and Outputs

Make the functions into LionAGI `Tool` objects

the `func_to_tool` function converts a function with google or rest style docstring into a `Tool` object, 

which can be used during a `Session`

In [22]:
# prompts
system = {
    "persona": "you are a helpful assistant, perform as a researcher",
    "notice": f"your research topic is on {topic}, and the question is {question}",
    "requirements": "Think step by step",
    "responsibilities": "Researching a specific topic and question, explore and provide specific findings and insights",
    "deliverable": "technical only, ~ 1000-1500 words, briefly explain the core concepts and rationale behind, retain from being vague or general, target audience is highly sophisticated and can judge your work. "
}

instruct = f"""
read a few paper abstracts, carefully propose a few unique, creative, pratical and achieveable solutions to 
solve the research question on the specific topics, notice you can use the query tools in parallel, but your questions need to be all different and specific to the tools. Your final deliverable needs to be highly specific and technical. you have to 
use every tool at least once, but as extensively as you can. 
"""

In [23]:
context = """
LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters
Ching Chang , Wei-Yao Wang , Wen-Chih Peng and Tien-Fu Chen
National Yang Ming Chiao Tung University, Hsinchu, Taiwan
blacksnail789521.cs10@nycu.edu.tw, sf1638.cs05@nctu.edu.tw, {wcpeng, tfchen}@cs.nycu.edu.tw
Abstract
Multivariate time-series forecasting is vital in vari-
ous domains, e.g., economic planning and weather
prediction. Deep train-from-scratch models have
exhibited effective performance yet require large
amounts of data, which limits real-world applica-
bility. Recently, researchers have leveraged the rep-
resentation learning transferability of pre-trained
Large Language Models (LLMs) to handle limited
non-linguistic datasets effectively. However, incor-
porating LLMs with time-series data presents chal-
lenges of limited adaptation due to different com-
positions between time-series and linguistic data,
and the inability to process multi-scale temporal in-
formation. To tackle these challenges, we propose
LLM4TS, a framework for time-series forecasting
with pre-trained LLMs. LLM4TS consists of a two-
stage fine-tuning strategy: the time-series align-
ment stage to align LLMs with the nuances of time-
series data, and the forecasting fine-tuning stage for
downstream time-series forecasting tasks. Further-
more, our framework features a novel two-level ag-
gregation method that integrates multi-scale tempo-
ral data within pre-trained LLMs, enhancing their
ability to interpret time-specific information. In ex-
periments across 7 time-series forecasting datasets,
LLM4TS is superior to existing state-of-the-art
methods compared with trained-from-scratch mod-
els in full-shot scenarios, and also achieves an av-
erage improvement of 6.84% in MSE in few-shot
scenarios. In addition, evaluations compared with
different self-supervised learning approaches high-
light LLM4TS’s effectiveness with representation
learning in forecasting tasks.
1 Introduction
Forecasting is a vital task in multivariate time-series analy-
sis, not only for its ability to operate without manual label-
ing but also for its importance in practical applications such
as economic planning [Lai et al., 2018] and weather predic-
tion [Zhou et al., 2021]. Recently, numerous deep train-from-
scratch models have been developed for time-series forecast-
ing [Nie et al., 2023], although some lean towards unsuper-
vised representation learning [Chang et al., 2023] and transfer
learning [Zhang et al., 2022; Zhou et al., 2023]. Generally,
these approaches aim to employ adept representation learn-
ers: first extracting rich representations from the time-series
data and then using these representations for forecasting.
Achieving an adept representation learner requires suffi-
cient training data [Hoffmann et al., 2022], yet in real-world
scenarios, there is often a lack of large-scale time-series
datasets. For instance, in industrial manufacturing, the sen-
sor data for different products cannot be combined for further
analysis, leading to limited data for each product type [Yeh
et al., 2019]. Recent research has pivoted towards pre-trained
LLMs in Natural Language Processing (NLP) [Radford et al.,
2019; Touvron et al., 2023] , exploiting their robust represen-
tation learning and few-shot learning capabilities. Moreover,
these LLMs can adapt to non-linguistic datasets (e.g., images
[Lu et al., 2021], audio [Ghosal et al., 2023], tabular data
[Hegselmann et al., 2023], and time-series data [Zhou et al.,
2023]) by fine-tuning with only a few parameters and limited
data. While LLMs are renowned for their exceptional trans-
fer learning capabilities across various fields, the domain-
specific nuances of time-series data introduce two challenges
in leveraging these models for time-series forecasting.
The first challenge of employing LLMs for time-series
forecasting is their limited adaptation to the unique charac-
teristics of time-series data due to LLMs’ initial pre-training
focus on the linguistic corpus. While LLMs have been both
practically and theoretically proven [Zhou et al., 2023] to be
effective in transfer learning across various modalities thanks
to their data-independent self-attention mechanism, their pri-
mary focus on general text during pre-training causes a short-
fall in recognizing key time-series patterns and nuances cru-
cial for accurate forecasting. This limitation is evident in ar-
eas such as meteorology and electricity forecasting [Zhou et
al., 2021], where failing to account for weather patterns and
energy consumption trends leads to inaccurate predictions.
The second challenge lies in the limited capacity to process
multi-scale temporal information. While LLMs are adept at
understanding the sequence and context of words, they strug-
gle to understand temporal information due to the lack of uti-
lizing multi-scale time-related data such as time units (e.g.,
seconds, minutes, hours, etc.) and specific dates (e.g., holi-
days, significant events). This temporal information is vital
in time-series analysis for identifying and predicting patterns
arXiv:2308.08469v5  [cs.LG]  18 Jan 2024
[Wu et al., 2021]; for instance, in energy management, it is
used to address consumption spikes during daytime and in
summer/winter, in contrast to the lower demand during the
night and in milder seasons [Zhou et al., 2021]. This under-
scores the importance of models adept at interpreting multi-
scale temporal patterns (hourly to seasonal) for precise energy
demand forecasting. However, most LLMs (e.g., [Radford et
al., 2019; Touvron et al., 2023]) built on top of the Trans-
former architecture do not naturally incorporate multi-scale
temporal information, leading to models that fail to capture
crucial variations across different time scales.
To address the above issues, we propose LLM4TS,
a framework for time-series forecasting with pre-trained
LLMs. Regarding the first challenge, our framework intro-
duces a two-stage fine-tuning approach: the time-series align-
ment stage and the forecasting fine-tuning stage. The first
stage focuses on aligning the LLMs with the characteristics of
time-series data by utilizing the autoregressive objective, en-
abling the fine-tuned LLMs to adapt to time-series represen-
tations. The second stage is incorporated to learn correspond-
ing time-series forecasting tasks. In this manner, our model
supports effective performance in full- and few-shot scenar-
ios. Notably, throughout both stages, most parameters in the
pre-trained LLMs are frozen, thus preserving the model’s in-
herent representation learning capability. To overcome the
limitation of LLMs in integrating multi-scale temporal infor-
mation, we introduce a novel two-level aggregation strategy.
This approach embeds multi-scale temporal information into
the patched time-series data, ensuring that each patch not only
represents the series values but also encapsulates the critical
time-specific context. Consequently, LLM4TS emerges as
a data-efficient time-series forecaster, demonstrating robust
few-shot performance across various datasets (Figure 1).
In summary, the paper’s main contributions are as follows:
• Aligning LLMs Toward Time-Series Data: To the
best of our knowledge, LLM4TS is the first method that
aligns pre-trained Large Language Models with time-
series characteristics, effectively utilizing existing rep-
resentation learning and few-shot learning capabilities.
• Multi-Scale Temporal Information in LLMs: To
adapt to time-specific information, a two-level aggrega-
tion method is proposed to integrate multi-scale tempo-
ral data within pre-trained LLMs.
• Robust Performance in Forecasting: LLM4TS ex-
cels in 7 real-world time-series forecasting benchmarks,
outperforming state-of-the-art methods, including those
trained from scratch. It also demonstrates strong few-
shot capabilities, particularly with only 5% of data,
where it surpasses the best baseline that uses 10% of
data. This efficiency makes LLM4TS highly relevant for
practical, real-world forecasting applications
"""

### a. First attempt

In [27]:
from lionagi import Session

# set up a researcher session
researcher = Session(system, tools=[query_arxiv, query_bc, query_d2l, query_google, query_wiki])

# invoke the task for researcher
out = await researcher.followup(instruct, context=context, max_followup=3, temperature=0.7, auto=True)

In [28]:
from IPython.display import Markdown
Markdown(out)

Leveraging Large Language Models (LLMs) for blockchain data time series analysis, particularly in high-frequency decentralized finance (DeFi) data, presents a novel approach to addressing the unique challenges and exploiting the opportunities within this rapidly evolving sector. This technical exploration synthesizes insights from recent research, blockchain technology fundamentals, machine learning techniques, and current advancements in the field to propose practical, achievable solutions for enhancing time series forecasting in DeFi.

### 1. **Enhanced Forecasting and Trend Analysis**

#### Rationale:
LLMs, with their superior ability to understand complex patterns and generate natural language, can be trained on historical DeFi transaction data to forecast future market trends and asset prices. This capability is crucial in the volatile and fast-paced DeFi market, where accurate predictions can significantly impact investment decisions and risk management strategies.

#### Application:
Develop a fine-tuned LLM specifically for interpreting high-frequency DeFi data. This model should be trained on a comprehensive dataset encompassing transaction histories, token price movements, liquidity pool dynamics, and smart contract interactions, among other relevant metrics. By treating future values in the time series as "text" to be predicted, the LLM can generate forecasts with a higher degree of accuracy.

### 2. **Real-time Anomaly Detection and Fraud Prevention**

#### Rationale:
Anomaly detection in high-frequency trading environments is critical for identifying market manipulation, fraud, or system errors in real-time. LLMs' capability to monitor and analyze transaction data for unusual patterns can enhance the security and integrity of DeFi platforms.

#### Application:
Implement an LLM-based monitoring system that continuously scans the blockchain for anomalies in transaction patterns, smart contract executions, and liquidity pool behaviors. Utilizing NLP techniques, this system can also analyze social media and news articles for potential market manipulation signals, providing a comprehensive anomaly detection mechanism.

### 3. **Sentiment Analysis for Market Dynamics Understanding**

#### Rationale:
Market sentiment, driven by news articles, social media, and community discussions, plays a significant role in influencing DeFi asset prices and trends. LLMs can analyze textual data to gauge market sentiment towards specific DeFi projects or the overall market, offering valuable insights into market dynamics.

#### Application:
Create a sentiment analysis tool that leverages LLMs to process and interpret vast amounts of textual data related to DeFi. This tool should analyze news headlines, social media posts, forum discussions, and other relevant sources to provide real-time sentiment scores. These scores can then be integrated into trading algorithms to inform investment strategies.

### 4. **Automating Regulatory Compliance and Reporting**

#### Rationale:
Ensuring compliance with evolving regulatory standards is a significant challenge for DeFi platforms. LLMs can automate the compliance process by analyzing transaction data against regulatory requirements, generating reports, and alerting for suspicious activities.

#### Application:
Develop a compliance automation tool that uses LLMs to interpret transaction data in the context of current regulatory frameworks. This tool should be capable of generating compliance reports, identifying potential non-compliance issues, and flagging transactions that require further investigation.

### 5. **Multi-Scale Temporal Information Integration**

#### Rationale:
Incorporating multi-scale temporal information is vital for accurately predicting market behaviors that depend on various time frames, from minutes to seasons. Most existing LLMs struggle to integrate such temporal information effectively.

#### Application:
Introduce a novel two-level aggregation strategy within the LLM framework to embed multi-scale temporal information into the analysis process. This involves creating temporal embeddings that represent different time scales and integrating these embeddings with transaction data, allowing the LLM to better understand and forecast based on time-specific information.

### Conclusion

The application of Large Language Models in analyzing high-frequency decentralized finance data offers promising avenues for enhancing forecasting accuracy, detecting anomalies, understanding market sentiment, ensuring regulatory compliance, and integrating multi-scale temporal information. By addressing the unique challenges of DeFi data analysis with these innovative solutions, stakeholders can unlock significant value, driving forward the evolution of decentralized financial markets.

In [29]:
researcher.messages

Unnamed: 0,node_id,timestamp,role,sender,recipient,content
0,3f85d42fcbe567d6bcd39768c86a5c50,2024_03_27T22_55_36_369373+00_00,system,system,assistant,"{""system_info"": {""persona"": ""you are a helpful..."
1,26c4659dd2a8b97137d0121068ef6bc5,2024_03_27T22_55_36_370134+00_00,user,user,main,"{""instruction"": {""NOTICE"": ""\n In the curre..."
2,72f4240550e3630c7c9b1bc9bd67addb,2024_03_27T22_55_41_908084+00_00,assistant,action_request,action,"{""action_request"": [{""action"": ""action_query_a..."
3,0db110123358c7eb67f1ba3d2d5e1b7f,2024_03_27T22_56_33_181238+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_arxiv""..."
4,448c3883920226c1f515578b23175f76,2024_03_27T22_56_33_182386+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_bc"", ""..."
5,27b44d79932bb22aecffd294a4542329,2024_03_27T22_56_33_182878+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_d2l"", ..."
6,4a1b547dee925a2c8a99217beafd9f7c,2024_03_27T22_56_33_183354+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_google..."
7,d274218cff51754e6c67cfd612ffd3b8,2024_03_27T22_56_33_183807+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_wiki"",..."
8,338097ed5e2c413a2c6993fd0ad08adb,2024_03_27T22_56_33_184357+00_00,user,user,main,"{""instruction"": ""\n In the current task, yo..."
9,a3b901fc55534386d263830bb59ace85,2024_03_27T22_57_05_794459+00_00,assistant,main,user,"{""response"": ""Leveraging Large Language Models..."


seem like the assistant gave answer during the ReAct, instead of presenting outcome in the end, let us check what it did

### b. Improve work

In [30]:
out1 = await researcher.chat(
    """
    you asked a lot of good questions and got plenty answers, please integrate your 
    conversation, be a lot more technical, you will be rewarded with 500 dollars for 
    great work, and punished for subpar work, take a deep breath, you can do it
    """
)

In [31]:
Markdown(out1)

Leveraging Large Language Models (LLMs) in the analysis of high-frequency blockchain data, particularly within the decentralized finance (DeFi) domain, presents a multifaceted opportunity to enhance predictive analytics, anomaly detection, sentiment analysis, and regulatory compliance. This technical exposition delves into the practical application of LLMs for time series forecasting of DeFi data, drawing upon insights from recent research, blockchain fundamentals, and advancements in machine learning techniques. The proposed solutions aim to address the unique challenges inherent in the DeFi ecosystem, such as high volatility, complex smart contract interactions, and the rapid pace of transactions.

### Enhanced Forecasting and Trend Analysis with LLMs

#### Technical Framework:

1. **Data Preprocessing and Representation**: Convert high-frequency DeFi transaction data into a structured format amenable to sequence modeling. This involves encoding numerical features (e.g., transaction amounts, token prices) and categorical features (e.g., token types, transaction types) into embeddings that capture the underlying temporal dynamics.

2. **Model Architecture**: Utilize a Transformer-based LLM, fine-tuned for the specific nuances of DeFi data. The model should incorporate attention mechanisms to weigh the importance of different parts of the input data sequence, allowing it to capture long-range dependencies and complex patterns in the data.

3. **Training Strategy**: Employ a combination of supervised learning for known historical trends and unsupervised or self-supervised learning to uncover latent patterns within the data. Techniques like contrastive learning can be particularly useful in self-supervised settings to enhance the model's ability to differentiate between normal and anomalous patterns.

4. **Forecasting Approach**: Frame the forecasting task as a sequence generation problem, where the model predicts future values of the time series based on past observations. This can be facilitated by using an autoregressive model structure where predictions for time \(t\) are conditioned on observations up to time \(t-1\).

### Real-time Anomaly Detection and Fraud Prevention

#### Technical Framework:

1. **Anomaly Detection Module**: Integrate an anomaly detection component within the LLM framework that specifically targets irregularities in transaction patterns, smart contract executions, and liquidity pool behaviors. This module should leverage the model's ability to understand the normal sequence of transactions and flag deviations in real-time.

2. **Adaptive Thresholding**: Implement dynamic thresholding mechanisms based on statistical modeling of transaction data distributions. These thresholds can adapt to changing market conditions and transaction volumes, improving the sensitivity and specificity of anomaly detection.

3. **Integration with External Data Sources**: Augment the LLM's capabilities by incorporating external data sources, such as social media feeds, news headlines, and regulatory alerts. This can enhance the model's ability to detect coordinated fraud attempts or market manipulation schemes that may not be evident from transaction data alone.

### Sentiment Analysis for Market Dynamics Understanding

#### Technical Framework:

1. **NLP-Based Sentiment Analysis**: Leverage the LLM's natural language processing capabilities to analyze textual data related to DeFi projects and the broader market. This involves training the model on a dataset of financial news articles, social media posts, and forum discussions, labeled with sentiment scores.

2. **Contextual Sentiment Scoring**: Develop a scoring system that accounts for the context and relevance of the textual data to specific DeFi projects or market trends. This requires the model to not only assess the sentiment of the text but also its significance in relation to current market dynamics.

3. **Integration with Time Series Analysis**: Combine sentiment analysis outputs with traditional time series forecasting models to provide a more holistic view of market trends. Sentiment scores can serve as additional features in the forecasting model, potentially improving prediction accuracy.

### Automating Regulatory Compliance and Reporting

#### Technical Framework:

1. **Regulatory Rule Encoding**: Encode regulatory requirements and compliance rules as structured data that can be interpreted by the LLM. This includes rules for anti-money laundering (AML), know your customer (KYC) procedures, and transaction reporting obligations.

2. **Compliance Monitoring Module**: Design a module within the LLM framework that continuously monitors transaction data for compliance with encoded regulatory rules. This module should be capable of generating real-time alerts for potential non-compliance incidents and automating the generation of compliance reports.

3. **Smart Contract Analysis**: Utilize the LLM to analyze and verify the security and compliance of smart contracts underpinning DeFi transactions. This involves interpreting the contract's code and logic in the context of regulatory requirements and identifying potential vulnerabilities or compliance issues.

### Multi-Scale Temporal Information Integration

#### Technical Framework:

1. **Temporal Embeddings**: Develop temporal embeddings that capture multi-scale time-related information, such as time of day, day of the week, and seasonality. These embeddings should be integrated with the transaction data embeddings to provide the LLM with a richer context for analysis.

2. **Two-Level Aggregation Strategy**: Implement a two-level aggregation approach within the LLM framework. The first level aggregates data at a finer granularity (e.g., minute-level), capturing short-term patterns and anomalies. The second level aggregates data at a coarser granularity (e.g., daily or weekly), allowing the model to understand longer-term trends and seasonality.

3. **Dynamic Time Warping (DTW)**: Employ DTW techniques to align time series data of different lengths and scales, facilitating the comparison and analysis of temporal patterns across different time frames. This can enhance the LLM's ability to forecast based on multi-scale temporal information.

In conclusion, applying LLMs to the analysis of high-frequency DeFi data offers a comprehensive framework for enhancing forecasting accuracy, detecting anomalies, understanding market sentiment, ensuring regulatory compliance, and integrating multi-scale temporal information. By addressing the unique challenges of DeFi data analysis with these advanced technical solutions, stakeholders can unlock significant value and drive forward the evolution of decentralized financial markets.

### c. check the queries

In [32]:
branch = researcher.default_branch

In [35]:
branch.action_response

Unnamed: 0,node_id,timestamp,role,sender,recipient,content
0,0db110123358c7eb67f1ba3d2d5e1b7f,2024_03_27T22_56_33_181238+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_arxiv""..."
1,448c3883920226c1f515578b23175f76,2024_03_27T22_56_33_182386+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_bc"", ""..."
2,27b44d79932bb22aecffd294a4542329,2024_03_27T22_56_33_182878+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_d2l"", ..."
3,4a1b547dee925a2c8a99217beafd9f7c,2024_03_27T22_56_33_183354+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_google..."
4,d274218cff51754e6c67cfd612ffd3b8,2024_03_27T22_56_33_183807+00_00,assistant,action_response,main,"{""action_response"": {""function"": ""query_wiki"",..."


there are 5 tool usages during this ReAct session, let us take a look at some

In [36]:
from lionagi.libs import convert, nested

res_dict1 = convert.to_dict(branch.action_response.content.iloc[1])

func_ = nested.nget(res_dict1, ['action_response', 'function'])
args_ = nested.nget(res_dict1, ['action_response', 'arguments'])
output_ = nested.nget(res_dict1, ['action_response', 'output'])

print(f"Tool used is: {func_}")
print(f"question asked: {args_}")
Markdown(output_)

Tool used is: query_bc
question asked: {'query': 'challenges and solutions in analyzing high frequency decentralized finance data'}


Analyzing high-frequency decentralized finance (DeFi) data presents several challenges, primarily due to the nature of blockchain technology and the characteristics of DeFi markets. These challenges include:

1. **Data Volume and Velocity**: DeFi platforms generate vast amounts of data with each transaction. The high frequency of transactions adds to the complexity, requiring robust systems for data capture and storage.

2. **Data Transparency and Accessibility**: While blockchain is inherently transparent, accessing and interpreting the data can be challenging. The decentralized nature means data is spread across numerous nodes, making aggregation and analysis complex.

3. **Data Integrity and Security**: Ensuring the integrity and security of data is paramount. Blockchain technology is secure by design, but the analysis process must maintain this security, especially when integrating data from multiple sources.

4. **Smart Contract Complexity**: DeFi transactions often involve complex smart contracts. Understanding and analyzing the logic and outcomes of these contracts require specialized knowledge and tools.

5. **Regulatory and Compliance Issues**: Navigating the evolving regulatory landscape of DeFi and ensuring compliance add another layer of complexity to data analysis.

Solutions to these challenges leverage blockchain technology's inherent strengths and innovative approaches to data analysis:

1. **Distributed Ledger Technology (DLT)**: Utilizing DLT for data storage and management can help manage the volume and velocity of DeFi data, ensuring integrity and security.

2. **Smart Contracts for Data Management**: Implementing smart contracts can automate data aggregation and analysis processes, reducing the complexity and increasing efficiency.

3. **Advanced Analytics and Machine Learning**: Employing advanced analytics techniques and machine learning can help in interpreting complex data patterns, making sense of vast datasets generated by DeFi platforms.

4. **Decentralized Data Marketplaces**: Participating in or creating decentralized data marketplaces can improve access to DeFi data, allowing for more comprehensive analysis.

5. **Collaboration with Regulatory Bodies**: Working closely with regulators to understand compliance requirements can guide the development of analysis tools that automatically check for compliance, reducing the regulatory burden.

6. **Educational and Training Programs**: Developing educational programs to improve understanding of DeFi and blockchain analytics can help analysts and organizations better navigate the complexities of DeFi data.

By addressing these challenges with targeted solutions, stakeholders can unlock the full potential of DeFi data, driving innovation and growth in the decentralized finance sector.

In [37]:
res_dict2 = convert.to_dict(branch.action_response.content.iloc[3])

func_ = nested.nget(res_dict2, ['action_response', 'function'])
args_ = nested.nget(res_dict2, ['action_response', 'arguments'])
output_ = nested.nget(res_dict2, ['action_response', 'output'])

print(f"Tool used is: {func_}")
print(f"question asked: {args_}")
Markdown(output_)

Tool used is: query_google
question asked: {'query': 'recent advancements in time series forecasting using LLMs in decentralized finance'}


Recent advancements in time series forecasting using Large Language Models (LLMs) in decentralized finance (DeFi) have been significant, reflecting the broader trend of applying cutting-edge AI techniques to financial markets. Key advancements include:

1. **Improved Accuracy**: LLMs have been fine-tuned to better understand and predict market trends by analyzing vast amounts of historical data, leading to more accurate forecasts of asset prices, interest rates, and market movements.

2. **Natural Language Processing (NLP) Enhancements**: The integration of NLP techniques allows LLMs to analyze news articles, social media posts, and other textual data to gauge market sentiment and predict its impact on future market trends. This is particularly useful in the volatile DeFi space, where market sentiment can significantly influence asset prices.

3. **Anomaly Detection**: LLMs are employed to identify unusual patterns or anomalies in time series data that could indicate market manipulation, fraud, or emerging trends, which is crucial for risk management and regulatory compliance in DeFi platforms.

4. **Real-time Analysis**: The ability to process and analyze data in real-time has improved, enabling DeFi platforms to make quicker decisions based on the latest market information. This is important in the fast-paced DeFi sector, where market conditions can change rapidly.

5. **Customization and Personalization**: LLMs have been adapted to create personalized investment strategies for users by analyzing their preferences, risk tolerance, and past behavior, enhancing the user experience and potentially leading to higher returns for individual investors.

6. **Interoperability and Integration**: Progress has been made in making LLMs more interoperable with various blockchain platforms and DeFi applications, allowing for seamless integration and more comprehensive analysis across different data sources and platforms.

These advancements demonstrate the potential of LLMs to transform time series forecasting in decentralized finance, making it more accurate, efficient, and personalized. As these technologies continue to evolve, they are likely to play an increasingly central role in the DeFi ecosystem.

### d. Save the work

In [38]:
researcher.to_csv_file()
researcher.log_to_csv()

12 messages saved to data/logs/main_messages_20240327190213.csv
3 logs saved to data/logs/main_log_20240327190213.csv


In [39]:
from lionagi.libs import convert, func_call

responses = convert.to_list([responses_arxiv, responses_d2l, responses_bc, responses_google, responses_wiki], flatten=True)
responses_dicts = func_call.lcall(responses, lambda x: {"response": str(x)})

In [40]:
responses_df = convert.to_df(responses_dicts)
responses_df.to_csv("research_query_responses.csv")

Reference:

LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters

Ching Chang , Wei-Yao Wang , Wen-Chih Peng and Tien-Fu Chen

National Yang Ming Chiao Tung University, Hsinchu, Taiwan

blacksnail789521.cs10@nycu.edu.tw, sf1638.cs05@nctu.edu.tw, {wcpeng, tfchen}@cs.nycu.edu.tw