### **Contributors**

This project has been done by:
1. **Navid Pourhadi Hasanabad** 2080655 computer engineering AI and robotics
2. **Seyed Ali Amir Khorasani** 2081302 computer engineering High Performance Computing and Big Data


as the final project of the **Natural Language processing** course organized by **Prof. Giorgio Satta** at **university of Padova** in department of Informatica **DEI**.

# **Crypto-RAG: RAG-based Crypto analysis through the recent news**

Crypto-RAG is a system for retrieving the latest news about the market of crypto currency and make some analysis based on LLMs. In this system, based on the news published on the websites (our case-study is Tradingview), we try to use the LLM to analyse the news and predict the future trends of the coins.

\

More technically, we are using the **Retrieval Augmented Generative (RAG) model** to retrieve the articles collected through web scraping of the news websites and generate the analysis based on the given prompt. In this scenario,
we are following the procedure below:

\

1. Collecting the news articles related to bitcoin in different categories through web scraping of https://www.tradingview.com/news/markets/?category=crypto website.
2. Transform the collected articles into chunks
3. Embed each chunk and store into a vectorStore
4. Retrieve the relevant chunks from the vectorstore based on the given query
5. create the prompt consists of query as the question, and the documents retrieved as the external relevant information.
6. Use Llama2 chat LLM over the generated prompt to Generate an answer based on the information provided.

\

Also, in this project we use ***LangChain*** package in order to combine all the different parts of a RAG system in one unique container.



In [1]:
!pip install -U -q "langchain_core" "langchain" "transformers==4.31.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" "langchain-community" "faiss-cpu" "tiktoken" "sentence-transformers" "huggingface-hub" "langchain_experimental" "langchain-google-vertexai" "rouge_score"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m321.8/321.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

### Package Importing and HuggingFace token retrieving

In this section, we are importing ther required packages for this project. Also, since we are using **HuggingFace** environment as the source of our models, we need to use its user **token** for some of the models that are used in this project.

\
Note: the models produced by **Meta** requires an access request and authentication. You must complete a form in your huggingface account to ask access to their models. Since, one of the robust models that we used in this project is Llama2, in case you want to run the project based on this model, first you should follow the following procedure to get the access to the model through the code.

\
Note2: If you are using the other LLM model other than `meta-llama/Llama-2-13b-chat-hf`, you can skip all of the following procedure and also comment the login procedure since it is no more required.

\
## **Guideline: adding Token key of HuggingFace into Colab's env. variable**

\
**Procedure for the getting Token key:**

1. Login to your HuggingFace account.
2. navigate to `meta-llama/Llama-2-13b-chat-hf` model's page.
3. Complete the form for granting access to use the model (granting access usually takes 5 min by Meta)
4. Now, navigate to the `setting > Access Tokens` of your profile,
5. Create a new token
6. Click on `manage > Edit Permissions` of the created token,
7. In the `Repositories permissions` part, click on the search bar and select the `meta-llama/Llama-2-13b-chat-hf` model and enable `Read access to contents of selected repos` option.
8. Now, copy the **Token key** to the project for the login part

\
**Procedure for adding Token key as the colab environment variable:**

1. After you copied the **Token Key** from the HuggingFace
2. In colab project, click on the **secret Key** sign on the **left panel**
3. Add a new secret key with the **name: HF_KEY** and **value: {your token}**
4. You are done with the Token key

** in case you are not using colab, change the `google_colab` variable value to `False` and provide the token key in the runtime of the cell as it is going to ask you.

At this step, if Meta admit the access to the models of Llama2, there shouldn't be any problem in the further steps.

In [5]:
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.prompts import PromptTemplate, ChatPromptTemplate

from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
from langchain_community.chat_models import ChatOpenAI

from huggingface_hub import login as hf_login
from huggingface_hub import notebook_login

from rouge_score import rouge_scorer

import torch
import transformers

import requests
import pandas as pd
import bs4
import os
from datetime import datetime

from bs4 import BeautifulSoup
import json



# in case you are using colab: copy the token to secret key of the project with the name of 'HF_KEY' (the 'key' sign on the left vertical bar)
from google.colab import drive, userdata

# turn it to False in case you are not using colab
google_colab = True
if google_colab:
  drive.mount("/content/drive/", force_remount=True)

  HF_KEY = userdata.get('HF_KEY')
  hf_login(HF_KEY)

else:
  notebook_login()



Mounted at /content/drive/
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [6]:
if google_colab:

    if not os.path.exists("/content/drive/MyDrive/crypto_RAG/"):
        os.mkdir("/content/drive/MyDrive/crypto_RAG")

    os.chdir("/content/drive/MyDrive/crypto_RAG/")
    !ls



### **Data acquisition (Web Scraping)**

In this section, the data of our RAG system will be collected through web scraping of the news related to bitcoin cryptocurrency from the source of tradingView website which is a well-known website for trading markets: https://www.tradingview.com/news/markets/?category=crypto .

\
We chose TradingView website as the source of the news, since it is one of the major references of the traders of the markets to investigate the prices, news, analysis, etc. Also, one of the benefits of the news section of the tradingview is that it reflect the news published by the valid newsletters like Reuters, NewsBTC, Coin Telegraph, etc.  

\
Because of the structure of the website, it was not possible to retrieve the news data using the APIs that are used, so we tried to capture and scrap them from the HTML source codes. The structure of the captured data are as follow:

- **id**: the unique id that is used for each news
- **title**: the title of the news
- **time**: the timestamp of the news that is published on the website
- **source**: the news agency that published the news
- **description**: the content of the news

\
Finally, after scraping the news and their corresponding articles from the website, we collect all of them inside ***./crypto_data.csv*** file.

**note**: since, we are working on only text input,  only \<p\> elements of the articles (which contains only texts) are extracted for each news.

\
#### **This cell will update the database of the news, and should be run at least once in a day to keep the database updated from the website.**

In [7]:

# Define the URL
base_url = "https://www.tradingview.com"
url = "https://www.tradingview.com/news/markets/?category=crypto"

added_news = 0

# read the csv file and store it in a pd dataframe
try:
    csv_file = pd.read_csv("./crypto_news.csv")
except FileNotFoundError:
    print("csv file not found. create an empty one with columns id, title, time, source, description")
    csv_file = pd.DataFrame(columns=["id", "title", "time", "source", "description"])

response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    script_tags = soup.find_all("script", {"type": "application/prs.init-data+json"})
    target_json_data = None
    for script_tag in script_tags:
        if script_tag.string and '{"title":"Market news"' in script_tag.string:
            target_json_data = script_tag.string.strip()
            break

    if target_json_data:
        data_dict = json.loads(target_json_data)

        for key in data_dict:
            news_data = data_dict[key]["blocks"][0]["news"]["items"]

        new_news = []

        for news_item in news_data:
            if csv_file["id"].isin([news_item["id"]]).any():
                continue

            try:
                response2 = requests.get(base_url+news_item["storyPath"])
                if response2.status_code == 200:
                    soup2 = BeautifulSoup(response2.content, "html.parser")
                    time = soup2.find_all("time")[0]["datetime"]
                    div_tag = soup2.find_all("div", {"class": "js-news-story-container"})
                    description = "".join([p.text+" \n\n " for p in div_tag[0].find_all("p")])

                news = {
                    "id": news_item["id"],
                    "title": news_item["title"],
                    "time": time,
                    "source": news_item["source"],
                    "description": description
                }

                print(news)

                new_news.append(news)
                added_news = added_news + 1
            except:
                print(f"Failed to retrieve data from {base_url+news_item['storyPath']}")
                continue

        if new_news:
            new_df = pd.DataFrame(new_news)
            csv_file = pd.concat([csv_file, new_df], ignore_index=True)
            csv_file.to_csv("./crypto_news.csv", index=False)
            print(added_news," new data appended and saved to csv.")


    else:
        print('Script tag with {"title": "Market news"} not found.')

else:
    print(f"Failed to retrieve data from {url}. Status code: {response.status_code}")



csv file not found. create an empty one with columns id, title, time, source, description
{'id': 'tag:reuters.com,2024:newsml_L4N3IJ1CO:0', 'title': 'VanEck sets 0.20% fee for proposed spot ethereum ETF', 'time': 'Fri, 21 Jun 2024 20:54:32 GMT', 'source': 'Reuters', 'description': 'VanEck plans to charge a fee of 0.20% on their proposed spot ethereum exchange-traded fund (ETF), according to a  U.S. Securities and Exchange Commission filing on Friday. \n\n '}
{'id': 'DJN_DN20240621005985:0', 'title': 'CoinDesk Bitcoin Price Index Lost 1.23% to $64173.16 — Data Talk', 'time': 'Fri, 21 Jun 2024 20:31:00 GMT', 'source': 'Dow Jones Newswires', 'description': 'CoinDesk Bitcoin Price Index is down $799.45 today or 1.23% to $64173.16 \n\n Note: CoinDesk Bitcoin Price Index (XBX) at 4 p.m. ET close \n\n Data compiled by Dow Jones Market Data \n\n '}
{'id': 'tag:reuters.com,2024:newsml_L4N3IJ1AO:0', 'title': 'Trump campaign refunds Winklevoss twins after bitcoin donations exceed limit, Bloomberg

### **1. RAG system: Data Loading**



loading the csv file generated **"data.csv"** using langchain CSVLoader, with the following specification considerations:

- **description** of the news as the content of the data
- **id** column as the source_column. (it will be used when caching the data as a reference)
- **id**, **title**, **source**, **time** as the metadata of the news

\
### **Important Note**
since the news are scrapped on the fly, whenever the corresponding cell be executed, there wouldn't be any accurate number of the data in our database.

In [8]:
crypto_loader = CSVLoader(file_path="./crypto_news.csv", source_column="id", metadata_columns=["id", "title", "source", "time"])
crypto_data = crypto_loader.load()
print("number of the news data exist inside the database file is " ,len(crypto_data))



number of the news data exist inside the database file is  58


### **2. RAG system: Chunk transformation**

Split the texts of content of each news into **chunks** using langchain RecursiveCharacterTextSplitter. More intuitively, we want to break the large documents into smaller chunks of data. The proposed process of chuck transformation is based on two step:

\
For creating chunks, we split the data, based on two criteria:

1. split based on paragraph: since, mostly each paragraph has unique distinct context which can be intuitive by its own, we first split the news article by the paragraphs.

2. split based on the length: there are some cases that the paragraph become too long which probably means there might be more contexts talked about in a same paragraph. so, we try to makes many chunks from these paragraph, too.


\
Since, the data related to newsletter publisher and publishing time (if the news published for a long time ago is not so useful news for some of the analysis) are useful information for the analyzing, we add these info at the beginning of each chunk content.

The format of each chunk is as follows:

Document content:
\
`Source newsletter: {newsletter}, Time of publication of news: {date of publish}, Content of news:{new content} `

Metadata:
\
`id, title, time, source`

In [9]:
chunk_paragraph_to_sub = RecursiveCharacterTextSplitter(separators=["\n\n", "\n"], chunk_size=1000, chunk_overlap=50)
crypto_documents_chunks = chunk_paragraph_to_sub.transform_documents(crypto_data)

# add the time and source from metadata to the document of each chunk
for doc in crypto_documents_chunks:
    doc.page_content = "Source newsletter: " + doc.metadata["source"] + ", Time of publication of news: " + doc.metadata["time"] + ", Content of news:" + doc.page_content

print("The number of chunks: ", len(crypto_documents_chunks))


The number of chunks:  207


### **3. RAG system: Indexing**

Now, using a sentence embedding model, we embed each chunk and will store them inside a vector store.
Although there are numerous models that can be employed for the embedding step of the sentence and paragraphs, there are the following ones (with their attributes) that are more popular in the huggingface:

- `sentence-transformers/all-mpnet-base-v2` : 33.3M params |  768 dim  |  using contrastive learning on `microsoft/mpnet-base` model fine-tuned on 1B data | more popular to be used on longer paragraphs
- `sentence-transformers/all-MiniLM-L6-v2` : 22.7M params | 384 dim | using contrastive learning on `nreimers/MiniLM-L6-H384-uncased` model fine-tuned on 1B data | more popular to be used on sentences and short paragraphs

\
For this reason, we use `sentence-transofrmers/all-MiniLM-L6-v2` embedder which is a embedding model for sentences and paragraphs to the space of 384 dimensions.

\
In this code, we also utilized cacheBackedEmbedding which is a caching mechanism for better efficiency in which we store the embeddings of the samples for the second time usages.

\
Additionally, the vector space that is used in this project is FAISS (Facebook AI Similarity Search) in which is a simple and efficient library for similarity searches in the vector spaces in the large-scale datasets.

In [10]:

store = LocalFileStore("./cache/")

embedding_model_name = 'sentence-transformers/all-mpnet-base-v2'

embeddings_model = HuggingFaceEmbeddings(model_name=embedding_model_name)
embedder = CacheBackedEmbeddings.from_bytes_store(embeddings_model, store, namespace=embedding_model_name)

vector_store = FAISS.from_documents(crypto_documents_chunks, embedder)

retriever = vector_store.as_retriever()



  warn_deprecated(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In this section we want to test the query embedding and similarity search in the vectorstore. Regarding this issue, we defined some queries in order to find the most similar chunks in the database to them to be used as the retrieved content.



In [11]:
def search_top_similarities_byScoreAndTime(query, embeddings_model, vector_store, k=4):
    print("\n*****************************************************")
    print("reference query:", query)
    embedding_vector = embeddings_model.embed_query(query)
    docs = vector_store.similarity_search_by_vector(embedding_vector, k=2*k)

    for doc in docs:
        if isinstance(doc.metadata["time"], datetime):
            doc.metadata["time"] = doc.metadata["time"].strftime('%a, %d %b %Y %H:%M:%S %Z')

    docs.sort(key= lambda doc: doc.metadata["time"], reverse=True)
    for doc in docs[:k]:
      print(doc)
    # return docs[:k]

# Example usage
search_top_similarities_byScoreAndTime("How much would be the peak of bitcoin chart in 2025?", embeddings_model, vector_store, k=2)
search_top_similarities_byScoreAndTime("bitcoin price analysis by the end of 2024?", embeddings_model, vector_store, k=3)
search_top_similarities_byScoreAndTime("what is going on between India and Binance?", embeddings_model, vector_store,k = 5)


*****************************************************
reference query: How much would be the peak of bitcoin chart in 2025?
page_content="Source newsletter: Cointelegraph, Time of publication of news: Fri, 21 Jun 2024 20:03:49 GMT, Content of news:Related: Why is Bitcoin price down today? \n\n The worsening sentiment was reinforced after retail data provider Syntun stated that China’s annual mid-year e-commerce festival saw sales drop for the first time in eight years. The event celebrates the founding date of Chinese giant JD.com which is the region’s second-largest in terms of annual sales, according to CNBC. Gross sales reached $102.3 billion in 2024, a 7% drop compared to 2023. \n\n Under this scenario, the U.S. dollar Strength Index (DXY) rose to its highest level in fifty days at 105.85, indicating that investors are moving away from the euro, British pound, Swiss franc, and similar currencies. While the S&P 500 index remained unchanged on June 21, traders viewed Bitcoin's 52% g

### **4. RAG System: LLM model Loading**

In this part, we are trying to import the `Llama-2-13b-chat-hf` model as our LLM chat model to be used in our RAG system. In this procedure, we import the model config, tokenizer, and the weights of pretrained model through

\
The important thing is the quantization that has to be done since the model does have 13 billion parameters (which is too heavy) and for better performance needed to be quantized.

In this code, a 4-bit quantizer is used through `BitsAndBytesConfig`in order to quantize the configuration of the model.

Finally, the Llama2 chat model is loaded with the configuration loaded from its model

In [12]:
Llama2_chat_model = "meta-llama/Llama-2-13b-chat-hf"
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# we have to use a token-key for Llama2 model since it's access is restricted by meta (don't forget to set the permissions for this model)
model_config = transformers.AutoConfig.from_pretrained(Llama2_chat_model)

tokenizer = transformers.AutoTokenizer.from_pretrained(Llama2_chat_model)

model = transformers.AutoModelForCausalLM.from_pretrained(
    Llama2_chat_model,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto'
)

model.eval()




config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


The LLM model for the task of **generating text** is created using the tokenizer and the model config defined in the previous cell.

The important setting of the model is as follows:
1. temperature: is equal to 0 since we want the model only to reflect the information in the news and do not hallucinate.

* **Note**: Also, the model with higher temperature can be used in order to use the power of the LLM model in order to predict based on the news, but since for prediction the data of the charts are crucial and we do not have them yet, we only expect the model to induce based on the news without any hallucination.

2. max_new_tokens: the amount of the tokens generated as the answer is at maximum 500, since we do not want some huge answer.

In [13]:
# this model is use only for inference based on retrieved news without any hallucination
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=True,
    temperature=0.0,
    max_new_tokens=500
)

llm = HuggingFacePipeline(pipeline=generate_text)


Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
  warn_deprecated(


### **5. Prompt engineering and Retrieval Chain creating**

In this cell we are going to define the template of the prompt for our domain (crypto currency news analysing) as well as creating the needed retrieval chain for RAG model.

\
#### Prompt Template
Through iterative refinement of the prompt template, we find the following information useful to force the LLM answer more efficient and effective:

1. Describe its **task** as crypto market analyst based on referred news
2. To more focus on the **impacts on the trends** (which is our goal)
3. **Restrict** the model to not answer if the provided news are not useful or too old
4. one of the main useful parts of prompt is the **Current timestamp** in which the query is executed, because the model should know the distance of the time of the news in order to generate effective answer.
5. provide the **retrieved news**
6. the question same as the given **query**


Also, the required data that is needed to pass for creating the prompt are:
1. question
2. context of the retrieved news
3. current timestamp (which is passed as partial vatiable)

\
#### Retrieval Chain

The chain that is used for the RAG system is `RetrievalQA` chain (an in-built chain defined in langchain package) with the required parameters:

1. LLM Model: LLM
2. retriever: retriever of the vector-store with FAISS
3. prompt: the custom prompt template define in this cell
4. callback: the `StdOutCallbackHandler` function for printing in the std out
5. return_source_documents: False

In [14]:
# callback handler needed for printing in the std out
handler = StdOutCallbackHandler()

# current timestamp with format of year, month, day, hour, minute, second
def current_timestamp():
    return datetime.now().strftime("%Y-%m-%d %H:%M:%S")


custom_prompt_template = PromptTemplate(
    input_variables=["question", "context"],
    template="""As a knowledgeable analyst assistant on the market of crypto-currency, use the following context news on the market to analyze the trends of crypto-currencies and explain the possible impacts of the news on the market. If you think the provided news are not sufficient or too old, just say that you can not provide reliable information.\n\n
- The current time which this query is generated is {current_time}, so be aware of the difference between current time and the time of the news when you want to induce some analysis based on the news.\n\n
- Context : {context}\n\n
- Question: {question}\n\n
- Answer:""",
    partial_variables={"current_time": current_timestamp()},
)


QA_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever= retriever,
    chain_type_kwargs={"prompt": custom_prompt_template},
    callbacks=[handler],
    return_source_documents=False)


### **6. Testing Phase**

In this section we are going to test some queries related to the news recently published in the newsletters and Tradingview to see if our model can induce some rational answer based on them or not.

In [15]:
query1 = QA_chain({"query" : "How was the market of bitcoin in the month of may 2024?", "current_time": current_timestamp()})
print("***************************************************************\n",query1["result"])

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
***************************************************************
 As a knowledgeable analyst assistant on the market of crypto-currency, use the following context news on the market to analyze the trends of crypto-currencies and explain the possible impacts of the news on the market. If you think the provided news are not sufficient or too old, just say that you can not provide reliable information.


- The current time which this query is generated is 2024-06-21 21:16:16, so be aware of the difference between current time and the time of the news when you want to induce some analysis based on the news.


- Context : Source newsletter: Cointelegraph, Time of publication of news: Fri, 21 Jun 2024 20:03:49 GMT, Content of news:Related: Why is Bitcoin price down today? 

 The worsening sentiment was reinforced after retail data provider Syntun stated that China’s annual mid-year e-commerce festival saw sales drop for 

In [17]:
query2 = QA_chain({"query" : "Based on the Halving of the bitcoin the market experienced in 2024, what is the expectation of the price of bitcoin in 2025?"})
print("*****************************************\n", query2["result"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
*****************************************
 As a knowledgeable analyst assistant on the market of crypto-currency, use the following context news on the market to analyze the trends of crypto-currencies and explain the possible impacts of the news on the market. If you think the provided news are not sufficient or too old, just say that you can not provide reliable information.


- The current time which this query is generated is 2024-06-21 21:16:16, so be aware of the difference between current time and the time of the news when you want to induce some analysis based on the news.


- Context : Source newsletter: Cointelegraph, Time of publication of news: Fri, 21 Jun 2024 17:01:38 GMT, Content of news:description: Bitcoin BTCUSD is gradually moving down toward the support of the $56,552 to $73,777 range it has been stuck in for the past several months. Glassnode lead analyst James Check cautioned traders in a Jun

#### **Testing using user input**

In case you want to try some queries to check the output, this section is dedicated for this reason. you can run the cell, write your query, and see the magic output!! ✨🔥

In [22]:
user_query = input("Enter your query: ")

query3 = QA_chain({"query" : user_query})
print("***************************************************************\n",query3["result"])

Enter your query: what is going on between India and Binance?


[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m
***************************************************************
 As a knowledgeable analyst assistant on the market of crypto-currency, use the following context news on the market to analyze the trends of crypto-currencies and explain the possible impacts of the news on the market. If you think the provided news are not sufficient or too old, just say that you can not provide reliable information.


- The current time which this query is generated is 2024-06-21 21:16:16, so be aware of the difference between current time and the time of the news when you want to induce some analysis based on the news.


- Context : Source newsletter: Reuters, Time of publication of news: Fri, 21 Jun 2024 14:17:57 GMT, Content of news:Houlahan said during the prison visit, they found Gambaryan living in difficult conditions and "he was also clearly under a lot of stress and his health is not very good." 

 Binance has previously said Gambaryan had malaria and pneumonia. 

 Ga

### **Evaluation: ROUGE**

To evaluate our model, we use the ROUGE Score metric. Specifically, we assess the impact of the RAG process in generating reasonable and up-to-date answers. We will compute the ROUGE score for our model (the LLM model after the RAG process) by using the most similar chunk answer to the query as the reference. Additionally, we compute the ROUGE score for the base LLM model using the same reference as before.

\
By comparing these two scores, we can measure the impact of the RAG process on generating accurate responses. Although identifying the correct reference answer is challenging, for this evaluation, we define the reference point as explained above. This method allows us to systematically compare the performance of the base model and the RAG-enhanced model.
\
\
**ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) scores help evaluate the quality of text generated by natural language processing models. These scores compare the generated text to a reference text, focusing on three common metrics: ROUGE-1, ROUGE-2, and ROUGE-L.

ROUGE-1: Measures the overlap of unigrams (single words) between the generated text and the reference text. This indicates how many individual words from the reference text are present in the generated text.

ROUGE-2: Measures the overlap of bigrams (two consecutive words) between the generated text and the reference text. This provides a sense of how well the generated text captures the context and flow of the reference text by considering pairs of words.

ROUGE-L: Measures the longest common subsequence (LCS) between the generated text and the reference text. This captures the longest sequence of words that appear in both texts in the same order, assessing the overall structure and coherence of the generated text.

\
The RAG-enhanced model scores higher across all three ROUGE metrics compared to the base model:

ROUGE-1 improved from 0.2004 to 0.2309, indicating better unigram overlap with the reference text.
ROUGE-2 improved significantly from 0.0181 to 0.2295, showing better capture of the context and flow of the reference text.
ROUGE-L improved from 0.1136 to 0.2309, indicating that the RAG-enhanced model produces text that is more structurally coherent and closely follows the reference text.
\
\
In summary, the RAG-enhanced model generates text that is more similar to the reference text in terms of individual words, pairs of consecutive words, and overall sequence, indicating better performance compared to the base model.

In [21]:
# Function to generate answers using the base model
def generate_base_answers(prompts):
    base_answers = []
    for prompt in prompts:
        response = llm(prompt)  # response is a single string
        base_answers.append(response)
    return base_answers
def retrieve_reference_data(queries, vector_store, embeddings_model):
    references = []
    for query in queries:
        embedding_vector = embeddings_model.embed_query(query)
        docs = vector_store.similarity_search_by_vector(embedding_vector, k=1)
        if docs:
            references.append(docs[0].page_content)
        else:
            references.append("")
    return references


# Function to compute ROUGE scores
def compute_rouge_scores(predictions, references):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}
    for pred, ref in zip(predictions, references):
        score = scorer.score(ref, pred)
        for key in scores:
            scores[key].append(score[key].fmeasure)
    avg_scores = {key: sum(scores[key])/len(scores[key]) for key in scores}
    return avg_scores

# Define your prompts for evaluation
prompts = [
    "What is the current price of Bitcoin?",
    "What are the predictions for Bitcoin's price in 2025?",
    "What is the impact of Bitcoin halving in 2024?",
    "How was the Bitcoin market in May 2024?",
    "What are the recent news on Bitcoin and Binance?"
]

# Generate answers using the base Llama2 model
base_answers = generate_base_answers(prompts)



# Retrieve reference answers
references = retrieve_reference_data(prompts, vector_store, embeddings_model)

# Compute ROUGE scores for base model
base_rouge_scores = compute_rouge_scores(base_answers, references)

# Generate answers using the RAG-enhanced model
rag_answers = []
for prompt in prompts:
    response = QA_chain({"query": prompt})
    rag_answers.append(response["result"])
# Compute ROUGE scores for RAG-enhanced model
rag_rouge_scores = compute_rouge_scores(rag_answers, references)

print("ROUGE Scores for Base Model:", base_rouge_scores)
print("ROUGE Scores for RAG-Enhanced Model:", rag_rouge_scores)





[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m
ROUGE Scores for Base Model: {'rouge1': 0.20048120219596327, 'rouge2': 0.01813402375892777, 'rougeL': 0.11364115414783929}
ROUGE Scores for RAG-Enhanced Model: {'rouge1': 0.23094321353784708, 'rouge2': 0.22952736276369556, 'rougeL': 0.23094321353784708}
