# **RAG-based bitcoin analysis through the recent news**

In this project, we are going to investigate the fundamental analysis of the **bitcoin** cryptocurrency based on the recent news associated with bitcoin and financial markets released on the websites and newsletters.

\

More technically, we are using the **Retrieval Augmented Generative (RAG) model** to retrieve the articles collected through web scraping of the news websites and generate the analysis based on the given prompt. In this scenario,
we are following the procedure below:

\

1. Collecting the news articles related to bitcoin in different categories through web scraping of https://news.bitcoin.com/ website.
2. Transform the collected articles into chunks
3. Embed each chunk and store into a vectorStore
4. Retrieve the relevant chunks from the vectorstore based on the given query
5. create the prompt consists of query as the question, and the documents retrieved as the external relevant information.
6. Use Llama2 chat LLM over the generated prompt to Generate an answer based on the information provided.

\

Also, in this project we use ***LangChain*** package in order to combine all the different parts of a RAG system in one unique container.

\
In the following sections, we will describe the components in detail.

In [1]:
!pip install -U -q "langchain" "transformers==4.31.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" "langchain-community" "faiss-cpu" "tiktoken" "sentence-transformers" "huggingface-hub" "newsapi-python"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.5/973.5 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m27.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m70.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

### Package Importing and HuggingFace token retrieving

In this section, we are importing ther required packages for this project. Also, since we are using **HuggingFace** environment as the source of our models, we need to use its user **token** for some of the models that are used in this project.

\
Note: the models produced by **Meta** requires an access request and authentication. You must complete a form in your huggingface account to ask access to their models. Since, one of the robust models that we used in this project is Llama2, in case you want to run the project based on this model, first you should follow the following procedure to get the access to the model through the code.

\
Note2: If you are using the other LLM model other than `meta-llama/Llama-2-13b-chat-hf, you can skip all of the following procedure and also comment the login procedure since it is no more required.

\
Procedure for the Token retriving:

1. Login to your HuggingFace account.
2. navigate to `meta-llama/Llama-2-13b-chat-hf` model's page.
3. Complete the form for granting access to use the model (granting access usually takes 5 min by Meta)
4. Now, navigate to the `setting > Access Tokens` of your profile,
5. Create a new token
6. Click on `manage > Edit Permissions` of the created token,
7. In the `Repositories permissions` part, click on the search bar and select the `meta-llama/Llama-2-13b-chat-hf` model and enable `Read access to contents of selected repos` option.
8. Now, copy the **Token key** to the project for the login part

In [2]:
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.prompts import PromptTemplate

from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
from langchain_community.chat_models import ChatOpenAI

from huggingface_hub import login as hf_login
from huggingface_hub import notebook_login

import torch
import transformers

import requests
import pandas as pd
import bs4


# in case you are using colab: copy the token to secret key of the project with the name of 'HF_TOKEN' (the 'key' sign on the left vertical bar)
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')
hf_login(HF_TOKEN)

# in case you are running locally (comment the colab part, and uncomment the code below)
# notebook_login()

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### **Data acquisition (Web Scraping)**

In this section, the data of our RAG system will be collected through web scraping of the news related to bitcoin cryptocurrency from https://news.bitcoin.com/ website.

\
There are different categories of news in which based on the task we considered, we only used news related to the market of bitcoin and events related to it.

categories: `market-updates, markets-and-prices, finance, economics, mining`

Finally, after scraping the news and their corresponding articles from the website, we collect all of them inside *./data.csv* file.

**note**: since, we are working on only text input,  only \<p\> elements of the articles (which contains only texts) are extracted for each news.

In [3]:
def extract_content(slug, headers, base_url):
    article_url = base_url + slug
    article_resp = requests.get(article_url, headers=headers)
    article_data = article_resp.json()

    soup = bs4.BeautifulSoup(article_data["content"], "html.parser")
    paragraphs = soup.find_all("p")
    article_content = ' '.join([p.get_text() for p in paragraphs])

    return article_content

categories = ["market-updates", "markets-and-prices", "finance", "economics", "mining"]

base_url = "https://api.news.bitcoin.com/wp-json/bcn/v1/post?slug="
url = "https://api.news.bitcoin.com/wp-json/bcn/v1/posts?offset=0&per_page=100&s=bitcoin&filter_by=category&filter="

# Read the existing CSV file
try:
    existing_data = pd.read_csv("./data.csv")
except FileNotFoundError:
    existing_data = pd.DataFrame(columns=["id", "date", "categories", "content"])

for category in categories:

    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
    response = requests.get(url+category, headers=headers)
    data = pd.DataFrame(response.json()["posts"])
    data = data[["id", "date", "categories", "slug"]]

    data["categories"] = data["categories"].apply(lambda x: x[0]["name"])

    new_data = data[~data["id"].isin(existing_data["id"])]

    new_data["content"] = new_data["slug"].apply(extract_content, args=(headers, base_url))
    new_data = new_data[["id", "date", "categories", "content"]]

    existing_data = pd.concat([existing_data, new_data], ignore_index=True)

existing_data.to_csv("./data.csv", index=False)


### **1. RAG system: Data Loading**



loading the csv file generated **"data.csv"** using langchain CSVLoader, with considering the **id** column as the source_column.

In [4]:
bitcoin_loader = CSVLoader(file_path="data.csv", source_column="id")
bitcoin_data = bitcoin_loader.load()
print("number of the news data exist inside the database file is " ,len(bitcoin_data))


number of the news data exist inside the database file is  500


### **2. RAG system: Chunk transformation**

Split the texts of content of each news into **chunks** using langchain RecursiveCharacterTextSplitter. More intuitively, we want to break the large documents into smaller chunks of data. The proposed process of chuck transformation is based on two step:

1. split based on paragraph: since, mostly each paragraph has unique distinct context which can be intuitive by its own, we first split the news article by the paragraphs.

2. split based on the length: there are some cases that the paragraph become too long which probably means there might be more contexts talked about in a same paragraph. so, we try to makes many chunks from these paragraph, too.

In [5]:
chunk_article_to_paragraph = RecursiveCharacterTextSplitter(separators=["\n\n", "\n",],)
chunk_paragraph_to_sub = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50,
    length_function = len,
)

In [6]:
bitcoin_documents = chunk_article_to_paragraph.transform_documents(bitcoin_data)
len(bitcoin_documents)

2268

### **3. RAG system: Indexing**

Now, using a sentence embedding model, we embed each chunk and will store them inside a vector store.
Although there are numerous models that can be employed for the embedding step of the sentence and paragraphs, there are the following ones (with their attributes) that are more popular in the huggingface:

- `sentence-transformers/all-mpnet-base-v2` : 33.3M params |  768 dim  |  using contrastive learning on `microsoft/mpnet-base` model fine-tuned on 1B data | more popular to be used on longer paragraphs
- `sentence-transformers/all-MiniLM-L6-v2` : 22.7M params | 384 dim | using contrastive learning on `nreimers/MiniLM-L6-H384-uncased` model fine-tuned on 1B data | more popular to be used on sentences and short paragraphs

\
For this reason, we use `sentence-transofrmers/all-MiniLM-L6-v2` embedder which is a embedding model for sentences and paragraphs to the space of 384 dimensions.

\
In this code, we also utilized cacheBackedEmbedding which is a caching mechanism for better efficiency in which we store the embeddings of the samples for the second time usages.

\
Additionally, the vector space that is used in this project is FAISS (Facebook AI Similarity Search) in which is a simple and efficient library for similarity searches in the vector spaces in the large-scale datasets.

In [7]:
store = LocalFileStore("./cache/")

embedding_model_name = 'sentence-transformers/all-mpnet-base-v2'

embeddings_model = HuggingFaceEmbeddings(embedding_model_name)
embedder = CacheBackedEmbeddings.from_bytes_store(embeddings_model, store, namespace=embedding_model_name)

vector_store = FAISS.from_documents(bitcoin_documents, embedder)

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Testing the query embedding and similarity search in the vector store.



In [8]:
def search_top_similarities(query, core_embeddings_model, vector_store, k=4):
    print("\n*****************************************************")
    print("reference query:", query)
    embedding_vector = core_embeddings_model.embed_query(query)
    docs = vector_store.similarity_search_by_vector(embedding_vector, k=k)
    for page in docs:
        print(page.page_content)

# Example usage
search_top_similarities("How much would be the peak of bitcoin chart in 2025?", core_embeddings_model, vector_store, k=2)
search_top_similarities("bitcoin price analysis by the end of 2024?", core_embeddings_model, vector_store, k=3)


*****************************************************
reference query: How much would be the peak of bitcoin chart in 2025?
content: Periodically, the product comparison website finder.com releases a new price prediction survey focusing on key cryptocurrencies and gathers a wide array of crypto and fintech experts for their perspectives. According to the most recent report on bitcoin forecasts, the panel of experts anticipates that bitcoin could attain a value in the six-figure range by 2024. Finder’s latest survey on bitcoin’s (BTC) price involved a unique panel of 31 crypto and fintech experts. The results suggest that BTC could reach a peak of $122,000 in 2024, and then settle at $109,000 by year’s end. Nick Ranga, the senior crypto and forex analyst at Forextraders, attributes the current uptick in BTC’s value to growing institutional engagement. Ranga is among those experts who foresee BTC reaching its zenith once more in 2024, ultimately closing the year around the $100,000 mark

Login into huggingface for getting the token for the Llama2 model (it is a private model now on huggingface and needed to asking for the access for using it)

In [9]:
Llama2_chat_model = "meta-llama/Llama-2-13b-chat-hf"
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# we have to use a token-key for Llama2 model since it's access is restricted by meta (don't forget to set the permissions for this model)
model_config = transformers.AutoConfig.from_pretrained(Llama2_chat_model)

tokenizer = transformers.AutoTokenizer.from_pretrained(Llama2_chat_model)

model = transformers.AutoModelForCausalLM.from_pretrained(
    Llama2_chat_model,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto'
)

model.eval()



config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


In [10]:
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=True,
    temperature=0.1,
    max_new_tokens=256
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [11]:
llm = HuggingFacePipeline(pipeline=generate_text)

  warn_deprecated(


In [12]:
retriever = vector_store.as_retriever()

In [13]:
# callback handler needed for printing in the std out
handler = StdOutCallbackHandler()

custom_prompt_template = PromptTemplate(
    input_variables=["question", "context"],
    template="""
    As an knowledgeable analyst assistant on the market of bitcoin, use the following context news to analyze the market of bitcoin based on the question and answer it.\n
    Context: {context} \n\n
    Question: {question}\n\n
    Answer:
    """
)

qa_with_sources_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": custom_prompt_template},
    callbacks=[handler],
    return_source_documents=False)

In [14]:
qa_with_sources_chain({"query": "How is the situation of bitcoin after the halving of april 2024?"})

  warn_deprecated(




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'How is the situation of bitcoin after the halving of april 2024?',
 'result': '\n    As an knowledgeable analyst assistant on the market of bitcoin, use the following context news to analyze the market of bitcoin based on the question and answer it.\n\n    Context: anticipate a decline to $42K following the halving event. What do you think about the future of bitcoin’s price trajectory? Share your thoughts and opinions about this subject in the comments section below.\n\nWhat do you think about bitcoin’s market action post-halving? Share your thoughts and opinions about this subject in the comments section below.\n What do you think about bitcoin’s market action post-halving? Share your thoughts and opinions about this subject in the comments section below.\n\nCEO of Bitwise Asset Management, explained on social media platform X this week why he believes the upcoming bitcoin halving will be the most impactful one so far. “The April 2024 Bitcoin halving may be the most impact

In [15]:
qa_with_sources_chain({"query" : "when was the halving of bitcoin in 2024? And was it the last halving?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'when was the halving of bitcoin in 2024? And was it the last halving?',
 'result': "\n    As an knowledgeable analyst assistant on the market of bitcoin, use the following context news to analyze the market of bitcoin based on the question and answer it.\n\n    Context: explained: “The next bitcoin halving is expected to occur around April 20, 2024. After this, the amount of bitcoin created with each new block will fall to 3.125 from 6.25, and daily issuance will fall to about 450 bitcoin from about 900. This process is scheduled to continue until the last bitcoin is mined around 2140.” Jacobs detailed in a video about the Bitcoin halving: “One of the reasons some people find bitcoin valuable is its scarcity. Unlike fiat currencies that can be printed at the discretion of governments or central banks, bitcoin cannot be created endlessly. In fact, bitcoin has a capped supply of 21 million coins.” Besides noting that the Bitcoin halving “creates more scarcity,” Jacobs stressed

In [16]:
qa_with_sources_chain({"query" : "How was the market of bitcoin in the month of may 2024?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'How was the market of bitcoin in the month of may 2024?',
 'result': "\n    As an knowledgeable analyst assistant on the market of bitcoin, use the following context news to analyze the market of bitcoin based on the question and answer it.\n\n    Context: aims to determine the price of bitcoin by the end of 2024. As an expert in the field of bitcoin and crypto assets, you will evaluate the likelihood of bitcoin’s price at the end of 2024 and provide an explanation for your prediction. The global macroeconomic landscape remains uncertain. The current date is April 28, 2024, and bitcoin is currently trading at $62,900 per unit. A total of 11 spot bitcoin exchange-traded funds were approved three months ago in the United States on Jan. 10, 2024. The block reward halving occurred on April 19, 2024, reducing the subsidy from 6.25 BTC to 3.125 BTC. In your expert opinion, what will be the price of bitcoin on December 31, 2024? We asked Anthropic’s Claude 3 Sonnet, Openai’s Chatgp

In [17]:
qa_with_sources_chain({"query" : "Based on the Halving of the bitcoin the market experienced in 2024, what is the expectation of the price of bitcoin in 2025?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Based on the Halving of the bitcoin the market experienced in 2024, what is the expectation of the price of bitcoin in 2025?',
 'result': '\n    As an knowledgeable analyst assistant on the market of bitcoin, use the following context news to analyze the market of bitcoin based on the question and answer it.\n\n    Context: aims to determine the price of bitcoin by the end of 2024. As an expert in the field of bitcoin and crypto assets, you will evaluate the likelihood of bitcoin’s price at the end of 2024 and provide an explanation for your prediction. The global macroeconomic landscape remains uncertain. The current date is April 28, 2024, and bitcoin is currently trading at $62,900 per unit. A total of 11 spot bitcoin exchange-traded funds were approved three months ago in the United States on Jan. 10, 2024. The block reward halving occurred on April 19, 2024, reducing the subsidy from 6.25 BTC to 3.125 BTC. In your expert opinion, what will be the price of bitcoin on Dec