<div id = 'intro'
     style="font-size: 42px; font-weight: bold; color: #353935; text-shadow: 1px 2px 3px rgba(0,0,0,0.1);">
        <b>Introduction</b>
    <hr style="border: none;
               border-top: 1.25px solid #E0E0E0;
               width: 100%;
               margin-top: 20px;
               margin-bottom: 20px;
               margin-left: 0px;
               margin-right: auto;">
</div>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">Large language models are amazing tools that can help humans obtain answers to questions, summarize extensive texts, translate documents from one language to another, and help us code, among others.</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">However, LLMs have one major issue: <b>hallucinations</b>. Hallucinations happen when an LLM spits out random facts from its training data, even if it may have no real connection to the user's prompt. Large language models have a hard time saying <i>"I don't know"</i> to questions they don't have an answer to.</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;"><b>Retrieval-augmented generation (RAG)</b> is an AI framework that has two main objectives: Improve the quality of responses generated by connecting the model to an external source of knowledge and ensure that users have access to the model's sources so you can fact-check its answers for accuracy.</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">With RAG, we can also ensure that the large language model has access to <b>proprietary data</b> by connecting it to custom sources of data from where it can retrieve information.</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">The image below provides a clear understanding of how RAG works. First, a user makes a question to the LLM. Before reaching the model, the question reaches a <b>retriever</b>. This retriever will be responsible for looking up and retrieving relevant documents to answer the question from the <b>knowledge base</b>. The question, plus the relevant documents, will then be sent to the LLM, which will be able to generate a source-informed answer according to the sources from the documents it received.</p>

<center>
    <img src = "https://assets-global.website-files.com/63f3993d10c2a062a4c9f13c/64593ba041a4ff8dfef73f30_1*LYApKuxzzmvFECqwYk61wg.png">
<p style = "font-size: 16px;
            font-family: 'Roboto', sans-serif;
            text-align: center;
            margin-top: 10px;">Source: <a href = "https://www.ml6.eu/blogpost/leveraging-llms-on-your-domain-specific-knowledge-base">ml6.eu</a></p>
</center>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">In this notebook, we will implement a retrieval-augmented generation system for an LLM using the <a href="https://www.kaggle.com/datasets/lusfernandotorres/wikipedia-crypto-articles">Wikipedia Crypto Articles</a>, a dataset that I uploaded a few days ago here on Kaggle.</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">Before getting our hands dirty with code, let's install some relevant packages. These are:</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">• <b><a href = "https://www.trychroma.com/">Chromadb</a>:</b> An open-source embedding database that allows us to plug LLMs to knowledge bases. It allows us to store and query embeddings and their metadata.</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">• <b><a href = "https://www.langchain.com/">LangChain</a>:</b> A framework that allows us to develop several applications powered by LLMs.</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">• <b><a href = "https://pypi.org/project/sentence-transformers/">Sentence Transformers</a>:</b> A framework that provides an easy method to compute dense vector representations for sentences, paragraphs, and images by leveraging pre-trained transformer models.</p>

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">• <b><a href = "https://github.com/TimDettmers/bitsandbytes">bitsandbytes</a>:</b> A library designed to optimize the training and deployment of large models through 4-bit quantization of the model's weights, reducing memory footprint and enhancing memory efficiency.</p>

In [1]:
# Auto DataViz tool
!pip install ydata-profiling

Collecting ydata-profiling
  Downloading ydata_profiling-4.7.0-py2.py3-none-any.whl (357 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m357.9/357.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting visions[type_image_path]<0.7.7,>=0.7.5 (from ydata-profiling)
  Downloading visions-0.7.6-py3-none-any.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.8/104.8 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (686 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m686.1/686.1 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Collecting seaborn<0.13,>=0.10.1 (from ydata-profiling)
  Downloading seaborn-0.12.2-py3-none-any.whl (293 kB)
[2K     

In [2]:
# Installing Required Packages
!pip install chromadb
!pip install langchain
!pip install sentence_transformers
!pip install bitsandbytes

Collecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.110.2-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb)
  Downloading uvicorn-0.29.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.5.0-py2

In [3]:
# Importing Required modules

# Data Handling
import pandas as pd
import numpy as np

from ydata_profiling import ProfileReport

import torch
from torch import bfloat16
import transformers
from transformers import AutoTokenizer

from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

import warnings
warnings.filterwarnings("ignore")

In [4]:
# Checking if GPU is available
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(torch.cuda.current_device())
    total_memory = torch.cuda.get_device_properties(0).total_memory
    total_memory_gb = total_memory / (1024**3) # Converting memory to Gb
    print("GPU is available. \nUsing GPU")
    print("\nGPU Name:", gpu_name)
    print(f"Total GPU Memory: {total_memory_gb:.2f} GB")

    device = torch.device('cuda')
else:
    print("GPU is not available. \nUsing CPU")
    device = torch.device('cpu')

GPU is not available. 
Using CPU


In [5]:
# Loading data
df = pd.read_csv('/content/Wikipedia Crypto Articles.csv') #Select Path for Dataset

report = ProfileReport(df, title = 'Wikipedia Crypto Articles')

In [7]:
print('Dataframe Length:', len(df), 'rows')

df = df.dropna() # Handling Missing Values

print('Length After Dropping Empty Values:', len(df), 'rows')

Dataframe Length: 227 rows
Length After Dropping Empty Values: 218 rows


In [8]:
print('Title:', df.title.iloc[-1])
print('\n\n\n')
print(df.article.iloc[-1])

Title: NEO (cryptocurrency)




Neo is a blockchain-based cryptocurrency and application platform used to run smart contracts and decentralized applications. The project, originally named Antshares, was founded in 2014 by Da HongFei and Erik Zhang and rebranded as Neo in 2017. In 2017 and 2018, the cryptocurrency maintained some success in the Chinese market despite the recently-enacted prohibition on cryptocurrency in that country.


== Technical specifications ==
The Neo network runs on a proof-of-stake decentralized Byzantine fault tolerant (dBFT) consensus mechanism between a number of centrally approved nodes, and can support up to 10,000 transactions per second. The base asset of the Neo blockchain is the non-divisible Neo token which generates GAS tokens. These GAS tokens, a separate asset on the network, can be used to pay for transaction fees, and are divisible with the smallest unit of 0.00000001. The inflation rate of GAS is controlled with a decaying half-life algorithm tha

In [9]:
# Loading dataframe content into a document
articles = DataFrameLoader(df,
                           page_content_column = "title")

In [10]:
# Loading entire dataframe into document format
document = articles.load()

In [11]:
# Splitting document into smaller chunks
splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,
                                chunk_overlap = 20)
splitted_texts = splitter.split_documents(document)

In [12]:
# Loading model to create the embeddings
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
# Creating indexed database of embded data with ChromaDB
chroma_database = Chroma.from_documents(splitted_texts,
                                      embedding_model,
                                      persist_directory = 'chroma_db')

<p style="font-family: 'Roboto', Arial, sans-serif; font-size: 20px; color: #353935;">You can see below that <code>chroma_database</code> is a vector store.</p>

In [14]:
# Visualizing the database
chroma_database

<langchain_community.vectorstores.chroma.Chroma at 0x7cd4f454c550>

In [15]:
# Defining retriever
retriever = chroma_database.as_retriever()
retriever

VectorStoreRetriever(tags=['Chroma', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7cd4f454c550>)

In [16]:
# Configuring BitsAndBytesConfig for loading model in an optimal way
quantization_config = transformers.BitsAndBytesConfig(load_in_4bit = True,
                                        bnb_4bit_quant_type = 'nf4',
                                        bnb_4bit_use_double_quant = True,
                                        bnb_4bit_compute_dtype = bfloat16)

In [None]:
# Loading Mistral 7b

from transformers import AutoModelForCausalLM, AutoTokenizer

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
)


llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

In [None]:
# Defining the QnA chain
QnA = RetrievalQA.from_chain_type(llm = llm,
                                 chain_type = 'stuff',
                                 retriever = retriever,
                                 verbose = False)

In [None]:
# Defining function to fetch documents according to a query
def get_answers(QnA, query):
    answer = QnA.run(query)
    print(f"\033[1mQuery:\033[0m {query}\n")
    print(f"\033[1mAnswer:\033[0m ", answer)

In [None]:
query = """Who created the Bitcoin? When was it created?"""
get_answers(QnA, query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1mQuery:[0m Who created the Bitcoin? When was it created?

[1mAnswer:[0m   Bitcoin was created by an unknown person or group of people using the name Satoshi Nakamoto. It was created in 2009.


In [None]:
query = """What was the biggest scam in the history of cryptocurrencies?"""
get_answers(QnA, query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1mQuery:[0m What was the biggest scam in the history of cryptocurrencies?

[1mAnswer:[0m   The biggest scam in the history of cryptocurrencies was the 2021 Squid Game cryptocurrency scam.


In [None]:
query = """How much will one Bitcoin cost in 2030?"""
get_answers(QnA, query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1mQuery:[0m How much will one Bitcoin cost in 2030?

[1mAnswer:[0m   I don't know.


In [None]:
query = """Cite the names of five relevant people in crypto?"""
get_answers(QnA, query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1mQuery:[0m Cite the names of five relevant people in crypto?

[1mAnswer:[0m   1. Andreas Antonopolous 2. Brian Armstrong 3. Changpeng Zhao 4. Andreas Antonopolous 5. Andreas Antonopolous


In [None]:
query = """What exchanges can I use to buy crypto?"""
get_answers(QnA, query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1mQuery:[0m What exchanges can I use to buy crypto?

[1mAnswer:[0m   Crypto.com is a cryptocurrency exchange that offers an initial exchange offering (IEO) for various cryptocurrencies. It is available in Europe and other parts of the world.


In [None]:
query = """Who conceived Ethereum?"""
get_answers(QnA, query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1mQuery:[0m Who conceived Ethereum?

[1mAnswer:[0m   Vitalik Buterin conceived Ethereum.


In [None]:
# Obtaining the source and documents searched
docs = chroma_database.similarity_search(query)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
    details = doc.to_json()['kwargs']
    print("\nSource (Article Title):", details['page_content'])
    print("\nText", details['metadata']['article'][:350] + ". . .")
    print('\n\n\n')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: Who conceived Ethereum?
Retrieved documents: 4

Source (Article Title): Ethereum

Text Ethereum is a decentralized blockchain with smart contract functionality. Ether (Abbreviation: ETH;  sign: Ξ) is the native cryptocurrency of the platform. Among cryptocurrencies, ether is second only to bitcoin in market capitalization. It is open-source software.
Ethereum was conceived in 2013 by programmer Vitalik Buterin. Additional founders of. . .





Source (Article Title): History of bitcoin

Text Bitcoin is a cryptocurrency, a digital asset that uses cryptography to control its creation and management rather than relying on central authorities. Originally designed as a medium of exchange, Bitcoin is now primarily regarded as a store of value. The history of bitcoin started with its invention and implementation by Satoshi Nakamoto, who integ. . .





Source (Article Title): Ethereum Classic

Text Ethereum Classic is a blockchain-based distributed computing platform that offers smart 

In [None]:
# different query
query = """What exchanges can I use to buy crypto?"""
docs = chroma_database.similarity_search(query)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
    details = doc.to_json()['kwargs']
    print("\nSource (Article Title):", details['page_content'])
    print("\nText", details['metadata']['article'][:350] + ". . .")
    print('\n\n\n')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Query: What exchanges can I use to buy crypto?
Retrieved documents: 4

Source (Article Title): Cryptocurrency exchange

Text A cryptocurrency exchange, or a digital currency exchange (DCE), is a business that allows customers to trade cryptocurrencies or digital currencies for other assets, such as conventional fiat money or other digital currencies. Exchanges may accept credit card payments, wire transfers or other forms of payment in exchange for digital currencies or . . .





Source (Article Title): Crypto.com

Text Crypto.com is a cryptocurrency exchange company based in Singapore. As of June 2023, the company reportedly had 80 million customers and 4,000 employees. The exchange issues its own exchange token named Cronos (CRO).


== History ==
The company was initially founded in Hong Kong by Bobby Bao, Gary Or, Kris Marszalek, and Rafael Melo in 2016 as "Mon. . .





Source (Article Title): Initial exchange offering

Text An Initial exchange offering (IEO) is the cryptocurrency