# Article Retrieval System using Mistral7b, LangChain and ChromaDB

In this notebook, I'll create an Article Retrieval System using Mistral7b, LangChain and ChromaDB. I'll use also the Chroma vector database to store data, perform chunking and apply Retrieval Augmented Generation (RAG) in practice. I'll build both the searching of articles based on the query, but also question-answering (QA) system. The data that I'm going to use is a comprehensive collection of blog posts sourced from Medium, focusing specifically on articles published under the "Towards Data Science" publication. I'll also give five examples of different queries in searching articles, but also answering the questions.


# Setup

Firstly, it's necessary to install and then import all of the libraries and packages that we'll use in the project.

In [1]:
import warnings
warnings.filterwarnings('ignore')

!pip install -q -U transformers accelerate bitsandbytes langchain tiktoken sentence-transformers chromadb

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
keras-cv 0.8.2 requires keras-core, which is not installed.
keras-nlp 0.8.2 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is in

In [2]:
import pandas as pd
import torch
from torch import bfloat16
import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import langchain
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA, LLMChain
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/1300-towards-datascience-medium-articles-dataset/medium.csv
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/config.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model-00002-of-00002.bin
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer_config.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model.bin.index.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model-00001-of-00002.bin
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/special_tokens_map.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/.gitattributes
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer.model
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/generation_config.json


* <b>torch.backends.cuda.enable_mem_efficient_sdp(False)</b>: This line is configuring Torch (PyTorch) to enable or disable memory-efficient structured data parallelism (SDP) for CUDA tensors. SDP is a technique for parallelizing computations across multiple GPUs by dividing data structures across them efficiently. By passing False as an argument, this line indicates that memory-efficient SDP should be disabled.

* <b>torch.backends.cuda.enable_flash_sdp(False)</b>: Similarly, this line is configuring Torch to enable or disable flash structured data parallelism (SDP) for CUDA tensors. Flash SDP is another technique for parallelizing computations across multiple GPUs, potentially offering different trade-offs compared to memory-efficient SDP. By passing False as an argument, this line indicates that flash SDP should be disabled.

I disabled this features, because in my case it's necessary to work with Mistral7b model further.

In [3]:
#configure torch
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

# Data loading

My system indexes articles from the "1300 Towards Data Science Medium Articles", so I'll import the data and load team.

In [4]:
#import csv
df = pd.read_csv('/kaggle/input/1300-towards-datascience-medium-articles-dataset/medium.csv')

I have the data stored in the form of pandas dataframe, so I want to use the DataframeLoader. It's important to mention that the page_content_column define the loader to identify which one is your id or page “title”.

In [5]:
#load data
articles = DataFrameLoader(df, page_content_column = "Title")
document = articles.load()

# Model

In [6]:
model_path="/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"

#initialize the model
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype = torch.bfloat16,
    device_map = "auto",
    trust_remote_code = True
)

#initialize the tokenizer
tokenizer=AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Chunking

It's a process of extracting meaningful phrases, or "chunks," from a sentence based on its grammatical structure and parts of speech. It involves identifying and grouping together contiguous words or tokens that form a syntactic unit, typically consisting of a noun phrase, verb phrase, or prepositional phrase.

In my case chunking can help in text preprocessing, content selection or context segmentation.

I use token-based approach, as I would like to focus on tokens, not on characters.

* <b>chunk_size:</b> This parameter specifies the desired size of each chunk in terms of the number of tokens (words or subwords).
* <b>chunk_overlap:</b> This parameter specifies the overlap between consecutive chunks. A value of 0 indicates no overlap, meaning each chunk will start exactly where the previous one ends.

In [7]:
#Split text into smaller chunks
splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=75)
splitted_texts = splitter.split_documents(document)

# Storing the data

Now we need to put chunks into an index so that we are able to retrieve them easily when we want to find something in the document or answer questions. We use embedding model and vector database for this purpose.

In [8]:
#initialize an embedding model 
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

#create chroma database from provided documents
chroma_database = Chroma.from_documents(splitted_texts,
                                      embedding_model,
                                      persist_directory = 'chroma_db')

#convert chroma database into retriever
retriever = chroma_database.as_retriever()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Searching the articles

I'll execute a similarity search using a query against a Chroma database to retrieve the most similar documents. I'll print information about the query and the retrieved documents, including their titles and snippets of text. Each search will give three highly related articles. The number of retrieved articles may be changed by changing 'k' parameter in similarity_search().

In [9]:
def search_articles(query):
    docs = chroma_database.similarity_search(query, k=3)
    print(f'Query: {query}')
    print(f'Retrieved documents: {len(docs)}')
    for doc in docs:
        details = doc.to_json()['kwargs']
        print("\nSource (Article Title):", details['page_content'])
        print("\nText: ", details['metadata']['Text'][:350] + ". . .")
        print('\n\n\n')

In [10]:
query = "What is Word2Vec?"
search_articles(query)

Query: What is Word2Vec?
Retrieved documents: 3

Source (Article Title): A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model

Text:  1. Introduction of Word2vec

Word2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to d. . .





Source (Article Title): Analogies from Word Vectors?

Text:  Another common example of analogies is the derivation of capitals. Let’s see what are the candidates for the capital of Austria:

Candidates for Austria-Germany+Berlin from German Wikipedia corpus

This time we have a hit, indeed Vienna is to Austria as Berlin is to Germany. The list of other candidates also makes sense, we find five capitals of Au. . .





Source (Article Title): Word Clouds in Python: Comprehensive Example

Te

In [11]:
query = "How to implement Gradient Descent and Backpropagation?"
search_articles(query)

Query: How to implement Gradient Descent and Backpropagation?
Retrieved documents: 3

Source (Article Title): A Step-by-Step Implementation of Gradient Descent and Backpropagation

Text:  A Step-by-Step Implementation of Gradient Descent and Backpropagation

The original intention behind this post was merely me brushing upon mathematics in neural network, as I like to be well versed in the inner workings of algorithms and get to the essence of things. I then think I might as well put together a story rather than just revisiting the . . .





Source (Article Title): Understanding Backpropagation Algorithm

Text:  Backpropagation algorithm is probably the most fundamental building block in a neural network. It was first introduced in 1960s and almost 30 years later (1989) popularized by Rumelhart, Hinton and Williams in a paper called “Learning representations by back-propagating errors”.

The algorithm is used to effectively train a neural network through a. . .





Source (Article Ti

In [12]:
query = "Describe Logistic Regression"
search_articles(query)

Query: Describe Logistic Regression
Retrieved documents: 3

Source (Article Title): Logistic Regression

Text:  Logistic Regression

Contrary to its name logistic regression is a classification algorithm. Given an input example, a logistic regression model assigns the example to a relevant class.

A note on the notation. x_{i} means x subscript i and x_{^th} means x superscript th.

Quick Review of Linear Regression

Linear Regression is used to predict a re. . .





Source (Article Title): Logistic Regression in Machine Learning using Python

Text:  Logistic Regression in Machine Learning using Python

Learn how logistic regression works and how you can easily implement it from scratch using python as well as using sklearn. Adarsh Menon · Follow Published in Towards Data Science · 7 min read · Dec 27, 2019 -- 2 Listen Share

In statistics logistic regression is used to model the probability of. . .





Source (Article Title): Logistic Regression from Scratch in R

Text:  Introductio

In [13]:
query = "Why should I use Cython?"
search_articles(query)

Query: Why should I use Cython?
Retrieved documents: 3

Source (Article Title): Use Cython to get more than 30X speedup on your Python code

Text:  Cython will give your Python code super-car speed

Want to be inspired? Come join my Super Quotes newsletter. 😎

Python is a community favourite programming language! It’s by far one of the easiest to use as code is written in an intuitive, human-readable way.

Yet you’ll often hear the same complaint about Python over and over again, especially fr. . .





Source (Article Title): PyTorch 1.3 — What’s new?. Support for Android and iOS, Named…

Text:  PyTorch 1.3 — What’s new?

Facebook just released PyTorch v1.3 and it is packed with some of the most awaited features. The three most attractive ones are:

Named Tensor — Something that would make the life of machine learning practitioners much easier. Quantization — For performance critical systems like IoT devices and embedded systems. Mobile Su. . .





Source (Article Title): Why is Pyth

In [14]:
query = "Give me a list of examples where I can apply time series clustering"
search_articles(query)

Query: Give me a list of examples where I can apply time series clustering
Retrieved documents: 3

Source (Article Title): Time Series Clustering and Dimensionality Reduction

Text:  Time Series must be handled with care by data scientists. This kind of data contains intrinsic information about temporal dependency. it’s our work to extract these golden resources, where it is possible and useful, in order to help our model to perform the best.

With Time Series I see confusion when we face a problem of dimensionality reduction o. . .





Source (Article Title): Working With Time Series Data

Text:  Working With Time Series Data

NYC’s daily temperature chart (November 1, 2019 to December 11, 2019) produced with Matplotlib

Data scientists study time series data to determine if a time based trend exists. We can analyze hourly subway passengers, daily temperatures, monthly sales, and more to see if there are various types of trends. These trend. . .





Source (Article Title): Cluster a

# Question-answering (QA) system

My QA system uses a prompt template for generating prompts, a language model chain for generating answers, and a retrieval-augmented generation chain to combine retrieval and generation approaches in the QA process.

* <b>llm_chain:</b> sets up a generation-based question-answering chain using the LLMChain class. It uses the language model (llm) and the same prompt template. It doesn't use retriever.
* <b>rag_chain:</b> sets up a retrieval-augmented generation (RAG) chain. It combines retrieval (using the retriever) with generation (using the language model).

As we can see on the example regarding the logistic regression, the answers with RAG may be more precise.

In [15]:
#define a configuration for quantization 
quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                        bnb_4bit_quant_type = 'nf4',
                                        bnb_4bit_use_double_quant = True,
                                        bnb_4bit_compute_dtype = bfloat16)

#create a text generation pipeline using Mistral7b
text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

2024-04-09 15:04:35.675406: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-09 15:04:35.675497: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-09 15:04:35.811096: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [16]:
# build prompt
prompt_template = """
### [INST] 
Instruction: Answer the question based on your data science knowledge. Here is context to help:

{context}

### QUESTION:
{question} 

[/INST]"""

# create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# create llm chain 
llm_chain = LLMChain(llm=llm, prompt=prompt)

# create rag chain 
rag_chain = ( 
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

def generate_answer(query):
    print('Query: '+query)
    print('\n')
    print('Answer: '+llm_chain.invoke({"context":"", 
                      "question": query})['text'].split('[/INST]')[1].strip())
    print('\n\n')
    print('Answer with RAG: '+rag_chain.invoke(query)['text'].split('[/INST]')[1].strip())

In [17]:
query = "What is Word2Vec?"
generate_answer(query)

Query: What is Word2Vec?


Answer: Word2Vec is a natural language processing technique used for representing words as vectors in a high-dimensional space. It is a type of neural network that learns word embeddings, which are dense, low-dimensional representations of words that capture semantic relationships between them. The idea behind Word2Vec is to map words to continuous vectors such that the similarity between two words can be measured by the distance between their corresponding vectors in the vector space. This allows for more efficient and accurate text classification, sentiment analysis, machine translation, and other NLP tasks.



Answer with RAG: Word2Vec is a neural network architecture that is commonly used for learning word embeddings, which are dense vector representations of words that capture semantic relationships between them. Word2Vec uses a two-layer neural network to map words to vectors in a high-dimensional space. The input to the network is a corpus of text, and

In [18]:
query = "How to implement Gradient Descent and Backpropagation?"
generate_answer(query)

Query: How to implement Gradient Descent and Backpropagation?


Answer: Gradient descent and backpropagation are two fundamental techniques used in machine learning for optimizing models. Here's a step-by-step guide on how to implement them:

1. Define the objective function: The first step is to define the objective function that you want to minimize or maximize. This could be the cost function of a linear regression model, the cross-entropy loss of a neural network, or any other function that measures the performance of your model.
2. Initialize the weights: Next, initialize the weights of your model randomly or using some pre-trained values. The weights represent the parameters of your model and will be updated during training.
3. Choose the learning rate: The learning rate is a hyperparameter that determines the step size at each iteration of gradient descent. A small learning rate may result in slow convergence, while a large learning rate may cause the model to diverge. You can e

In [19]:
query = "Describe Logistic Regression"
generate_answer(query)

Query: Describe Logistic Regression


Answer: Logistic regression is a type of supervised learning algorithm used for classification problems, where the goal is to predict the probability of an event occurring based on a set of input features. It is a statistical method that uses a logistic function to model the relationship between the input variables and the output variable. The logistic function maps any input value to a value between 0 and 1, which can be interpreted as the probability of the event occurring.

In logistic regression, the model is trained by minimizing the cross-entropy loss function, which measures the difference between the predicted probabilities and the true labels. The coefficients of the logistic function are learned during training, and they represent the impact of each input feature on the probability of the event occurring. Once the model is trained, it can be used to make predictions on new data by applying the logistic function to the input values and int

In [20]:
query = "Why should I use Cython?"
generate_answer(query)

Query: Why should I use Cython?


Answer: Cython is a language that allows you to write Python code that can be compiled into machine code, which can run much faster than pure Python code. This can be useful for performance-critical applications where speed is important, such as scientific computing or data analysis.

One of the main advantages of using Cython is that it allows you to take advantage of the performance benefits of C and C++ without having to learn those languages. Cython is similar to Python in many ways, but it also has some features that are specific to C and C++, such as support for pointers and low-level memory management.

Another advantage of using Cython is that it can be used with popular Python libraries such as NumPy and Pandas, which can significantly improve the performance of these libraries when working with large datasets.

Overall, Cython can be a powerful tool for improving the performance of Python code, especially for applications that require high le

In [21]:
query = "Give me a list of examples where I can apply time series clustering"
generate_answer(query)

Query: Give me a list of examples where I can apply time series clustering


Answer: Time series clustering is a technique used in data science to group similar time series data together based on their patterns and trends over time. Here are some examples of where you can apply time series clustering:

1. Stock market analysis: You can use time series clustering to group stocks with similar price movements together, allowing for better investment strategies.
2. Customer behavior analysis: By analyzing customer purchase patterns over time, you can cluster customers into different segments based on their buying habits, which can be useful for marketing and sales purposes.
3. Weather forecasting: Time series clustering can be used to group weather data from different locations together based on their similarities, which can help improve weather forecasting accuracy.
4. Healthcare monitoring: By analyzing patient health data over time, you can cluster patients with similar health condition