# Article Retrieval System using Mistral7b and LangChain

In this notebook, I'll create an Article Retrieval System using Mistral7b and LangChain. I'll use also the Chroma vector database to store data, perform chunking and apply Retrieval Augmented Generation (RAG) in practice. I'll build both the searching of articles based on the query, but also question-answering (QA) system. The data that I'm going to use is a comprehensive collection of blog posts sourced from Medium, focusing specifically on articles published under the "Towards Data Science" publication. I'll also give five examples of different queries in searching articles, but also answering the questions.


# Setup

Firstly, it's necessary to install and then import all of the libraries and packages that we'll use in the project.

In [27]:
import warnings
warnings.filterwarnings('ignore')

!pip install -q -U transformers accelerate bitsandbytes langchain tiktoken sentence-transformers chromadb

^C
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [28]:
import pandas as pd
import torch
from torch import bfloat16
import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import langchain
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA, LLMChain
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/config.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model-00002-of-00002.bin
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer_config.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model.bin.index.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model-00001-of-00002.bin
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/special_tokens_map.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/.gitattributes
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer.model
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/generation_config.json
/kaggle/input/1300-towards-datascience-medium-articles-dataset/medium.csv


* <b>torch.backends.cuda.enable_mem_efficient_sdp(False)</b>: This line is configuring Torch (PyTorch) to enable or disable memory-efficient structured data parallelism (SDP) for CUDA tensors. SDP is a technique for parallelizing computations across multiple GPUs by dividing data structures across them efficiently. By passing False as an argument, this line indicates that memory-efficient SDP should be disabled.

* <b>torch.backends.cuda.enable_flash_sdp(False)</b>: Similarly, this line is configuring Torch to enable or disable flash structured data parallelism (SDP) for CUDA tensors. Flash SDP is another technique for parallelizing computations across multiple GPUs, potentially offering different trade-offs compared to memory-efficient SDP. By passing False as an argument, this line indicates that flash SDP should be disabled.

I disabled this features, because in my case it's necessary to work with Mistral7b model further.

In [29]:
#configure torch
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

# Data loading

My system indexes articles from the "1300 Towards Data Science Medium Articles", so I'll import the data and load team.

In [30]:
#import csv
df = pd.read_csv('/kaggle/input/1300-towards-datascience-medium-articles-dataset/medium.csv')

I have the data stored in the form of pandas dataframe, so I want to use the DataframeLoader. It's important to mention that the page_content_column define the loader to identify which one is your id or page “title”.

In [31]:
#load data
articles = DataFrameLoader(df, page_content_column = "Title")
document = articles.load()

# Model

In [32]:
model_path="/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1"

#initialize the model
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype = torch.bfloat16,
    device_map = "auto",
    trust_remote_code = True
)

#initialize the tokenizer
tokenizer=AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

# Chunking

It's a process of extracting meaningful phrases, or "chunks," from a sentence based on its grammatical structure and parts of speech. It involves identifying and grouping together contiguous words or tokens that form a syntactic unit, typically consisting of a noun phrase, verb phrase, or prepositional phrase.

In my case chunking can help in text preprocessing, content selection or context segmentation.

I use token-based approach, as I would like to focus on tokens, not on characters.

* <b>chunk_size:</b> This parameter specifies the desired size of each chunk in terms of the number of tokens (words or subwords).
* <b>chunk_overlap:</b> This parameter specifies the overlap between consecutive chunks. A value of 0 indicates no overlap, meaning each chunk will start exactly where the previous one ends.

In [33]:
'''splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,
                                chunk_overlap = 20)
splitted_texts = splitter.split_documents(document)'''
#Split text into smaller chunks
splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=75)
splitted_texts = splitter.split_documents(document)

# Storing the data

Now we need to put chunks into an index so that we are able to retrieve them easily when we want to find something in the document or answer questions. We use embedding model and vector database for this purpose.

In [34]:
#initialize an embedding model 
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

#create chroma database from provided documents
chroma_database = Chroma.from_documents(splitted_texts,
                                      embedding_model,
                                      persist_directory = 'chroma_db')

#convert chroma database into retriever
retriever = chroma_database.as_retriever()

# Searching the articles

I'll execute a similarity search using a query against a Chroma database to retrieve the most similar documents. I'll print information about the query and the retrieved documents, including their titles and snippets of text. Each search will give three highly related articles. The number of retrieved articles may be changed by changing 'k' parameter in similarity_search().

In [35]:
query = "What is Word2Vec?"

docs = chroma_database.similarity_search(query, k=3)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
    details = doc.to_json()['kwargs']
    print("\nSource (Article Title):", details['page_content'])
    print("\nText: ", details['metadata']['Text'][:350] + ". . .")
    print('\n\n\n')

Query: What is Word2Vec?
Retrieved documents: 3

Source (Article Title): A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model

Text:  1. Introduction of Word2vec

Word2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to d. . .





Source (Article Title): A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model

Text:  1. Introduction of Word2vec

Word2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to d. . .





Source (Article Title): Analogies fro

In [36]:
query = "How to implement Gradient Descent and Backpropagation?"

docs = chroma_database.similarity_search(query, k=3)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
    details = doc.to_json()['kwargs']
    print("\nSource (Article Title):", details['page_content'])
    print("\nText: ", details['metadata']['Text'][:350] + ". . .")
    print('\n\n\n')

Query: How to implement Gradient Descent and Backpropagation?
Retrieved documents: 3

Source (Article Title): A Step-by-Step Implementation of Gradient Descent and Backpropagation

Text:  A Step-by-Step Implementation of Gradient Descent and Backpropagation

The original intention behind this post was merely me brushing upon mathematics in neural network, as I like to be well versed in the inner workings of algorithms and get to the essence of things. I then think I might as well put together a story rather than just revisiting the . . .





Source (Article Title): A Step-by-Step Implementation of Gradient Descent and Backpropagation

Text:  A Step-by-Step Implementation of Gradient Descent and Backpropagation

The original intention behind this post was merely me brushing upon mathematics in neural network, as I like to be well versed in the inner workings of algorithms and get to the essence of things. I then think I might as well put together a story rather than just revisiting the

In [37]:
query = "Describe Logistic Regression"

docs = chroma_database.similarity_search(query, k=3)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
    details = doc.to_json()['kwargs']
    print("\nSource (Article Title):", details['page_content'])
    print("\nText: ", details['metadata']['Text'][:350] + ". . .")
    print('\n\n\n')

Query: Describe Logistic Regression
Retrieved documents: 3

Source (Article Title): Logistic Regression

Text:  Logistic Regression

Contrary to its name logistic regression is a classification algorithm. Given an input example, a logistic regression model assigns the example to a relevant class.

A note on the notation. x_{i} means x subscript i and x_{^th} means x superscript th.

Quick Review of Linear Regression

Linear Regression is used to predict a re. . .





Source (Article Title): Logistic Regression

Text:  Logistic Regression

Contrary to its name logistic regression is a classification algorithm. Given an input example, a logistic regression model assigns the example to a relevant class.

A note on the notation. x_{i} means x subscript i and x_{^th} means x superscript th.

Quick Review of Linear Regression

Linear Regression is used to predict a re. . .





Source (Article Title): Logistic Regression in Machine Learning using Python

Text:  Logistic Regression in Machin

In [38]:
query = "Why should I use Cython?"

docs = chroma_database.similarity_search(query, k=3)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
    details = doc.to_json()['kwargs']
    print("\nSource (Article Title):", details['page_content'])
    print("\nText: ", details['metadata']['Text'][:350] + ". . .")
    print('\n\n\n')

Query: Why should I use Cython?
Retrieved documents: 3

Source (Article Title): Use Cython to get more than 30X speedup on your Python code

Text:  Cython will give your Python code super-car speed

Want to be inspired? Come join my Super Quotes newsletter. 😎

Python is a community favourite programming language! It’s by far one of the easiest to use as code is written in an intuitive, human-readable way.

Yet you’ll often hear the same complaint about Python over and over again, especially fr. . .





Source (Article Title): Use Cython to get more than 30X speedup on your Python code

Text:  Cython will give your Python code super-car speed

Want to be inspired? Come join my Super Quotes newsletter. 😎

Python is a community favourite programming language! It’s by far one of the easiest to use as code is written in an intuitive, human-readable way.

Yet you’ll often hear the same complaint about Python over and over again, especially fr. . .





Source (Article Title): PyTorch 1.3 — 

In [39]:
query = "Give me a list of examples where I can apply time series clustering"

docs = chroma_database.similarity_search(query, k=3)
print(f'Query: {query}')
print(f'Retrieved documents: {len(docs)}')
for doc in docs:
    details = doc.to_json()['kwargs']
    print("\nSource (Article Title):", details['page_content'])
    print("\nText: ", details['metadata']['Text'][:350] + ". . .")
    print('\n\n\n')

Query: Give me a list of examples where I can apply time series clustering
Retrieved documents: 3

Source (Article Title): Time Series Clustering and Dimensionality Reduction

Text:  Time Series must be handled with care by data scientists. This kind of data contains intrinsic information about temporal dependency. it’s our work to extract these golden resources, where it is possible and useful, in order to help our model to perform the best.

With Time Series I see confusion when we face a problem of dimensionality reduction o. . .





Source (Article Title): Time Series Clustering and Dimensionality Reduction

Text:  Time Series must be handled with care by data scientists. This kind of data contains intrinsic information about temporal dependency. it’s our work to extract these golden resources, where it is possible and useful, in order to help our model to perform the best.

With Time Series I see confusion when we face a problem of dimensionality reduction o. . .





Source (Art

# Question-answering (QA) system

My QA system uses a prompt template for generating prompts, a language model chain for generating answers, and a retrieval-augmented generation chain to combine retrieval and generation approaches in the QA process.

* <b>llm_chain:</b> sets up a generation-based question-answering chain using the LLMChain class. It uses the language model (llm) and the same prompt template. It doesn't use retriever.
* <b>rag_chain:</b> sets up a retrieval-augmented generation (RAG) chain. It combines retrieval (using the retriever) with generation (using the language model).

As we can see on the example regarding the logistic regression, the answers with RAG may be more precise.

In [40]:
#define a configuration for quantization 
quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                        bnb_4bit_quant_type = 'nf4',
                                        bnb_4bit_use_double_quant = True,
                                        bnb_4bit_compute_dtype = bfloat16)

#create a text generation pipeline using Mistral7b
text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

In [41]:
# build prompt
prompt_template = """
### [INST] 
Instruction: Answer the question based on your data science knowledge. Here is context to help:

{context}

### QUESTION:
{question} 

[/INST]"""

# create prompt from prompt template 
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# create llm chain 
llm_chain = LLMChain(llm=llm, prompt=prompt)

# create rag chain 
rag_chain = ( 
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

In [42]:
query = "What is Word2Vec?"
print('Query: '+query)
print('\n')
print('Answer: '+llm_chain.invoke({"context":"", 
                  "question": query})['text'].split('[/INST]')[1].strip())
print('\n\n')
print('Answer with RAG: '+rag_chain.invoke(query)['text'].split('[/INST]')[1].strip())

Query: What is Word2Vec?




KeyboardInterrupt: 

In [None]:
query = "How to implement Gradient Descent and Backpropagation?"
print('Query: '+query)
print('\n')
print('Answer: '+llm_chain.invoke({"context":"", 
                  "question": query})['text'].split('[/INST]')[1].strip())
print('\n\n')
print('Answer with RAG: '+rag_chain.invoke(query)['text'].split('[/INST]')[1].strip())

In [None]:
query = "Describe Logistic Regression"
print('Query: '+query)
print('\n')
print('Answer: '+llm_chain.invoke({"context":"", 
                  "question": query})['text'].split('[/INST]')[1].strip())
print('\n\n')
print('Answer with RAG: '+rag_chain.invoke(query)['text'].split('[/INST]')[1].strip())

In [None]:
query = "Why should I use Cython?"
print('Query: '+query)
print('\n')
print('Answer: '+llm_chain.invoke({"context":"", 
                  "question": query})['text'].split('[/INST]')[1].strip())
print('\n\n')
print('Answer with RAG: '+rag_chain.invoke(query)['text'].split('[/INST]')[1].strip())

In [None]:
query = "Give me a list of examples where I can apply time series clustering"
print('Query: '+query)
print('\n')
print('Answer: '+llm_chain.invoke({"context":"", 
                  "question": query})['text'].split('[/INST]')[1].strip())
print('\n\n')
print('Answer with RAG: '+rag_chain.invoke(query)['text'].split('[/INST]')[1].strip())