# Introduction

<img src="https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F769452%2Fb18d0513200d426e556b2b7b7c825981%2FRAG.png?generation=1695504022336680&alt=media"></img>

## Objective

Use Llama 2.0, Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system to explore the Enron Emails. We will index the emails in ChromaDB using HuggingFace embeddings and then will use Langchain and Llama 2 to query about the indexed content.

## Definitions

* LLM - Large Language Model  
* Llama 2.0 - LLM from Meta 
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation 

## Model details

* **Model**: Llama 2  
* **Variation**: 7b-chat-hf  (7b: 7B dimm. hf: HuggingFace build)
* **Version**: V1  
* **Framework**: PyTorch  

LlaMA 2 model is pretrained and fine-tuned with 2 Trillion tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over LlaMA 1 model.

# Installations, imports, utils

In [1]:
!pip install transformers accelerate einops langchain xformers bitsandbytes chromadb sentence_transformers

Collecting einops
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m718.7 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting langchain
  Downloading langchain-0.1.13-py3-none-any.whl (810 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m810.5/810.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting xformers
  Downloading xformers-0.0.25-cp310-cp310-manylinux2014_x86_64.whl (222.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m222.5/222.5 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[?25hCollecting chromadb
  Downloading chromadb-0.4.24-py3-none-any.whl (525

In [2]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
import chromadb
from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma


# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [5]:
# model_id = '/kaggle/input/llama-2/pytorch/7b-chat-hf/1'
model_id = '/kaggle/input/llama2-7b-hf/Llama2-7b-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

Prepare the model and the tokenizer.

In [6]:
time_1 = time()
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_2 = time()
print(f"Prepare model, tokenizer: {round(time_2-time_1, 3)} sec.")

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Prepare model, tokenizer: 133.786 sec.


Define the query pipeline.

In [8]:
time_1 = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        device_map="auto",)
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

Prepare pipeline: 0.0 sec.


We define a function for testing the pipeline.

# Query Enron Emails

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of State of the Union (SOTU).

In [9]:
llm = HuggingFacePipeline(pipeline=query_pipeline)
# checking again that everything is working fine
llm(prompt="Please explain what was the Enron scandal. Keep it in 100 words.")

  warn_deprecated(


"Please explain what was the Enron scandal. Keep it in 100 words.\nThe Enron scandal was a major corporate scandal that occurred in the United States in the early 2000s. It involved the collapse of the Enron Corporation, a major energy company, and the subsequent investigation and prosecution of several individuals involved in the company's accounting practices.\nThe scandal began when it was discovered that Enron had been using complex accounting practices to hide massive losses and debt, while at the same time reporting record profits. The company had also been involved in a number of questionable business practices, including the creation of off-balance-sheet partnerships and the manipulation of energy prices.\nThe scandal led to the collapse of Enron, the largest bankruptcy in U.S. history at the time, and the prosecution of several individuals, including former Enron CEO Jeffrey Skilling and former Enron CFO Andrew Fastow. The scandal also led to the passage of the Sarbanes-Oxley 

## Ingestion of data using Text loder

We will ingest a selection of Enron emails, already processed.

In [10]:
loader = CSVLoader("/kaggle/input/parse-and-process-enron-emails-dataset/proc_email.csv",
                    encoding="utf8",source_column="to_index")
documents = loader.load()

We will use just a subset (first 1000) documents from the collection.

In [11]:
sel_documents = documents[0:1000]

In [18]:
print(sel_documents[0])

page_content="To: frozenset({'robert.walker@enron.com'})\nFrom: frozenset({'daren.farmer@enron.com'})\nX-To: Robert Walker\nX-From: Daren J Farmer\ncontent: ENA Contact\n\nDaren Farmer\nPhone # 713-853-6905\nFax# 713-646-2391\n\nEB3211F\nto_index: From Daren J Farmer to Robert Walker: ENA Contact\n\nDaren Farmer\nPhone # 713-853-6905\nFax# 713-646-2391\n\nEB3211F" metadata={'source': 'From Daren J Farmer to Robert Walker: ENA Contact\n\nDaren Farmer\nPhone # 713-853-6905\nFax# 713-646-2391\n\nEB3211F', 'row': 0}


## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [13]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
all_splits = text_splitter.split_documents(sel_documents)

In [20]:
print(all_splits[0])

page_content="To: frozenset({'robert.walker@enron.com'})\nFrom: frozenset({'daren.farmer@enron.com'})\nX-To: Robert Walker\nX-From: Daren J Farmer\ncontent: ENA Contact\n\nDaren Farmer\nPhone # 713-853-6905\nFax# 713-646-2391\n\nEB3211F\nto_index: From Daren J Farmer to Robert Walker: ENA Contact\n\nDaren Farmer\nPhone # 713-853-6905\nFax# 713-646-2391\n\nEB3211F" metadata={'source': 'From Daren J Farmer to Robert Walker: ENA Contact\n\nDaren Farmer\nPhone # 713-853-6905\nFax# 713-646-2391\n\nEB3211F', 'row': 0}


## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [25]:
!pip install sentence-transformers==2.2.2

Collecting sentence-transformers==2.2.2
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m850.0 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125926 sha256=a2df7f281af12d36f4a7d9e545cf104f194640047c3cfee8310e00b42029ff06
  Stored in directory: /root/.cache/pip/wheels/62/f2/10/1e606fd5f02395388f74e7462910fe851042f97238cbbd902f
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 2.6.1
    Uninstalling sentence-transformers-2.6.1:
      Successfully uninstalled sentence

In [26]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

Downloading .gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading 1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [27]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

## Initialize chain

In [28]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

## Test the Retrieval-Augmented Generation with the indexed embeddings of Enron emails


We define a test function, that will run the query and time it.

In [29]:
def test_rag(qa, query):
    print(f"Query: {query}\n")
    time_1 = time()
    result = qa.run(query)
    time_2 = time()
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

Let's check few queries.

In [30]:
query = "Who was Sheila Chang?"
test_rag(qa, query)

Query: Who was Sheila Chang?



[1m> Entering new RetrievalQA chain...[0m


  warn_deprecated(



[1m> Finished chain.[0m
Inference time: 305.973 sec.

Result:  Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

per the request of Mark Haedicke/Elizabeth Sager, please see this 
attachment:
to_index: From Sara Shackleton to Tana Jones: ----- Forwarded by Sara Shackleton/HOU/ECT on 09/20/2000 02:47 PM -----

Please reply as soon as possible if you are going to attend this lunch 
meeting (for catering purposes).  Thanks.
to_index: From Sara Shackleton to Kaye Ellis: Per my voice mail.  I think Suzanne already reserved a room.
---------------------- Forwarded by Sara Shackleton/HOU/ECT on 04/18/2000 
01:40 PM ---------------------------

Emily Sellers
---------------------- Forwarded by Emily Sellers/ET&S/Enron on 02/06/2001 
01:14 PM ---------------------------


"Tina Shelton" <TShelton@phrusa.org> on 02/06/2001 01:09:00 PM
To: esellers@enron.com
cc:  

Subject: follow 

Optionaly, we can query directly the database, to get the documents used to create the context of the answer.

In [31]:
def query_database(query):
    docs = vectordb.similarity_search(query)
    print(f"Query: {query}")
    print(f"Retrieved documents: {len(docs)}")
    for doc in docs:
        doc_details = doc.to_json()['kwargs']
        print("Source: ", doc_details['metadata']['source'])
        print("Text: ", doc_details['page_content'], "\n")

Expand the following cell to see the documents selected from the vector DB by direct query.

In [32]:
query_database(query)

Query: Who was Sheila Chang?
Retrieved documents: 4
Source:  From Sara Shackleton to Tana Jones: ----- Forwarded by Sara Shackleton/HOU/ECT on 09/20/2000 02:47 PM -----

	Brenda Whitehead
	09/11/2000 09:16 AM
		 
		 To: Alan Aronowitz/HOU/ECT@ECT, Peggy Banczak/HOU/ECT@ECT, Sandi M 
Braband/HOU/ECT@ECT, Teresa G Bushman/HOU/ECT@ECT, Bob Carter/HOU/ECT@ECT, 
Michelle Cash/HOU/ECT@ECT, Barton Clark/HOU/ECT@ECT, Harry M 
Collins/HOU/ECT@ECT, Shonnie Daniel/HOU/ECT@ECT, Peter del 
Vecchio/HOU/ECT@ECT, Stacy E Dickson/HOU/ECT@ECT, Shawna Flynn/HOU/ECT@ECT, 
Barbara N Gray/HOU/ECT@ECT, Wayne Gresham/HOU/ECT@ECT, Mark E 
Haedicke/HOU/ECT@ECT, Leslie Hansen/HOU/ECT@ECT, Jeffrey T Hodge/HOU/ECT@ECT, 
Dan J Hyvl/HOU/ECT@ECT, Dan Lyons/HOU/ECT@ECT, Travis McCullough/HOU/ECT@ECT, 
Lisa Mellencamp/HOU/ECT@ECT, Janet H Moore/HOU/ECT@ECT, Janice R 
Moore/HOU/ECT@ECT, Julia Murray/HOU/ECT@ECT, Gerald Nemec/HOU/ECT@ECT, David 
Portz/HOU/ECT@ECT, Elizabeth Sager/HOU/ECT@ECT, Richard B 
Sanders/HOU/ECT@E

In [33]:
query = "In what context is mentioned Natural Gas Storage Overview?"
test_rag(qa, query)

Query: In what context is mentioned Natural Gas Storage Overview?



[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m
Inference time: 289.318 sec.

Result:  Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Leslie
to_index: From Dan J Hyvl to Leslie Hansen: Leslie,
 I have reviewed the product descriptions for the natural gas products on 
Dynegydirect and they appear to be standard descriptions which should be 
easily understood by our traders.  As such, I am comfortable with such 
descriptions.



	Leslie Hansen
	12/22/2000 11:28 AM
		 
		 To: Brent Hendry/NA/Enron@Enron, Dan J Hyvl/HOU/ECT@ECT, Marcus 
Nettelton/NA/Enron@ENRON
		 cc: Sheri Thomas/HOU/ECT@ECT
		 Subject: DynegyDirect Product Approvals

Please notify both me and Sheri Thomas via e-mail as soon as you are 
comfortable with the product descriptions for your respective commodities on 
Dynegydirect so that Sheri can set up the respective traders to trade via 
DynegyDirect.  Sheri has been receiving c

Expand the following cell to see the documents selected from the vector DB by direct query.

In [34]:
query_database(query)

Query: In what context is mentioned Natural Gas Storage Overview?
Retrieved documents: 4
Source:  From Dan J Hyvl to Leslie Hansen: Leslie,
 I have reviewed the product descriptions for the natural gas products on 
Dynegydirect and they appear to be standard descriptions which should be 
easily understood by our traders.  As such, I am comfortable with such 
descriptions.



	Leslie Hansen
	12/22/2000 11:28 AM
		 
		 To: Brent Hendry/NA/Enron@Enron, Dan J Hyvl/HOU/ECT@ECT, Marcus 
Nettelton/NA/Enron@ENRON
		 cc: Sheri Thomas/HOU/ECT@ECT
		 Subject: DynegyDirect Product Approvals

Please notify both me and Sheri Thomas via e-mail as soon as you are 
comfortable with the product descriptions for your respective commodities on 
Dynegydirect so that Sheri can set up the respective traders to trade via 
DynegyDirect.  Sheri has been receiving calls from traders anxious to use the 
system.

Thank you so much.  Have a happy holiday.

Leslie

Text:  Leslie
to_index: From Dan J Hyvl to Leslie H

In [35]:
query = "Summarize email correspondents of Vince J Kaminski. Limit the list of correspondents to 10."
test_rag(qa, query)

Query: Summarize email correspondents of Vince J Kaminski. Limit the list of correspondents to 10.



[1m> Entering new RetrievalQA chain...[0m





[1m> Finished chain.[0m
Inference time: 251.337 sec.

Result:  Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

To: frozenset({'amy.fitzpatrick@enron.com'})
From: frozenset({'vince.kaminski@enron.com'})
X-To: Amy FitzPatrick
X-From: Vince J Kaminski
content: Amy,

Yes, I am interested. I am in London now, but I shall contact him on 
Thuirsday.

Vince




Amy FitzPatrick
02/21/2000 03:34 AM
To: Vince J Kaminski/HOU/ECT@ECT
cc:  
Subject: Re: CV of Rodney Greene re quantitative positions.

Vince - 

Would you have any interest in this candidate?  

Kind regards -
Amy
---------------------- Forwarded by Amy FitzPatrick/LON/ECT on 21/02/2000 
09:34 ---------------------------


Bryan Seyfried
18/02/2000 19:50
To: Amy FitzPatrick/LON/ECT@ECT
cc:  

Subject: Re: CV of Rodney Greene re quantitative positions.  

probably a bit to techy for me but maybe a good fit for Vince Kam

Expand the following cell to see the documents selected from the vector DB by direct query.

In [36]:
query_database(query)

Query: Summarize email correspondents of Vince J Kaminski. Limit the list of correspondents to 10.
Retrieved documents: 4
Source:  From Vince J Kaminski to Amy FitzPatrick: Amy,

Yes, I am interested. I am in London now, but I shall contact him on 
Thuirsday.

Vince




Amy FitzPatrick
02/21/2000 03:34 AM
To: Vince J Kaminski/HOU/ECT@ECT
cc:  
Subject: Re: CV of Rodney Greene re quantitative positions.

Vince - 

Would you have any interest in this candidate?  

Kind regards -
Amy
---------------------- Forwarded by Amy FitzPatrick/LON/ECT on 21/02/2000 
09:34 ---------------------------


Bryan Seyfried
18/02/2000 19:50
To: Amy FitzPatrick/LON/ECT@ECT
cc:  

Subject: Re: CV of Rodney Greene re quantitative positions.  

probably a bit to techy for me but maybe a good fit for Vince Kaminski in 
Houston Research.

bs



Amy FitzPatrick
17/02/2000 12:52
To: David Port/Corp/Enron@ENRON, David Weekes/LON/ECT@ECT, Steve W 
Young/LON/ECT@ECT, Bryan Seyfried/LON/ECT@ECT
cc:  

Subject: CV of 

# References  

[1] Murtuza Kazmi, Using LLaMA 2.0, FAISS and LangChain for Question-Answering on Your Own Data, https://medium.com/@murtuza753/using-llama-2-0-faiss-and-langchain-for-question-answering-on-your-own-data-682241488476