# Email Search AI

This AI application helps you to search through your emails based on your query and helps you identify the right set of answers to your query. It also cites the source from where the answer gets generated so as an end user you can verify the source.

The input to this is a dataset of csv containing emails - subject, body, from, to, timestamp. There is also a summary dataset that summaries the information present in the email thread.

### Importing the necessary libraries

In [1]:
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import openai
import chromadb
from sentence_transformers import CrossEncoder, util

### Loading the dataset

In [2]:
base_df = pd.read_csv('emails_dataset/CSV/email_thread_details.csv')
summary_df = pd.read_csv('emails_dataset/CSV/email_thread_summaries.csv')

### Data Understanding

In [3]:
# There are two dataframes present which are merged based on the thread id. 
# This is because one dataframe contains the email data while another contains the summaries of the text present as part of the emails.

main_df = pd.merge(base_df, summary_df, on='thread_id')
main_df.iloc[5]

thread_id                                                    2
subject                                     Credit Group Lunch
timestamp                                  2000-01-12 05:26:00
from                                                Tana Jones
to                                           ['Suzanne Adams']
body                                          I'll be there...
summary      A lunch meeting has been scheduled for May 5th...
Name: 5, dtype: object

In [4]:
base_df

Unnamed: 0,thread_id,subject,timestamp,from,to,body
0,1,FW: Master Termination Log,2002-01-29 11:23:42,"Gossett, Jeffrey C. JGOSSET","['Giron', 'Darron C. Dgiron', 'Love', 'Phillip...",\n\n -----Original Message-----\nFrom: =09Ther...
1,1,FW: Master Termination Log,2002-01-31 12:50:00,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Gossett', 'Jeff...",\n\n -----Original Message-----\nFrom: =09Panu...
2,1,FW: Master Termination Log,2002-02-05 15:03:35,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Anderson', 'Dia...",Note to Stephanie Panus....\n\nStephanie...ple...
3,1,FW: Master Termination Log,2002-02-05 15:06:25,"Theriot, Kim S. KTHERIO","['Hall', 'D. Todd Thall', 'Sweeney', 'Kevin Ks...",\n\n -----Original Message-----\nFrom: =09Panu...
4,1,FW: Master Termination Log,2002-05-28 07:20:35,"Kelly, Katherine L. KKELLY","['Germany', 'Chris Cgerman']",\n\n -----Original Message-----\nFrom: =09McMi...
...,...,...,...,...,...,...
21679,4166,vacation,2000-10-04 11:32:00,Sara Shackleton,"['Gary Hickerson', 'Sheila Glover', 'Laurel Ad...",I will be on vacation from October 6- 13. Als...
21680,4167,web file,2001-03-18 22:57:00,Matt Smith,['Amanda Huble'],"Amanda,\n\nCan you put this file in the approp..."
21681,4167,web file,2001-03-19 04:42:00,Matt Smith,['Amanda Huble'],"Amanda,\n\nPlease move the file i sent you fro..."
21682,4167,web file,2001-03-19 09:57:00,Matt Smith,['Amanda Huble <Amanda Huble/NA/Enron@Enron'],"Amanda,\n\nCan you put this file in the approp..."


In [75]:
summary_df

Unnamed: 0,thread_id,summary
0,1,The email thread discusses the Master Terminat...
1,2,A lunch meeting has been scheduled for May 5th...
2,3,Ben is updating a friend on his progress with ...
3,4,The recipient of the email thread initially ex...
4,5,The email thread discusses the long form confi...
...,...,...
4162,4163,Peter Thompson has sent a memo to Kay Mann and...
4163,4164,The email thread revolves around the sharing a...
4164,4165,Susan asks Emily about her plans for the weeke...
4165,4166,Several employees will be on vacation during d...


In [6]:
main_df

Unnamed: 0,thread_id,subject,timestamp,from,to,body,summary
0,1,FW: Master Termination Log,2002-01-29 11:23:42,"Gossett, Jeffrey C. JGOSSET","['Giron', 'Darron C. Dgiron', 'Love', 'Phillip...",\n\n -----Original Message-----\nFrom: =09Ther...,The email thread discusses the Master Terminat...
1,1,FW: Master Termination Log,2002-01-31 12:50:00,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Gossett', 'Jeff...",\n\n -----Original Message-----\nFrom: =09Panu...,The email thread discusses the Master Terminat...
2,1,FW: Master Termination Log,2002-02-05 15:03:35,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Anderson', 'Dia...",Note to Stephanie Panus....\n\nStephanie...ple...,The email thread discusses the Master Terminat...
3,1,FW: Master Termination Log,2002-02-05 15:06:25,"Theriot, Kim S. KTHERIO","['Hall', 'D. Todd Thall', 'Sweeney', 'Kevin Ks...",\n\n -----Original Message-----\nFrom: =09Panu...,The email thread discusses the Master Terminat...
4,1,FW: Master Termination Log,2002-05-28 07:20:35,"Kelly, Katherine L. KKELLY","['Germany', 'Chris Cgerman']",\n\n -----Original Message-----\nFrom: =09McMi...,The email thread discusses the Master Terminat...
...,...,...,...,...,...,...,...
21679,4166,vacation,2000-10-04 11:32:00,Sara Shackleton,"['Gary Hickerson', 'Sheila Glover', 'Laurel Ad...",I will be on vacation from October 6- 13. Als...,Several employees will be on vacation during d...
21680,4167,web file,2001-03-18 22:57:00,Matt Smith,['Amanda Huble'],"Amanda,\n\nCan you put this file in the approp...",Mat has sent an email to Amanda requesting her...
21681,4167,web file,2001-03-19 04:42:00,Matt Smith,['Amanda Huble'],"Amanda,\n\nPlease move the file i sent you fro...",Mat has sent an email to Amanda requesting her...
21682,4167,web file,2001-03-19 09:57:00,Matt Smith,['Amanda Huble <Amanda Huble/NA/Enron@Enron'],"Amanda,\n\nCan you put this file in the approp...",Mat has sent an email to Amanda requesting her...


### Data Preprocessing

In [7]:
# Initially we aggregate the data based on the comma seperated values, based on new line characters creating a single list or single large string.
cleaned_df = main_df.groupby('thread_id').agg({
    'subject': 'first',
    'from': lambda x: ','.join(x),
    'to': lambda x: ','.join(x),
    'body': lambda x: '\n'.join(x),
    'summary': 'first'
}).reset_index()

In [8]:
cleaned_df['Text_Length'] = cleaned_df['body'].apply(lambda x: len(x.split(' ')))

In [9]:
cleaned_df

Unnamed: 0,thread_id,subject,from,to,body,summary,Text_Length
0,1,FW: Master Termination Log,"Gossett, Jeffrey C. JGOSSET,Theriot, Kim S. KT...","['Giron', 'Darron C. Dgiron', 'Love', 'Phillip...",\n\n -----Original Message-----\nFrom: =09Ther...,The email thread discusses the Master Terminat...,1345
1,2,Credit Group Lunch,"Tana Jones,Tana Jones,Carol St Clair,Carol St ...","['Suzanne Adams'],['Suzanne Adams'],['Suzanne ...",I'll be there...\nI will attend.\nSuzanne:\nHe...,A lunch meeting has been scheduled for May 5th...,443
2,3,New Address,"Benjamin Rogers,Bruce Rudy,Brian Hendon,Gerald...","['""CHOBY', 'C."" <G7PWC3@stennis.navy.mil'],['B...","Hey there; \n""Do you know who your ""big toe"" i...",Ben is updating a friend on his progress with ...,4833
3,4,EOL Data,"Phillip M Love,Phillip M Love,Phillip M Love,P...","['Julie Ferrara'],['Julie Ferrara'],['Julie Fe...",thanks for the update.\nPL\nthat is ok. Thank...,The recipient of the email thread initially ex...,67
4,5,RE: long form confirm/MDEA,"Kay Mann,Reagan Rorschach,Kay Mann,Edward Sack...","['Reagan Rorschach'],['Kay Mann'],['Reagan Ror...",I think you can send it just so he has the for...,The email thread discusses the long form confi...,1970
...,...,...,...,...,...,...,...
4162,4163,ltr to Kay Mann: Site specific references in G...,"Kay Mann,Kay Mann,Kay Mann,Kay Mann","['Sheila Tweed', 'Dale Rasmussen', 'Stuart Zis...",FYI.\n---------------------- Forwarded by Kay ...,Peter Thompson has sent a memo to Kay Mann and...,344
4163,4164,presentation,"Elizabeth Sager,Mike McConnell,Mike McConnell,...","['Genia FitzGerald'],['Rick Bergsieker'],['Geo...",Can you send him a hard copy (He is w Constell...,The email thread revolves around the sharing a...,655
4164,4165,this weekend,"Watson, Kimberly KWATSON,Scott, Susan M. SSCOT...","[""'john.watson@pdq.net'""],['Corey Leahy (E-mai...","I don't see you on MSN, but I am on the phone ...",Susan asks Emily about her plans for the weeke...,214
4165,4166,vacation,"Susan Scott,Scott Neal,Peter F Keavey,Sandra M...","['Drew Fossum@ENRON', 'Janet Cones', 'Audrey R...",I'm planning to be out August 28 - Sept 1.\nI ...,Several employees will be on vacation during d...,108


In [10]:
cleaned_df['Text_Length'].describe(percentiles=[.01, .25,.5,.75,.99])

count     4167.000000
mean      1304.195104
std       2075.267246
min          5.000000
1%          52.000000
25%        396.500000
50%        810.000000
75%       1519.000000
99%       7486.180000
max      45707.000000
Name: Text_Length, dtype: float64

In [11]:
# As we can observe above that most of the text is present in the range of 51-7486.
# So here we clean the text such that any text above or below the range is not considered into the dataset.

cleaned_df = cleaned_df.loc[cleaned_df['Text_Length'] > 51]
cleaned_df = cleaned_df.loc[cleaned_df['Text_Length'] < 7486]

In [12]:
cleaned_df

Unnamed: 0,thread_id,subject,from,to,body,summary,Text_Length
0,1,FW: Master Termination Log,"Gossett, Jeffrey C. JGOSSET,Theriot, Kim S. KT...","['Giron', 'Darron C. Dgiron', 'Love', 'Phillip...",\n\n -----Original Message-----\nFrom: =09Ther...,The email thread discusses the Master Terminat...,1345
1,2,Credit Group Lunch,"Tana Jones,Tana Jones,Carol St Clair,Carol St ...","['Suzanne Adams'],['Suzanne Adams'],['Suzanne ...",I'll be there...\nI will attend.\nSuzanne:\nHe...,A lunch meeting has been scheduled for May 5th...,443
2,3,New Address,"Benjamin Rogers,Bruce Rudy,Brian Hendon,Gerald...","['""CHOBY', 'C."" <G7PWC3@stennis.navy.mil'],['B...","Hey there; \n""Do you know who your ""big toe"" i...",Ben is updating a friend on his progress with ...,4833
3,4,EOL Data,"Phillip M Love,Phillip M Love,Phillip M Love,P...","['Julie Ferrara'],['Julie Ferrara'],['Julie Fe...",thanks for the update.\nPL\nthat is ok. Thank...,The recipient of the email thread initially ex...,67
4,5,RE: long form confirm/MDEA,"Kay Mann,Reagan Rorschach,Kay Mann,Edward Sack...","['Reagan Rorschach'],['Kay Mann'],['Reagan Ror...",I think you can send it just so he has the for...,The email thread discusses the long form confi...,1970
...,...,...,...,...,...,...,...
4162,4163,ltr to Kay Mann: Site specific references in G...,"Kay Mann,Kay Mann,Kay Mann,Kay Mann","['Sheila Tweed', 'Dale Rasmussen', 'Stuart Zis...",FYI.\n---------------------- Forwarded by Kay ...,Peter Thompson has sent a memo to Kay Mann and...,344
4163,4164,presentation,"Elizabeth Sager,Mike McConnell,Mike McConnell,...","['Genia FitzGerald'],['Rick Bergsieker'],['Geo...",Can you send him a hard copy (He is w Constell...,The email thread revolves around the sharing a...,655
4164,4165,this weekend,"Watson, Kimberly KWATSON,Scott, Susan M. SSCOT...","[""'john.watson@pdq.net'""],['Corey Leahy (E-mai...","I don't see you on MSN, but I am on the phone ...",Susan asks Emily about her plans for the weeke...,214
4165,4166,vacation,"Susan Scott,Scott Neal,Peter F Keavey,Sandra M...","['Drew Fossum@ENRON', 'Janet Cones', 'Audrey R...",I'm planning to be out August 28 - Sept 1.\nI ...,Several employees will be on vacation during d...,108


In [13]:
# The below functions here perform data preprocessing such as cleaning, removing unwanted characters, unwanted spaces and many more.

import ast
import re

def clean_to_list(text):
  cleaned_list = ast.literal_eval(text)
  flattened = [name for sublist in cleaned_list for name in sublist]
  flattened = list(set(flattened))
  flattened  = ", ".join(map(str, flattened))
  return flattened

def clean_from_list(input_string):
    names = input_string.split(',')
    unique_names = set(names)
    cleaned_names = [name.strip() for name in unique_names]
    sorted_names = sorted(cleaned_names)
    sorted_names = ", ".join(str(x) for x in sorted_names)
    return sorted_names

def clean_email_text(text):
  # Step 1: Remove encoding artifacts like =09, =20, etc.
  cleaned_text = re.sub(r'=09|=20', ' ', text)

  # Step 2: Remove "-----Original Message-----" and other unnecessary separators
  cleaned_text = re.sub(r'-----Original Message-----', '', cleaned_text)

  # Step 3: Remove email header lines (From, Sent, To, Subject, etc.)
  cleaned_text = re.sub(r'From:.*\n|Sent:.*\n|To:.*\n|Cc:.*\n|Subject:.*\n|Forwarded by.*\n|cc:.*\n', '', cleaned_text)

  # Step 4: Normalize line breaks (remove extra newlines, and extra spaces)
  cleaned_text = re.sub(r'\n+', '\n', cleaned_text)  # Consolidate multiple newlines into one
  cleaned_text = re.sub(r'^\s+|\s+?$', '', cleaned_text)  # Trim leading/trailing spaces on each line
  cleaned_text = re.sub(r'\s{2,}', ' ', cleaned_text)  # Replace multiple spaces with a single space

  # Step 5: Optionally, remove excessive blank lines that might still exist after cleaning
  cleaned_text = re.sub(r'\n\s*\n', '\n', cleaned_text)
  cleaned_text = re.sub(r'=| =|', '', cleaned_text)
  cleaned_text = re.sub(r'\n', '', cleaned_text)
  return cleaned_text

In [14]:
clean_from_list(cleaned_df['from'][0])

'Gossett, Jeffrey C. JGOSSET, Katherine L. KKELLY, Kelly, Kim S. KTHERIO, Theriot'

In [15]:
cleaned_df['to'] = cleaned_df['to'].apply(clean_to_list)
cleaned_df['from'] = cleaned_df['from'].apply(clean_from_list)
cleaned_df['body'] = cleaned_df['body'].apply(clean_email_text)

In [16]:
display(cleaned_df)

Unnamed: 0,thread_id,subject,from,to,body,summary,Text_Length
0,1,FW: Master Termination Log,"Gossett, Jeffrey C. JGOSSET, Katherine L. KKEL...","Stacey W. Swhite, Chris Cgerman, Bryce Bbaxter...","ey W.; Murphy, Melissa; Hall, D. Todd; Sweeney...",The email thread discusses the Master Terminat...,1345
1,2,Credit Group Lunch,"Carol St Clair, Mark Taylor, Sara Shackleton, ...","Suzanne Adams, Kaye Ellis",I'll be there...I will attend.Suzanne:Here is ...,A lunch meeting has been scheduled for May 5th...,443
2,3,New Address,"Benjamin Rogers, Brian Hendon, Bruce Rudy, Ger...",Brian.Hendon@ENRONCOMMUNICATIONS.nt.ect.enron....,"Hey there; ""Do you know who your ""big toe"" is ...",Ben is updating a friend on his progress with ...,4833
3,4,EOL Data,Phillip M Love,Julie Ferrara,thanks for the update.PLthat is ok. Thanks for...,The recipient of the email thread initially ex...,67
4,5,RE: long form confirm/MDEA,"Edward Sacks, Kay Mann, Reagan Rorschach","Reagan Rorschach, Kay Mann, kay.mann@worldnet....",I think you can send it just so he has the for...,The email thread discusses the long form confi...,1970
...,...,...,...,...,...,...,...
4162,4163,ltr to Kay Mann: Site specific references in G...,Kay Mann,"kent.shoemaker@ae.ge.com, Ben Jacoby, Stuart Z...",FYI.---------------------- PM ----------------...,Peter Thompson has sent a memo to Kay Mann and...,344
4163,4164,presentation,"David W Delainey, Elizabeth Sager, Eric Groves...","Trevor Twoods, Per Sekse, Elliot Mainzer <Elli...",Can you send him a hard copy (He is w Constell...,The email thread revolves around the sharing a...,655
4164,4165,this weekend,"Kimberly KWATSON, Sara SSHACKL, Scott, Shackle...","emily boon (E-mail) <emily.boon@msdw.com, 'joh...","I don't see you on MSN, but I am on the phone ...",Susan asks Emily about her plans for the weeke...,214
4165,4166,vacation,"Peter F Keavey, Sandra McCubbin, Sara Shacklet...","John Greene, Scott Sefton, Jorge A Garcia, Mel...",I'm planning to be out August 28 - Sept 1.I am...,Several employees will be on vacation during d...,108


In [17]:
cleaned_df['Metadata'] = cleaned_df.apply(lambda x: {'Threadid': x['thread_id'],'From': x['from'], 'To':x['to'], 'Subject':x['subject']}, axis=1)

In [19]:
# This is the cleaned dataframe that will serve as a input dataset for our generative AI model.
display(cleaned_df)

Unnamed: 0,thread_id,subject,from,to,body,summary,Text_Length,Metadata
0,1,FW: Master Termination Log,"Gossett, Jeffrey C. JGOSSET, Katherine L. KKEL...","Stacey W. Swhite, Chris Cgerman, Bryce Bbaxter...","ey W.; Murphy, Melissa; Hall, D. Todd; Sweeney...",The email thread discusses the Master Terminat...,1345,"{'Threadid': 1, 'From': 'Gossett, Jeffrey C. J..."
1,2,Credit Group Lunch,"Carol St Clair, Mark Taylor, Sara Shackleton, ...","Suzanne Adams, Kaye Ellis",I'll be there...I will attend.Suzanne:Here is ...,A lunch meeting has been scheduled for May 5th...,443,"{'Threadid': 2, 'From': 'Carol St Clair, Mark ..."
2,3,New Address,"Benjamin Rogers, Brian Hendon, Bruce Rudy, Ger...",Brian.Hendon@ENRONCOMMUNICATIONS.nt.ect.enron....,"Hey there; ""Do you know who your ""big toe"" is ...",Ben is updating a friend on his progress with ...,4833,"{'Threadid': 3, 'From': 'Benjamin Rogers, Bria..."
3,4,EOL Data,Phillip M Love,Julie Ferrara,thanks for the update.PLthat is ok. Thanks for...,The recipient of the email thread initially ex...,67,"{'Threadid': 4, 'From': 'Phillip M Love', 'To'..."
4,5,RE: long form confirm/MDEA,"Edward Sacks, Kay Mann, Reagan Rorschach","Reagan Rorschach, Kay Mann, kay.mann@worldnet....",I think you can send it just so he has the for...,The email thread discusses the long form confi...,1970,"{'Threadid': 5, 'From': 'Edward Sacks, Kay Man..."
...,...,...,...,...,...,...,...,...
4162,4163,ltr to Kay Mann: Site specific references in G...,Kay Mann,"kent.shoemaker@ae.ge.com, Ben Jacoby, Stuart Z...",FYI.---------------------- PM ----------------...,Peter Thompson has sent a memo to Kay Mann and...,344,"{'Threadid': 4163, 'From': 'Kay Mann', 'To': '..."
4163,4164,presentation,"David W Delainey, Elizabeth Sager, Eric Groves...","Trevor Twoods, Per Sekse, Elliot Mainzer <Elli...",Can you send him a hard copy (He is w Constell...,The email thread revolves around the sharing a...,655,"{'Threadid': 4164, 'From': 'David W Delainey, ..."
4164,4165,this weekend,"Kimberly KWATSON, Sara SSHACKL, Scott, Shackle...","emily boon (E-mail) <emily.boon@msdw.com, 'joh...","I don't see you on MSN, but I am on the phone ...",Susan asks Emily about her plans for the weeke...,214,"{'Threadid': 4165, 'From': 'Kimberly KWATSON, ..."
4165,4166,vacation,"Peter F Keavey, Sandra McCubbin, Sara Shacklet...","John Greene, Scott Sefton, Jorge A Garcia, Mel...",I'm planning to be out August 28 - Sept 1.I am...,Several employees will be on vacation during d...,108,"{'Threadid': 4166, 'From': 'Peter F Keavey, Sa..."


### Initialize OPENAI and Create Vector DB

As we have now obtained a cleaned data, we now next set to work on generating the embeddings and storing them into the vector database.

- Create a persistent client of the chromadb.
- Setup an embedding function with the help of OpenAI.
- Create a collection to store the embeddings.
- Using the method of chunking, convert the text into embeddings and then add them to the collection.

In [20]:
# Set the API key
filepath = "OPENAI_API_Key.txt"

with open(filepath, "r") as f:
  openai.api_key = ' '.join(f.readlines())

In [21]:
# Import the OpenAI Embedding Function into chroma
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import chromadb

In [22]:
# Define the path where chroma collections will be stored
chroma_data_path = 'chromadb'

In [23]:
# Call PersistentClient()
perstclient = chromadb.PersistentClient()

In [24]:
# Set up the embedding function using the OpenAI embedding model
model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=openai.api_key, model_name=model)

In [25]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents
emails_collection = perstclient.get_or_create_collection(name='Learn_Rag_email_modified', embedding_function=embedding_function)

In [26]:
#This function converts the dataframe into chunks and then adds the embeddings into the vectorstore.

def return_chunks(cleaned_df):
    df_len = len(cleaned_df)
    for i in range(0, df_len, 500):
        start = i
        end = i + 500 if i+500 < df_len else df_len
        documents_list = cleaned_df["summary"][start:end].astype('str').tolist()
        metadata_list = cleaned_df['Metadata'][start:end].tolist()
        docs_ids = cleaned_df['thread_id'][start:end].astype('str').tolist()
        if documents_list and metadata_list and docs_ids:
            emails_collection.add(documents=documents_list, ids=docs_ids, metadatas=metadata_list)
            print(f"Start = {start}.....End = {end}")
        else:
            break
    return

In [27]:
return_chunks(cleaned_df)

Start = 0.....End = 500
Start = 500.....End = 1000
Start = 1000.....End = 1500
Start = 1500.....End = 2000
Start = 2000.....End = 2500
Start = 2500.....End = 3000
Start = 3000.....End = 3500
Start = 3500.....End = 4000
Start = 4000.....End = 4086


In [28]:
# Now let's obtain the few 3 entries in the collection to see if the data has been captured correctly.

emails_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['1', '2'],
 'embeddings': array([[-0.0050267 ,  0.0048146 , -0.00124077, ..., -0.00390966,
         -0.02675251, -0.01411153],
        [-0.00561031, -0.00979695,  0.00220803, ..., -0.00475341,
         -0.01313682, -0.01265102]], shape=(2, 1536)),
 'documents': ["The email thread discusses the Master Termination Log and the need to investigate a CNG LDC (Hope Gas) termination and a $66 million settlement offer. Stephanie Panus sends out the Daily List and Master Termination Log for various dates. Kim Theriot requests her name and Melissa Murphy's name to be removed from the distribution list and adds several names to it. The thread also includes updates on terminations and valid terminations for various companies.",
  'A lunch meeting has been scheduled for May 5th from 12:00 p.m. to 1:30 p.m. to discuss the ISDA and CSA Masters and Schedules. Attendees are asked to RSVP for catering purposes. Carol requests confirmation of attendees and adds three new members to the group. Jo

In [29]:
# We now go to create a cache collection so that the data can be first checked in the cache if it's present.
# If the data isn't present in the cache collection only then the main collection would be inferenced upon.

cache_collection = perstclient.get_or_create_collection(name='Email_Cache', embedding_function=embedding_function)

In [30]:
cache_collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.embeddings: 'embeddings'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [91]:
# Lets take a query from the end user
query = input()

 All email from Susan to Emily asking her plan for the weekend


In [92]:
# Querying the cache collection based on the query would initially result in empty collection.
# But later as we run the notebook later, we can obtain data from the cache collection

cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)
cache_results

{'ids': [['multiple individuals discussing their plans to visit Houston and meet with each other. They discuss potential dates and times for meetings, as well as locations for dinner']],
 'embeddings': None,
 'documents': [['multiple individuals discussing their plans to visit Houston and meet with each other. They discuss potential dates and times for meetings, as well as locations for dinner']],
 'uris': None,
 'data': None,
 'metadatas': [[{'distances0': '0.15828947722911835',
    'distances1': '0.20246121287345886',
    'distances2': '0.26013675332069397',
    'distances3': '0.26889169216156006',
    'distances4': '0.28665289282798767',
    'distances5': '0.28878647089004517',
    'distances6': '0.2895488739013672',
    'distances7': '0.2993606925010681',
    'distances8': '0.30021142959594727',
    'distances9': '0.30209222435951233',
    'documents0': 'The email thread consists of various individuals discussing their upcoming visits to Houston and arranging meetings. They also di

In [93]:
# Now let's query from the main collection which hosts the entire dataset in it.

results = emails_collection.query(
    query_texts=query,
    n_results=10
)
results

{'ids': [['4165',
   '1578',
   '3690',
   '2584',
   '961',
   '3268',
   '1835',
   '3031',
   '3841',
   '1570']],
 'embeddings': None,
 'documents': [["Susan asks Emily about her plans for the weekend, mentioning a last-minute trip to Boston. Sara provides her contact information. Susan asks about a game and mentions that Emily may also be in town. Karen cancels plans for a lake trip due to a friend's back problems and suggests staying home to be with Herschel. She suggests going to her mom's later in the week and going to the Hyatt Hill Country the following week.",
   "The email thread consists of various conversations about weekend plans and activities. The first email asks if the recipient will be in town and suggests a visit. The second email mentions potential layoffs and the impact on the economy. The third email discusses the recipient's house search. The fourth email asks if the recipient closed on a deal. The fifth email discusses weekend plans and suggests saving a trip 

### Implementing Cache in Semantic Search

Now that we have the data specific to the query, we now go ahead and populate the cache. This help in reducing the query wait time in future.
If the required results are present as a part of the cache collection which would usually be a smaller collection compared to the main collection, the seach would be easier and faster, resulting in a lower wait time.

In [94]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = emails_collection.query(
      query_texts=query,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      for key, val in results.items():
        if val is None:
          continue
        for i in range(10):
          if key != 'included':
            Keys.append(str(key)+str(i))
            Values.append(str(val[0][i]))

      cache_collection.add(
          documents= [query],
          ids = [query],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })


Not found in cache. Found in main collection.


In [95]:
results_df['Documents'][0]

"Susan asks Emily about her plans for the weekend, mentioning a last-minute trip to Boston. Sara provides her contact information. Susan asks about a game and mentions that Emily may also be in town. Karen cancels plans for a lake trip due to a friend's back problems and suggests staying home to be with Herschel. She suggests going to her mom's later in the week and going to the Hyatt Hill Country the following week."

### Re-Ranking with Cross Encoder Model

**Why we use a cross encoder model?**  <br>
Cross encoder models are used to enhance the results of the results obtained from the bi-encoder models such as  gpt, Llama. They take a pair of sentences as an input and return a similarity score which would be comparitively accurate compared to the bi-encoder models.

This step is not scalable as the cross-encoder models are comparatively slow in processing and required larger compute when used on a larger data.
Hence, in this step we just pass the top 10 documents that were retrieved from the vector search.

In [96]:
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [97]:
cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)

In [98]:
cross_rerank_scores

array([ 7.546918  ,  2.065117  , -1.4000587 , -0.88656974,  5.8444014 ,
       -0.5868465 ,  3.2252655 ,  1.9506568 ,  0.911572  , -0.12057365],
      dtype=float32)

In [99]:
results_df['Reranked_scores'] = cross_rerank_scores
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'From': 'Kimberly KWATSON, Sara SSHACKL, Scot...",Susan asks Emily about her plans for the weeke...,0.203655,4165,7.546918
1,"{'From': 'Chris CDORLAN, Dorland, Grigsby, Len...",The email thread consists of various conversat...,0.238565,1578,2.065117
2,"{'From': 'Kate Symes, Kay Mann, Mark - ECT Leg...",There is a request for risk management people ...,0.245788,3690,-1.400059
3,"{'From': 'Carol St Clair, Kay KMANN, Kay Mann,...",The email thread consists of various conversat...,0.249903,2584,-0.88657
4,"{'From': 'Bailey, Blair, Corman, Fernandez, Ge...",The email thread consists of various individua...,0.254354,961,5.844401
5,"{'From': 'Chris CGERMAN, Germany, Kim S (Houst...",The email thread consists of a conversation be...,0.263253,3268,-0.586846
6,"{'From': 'Bailey, Barry BTYCHOL, Fernandez, Jo...",Susan and Dona are excited to see each other a...,0.268738,1835,3.225266
7,"{'From': 'Matthew Lenhart, Susan M Scott', 'Su...",Julie asks Susan if she can go to the spa on S...,0.273342,3031,1.950657
8,"{'From': 'Kay Mann, Mike Carson, Phillip M Lov...",The email thread discusses plans for a weekend...,0.276414,3841,0.911572
9,"{'From': 'Benjamin Rogers, Kay Mann, Matthew L...",The email thread discusses weekend plans and a...,0.279297,1570,-0.120574


In [100]:
top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'From': 'Kimberly KWATSON, Sara SSHACKL, Scot...",Susan asks Emily about her plans for the weeke...,0.203655,4165,7.546918
1,"{'From': 'Chris CDORLAN, Dorland, Grigsby, Len...",The email thread consists of various conversat...,0.238565,1578,2.065117
2,"{'From': 'Kate Symes, Kay Mann, Mark - ECT Leg...",There is a request for risk management people ...,0.245788,3690,-1.400059


In [101]:
top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'From': 'Kimberly KWATSON, Sara SSHACKL, Scot...",Susan asks Emily about her plans for the weeke...,0.203655,4165,7.546918
4,"{'From': 'Bailey, Blair, Corman, Fernandez, Ge...",The email thread consists of various individua...,0.254354,961,5.844401
6,"{'From': 'Bailey, Barry BTYCHOL, Fernandez, Jo...",Susan and Dona are excited to see each other a...,0.268738,1835,3.225266


In [102]:
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]

### Generating Response Using RAG

As we now have the re-ranked data filtered using the semantic search, we can now pass this data to the LLM through a well engineered prompt. 
Using the prompt, the LLM can re-engineer the response into a more human readable format making it a suitable user response. 

In [103]:
# Define the function to generate the response. Provide a comprehensive prompt that passes the user query and the top 3 results to the model

def generate_response(query, results_df):
    """
    Generate a response using GPT-3.5's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
                {"role": "system", "content":  "You are a helpful assistant who is can analyze emails and who can effectively answer user queries about any information in the emails."},
                {"role": "user", "content": f"""You are a helpful assistant who is can analyze emails and who can effectively answer user queries about any information in the emails.
                                                You have a question asked by the user in '{query}' and you have some search results from a corpus of company emails data in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

                                                The column 'documents' inside this dataframe contains the summarized text from the email document and the column 'metadata' contains the email details such as 'From', 'To', 'Subject', 'IDs' and many more.

                                                Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant email data such as 'From', 'To', 'IDs', 'Subject' citations.
                                                
                                                Follow the guidelines below when performing the task.
                                                1. Try to provide relevant/accurate numbers if available.
                                                2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
                                                3. If the document text has tables with relevant information.
                                                3. Use the Metadatas columns in the dataframe to retrieve and cite the email's Subject From and To as citation.
                                                4. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
                                                5. You are a customer facing assistant, so do not provide any information on internal workings, just answer the query directly.

                                                The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.
                                                """},
              ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')

In [104]:
response = generate_response(query, top_3_RAG)

In [105]:
print("\n".join(response))

Based on the provided documents, there is 1 relevant email involving Susan and Emily regarding plans for the weekend.

From: Susan
To: Emily
Subject: Weekend Plans

In the email, Susan is asking Emily about her plans for the weekend. The context suggests that Susan is interested in knowing what Emily has planned for the upcoming weekend.

Response:
Susan has reached out to Emily inquiring about her plans for the weekend.

Citation:
- Subject: Weekend Plans
- From: Susan
- To: Emily


----------------------------------------------------**END**---------------------------------------------------------------------