#                                            Semantic Spotter - Project

## Building Langchain based Email Search AI
### Dataset: https://www.kaggle.com/code/xokent/email-thread-summary-nlp

## Problem Statement
Above Kaggle Dataset contains different email threads between different individuals. So it's very hard to get the insght of full conversation as the here about 21864 rows and each thread have multiple rows. For an outside person it's very hard to understand what these conversations are all about.

## Solution
So based on the problem building a QAChain or Conversational Chain will be apt application to get required information about the email conversations happening among different individuals.
### So We will use Langchain for this particular problem statement.

### Step1: Libraries import

In [32]:
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
import numpy as np
import pandas as pd

In [2]:
!pip install langchain langchain-openai



In [3]:
!pip install faiss-cpu python-dotenv tiktoken



### Step2: Data Loading

In [5]:
df1 = pd.read_csv("email_thread_details.csv")

In [6]:
df2 = pd.read_csv("email_thread_summaries.csv")

In [7]:
df1

Unnamed: 0,thread_id,subject,timestamp,from,to,body
0,1,FW: Master Termination Log,2002-01-29 11:23:42,"Gossett, Jeffrey C. JGOSSET","['Giron', 'Darron C. Dgiron', 'Love', 'Phillip...",\n\n -----Original Message-----\nFrom: =09Ther...
1,1,FW: Master Termination Log,2002-01-31 12:50:00,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Gossett', 'Jeff...",\n\n -----Original Message-----\nFrom: =09Panu...
2,1,FW: Master Termination Log,2002-02-05 15:03:35,"Theriot, Kim S. KTHERIO","['Murphy', 'Melissa Mmurphy', 'Anderson', 'Dia...",Note to Stephanie Panus....\n\nStephanie...ple...
3,1,FW: Master Termination Log,2002-02-05 15:06:25,"Theriot, Kim S. KTHERIO","['Hall', 'D. Todd Thall', 'Sweeney', 'Kevin Ks...",\n\n -----Original Message-----\nFrom: =09Panu...
4,1,FW: Master Termination Log,2002-05-28 07:20:35,"Kelly, Katherine L. KKELLY","['Germany', 'Chris Cgerman']",\n\n -----Original Message-----\nFrom: =09McMi...
...,...,...,...,...,...,...
21679,4166,vacation,2000-10-04 11:32:00,Sara Shackleton,"['Gary Hickerson', 'Sheila Glover', 'Laurel Ad...",I will be on vacation from October 6- 13. Als...
21680,4167,web file,2001-03-18 22:57:00,Matt Smith,['Amanda Huble'],"Amanda,\n\nCan you put this file in the approp..."
21681,4167,web file,2001-03-19 04:42:00,Matt Smith,['Amanda Huble'],"Amanda,\n\nPlease move the file i sent you fro..."
21682,4167,web file,2001-03-19 09:57:00,Matt Smith,['Amanda Huble <Amanda Huble/NA/Enron@Enron'],"Amanda,\n\nCan you put this file in the approp..."


In [8]:
df2

Unnamed: 0,thread_id,summary
0,1,The email thread discusses the Master Terminat...
1,2,A lunch meeting has been scheduled for May 5th...
2,3,Ben is updating a friend on his progress with ...
3,4,The recipient of the email thread initially ex...
4,5,The email thread discusses the long form confi...
...,...,...
4162,4163,Peter Thompson has sent a memo to Kay Mann and...
4163,4164,The email thread revolves around the sharing a...
4164,4165,Susan asks Emily about her plans for the weeke...
4165,4166,Several employees will be on vacation during d...


#### Creating Metadata

In [9]:
# Step 1: Create metadata_dict
df1["metadata_dict"] = df1.apply(lambda row: {
    "thread_id": row["thread_id"],
    "subject": row["subject"],
    "timestamp": row["timestamp"],
    "from": row["from"],
    "to": row["to"],
    "body": row["body"]
}, axis=1)

# Step 2: Group by thread_id and merge metadata_dicts
def merge_metadata(group):
    first_row = group.iloc[0]["metadata_dict"].copy()  # Take metadata from the first message
    merged_body = "\n\n".join(row["body"] for row in group["metadata_dict"])  # Merge all bodies
    first_row["body"] = merged_body
    return pd.Series({
        "thread_id": first_row["thread_id"],
        "metadata_dict": first_row
    })

df1_cleaned_merged = df1.groupby("thread_id").apply(merge_metadata).reset_index(drop=True)

  df1_cleaned_merged = df1.groupby("thread_id").apply(merge_metadata).reset_index(drop=True)


In [10]:
df1_cleaned_merged.head(40)

Unnamed: 0,thread_id,metadata_dict
0,1,"{'thread_id': 1, 'subject': 'FW: Master Termin..."
1,2,"{'thread_id': 2, 'subject': 'Credit Group Lunc..."
2,3,"{'thread_id': 3, 'subject': 'New Address', 'ti..."
3,4,"{'thread_id': 4, 'subject': 'EOL Data', 'times..."
4,5,"{'thread_id': 5, 'subject': 'RE: long form con..."
5,6,"{'thread_id': 6, 'subject': 'BABY!', 'timestam..."
6,7,"{'thread_id': 7, 'subject': 'Canadian utilitie..."
7,8,"{'thread_id': 8, 'subject': 'RE: Golf Anyone?'..."
8,9,"{'thread_id': 9, 'subject': 'RE: YO', 'timesta..."
9,10,"{'thread_id': 10, 'subject': 'RE: NNG/Dynegy D..."


In [11]:
df1_cleaned_merged[df1_cleaned_merged["thread_id"] == 1]["metadata_dict"].iloc[0]["body"]

'\n\n -----Original Message-----\nFrom: =09Theriot, Kim S. =20\nSent:=09Tuesday, January 29, 2002 1:23 PM\nTo:=09Richardson, Stacey; Anderson, Diane; Gossett, Jeffrey C.; White, Stac=\ney W.; Murphy, Melissa; Hall, D. Todd; Sweeney, Kevin\nCc:=09Aucoin, Evelyn; Baxter, Bryce; Wynne, Rita\nSubject:=09FW: Master Termination Log\n\n\n\n -----Original Message-----\nFrom: =09Panus, Stephanie =20\nSent:=09Tuesday, January 29, 2002 11:39 AM\nTo:=09Adams, Laurel; Alonso, Tom; Aronowitz, Alan; Bailey, Susan; Balfour-F=\nlanagan, Cyndie; Baughman, Edward; Belden, Tim; Bishop, Serena; Brackett, D=\nebbie R.; Bradford, William S.; Browning, Mary Nell; Bruce, James; Bruce, M=\nichelle; Bruce, Robert; Buerkle, Jim; Calger, Christopher F.; Carrington, C=\nlara; Considine, Keith; Cordova, Karen A.; Crandall, Sean; Cutsforth, Diane=\n; Diamond, Russell; Dunton, Heather; Edison, Susan; Elafandi, Mo; Fischer, =\nMark; Flores, Nony; Fondren, Mark; Gorny, Vladimir; Gorte, David; Gresham, =\nWayne; Hagelman

In [12]:
# Step 1: Merge df1_cleaned_merged with df2 on thread_id
merged_df = pd.merge(df1_cleaned_merged, df2, on="thread_id", how="left")

# Step 2: Add 'summary' to metadata_dict
merged_df["metadata_dict"] = merged_df.apply(
    lambda row: {**row["metadata_dict"], "summary": row["summary"]},
    axis=1
)

# Optional: drop the now-redundant summary column
merged_df = merged_df.drop(columns=["summary"])

In [13]:
merged_df

Unnamed: 0,thread_id,metadata_dict
0,1,"{'thread_id': 1, 'subject': 'FW: Master Termin..."
1,2,"{'thread_id': 2, 'subject': 'Credit Group Lunc..."
2,3,"{'thread_id': 3, 'subject': 'New Address', 'ti..."
3,4,"{'thread_id': 4, 'subject': 'EOL Data', 'times..."
4,5,"{'thread_id': 5, 'subject': 'RE: long form con..."
...,...,...
4162,4163,"{'thread_id': 4163, 'subject': 'ltr to Kay Man..."
4163,4164,"{'thread_id': 4164, 'subject': 'presentation',..."
4164,4165,"{'thread_id': 4165, 'subject': 'this weekend',..."
4165,4166,"{'thread_id': 4166, 'subject': 'vacation', 'ti..."


In [14]:
merged_df[merged_df["thread_id"] == 1]["metadata_dict"].iloc[0]["summary"]

"The email thread discusses the Master Termination Log and the need to investigate a CNG LDC (Hope Gas) termination and a $66 million settlement offer. Stephanie Panus sends out the Daily List and Master Termination Log for various dates. Kim Theriot requests her name and Melissa Murphy's name to be removed from the distribution list and adds several names to it. The thread also includes updates on terminations and valid terminations for various companies."

In [15]:
merged_df[merged_df["thread_id"] == 1]["metadata_dict"].iloc[0]["subject"]

'FW: Master Termination Log'

In [16]:
final_df = merged_df['metadata_dict']

In [17]:
final_df

0       {'thread_id': 1, 'subject': 'FW: Master Termin...
1       {'thread_id': 2, 'subject': 'Credit Group Lunc...
2       {'thread_id': 3, 'subject': 'New Address', 'ti...
3       {'thread_id': 4, 'subject': 'EOL Data', 'times...
4       {'thread_id': 5, 'subject': 'RE: long form con...
                              ...                        
4162    {'thread_id': 4163, 'subject': 'ltr to Kay Man...
4163    {'thread_id': 4164, 'subject': 'presentation',...
4164    {'thread_id': 4165, 'subject': 'this weekend',...
4165    {'thread_id': 4166, 'subject': 'vacation', 'ti...
4166    {'thread_id': 4167, 'subject': 'web file', 'ti...
Name: metadata_dict, Length: 4167, dtype: object

### Step3: Create langchain document object

In [18]:
from langchain.docstore.document import Document

documents = [
    Document(
        page_content=row["body"],  # Use the full thread body for vector search
        metadata=row  # Include all metadata: subject, summary, to, from, etc.
    )
    for row in final_df
]

### Step4: Chunking

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
chunked_documents = splitter.split_documents(documents)

#### Reading Open API key

In [20]:
import openai
import os

In [21]:
# Folder path
with open("keys.txt", "r") as file:
 os.environ["OPENAI_API_KEY"] = file.read().strip()  # Removes extra spaces/newlines

In [22]:
pip install sentence-transformers

Note: you may need to restart the kernel to use updated packages.


In [23]:
import logging

logging.disable(logging.CRITICAL)

In [25]:
from langchain.vectorstores import FAISS

### Step5: Embedding and Vector store creation

In [26]:
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunked_documents, embedding_model)

### Step6: Set up a Retriever

In [27]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

### Step7: Generative QA Chain

In [28]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)

  llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


In [31]:
query = "When is House of Lords planning to give decision in CATS litigation"

response = qa_chain.invoke({"query": query})

print("Answer:\n", response['result'])

# Optional: Check sources
for i, doc in enumerate(response['source_documents']):
    print(f"\n--- Source Document {i+1} ---\n")
    print(doc.page_content[:500])

Answer:
 The House of Lords planned to give their decision in the CATS litigation on Wednesday, April 4.

--- Source Document 1 ---

----- Forwarded by Richard B Sanders/HOU/ECT on 03/30/2001 10:01 AM -----

	Mary Nell Browning
	03/30/2001 09:12 AM
		 
		 To: James Derrick/Corp/Enron, Michael R Brown/LON/ECT@ECT, John 
Sherriff/LON/ECT@ECT, Mark Evans/Legal/LON/ECT@ECT, Richard B 
Sanders/HOU/ECT@ECT, Fernley Dyson/LON/ECT@ECT, Paul Chivers/LON/ECT@ECT, 
Richard Lewis/LON/ECT@ECT, Peter Crilly/LON/ECT@ECT, Richard 
Harper/LON/ECT@ECT, Paul Turner/LON/ECT@ECT, Jackie Gentle/LON/ECT@ECT, 
Claire Wright/LON/ECT@ECT, Raj N Patel 

--- Source Document 2 ---

Disappointingly, the House of Lords ruled 5 - 0 against Enron in the CATS 
litigation today.  This will mean that we will repay to the CATS parties 
approximately $150 million plus interest and court costs, putting the final 
figure at an estimated $155-160 million.  We expect to be invoiced for the 
principal amount in the next week or

In [38]:
query1 = "When is House of Lords planning to give decision in CATS litigation"

In [39]:
answer = qa_chain.invoke({"query": query1})['result']
print(answer.strip().split("\n")[0])

The House of Lords planned to give their decision in the CATS litigation on Wednesday, 4th April.


In [37]:
query2 = "What is the summary of conversation between Kevin A. Howard and King Jr."
print((qa_chain.invoke({"query": query2})['result']).strip().split("\n")[0])

Kevin A. Howard was appointed as Vice President of ETS, TW, and NNG effective November 12, 2001. His job description involved commercial and financial transactions support. King Jr. was coordinating with Miranda to update the organizational charts for TW and NNG to include Kevin A. Howard. There was a discussion about Kevin reporting to Rod, and questions were raised about his placement on the organizational charts under Saunders and Peters, as well as in the Finance and Accounting charts. Bill confirmed that Kevin would be reporting to Rod and that his title would be Vice President, Commercial and Financial Transactions Support.


In [42]:
query3 = "As per mail chain Index forwards/swaps who has become the leader of the project by default"
print((qa_chain.invoke({"query": query3})['result']).strip().split("\n")[0])

Bob Badeer has become the leader of the project by default, as mentioned in the email chain.


In [53]:
query4 = "Before being transferred to gas pipeline legal group, Bill Rapp was associated with which department"
print((qa_chain.invoke({"query": query4})['result']).strip().split("\n")[0])

Before being transferred to the gas pipeline legal group, Bill Rapp was associated with the Enron Legal Department.


#### We have built a simple QA Chain using where we are getting the desired answers for our queries from the list of the documents

### Step8: Now let's build a conversational QA Chain

In [58]:
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
# Create memory to store chat history
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Define the retriever
retriever_conversational = vectorstore.as_retriever()

# Create the conversational QA chain
coversational_qa_chain = ConversationalRetrievalChain.from_llm(
    llm=ChatOpenAI(),  # can be any chat-capable LLM
    retriever=retriever_conversational,
    memory=memory,
)

In [59]:
conv1 = "When is House of Lords planning to give decision in CATS litigation"

In [64]:
print(coversational_qa_chain.invoke({"question": conv1}))

{'question': 'When is House of Lords planning to give decision in CATS litigation', 'chat_history': [HumanMessage(content='When is House of Lords planning to give decision in CATS litigation', additional_kwargs={}, response_metadata={}), AIMessage(content='The House of Lords planned to give their decision in the CATS litigation on Wednesday 4 April.', additional_kwargs={}, response_metadata={}), HumanMessage(content='When is House of Lords planning to give decision in CATS litigation', additional_kwargs={}, response_metadata={}), AIMessage(content='The House of Lords is scheduled to deliver their decision in the CATS litigation on Wednesday, April 4.', additional_kwargs={}, response_metadata={}), HumanMessage(content='When is House of Lords planning to give decision in CATS litigation', additional_kwargs={}, response_metadata={}), AIMessage(content='The House of Lords was scheduled to deliver their decision in the CATS litigation on Wednesday, April 4th.', additional_kwargs={}, respons

In [65]:
print(coversational_qa_chain.invoke({"question": conv1})['answer'])

The House of Lords planned to announce their decision in the CATS litigation on Wednesday, April 4.


In [66]:
conv2 = "So what was the final decision from House of Lords"

In [67]:
print(coversational_qa_chain.invoke({"question": conv2})['answer'])

The final decision from the House of Lords ruled against Enron in the CATS litigation, with a 5 - 0 decision. This meant that Enron would have to repay approximately $150 million plus interest and court costs, totaling an estimated $155-160 million. The Lords' decision was based on their interpretation of the contract and their assessment of Enron's entitlement to relief.


In [71]:
conv3 = "Thanks can you give me the full summary of this case"
print(coversational_qa_chain.invoke({"question": conv3})['answer'])

Enron was involved in a CATS litigation case where the House of Lords ruled against them, resulting in Enron having to repay approximately $150 million plus interest and court costs, totaling an estimated $155-160 million. The Lords' decision was based on their interpretation of the contract rather than the contract's provisions. The opinion highlighted issues such as the retroactive consequences of latent defects, the timing of obligations under the contract, and the effectiveness of notices sent by the CATS parties. Lord Hoffman, who authored the primary opinion, concluded that Enron was not entitled to relief under the contract because they were not ready to flow J-Block gas during a specific period. The ruling was seen as unfavorable to Enron, and despite the disappointment, the support received during the case was appreciated.


#### We see we are getting right results in form of the conversation and llm is able to undersatnd the context of current query based on the previous query 

## Summary
#### We built a simple QA Chain (Question Answer) using Langchain where we can find answers for our query from the corpus of data, and we are getting more than 90% accuracy.
#### Also we can build a conversational chain just like chatbots and we were able to maintain the history of conversation going. And answers provided by the conversational chain were based on the previous instance.
