## Initial processing, summarization of the posts, preparing the Reddit dataset for RAG pipeline

### Thread reconstruction

In [1]:
import pandas as pd
import json

In [2]:
with open('FedEmployees.json', 'r', encoding='utf-8') as f:
    bbw = json.load(f)
df = pd.DataFrame(bbw)

Let us identify how many different Reddit threads are there in this dataset. By a thread we mean the the unique submission post and all the comments to it. This can be identified by the `reddit_link_id`.

In [4]:
lis = list(set(list(df.reddit_link_id)))

In [7]:
list_of_reddit_link_ids = [e for e in lis if e is not None]
len(list_of_reddit_link_ids)

49

There are just 49 different threads in this subreddit. Following are the unique threads identified by their `reddit_link_id`

In [52]:
# function to reconstruct the entire thread by reddit_link_id. 
# We sorted the posts by reddit_created_utc to preserve a temporal information
def retrive_entire_post(dataframe, id, comment_level = -1):
    '''reconstruct the Reddit page (containing all posts if comment_level=-1) from the given reddit_name/reddit_link_id/reddit_parent_id. 
       If comment_level=1 then the reconstruction contains is restricted only to level 1 comments. id is the index location of list_of_reddit_link_ids''' 

    # Creating a list of reddit_link_id from the given Reddit dataframe after removing None values
    lis = list(set(list(df.reddit_link_id)))
    lis = [ele for ele in lis if ele is not None]
    id_name = lis[id]
    
    if(comment_level == -1):
        # Reconstructs the entire post with the given id with all the comments in a somewhat unstructured manner. A sorting is done with respect to reddit_created_utc to keep the temporal flow of information
        return pd.concat([dataframe[dataframe['reddit_name']==id_name], dataframe[dataframe['reddit_link_id']==id_name]]).sort_values(by=['reddit_created_utc'])
    if(comment_level == 1):
        # Reconstructs the entire post with the given id keeping only level 1 comments. A sorting is done with respect to reddit_created_utc to keep the temporal flow of information
        arr = []
        arr.append(dataframe[dataframe['reddit_name']==id_name])
        arr.append(dataframe[dataframe['reddit_parent_id']==id_name])
        return pd.concat(arr).sort_values(by=['reddit_created_utc'])

In [28]:
#example
retrive_entire_post(df, 31) ## this gives the entire thread

Unnamed: 0,aware_post_type,aware_created_ts,reddit_id,reddit_name,reddit_created_utc,reddit_author,reddit_text,reddit_permalink,reddit_title,reddit_url,reddit_subreddit,reddit_link_id,reddit_parent_id,reddit_submission
84,submission,2021-02-22T18:00:59,lq2m17,t3_lq2m17,1614034859,troyf1,Does anyone have a chart outlining the current...,/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_s...,NO vs GS pay schedule?,https://www.reddit.com/r/FedEmployees/comments...,FedEmployees,,,
85,comment,2021-02-23T11:48:32,goh7hg5,t1_goh7hg5,1614098912,katzeye007,OPM.gov should have that data,/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_s...,,,FedEmployees,t3_lq2m17,t3_lq2m17,lq2m17
87,comment,2021-02-23T11:51:35,goh7zc6,t1_goh7zc6,1614099095,troyf1,Unfortunately it’s not posted anywhere on ther...,/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_s...,,,FedEmployees,t3_lq2m17,t1_goh7hg5,lq2m17
86,comment,2021-02-23T17:16:14,goijknf,t1_goijknf,1614118574,,[deleted],/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_s...,,,FedEmployees,t3_lq2m17,t3_lq2m17,lq2m17
88,comment,2021-03-08T22:12:05,gqadxji,t1_gqadxji,1615259525,ProveItAllNite,This is not true. There are several pay system...,/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_s...,,,FedEmployees,t3_lq2m17,t1_goijknf,lq2m17


Note that some reddit_text are deleted as can be seen above

In [30]:
#example
retrive_entire_post(df, 31, 1) ## this gives the thread with only level 1 comments (there were two comments to a comment in this thread)

Unnamed: 0,aware_post_type,aware_created_ts,reddit_id,reddit_name,reddit_created_utc,reddit_author,reddit_text,reddit_permalink,reddit_title,reddit_url,reddit_subreddit,reddit_link_id,reddit_parent_id,reddit_submission
84,submission,2021-02-22T18:00:59,lq2m17,t3_lq2m17,1614034859,troyf1,Does anyone have a chart outlining the current...,/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_s...,NO vs GS pay schedule?,https://www.reddit.com/r/FedEmployees/comments...,FedEmployees,,,
85,comment,2021-02-23T11:48:32,goh7hg5,t1_goh7hg5,1614098912,katzeye007,OPM.gov should have that data,/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_s...,,,FedEmployees,t3_lq2m17,t3_lq2m17,lq2m17
86,comment,2021-02-23T17:16:14,goijknf,t1_goijknf,1614118574,,[deleted],/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_s...,,,FedEmployees,t3_lq2m17,t3_lq2m17,lq2m17


In [31]:
# cross-check with the actual web-link
retrive_entire_post(df, 31, 1).iloc[0].reddit_url

'https://www.reddit.com/r/FedEmployees/comments/lq2m17/no_vs_gs_pay_schedule/'

In [32]:
# same as the function retrive_entire_post above but for id it uses the actual reddit_link_id ... can be handy later on
def retrive_entire_post_by_id(dataframe, id, comment_level = -1):
    '''reconstruct the Reddit thread (containing all posts if comment_level=-1) from the given reddit_name/reddit_link_id/reddit_parent_id. 
       If comment_level=1 then the reconstruction is restricted only to level 1 comments''' 
    id_name = id
    if(comment_level == -1):
        # Reconstructs the entire post with the given id with all the comments in a somewhat unstructured manner. A sorting is done with respect to reddit_created_utc to keep the temporal flow of information
        return pd.concat([dataframe[dataframe['reddit_name']==id_name], dataframe[dataframe['reddit_link_id']==id_name]]).sort_values(by=['reddit_created_utc'])
    if(comment_level == 1):
        # Reconstructs the entire post with the given id keeping only level 1 comments. A sorting is done with respect to reddit_created_utc to keep the temporal flow of information
        arr = []
        arr.append(dataframe[dataframe['reddit_name']==id_name])
        arr.append(dataframe[dataframe['reddit_parent_id']==id_name])
        return pd.concat(arr).sort_values(by=['reddit_created_utc'])

### We extract all the conversation in each thread into a list

In [54]:
# this function extracts all the conversation in a given thread and returns them in a list format. 
# The comment_level option lets one choose to limit to the top level comment or to all the comments (in which case the heirarchial structure is not preserved in the output ... need to think about how to do it)
def conversation(dataframe, id_name, comment_level=-1, user=False):
    if user:
        temp = retrive_entire_post(dataframe, id_name, comment_level=comment_level)
        return temp[["reddit_author", "reddit_text"]].set_index("reddit_author").to_dict()['reddit_text']
        
    if not user:
        temp = retrive_entire_post(dataframe, id_name, comment_level=comment_level)
        return list(temp.reddit_text)

In [53]:
# example
print(conversation(df,31,-1))

['Does anyone have a chart outlining the current NO pay schedule for civilians?  Can’t seem to find it anywhere, and trying to compare it to the GS pay schedule for a job opportunity.  My locality is Washington, DC.  For example:\n\nThe opportunity is NO-5 with a salary range of $120,577 - $172,500\n\nGS-14 is $122,530 - $159,286, and GS-15 is $144,128 - $172,500\n\nIs there anything higher than NO-5 on the NO pay schedule?', 'OPM.gov should have that data', 'Unfortunately it’s not posted anywhere on there, which I found odd...', '[deleted]', 'This is not true. There are several pay systems and/or agencies whose  pay plan exceeds the GS-15 level. Here are a few:\n\nhttps://www.consumerfinance.gov/about-us/careers/pay-scales/\nhttps://careers.occ.gov/pay-and-benefits/salary/index-occ-salary-structure.html\nhttps://www.va.gov/OHRM/Pay/2021/PhysicianDentist/PayTables_20210103.pdf']


### Next we summarize the conversation using an LLM (this is not an important step for implementing RAG)

In [55]:
import os
from getpass import getpass

In [56]:
huggingfacehub_api_token = getpass() # insert your huggingfacehub_api_token here (it is freely avaialble on the hugging face website)

 ········


In [57]:
os.environ['huggingfacehub_api_token'] = huggingfacehub_api_token

In [58]:
from langchain.prompts import PromptTemplate
from langchain_community.llms import HuggingFaceHub

In [60]:
# get the LLMs (I have chosen two here just to see if one is better than the other)
llm_mistral = HuggingFaceHub(repo_id='mistralai/Mistral-7B-Instruct-v0.2', huggingfacehub_api_token=huggingfacehub_api_token)
llm_falcon_7 = HuggingFaceHub(repo_id='tiiuae/falcon-7b-instruct', huggingfacehub_api_token=huggingfacehub_api_token)

In [64]:
def gen_prompt(d_frame, id):
    # template = """You are a conversation summarizing AI agent. The conversation is given in the form of a Python list: {list}
    #               Write a summary capturing key highlights of the above conversation.
    #            """
    template = """Summarize the following conversation: 
                  {list} 
                  Paraphrase your output."""
    prompt = PromptTemplate(template=template, input_variables=['list'])
    prompt_formatted_str: str = prompt.format(list=conversation(d_frame, id))
    return prompt_formatted_str

In [65]:
def llm_response(df, id, llm=llm_mistral):
    prompt = gen_prompt(df, id)
    out = llm(prompt)
    # return out.split('Write a summary capturing key highlights of the above conversation.')[-1]
    return out.split('Paraphrase your output.')[-1]

In [67]:
print(conversation(df,31))

['Does anyone have a chart outlining the current NO pay schedule for civilians?  Can’t seem to find it anywhere, and trying to compare it to the GS pay schedule for a job opportunity.  My locality is Washington, DC.  For example:\n\nThe opportunity is NO-5 with a salary range of $120,577 - $172,500\n\nGS-14 is $122,530 - $159,286, and GS-15 is $144,128 - $172,500\n\nIs there anything higher than NO-5 on the NO pay schedule?', 'OPM.gov should have that data', 'Unfortunately it’s not posted anywhere on there, which I found odd...', '[deleted]', 'This is not true. There are several pay systems and/or agencies whose  pay plan exceeds the GS-15 level. Here are a few:\n\nhttps://www.consumerfinance.gov/about-us/careers/pay-scales/\nhttps://careers.occ.gov/pay-and-benefits/salary/index-occ-salary-structure.html\nhttps://www.va.gov/OHRM/Pay/2021/PhysicianDentist/PayTables_20210103.pdf']


In [68]:
# sample thread summarization outputs from the LLMs
i=31
print("llm_mistral response: ", llm_response(df, i))
print()
print("llm_falcon_7 response: ", llm_response(df, i, llm=llm_falcon_7))

  warn_deprecated(


llm_mistral response:  

A user is inquiring about the current NO pay schedule for civilians in Washington, DC, specifically for the NO-5 position with a salary range of $120,577 - $172,500. They are trying to compare it to the GS pay schedule for a job opportunity. The user mentions that they have been unable to find this information on OPM.gov and that GS-14 and GS-15

llm_falcon_7 response:  
The conversation revolves around the difficulty of finding a chart outlining the current NO pay schedule for civilians in Washington, DC, and the speaker is wondering if there is anything higher than NO-5 on the NO pay schedule. The speaker also questions the accuracy of the information provided by OPM.gov.


In [70]:
from tqdm import tqdm

In [72]:
list_of_summaries_via_falcon = []
for i in tqdm(range(len(list_of_reddit_link_ids))):
    list_of_summaries_via_falcon.append({list_of_reddit_link_ids[i]: llm_response(df, i, llm=llm_falcon_7)})

list_of_summaries_via_mistral = []
for i in tqdm(range(len(list_of_reddit_link_ids))):
    list_of_summaries_via_mistral.append({list_of_reddit_link_ids[i]: llm_response(df, i, llm=llm_mistral)})

100%|███████████████████████████████████████████| 49/49 [01:36<00:00,  1.96s/it]
100%|███████████████████████████████████████████| 49/49 [01:05<00:00,  1.33s/it]


In [77]:
with open('Falcon_summarization_FedEmp.json', 'w') as fout:
    json.dump(list_of_summaries_via_falcon , fout)

In [78]:
with open('Mistral_summarization_FedEmp.json', 'w') as fout:
    json.dump(list_of_summaries_via_mistral , fout)

In [82]:
# example: note that the LLM output is abruptly truncated. This needs to be fixed somehow. 
list_of_summaries_via_mistral[0]

{'t3_rj3a8g': ' \n\nThe employee expresses satisfaction with their job and the federal government as a whole, but is growing increasingly unhappy with their supervisor, who has adopted a micromanaging management style from their private sector background. The supervisor is targeting a coworker for closer scrutiny and discipline, and has implied that the employee and other subordinates may be next. The employee feels threatened and wants to take precautions, including increasing distance and being aware of their rights and options'}

In [80]:
# for comparision, here is the full text of the submission post
retrive_entire_post_by_id(df, 't3_rj3a8g').iloc[0].reddit_text

'Entering my third year of employment with fed gov. Overall I’m glad I switched from private sector, it’s been what I had hoped and I plan to stay a fed gov employee until retirement.\n\nHere’s the rub: I love the work I do but I’m beginning to hate my job more with each passing week. Specifically, I have an overbearing supervisor who was hired several months ago from the private sector, and while this individual is a decent, well meaning person (I think), they’ve been steadily morphing into an egomaniac and their proclivity for bringing their apparent 35 years of private sector management style to bear on me and other fellow subordinates is becoming increasingly intolerable by the week.\n\nToday we had a face to face, one on one and I was left with the impression that one of my coworkers/fellow subordinates is being targeted for discipline/write up of a Performance Improvement Plan so they can be micromanaged by this supervisor even more closely and that much easier to either discipli

## Converting summarization into Langchain Documents

 We work with falcon summarization as our dataset

In [83]:
dict_list_f = []
for e in list_of_summaries_via_falcon:
    key = list(e.keys())[0]
    val = list(e.values())[0]
    dict_list_f.append({'id' : key, 'text' : val})

In [84]:
dict_list_f[0]

{'id': 't3_rj3a8g',
 'text': " \nThe conversation with the supervisor is becoming increasingly hostile, and the employee is feeling threatened. They are concerned about the supervisor's micromanaging and the potential for disciplinary action. The employee is looking for resources and ways to protect themselves. They suggest researching the PIP process and union representation."}

In [85]:
with open('Falcon_summarization_FedEmp.json', 'w') as fout:
    json.dump(dict_list_f , fout)

In [86]:
from langchain_community.document_loaders import JSONLoader

In [87]:
def metadata_func(record: dict, metadata: dict) -> dict:
    
    metadata['reddit_link_id'] = record.get('id')
    # we can add more metadata relevant to the threads here

    return metadata

In [88]:
loader = JSONLoader(
    file_path='Falcon_summarization_FedEmp.json',
    jq_schema='.[]',
    content_key='text',
    metadata_func=metadata_func
)

In [89]:
docs = loader.load() 

In [94]:
print(type(docs))
print(type(docs[0]))

<class 'list'>
<class 'langchain_core.documents.base.Document'>


`docs` contains the list of `Langchain Documents` objects. There is no chuncking upto this point. `JSONLoader` prepares our dataset for `langchain`

In [90]:
# example
docs[0]

Document(page_content=" \nThe conversation with the supervisor is becoming increasingly hostile, and the employee is feeling threatened. They are concerned about the supervisor's micromanaging and the potential for disciplinary action. The employee is looking for resources and ways to protect themselves. They suggest researching the PIP process and union representation.", metadata={'source': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json', 'seq_num': 1, 'reddit_link_id': 't3_rj3a8g'})

In [95]:
# add this extra metadata for RAGAS (will come up later)
for document in docs:
    document.metadata['filename'] = document.metadata['source']

In [96]:
docs[0]

Document(page_content=" \nThe conversation with the supervisor is becoming increasingly hostile, and the employee is feeling threatened. They are concerned about the supervisor's micromanaging and the potential for disciplinary action. The employee is looking for resources and ways to protect themselves. They suggest researching the PIP process and union representation.", metadata={'source': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json', 'seq_num': 1, 'reddit_link_id': 't3_rj3a8g', 'filename': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json'})

## Synthetic dataset for evaluation using RAGAS 

Need to figure out how to get the synthetic test dataset using just open source LLM

In [97]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

In [98]:
OPENAI_API_KEY = getpass()

 ········


In [102]:
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

In [101]:
# generator with open source models
# does not work yet! Not sure where exactly the problem is. Perhaps the embedding model needs to consistent with LLM -- I dont know. 

from langchain_community.embeddings import HuggingFaceEmbeddings

generator_llm = llm_falcon_7
critic_llm = llm_falcon_7
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/bert-base-nli-mean-tokens')

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# generate testset
testset = generator.generate_with_langchain_docs(docs, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

embedding nodes:   0%|          | 0/98 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


ValueError: a cannot be empty unless no samples are taken

In [None]:
# generator with openai models
# works but costs money

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# generate testset
testset = generator.generate_with_langchain_docs(docs, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

In [148]:
testset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What does the Hatch Act regulate in terms of p...,[\nThe Hatch Act is a federal law that regulat...,The Hatch Act regulates political activity by ...,simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
1,What is the recommended timeframe for staying ...,[\nThe conversation revolves around the ideal ...,"According to SES leaders, it is recommended to...",simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
2,What is the recommended timeframe for staying ...,[\nThe conversation revolves around the ideal ...,"According to SES leaders, it is recommended to...",simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
3,How does the Hatch Act regulate social media a...,[\nThe Hatch Act is a federal law that regulat...,The Hatch Act regulates social media activity ...,simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
4,How is job hopping viewed in the federal gover...,[\nThe conversation revolves around the ideal ...,The conversation touches on the idea that job ...,simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
5,How are salaries determined for different job ...,[\nThe conversation revolves around the discre...,,reasoning,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
6,Why do SES leaders recommend staying in a role...,[\nThe conversation revolves around the ideal ...,SES leaders recommend staying in a role for 4-...,reasoning,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
7,What are the Hatch Act restrictions for govern...,[\nThe Hatch Act is a federal law that regulat...,,multi_context,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
8,What law regulates government employees' polit...,[\nThe Hatch Act is a federal law that regulat...,The Hatch Act is a federal law that regulates ...,multi_context,[{'source': '/Users/hraj/Documents/Erdos/aware...,True


In [151]:
testset.to_pandas().iloc[3]["metadata"]

[{'source': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json',
  'seq_num': 25,
  'reddit_link_id': 't3_gxewnv',
  'filename': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json'}]

In [179]:
testset.test_data[0]

DataRow(question='What does the Hatch Act regulate in terms of political activity by government employees?', contexts=['\nThe Hatch Act is a federal law that regulates political activity by government employees. According to the Office of Special Counsel website, the definition of political activity includes expressing opinions about candidates and issues, as well as activity directed at the success or failure of a political party, candidate for partisan political office, or partisan political group. It is important to note that social media activity may fall under this definition, and it is recommended to avoid mentioning your agency or town on personal social media profiles.'], ground_truth='The Hatch Act regulates political activity by government employees, including expressing opinions about candidates and issues, as well as activity directed at the success or failure of a political party, candidate for partisan political office, or partisan political group. Social media activity m

## Building an end-to-end RAG pipeline

### Chuncked langchain documents

In a full production-level product, we would need to do a proper chuncking of langchain Documents. However, since we generated our evaluation test set without chunking, we will be working with the unchuncked documents in the rest of this Section.

In [105]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [106]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, 
    chunk_overlap=20)
docs_split = text_splitter.split_documents(docs)

`docs_split` containts the chuncked documents `docs`

In [107]:
print(len(docs_split))
print(len(docs))

196
49


In [110]:
print(docs[0].page_content)

 
The conversation with the supervisor is becoming increasingly hostile, and the employee is feeling threatened. They are concerned about the supervisor's micromanaging and the potential for disciplinary action. The employee is looking for resources and ways to protect themselves. They suggest researching the PIP process and union representation.


In [114]:
print(docs_split[0].page_content)
print(docs_split[1].page_content)
print(docs_split[3].page_content)

The conversation with the supervisor is becoming increasingly hostile, and the employee is feeling
employee is feeling threatened. They are concerned about the supervisor's micromanaging and the
and ways to protect themselves. They suggest researching the PIP process and union representation.


### Embedding

In [28]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [213]:
# embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/bert-base-nli-mean-tokens') # earlier we used this
# both 'sentence-transformers/bert-base-nli-mean-tokens' and allenai/longformer-base-4096'  produce 768 dimensional embeddings

In [116]:
embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/bert-base-nli-mean-tokens')

In [117]:
embeddings.dict

<bound method BaseModel.dict of HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
), model_name='sentence-transformers/bert-base-nli-mean-tokens', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)>

### Indexing (ChromaDB) 

In [118]:
# Load Documents (already done above: docs)

# Chunk the documents (we have already done it above: docs_split)

# Embed

from langchain_community.vectorstores import Chroma

In [120]:
vectorstore = Chroma.from_documents(documents=docs_split, embedding=embeddings)

In [121]:
vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings)
retriever = vectorstore.as_retriever()

### Retrival

In [122]:
retrive_documents = retriever.invoke("are Fed employees happy with their salary?")

In [124]:
for docx in retrive_documents:
    print(docx)
    print()

page_content='is concerned about the salary and whether it is typical for a fed job. The HR contact provided the' metadata={'filename': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json', 'reddit_link_id': 't3_t0p7o1', 'seq_num': 45, 'source': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json'}

page_content="The conversation revolves around the applicant's expectations for compensation and job benefits" metadata={'filename': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json', 'reddit_link_id': 't3_r5r4fc', 'seq_num': 4, 'source': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json'}

page_content='job and their retirement benefits. They inquire about the possibility of receiving retirement pay,' metadata={'filename': '/Users/hraj/Documents/Erdos/a

Some tests based on `ragas` generated testset 

In [264]:
testset.test_data[0].question

'What does the Hatch Act regulate in terms of political activity by government employees?'

In [265]:
question = testset.test_data[0].question
retrive_documents = retriever.invoke(question)

In [268]:
retrive_documents[0]

Document(page_content='The Hatch Act is a federal law that regulates political activity by government employees. According to the Office of Special Counsel website, the definition of political activity includes expressing', metadata={'filename': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json', 'reddit_link_id': 't3_gxewnv', 'seq_num': 25, 'source': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json'})

In [269]:
testset.test_data[0]

DataRow(question='What does the Hatch Act regulate in terms of political activity by government employees?', contexts=['\nThe Hatch Act is a federal law that regulates political activity by government employees. According to the Office of Special Counsel website, the definition of political activity includes expressing opinions about candidates and issues, as well as activity directed at the success or failure of a political party, candidate for partisan political office, or partisan political group. It is important to note that social media activity may fall under this definition, and it is recommended to avoid mentioning your agency or town on personal social media profiles.'], ground_truth='The Hatch Act regulates political activity by government employees, including expressing opinions about candidates and issues, as well as activity directed at the success or failure of a political party, candidate for partisan political office, or partisan political group. Social media activity m

### Generation (this optional; we do not really need it for our project)

This part is not required in the project.
The main question is do we need generation for evaluation purposes? Anyways let us perform the generation process below.

In [125]:
# Prompt
from langchain.prompts import ChatPromptTemplate

template = """ Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context: 
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [126]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# LLM
llm = llm_mistral

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [129]:
# Question
print(rag_chain.invoke("are Fed employees happy with their salary?"))

Human:  Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context: 
is concerned about the salary and whether it is typical for a fed job. The HR contact provided the

The conversation revolves around the applicant's expectations for compensation and job benefits

job and their retirement benefits. They inquire about the possibility of receiving retirement pay,

current Agency will compensate them for this time. The main concern is whether the employee will

Question:
are Fed employees happy with their salary?

Answer:
I don't know. The context provided does not include any information about the overall satisfaction or happiness of Fed employees with their salaries.


In [128]:
retrive_documents = retriever.invoke("are Fed employees happy with their salary?")
for d in retrive_documents:
    print(d)
    print()

page_content='is concerned about the salary and whether it is typical for a fed job. The HR contact provided the' metadata={'filename': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json', 'reddit_link_id': 't3_t0p7o1', 'seq_num': 45, 'source': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json'}

page_content="The conversation revolves around the applicant's expectations for compensation and job benefits" metadata={'filename': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json', 'reddit_link_id': 't3_r5r4fc', 'seq_num': 4, 'source': '/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/Falcon_summarization_FedEmp.json'}

page_content='job and their retirement benefits. They inquire about the possibility of receiving retirement pay,' metadata={'filename': '/Users/hraj/Documents/Erdos/a

In [293]:
# another question (based on the test dataset)

question = testset.test_data[4].question
output = rag_chain.invoke(question)
print(output)

Human:  Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context: 
The Hatch Act is a federal law that regulates political activity by government employees. According to the Office of Special Counsel website, the definition of political activity includes expressing

The conversation revolves around the creation of a subreddit for federal workers to share workplace-related thoughts, ideas, and discussions. The speaker hopes to attract more federal workers to the


The Hatch Act is a federal law that regulates political activity by government employees. According to the Office of Special Counsel website, the definition of political activity includes expressing opinions about candidates and issues, as well as activity directed at the success or failure of a political party, candidate for partisan political office, or partisan political group. It is important to note that social media activity 

In [295]:
testset.test_data[4].ground_truth

'The conversation touches on the idea that job hopping may be viewed negatively in the federal government, according to SES leaders.'

## RAG evaluation using RAGAS

### First prepare the dataset with which we want to evaluate

In [312]:
test_df = testset.to_pandas()

In [313]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What does the Hatch Act regulate in terms of p...,[\nThe Hatch Act is a federal law that regulat...,The Hatch Act regulates political activity by ...,simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
1,What is the recommended timeframe for staying ...,[\nThe conversation revolves around the ideal ...,"According to SES leaders, it is recommended to...",simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
2,What is the recommended timeframe for staying ...,[\nThe conversation revolves around the ideal ...,"According to SES leaders, it is recommended to...",simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
3,How does the Hatch Act regulate social media a...,[\nThe Hatch Act is a federal law that regulat...,The Hatch Act regulates social media activity ...,simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
4,How is job hopping viewed in the federal gover...,[\nThe conversation revolves around the ideal ...,The conversation touches on the idea that job ...,simple,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
5,How are salaries determined for different job ...,[\nThe conversation revolves around the discre...,,reasoning,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
6,Why do SES leaders recommend staying in a role...,[\nThe conversation revolves around the ideal ...,SES leaders recommend staying in a role for 4-...,reasoning,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
7,What are the Hatch Act restrictions for govern...,[\nThe Hatch Act is a federal law that regulat...,,multi_context,[{'source': '/Users/hraj/Documents/Erdos/aware...,True
8,What law regulates government employees' polit...,[\nThe Hatch Act is a federal law that regulat...,The Hatch Act is a federal law that regulates ...,multi_context,[{'source': '/Users/hraj/Documents/Erdos/aware...,True


In [316]:
test_questions = test_df["question"].values.tolist()
test_questions

['What does the Hatch Act regulate in terms of political activity by government employees?',
 'What is the recommended timeframe for staying in a position within the federal government before considering a new role?',
 'What is the recommended timeframe for staying in a position within the federal government before considering a new role?',
 'How does the Hatch Act regulate social media activity for government employees?',
 'How is job hopping viewed in the federal government, according to SES leaders?',
 'How are salaries determined for different job levels?',
 'Why do SES leaders recommend staying in a role for 4-5 years before seeking a new one in the federal government, and how is job hopping viewed in this situation?',
 'What are the Hatch Act restrictions for government employees in politics, considering their future federal agency employment and leave policies?',
 "What law regulates government employees' political involvement and social media use?"]

In [130]:
test_questions = ['What does the Hatch Act regulate in terms of political activity by government employees?',
 'What is the recommended timeframe for staying in a position within the federal government before considering a new role?',
 'What is the recommended timeframe for staying in a position within the federal government before considering a new role?',
 'How does the Hatch Act regulate social media activity for government employees?',
 'How is job hopping viewed in the federal government, according to SES leaders?',
 'How are salaries determined for different job levels?',
 'Why do SES leaders recommend staying in a role for 4-5 years before seeking a new one in the federal government, and how is job hopping viewed in this situation?',
 'What are the Hatch Act restrictions for government employees in politics, considering their future federal agency employment and leave policies?',
 "What law regulates government employees' political involvement and social media use?"]

In [317]:
test_ground_truth = test_df["ground_truth"].values.tolist()
test_ground_truth

['The Hatch Act regulates political activity by government employees, including expressing opinions about candidates and issues, as well as activity directed at the success or failure of a political party, candidate for partisan political office, or partisan political group. Social media activity may also fall under this definition, and it is advised to avoid mentioning your agency or town on personal social media profiles.',
 'According to SES leaders, it is recommended to stay in a position for at least 4-5 years before considering a new role. By year 4, individuals should be demonstrating their value and making a positive contribution to their current position.',
 'According to SES leaders, it is recommended to stay in a position for at least 4-5 years before considering a new role. By year 4, individuals should be demonstrating their value and making a positive contribution to their current position.',
 'The Hatch Act regulates social media activity for government employees by incl

In [131]:
test_ground_truth = ['The Hatch Act regulates political activity by government employees, including expressing opinions about candidates and issues, as well as activity directed at the success or failure of a political party, candidate for partisan political office, or partisan political group. Social media activity may also fall under this definition, and it is advised to avoid mentioning your agency or town on personal social media profiles.',
 'According to SES leaders, it is recommended to stay in a position for at least 4-5 years before considering a new role. By year 4, individuals should be demonstrating their value and making a positive contribution to their current position.',
 'According to SES leaders, it is recommended to stay in a position for at least 4-5 years before considering a new role. By year 4, individuals should be demonstrating their value and making a positive contribution to their current position.',
 'The Hatch Act regulates social media activity for government employees by including it under the definition of political activity. Government employees are advised to avoid mentioning their agency or town on personal social media profiles to comply with the regulations of the Hatch Act.',
 'The conversation touches on the idea that job hopping may be viewed negatively in the federal government, according to SES leaders.',
 'nan',
 'SES leaders recommend staying in a role for 4-5 years before seeking a new one in the federal government because by year 4, individuals should be demonstrating their value and making a positive contribution to their current position. Job hopping may be viewed negatively as it may suggest a lack of commitment or stability.',
 'nan',
 'The Hatch Act is a federal law that regulates political activity by government employees, including social media use. It defines political activity as expressing opinions about candidates and issues, as well as activity directed at the success or failure of a political party, candidate for partisan political office, or partisan political group. Social media activity may fall under this definition, and it is advised to avoid mentioning your agency or town on personal social media profiles.']

In [132]:
def rag_response(question):
    response = rag_chain.invoke(question).split("\nAnswer:")
    ans = response[-1]
    cxt = response[0].split("\nQuestion:")[0].split("\nContext:")[-1]
    return ans, cxt

In [133]:
answers = []
context = []
for question in tqdm(test_questions):
    x, y = rag_response(question)
    answers.append(x)
    context.append(y)

100%|█████████████████████████████████████████████| 9/9 [00:08<00:00,  1.04it/s]


In [134]:
# need to fix this properly (right in the generation section) later
context1 = [ctc.split('\n\n') for ctc in context]

In [135]:
from datasets import Dataset

In [136]:
response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : context1, # the spelling of the keys should is as is. An error here will lead to an error in the evaluate function
    "ground_truth" : test_ground_truth
})

In [137]:
response_dataset[2]

{'question': 'What is the recommended timeframe for staying in a position within the federal government before considering a new role?',
 'answer': '\nAccording to the context, it is recommended to stay in a position for at least 4-5 years before considering a new role in the federal government.',
 'contexts': [' \nin the federal government. According to SES leaders, it is recommended to stay in a position for at',
  '\nThe conversation revolves around the ideal frequency of changing positions in the federal government. According to SES leaders, it is recommended to stay in a position for at least 4-5 years before considering a new role. They suggest that by year 4, individuals should be demonstrating their value and making a positive contribution to their current position. The conversation also touches on the idea that job hopping may be viewed negatively.',
  "The conversation revolves around the speaker's upcoming employment at a federal agency, their",
  "\nThe conversation revolve

### Next we perform evaluation on the dataset

In [138]:
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    answer_correctness,
    context_recall,
    context_precision,
)

In [139]:
metrics = [context_recall, context_precision]

In [140]:
results = evaluate(response_dataset, metrics) # this require openAI to run 

Evaluating:   0%|          | 0/18 [00:00<?, ?it/s]

In [142]:
# this is our final metrics based on which we will need to evaluation our RAG pipeline
results

{'context_recall': 0.6667, 'context_precision': 0.4241}

### Trying to make use of open source for RAG evaluation

In [387]:
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain import HuggingFacePipeline

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

In [43]:
embeddings

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
), model_name='sentence-transformers/bert-base-nli-mean-tokens', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [44]:
llm_mistral

HuggingFaceHub(client=<InferenceClient(model='mistralai/Mistral-7B-Instruct-v0.2', timeout=None)>, repo_id='mistralai/Mistral-7B-Instruct-v0.2', task='text-generation', huggingfacehub_api_token='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')

In [None]:
# evaluator
# model_id = "mistralai/Mistral-7B-Instruct-v0.2"
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(model_id)

# this seems to be loading the LLM locally. No not run again

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

In [None]:
pipe = pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.1, 
    repetition_penalty=1.1  # without this output begins repeating
)

evaluator = HuggingFacePipeline(pipeline=pipe)

In [48]:
# ragas
result = evaluate(
    dataset=response_dataset,
    llm=llm_mistral,
    embeddings=embeddings,
    metrics=metrics
)

Evaluating:   0%|          | 0/18 [00:00<?, ?it/s]

Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid response format. Expected a list of dictionaries with keys 'verdict'
Invalid response format. Expected a list of dictionaries with keys 'verdict'
Invalid response format. Expected a list of dictionaries with keys 'verdict'
Invalid response format. Expected a list of dictionaries with keys 'verdict'
Invalid response format. Expected a list of dictionaries with keys 'verdict'
Invalid response format. Expec

In [57]:
response_dataset[2]

{'question': 'What is the recommended timeframe for staying in a position within the federal government before considering a new role?',
 'answer': ' Based on the context, it is recommended to stay in a position for at least 4-5 years before considering a new role.',
 'contexts': [' ',
  'The conversation revolves around the ideal frequency of changing positions in the federal government. According to SES leaders, it is recommended to stay in a position for at least 4-5 years before considering a new role. They suggest that by year 4, individuals should be demonstrating their value and making a positive contribution to their current position. The conversation also touches on the idea that job hopping may be viewed negatively.',
  "\nThe conversation revolves around an employee's transfer from one Federal Government Agency to another, discussing their accrued travel and comp time, and whether their current Agency will compensate them for this time. The main concern is whether the employ

In [58]:
result = evaluate(
    dataset=response_dataset,
    metrics=metrics
)

Evaluating:   0%|          | 0/18 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [59]:
result

{'context_recall': 0.7667, 'context_precision': 0.5000}

In [49]:
result

{'context_recall': nan, 'context_precision': nan}

In [391]:
# from llama_index.llms import HuggingFaceInferenceAPI
# falcon_llm = HuggingFaceInferenceAPI(
#     model_name="tiiuae/falcon-7b-instruct",
#     token=huggingfacehub_api_token
# )

In [1]:
2+2

4

In [47]:
help(evaluate)

Help on function evaluate in module ragas.evaluation:

evaluate(dataset: 'Dataset', metrics: 'list[Metric] | None' = None, llm: 't.Optional[BaseRagasLLM | LangchainLLM]' = None, embeddings: 't.Optional[BaseRagasEmbeddings | LangchainEmbeddings]' = None, callbacks: 'Callbacks' = None, is_async: 'bool' = False, run_config: 't.Optional[RunConfig]' = None, raise_exceptions: 'bool' = True, column_map: 't.Optional[t.Dict[str, str]]' = None) -> 'Result'
    Run the evaluation on the dataset with different metrics
    
    Parameters
    ----------
    dataset : Dataset[question: list[str], contexts: list[list[str]], answer: list[str], ground_truth: list[list[str]]]
        The dataset in the format of ragas which the metrics will use to score the RAG
        pipeline with
    metrics : list[Metric] , optional
        List of metrics to use for evaluation. If not provided then ragas will run the
        evaluation on the best set of metrics to give a complete view.
    llm: BaseRagasLLM, optio