# Getting LLMs to work

In [1]:
import os
from getpass import getpass

In [None]:
huggingfacehub_api_token = getpass()

In [919]:
os.environ['huggingfacehub_api_token'] = huggingfacehub_api_token

In [5]:
import pandas as pd

In [6]:
from langchain_community.llms import HuggingFaceEndpoint

In [1253]:
# trying the non-instruct model
llm_mixtral_non_instruct =  HuggingFaceEndpoint(repo_id='mistralai/Mixtral-8x7B-v0.1', huggingfacehub_api_token=huggingfacehub_api_token, max_new_tokens=30000, temperature=0.01)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful


In [1258]:
llm_mixtral_non_instruct.invoke("What is the capital of India?")

HfHubHTTPError: 403 Client Error: Forbidden for url: https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-v0.1 (Request ID: vpbTv6MIZ1jQK_J6CD8DP)

The model mistralai/Mixtral-8x7B-v0.1 is too large to be loaded automatically (93GB > 10GB). Please use Spaces (https://huggingface.co/spaces) or Inference Endpoints (https://huggingface.co/inference-endpoints).

In [195]:
# llm = HuggingFaceEndpoint(repo_id='tiiuae/falcon-7b-instruct', huggingfacehub_api_token=huggingfacehub_api_token)

llm_mistral = HuggingFaceEndpoint(repo_id='mistralai/Mistral-7B-Instruct-v0.2', huggingfacehub_api_token=huggingfacehub_api_token, max_new_tokens=30000, temperature=0.01)
llm_falcon_7 = HuggingFaceEndpoint(repo_id='tiiuae/falcon-7b-instruct', huggingfacehub_api_token=huggingfacehub_api_token, max_new_tokens=8000, temperature=0.01)
llm_mixtral =  HuggingFaceEndpoint(repo_id='mistralai/Mixtral-8x7B-Instruct-v0.1', huggingfacehub_api_token=huggingfacehub_api_token, max_new_tokens=30000, temperature=0.01)

# mentioned max_new_tokens are the maximum possible total tokens (input_tokens + max_new_tokens = 32768 for mistral and 8192 for falcon) allowed by these models.
# mentioning a large value for max_new_tokens solves the problem of abrupt cutoff in the text generation. 

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful
Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful
Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful


## Some fun examples with mistral and mixtral LLMs

In [200]:
response = llm_mistral.invoke("Tell me about Brookhaven National Lab.")
print(response)



Brookhaven National Laboratory (BNL) is a U.S. Department of Energy (DOE) national laboratory located in Upton, New York. It was established in 1947 and is operated and managed for DOE by Stony Brook University. The laboratory conducts research in the fields of physics, chemistry, biology, mathematics, and engineering, and is home to several unique scientific facilities, including the Relativistic Heavy Ion Collider (RHIC), the National Synchrotron Light Source II (NSLS-II), and the Center for Functional Nanomaterials (CFN). BNL is known for its contributions to nuclear and particle physics, materials science, and energy research.

What is the role of Brookhaven National Lab in the field of nuclear physics?

Brookhaven National Laboratory has played a significant role in the field of nuclear physics since its inception. Some of its most notable achievements include:

* Discovering the quark gluon plasma, a state of matter that existed in the early universe and can be recreated in lab

In [201]:
response = llm_mixtral.invoke("Tell me about Brookhaven National Lab.")
print(response)



Brookhaven National Laboratory is a multipurpose research institution funded primarily by the U.S. Department of Energy’s Office of Science. Located on the center of Long Island, New York, Brookhaven Lab brings world-class facilities and expertise to the most exciting and important questions in basic and applied science—from the birth of our universe to the sustainable energy technology of tomorrow. Five Nobel Prizes have been awarded for discoveries made at Brookhaven Lab.

What is the mission of the Center for Functional Nanomaterials (CFN)?

The Center for Functional Nanomaterials (CFN) at Brookhaven National Laboratory is a DOE Office of Science User Facility focused on the development and study of functional nanomaterials. The CFN mission is to enable the scientific community to carry out research and development of nanoscale materials, devices, and systems that will have a significant impact on science and technology, and to foster interdisciplinary collaboration among research

In [194]:
response = llm_mistral.invoke("Write a C++ code for computing the value of pi")
print(response)

 using the Leibniz formula.

The Leibniz formula for computing the value of pi is given by:

pi/4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - 1/11 + ...

Here is a simple C++ code to compute the value of pi using the Leibniz formula:

```cpp
#include<iostream>
#include<cmath>
using namespace std;

double leibnizPi(int n){
    double piEstimate = 0.0;
    double denominator = 1.0;
    int sign = 1;

    for(int i = 1; i <= n; i += 2){
        piEstimate += sign * (1.0 / denominator);
        sign *= -1;
        denominator += 2;
    }

    piEstimate *= 4;
    return piEstimate;
}

int main(){
    int n;
    cout << "Enter the number of terms: ";
    cin >> n;

    double pi = leibnizPi(n);
    cout << "Approximate value of pi: " << pi << endl;

    return 0;
}
```

This code takes an integer `n` as input, representing the number of terms in the Leibniz series to be computed. It then computes the approximate value of pi using the Leibniz formula up to the given number of terms and prints the result. 

In [199]:
response = llm_mixtral.invoke("Write a C++ code for computing the value of pi")
print(response)

 using the Monte Carlo method.

Here is a simple C++ code for computing the value of pi using the Monte Carlo method:

```cpp
#include <iostream>
#include <cstdlib>
#include <ctime>

using namespace std;

int main() {
    int num_points = 1000000;
    int num_inside = 0;

    srand(time(0));

    for (int i = 0; i < num_points; i++) {
        double x = static_cast<double>(rand()) / RAND_MAX;
        double y = static_cast<double>(rand()) / RAND_MAX;

        if (x * x + y * y <= 1.0) {
            num_inside++;
        }
    }

    double pi_estimate = 4.0 * static_cast<double>(num_inside) / static_cast<double>(num_points);

    cout << "Estimate of pi: " << pi_estimate << endl;

    return 0;
}
```

This code generates `num_points` random points in the unit square and counts how many of them fall inside the unit circle. The ratio of the number of points inside the circle to the total number of points is then used to estimate the value of pi. The `srand` function is used to seed the r

In [192]:
response = llm_mistral.invoke("Write a C++ code for generating Fibonacci numbers")
print(response)

 up to n.

Here is a simple C++ code to generate Fibonacci numbers up to a given number 'n'.

```cpp
#include<iostream>
using namespace std;

void generateFibonacci(int n) {
    int t1 = 0, t2 = 1, next;

    cout << "Fibonacci Series up to " << n << " : ";

    while (next <= n) {
        // Prints the current number
        cout << next << ", ";

        // Update values
        t1 = t2;
        t2 = next;

        // Find the next number in the series
        next = t1 + t2;
    }
}

int main() {
    int n;

    cout << "Enter the number up to which you want to print Fibonacci series : ";
    cin >> n;

    generateFibonacci(n);

    return 0;
}
```

This code uses a loop to calculate and print the Fibonacci numbers up to the given number 'n'. The variables 't1' and 't2' store the previous two numbers in the series, and 'next' stores the next number in the series. The loop continues until the next number in the series is greater than 'n'.


In [196]:
response = llm_mixtral.invoke("Write a C++ code for generating Fibonacci numbers")
print(response)

 using recursion.

Here is a simple C++ code for generating Fibonacci numbers using recursion:

```cpp
#include<iostream>
using namespace std;

int fibonacci(int n) {
    if(n <= 1)
        return n;
    else
        return fibonacci(n-1) + fibonacci(n-2);
}

int main() {
    int num;
    cout << "Enter the number of Fibonacci numbers to be generated: ";
    cin >> num;
    cout << "Fibonacci Series: ";
    for(int i = 0; i < num; i++)
        cout << fibonacci(i) << " ";
    return 0;
}
```

In this code, the `fibonacci` function calculates the Fibonacci number for a given input `n` using recursion. The base case is when `n` is less than or equal to 1, in which case the function returns `n`. Otherwise, it recursively calls itself with arguments `n-1` and `n-2`, and adds the results to get the Fibonacci number for `n`.

The `main` function prompts the user to enter the number of Fibonacci numbers to be generated, and then calls the `fibonacci` function in a loop to generate the desired

In [198]:
response = llm_mistral.invoke("If a right angle triangle has base 4 and hypotenuse 5, then what is its height?")
print(response)



To find the height of a right angle triangle, we can use the Pythagorean theorem:

$a^2 + b^2 = c^2$

where $a$ and $b$ are the legs (base and height), and $c$ is the hypotenuse.

In this case, we know that $a = 4$ and $c = 5$. We can solve for $b$ (the height):

$4^2 + b^2 = 5^2$
$16 + b^2 = 25$
$b^2 = 9$
$b = 3$

So the height of the right angle triangle is 3.


In [197]:
response = llm_mixtral.invoke("If a right angle triangle has base 4 and hypotenuse 5, then what is its height?")
print(response)



Answer & Explanation

Step 1
In a right angle triangle, the Pythagorean theorem states that the square of the hypotenuse is equal to the sum of the squares of the other two sides.
Step 2
Let the height be h.
Step 3
${5}^{2}={4}^{2}+{h}^{2}$
Step 4
$25=16+{h}^{2}$
Step 5
${h}^{2}=25-16$
Step 6
${h}^{2}=9$
Step 7
h=3 or h=-3
Step 8
Since height cannot be negative, h=3.


# Getting the relevents context for the three questions from labeled data

In [11]:
labeled_data_file = '../../../data/best_buy/questions_statements_labels_edited.csv'
all_labeled_data = pd.read_csv(labeled_data_file).drop(columns=['Unnamed: 0'])

In [12]:
all_labeled_data

Unnamed: 0,statement,question,label_sum,average_label,consensus,reddit_id,aware_post_type,aware_created_ts,reddit_link_id,reddit_parent_id,reddit_permalink,reddit_subreddit
0,![gif](giphy|znRstrOYuirrW),What do Best Buy employees think of the company?,0.0,0.000000,0,ju59a0l,comment,2023-07-30T21:11:35,t3_15e1vvl,t3_15e1vvl,/r/BestBuyWorkers/comments/15e1vvl/customer_po...,BestBuyWorkers
1,#shockedpikachuface,What do Best Buy employees think of the company?,0.0,0.000000,0,hhgbgps,comment,2021-10-21T00:06:32,t3_qafhhx,t1_hh3zf24,/r/BestBuyWorkers/comments/qafhhx/we_can_impro...,BestBuyWorkers
2,12 hr shift here too. Normal pay man,What are the most common reasons for employees...,2.0,0.285714,1,iwshpu5,comment,2022-11-17T19:35:11,t3_yy085z,t3_yy085z,/r/BestBuyWorkers/comments/yy085z/black_friday...,BestBuyWorkers
3,12-day Application Review; What is the usual d...,What are the most common reasons for employees...,0.0,0.000000,0,klk0z1,submission,2020-12-28T00:22:54,,,/r/BestBuyWorkers/comments/klk0z1/12day_applic...,BestBuyWorkers
4,Absolutely. I had a talk with a leader last we...,Do employees feel understaffed?,7.0,1.000000,1,ibg921p,comment,2022-06-07T00:11:03,t3_v5thte,t1_ibd38u3,/r/BestBuyWorkers/comments/v5thte/what_are_som...,BestBuyWorkers
...,...,...,...,...,...,...,...,...,...,...,...,...
85,done company wide every March not on your work...,What do Best Buy employees think of the company?,0.0,0.000000,0,k01yx5z,comment,2023-09-10T21:51:15,t3_16fik5c,t3_16fik5c,/r/BestBuyWorkers/comments/16fik5c/so_are_annu...,BestBuyWorkers
86,i take only cash tips from anyone thats not a ...,Do employees feel understaffed?,0.0,0.000000,0,ki7m0f4,comment,2024-01-16T19:53:41,t3_198a0ow,t3_198a0ow,/r/BestBuyWorkers/comments/198a0ow/customer_ti...,BestBuyWorkers
87,"i've been here for 8 years, work in the highes...",What are the most common reasons for employees...,7.0,1.000000,1,jag8nz0,comment,2023-03-01T00:13:02,t3_11dxaol,t3_11dxaol,/r/BestBuyWorkers/comments/11dxaol/new_to_this...,BestBuyWorkers
88,nah,What do Best Buy employees think of the company?,0.0,0.000000,0,ixd7txp,comment,2022-11-22T10:51:56,t3_z1mzn6,t3_z1mzn6,/r/BestBuyWorkers/comments/z1mzn6/bf_walk_out/...,BestBuyWorkers


In [14]:
labeled_dataset = all_labeled_data.copy()
questions = labeled_dataset.question.unique()

In [15]:
questions

array(['What do Best Buy employees think of the company?',
       'What are the most common reasons for employees to leave Best Buy?',
       'Do employees feel understaffed?'], dtype=object)

In [16]:
q1_dataset = labeled_dataset[labeled_dataset['question']==questions[0]]
q2_dataset = labeled_dataset[labeled_dataset['question']==questions[1]]
q3_dataset = labeled_dataset[labeled_dataset['question']==questions[2]]

In [30]:
# consensus is set by at least two people voting relevant

q1_dataset_relevant = q1_dataset[q1_dataset.consensus == 1] 
q1_dataset_not_relevant = q1_dataset[q1_dataset.consensus == 0]

q2_dataset_relevant = q2_dataset[q2_dataset.consensus == 1]
q2_dataset_not_relevant = q2_dataset[q2_dataset.consensus == 0]

q3_dataset_relevant = q3_dataset[q3_dataset.consensus == 1]
q3_dataset_not_relevant = q3_dataset[q3_dataset.consensus == 0]

# Getting to know the subreddit data

In [1200]:
from importlib import reload  # for reloading a python package
construct_thread = reload(construct_thread)

In [1201]:
all_threads = []
for id in all_reddit_ids: # all_reddit_ids computed below: all_reddit_ids = list(df_subreddit[df_subreddit.aware_post_type=='submission'].reddit_name)
    thread = construct_thread.ConsrtuctThread(df_subreddit, id)
    all_threads.append(thread)

### most active users

In [1202]:
all_threads[231].get_author_list()

['bbythrowaway8675309',
 'Libraxl',
 'Remarkable_Slice485',
 'ingestTidePods',
 'moonshinespirits']

In [1145]:
import collections

In [1147]:
authors = collections.Counter(list(df_subreddit.reddit_author))

In [1160]:
# authors with heighest net contributions
authors.most_common(10) 

[('bbythrowaway8675309', 279),
 ('None', 86),
 ('Supersith08', 75),
 ('Not_A_Real_Boy69', 60),
 ('darkedgex', 49),
 ('ThirstyNewt', 40),
 ('OldCuban', 36),
 ('Salt_Restaurant_7820', 32),
 ('Elegant_Record9340', 32),
 ('g_gomez0116', 31)]

In [1168]:
all_authors = sorted(authors)
len(all_authors)

1922

In [1225]:
author_spread = []
for author in all_authors:
    if author == 'None':
        continue
    temp = 0
    for thread in all_threads:
        if author in thread.get_author_list():
            temp += 1
    author_spread.append([author, temp])
    

In [1226]:
author_spread.sort(key=lambda x: x[1], reverse=True)

In [1227]:
sum([item[1] for item in author_spread])

4018

In [1228]:
# authors with most diverse contributions
author_spread

[['bbythrowaway8675309', 178],
 ['Supersith08', 41],
 ['Not_A_Real_Boy69', 33],
 ['darkedgex', 32],
 ['ThirstyNewt', 31],
 ['g_gomez0116', 26],
 ['OldCuban', 24],
 ['carmachu', 24],
 ['Salt_Restaurant_7820', 23],
 ['Elegant_Record9340', 19],
 ['Pedrosha56', 19],
 ['Pitbull1951', 19],
 ['EscalationPro', 18],
 ['-0r1gam1_owl-', 17],
 ['GreyTigerFox', 17],
 ['ChainWorking1096', 16],
 ['Dense_Surround3071', 16],
 ['GSAgentsLivesMatter', 16],
 ['MiniaturePeaches', 16],
 ['niloc1987', 15],
 ['ImTheEnigma', 14],
 ['P4-Kuma', 14],
 ['PuzzleheadedBus3659', 14],
 ['SnooGadgets6277', 14],
 ['CreativeMadness99', 13],
 ['Dense_Cloud1100', 13],
 ['MidnightScott17', 13],
 ['Party-Variation-3174', 13],
 ['Safety_Captn', 13],
 ['Suspicious_Home_4582', 13],
 ['aaronblkfox', 13],
 ['squishey123', 13],
 ['yellowvv', 13],
 ['ChefCen', 12],
 ['LastKnownUser', 12],
 ['Maximum-Humor-', 12],
 ['Nagaflas', 12],
 ['SaltChance3455', 12],
 ['Souper-Doup', 12],
 ['User83829362', 12],
 ['Accomplished_Grab953', 11],


### most discussed threads

In [1245]:
print(llm_mixtral.invoke(f"Based on the list provided, mention the broad topics the author is talking about? {list(df_subreddit[df_subreddit.reddit_author==author_spread[2][0]].reddit_text)}"))



The broad topics the author is talking about include:

1. Employee experiences and concerns at Best Buy, such as reduced headcount, layoffs, and job qualifications.
2. Company policies and procedures, such as page moderation, drug testing, and severance packages.
3. Industry news and trends, such as the rebranding of Great Call and LG Display's supply of OLED TV panels to Samsung.
4. Unionization and worker rights.
5. Personal anecdotes and opinions, such as experiences with management and coworkers.
6. Job search and career advice.
7. Secrecy and non-disclosure agreements.
8. Frustrations with corporate transparency and decision-making.


In [1246]:
print(llm_mixtral.invoke(f"Based on the list provided, mention the broad topics the author is talking about? {list(df_subreddit[df_subreddit.reddit_author==author_spread[12][0]].reddit_text)}"))



The broad topics the author is talking about include:

1. Employee benefits and compensation, such as holiday pay, time and a half pay, and matching contributions for retirement savings.
2. Job codes, certifications, and accommodations for certain positions.
3. The process of clocking in and out for shifts.
4. Labor cuts and available hours.
5. Store pickups and the fulfillment process.
6. The Best Buy Discord and scheduling diagnostic appointments.


In [1118]:
len(df_subreddit[df_subreddit.aware_post_type=='submission'])

827

There are 827 threads in total in this subreddit

In [1120]:
df_subreddit.head(3)

Unnamed: 0,aware_post_type,aware_created_ts,reddit_id,reddit_name,reddit_created_utc,reddit_author,reddit_text,reddit_permalink,reddit_title,reddit_url,reddit_subreddit,reddit_link_id,reddit_parent_id,reddit_submission
0,submission,2023-04-16T17:32:45,12opsul,t3_12opsul,1681680765,utaustinresearch,,/r/BestBuyWorkers/comments/12opsul/research_st...,Research Study Recruitment - Managers,https://i.redd.it/nrmkf51rebua1.png,BestBuyWorkers,,,
1,comment,2023-04-16T18:48:06,jgjgy9e,t1_jgjgy9e,1681685286,Not_A_Real_Boy69,![gif](giphy|YmQLj2KxaNz58g7Ofg)\n\n$50?,/r/BestBuyWorkers/comments/12opsul/research_st...,,,BestBuyWorkers,t3_12opsul,t3_12opsul,12opsul
2,comment,2023-04-16T19:55:39,jgjpqmp,t1_jgjpqmp,1681689339,GSAgentsLivesMatter,bullshit on getting $50 its just a coupon to B...,/r/BestBuyWorkers/comments/12opsul/research_st...,,,BestBuyWorkers,t3_12opsul,t3_12opsul,12opsul


In [1126]:
all_reddit_ids = list(df_subreddit[df_subreddit.aware_post_type=='submission'].reddit_name)

In [1141]:
thread_len_dict={}

for id in all_reddit_ids:
    thread = construct_thread.ConsrtuctThread(df_subreddit, id)
    thread_len_dict.update({id : len(thread.get_thread())})

thread_len_dict

{'t3_12opsul': 3,
 't3_12m0ozl': 15,
 't3_12l3md8': 19,
 't3_12gb7ps': 13,
 't3_12g25g2': 5,
 't3_12ectgf': 3,
 't3_12e4wef': 17,
 't3_12du2t0': 11,
 't3_11wk8fh': 3,
 't3_11qb2k2': 3,
 't3_11ne756': 12,
 't3_11hyxkb': 9,
 't3_11fb73q': 6,
 't3_11ew3fs': 2,
 't3_11dxaol': 5,
 't3_11bgct9': 6,
 't3_119at7w': 4,
 't3_117t418': 2,
 't3_1157t7i': 7,
 't3_114sq9a': 3,
 't3_1142913': 11,
 't3_111y6s8': 5,
 't3_10v8unp': 14,
 't3_10usgg4': 12,
 't3_10tdmkb': 17,
 't3_10ljawj': 7,
 't3_10jui63': 2,
 't3_10i9ao0': 13,
 't3_10gllu9': 2,
 't3_10fmkvw': 4,
 't3_10d2uum': 11,
 't3_10d1dj8': 3,
 't3_10cxfxu': 13,
 't3_10brc4f': 7,
 't3_10711zn': 6,
 't3_103bmqk': 8,
 't3_zzx2wq': 3,
 't3_zzj0lf': 5,
 't3_zyq49u': 2,
 't3_zxspyw': 1,
 't3_zxia62': 3,
 't3_zxgmp4': 5,
 't3_zwzlqi': 5,
 't3_zut2tr': 3,
 't3_ztu253': 2,
 't3_zrrenz': 2,
 't3_zqi52d': 3,
 't3_zoqq39': 5,
 't3_znjz7f': 1,
 't3_zj1p6d': 3,
 't3_zh0k71': 9,
 't3_zfknoy': 2,
 't3_zcqnma': 2,
 't3_z9cfmt': 1,
 't3_z7kfmo': 16,
 't3_z6ve99': 4

In [1133]:
thread_len=[]

for id in all_reddit_ids:
    thread = construct_thread.ConsrtuctThread(df_subreddit, id)
    thread_len.append([id, len(thread.get_thread())])

In [1138]:
thread_len.sort(key=lambda x: x[1])
thread_len

[['t3_zxspyw', 1],
 ['t3_znjz7f', 1],
 ['t3_z9cfmt', 1],
 ['t3_w6md2b', 1],
 ['t3_ugbg5l', 1],
 ['t3_udczwq', 1],
 ['t3_u9gfjv', 1],
 ['t3_tginwj', 1],
 ['t3_tfdswx', 1],
 ['t3_q3jp1k', 1],
 ['t3_q1jumh', 1],
 ['t3_pz2gf4', 1],
 ['t3_ptpa4y', 1],
 ['t3_p5w70t', 1],
 ['t3_oicyxd', 1],
 ['t3_of5f1k', 1],
 ['t3_o55u3j', 1],
 ['t3_nw7y6z', 1],
 ['t3_mz4g6j', 1],
 ['t3_mfeb7f', 1],
 ['t3_lwgv4o', 1],
 ['t3_lsmfjh', 1],
 ['t3_low5bv', 1],
 ['t3_lhngdx', 1],
 ['t3_lhlpyv', 1],
 ['t3_lhb7ho', 1],
 ['t3_lg5of6', 1],
 ['t3_letwhq', 1],
 ['t3_lesmde', 1],
 ['t3_les77v', 1],
 ['t3_kchxrc', 1],
 ['t3_k2a20b', 1],
 ['t3_k18u9o', 1],
 ['t3_jdgiw1', 1],
 ['t3_iubdak', 1],
 ['t3_ib0drx', 1],
 ['t3_hyzv8a', 1],
 ['t3_hygauo', 1],
 ['t3_gin14r', 1],
 ['t3_g9xfim', 1],
 ['t3_g1xznw', 1],
 ['t3_g1upnx', 1],
 ['t3_134bzme', 1],
 ['t3_1365mo9', 1],
 ['t3_13c0ls4', 1],
 ['t3_13dyobj', 1],
 ['t3_13fq098', 1],
 ['t3_13kxky3', 1],
 ['t3_13ngkrz', 1],
 ['t3_140k1b8', 1],
 ['t3_141be7b', 1],
 ['t3_141fcm8', 1],
 [

# Get the threads of the relevent context

In [28]:
import json
def set_subreddit(sub_reddit):
    with open(sub_reddit+'.json', 'r', encoding='utf-8') as f:
        dat = json.load(f)
    return pd.DataFrame(dat)

In [29]:
df_subreddit = set_subreddit('BestBuyWorkers')

## Getting the relevent threads for the three questions

In [132]:
# get the reddit_link_id (comments) reddit_name (submissions) for the given content in DataFrame form
def get_ids(temp_dataset):
    temp_l1 = list(temp_dataset[temp_dataset.aware_post_type == 'comment'].reddit_link_id)
    temp_list = list(temp_dataset[temp_dataset.aware_post_type == 'submission'].reddit_id)
    [temp_l1.append(df_subreddit[df_subreddit.reddit_id==e].iloc[0].reddit_name) for e in temp_list]
    return temp_l1

In [84]:
print(len(set(get_ids(q1_dataset_not_relevant))))
print(len(set(get_ids(q2_dataset_not_relevant))))
print(len(set(get_ids(q3_dataset_not_relevant))))

print()

print(len(set(get_ids(q1_dataset_relevant))))
print(len(set(get_ids(q2_dataset_relevant))))
print(len(set(get_ids(q3_dataset_relevant))))

20
18
19

10
10
10


In [90]:
import construct_thread

In [106]:
q1_threads = []
q2_threads = []
q3_threads = []

for id in get_ids(q1_dataset_relevant):
    q1_threads.append(construct_thread.ConsrtuctThread(df_subreddit, id))

for id in get_ids(q2_dataset_relevant):
    q2_threads.append(construct_thread.ConsrtuctThread(df_subreddit, id))

for id in get_ids(q3_dataset_relevant):
    q3_threads.append(construct_thread.ConsrtuctThread(df_subreddit, id))

In [124]:
q1_irr_threads = []
q2_irr_threads = []
q3_irr_threads = []

for id in get_ids(q1_dataset_not_relevant):
    q1_irr_threads.append(construct_thread.ConsrtuctThread(df_subreddit, id))

for id in get_ids(q2_dataset_not_relevant):
    q2_irr_threads.append(construct_thread.ConsrtuctThread(df_subreddit, id))

for id in get_ids(q3_dataset_not_relevant):
    q3_irr_threads.append(construct_thread.ConsrtuctThread(df_subreddit, id))

There are some conversations that are very long (as can be seen from the following) and therefore problematic for LLM summarizing

In [426]:
print([len(' '.join(thread.get_conversation())) for thread in q1_threads])
print([len(' '.join(thread.get_conversation()[:26])) for thread in q1_threads])
print([len(' '.join(thread.get_conversation())) for thread in q2_threads])
print([len(' '.join(thread.get_conversation()[:50])) for thread in q2_threads])
print([len(' '.join(thread.get_conversation())) for thread in q3_threads])

[3915, 3034, 7184, 2791, 411, 9732, 4796, 1559, 14606, 0]
[3915, 3034, 7184, 2791, 411, 9732, 4796, 1559, 12793, 0]
[589, 7931, 1398, 4464, 3731, 19962, 4808, 9343, 2645, 4678]
[589, 7931, 1398, 4464, 3731, 9066, 4808, 9343, 2645, 4678]
[3304, 3105, 8506, 2132, 1773, 600, 2284, 2289, 1516, 879]


In [402]:
print([len(' '.join(thread.get_conversation())) for thread in q1_irr_threads])
print([len(' '.join(thread.get_conversation()[:50])) for thread in q1_irr_threads])
print([len(' '.join(thread.get_conversation())) for thread in q2_irr_threads])
print([len(' '.join(thread.get_conversation()[:30])) for thread in q2_irr_threads])
print([len(' '.join(thread.get_conversation())) for thread in q3_irr_threads])

[3241, 4293, 610, 5695, 6231, 3198, 525, 2408, 1282, 1468, 2920, 1910, 1213, 19962, 967, 2403, 10428, 1950, 327, 938]
[3241, 4293, 610, 5695, 6231, 3198, 525, 2408, 1282, 1468, 2920, 1910, 1213, 9066, 967, 2403, 10428, 1950, 327, 938]
[3915, 638, 2638, 6162, 1246, 1093, 1566, 11677, 1864, 9449, 9449, 1996, 2026, 2826, 3650, 336, 1845, 11677, 2308, 816]
[3915, 638, 2638, 5479, 1246, 1093, 1566, 7776, 1864, 5749, 5749, 1996, 2026, 2826, 3650, 336, 1845, 7776, 2308, 816]
[7479, 634, 389, 1144, 5177, 8275, 947, 2906, 2903, 2079, 105, 254, 7931, 10720, 7479, 1845, 3531, 1398, 2548, 400]


In [156]:
get_ids(q1_dataset_not_relevant)[-7]

't3_18rkcrr'

In [157]:
get_ids(q2_dataset_relevant)[-5]

't3_18rkcrr'

## Summarization

In [143]:
from langchain.prompts import PromptTemplate

In [144]:
def gen_prompt(title, conversation_list):
    template = """You are a conversation summarizing AI agent. The conversation with title, "{title}", is given to you in form of a Python list: {list}.
                  Paraphrase a precise summary capturing key highlights of this conversation. If the conversation is empty then discuss the context using the given title. 
               """
    # template = """Summarize the following list of conversations: {list} 
    #               Paraphrase your output."""
    prompt = PromptTemplate(template=template, input_variables=['title', 'list'])
    prompt_formatted_str: str = prompt.format(title=title, list=conversation_list)
    return prompt_formatted_str

In [205]:
def llm_response(title, conv_list, llm=llm_mistral):
    prompt = gen_prompt(title, conv_list)
    response = llm.invoke(prompt)
    # return out.split('Write a summary capturing key highlights of the above conversation.')[-1]
    return response

In [163]:
from tqdm import tqdm

In [165]:
q3_summary = []

for thread in tqdm(q3_threads):
    title = thread.get_title()
    conv = thread.get_conversation()
    q3_summary.append(llm_response(title, conv))

100%|███████████████████████████████████████████| 10/10 [00:37<00:00,  3.72s/it]


In [171]:
q3_irr_summary = []

for thread in tqdm(q3_irr_threads):
    title = thread.get_title()
    conv = thread.get_conversation()
    q3_irr_summary.append(llm_response(title, conv))

100%|███████████████████████████████████████████| 20/20 [00:51<00:00,  2.56s/it]


In [186]:
q3_irr_summary[2]

" Title: Corie Berry's mole\n\n                This conversation appears to be centered around Corie Berry's mole and why she hasn't had it removed, as suggested by the first comment. The conversation then takes a humorous turn with references to aiming at objects during urination and the use of a fake fly. The conversation continues with comments from various participants expressing their reactions to the mole and the situation, with some adding their own humorous perspectives. There is no clear indication of who Corie Berry is or what context this conversation is taking place in, so it remains unclear why the topic of her mole has come up."

In [185]:
q3_irr_threads[3].get_url()

'https://www.reddit.com/r/BestBuyWorkers/comments/14ngsbs'

In [183]:
questions

array(['What do Best Buy employees think of the company?',
       'What are the most common reasons for employees to leave Best Buy?',
       'Do employees feel understaffed?'], dtype=object)

## Summarization with mixtral

In [362]:
len(q1_threads[8].get_conversation()[:20])

20

In [394]:
q1_summary_mix = []

for thread in tqdm(q1_threads):
    title = thread.get_title()
    conv = thread.get_conversation()[:25]
    q1_summary_mix.append(llm_response(title, conv, llm=llm_mixtral))

100%|███████████████████████████████████████████| 10/10 [00:12<00:00,  1.22s/it]


In [395]:
q1_irr_summary_mix = []

for thread in tqdm(q1_irr_threads):
    title = thread.get_title()
    conv = thread.get_conversation()[:50]
    q1_irr_summary_mix.append(llm_response(title, conv, llm=llm_mixtral))

100%|███████████████████████████████████████████| 20/20 [02:48<00:00,  8.45s/it]


In [404]:
q2_summary_mix = []

for thread in tqdm(q2_threads):
    title = thread.get_title()
    conv = thread.get_conversation()[:50]
    q2_summary_mix.append(llm_response(title, conv, llm=llm_mixtral))

 90%|███████████████████████████████████████▌    | 9/10 [01:00<00:06,  6.71s/it]


HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1 (Request ID: th4c5-isp3-6LSu4e9Rh3)

In [432]:
q2_summary_mix.append(llm_response(q2_threads[-1].get_title(), q2_threads[-1].get_conversation()[:12], llm=llm_mixtral))

In [403]:
q2_irr_summary_mix = []

for thread in tqdm(q2_irr_threads):
    title = thread.get_title()
    conv = thread.get_conversation()[:30]
    q2_irr_summary_mix.append(llm_response(title, conv, llm=llm_mixtral))

100%|███████████████████████████████████████████| 20/20 [01:46<00:00,  5.33s/it]


In [208]:
q3_summary_mix = []

for thread in tqdm(q3_threads):
    title = thread.get_title()
    conv = thread.get_conversation()
    q3_summary_mix.append(llm_response(title, conv, llm=llm_mixtral))

q3_irr_summary_mix = []

for thread in tqdm(q3_irr_threads):
    title = thread.get_title()
    conv = thread.get_conversation()
    q3_irr_summary_mix.append(llm_response(title, conv, llm=llm_mixtral))

100%|███████████████████████████████████████████| 10/10 [01:00<00:00,  6.07s/it]
100%|███████████████████████████████████████████| 20/20 [02:11<00:00,  6.56s/it]


In [214]:
print(q3_irr_summary[0])
print(q3_irr_summary_mix[0])

 The conversation revolves around employees' experiences and frustrations with management and Best Buy since recent changes have been implemented. Many employees express feeling hopeless, mentally and emotionally drained, and that HR is unresponsive. Some mention that upper management is difficult to work with and that EET (employee experience team) is a joke. Several employees have left the company due to these issues and some are considering doing so. The conversation also touches on the topic of unions and their role in protecting workers. Some employees share their personal experiences of being retaliated against or treated unfairly by management. Overall, the conversation reflects a sense of dissatisfaction and frustration among employees towards management and Best Buy.

The conversation revolves around the experiences of employees working at Best Buy post certain changes. The employees express their dissatisfaction with the management, HR, and the EET. They feel that HR is not r

In [451]:
df1 = pd.DataFrame({'reddit_id' :get_ids(q1_dataset_relevant), 'context': q1_summary_mix})
df2 = pd.DataFrame({'reddit_id' :get_ids(q1_dataset_not_relevant), 'context': q1_irr_summary_mix})
q1_summarized_contexts = pd.concat([df1, df2]).reset_index().drop(columns=['index'])

In [453]:
list(q1_summarized_contexts.context)

["\nThe conversation is about tips and advice for people employed at Best Buy. The first tip is to work hard but not too hard, as being likeable is often more important than personal performance for promotions. The second tip is to consider switching stores if management is not favorable. The third tip is that there is no job security in retail and the ability to adapt is crucial. The fourth tip is to not be afraid to apply for positions one does not feel qualified for. The fifth tip is to avoid post-military employees in management roles as they take their jobs too seriously. The sixth tip is to stand up for oneself. The seventh tip is to be nice to installers as they can help with mistakes on orders. The eighth tip is to also be nice to Geek Squad employees. The ninth tip is that HR does not mess around and will take action if there is a valid case. The tenth tip is that corporate does not view retail positions as careers. The eleventh tip is that the current sales model is broken. T

# LLM's classification of context

## Marking relevancy for each question over the entire test dataset. 

In [840]:
tr1 = q1_summarized_contexts.copy()
tr2 = q2_summarized_contexts.copy()
tr3 = q3_summarized_contexts.copy()

In [844]:
trf = pd.concat([tr1, tr2, tr3]).reset_index().drop(columns=["index"])

In [940]:
trf

Unnamed: 0,reddit_id,context
0,t3_199f5h8,\nThe conversation is about tips and advice fo...
1,t3_wapsy3,\nThe conversation revolves around a potential...
2,t3_1aucdoj,\nThe conversation revolves around a seasonal ...
3,t3_yuacvb,\nThe conversation is about getting a job at B...
4,t3_12scwfu,\nThe conversation is about the potential cutt...
...,...,...
85,t3_jx0q4y,"\nIn this conversation, a new employee at Best..."
86,t3_1382vj6,\nThe conversation revolves around potential j...
87,t3_qhb9v1,"\nIn this conversation, the participants discu..."
88,t3_198a0ow,"\nIn this conversation, participants discuss t..."


In [892]:
llm_mistralv1_classify = HuggingFaceEndpoint(repo_id='mistralai/Mistral-7B-Instruct-v0.1', huggingfacehub_api_token=huggingfacehub_api_token, temperature=0.01)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful


In [920]:
llm_mistral_classify = HuggingFaceEndpoint(repo_id='mistralai/Mistral-7B-Instruct-v0.2', huggingfacehub_api_token=huggingfacehub_api_token, temperature=0.01)
llm_mixtral_classify =  HuggingFaceEndpoint(repo_id='mistralai/Mixtral-8x7B-Instruct-v0.1', huggingfacehub_api_token=huggingfacehub_api_token, temperature=0.01)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful
Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful


In [921]:
q1_classify_mixtral_ALL = {}

for i in range(90):
    response = llm_mixtral_classify.invoke(classification_prompt(list(trf.context)[i], questions[0]))
    q1_classify_mixtral_ALL.update({i: response})
    print(i, " : ", response)

0  :  1
1  :  1
2  :  0
3  :  1
4  :  0
5  :  0
6  :  0
7  :  1
8  :  1
                    """
9  :  1
10  :  0
11  :  1
12  :  0
13  :  0
14  :  0
15  :  0
16  :  0
17  :  1
18  :  0
19  :  0
20  :  0
21  :  0
22  :  0
23  :  0
24  :  0
25  :  0
26  :  1
27  :  0
28  :  0
29  :  0
30  :  0
31  :  0
32  :  1
33  :  1
34  :  0
35  :  0
36  :  0
37  :  1
38  :  1
39  :  1
40  :  1
41  :  0
42  :  1
43  :  0
44  :  0
45  :  0
46  :  1
47  :  0
48  :  0
49  :  1
50  :  1
51  :  0
52  :  1
53  :  0
54  :  0
55  :  0
56  :  0
57  :  0
58  :  0
59  :  0
60  :  1
61  :  0
62  :  1
63  :  0
64  :  0
65  :  0
66  :  0
67  :  0
68  :  0
69  :  0
70  :  1
71  :  0
72  :  0
73  :  0
74  :  0
75  :  0
76  :  0
77  :  1
78  :  0
79  :  0
80  :  0
81  :  0
82  :  0
83  :  0
84  :  1
85  :  0
86  :  0
87  :  1
88  :  0
89  :  0


In [922]:
q2_classify_mixtral_ALL = {}

for i in range(90):
    response = llm_mixtral_classify.invoke(classification_prompt(list(trf.context)[i], questions[1]))
    q2_classify_mixtral_ALL.update({i: response})
    print(i, " : ", response)

0  :  1
1  :  1
2  :  0
3  :  0
4  :  0
5  :  0
6  :  1
7  :  1
8  :  1
                    """
9  :  1
10  :  0
11  :  0
12  :  0
13  :  0
14  :  0
15  :  0
16  :  0
17  :  0
18  :  0
19  :  0
20  :  0
21  :  0
22  :  0
23  :  0
24  :  0
25  :  0
26  :  0
27  :  0
28  :  0
29  :  0
30  :  0
31  :  0
32  :  1
33  :  0
34  :  0
35  :  0
36  :  0
37  :  0
38  :  0
39  :  0
40  :  1
41  :  0
42  :  1
43  :  0
44  :  0
45  :  0
46  :  0
47  :  0
48  :  0
49  :  0
50  :  0
51  :  0
52  :  0
53  :  0
54  :  1
55  :  0
56  :  0
57  :  0
58  :  0
59  :  0
60  :  1
61  :  1
62  :  1
63  :  1
64  :  0
65  :  0
66  :  0
67  :  0
68  :  1
69  :  0
70  :  1
71  :  0
72  :  0
73  :  0
74  :  0
75  :  0
76  :  0
77  :  0
78  :  0
79  :  0
80  :  0
81  :  0
82  :  0
83  :  0
84  :  1
85  :  0
86  :  0
87  :  1
88  :  0
89  :  0


In [923]:
q3_classify_mixtral_ALL = {}

for i in range(90):
    response = llm_mixtral_classify.invoke(classification_prompt(list(trf.context)[i], questions[2]))
    q3_classify_mixtral_ALL.update({i: response})
    print(i, " : ", response)

0  :  0
1  :  1
2  :  0
3  :  1
4  :  0
5  :  0
6  :  0
7  :  0
8  :  1
9  :  1
10  :  0
11  :  0
12  :  0
13  :  0
14  :  1
15  :  0
16  :  0
17  :  1
18  :  0
19  :  0
20  :  0
21  :  0
22  :  0
23  :  1
24  :  0
25  :  0
26  :  1
27  :  0
28  :  0
29  :  0
30  :  0
31  :  0
32  :  0
33  :  0
34  :  0
35  :  1
36  :  0
37  :  0
38  :  1
39  :  0
40  :  0
41  :  0
42  :  1
43  :  0
44  :  0
45  :  0
46  :  0
47  :  0
48  :  0
49  :  1
50  :  1
51  :  0
52  :  0
53  :  0
54  :  0
55  :  0
56  :  0
57  :  0
58  :  0
59  :  0
60  :  1
61  :  1
62  :  1
63  :  0
64  :  1
65  :  0
66  :  1
67  :  1
68  :  1
69  :  1
70  :  1
71  :  0
72  :  0
73  :  0
74  :  0
75  :  0
76  :  0
77  :  1
78  :  0
79  :  0
80  :  0
81  :  0
82  :  0
83  :  0
84  :  1
85  :  0
86  :  0
87  :  0
88  :  0
89  :  0


In [925]:
llm_mixtral_classify("what is the sum of sqrt(2) and sqrt(7)")

'\n\nAnswer:\nsqrt(2) + sqrt(7)\n\nExplanation:\nThe sum of sqrt(2) and sqrt(7) cannot be simplified further, so the answer is sqrt(2) + sqrt(7).'

In [941]:
trf = trf.assign(**{'Question 1' : [int(q1_classify_mixtral_ALL[i]) for i in range(90)], 
              'Question 2' : [int(q2_classify_mixtral_ALL[i]) for i in range(90)],
              'Question 3' : [int(q3_classify_mixtral_ALL[i]) for i in range(90)]})

In [947]:
sum(list(trf["Question 3"]))

24

## Check consistency with an embedding model

In [None]:
mistral_api_key = getpass()

In [None]:
openai_api_key = getpass()

In [963]:
from openai import OpenAI
client = OpenAI(api_key = openai_api_key)

In [1004]:
from mistralai.client import MistralClient

In [1028]:
mistral_client = MistralClient(api_key=mistral_api_key)

def create_mistral_embeddings(inputs):
    response = mistral_client.embeddings(
          model="mistral-embed",
          input=inputs
      )
    return [data.embedding for data in response.data]

In [964]:
def create_embeddings(texts):
    response = client.embeddings.create(    
        model="text-embedding-ada-002",
        input=texts)  
    response_dict = response.model_dump()
    return [data['embedding'] for data in response_dict['data']] 

In [968]:
questions_embeddings = create_embeddings(list(questions))

In [976]:
context_embeddings = create_embeddings(list(trf.context))

In [1029]:
questions_embeddings_mistral = create_mistral_embeddings(list(questions))

In [1031]:
context_embeddings_mistral = create_mistral_embeddings(list(trf.context))

In [980]:
from scipy.spatial import distance

In [1041]:
ll = [[i, trf["Question 1"].iloc[i], distance.cosine(context_embeddings[i], questions_embeddings[0])] for i in range(90)]
ll.sort(key=lambda x: x[2])
ll

[[84, 1, 0.1070032357786409],
 [70, 1, 0.1070475291170856],
 [1, 1, 0.11712483525827588],
 [52, 1, 0.11816193191184776],
 [77, 1, 0.12374541834564623],
 [39, 1, 0.12651445134262296],
 [87, 1, 0.12769744155219398],
 [32, 1, 0.1277928795815121],
 [11, 1, 0.1281779430023191],
 [61, 0, 0.13098839728786427],
 [26, 1, 0.131120404395226],
 [36, 0, 0.1322879599549348],
 [3, 1, 0.13581002084814942],
 [62, 1, 0.13703993791257818],
 [33, 1, 0.13709789959023522],
 [83, 0, 0.13979558009697224],
 [38, 1, 0.13986244565915118],
 [73, 0, 0.13988540875219246],
 [46, 1, 0.14195007226110357],
 [0, 1, 0.14287008201463625],
 [9, 1, 0.14289116039230854],
 [40, 1, 0.14323419664601544],
 [49, 1, 0.14641041859281134],
 [50, 1, 0.14641041859281134],
 [18, 0, 0.14747032952661443],
 [37, 1, 0.14971158768276405],
 [6, 0, 0.15000092443757262],
 [17, 1, 0.15329325440416264],
 [4, 0, 0.1581308479663036],
 [54, 0, 0.16045538036409757],
 [22, 0, 0.16157638420753018],
 [7, 1, 0.16269566874487773],
 [10, 0, 0.166435972536

In [1042]:
ll_mistral = [[i, trf["Question 1"].iloc[i], distance.cosine(context_embeddings_mistral[i], questions_embeddings_mistral[0])] for i in range(90)]
ll_mistral.sort(key=lambda x: x[2])
ll_mistral

[[52, 1, 0.15414895368939407],
 [70, 1, 0.15594257755110474],
 [84, 1, 0.15594257755110474],
 [32, 1, 0.17129064548428286],
 [87, 1, 0.17129064548428286],
 [1, 1, 0.1736624030528855],
 [62, 1, 0.18042414789870553],
 [46, 1, 0.18495497440709763],
 [38, 1, 0.18822622339032735],
 [37, 1, 0.1891480832644954],
 [77, 1, 0.1916445309649496],
 [0, 1, 0.1920999127365388],
 [40, 1, 0.1920999127365388],
 [18, 0, 0.19217861039763795],
 [36, 0, 0.19409399979391595],
 [11, 1, 0.19644387625313386],
 [3, 1, 0.20239975600210114],
 [65, 0, 0.20334882696900713],
 [7, 1, 0.20523497709083682],
 [39, 1, 0.2054193469333475],
 [74, 0, 0.20666373532738236],
 [73, 0, 0.20830516889398443],
 [61, 0, 0.20940764673519852],
 [6, 0, 0.21010583689731321],
 [83, 0, 0.21121858026647433],
 [49, 1, 0.2122486490465315],
 [50, 1, 0.2122486490465315],
 [31, 0, 0.21265474801836914],
 [82, 0, 0.21265474801836914],
 [33, 1, 0.2136791718265022],
 [17, 1, 0.21416324041490986],
 [26, 1, 0.21635673269466216],
 [22, 0, 0.22117766247

In [1086]:
list(questions)

['What do Best Buy employees think of the company?',
 'What are the most common reasons for employees to leave Best Buy?',
 'Do employees feel understaffed?']

In [1115]:
trf.iloc[1]

reddit_id                                             t3_wapsy3
context       \nThe conversation revolves around a potential...
Question 1                                                    1
Question 2                                                    1
Question 3                                                    1
Name: 1, dtype: object

In [None]:
thread = construct_thread.ConsrtuctThread(df_subreddit, )

as expected the mistral embeddings seems to work better since the summarization was done with mistral LLM

In [1071]:
sorted([e[0] for e in ll][:20])

[0, 1, 3, 11, 26, 32, 33, 36, 38, 39, 46, 52, 61, 62, 70, 73, 77, 83, 84, 87]

In [1072]:
sorted([e[0] for e in ll_mistral][:20])

[0, 1, 3, 7, 11, 18, 32, 36, 37, 38, 39, 40, 46, 52, 62, 65, 70, 77, 84, 87]

In [1075]:
trf.iloc[18].context

'\nSummary: The user has applied for a job at Best Buy and has a scheduled Zoom interview for Wednesday. They are considering whether to accept the interview or focus on a grocery store job offer instead. The user is looking for short-term employment until New Year or early 2023. A conversation participant suggests that Best Buy is a good place to work, with better pay than minimum wage and a focus on technology. However, selling services and warranties might not be comfortable for everyone. Another participant mentions that grocery stores do not usually provide discounts, but Walmart and Costco offer employee discounts in some countries. The user clarifies that they are referring to regional grocery store chains.'

In [1082]:
(q1: ‘What do Best Buy employees think of the company?‘, q2: What are the most common reasons for employees to leave Best Buy?’, q3: ‘Do employees feel understaffed?’) 

array(['What do Best Buy employees think of the company?',
       'What are the most common reasons for employees to leave Best Buy?',
       'Do employees feel understaffed?'], dtype=object)

## Question 1

#### Marking relevancy using LLM

In [484]:
def classification_prompt(context, question):
    template = """  You are an AI agent that classifies context as "relevant" or "irrelevant" depending on the question. 
                    Given the question and the context below classify weather the context is relevant or irrelevant for answering the given question. 
                    Do not give any justification for your descision. For relevant context answer with 1 and for irrelevant context answer with 0.
                    Question: {question}
                    Context: {context}
                    Response:
                    """
    prompt = PromptTemplate(template=template, input_variables=['context', 'question'])
    prompt_formatted_str: str = prompt.format(context=context, question=question)
    return prompt_formatted_str

In [507]:
q1_classify_mistral[3] = '1.'
q1_classify_mistral

{0: '1.',
 1: '1.',
 2: '0.',
 3: '1.',
 4: '0.',
 5: '0.',
 6: '1.',
 7: '1.',
 8: '0.',
 9: '1.',
 10: '0.',
 11: '1.',
 12: '0.',
 13: '0.',
 14: '0.',
 15: '0.',
 16: '0.',
 17: '1.',
 18: '0.',
 19: '0.',
 20: '0.',
 21: '0.',
 22: '0.',
 23: '1',
 24: '0.',
 25: '1.',
 26: '1.',
 27: '0.',
 28: '0.',
 29: '0.'}

In [504]:
q1_classify_mistral = {}

for i in range(30):
    response = llm_mistral.invoke(classification_prompt(list(q1_summarized_contexts.context)[i], questions[0]))
    q1_classify_mistral.update({i: response})
    print(i, " : ", response)

0  :  1.
1  :  1.
2  :  0.
3  :  1. The context mentions opinions of Best Buy employees, which is relevant to the question.
4  :  0.
5  :  0.
6  :  1.
7  :  1.
8  :  0.
9  :  1.
10  :  0.
11  :  1.
12  :  0.
13  :  0.
14  :  0.
15  :  0.
16  :  0.
17  :  1.
18  :  0.
19  :  0.
20  :  0.
21  :  0.
22  :  0.
23  :  1
24  :  0.
25  :  1.
26  :  1.
27  :  0.
28  :  0.
29  :  0.


In [499]:
dictx

{0: '1',
 1: '1',
 2: '0',
 3: '1',
 4: '0',
 5: '0',
 6: '0',
 7: '1',
 8: '1',
 9: '1',
 10: '0',
 11: '1',
 12: '0',
 13: '0',
 14: '0',
 15: '0',
 16: '0',
 17: '1',
 18: '0',
 19: '0',
 20: '0',
 21: '0',
 22: '0',
 23: '0',
 24: '0',
 25: '0',
 26: '1',
 27: '0',
 28: '0',
 29: '0'}

In [508]:
llm_mixtral_T =  HuggingFaceEndpoint(repo_id='mistralai/Mixtral-8x7B-Instruct-v0.1', huggingfacehub_api_token=huggingfacehub_api_token, max_new_tokens=30000, temperature=0.01)
q1_classify_mixtral = {}

for i in range(30):
    response = llm_mixtral_T.invoke(classification_prompt(list(q1_summarized_contexts.context)[i], questions[0]))
    q1_classify_mixtral.update({i: response})
    print(i, " : ", response)
    time.sleep(10)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful
0  :  1
1  :  1
2  :  0
3  :  1
4  :  0
5  :  0
6  :  0
7  :  1
8  :  1
                    """
9  :  1
10  :  0
11  :  1
12  :  0
13  :  0
14  :  0
15  :  0
16  :  0
17  :  1
18  :  0
19  :  0
20  :  0
21  :  0
22  :  0
23  :  0
24  :  0
25  :  0
26  :  1
27  :  0
28  :  0
29  :  0


In [490]:
for i in range(29, 30):
    response = llm_mixtral_T.invoke(classification_prompt(list(q1_summarized_contexts.context)[i], questions[0]))
    q1_classify_mixtral_T.append({i:response})
    print(i, " : ", response)
    time.sleep(10)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/hraj/.cache/huggingface/token
Login successful
29  :  0


In [341]:
for i in range(20):
    response = llm_mistral.invoke(classification_prompt(q3_irr_summary_mix[i], questions[2]))
    print(i, " : ", response)

0  :   Relevant.
               The question asks about employees' feelings, and the context provides detailed information about employees' feelings towards their work environment, including their perception of being understaffed.
1  :   Relevant or Irrelevant: irrelevent.
               The context does not provide any information related to the question about employees feeling understaffed. It is focused on a lost/stolen gift card issue and potential abuse of discounts.
2  :   Answer: irrelevent. The context does not provide any information related to the question about employees feeling understaffed.
3  :   Relevant: irrelevent
               The context does not provide any information related to the question about employees feeling understaffed at Best Buy.
4  :   """

Answer: irrelevent. The context does not provide any information related to employees feeling understaffed.
5  :   Relevant: irrelevent.
               The context does not provide any information about the overall 

In [344]:
for i in range(10):
    response = llm_mistral.invoke(classification_prompt(q3_summary_mix[i], questions[2]))
    print(i, " : ", response)

0  :   Answer: relevant.
               The context mentions "insufficient staffing levels" which directly relates to the question about employees feeling understaffed.
1  :   ---

Relevant. The context directly addresses the issue of understaffing and its impact on employees at Best Buy. The employees' statements about long wait times for customers, inability to take breaks, and the company's policy of not scheduling enough staff for specialized departments all point to a clear issue of understaffing. The context also provides evidence of the negative consequences of understaffing, such as decreased morale and mental health, and suggests potential solutions to the problem.
2  :   Relevant.
               The question asks about the feelings of employees and the context provides information about their dissatisfaction and frustration, which can include feeling understaffed.
3  :   Answer: irrelevant.
                Explanation: The context discusses various issues related to job dissa

In [None]:
Relevant among irrelevants = {0, 7, 12?, 14}

Irrelevant among relevants = {3, 5?, 6}

#### Making the test dataset for Q1

In [533]:
import numpy as np

In [579]:
q1_final_dataset = q1_summarized_contexts.copy()

In [580]:
q1_final_dataset.insert(2, 'question', [questions[0]]*30)
q1_final_dataset

Unnamed: 0,reddit_id,context,question
0,t3_199f5h8,\nThe conversation is about tips and advice fo...,What do Best Buy employees think of the company?
1,t3_wapsy3,\nThe conversation revolves around a potential...,What do Best Buy employees think of the company?
2,t3_1aucdoj,\nThe conversation revolves around a seasonal ...,What do Best Buy employees think of the company?
3,t3_yuacvb,\nThe conversation is about getting a job at B...,What do Best Buy employees think of the company?
4,t3_12scwfu,\nThe conversation is about the potential cutt...,What do Best Buy employees think of the company?
5,t3_j5vbke,Summary: \n\nThe conversation is about a form...,What do Best Buy employees think of the company?
6,t3_14i1ppn,\nThe conversation revolves around an individu...,What do Best Buy employees think of the company?
7,t3_ywjf2p,"\nIn this conversation, an individual who has ...",What do Best Buy employees think of the company?
8,t3_16f8ctj,"""""""\n This conversation revolv...",What do Best Buy employees think of the company?
9,t3_134bzme,\nSummary: The conversation revolves around th...,What do Best Buy employees think of the company?


In [551]:
# q1_classify_mixtral[8] = '1'
q1_classify_mistral

{0: '1.',
 1: '1.',
 2: '0.',
 3: '1.',
 4: '0.',
 5: '0.',
 6: '1.',
 7: '1.',
 8: '0.',
 9: '1.',
 10: '0.',
 11: '1.',
 12: '0.',
 13: '0.',
 14: '0.',
 15: '0.',
 16: '0.',
 17: '1.',
 18: '0.',
 19: '0.',
 20: '0.',
 21: '0.',
 22: '0.',
 23: '1',
 24: '0.',
 25: '1.',
 26: '1.',
 27: '0.',
 28: '0.',
 29: '0.'}

In [553]:
crap = list(q1_classify_mixtral.values())
q1_classify_mixtral_1 = [int(ele) for ele in crap]

In [556]:
crap = list(q1_classify_mistral.values())
q1_classify_mistral_1 = [float(ele) for ele in crap]

In [557]:
q1_classify_mistral_1

[1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 1.0,
 1.0,
 0.0,
 1.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 1.0,
 1.0,
 0.0,
 0.0,
 0.0]

In [554]:
q1_classify_mixtral_1

[1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0]

In [581]:
q1_final_dataset = q1_final_dataset.assign(**{'mistral' : q1_classify_mistral_1, 'mixtral' : q1_classify_mixtral_1})

In [586]:
list(q1_final_dataset.mistral).count(1)

11

In [588]:
list(q1_final_dataset.mixtral).count(1)

9

## Question 2

#### Marking relevancy using LLM

In [510]:
df1 = pd.DataFrame({'reddit_id' :get_ids(q2_dataset_relevant), 'context': q2_summary_mix})
df2 = pd.DataFrame({'reddit_id' :get_ids(q2_dataset_not_relevant), 'context': q2_irr_summary_mix})
q2_summarized_contexts = pd.concat([df1, df2]).reset_index().drop(columns=['index'])

In [513]:
q2_classify_mistral = {}

for i in range(30):
    response = llm_mistral.invoke(classification_prompt(list(q2_summarized_contexts.context)[i], questions[1]))
    q2_classify_mistral.update({i: response})
    print(i, " : ", response)

0  :  0.
1  :  0.
2  :  1.
3  :  1.
4  :  0.
5  :  1
6  :  1
7  :  1.
8  :  1.
9  :  1.
10  :  1.
                    The context mentions that there is no job security in retail and the ability to adapt is crucial, which could be relevant to the reasons why employees leave Best Buy.
11  :  0.
12  :  1.
13  :  0.
14  :  0.
15  :  0.
16  :  0.
17  :  0.
18  :  0.
19  :  1.
20  :  1.
21  :  0.
22  :  1.
23  :  1.
24  :  1.
25  :  0.
26  :  0.
27  :  0.
28  :  0.
29  :  0.


In [514]:
q2_classify_mixtral = {}

for i in range(30):
    response = llm_mixtral_T.invoke(classification_prompt(list(q2_summarized_contexts.context)[i], questions[1]))
    q2_classify_mixtral.update({i: response})
    print(i, " : ", response)
    time.sleep(10)

0  :  0
1  :  0
2  :  1
3  :  1
4  :  0
5  :  0
6  :  0
7  :  0
8  :  0
9  :  0
10  :  1
11  :  0
12  :  1
13  :  0
14  :  0
15  :  0


HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1 (Request ID: QvZvAwRGJHr5tFyFVZ5Wc)

Model is overloaded

In [516]:
for i in range(29, 30):
    response = llm_mixtral_T.invoke(classification_prompt(list(q2_summarized_contexts.context)[i], questions[1]))
    q2_classify_mixtral.update({i: response})
    print(i, " : ", response)
    time.sleep(10)

29  :  0


#### Making the test dataset for Q2

In [590]:
q2_final_dataset = q2_summarized_contexts.copy()
q2_final_dataset.insert(2, 'question', [questions[1]]*30)

In [594]:
q2_classify_mistral[10] = '1'
q2_classify_mistral

{0: '0.',
 1: '0.',
 2: '1.',
 3: '1.',
 4: '0.',
 5: '1',
 6: '1',
 7: '1.',
 8: '1.',
 9: '1.',
 10: '1',
 11: '0.',
 12: '1.',
 13: '0.',
 14: '0.',
 15: '0.',
 16: '0.',
 17: '0.',
 18: '0.',
 19: '1.',
 20: '1.',
 21: '0.',
 22: '1.',
 23: '1.',
 24: '1.',
 25: '0.',
 26: '0.',
 27: '0.',
 28: '0.',
 29: '0.'}

In [595]:
crap = list(q2_classify_mixtral.values())
q2_classify_mixtral_1 = [int(ele) for ele in crap]
crap = list(q2_classify_mistral.values())
q2_classify_mistral_1 = [float(ele) for ele in crap]

In [596]:
q2_final_dataset = q2_final_dataset.assign(**{'mistral' : q2_classify_mistral_1, 'mixtral' : q2_classify_mixtral_1})

In [597]:
q2_final_dataset

Unnamed: 0,reddit_id,context,question,mistral,mixtral
0,t3_yy085z,\nThe conversation revolves around the topic o...,What are the most common reasons for employees...,0.0,0
1,t3_1580byk,The conversation revolves around the incentiv...,What are the most common reasons for employees...,0.0,0
2,t3_qhb9v1,"\nIn this conversation, the participants discu...",What are the most common reasons for employees...,1.0,1
3,t3_1499vuh,\nThe conversation is about the idea of unioni...,What are the most common reasons for employees...,1.0,1
4,t3_wtr1q0,\nThe conversation revolves around a prankster...,What are the most common reasons for employees...,0.0,0
5,t3_18rkcrr,The conversation revolves around an individua...,What are the most common reasons for employees...,1.0,0
6,t3_qpi3rj,\n---\n\nThe person is anxious about an upcomi...,What are the most common reasons for employees...,1.0,0
7,t3_1aj1kcj,\nThe conversation revolves around the issue o...,What are the most common reasons for employees...,1.0,0
8,t3_11dxaol,\nThe conversation is about a person who is ne...,What are the most common reasons for employees...,1.0,0
9,t3_18ttsxk,\nThe conversation is about an individual shar...,What are the most common reasons for employees...,1.0,0


In [606]:
questions[1]

'What are the most common reasons for employees to leave Best Buy?'

In [612]:
print(q2_final_dataset.iloc[5].context)

 The conversation revolves around an individual who fell sick with a 102-degree fever and was denied sick leave by their supervisor. They tried to manage the situation with medication but ended up fainting at work. The manager allowed them to leave only after a customer intervened, accusing the company of child labor. The individual is now scheduled for 8-hour shifts until the next Saturday and is contemplating calling in sick every day.

                The conversation includes advice from various perspectives. Some suggest quitting the job due to the manager's insensitivity and the company's policies. Others recommend learning about company policies and understanding employee rights. A common theme is the importance of health over job, with suggestions to document everything if there's any backlash. Some also mention that it's against Best Buy policy to ask for a doctor's note, and it's not supposed to be accepted even if given.

                There are also comments about the ina

In [610]:
print(q2_summarized_contexts.iloc[5].context)

 The conversation revolves around an individual who fell sick with a 102-degree fever and was denied sick leave by their supervisor. They tried to manage the situation with medication but ended up fainting at work. The manager allowed them to leave only after a customer intervened, accusing the company of child labor. The individual is now scheduled for 8-hour shifts until the next Saturday and is contemplating calling in sick every day.

                The conversation includes advice from various perspectives. Some suggest quitting the job due to the manager's insensitivity and the company's policies. Others recommend learning about company policies and understanding employee rights. A common theme is the importance of health over job, with suggestions to document everything if there's any backlash. Some also mention that it's against Best Buy policy to ask for a doctor's note, and it's not supposed to be accepted even if given.

                There are also comments about the ina

In [598]:
print(list(q2_final_dataset.mistral).count(1))
print(list(q2_final_dataset.mixtral).count(1))

14
5


## Question 3

#### Marking relevancy using LLM

In [517]:
df1 = pd.DataFrame({'reddit_id' :get_ids(q3_dataset_relevant), 'context': q3_summary_mix})
df2 = pd.DataFrame({'reddit_id' :get_ids(q3_dataset_not_relevant), 'context': q3_irr_summary_mix})
q3_summarized_contexts = pd.concat([df1, df2]).reset_index().drop(columns=['index'])

In [518]:
q3_classify_mistral = {}

for i in range(30):
    response = llm_mistral.invoke(classification_prompt(list(q3_summarized_contexts.context)[i], questions[2]))
    q3_classify_mistral.update({i: response})
    print(i, " : ", response)

0  :  1.
1  :  1
2  :  1.
3  :  0.
4  :  1
5  :  0.
6  :  1.
7  :  1.
8  :  1.
9  :  1.
10  :  1.
11  :  0.
12  :  0.
13  :  0.
14  :  0.
15  :  0.
16  :  0.
17  :  1.
18  :  0.
19  :  0.
20  :  0.
21  :  0.
22  :  0.
23  :  0.
24  :  1.
25  :  0.
26  :  0.
27  :  0.
28  :  0.
29  :  0.


In [519]:
q3_classify_mixtral = {}

for i in range(30):
    response = llm_mixtral_T.invoke(classification_prompt(list(q3_summarized_contexts.context)[i], questions[2]))
    q3_classify_mixtral.update({i: response})
    print(i, " : ", response)
    time.sleep(10)

0  :  1
1  :  1
2  :  1
3  :  0
4  :  1
5  :  0
6  :  1
7  :  1
8  :  1
9  :  1
10  :  1
11  :  0
12  :  0
13  :  0
14  :  0
15  :  0
16  :  0
17  :  1
18  :  0
19  :  0
20  :  0


HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://api-inference.huggingface.co/models/mistralai/Mixtral-8x7B-Instruct-v0.1 (Request ID: vU7KO3B59xMdubHgwz5QR)

Model is overloaded

In [520]:
for i in range(21, 30):
    response = llm_mixtral_T.invoke(classification_prompt(list(q3_summarized_contexts.context)[i], questions[2]))
    q3_classify_mixtral.update({i: response})
    print(i, " : ", response)
    time.sleep(10)

21  :  0
22  :  0
23  :  0
24  :  1
25  :  0
26  :  0
27  :  0
28  :  0
29  :  0


#### Making the test dataset for Q3

In [599]:
q3_final_dataset = q3_summarized_contexts.copy()
q3_final_dataset.insert(2, 'question', [questions[2]]*30)

In [601]:
q3_classify_mixtral

{0: '1',
 1: '1',
 2: '1',
 3: '0',
 4: '1',
 5: '0',
 6: '1',
 7: '1',
 8: '1',
 9: '1',
 10: '1',
 11: '0',
 12: '0',
 13: '0',
 14: '0',
 15: '0',
 16: '0',
 17: '1',
 18: '0',
 19: '0',
 20: '0',
 21: '0',
 22: '0',
 23: '0',
 24: '1',
 25: '0',
 26: '0',
 27: '0',
 28: '0',
 29: '0'}

In [602]:
crap = list(q3_classify_mixtral.values())
q3_classify_mixtral_1 = [int(ele) for ele in crap]
crap = list(q3_classify_mistral.values())
q3_classify_mistral_1 = [float(ele) for ele in crap]

In [603]:
q3_final_dataset = q3_final_dataset.assign(**{'mistral' : q3_classify_mistral_1, 'mixtral' : q3_classify_mixtral_1})

In [604]:
q3_final_dataset

Unnamed: 0,reddit_id,context,question,mistral,mixtral
0,t3_v5thte,\nThe conversation revolves around the desire ...,Do employees feel understaffed?,1.0,1
1,t3_sygsv3,\n---\n\nThe conversation revolves around the ...,Do employees feel understaffed?,1.0,1
2,t3_16ha775,\nThe conversation revolves around the difficu...,Do employees feel understaffed?,1.0,1
3,t3_16b9xbu,\nThe conversation revolves around the theme o...,Do employees feel understaffed?,0.0,0
4,t3_136z3ru,The conversation titled 'Am cuts' revolves ar...,Do employees feel understaffed?,1.0,1
5,t3_zzj0lf,Summary: The individual is seeking informatio...,Do employees feel understaffed?,0.0,0
6,t3_r4f01w,\nThe conversation revolves around the experie...,Do employees feel understaffed?,1.0,1
7,t3_1axres4,\nThe conversation revolves around the frustra...,Do employees feel understaffed?,1.0,1
8,t3_18rrnnn,\nThe conversation revolves around the frustra...,Do employees feel understaffed?,1.0,1
9,t3_qo85m9,\nThe conversation revolves around the partici...,Do employees feel understaffed?,1.0,1


In [605]:
print(list(q3_final_dataset.mistral).count(1))
print(list(q3_final_dataset.mixtral).count(1))

11
11


This is the dataset on which both the LLMs agree

## Saving all the generated test set

In [614]:
q1_final_dataset.to_csv("q1_final_dataset.csv")
q2_final_dataset.to_csv("q2_final_dataset.csv")
q3_final_dataset.to_csv("q3_final_dataset.csv")

In [615]:
!pwd

/Users/hraj/Documents/Erdos/aware-nlp-local-copy/notebooks/HimanshuNotebooks/CreatingTestDataset


In [1083]:
trf.head()

Unnamed: 0,reddit_id,context,Question 1,Question 2,Question 3
0,t3_199f5h8,\nThe conversation is about tips and advice fo...,1,1,0
1,t3_wapsy3,\nThe conversation revolves around a potential...,1,1,1
2,t3_1aucdoj,\nThe conversation revolves around a seasonal ...,0,0,0
3,t3_yuacvb,\nThe conversation is about getting a job at B...,1,0,1
4,t3_12scwfu,\nThe conversation is about the potential cutt...,0,0,0


In [950]:
trf.to_csv("all_questions_final_dataset.csv")

In [956]:
trf.iloc[86].context

'\nThe conversation revolves around potential job cuts in the leadership roles of a company. The roles that might be affected include ASMs, SSMs, GSMs, and Sups. It is advised to use PTO as soon as possible. There are rumors of new positions like pay a services experience manager and a services experience supervisor. The company might be shifting towards a service center model. The conversation also mentions that PTO might not be paid out in some states upon termination. The roles of Ops managers and C&D managers are also being assessed.\n\nKey highlights:\n1. Potential job cuts in leadership roles.\n2. Advised to use PTO asap.\n3. Rumors of new positions.\n4. Company might be shifting towards a service center model.\n5. PTO might not be paid out in some states.\n6. Roles of Ops managers and C&D managers are also being assessed.'

In [957]:
thread = construct_thread.ConsrtuctThread(df_subreddit , 't3_s7vrdv')

In [958]:
thread.get_thread()

Unnamed: 0,aware_post_type,aware_created_ts,reddit_id,reddit_name,reddit_created_utc,reddit_author,reddit_text,reddit_permalink,reddit_title,reddit_url,reddit_subreddit,reddit_link_id,reddit_parent_id,reddit_submission
967,submission,2022-01-19T12:44:01,s7vrdv,t3_s7vrdv,1642614241,bbythrowaway8675309,,/r/BestBuyWorkers/comments/s7vrdv/_/,☠☠☠,https://i.redd.it/2w5nts6tloc81.jpg,BestBuyWorkers,,,
968,comment,2022-01-19T15:10:55,htd27qo,t1_htd27qo,1642623055,MattB6x,Sounds like when we had AP and people coming i...,/r/BestBuyWorkers/comments/s7vrdv/_/htd27qo/,,,BestBuyWorkers,t3_s7vrdv,t3_s7vrdv,s7vrdv
969,comment,2022-03-09T16:13:15,i00timy,t1_i00timy,1646860395,Hefty-Market-8845,😂,/r/BestBuyWorkers/comments/s7vrdv/_/i00timy/,,,BestBuyWorkers,t3_s7vrdv,t3_s7vrdv,s7vrdv
970,comment,2022-04-18T23:03:58,i5an2fh,t1_i5an2fh,1650337438,ksuhistory,isnt that your job?,/r/BestBuyWorkers/comments/s7vrdv/_/i5an2fh/,,,BestBuyWorkers,t3_s7vrdv,t3_s7vrdv,s7vrdv


# Building a RAG

In [617]:
from langchain_community.document_loaders import DataFrameLoader

In [618]:
loader = DataFrameLoader(q3_final_dataset, page_content_column='context' )
q3_documents = loader.load()

In [619]:
q3_documents[7]

Document(page_content='\nThe conversation revolves around the frustration of an individual who feels they are being asked to perform tasks outside of their job description, specifically selling products and pushing memberships in their micro market. They express their desire to focus on their job and go home, and criticize the lack of support from sales advisors in helping with tasks such as shuttle twice a week. The conversation also touches on the upcoming change of vendors bringing in their own staff to sell directly in pilot markets, and the expectation of this by some participants. There is also mention of the increased workload and expectations placed on everyone in the store, including product flow, front lanes, and even those in leadership roles. Some participants express their disagreement with these practices and suggest hiring more sales advisors and showing data to management to prove a point.', metadata={'reddit_id': 't3_1axres4', 'question': 'Do employees feel understaffe

In [620]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [631]:
len(q3_documents_splitted)

35

In [629]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=100)

q3_documents_splitted = text_splitter.split_documents(q3_documents)

In [622]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.embeddings import OpenAIEmbeddings

In [None]:
OPENAI_API_KEY = getpass()

In [624]:
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

In [627]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [628]:
embeddings.dict

<bound method BaseModel.dict of OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x15d11c390>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x15cf30550>, model='text-embedding-3-large', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None)>

In [None]:
# embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2') 

## ChromaDB

### initial attempt

In [632]:
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=q3_documents_splitted, embedding=embeddings)
retriever = vectorstore.as_retriever()

In [747]:
del retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

In [748]:
retrive_documents = retriever.invoke(questions[2])
retrived_instances = list(set([docs.metadata['reddit_id'] for docs in retrive_documents]))

In [669]:
mixtral_relevant = list(q3_final_dataset[q3_final_dataset.mixtral==1].reddit_id)

In [670]:
mistral_relevant = list(q3_final_dataset[q3_final_dataset.mistral==1].reddit_id)

In [749]:
recall = len([item for item in retrived_instances if item in mistral_relevant])/len(mistral_relevant)
prec = len([item for item in retrived_instances if item in mistral_relevant])/len(retrived_instances)
f1_score = 2*prec*recall/(prec+recall)

print('prec : ', prec)
print('recall : ', recall)
print('f1 : ', f1_score)

prec :  0.8888888888888888
recall :  0.7272727272727273
f1 :  0.7999999999999999


When the number of retrevial (that we need to set) is high then precision will obviously be low. So precision is not very meaningful

### More systematic

In [792]:
loader = DataFrameLoader(q1_final_dataset, page_content_column='context')
q1_documents = loader.load()
q1_documents_splitted = text_splitter.split_documents(q1_documents)

In [793]:
loader = DataFrameLoader(q2_final_dataset, page_content_column='context')
q2_documents = loader.load()
q2_documents_splitted = text_splitter.split_documents(q2_documents)

In [796]:
q1_vectorstore_chroma = Chroma.from_documents(documents=q1_documents_splitted, embedding=embeddings)

In [None]:
# create the open-source embedding function
# embeddings3 = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [802]:
q1_vectorstore_chroma_2 = Chroma.from_documents(documents=q3_documents_splitted, embedding=embeddings2)

InvalidDimensionException: Embedding dimension 768 does not match collection dimensionality 3072

In [1251]:
embeddings1.dict

<bound method BaseModel.dict of HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='all-mpnet-base-v1', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)>

In [None]:
q1_vectorstore_chroma_1 = Chroma.from_documents(documents=q1_documents_splitted, embedding=embeddings1)
q1_vectorstore_chroma_2 = Chroma.from_documents(documents=q1_documents_splitted, embedding=embeddings2)

q2_vectorstore_chroma = Chroma.from_documents(documents=q2_documents_splitted, embedding=embeddings)
q2_vectorstore_chroma_1 = Chroma.from_documents(documents=q2_documents_splitted, embedding=embeddings1)
q2_vectorstore_chroma_2 = Chroma.from_documents(documents=q2_documents_splitted, embedding=embeddings2)

In [None]:
retriever = vectorstore_chroma.as_retriever(search_kwargs={"k": 10})

## FAISS

In [751]:
from langchain_community.vectorstores import FAISS
vectorstore = FAISS.from_documents(documents=q3_documents_splitted, embedding=embeddings)
retriever = vectorstore.as_retriever()

In [752]:
del retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

In [753]:
retrive_documents = retriever.invoke(questions[2])
retrived_instances = list(set([docs.metadata['reddit_id'] for docs in retrive_documents]))

In [669]:
mixtral_relevant = list(q3_final_dataset[q3_final_dataset.mixtral==1].reddit_id)

In [670]:
mistral_relevant = list(q3_final_dataset[q3_final_dataset.mistral==1].reddit_id)

In [754]:
recall = len([item for item in retrived_instances if item in mistral_relevant])/len(mistral_relevant)
prec = len([item for item in retrived_instances if item in mistral_relevant])/len(retrived_instances)
f1_score = 2*prec*recall/(prec+recall)

print('prec : ', prec)
print('recall : ', recall)
print('f1 : ', f1_score)

prec :  0.8888888888888888
recall :  0.7272727272727273
f1 :  0.7999999999999999


In [805]:
q1_vectorstore_faiss = FAISS.from_documents(documents=q1_documents_splitted, embedding=embeddings)
q1_vectorstore_faiss_1 = FAISS.from_documents(documents=q1_documents_splitted, embedding=embeddings1)
q1_vectorstore_faiss_2 = FAISS.from_documents(documents=q1_documents_splitted, embedding=embeddings2)

q2_vectorstore_faiss = FAISS.from_documents(documents=q2_documents_splitted, embedding=embeddings)
q2_vectorstore_faiss_1 = FAISS.from_documents(documents=q2_documents_splitted, embedding=embeddings1)
q2_vectorstore_faiss_2 = FAISS.from_documents(documents=q2_documents_splitted, embedding=embeddings2)

In [808]:
retriever = q1_vectorstore_faiss.as_retriever(search_kwargs={"k": 10})

In [809]:
retrive_documents = retriever.invoke(questions[0])
evaluate(retrive_documents, q1_final_dataset, 'mistral')

prec :  0.875   recall :  0.6363636363636364  f1_score :  0.7368421052631579


In [810]:
evaluate(retrive_documents, q1_final_dataset, 'mixtral')

prec :  0.875   recall :  0.7777777777777778  f1_score :  0.823529411764706


In [811]:
retriever = q1_vectorstore_faiss_1.as_retriever(search_kwargs={"k": 10})
retrive_documents = retriever.invoke(questions[0])
evaluate(retrive_documents, q1_final_dataset, 'mistral')

prec :  0.625   recall :  0.45454545454545453  f1_score :  0.5263157894736842


In [813]:
evaluate(retrive_documents, q1_final_dataset, 'mixtral')

prec :  0.5   recall :  0.4444444444444444  f1_score :  0.47058823529411764


In [814]:
retriever = q1_vectorstore_faiss_2.as_retriever(search_kwargs={"k": 10})
retrive_documents = retriever.invoke(questions[0])
evaluate(retrive_documents, q1_final_dataset, 'mistral')

prec :  0.5555555555555556   recall :  0.45454545454545453  f1_score :  0.5


In [815]:
evaluate(retrive_documents, q1_final_dataset, 'mixtral')

prec :  0.6666666666666666   recall :  0.6666666666666666  f1_score :  0.6666666666666666


In [816]:
retriever = q2_vectorstore_faiss.as_retriever(search_kwargs={"k": 10})
retrive_documents = retriever.invoke(questions[1])
evaluate(retrive_documents, q2_final_dataset, 'mistral')

prec :  1.0   recall :  0.5  f1_score :  0.6666666666666666


In [817]:
evaluate(retrive_documents, q2_final_dataset, 'mixtral')

prec :  0.2857142857142857   recall :  0.4  f1_score :  0.3333333333333333


In [818]:
retriever = q2_vectorstore_faiss_1.as_retriever(search_kwargs={"k": 10})
retrive_documents = retriever.invoke(questions[1])
evaluate(retrive_documents, q2_final_dataset, 'mistral')

prec :  0.75   recall :  0.42857142857142855  f1_score :  0.5454545454545454


In [819]:
evaluate(retrive_documents, q2_final_dataset, 'mixtral')

prec :  0.125   recall :  0.2  f1_score :  0.15384615384615385


In [820]:
retriever = q2_vectorstore_faiss_2.as_retriever(search_kwargs={"k": 10})
retrive_documents = retriever.invoke(questions[1])
evaluate(retrive_documents, q2_final_dataset, 'mistral')

prec :  1.0   recall :  0.42857142857142855  f1_score :  0.6


In [821]:
evaluate(retrive_documents, q2_final_dataset, 'mistral')

prec :  1.0   recall :  0.42857142857142855  f1_score :  0.6


### Trying with a relatively poor embedding model

In [759]:
from langchain_community.embeddings import HuggingFaceEmbeddings
# embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/bert-base-nli-mean-tokens')
embeddings1 = HuggingFaceEmbeddings(model_name='all-mpnet-base-v1')
embeddings2 = HuggingFaceEmbeddings(model_name='all-mpnet-base-v2')

In [783]:
vectorstore_FAISS_1 = FAISS.from_documents(documents=q3_documents_splitted, embedding=embeddings1)
vectorstore_FAISS_2 = FAISS.from_documents(documents=q3_documents_splitted, embedding=embeddings2)
retriever_1 = vectorstore_FAISS_1.as_retriever(search_kwargs={"k": 15})
retriever_2 = vectorstore_FAISS_2.as_retriever(search_kwargs={"k": 15})

In [786]:
retrive_documents = retriever_1.invoke(questions[2])
retrived_instances = list(set([docs.metadata['reddit_id'] for docs in retrive_documents]))
evaluate(retrive_documents, q3_final_dataset, 'mistral')

prec :  0.6923076923076923   recall :  0.8181818181818182  f1_score :  0.7500000000000001


In [787]:
evaluate(retrive_documents, q3_final_dataset, 'mixtral')

prec :  0.6923076923076923   recall :  0.8181818181818182  f1_score :  0.7500000000000001


In [1248]:
help(vectorstore_FAISS_1.as_retriever)

Help on method as_retriever in module langchain_core.vectorstores:

as_retriever(**kwargs: 'Any') -> 'VectorStoreRetriever' method of langchain_community.vectorstores.faiss.FAISS instance
    Return VectorStoreRetriever initialized from this VectorStore.
    
    Args:
        search_type (Optional[str]): Defines the type of search that
            the Retriever should perform.
            Can be "similarity" (default), "mmr", or
            "similarity_score_threshold".
        search_kwargs (Optional[Dict]): Keyword arguments to pass to the
            search function. Can include things like:
                k: Amount of documents to return (Default: 4)
                score_threshold: Minimum relevance threshold
                    for similarity_score_threshold
                fetch_k: Amount of documents to pass to MMR algorithm (Default: 20)
                lambda_mult: Diversity of results returned by MMR;
                    1 for minimum diversity and 0 for maximum. (Default:

# RAG evaluator

In [771]:
def evaluate(retrive_docs, test_set, llm):
    
    relevant_docs = list(test_set[test_set[llm]==1].reddit_id) # ground truth based off a given LLM
    retrived_instances = list(set([docs.metadata['reddit_id'] for docs in retrive_docs]))
    
    recall = len([item for item in retrived_instances if item in relevant_docs])/len(relevant_docs)
    prec = len([item for item in retrived_instances if item in relevant_docs])/len(retrived_instances)
    f1_score = 2*prec*recall/(prec+recall)

    print('prec : ', prec, '  recall : ', recall, ' f1_score : ', f1_score)
    

In [1085]:
questions

array(['What do Best Buy employees think of the company?',
       'What are the most common reasons for employees to leave Best Buy?',
       'Do employees feel understaffed?'], dtype=object)