### Transformers Project - MiniLM

The purpose of this notebook to use MiniLM to get the embeddings of our candidate (news article) and query (SASB industry topic) to then compute the consine similarity scores to do information retrieval.

Since we are using an out-of-the-box MiniLM to get the embeddings, we only need the test_df_cleaned_no_stopwords datafile to get the respective model metrics.

In [70]:
import pandas as pd
df = pd.read_csv("test_df_cleaned_no_stopwords.csv")


In [71]:
df

Unnamed: 0,title_and_content,Ticker,Industry,Company,SASB,GPT_ESG_or_not,GPT_firm_or_not,GPT_sentiment,GPT_topics,ESG_or_not,firm_or_not,human_label_sentiment,url,articleId,title,Concatenated_SASB,lower_title_and_content,lower_Concatenated_SASB,cw_text,cw_sasb_query_text
0,New York Cements Itself as the Gold Mining Cap...,NEM,Metals & Mining,Newmont Corp,{'Tailings Storage Facilities Management': 'Th...,Minor,Major,Positive,"Community Relations, Business Ethics & Transpa...",No,No,No,https://www.newsmax.com/newsmax-tv/fitzgerald-...,c12355d81050473e89f4163372441061,Rep. Fitzgerald to Newsmax: DirecTV Dropping N...,Tailings Storage Facilities Management - The M...,new york cements itself as the gold mining cap...,tailings storage facilities management - the m...,new york cements gold mining capital world new...,tailings storage facilities management metals ...
1,"Shareholders v. Tesla, Nasdaq's diversity rule...",NDAQ,Security & Commodity Exchanges,Nasdaq Inc,{'Managing Conflicts of Interest': 'Security a...,Major,Major,Negative,"Managing Conflicts of Interest, Promoting Tran...",Major,Major,Positive,https://www.axios.com/pro/media-deals/2023/05/...,fcbd16768c584451912d7121a259ad9d,YouTube praises AI transformation at Brandcast,Managing Conflicts of Interest - Security and ...,"shareholders v. tesla, nasdaq's diversity rule...",managing conflicts of interest - security and ...,shareholders v. tesla nasdaq diversity rule se...,managing conflicts interest security commodity...
2,"FedEx closing more locations, planning to furl...",FDX,Air Freight & Logistics,FedEx Corp,{'Greenhouse Gas Emissions': 'Air Freight & Lo...,Minor,Major,Negative,"Employee Health & Safety, Labour Practices, Su...",Minor,Major,Negative,https://www.theguardian.com/technology/2023/ju...,3cb0ea7cb1cb40608c1cfc1e172ebc3e,Nick Clegg defends release of open-source AI m...,Greenhouse Gas Emissions - Air Freight & Logis...,"fedex closing more locations, planning to furl...",greenhouse gas emissions - air freight & logis...,fedex closing locations planning furlough empl...,greenhouse gas emissions air freight logistics...
3,Modelo Maker Profits From Bud Light‚Äö√Ñ√¥s De...,STZ,Alcoholic Beverages,Constellation Brands Inc A,{'Water Management': 'Water management include...,Minor,Minor,Positive,"Water Management, Packaging Lifecycle Manageme...",No,No,No,https://www.washingtonexaminer.com/restoring-a...,7b188eebdd7c42ed9ca51237d0989674,Conservative group targets Bank of America in ...,Water Management - Water management includes a...,modelo maker profits from bud light‚äö√ñ√¥s de...,water management - water management includes a...,modelo maker profits bud light‚äö√ñ√¥s decline...,water management water management includes ent...
4,Med tech investors paying up for patents - Med...,ILMN,Medical Equipment & Supplies,Illumina Inc,{'Product Safety': 'Information on product saf...,Minor,Major,Negative,Business Ethics,No,No,No,https://www.cleveland.com/business/2023/01/goo...,14b0ee5d771844c7838718faf0905545,"Google slashes 12,000 jobs to cope with shrink...",Product Safety - Information on product safety...,med tech investors paying up for patents - med...,product safety - information on product safety...,med tech investors paying patents med tech sta...,product safety information product safety effe...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1036,Lockheed Martin Stumbles on Supply Chain - WSJ...,LMT,Aerospace & Defence,Lockheed Martin,{'Product Safety': 'Product safety is an impor...,Major,Major,Negative,Materials Sourcing,Major,Major,Negative,https://www.wsj.com/articles/lockheed-martin-s...,427327b6dfec433aa34c28dcd842fb74,Lockheed Martin Stumbles on Supply Chain - WSJ,Product Safety - Product safety is an importan...,lockheed martin stumbles on supply chain - wsj...,product safety - product safety is an importan...,lockheed martin stumbles supply chain wsj dema...,product safety product safety important consid...
1037,Banks Rush To Borrow Record-Breaking $165 Bill...,BAC,Commercial Banks,Bank of America Corp,{'Factors in Credit Analysis': 'As financial i...,Major,Major,Negative,"Systemic Risk Management, Business Ethics",Major,Major,Negative,https://www.forbes.com/sites/nicholasreimann/2...,036ad564b4484076822ca227e4101c85,Banks Rush To Borrow Record-Breaking $165 Bill...,Factors in Credit Analysis - As financial inte...,banks rush to borrow record-breaking $165 bill...,factors in credit analysis - as financial inte...,banks rush borrow record breaking $ 165 billio...,factors credit analysis financial intermediari...
1038,Ohio train derailment: Norfolk Southern CEO sa...,NSC,Rail Transportation,Norfolk Southern Corp,{'Greenhouse Gas Emissions': 'The Rail Transpo...,Major,Major,Negative,"Accident & Safety Management, Employee Health ...",Major,Major,Negative,https://www.washingtonexaminer.com/news/ohio-t...,48278805ac6b4850b3692fa5b8f45a80,Ohio train derailment: Norfolk Southern CEO sa...,Greenhouse Gas Emissions - The Rail Transporta...,ohio train derailment: norfolk southern ceo sa...,greenhouse gas emissions - the rail transporta...,ohio train derailment norfolk southern ceo say...,greenhouse gas emissions rail transportation i...
1039,"AT&T, Verizon, and T-Mobile could avoid $200 m...",T,Telecommunication Services,AT&T Inc,{'Competitive Behaviour & Open Internet': 'The...,Major,Major,Negative,"Data Privacy, Data Security",Major,Major,Negative,https://www.theverge.com/2022/12/27/23527884/a...,31bec3ac87a6422eb00ee062128b162a,"AT&T, Verizon, and T-Mobile could avoid $200 m...",Competitive Behaviour & Open Internet - The Te...,"at&t, verizon, and t-mobile could avoid $200 m...",competitive behaviour & open internet - the te...,at&t verizon t mobile avoid $ 200 million fine...,competitive behaviour open internet telecommun...


In [72]:
import pandas as pd
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

In [73]:
# Function for mean pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


In [74]:
# Load tokenizer and model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')


In [75]:
# Function to encode text to embeddings
def encode_texts(texts):
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    return mean_pooling(model_output, encoded_input['attention_mask'])


In [76]:
df_test = df

In [77]:
# Generate embeddings for both queries and candidates
query_embeddings = encode_texts(df_test['cw_sasb_query_text'].tolist())
candidate_embeddings = encode_texts(df_test['cw_text'].tolist())

In [78]:
# Normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
candidate_embeddings = F.normalize(candidate_embeddings, p=2, dim=1)

In [79]:
df_test['query_embedding'] = [emb.numpy().tolist() for emb in query_embeddings]
df_test['candidate_embedding'] = [emb.numpy().tolist() for emb in candidate_embeddings]

In [80]:
df_test

Unnamed: 0,title_and_content,Ticker,Industry,Company,SASB,GPT_ESG_or_not,GPT_firm_or_not,GPT_sentiment,GPT_topics,ESG_or_not,...,url,articleId,title,Concatenated_SASB,lower_title_and_content,lower_Concatenated_SASB,cw_text,cw_sasb_query_text,query_embedding,candidate_embedding
0,New York Cements Itself as the Gold Mining Cap...,NEM,Metals & Mining,Newmont Corp,{'Tailings Storage Facilities Management': 'Th...,Minor,Major,Positive,"Community Relations, Business Ethics & Transpa...",No,...,https://www.newsmax.com/newsmax-tv/fitzgerald-...,c12355d81050473e89f4163372441061,Rep. Fitzgerald to Newsmax: DirecTV Dropping N...,Tailings Storage Facilities Management - The M...,new york cements itself as the gold mining cap...,tailings storage facilities management - the m...,new york cements gold mining capital world new...,tailings storage facilities management metals ...,"[0.03913487493991852, 0.004532219842076302, 0....","[0.05467555671930313, -0.11273455619812012, 0...."
1,"Shareholders v. Tesla, Nasdaq's diversity rule...",NDAQ,Security & Commodity Exchanges,Nasdaq Inc,{'Managing Conflicts of Interest': 'Security a...,Major,Major,Negative,"Managing Conflicts of Interest, Promoting Tran...",Major,...,https://www.axios.com/pro/media-deals/2023/05/...,fcbd16768c584451912d7121a259ad9d,YouTube praises AI transformation at Brandcast,Managing Conflicts of Interest - Security and ...,"shareholders v. tesla, nasdaq's diversity rule...",managing conflicts of interest - security and ...,shareholders v. tesla nasdaq diversity rule se...,managing conflicts interest security commodity...,"[0.011969239450991154, -0.05236460268497467, -...","[-0.025900591164827347, -0.028664009645581245,..."
2,"FedEx closing more locations, planning to furl...",FDX,Air Freight & Logistics,FedEx Corp,{'Greenhouse Gas Emissions': 'Air Freight & Lo...,Minor,Major,Negative,"Employee Health & Safety, Labour Practices, Su...",Minor,...,https://www.theguardian.com/technology/2023/ju...,3cb0ea7cb1cb40608c1cfc1e172ebc3e,Nick Clegg defends release of open-source AI m...,Greenhouse Gas Emissions - Air Freight & Logis...,"fedex closing more locations, planning to furl...",greenhouse gas emissions - air freight & logis...,fedex closing locations planning furlough empl...,greenhouse gas emissions air freight logistics...,"[0.042068369686603546, -0.03888951241970062, 0...","[0.06030776724219322, -0.12659691274166107, 0...."
3,Modelo Maker Profits From Bud Light‚Äö√Ñ√¥s De...,STZ,Alcoholic Beverages,Constellation Brands Inc A,{'Water Management': 'Water management include...,Minor,Minor,Positive,"Water Management, Packaging Lifecycle Manageme...",No,...,https://www.washingtonexaminer.com/restoring-a...,7b188eebdd7c42ed9ca51237d0989674,Conservative group targets Bank of America in ...,Water Management - Water management includes a...,modelo maker profits from bud light‚äö√ñ√¥s de...,water management - water management includes a...,modelo maker profits bud light‚äö√ñ√¥s decline...,water management water management includes ent...,"[0.05901496112346649, 0.011376441456377506, -0...","[0.03617096692323685, -0.036672115325927734, -..."
4,Med tech investors paying up for patents - Med...,ILMN,Medical Equipment & Supplies,Illumina Inc,{'Product Safety': 'Information on product saf...,Minor,Major,Negative,Business Ethics,No,...,https://www.cleveland.com/business/2023/01/goo...,14b0ee5d771844c7838718faf0905545,"Google slashes 12,000 jobs to cope with shrink...",Product Safety - Information on product safety...,med tech investors paying up for patents - med...,product safety - information on product safety...,med tech investors paying patents med tech sta...,product safety information product safety effe...,"[0.008400478400290012, -0.03179089352488518, 0...","[0.0538497194647789, -0.08131517469882965, 0.0..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1036,Lockheed Martin Stumbles on Supply Chain - WSJ...,LMT,Aerospace & Defence,Lockheed Martin,{'Product Safety': 'Product safety is an impor...,Major,Major,Negative,Materials Sourcing,Major,...,https://www.wsj.com/articles/lockheed-martin-s...,427327b6dfec433aa34c28dcd842fb74,Lockheed Martin Stumbles on Supply Chain - WSJ,Product Safety - Product safety is an importan...,lockheed martin stumbles on supply chain - wsj...,product safety - product safety is an importan...,lockheed martin stumbles supply chain wsj dema...,product safety product safety important consid...,"[0.03700163587927818, -0.0315348319709301, -0....","[-0.09254308044910431, -0.04089051112532616, 0..."
1037,Banks Rush To Borrow Record-Breaking $165 Bill...,BAC,Commercial Banks,Bank of America Corp,{'Factors in Credit Analysis': 'As financial i...,Major,Major,Negative,"Systemic Risk Management, Business Ethics",Major,...,https://www.forbes.com/sites/nicholasreimann/2...,036ad564b4484076822ca227e4101c85,Banks Rush To Borrow Record-Breaking $165 Bill...,Factors in Credit Analysis - As financial inte...,banks rush to borrow record-breaking $165 bill...,factors in credit analysis - as financial inte...,banks rush borrow record breaking $ 165 billio...,factors credit analysis financial intermediari...,"[0.07744689285755157, -0.0477789081633091, -0....","[0.0825621709227562, -0.11504682898521423, -0...."
1038,Ohio train derailment: Norfolk Southern CEO sa...,NSC,Rail Transportation,Norfolk Southern Corp,{'Greenhouse Gas Emissions': 'The Rail Transpo...,Major,Major,Negative,"Accident & Safety Management, Employee Health ...",Major,...,https://www.washingtonexaminer.com/news/ohio-t...,48278805ac6b4850b3692fa5b8f45a80,Ohio train derailment: Norfolk Southern CEO sa...,Greenhouse Gas Emissions - The Rail Transporta...,ohio train derailment: norfolk southern ceo sa...,greenhouse gas emissions - the rail transporta...,ohio train derailment norfolk southern ceo say...,greenhouse gas emissions rail transportation i...,"[0.049445442855358124, -0.03318081423640251, 0...","[-0.00031995910103432834, -0.05394741520285606..."
1039,"AT&T, Verizon, and T-Mobile could avoid $200 m...",T,Telecommunication Services,AT&T Inc,{'Competitive Behaviour & Open Internet': 'The...,Major,Major,Negative,"Data Privacy, Data Security",Major,...,https://www.theverge.com/2022/12/27/23527884/a...,31bec3ac87a6422eb00ee062128b162a,"AT&T, Verizon, and T-Mobile could avoid $200 m...",Competitive Behaviour & Open Internet - The Te...,"at&t, verizon, and t-mobile could avoid $200 m...",competitive behaviour & open internet - the te...,at&t verizon t mobile avoid $ 200 million fine...,competitive behaviour open internet telecommun...,"[0.012098213657736778, -0.012996641919016838, ...","[-0.04518311470746994, -0.04662075266242027, 0..."


In [81]:
from sklearn.metrics.pairwise import cosine_similarity
# Compute cosine similarity
similarity_scores = cosine_similarity(candidate_embeddings, query_embeddings)

In [82]:
similarity_scores

array([[0.40214288, 0.42126387, 0.31923607, ..., 0.31424257, 0.27241468,
        0.24858631],
       [0.28479084, 0.5311348 , 0.3055724 , ..., 0.33699417, 0.32223922,
        0.32994804],
       [0.20681112, 0.3036417 , 0.38422555, ..., 0.27731782, 0.21124437,
        0.16820611],
       ...,
       [0.29766354, 0.30126536, 0.32963973, ..., 0.437491  , 0.1749703 ,
        0.2810679 ],
       [0.16222706, 0.28442267, 0.18814938, ..., 0.16412656, 0.26750275,
        0.16986234],
       [0.53071606, 0.37750873, 0.590927  , ..., 0.5144143 , 0.5996069 ,
        0.6496229 ]], dtype=float32)

In [83]:
import numpy as np
# Extract the diagonal elements from the similarity_scores matrix
# These elements correspond to the cosine similarity between each query and its corresponding candidate
diagonal_similarity_scores = np.diag(similarity_scores)

# Add these scores as a new column in df_test
df_test['cosine_similarity_miniLM'] = diagonal_similarity_scores


In [84]:
df_test

Unnamed: 0,title_and_content,Ticker,Industry,Company,SASB,GPT_ESG_or_not,GPT_firm_or_not,GPT_sentiment,GPT_topics,ESG_or_not,...,articleId,title,Concatenated_SASB,lower_title_and_content,lower_Concatenated_SASB,cw_text,cw_sasb_query_text,query_embedding,candidate_embedding,cosine_similarity_miniLM
0,New York Cements Itself as the Gold Mining Cap...,NEM,Metals & Mining,Newmont Corp,{'Tailings Storage Facilities Management': 'Th...,Minor,Major,Positive,"Community Relations, Business Ethics & Transpa...",No,...,c12355d81050473e89f4163372441061,Rep. Fitzgerald to Newsmax: DirecTV Dropping N...,Tailings Storage Facilities Management - The M...,new york cements itself as the gold mining cap...,tailings storage facilities management - the m...,new york cements gold mining capital world new...,tailings storage facilities management metals ...,"[0.03913487493991852, 0.004532219842076302, 0....","[0.05467555671930313, -0.11273455619812012, 0....",0.402143
1,"Shareholders v. Tesla, Nasdaq's diversity rule...",NDAQ,Security & Commodity Exchanges,Nasdaq Inc,{'Managing Conflicts of Interest': 'Security a...,Major,Major,Negative,"Managing Conflicts of Interest, Promoting Tran...",Major,...,fcbd16768c584451912d7121a259ad9d,YouTube praises AI transformation at Brandcast,Managing Conflicts of Interest - Security and ...,"shareholders v. tesla, nasdaq's diversity rule...",managing conflicts of interest - security and ...,shareholders v. tesla nasdaq diversity rule se...,managing conflicts interest security commodity...,"[0.011969239450991154, -0.05236460268497467, -...","[-0.025900591164827347, -0.028664009645581245,...",0.531135
2,"FedEx closing more locations, planning to furl...",FDX,Air Freight & Logistics,FedEx Corp,{'Greenhouse Gas Emissions': 'Air Freight & Lo...,Minor,Major,Negative,"Employee Health & Safety, Labour Practices, Su...",Minor,...,3cb0ea7cb1cb40608c1cfc1e172ebc3e,Nick Clegg defends release of open-source AI m...,Greenhouse Gas Emissions - Air Freight & Logis...,"fedex closing more locations, planning to furl...",greenhouse gas emissions - air freight & logis...,fedex closing locations planning furlough empl...,greenhouse gas emissions air freight logistics...,"[0.042068369686603546, -0.03888951241970062, 0...","[0.06030776724219322, -0.12659691274166107, 0....",0.384226
3,Modelo Maker Profits From Bud Light‚Äö√Ñ√¥s De...,STZ,Alcoholic Beverages,Constellation Brands Inc A,{'Water Management': 'Water management include...,Minor,Minor,Positive,"Water Management, Packaging Lifecycle Manageme...",No,...,7b188eebdd7c42ed9ca51237d0989674,Conservative group targets Bank of America in ...,Water Management - Water management includes a...,modelo maker profits from bud light‚äö√ñ√¥s de...,water management - water management includes a...,modelo maker profits bud light‚äö√ñ√¥s decline...,water management water management includes ent...,"[0.05901496112346649, 0.011376441456377506, -0...","[0.03617096692323685, -0.036672115325927734, -...",0.421684
4,Med tech investors paying up for patents - Med...,ILMN,Medical Equipment & Supplies,Illumina Inc,{'Product Safety': 'Information on product saf...,Minor,Major,Negative,Business Ethics,No,...,14b0ee5d771844c7838718faf0905545,"Google slashes 12,000 jobs to cope with shrink...",Product Safety - Information on product safety...,med tech investors paying up for patents - med...,product safety - information on product safety...,med tech investors paying patents med tech sta...,product safety information product safety effe...,"[0.008400478400290012, -0.03179089352488518, 0...","[0.0538497194647789, -0.08131517469882965, 0.0...",0.456808
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1036,Lockheed Martin Stumbles on Supply Chain - WSJ...,LMT,Aerospace & Defence,Lockheed Martin,{'Product Safety': 'Product safety is an impor...,Major,Major,Negative,Materials Sourcing,Major,...,427327b6dfec433aa34c28dcd842fb74,Lockheed Martin Stumbles on Supply Chain - WSJ,Product Safety - Product safety is an importan...,lockheed martin stumbles on supply chain - wsj...,product safety - product safety is an importan...,lockheed martin stumbles supply chain wsj dema...,product safety product safety important consid...,"[0.03700163587927818, -0.0315348319709301, -0....","[-0.09254308044910431, -0.04089051112532616, 0...",0.347897
1037,Banks Rush To Borrow Record-Breaking $165 Bill...,BAC,Commercial Banks,Bank of America Corp,{'Factors in Credit Analysis': 'As financial i...,Major,Major,Negative,"Systemic Risk Management, Business Ethics",Major,...,036ad564b4484076822ca227e4101c85,Banks Rush To Borrow Record-Breaking $165 Bill...,Factors in Credit Analysis - As financial inte...,banks rush to borrow record-breaking $165 bill...,factors in credit analysis - as financial inte...,banks rush borrow record breaking $ 165 billio...,factors credit analysis financial intermediari...,"[0.07744689285755157, -0.0477789081633091, -0....","[0.0825621709227562, -0.11504682898521423, -0....",0.549043
1038,Ohio train derailment: Norfolk Southern CEO sa...,NSC,Rail Transportation,Norfolk Southern Corp,{'Greenhouse Gas Emissions': 'The Rail Transpo...,Major,Major,Negative,"Accident & Safety Management, Employee Health ...",Major,...,48278805ac6b4850b3692fa5b8f45a80,Ohio train derailment: Norfolk Southern CEO sa...,Greenhouse Gas Emissions - The Rail Transporta...,ohio train derailment: norfolk southern ceo sa...,greenhouse gas emissions - the rail transporta...,ohio train derailment norfolk southern ceo say...,greenhouse gas emissions rail transportation i...,"[0.049445442855358124, -0.03318081423640251, 0...","[-0.00031995910103432834, -0.05394741520285606...",0.437491
1039,"AT&T, Verizon, and T-Mobile could avoid $200 m...",T,Telecommunication Services,AT&T Inc,{'Competitive Behaviour & Open Internet': 'The...,Major,Major,Negative,"Data Privacy, Data Security",Major,...,31bec3ac87a6422eb00ee062128b162a,"AT&T, Verizon, and T-Mobile could avoid $200 m...",Competitive Behaviour & Open Internet - The Te...,"at&t, verizon, and t-mobile could avoid $200 m...",competitive behaviour & open internet - the te...,at&t verizon t mobile avoid $ 200 million fine...,competitive behaviour open internet telecommun...,"[0.012098213657736778, -0.012996641919016838, ...","[-0.04518311470746994, -0.04662075266242027, 0...",0.267503


In [85]:
df_test.to_csv('miniLM_output.csv')  # Optionally, save to a new CSV file

# Evaluation

In this section after doing some additional model results prep work, we are computing the following metrics:
* Success at K - A metric to establish whether we get a hit/relevant ESG article within K. Measures whether the relevant document (or item) appears in the top K positions of the model's ranking.
* Mean Reciprocal Rank (MRR) - MRR provides insight into the model's ability to return relevant items at higher ranks. It measures when does the first relevant ESG article appears. The closer this final number is to 1, the better the system is at giving you the right answers upfront.  
* Precision at K - Measures the proportion of retrieved documents that are relevant among the top K documents retrieved. It's calculated by dividing the number of relevant documents in the top K by K.
* Recall at K - Measures the proportion of relevant documents retrieved in the top K positions out of all relevant documents available. 
* F1 Score at K - Combines precision and recall into a single metric, offering a more comprehensive evaluation of the model's performance. It helps balance the trade-off between precision and recall, ensuring that neither is disproportionately favored.

In [86]:
df_test['GPT_ESG_or_not']

0       Minor
1       Major
2       Minor
3       Minor
4       Minor
        ...  
1036    Major
1037    Major
1038    Major
1039    Major
1040    Major
Name: GPT_ESG_or_not, Length: 1041, dtype: object

In [87]:
#PREP WORK FOR SUCCESS AT K
# Sort articles by cosine similarity score for each cw_sasb_query_text group
top_sorted_df = df_test.groupby('cw_sasb_query_text', group_keys=False) \
                  .apply(lambda x: x.sort_values('cosine_similarity_miniLM', ascending=False))
top_sorted_df = top_sorted_df.reset_index(drop=True)
test_df_relevant = df_test[['cw_text', 'GPT_ESG_or_not']].drop_duplicates()
merged_df_final = pd.merge(top_sorted_df, test_df_relevant, on='cw_text', how='left')

In [88]:
merged_df_final

Unnamed: 0,title_and_content,Ticker,Industry,Company,SASB,GPT_ESG_or_not_x,GPT_firm_or_not,GPT_sentiment,GPT_topics,ESG_or_not,...,title,Concatenated_SASB,lower_title_and_content,lower_Concatenated_SASB,cw_text,cw_sasb_query_text,query_embedding,candidate_embedding,cosine_similarity_miniLM,GPT_ESG_or_not_y
0,Top Ad Firm Suggests ‚Äö√Ñ√≤Pause‚Äö√Ñ√¥ on Tw...,IPG,Advertising & Marketing,Interpublic Group Cos,{'Advertising Integrity': 'Entities have a leg...,Major,Major,Negative,"Advertising Integrity, Data Privacy",Major,...,Democratic Senator Discussed Social Media Cens...,Advertising Integrity - Entities have a legal ...,top ad firm suggests ‚äö√ñ√≤pause‚äö√ñ√¥ on tw...,advertising integrity - entities have a legal ...,ad firm suggests äö√ñ√≤pause‚äö√ñ√¥ twitter ad...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[0.0016523862723261118, -0.07466538995504379, ...",0.589782,Major
1,Ad industry tries to quash proposed data broke...,IPG,Advertising & Marketing,Interpublic Group Cos,{'Advertising Integrity': 'Entities have a leg...,Major,Major,Negative,"Advertising Integrity, Data Privacy",Major,...,Ron Howard and Brian Grazer‚Äôs Impact Closes ...,Advertising Integrity - Entities have a legal ...,ad industry tries to quash proposed data broke...,advertising integrity - entities have a legal ...,ad industry tries quash proposed data broker r...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[-0.004404451698064804, -0.10143864899873734, ...",0.574550,Major
2,Advertisers pull back from Twitter amid 'uncer...,IPG,Advertising & Marketing,Interpublic Group Cos,{'Advertising Integrity': 'Entities have a leg...,Major,Minor,Negative,"Advertising Integrity, Data Privacy",Major,...,Researchers find the age people make their bes...,Advertising Integrity - Entities have a legal ...,advertisers pull back from twitter amid 'uncer...,advertising integrity - entities have a legal ...,advertisers pull twitter amid uncertainty new ...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[0.017361894249916077, -0.03441184014081955, 0...",0.540461,Major
3,Credera launches global cross-functional AI co...,OMC,Advertising & Marketing,Omnicom Group,{'Advertising Integrity': 'Entities have a leg...,Major,Major,Positive,"Business Model Resilience, Data Security",No,...,Apple AirPods for $99 (that's $30 off) ‚Äî plu...,Advertising Integrity - Entities have a legal ...,credera launches global cross-functional ai co...,advertising integrity - entities have a legal ...,credera launches global cross functional ai co...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[0.0028907866217195988, -0.09370002150535583, ...",0.516890,Major
4,Michael Solomon Named CEO of PHD USA - NEW YOR...,OMC,Advertising & Marketing,Omnicom Group,{'Advertising Integrity': 'Entities have a leg...,Minor,Major,Positive,Workforce Diversity & Inclusion,No,...,Advertisers return to Fox News primetime after...,Advertising Integrity - Entities have a legal ...,michael solomon named ceo of phd usa - new yor...,advertising integrity - entities have a legal ...,michael solomon named ceo phd usa new york 15 ...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[0.017328303307294846, -0.03523758426308632, -...",0.509162,Minor
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1040,"""Violent"" incidents are on the rise at Target ...",TGT,Multiline and Specialty Retailers & Distributors,Target Corp,{'Workforce Diversity & Inclusion': 'The Multi...,Major,Major,Negative,"Labour Practices, Data Security",Major,...,Google's AI chatbot Bard is still being rushed,Workforce Diversity & Inclusion - The Multilin...,"""violent"" incidents are on the rise at target ...",workforce diversity & inclusion - the multilin...,violent incidents rise target stores costing m...,workforce diversity inclusion multiline specia...,"[0.044459979981184006, -0.03579284995794296, 0...","[0.015229632146656513, -0.0707889199256897, 0....",0.402097,Major
1041,Walmart earnings call foreshadows challenges a...,WMT,Multiline and Specialty Retailers & Distributors,Walmart Inc.,{'Workforce Diversity & Inclusion': 'The Multi...,Minor,Major,Neutral,"Workforce Diversity & Inclusion, Product Sourc...",No,...,"Amid an onslaught of tech layoffs, here are 12...",Workforce Diversity & Inclusion - The Multilin...,walmart earnings call foreshadows challenges a...,workforce diversity & inclusion - the multilin...,walmart earnings foreshadows challenges ahead ...,workforce diversity inclusion multiline specia...,"[0.044459979981184006, -0.03579284995794296, 0...","[0.0041684615425765514, -0.09523602575063705, ...",0.371712,Minor
1042,Lowe's Cos. expands Triad footprint with distr...,LOW,Multiline and Specialty Retailers & Distributors,Lowe's Cos Inc,{'Workforce Diversity & Inclusion': 'The Multi...,Minor,Major,Positive,Labour Practices,Major,...,Here are the companies that have laid off empl...,Workforce Diversity & Inclusion - The Multilin...,lowe's cos. expands triad footprint with distr...,workforce diversity & inclusion - the multilin...,lowe cos expands triad footprint distribution ...,workforce diversity inclusion multiline specia...,"[0.044459979981184006, -0.03579284995794296, 0...","[0.005951021332293749, -0.06923587620258331, -...",0.338688,Minor
1043,Dollar General cuts annual profit outlook on h...,DG,Multiline and Specialty Retailers & Distributors,Dollar General Corp,{'Workforce Diversity & Inclusion': 'The Multi...,Minor,Major,Negative,"Product Sourcing, Packaging & Marketing, Energ...",Minor,...,American Airlines employee killed in 'industri...,Workforce Diversity & Inclusion - The Multilin...,dollar general cuts annual profit outlook on h...,workforce diversity & inclusion - the multilin...,dollar general cuts annual profit outlook high...,workforce diversity inclusion multiline specia...,"[0.044459979981184006, -0.03579284995794296, 0...","[-0.033035244792699814, -0.020411347970366478,...",0.263484,Minor


In [89]:
# Mapping - applying the Minor and Major as Yes assumption
mapping = {'Minor': 'Yes', 'Major': 'Yes', 'No': 'No'}
merged_df_final['GPT_ESG_or_not_x'] = merged_df_final['GPT_ESG_or_not_x'].map(mapping)
# Adding in the ground truth labels - checking if it was done successfullly
merged_df_final

Unnamed: 0,title_and_content,Ticker,Industry,Company,SASB,GPT_ESG_or_not_x,GPT_firm_or_not,GPT_sentiment,GPT_topics,ESG_or_not,...,title,Concatenated_SASB,lower_title_and_content,lower_Concatenated_SASB,cw_text,cw_sasb_query_text,query_embedding,candidate_embedding,cosine_similarity_miniLM,GPT_ESG_or_not_y
0,Top Ad Firm Suggests ‚Äö√Ñ√≤Pause‚Äö√Ñ√¥ on Tw...,IPG,Advertising & Marketing,Interpublic Group Cos,{'Advertising Integrity': 'Entities have a leg...,Yes,Major,Negative,"Advertising Integrity, Data Privacy",Major,...,Democratic Senator Discussed Social Media Cens...,Advertising Integrity - Entities have a legal ...,top ad firm suggests ‚äö√ñ√≤pause‚äö√ñ√¥ on tw...,advertising integrity - entities have a legal ...,ad firm suggests äö√ñ√≤pause‚äö√ñ√¥ twitter ad...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[0.0016523862723261118, -0.07466538995504379, ...",0.589782,Major
1,Ad industry tries to quash proposed data broke...,IPG,Advertising & Marketing,Interpublic Group Cos,{'Advertising Integrity': 'Entities have a leg...,Yes,Major,Negative,"Advertising Integrity, Data Privacy",Major,...,Ron Howard and Brian Grazer‚Äôs Impact Closes ...,Advertising Integrity - Entities have a legal ...,ad industry tries to quash proposed data broke...,advertising integrity - entities have a legal ...,ad industry tries quash proposed data broker r...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[-0.004404451698064804, -0.10143864899873734, ...",0.574550,Major
2,Advertisers pull back from Twitter amid 'uncer...,IPG,Advertising & Marketing,Interpublic Group Cos,{'Advertising Integrity': 'Entities have a leg...,Yes,Minor,Negative,"Advertising Integrity, Data Privacy",Major,...,Researchers find the age people make their bes...,Advertising Integrity - Entities have a legal ...,advertisers pull back from twitter amid 'uncer...,advertising integrity - entities have a legal ...,advertisers pull twitter amid uncertainty new ...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[0.017361894249916077, -0.03441184014081955, 0...",0.540461,Major
3,Credera launches global cross-functional AI co...,OMC,Advertising & Marketing,Omnicom Group,{'Advertising Integrity': 'Entities have a leg...,Yes,Major,Positive,"Business Model Resilience, Data Security",No,...,Apple AirPods for $99 (that's $30 off) ‚Äî plu...,Advertising Integrity - Entities have a legal ...,credera launches global cross-functional ai co...,advertising integrity - entities have a legal ...,credera launches global cross functional ai co...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[0.0028907866217195988, -0.09370002150535583, ...",0.516890,Major
4,Michael Solomon Named CEO of PHD USA - NEW YOR...,OMC,Advertising & Marketing,Omnicom Group,{'Advertising Integrity': 'Entities have a leg...,Yes,Major,Positive,Workforce Diversity & Inclusion,No,...,Advertisers return to Fox News primetime after...,Advertising Integrity - Entities have a legal ...,michael solomon named ceo of phd usa - new yor...,advertising integrity - entities have a legal ...,michael solomon named ceo phd usa new york 15 ...,advertising integrity entities legal responsib...,"[0.011685706675052643, -0.08869925141334534, -...","[0.017328303307294846, -0.03523758426308632, -...",0.509162,Minor
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1040,"""Violent"" incidents are on the rise at Target ...",TGT,Multiline and Specialty Retailers & Distributors,Target Corp,{'Workforce Diversity & Inclusion': 'The Multi...,Yes,Major,Negative,"Labour Practices, Data Security",Major,...,Google's AI chatbot Bard is still being rushed,Workforce Diversity & Inclusion - The Multilin...,"""violent"" incidents are on the rise at target ...",workforce diversity & inclusion - the multilin...,violent incidents rise target stores costing m...,workforce diversity inclusion multiline specia...,"[0.044459979981184006, -0.03579284995794296, 0...","[0.015229632146656513, -0.0707889199256897, 0....",0.402097,Major
1041,Walmart earnings call foreshadows challenges a...,WMT,Multiline and Specialty Retailers & Distributors,Walmart Inc.,{'Workforce Diversity & Inclusion': 'The Multi...,Yes,Major,Neutral,"Workforce Diversity & Inclusion, Product Sourc...",No,...,"Amid an onslaught of tech layoffs, here are 12...",Workforce Diversity & Inclusion - The Multilin...,walmart earnings call foreshadows challenges a...,workforce diversity & inclusion - the multilin...,walmart earnings foreshadows challenges ahead ...,workforce diversity inclusion multiline specia...,"[0.044459979981184006, -0.03579284995794296, 0...","[0.0041684615425765514, -0.09523602575063705, ...",0.371712,Minor
1042,Lowe's Cos. expands Triad footprint with distr...,LOW,Multiline and Specialty Retailers & Distributors,Lowe's Cos Inc,{'Workforce Diversity & Inclusion': 'The Multi...,Yes,Major,Positive,Labour Practices,Major,...,Here are the companies that have laid off empl...,Workforce Diversity & Inclusion - The Multilin...,lowe's cos. expands triad footprint with distr...,workforce diversity & inclusion - the multilin...,lowe cos expands triad footprint distribution ...,workforce diversity inclusion multiline specia...,"[0.044459979981184006, -0.03579284995794296, 0...","[0.005951021332293749, -0.06923587620258331, -...",0.338688,Minor
1043,Dollar General cuts annual profit outlook on h...,DG,Multiline and Specialty Retailers & Distributors,Dollar General Corp,{'Workforce Diversity & Inclusion': 'The Multi...,Yes,Major,Negative,"Product Sourcing, Packaging & Marketing, Energ...",Minor,...,American Airlines employee killed in 'industri...,Workforce Diversity & Inclusion - The Multilin...,dollar general cuts annual profit outlook on h...,workforce diversity & inclusion - the multilin...,dollar general cuts annual profit outlook high...,workforce diversity inclusion multiline specia...,"[0.044459979981184006, -0.03579284995794296, 0...","[-0.033035244792699814, -0.020411347970366478,...",0.263484,Minor


In [90]:
def calculate_success_at_k(merged_df, k):
    # Group by 'cw_sasb_query_text'
    grouped_df = merged_df.groupby('cw_sasb_query_text')
    group_sizes = grouped_df.size()
    hit_count = 0
    total_groups = len(grouped_df)
    for name, group in grouped_df:
        if 'Yes' in group.head(k)['GPT_ESG_or_not_x'].values:
            hit_count += 1
    hit_rate = hit_count / total_groups
    return hit_rate

In [97]:
#Call the function with merged_df_final and the value of k
# Initialize an empty DataFrame to store results
success_k = pd.DataFrame(columns=['k', 'hit_rate'])
# Create an empty list to store intermediate results
results = []
# Loop through k values from 1 to 5
for k in range(1, 6):
    hit_rate = calculate_success_at_k(merged_df_final, k)
    # Store the result as a dictionary in the list
    results.append({'k': k, 'hit_rate': hit_rate})

    # Convert the list of dictionaries to a DataFrame
success_k = pd.concat([pd.DataFrame([result]) for result in results], ignore_index=True)
# Display the results
print(success_k)


   k  hit_rate
0  1  0.869565
1  2  0.945652
2  3  0.956522
3  4  0.967391
4  5  0.978261


# MRR

In [92]:
def calculate_mrr(merged_df):
    # Group by 'cw_sasb_query_text' to process each query group separately
    grouped_df = merged_df.groupby('cw_sasb_query_text')
    total_queries = len(grouped_df)  # Total number of queries
    sum_reciprocal_rank = 0  # Initialize the sum of reciprocal ranks
    for name, group in grouped_df:
        # Sort each group just in case it's not sorted by relevance (similarity score)
        group = group.sort_values('cosine_similarity_miniLM', ascending=False)
        # Find the index (rank) of the first 'Yes' in the sorted group
        first_relevant_index = group['GPT_ESG_or_not_x'].eq('Yes').idxmax()
        if group.loc[first_relevant_index, 'GPT_ESG_or_not_x'] == 'Yes':
            rank = group.index.get_loc(first_relevant_index) + 1  # Get rank (1-based)
            sum_reciprocal_rank += 1 / rank  # Add the reciprocal of the rank to the sum
    mrr = sum_reciprocal_rank / total_queries  # Calculate the mean of the reciprocal ranks
    return mrr

In [93]:
# Call the function with your DataFrame and print the MRR
mrr_score = calculate_mrr(merged_df_final)
print(f"The Mean Reciprocal Rank (MRR) is: {mrr_score}")


The Mean Reciprocal Rank (MRR) is: 0.9170289855072463


In [101]:
def calculate_precision_recall_at_k_per_query(group, k):
    # Convert 'Yes'/'No' in 'GPT_ESG_or_not' to 1/0 for calculation
    group['is_correct'] = group['GPT_ESG_or_not_x'].apply(lambda x: 1 if x == 'Yes' else 0)
    # Sort the group by similarity_score in descending order and take top K
    top_k = group.sort_values('cosine_similarity_miniLM', ascending=False).head(k)
    # Calculate how many of the top K are correct
    correct_in_top_k = top_k['is_correct'].sum()
    # Calculate Precision at K
    precision_at_k = correct_in_top_k / k
    # Calculate Recall at K
    total_relevant = group['is_correct'].sum() #top_k['is_correct'].sum() #group['is_correct'].sum() #top_k['is_correct'].sum()
    recall_at_k = correct_in_top_k / total_relevant if total_relevant > 0 else 0 #k #total_relevant if total_relevant > 0 else 0
    # Calculate F1 at K
    if precision_at_k + recall_at_k > 0:
        f1_at_k = 2 * (precision_at_k * recall_at_k) / (precision_at_k + recall_at_k)
    else:
        f1_at_k = 0
    return precision_at_k, recall_at_k, f1_at_k

In [102]:
# Apply the function to each group and calculate the mean Precision, Recall, and F1 at K
# Use merged_df_final if want to see all of the initial test run results
results = merged_df_final.groupby('cw_sasb_query_text').apply(calculate_precision_recall_at_k_per_query, k=3)

In [103]:
# To see the overall average Precision, Recall, and F1 at K
average_precision_at_k = results.map(lambda x: x[0]).mean()
average_recall_at_k = results.map(lambda x: x[1]).mean()
average_f1_at_k = results.map(lambda x: x[2]).mean()
print(f"Average Precision at K: {average_precision_at_k}")
print(f"Average Recall at K: {average_recall_at_k}")
print(f"Average F1 at K: {average_f1_at_k}")

Average Precision at K: 0.7971014492753624
Average Recall at K: 0.39725890769766
Average F1 at K: 0.45564135905756903
