### Transformers - Text-Embedding-3-small- Metrics Calculation
The purpose of this notebook is to calculate the from text-embedding-3-small where the cosine similarities were computed in the openai_embeddings_pipeline notebook.

The metrics in this notebook will include the following metrics:
* Success at K - A metric to establish whether we get a hit/relevant ESG article within K. Measures whether the relevant document (or item) appears in the top K positions of the model's ranking.
* Mean Reciprocal Rank (MRR) - MRR provides insight into the model's ability to return relevant items at higher ranks. It measures when does the first relevant ESG article appears. The closer this final number is to 1, the better the system is at giving you the right answers upfront.  
* Precision at K - Measures the proportion of retrieved documents that are relevant among the top K documents retrieved. It's calculated by dividing the number of relevant documents in the top K by K.
* Recall at K - Measures the proportion of relevant documents retrieved in the top K positions out of all relevant documents available. 
* F1 Score at K - Combines precision and recall into a single metric, offering a more comprehensive evaluation of the model's performance. It helps balance the trade-off between precision and recall, ensuring that neither is disproportionately favored.

Files needed from the openai_embeddings_pipeline notebook:
 - test_set.csv - this file is the direct output from the openai_embeddings_pipeline notebook

In [1]:
import pandas as pd
from pathlib import Path
import spacy
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

In [2]:
# Define the directory path
directory_path = Path(r'C:\Users\tiffa.TIFFANY\OneDrive\Documents\DS 5999 - Capstone\Two Towers\Comparison')


# Define file paths
test_path = directory_path / 'test_set.csv'

# Read in the cleaned up CSV from other file
test_data = pd.read_csv(test_path, na_filter=False)

In [3]:
test_data.head()

Unnamed: 0,index,title & content,sentiment_perigon,summary,description,Ticker,Sector,Industry,Company,SASB,...,human_label_sentiment,pubDate,url,keywords,categories,entities,content,articleId,title,Unnamed: 27
0,33685.0,New York Cements Itself as the Gold Mining Cap...,"{'positive': 0.46477953, 'negative': 0.0338994...","This week, top-five producer AngloGold Ashanti...",(Bloomberg) -- The momentum has been building ...,NEM,Extractives & Minerals Processing,Metals & Mining,Newmont Corp,{'Tailings Storage Facilities Management': 'Th...,...,No,2023-02-08T22:16:21+00:00,https://www.newsmax.com/newsmax-tv/fitzgerald-...,"[{'name': 'Newsmax', 'weight': 0.09317307}, {'...",[{'name': 'Politics'}],"[{'data': 'Fitzgerald', 'type': 'PERSON', 'men...","Rep. Scott Fitzgerald, R-Wis., told Newsmax We...",c12355d81050473e89f4163372441061,Rep. Fitzgerald to Newsmax: DirecTV Dropping N...,
1,12072.0,"Shareholders v. Tesla, Nasdaq's diversity rule...","{'positive': 0.02043453, 'negative': 0.6323841...",\n\nThe case is In re Tesla Inc Securities Lit...,Some of the biggest securities cases of 2023 a...,NDAQ,Financials,Security & Commodity Exchanges,Nasdaq Inc,{'Managing Conflicts of Interest': 'Security a...,...,Positive,2023-05-18T14:28:52+00:00,https://www.axios.com/pro/media-deals/2023/05/...,"[{'name': 'Google AI', 'weight': 0.09959001}, ...",[{'name': 'Tech'}],"[{'data': 'YouTube', 'type': 'ORG', 'mentions'...",YouTube has embraced AI for causing a massive ...,fcbd16768c584451912d7121a259ad9d,YouTube praises AI transformation at Brandcast,
2,10731.0,"FedEx closing more locations, planning to furl...","{'positive': 0.02241792, 'negative': 0.9375396...",FedEx announced on Monday that it will close 2...,FedEx announced on Monday that it will close 2...,FDX,Transportation,Air Freight & Logistics,FedEx Corp,{'Greenhouse Gas Emissions': 'Air Freight & Lo...,...,Negative,2023-07-19T09:44:20+00:00,https://www.theguardian.com/technology/2023/ju...,"[{'name': 'AI models', 'weight': 0.13278106}, ...",[{'name': 'Tech'}],"[{'data': 'Nick Clegg', 'type': 'PERSON', 'men...",Nick Clegg has defended the release of an open...,3cb0ea7cb1cb40608c1cfc1e172ebc3e,Nick Clegg defends release of open-source AI m...,
3,34976.0,Modelo Maker Profits From Bud Light‚Äö√Ñ√¥s De...,"{'positive': 0.03527576, 'negative': 0.9478646...",Constellation Brands reported an 11% increase ...,Constellation Brands reported an 11% increase ...,STZ,Food & Beverage,Alcoholic Beverages,Constellation Brands Inc A,{'Water Management': 'Water management include...,...,No,2023-06-08T09:30:00+00:00,https://www.washingtonexaminer.com/restoring-a...,"[{'name': 'ESG', 'weight': 0.078042254}, {'nam...",[{'name': 'Finance'}],"[{'data': 'Bank of America', 'type': 'ORG', 'm...",Consumers‚Äô Research has launched a campaign ...,7b188eebdd7c42ed9ca51237d0989674,Conservative group targets Bank of America in ...,
4,15045.0,Med tech investors paying up for patents - Med...,"{'positive': 0.24512091, 'negative': 0.606532,...",A new PitchBook report has found that Med tech...,Med tech startups with patents or patent appli...,ILMN,Health Care,Medical Equipment & Supplies,Illumina Inc,{'Product Safety': 'Information on product saf...,...,No,2023-01-20T15:49:30.229000+00:00,https://www.cleveland.com/business/2023/01/goo...,"[{'name': 'shrinking pandemic growth bubble', ...",[{'name': 'Business'}],"[{'data': 'Google', 'type': 'ORG', 'mentions':...","MOUNTAIN VIEW, California -- Google will lay o...",14b0ee5d771844c7838718faf0905545,"Google slashes 12,000 jobs to cope with shrink...",


In [4]:
columns_to_keep = ['index', 'title & content', 'Industry', 'max_cosine_similarities', 'GPT_ESG_or_not']
test_data = test_data[columns_to_keep]

# Now, sort the filtered DataFrame in descending order by max_cosine_similarities
test_data = test_data.sort_values(by='max_cosine_similarities', ascending=False)

In [5]:
# Reset the index of the DataFrame and drop the old one
test_data = test_data.reset_index(drop=True)

Unnamed: 0,index,title & content,Industry,max_cosine_similarities,GPT_ESG_or_not
847,2027.0,"Unprecedented Storms Upend US Towns, Insurance...",Insurance,0.890576,Major
848,2030.0,Severe storms lead to unprecedented $34 billio...,Insurance,0.878662,Major
726,65826.0,Carbon-neutral manufacturing is the next step ...,Automobiles,0.875713,Major
962,174150.0,Banking turmoil was not a crisis but 'the down...,Commercial Banks,0.873954,Minor
731,35533.0,It may not be 2008 all over again ‚Äì but this...,Commercial Banks,0.867614,Minor
...,...,...,...,...,...
264,83047.0,Live Nation posts record $3 billion revenue am...,Leisure Facilities,0.735082,Major
346,83374.0,Taylor Swift's mega-producer Jack Antonoff bre...,Leisure Facilities,0.734845,Major
705,83454.0,Quanta Services (PWR) Amaze Investors With 14%...,Engineering & Construction Services,0.734790,Minor
392,88263.0,Grand jury decision on deadly Astroworld crowd...,Leisure Facilities,0.731009,Major


In [6]:
test_data = test_data.rename(columns={'max_cosine_similarities': 'similarity_score'})

Unnamed: 0,index,title & content,Industry,similarity_score,GPT_ESG_or_not
847,2027.0,"Unprecedented Storms Upend US Towns, Insurance...",Insurance,0.890576,Major
848,2030.0,Severe storms lead to unprecedented $34 billio...,Insurance,0.878662,Major
726,65826.0,Carbon-neutral manufacturing is the next step ...,Automobiles,0.875713,Major
962,174150.0,Banking turmoil was not a crisis but 'the down...,Commercial Banks,0.873954,Minor
731,35533.0,It may not be 2008 all over again ‚Äì but this...,Commercial Banks,0.867614,Minor


In [9]:
test_data = test_data.rename(columns={'title & content': 'cw_text'})

final_results_df = test_data

In [12]:
# Mapping - applying the Minor and Major as Yes assumption
mapping = {'Minor': 'Yes', 'Major': 'Yes', 'No': 'No'}

final_results_df['GPT_ESG_or_not'] = final_results_df['GPT_ESG_or_not'].map(mapping)


In [14]:
final_results_df = final_results_df.reset_index(drop=True)

In [23]:
final_results_df.head()

Unnamed: 0,index,cw_text,Industry,similarity_score,GPT_ESG_or_not
0,2027.0,"Unprecedented Storms Upend US Towns, Insurance...",Insurance,0.890576,Yes
1,2030.0,Severe storms lead to unprecedented $34 billio...,Insurance,0.878662,Yes
2,65826.0,Carbon-neutral manufacturing is the next step ...,Automobiles,0.875713,Yes
3,174150.0,Banking turmoil was not a crisis but 'the down...,Commercial Banks,0.873954,Yes
4,35533.0,It may not be 2008 all over again ‚Äì but this...,Commercial Banks,0.867614,Yes


In [24]:
top_sorted_df

Unnamed: 0,index,cw_text,Industry,similarity_score,GPT_ESG_or_not
0,20303.0,Ad industry tries to quash proposed data broke...,Advertising & Marketing,0.785364,Yes
1,25809.0,Credera launches global cross-functional AI co...,Advertising & Marketing,0.779233,Yes
2,34332.0,Advertisers pull back from Twitter amid 'uncer...,Advertising & Marketing,0.771291,Yes
3,34869.0,"DDB Chicago Promotes Kiska Howell to EVP, Head...",Advertising & Marketing,0.770837,Yes
4,37542.0,OMNICOM MEDIA GROUP RANKED #1 FOR 2022 INCREME...,Advertising & Marketing,0.768638,No
...,...,...,...,...,...
1041,6492.0,Waste Management (NYSE:WM) shareholders have e...,Waste Management,0.809619,No
1042,12565.0,"Waste Management, Inc. (NYSE:WM) Pays A US$0.7...",Waste Management,0.796485,No
1043,12656.0,Waste Management's (NYSE:WM) Returns Have Hit ...,Waste Management,0.796305,No
1044,17926.0,Recycling Collection Not Changing In Horsham -...,Waste Management,0.788378,Yes


In [25]:
# Sort articles by similarity_score for each Industry group
top_sorted_df = final_results_df.groupby('Industry', group_keys=False) \
                  .apply(lambda x: x.sort_values('similarity_score', ascending=False))
top_sorted_df = top_sorted_df.reset_index(drop=True)


def calculate_success_at_k(df, k):
    # Group by 'Industry'
    grouped_df = df.groupby('Industry')
    hit_count = 0
    total_groups = len(grouped_df)

    for name, group in grouped_df:
        # Check if 'Yes' is within the top k rows for 'GPT_ESG_or_not'
        if 'Yes' in group.head(k)['GPT_ESG_or_not'].values:
            hit_count += 1

    hit_rate = hit_count / total_groups
    return hit_rate

# Initialize an empty DataFrame to store results
success_k = pd.DataFrame(columns=['k', 'hit_rate'])

# Create an empty list to store intermediate results
results = []

# Loop through k values from 1 to 5
for k in range(1, 6):
    hit_rate = calculate_success_at_k(top_sorted_df, k)
    # Store the result as a dictionary in the list
    results.append({'k': k, 'hit_rate': hit_rate})

# Convert the list of dictionaries to a DataFrame
success_k = pd.concat([pd.DataFrame([result]) for result in results], ignore_index=True)

# Display the results
print(success_k)

   k  hit_rate
0  1  0.885246
1  2  1.000000
2  3  1.000000
3  4  1.000000
4  5  1.000000


In [26]:
# Sort articles by similarity_score for each Industry group
top_sorted_df = final_results_df.groupby('Industry', group_keys=False) \
                  .apply(lambda x: x.sort_values('similarity_score', ascending=False))
top_sorted_df = top_sorted_df.reset_index(drop=True)

def calculate_mrr(df):
    # Group by 'Industry' to process each industry group separately
    grouped_df = df.groupby('Industry')
    total_industries = len(grouped_df)  # Total number of industry groups
    sum_reciprocal_rank = 0  # Initialize the sum of reciprocal ranks
    
    for name, group in grouped_df:
        # Find the index (rank) of the first 'Yes' in the sorted group
        first_relevant_index = group['GPT_ESG_or_not'].eq('Yes').idxmax()
        # Check if the first relevant index actually contains 'Yes'
        if group.loc[first_relevant_index, 'GPT_ESG_or_not'] == 'Yes':
            rank = group.index.get_loc(first_relevant_index) + 1  # Get rank (1-based)
            sum_reciprocal_rank += 1 / rank  # Add the reciprocal of the rank to the sum
    
    mrr = sum_reciprocal_rank / total_industries  # Calculate the mean of the reciprocal ranks
    return mrr

# Call the function with your DataFrame and print the MRR
mrr_score = calculate_mrr(top_sorted_df)
print(f"The Mean Reciprocal Rank (MRR) is: {mrr_score}")


The Mean Reciprocal Rank (MRR) is: 0.9426229508196722


In [37]:
def calculate_precision_recall_at_k_per_group(group, k):
    # Convert 'Yes'/'No' in 'GPT_ESG_or_not' to 1/0 for calculation
    group['is_correct'] = group['GPT_ESG_or_not'].apply(lambda x: 1 if x == 'Yes' else 0)

    # Sort the group by similarity_score in descending order and take top K
    top_k = group.sort_values('similarity_score', ascending=False).head(k)

    # Calculate how many of the top K are correct
    correct_in_top_k = top_k['is_correct'].sum()
    
    # Calculate Precision at K
    precision_at_k = correct_in_top_k / k

    # Calculate Recall at K
    total_relevant = group['is_correct'].sum()  # This should consider the whole group, not just top K
    recall_at_k = correct_in_top_k / total_relevant if total_relevant > 0 else 0
    
    # Calculate F1 at K
    if precision_at_k + recall_at_k > 0:
        f1_at_k = 2 * (precision_at_k * recall_at_k) / (precision_at_k + recall_at_k)
    else:
        f1_at_k = 0

    return precision_at_k, recall_at_k, f1_at_k

# Apply the function to each industry group and calculate the mean Precision, Recall, and F1 at K
results = top_sorted_df.groupby('Industry').apply(calculate_precision_recall_at_k_per_group, k=3)

# Convert the results into a DataFrame and then calculate the mean
results_df = pd.DataFrame(results.tolist(), index=results.index, columns=['Precision', 'Recall', 'F1'])

average_precision_at_k = results_df['Precision'].mean()
average_recall_at_k = results_df['Recall'].mean()
average_f1_at_k = results_df['F1'].mean()

print(f"Average Precision at K: {average_precision_at_k}")
print(f"Average Recall at K: {average_recall_at_k}")
print(f"Average F1 at K: {average_f1_at_k}")


Average Precision at K: 0.8688524590163933
Average Recall at K: 1.0
Average F1 at K: 0.9114754098360652


In [30]:
results_df

Unnamed: 0_level_0,Precision,Recall,F1
Industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Advertising & Marketing,1.000000,0.500000,0.666667
Aerospace & Defence,1.000000,0.272727,0.428571
Agricultural Products,1.000000,0.600000,0.750000
Air Freight & Logistics,1.000000,0.428571,0.600000
Airlines,1.000000,0.088235,0.162162
...,...,...,...
Solar Technology & Project Developers,1.000000,0.428571,0.600000
Telecommunication Services,0.666667,0.142857,0.235294
Tobacco,1.000000,0.333333,0.500000
Toys & Sporting Goods,0.666667,0.666667,0.666667
