### Transformers Final Project - Two Towers - roBERTa Encoder

The purpose of this notebook to do Two Towers with a roBERTa encoder and additional add-on NN architecture. Thus, we ended up with a 13 layer NN architecture for each tower with GPT labeled ESG or not training data to find relevant ESG articles based on a particular industry as each industry has their own unique SASB Query. This notebook is to test out doing a Two Towers Information Retrieval architecture with a roBERTa encoder.

**Note**: The training and testing data use the assumption that **Major and Minor are both 'YES' for ESG and No is 'No' for ESG**.

The notebook uses bert_en_uncased_L-12_H-768_A-12 as the encoder:
- roberta: RoBERTa (Robustly Optimized BERT Approach) is an adaptation and improvement of BERT, developed by Facebook AI. It is designed for natural language processing tasks and enhances BERT by optimizing key hyperparameters, training with larger batch sizes, removing the next sentence prediction objective, and training on a larger dataset.
- base: Standard RoBERTa model size.
- L-12: Specifies that the model has 12 layers (or transformer blocks). Model depth impacts the model's ability to understand complex language features.
- H-768: The hidden size is 768, which is the number of neurons in each layer. This affects the model's capacity to learn and represent language data.
- A-12: Denotes the model uses 12 attention heads. Attention heads allow the model to focus on different parts of the sentence to better understand the context and the relationships between words.

Thus, the  roberta-base model employs a robustly optimized version of the BERT architecture, trained to better understand natural language through its 12-layer structure, with 768 neurons in each layer and 12 attention heads.

Additional Files Needed to run the file:
- Train and Test data: 
    - train_df_cleaned_no_stopwords
    - test_set_cleaned_no_stopwords
- Model trained weights (weights trained on GPT labelled data):
    - roBERTa-two_tower_model_weights.h5
- Results from model for further processing (so to avoid the recomputation of test results and to go directly to success at K information and other metrics):
    - Transformers-roBERTa-final_results_df.csv    

In [145]:
import pandas as pd
from transformers import RobertaTokenizer, TFRobertaModel
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Dropout, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
import numpy as np

from pathlib import Path
import spacy
tf.get_logger().setLevel('ERROR')

In [146]:
# Define the directory path
directory_path = Path(r'C:\Users\tiffa.TIFFANY\OneDrive\Documents\DS 5690 - Transformers\Final Project\Two Tower')


# Define file paths
train_path = directory_path / 'train_df_cleaned_no_stopwords.csv'
test_path = directory_path / 'test_df_cleaned_no_stopwords.csv'

# Read in the cleaned up CSV from other file
df_original = pd.read_csv(train_path, na_filter=False)
df_original.head()

Unnamed: 0,title_and_content,Ticker,Industry,Company,SASB,GPT_ESG_or_not,GPT_firm_or_not,GPT_sentiment,GPT_topics,ESG_or_not,firm_or_not,human_label_sentiment,url,articleId,title,Concatenated_SASB,lower_title_and_content,lower_Concatenated_SASB,cw_text,cw_sasb_query_text
0,A Bright Spot in Commercial Real Estate: Retai...,CBRE,Real Estate Services,CBRE Group Inc.,{'Sustainability Services': 'In the Real Estat...,Minor,Minor,Neutral,,Minor,Minor,Neutral,https://www.dallasnews.com/business/2023/08/18...,0c2744a7d8ab41f4b81a2dee8b36bb45,"American Airlines sues Skiplagged, claiming ch...",Sustainability Services - In the Real Estate S...,a bright spot in commercial real estate: retai...,sustainability services - in the real estate s...,bright spot commercial real estate retail shop...,sustainability services real estate services i...
1,UPS Trains Non-Union Staff to Deliver Packages...,UPS,Air Freight & Logistics,United Parcel Service Inc B,{'Greenhouse Gas Emissions': 'Air Freight & Lo...,Major,Major,Negative,Labour Practices,Minor,Major,Neutral,https://www.thesun.co.uk/tech/22883701/dangero...,c5810a1c9d26437999ef02bc39eedf1b,Urgent warning over two apps to delete from yo...,Greenhouse Gas Emissions - Air Freight & Logis...,ups trains non-union staff to deliver packages...,greenhouse gas emissions - air freight & logis...,ups trains non union staff deliver packages ca...,greenhouse gas emissions air freight logistics...
2,"A Clean Energy Fund's Challenges, How Twitter ...",STZ,Alcoholic Beverages,Constellation Brands Inc A,{'Water Management': 'Water management include...,Minor,Minor,Positive,"Energy Management, None",Major,Minor,Positive,https://www.theverge.com/2022/12/22/23522535/y...,1bf4ceb04a194274ba6fccbceac0194d,YouTube‚Äôs NFL Sunday Ticket deal is a brilli...,Water Management - Water management includes a...,"a clean energy fund's challenges, how twitter ...",water management - water management includes a...,clean energy fund challenges twitter antics hu...,water management water management includes ent...
3,Live Nation posts 73% jump in revenue and reco...,LYV,Leisure Facilities,Live Nation Entertainment Inc.,{'Customer Safety': 'Leisure facility entities...,Minor,Major,Negative,"Customer Safety, Governance",Minor,Major,Neutral,https://www.ksl.com/article/50656933/ford-reca...,1899eab638d6461796e1d93bb98e4177,Ford recall over discouraging use of seat belts,Customer Safety - Leisure facility entities op...,live nation posts 73% jump in revenue and reco...,customer safety - leisure facility entities op...,live nation posts 73 jump revenue record atten...,customer safety leisure facility entities oper...
4,Corporate Responsibility at T-Mobile: Reaching...,TMUS,Telecommunication Services,T-Mobile US Inc,{'Competitive Behaviour & Open Internet': 'The...,Major,Major,Positive,"Competitive Behaviour & Open Internet, Product...",Major,Major,Positive,https://www.indiatimes.com/technology/news/goo...,5976d28c358745b7a44d50558f3fd4c6,Indian Regulator Says Google's Data Hegemony I...,Competitive Behaviour & Open Internet - The Te...,corporate responsibility at t-mobile: reaching...,competitive behaviour & open internet - the te...,corporate responsibility t mobile reaching new...,competitive behaviour open internet telecommun...


In [147]:
test_data = pd.read_csv(test_path, na_filter=False)

In [150]:
df_original.shape

(4175, 22)

In [154]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = TFRobertaModel.from_pretrained('roberta-base')

Some layers from the model checkpoint at roberta-base were not used when initializing TFRobertaModel: ['lm_head']
- This IS expected if you are initializing TFRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFRobertaModel were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


## Create the Two Towers Architecture

In [155]:
# Build the Two Tower Architecture for each tower
def build_tower(name):
    input_ids = Input(shape=(None,), dtype=tf.int32, name=f"{name}_input_ids")
    attention_mask = Input(shape=(None,), dtype=tf.int32, name=f"{name}_attention_mask")
    
    roberta_output = roberta_model(input_ids=input_ids, attention_mask=attention_mask)[1]  # Using pooled output
    dense_layer = Dense(768, activation='tanh')(roberta_output)
    tower_model = Model(inputs=[input_ids, attention_mask], outputs=dense_layer, name=f"{name}_tower")
    return tower_model

candidate_tower = build_tower("candidate")
query_tower = build_tower("query")


In [156]:
def TwoTowerModel(candidate_tower, query_tower):
    candidate_embedding = candidate_tower.output
    query_embedding = query_tower.output
    
    # Normalize the embeddings to unit length
    candidate_embedding = Lambda(lambda x: K.l2_normalize(x, axis=1))(candidate_embedding)
    query_embedding = Lambda(lambda x: K.l2_normalize(x, axis=1))(query_embedding)
    
    # Compute cosine similarity
    dot_product = tf.matmul(candidate_embedding, query_embedding, transpose_b=True)
    model = Model(inputs=[candidate_tower.input, query_tower.input], outputs=dot_product)
    return model

two_tower_model = TwoTowerModel(candidate_tower, query_tower)

# Load model weights if you are rerunning
# two_tower_model.load_weights('roBERTa-two_tower_model_weights.h5')

In [157]:
candidate_texts = df_original['cw_text'].tolist()
query_texts = df_original['cw_sasb_query_text'].tolist()

# Since we are using all positive pairs to be a match, create a labels list with all 1s
labels = [1] * len(candidate_texts)

# Encode the texts using the HuggingFace's Transformers tokenizer
candidate_encoded = tokenizer(candidate_texts, padding=True, truncation=True, return_tensors="tf")
query_encoded = tokenizer(query_texts, padding=True, truncation=True, return_tensors="tf")

# Convert labels to a numpy array
labels = np.array(labels)

In [158]:
# Apply the training
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss_fn = tf.keras.losses.BinaryCrossentropy(from_logits=True)
accuracy_metric = tf.keras.metrics.BinaryAccuracy()

@tf.function
def train_step(model, candidate_inputs, query_inputs, labels):
    with tf.GradientTape() as tape:
        # Forward pass
        logits = model([candidate_inputs, query_inputs], training=True)
        loss = loss_fn(labels, tf.linalg.diag_part(logits))
    
    # Compute gradients and update model weights
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    
    # Update the metrics
    accuracy_metric.update_state(labels, tf.linalg.diag_part(logits))
    
    return loss

# Training loop
epochs = 3
for epoch in range(epochs):
    print(f"\nStart of Epoch {epoch + 1}")
    loss = train_step(two_tower_model, 
                      {'candidate_input_ids': candidate_encoded['input_ids'], 'candidate_attention_mask': candidate_encoded['attention_mask']},
                      {'query_input_ids': query_encoded['input_ids'], 'query_attention_mask': query_encoded['attention_mask']},
                      labels)
    
    # Display metrics at the end of each epoch
    print(f"Epoch {epoch + 1}, Loss: {loss}")


Start of Epoch 1
Epoch 1, Loss: 0.700639009475708

Start of Epoch 2
Epoch 2, Loss: 0.6412165760993958

Start of Epoch 3
Epoch 3, Loss: 0.5842878222465515


In [159]:
# Save model weights if running this for the first time
#two_tower_model.save_weights('roBERTa-two_tower_model_weights.h5')

In [160]:
test_candidate_texts = test_data['cw_text'].tolist()
test_query_texts = test_data['cw_sasb_query_text'].tolist()

# Encode the texts
test_candidate_encoded = tokenizer(test_candidate_texts, padding=True, truncation=True, return_tensors="tf")
test_query_encoded = tokenizer(test_query_texts, padding=True, truncation=True, return_tensors="tf")

In [161]:
candidate_embeddings = candidate_tower.predict([test_candidate_encoded['input_ids'], test_candidate_encoded['attention_mask']])
query_embeddings = query_tower.predict([test_query_encoded['input_ids'], test_query_encoded['attention_mask']])



### Compute Cosine Similarity Scores for the test set

In [162]:
# candidate_embeddings and query_embeddings are ordered and correspond to each pair in test_data to ensure we do not
# compute too many different combinations for different industries as we assume industry is a known label
cosine_similarities = []
for i in range(len(test_data)):
    candidate_emb = candidate_embeddings[i]
    query_emb = query_embeddings[i]
    
    # Reshape embeddings for sklearn's cosine_similarity function
    sim_score = cosine_similarity(candidate_emb.reshape(1, -1), query_emb.reshape(1, -1))
    
    # Append to your results list
    cosine_similarities.append({
        'cw_text': test_data.iloc[i]['cw_text'],
        'cw_sasb_query_text': test_data.iloc[i]['cw_sasb_query_text'],
        'similarity_score': sim_score[0, 0]
    })

# Convert to DataFrame for easier handling
results_df = pd.DataFrame(cosine_similarities)

In [163]:
results_df

Unnamed: 0,cw_text,cw_sasb_query_text,similarity_score
0,new york cements gold mining capital world new...,tailings storage facilities management metals ...,0.287166
1,shareholders v. tesla nasdaq diversity rule se...,managing conflicts interest security commodity...,0.292012
2,fedex closing locations planning furlough empl...,greenhouse gas emissions air freight logistics...,0.277035
3,modelo maker profits bud light‚äö√ñ√¥s decline...,water management water management includes ent...,0.296565
4,med tech investors paying patents med tech sta...,product safety information product safety effe...,0.288988
...,...,...,...
1036,lockheed martin stumbles supply chain wsj dema...,product safety product safety important consid...,0.275177
1037,banks rush borrow record breaking $ 165 billio...,factors credit analysis financial intermediari...,0.295587
1038,ohio train derailment norfolk southern ceo say...,greenhouse gas emissions rail transportation i...,0.284609
1039,at&t verizon t mobile avoid $ 200 million fine...,competitive behaviour open internet telecommun...,0.286689


In [164]:
test_df_copy = test_data.copy()

# Remove "NA" and get unique queries and candidates
unique_queries = test_df_copy.loc[test_df_copy['cw_sasb_query_text'] != "NA", ['cw_sasb_query_text', 'Industry']].drop_duplicates().reset_index(drop=True)

# Get unique candidates
unique_candidates = test_df_copy['cw_text'].unique()

In [165]:
merged_df = pd.merge(results_df, unique_queries[['cw_sasb_query_text', 'Industry']], 
                     on='cw_sasb_query_text', how='left')

merged_df

Unnamed: 0,cw_text,cw_sasb_query_text,similarity_score,Industry
0,new york cements gold mining capital world new...,tailings storage facilities management metals ...,0.287166,Metals & Mining
1,shareholders v. tesla nasdaq diversity rule se...,managing conflicts interest security commodity...,0.292012,Security & Commodity Exchanges
2,fedex closing locations planning furlough empl...,greenhouse gas emissions air freight logistics...,0.277035,Air Freight & Logistics
3,modelo maker profits bud light‚äö√ñ√¥s decline...,water management water management includes ent...,0.296565,Alcoholic Beverages
4,med tech investors paying patents med tech sta...,product safety information product safety effe...,0.288988,Medical Equipment & Supplies
...,...,...,...,...
1036,lockheed martin stumbles supply chain wsj dema...,product safety product safety important consid...,0.275177,Aerospace & Defence
1037,banks rush borrow record breaking $ 165 billio...,factors credit analysis financial intermediari...,0.295587,Commercial Banks
1038,ohio train derailment norfolk southern ceo say...,greenhouse gas emissions rail transportation i...,0.284609,Rail Transportation
1039,at&t verizon t mobile avoid $ 200 million fine...,competitive behaviour open internet telecommun...,0.286689,Telecommunication Services


In [166]:
final_results_df = pd.merge(merged_df, test_df_copy[['cw_text', 'url', 'articleId', 'title']], 
                    on='cw_text', how='left')
final_results_df.head()

Unnamed: 0,cw_text,cw_sasb_query_text,similarity_score,Industry,url,articleId,title
0,new york cements gold mining capital world new...,tailings storage facilities management metals ...,0.287166,Metals & Mining,https://www.newsmax.com/newsmax-tv/fitzgerald-...,c12355d81050473e89f4163372441061,Rep. Fitzgerald to Newsmax: DirecTV Dropping N...
1,shareholders v. tesla nasdaq diversity rule se...,managing conflicts interest security commodity...,0.292012,Security & Commodity Exchanges,https://www.axios.com/pro/media-deals/2023/05/...,fcbd16768c584451912d7121a259ad9d,YouTube praises AI transformation at Brandcast
2,fedex closing locations planning furlough empl...,greenhouse gas emissions air freight logistics...,0.277035,Air Freight & Logistics,https://www.theguardian.com/technology/2023/ju...,3cb0ea7cb1cb40608c1cfc1e172ebc3e,Nick Clegg defends release of open-source AI m...
3,modelo maker profits bud light‚äö√ñ√¥s decline...,water management water management includes ent...,0.296565,Alcoholic Beverages,https://www.washingtonexaminer.com/restoring-a...,7b188eebdd7c42ed9ca51237d0989674,Conservative group targets Bank of America in ...
4,med tech investors paying patents med tech sta...,product safety information product safety effe...,0.288988,Medical Equipment & Supplies,https://www.cleveland.com/business/2023/01/goo...,14b0ee5d771844c7838718faf0905545,"Google slashes 12,000 jobs to cope with shrink..."


In [167]:
# To count NaNs across the whole DataFrame
total_nans = final_results_df.isna().sum().sum()

# To count NaNs in each column separately
nans_per_column = merged_df.isna().sum()

print("Total NaNs in the DataFrame:", total_nans)
print("\nNaNs per column:\n", nans_per_column)

Total NaNs in the DataFrame: 0

NaNs per column:
 cw_text               0
cw_sasb_query_text    0
similarity_score      0
Industry              0
dtype: int64


In [168]:
# final_results_df.to_csv("Transformers-roBERTa-final_results_df.csv", index=False)

# Uncomment to read in csv if loading in success at K
# # final_results_df = pd.read_csv('Transformers-roBERTa-final_results_df.csv', na_filter=False)

## Calculating Model Comparison Metrics

In this section after doing some model prep work, we are computing the following metrics:
* Success at K - A metric to establish whether we get a hit/relevant ESG article within K. Measures whether the relevant document (or item) appears in the top K positions of the model's ranking.
* Mean Reciprocal Rank (MRR) - MRR provides insight into the model's ability to return relevant items at higher ranks. It measures when does the first relevant ESG article appears. The closer this final number is to 1, the better the system is at giving you the right answers upfront.  
* Precision at K - Measures the proportion of retrieved documents that are relevant among the top K documents retrieved. It's calculated by dividing the number of relevant documents in the top K by K.
* Recall at K - Measures the proportion of relevant documents retrieved in the top K positions out of all relevant documents available. 
* F1 Score at K - Combines precision and recall into a single metric, offering a more comprehensive evaluation of the model's performance. It helps balance the trade-off between precision and recall, ensuring that neither is disproportionately favored.

**Note**:This is the section user should have already reloaded results_df from Transformers-roBERTa-final_results_df before calculating the metrics. Also, in this section, we will do the mapping to have the testing data use the assumption that **Major and Minor are both 'YES' for ESG and No is 'No' for ESG**.

In [None]:
# Preparing data to do Success at K
# Sort articles by cosine similarity score for each Industry group
top_sorted_df = final_results_df.groupby('Industry', group_keys=False) \
                  .apply(lambda x: x.sort_values('similarity_score', ascending=False))

top_sorted_df = top_sorted_df.reset_index(drop=True)

test_df_relevant = test_df_copy[['cw_text', 'GPT_ESG_or_not']].drop_duplicates()
merged_df_final = pd.merge(top_sorted_df, test_df_relevant, on='cw_text', how='left')

# Mapping - applying the Minor and Major as Yes assumption
mapping = {'Minor': 'Yes', 'Major': 'Yes', 'No': 'No'}

merged_df_final['GPT_ESG_or_not'] = merged_df_final['GPT_ESG_or_not'].map(mapping)

# Adding in the ground truth labels and checking if it looks correct
merged_df_final

In [169]:
# Get Success at K Metrics
# Success at k is about within the top k results, is there at least one relevant item?

def calculate_success_at_k(merged_df, k):
    # Group by 'Industry'
    grouped_df = merged_df.groupby('Industry')
    group_sizes = grouped_df.size()
    hit_count = 0
    total_groups = len(grouped_df)

    for name, group in grouped_df:
        if 'Yes' in group.head(k)['GPT_ESG_or_not'].values:
            hit_count += 1

    hit_rate = hit_count / total_groups
    return hit_rate

# Initialize an empty DataFrame to store results
success_k = pd.DataFrame(columns=['k', 'hit_rate'])

# Create an empty list to store intermediate results
results = []

# Loop through k values from 1 to 5 as we don't expect going pass 5 is necessary
for k in range(1, 6):
    hit_rate = calculate_success_at_k(merged_df_final, k)
    # Store the result as a dictionary in the list
    results.append({'k': k, 'hit_rate': hit_rate})

# Convert the list of dictionaries to a DataFrame
success_k = pd.concat([pd.DataFrame([result]) for result in results], ignore_index=True)

# Display the results
print(success_k)

   k  hit_rate
0  1  0.704918
1  2  0.901639
2  3  0.950820
3  4  0.983607
4  5  1.000000


In [170]:
# Calculate MRR
# MRR is how well does the model rank the first relevant item, on average, across all queries a k 
# value does not play a role here.
def calculate_mrr(merged_df):
    # Group by 'Industry' to process each query group separately
    grouped_df = merged_df.groupby('Industry')
    total_queries = len(grouped_df)  # Total number of queries
    sum_reciprocal_rank = 0  # Initialize the sum of reciprocal ranks
    
    for name, group in grouped_df:
        # Sort each group just in case it's not sorted by relevance (similarity score)
        group = group.sort_values('similarity_score', ascending=False)
        # Find the index (rank) of the first 'Yes' in the sorted group
        first_relevant_index = group['GPT_ESG_or_not'].eq('Yes').idxmax()
        if group.loc[first_relevant_index, 'GPT_ESG_or_not'] == 'Yes':
            rank = group.index.get_loc(first_relevant_index) + 1  # Get rank (1-based)
            sum_reciprocal_rank += 1 / rank  # Add the reciprocal of the rank to the sum
    
    # Calculate the mean of the reciprocal ranks
    mrr = sum_reciprocal_rank / total_queries  
    return mrr

mrr_score = calculate_mrr(merged_df_final)
print(f"The Mean Reciprocal Rank (MRR) is: {mrr_score}")

The Mean Reciprocal Rank (MRR) is: 0.8311475409836067


In [172]:
# Calculate the precision, recall, and f1 at K
# We set k at what we decided from Success at K
def calculate_precision_recall_at_k_per_query(group, k):
    # Convert 'Yes'/'No' in 'GPT_ESG_or_not' to 1/0 for calculation
    group['is_correct'] = group['GPT_ESG_or_not'].apply(lambda x: 1 if x == 'Yes' else 0)

    # Sort the group by similarity_score in descending order and take top K
    top_k = group.sort_values('similarity_score', ascending=False).head(k)

    # Calculate how many of the top K are correct
    correct_in_top_k = top_k['is_correct'].sum()

    # Calculate Precision at K
    precision_at_k = correct_in_top_k / k

    # Calculate Recall at K
    total_relevant = group['is_correct'].sum()
    recall_at_k = correct_in_top_k / total_relevant if total_relevant > 0 else 0
    
    # Calculate F1 at K
    if precision_at_k + recall_at_k > 0:
        f1_at_k = 2 * (precision_at_k * recall_at_k) / (precision_at_k + recall_at_k)
    else:
        f1_at_k = 0

    return precision_at_k, recall_at_k, f1_at_k

# Apply the function to each group and calculate the mean Precision, Recall, and F1 at K
# Use merged_df_final if want to see all of the initial test run results
results = merged_df_final.groupby('Industry').apply(calculate_precision_recall_at_k_per_query, k=3)

# To see the overall average Precision, Recall, and F1 at K
average_precision_at_k = results.map(lambda x: x[0]).mean()
average_recall_at_k = results.map(lambda x: x[1]).mean()
average_f1_at_k = results.map(lambda x: x[2]).mean()

print(f"Average Precision at K: {average_precision_at_k}")
print(f"Average Recall at K: {average_recall_at_k}")
print(f"Average F1 at K: {average_f1_at_k}")


Average Precision at K: 0.792349726775956
Average Recall at K: 0.22519218937948737
Average F1 at K: 0.33208917862745946
