# Recipebook-Approved SQName Generator
The following expedition explores [Sentence Transformers](https://huggingface.co/sentence-transformers) to suggest a **Recipebook-Approved SQName** which contains the _us-gaap_ namespace for a given Name to prevent redundant generation of derived SQNames from custom tickers. This is an effort towards improving comparability in financial reports. Sentence Transformers are state-of-the-art models for semantic textual similarity. It is meant to be representative of a high-dimensional embedding model that can effectively capture semantic similarity between Names and existing SQNames. 

## Finetuning a Sentence Transformer
The Hugging Face Sentence Transformer is leveraged with a powerful yet lightweight pre-trained model, and then fine-tuned with the dataset provided to serve this specific use-case. 

### Data Pre-processing 
To truly capture the semantics of a name, noise/additional characters are removed and standard preprocessing techniques are employed. 
- Punctuation and Special Characters are removed
- Sentence is tokenized
- Converted to lowercase 
- Stopwords are removed
- Lemmatization (root words)

In [1]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_sentence(sentence):
    # Remove punctuation
    translator = str.maketrans(string.punctuation, ' '*len(string.punctuation))
    sentence = sentence.translate(translator)
    sentence = re.sub('[^A-Za-z0-9]+', ' ', sentence)
    # Tokenize the sentence
    tokens = nltk.word_tokenize(sentence)
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # Join the tokens back into a sentence
    preprocessed_sentence = ' '.join(tokens)
    return preprocessed_sentence

def camel_case_split(str):
    return " ".join(re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', str))

[nltk_data] Downloading package punkt to /Users/meg-kumar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/meg-
[nltk_data]     kumar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/meg-
[nltk_data]     kumar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Pre-process data and store it in the Names column

In [2]:
import pandas as pd
data = "data/NLP Task Dataset - 23rd June 2023.xlsx"
df = pd.read_excel(data)
df['Names'] = df['Names'].apply(lambda x: x.split('&&'))
df['Names'] = df['Names'].apply(lambda x: [preprocess_sentence(name) for name in x])
df['Names'].head()

0    [net income net income, net income loss attrib...
1    [cash flow investing activity capital expendit...
2    [equity apache shareholder equity, stockholder...
3    [net income tax, provision benefit income tax,...
4    [equity total liability equity, liability equi...
Name: Names, dtype: object

### SQName (URI)
This class implements SQName as a Unique Resource Identifier (URI) for the financial concepts. This will be handy since SQNames will be used often and this will help it scale well instead of constantly formatting a string to extract information. 

In [3]:
class SQName(str):
    def __init__(
        self,
        uri: str | None = None,
    ):
        if isinstance(uri, str):
            atoms = uri.rsplit("_", 1)
            namespace, tag = atoms

        namespace, tag = str(namespace), str(tag)

        if any(atom == "" for atom in (namespace, tag)):
            raise ValueError(f"Invalid URI: namespace = {namespace}, tag = {tag}")
        self._namespace = namespace
        self._tag = tag
        super().__init__()

    @property
    def namespace(self):
        """Get the URI namespace."""
        return self._namespace

    @property
    def tag(self):
        """Get the URI tag."""
        return self._tag

### Data for Finetuning 
- (tag, name) Pairs of positive examples - text similar to each other - are used to fine-tune Transformer Models. This needs to be extracted from the dataset. The recipe-book is considered to be a SQNames with _us-gaap_ as their namespace. The positive examples in the dataset are selected by pairing the tag of these SQNames with the each pre-processed name extracted from Names. 
- (name1, name2) Additionally, a sampling factor of 0.5 (this is scaled down for the scope of this experiment) multiplied with the Frequency of the item is used to also sample pairs from within the Names list, since they are also similar to _each other_. The sampling is used since considering every combination pair would be computationally expensive. The Frequency is used to represent higher frequency SQNames with a greater weight. 

In [5]:
import random
from tqdm import tqdm
pairs = []
SAMPLING_FACTOR = 0.5
for i, row in tqdm(df.iterrows()):
    entity = SQName(row["SQName"])
    if entity.namespace == "us-gaap":
        # Pairs of positive examples (tag, name)
        entity_name = preprocess_sentence(camel_case_split(entity.tag))
        for name in row["Names"]:
            pairs.append((entity_name, name))
        # Pairs of positive examples (name1, name2) sampled from the same tag
        k_samples = int(row["Frequency"] * SAMPLING_FACTOR)
        for _ in range(k_samples):
            pair = random.choices(row["Names"], k=2)
            pairs.append(pair)
pairs[:10]

15815it [00:01, 9522.36it/s] 


[('net income loss', 'net income net income'),
 ('net income loss', 'net income loss attributable parent'),
 ('net income loss', 'net earnings'),
 ('net income loss', 'net earnings attributable mohawk industry inc'),
 ('net income loss', 'net loss'),
 ('net income loss', 'net income loss'),
 ('net income loss', 'net income omnicom group inc'),
 ('net income loss', 'net income omnicom group inc'),
 ('net income loss', 'net income attributed omnicom group inc'),
 ('net income loss', 'net income loss attributable groupon inc')]

### Sentence Transformer 
Pre-trained model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [6]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
model_id = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_id)

### Loss Function - Multiple Negatives Ranking Loss
The [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) is used if you only have positive pairs, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions - in this case pairs of (SQName tag, and Name)

In [None]:
# SKIP THIS CELL IF YOU WANT TO REPRODUCE THE RESULTS, IT TAKES A LONG TIME TO RUN
# THE PATH TO MODEL HAS THE MODEL YOU WILL NEED, YOU CAN IMPORT IT DIRECTLY

train_examples = [InputExample(texts=[pair[0], pair[1]]) for pair in pairs]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=64)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=10) 
PATH_TO_MODEL = "data/sqname_fine_tuned_model_with_sampling.bin"
model.save(PATH_TO_MODEL)

In [7]:
PATH_TO_MODEL = "data/sqname_fine_tuned_model_with_sampling.bin"


In [8]:
model = SentenceTransformer(PATH_TO_MODEL)

### Recipebook as embeddings
Using the model above, the recipebook is augmented with the encoded embeddings from the fine-tuned model. This takes the tag from the SQNames with the _us-gaap_ namespacce, preprocesses them, and encodes them into a 384 dimensional vector output from the model. 

In [9]:
recipe_df = df[df.apply(lambda x: "us-gaap" in x["SQName"], axis=1)]
recipe_df["text"]  = recipe_df["SQName"].apply(lambda x: preprocess_sentence(camel_case_split(SQName(x).tag)))
recipe_df['code_embedding'] = recipe_df['text'].apply(lambda x: model.encode([x]))
recipe_df['code_embedding'].head()

0    [[-0.01032155, 0.060474418, 0.053759195, -0.01...
1    [[0.00035809193, 0.0012922849, -0.021274941, 0...
2    [[-0.013603674, -0.06935441, -0.09837998, 0.02...
3    [[-0.017173182, -0.01779158, 0.016376264, -0.0...
4    [[-0.00584583, -0.082059816, -0.032502823, 0.0...
Name: code_embedding, dtype: object

### Similarity - Vector search
The recipebook is searched for the SQName closest to a given code query using similarity code search. Similarity between vectors is sought through the cosine similarity index. 
- This is currently sorted by the similarity score. 
- The similarity is filtered to be > 0.8
- It could in the future also be sorted by Frequency, once it is sorted by Similarity. 

In [29]:
from sklearn.metrics.pairwise import cosine_similarity
def search_functions(recipe_df, code_query, n=5):
   embedding = model.encode([preprocess_sentence(code_query)])
   recipe_df['similarities'] = recipe_df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))
   # filter out the ones with similarity < 0.8
   copy = recipe_df[recipe_df["similarities"] > 0.8].dropna()
   res = copy.sort_values(['similarities'], ascending=[False]).head(n)
   suggestions = res["SQName"].tolist()
   if len(suggestions):
       return suggestions
   else:
         return "You are free to create your own extension, there isn't a direct match!"
       

### Examples
Three examples for randomly sampled names from the dataset. 
1. Amount of judgment or settlement awarded to (against) the entity in respect of litigation__ which will be fully absorbed by the Company's insurance carrier
2. Income derived from investments and income not otherwise specified in the income statement. Interest income represents earnings which reflect the time value of money or transactions in which the payments are for the use or forbearance of money. Dividend income represents a distribution of earnings to shareholders by investee companies
3. Food__ beverage__ racing and other revenue includes revenue from food and beverage sales__ racing revenue earned from pari-mutuel wagering on live harness racing and simulcast signals from other tracks and miscellaneous income

In [11]:
search_functions(recipe_df, "Amount of judgment or settlement awarded to (against) the entity in respect of litigation__ which will be fully absorbed by the Company's insurance carrier")

['us-gaap_LitigationSettlementAmount',
 'us-gaap_LitigationSettlementAmountAwardedFromOtherParty',
 'us-gaap_LitigationSettlementExpense',
 'us-gaap_PaymentsForLegalSettlements',
 'us-gaap_GainLossRelatedToLitigationSettlement']

In [12]:
search_functions(recipe_df, "Income derived from investments and income not otherwise specified in the income statement. Interest income represents earnings which reflect the time value of money or transactions in which the payments are for the use or forbearance of money. Dividend income represents a distribution of earnings to shareholders by investee companies")

['us-gaap_InvestmentIncomeInterestAndDividend',
 'us-gaap_OtherInterestAndDividendIncome',
 'us-gaap_InterestAndDividendIncomeOperating',
 'us-gaap_InvestmentIncomeDividend',
 'us-gaap_DividendIncomeOperating']

In [13]:
search_functions(recipe_df, "Food__ beverage__ racing and other revenue includes revenue from food and beverage sales__ racing revenue earned from pari-mutuel wagering on live harness racing and simulcast signals from other tracks and miscellaneous income")

['us-gaap_FoodAndBeverageRevenue']

This query also works for Names not previously in the dataset, since it is not a direct lookup. 

In [30]:
search_functions(recipe_df, "Money spent on restaurants and travel")

"You are free to create your own extension, there isn't a direct match!"

In [31]:
search_functions(recipe_df, "Equipment and software includes the cost of computer hardware and software__ and the cost of equipment used in the Company's operations")

['us-gaap_CapitalizedComputerSoftwareGross', 'us-gaap_PaymentsForSoftware']

### Qualitative Analysis
Sampling 500 "custom" ticker items and suggesting recipebook-approved SQNames

In [14]:
random_df = df[df.apply(lambda x: "us-gaap" not in x["SQName"], axis=1)].sample(500)
random_df["text"]  = random_df["SQName"].apply(lambda x: preprocess_sentence(camel_case_split(SQName(x).tag)))
random_df["Recipebook SQName"] = random_df["text"].apply(lambda x: search_functions(recipe_df, x, n=3))


In [16]:
PATH_TO_SAMPLE_XLSX = "data/sqname_suggestions_sample.xlsx"
random_df.to_excel(PATH_TO_SAMPLE_XLSX, index=False)

### Conclusion
The predictions for **Recipebook SQName** are in the _us-gaap_ namespace. They are filtered by a minimum of 0.8 similarity. If the list turns out to be empty, it might indicate that there are no obvious matches from the recipebook, and it could be considered a **valid** or an **appropriate** Extension. 