# Company classification task

In [None]:
import os
from scripts.embedding_rf import train_and_eval_embedding_rf
from scripts.bert_eval import evaluate_on_test_set, load_model_and_predict, create_label_encoder
from scripts.utils import load_json, visualize_classification_report, plot_model_comparisons

from scripts.business_classifier import BusinessClassifier

In [2]:
# Initial variables

train_csv_path = 'resources/train.csv'
test_csv_path = 'resources/test.csv'

bert_folder = 'bert_classifier/'

enriched_train_companies_path = 'resources/train_companies.json'
enriched_test_companies_path = 'resources/test_companies.json'

enriched_train_companies = load_json(enriched_train_companies_path)
enriched_test_companies = load_json(enriched_test_companies_path)

labels = [
        "Technology hardware & equipment", "Food drink & tobacco", "Construction",
        "Health care equipment & services", "Conglomerates", "Capital goods",
        "Retailing", "Utilities", "Food markets", "Aerospace & defense",
        "Materials", "Banking", "Oil & gas operations", "Business services & supplies",
        "Insurance", "Semiconductors", "Trading companies", "Diversified financials",
        "Drugs & biotechnology", "Telecommunications services", "Software & services",
        "Household & personal products", "Hotels restaurants & leisure", "Transportation",
        "Consumer durables", "Media", "Chemicals"
    ]

# Graph classifier paths
model_names=["all-MiniLM-L6-v2", "BAAI/bge-m3"]
business_db_path="./chroma_db_biz"
yelp_category_db_path="./chroma_db_yelp_cat"
forbes_category_db_path="./chroma_db_forbes_cat"
business_collection_name="businesses_multi"
category_collection_name="categories_multi"
forbes_path="./resources/train_companies.json"
yelp2forbes_path="./resources/yelp2forbes_business_data.json"
yelp_graph_path="./resources/yelp2forbes_categories.graphml"
forbes_graph_path="./resources/forbes_categories.graphml"
enrichment_db_path="./resources/test_companies.json"

## Building and enriching the dataset

We began by taking our raw dataset.csv of Forbes 2000 companies (with columns COMPANY, CATEGORY) and performing a stratified split per category: 70% of each category’s entries went into `train.csv`, and the remaining 30% into `test.csv`. This ensured that every category was represented proportionally in both training and testing sets, preserving class balance across our supervised experiments.

## BigPicture API enrichment

Next, for each company in `train.csv` (and later in `test.csv`), we invoked the BigPicture Company Enrichment API in two steps:

1. Search by name to retrieve the highest-confidence domain.
2. Find by that domain to obtain detailed metadata (tags, category taxonomy, description, LinkedIn handle, etc.).

We extracted and saved only the fields we needed—name, domain, confidence, tags, category, linkedin, description, and id, into enriched JSON datasets (`train_companies.json` / `test_companies.json`).

## Subcategory graph construction

### Forbes subcategory graph 

Using the enriched Forbes data, we treated each returned tag (e.g., "B2B", "Alumni", "Professional Networking") as a forbes subcategory. We then:

1. Prompted a compact `Meta-Llama-3.2-8B-Instruct` model with examples of businesses for each tag to generate a brief “Focus on …” keyword description (up to 5 keywords).
2. Embedded these generated descriptions along with the official Forbes category names using a MultiModelEmbedder (concatenated, normalized embeddings from multiple SentenceTransformer models).
3. Built a bipartite graph (`resources/forbes_categories.graphml`) where:
    - Nodes are subcategories (tags) and Forbes categories.
    - Edge weights combine:
        - Frequency weight of tag occurrences under each category (normalized per tag), and
        - Cosine similarity between tag-description embeddings and category-name embeddings.

This graph captures both empirical co-occurrence and semantic relatedness.


### Yelp subcategory graph

In parallel, we processed the clean Yelp dataset (sourced from Yelp Open Dataset) to extract each business’s granular Yelp categories (e.g., "Bubble Tea", "Hair Salons"). We:

- Sampled up to five businesses per Yelp category and prompted the same Meta-Llama model to produce concise "Focus on ..." keyword descriptions.
- Computed embeddings and constructed a second bipartite graph (`resources/yelp_categories.graphml`) linking Yelp subcategories to the same set of Forbes categories, embedding based cosine similarity on edges.

## Vector databases for retrieval

Finally, we created local ChromaDB vector stores for:

1. Forbes subcategory embeddings (tag descriptions) - `chroma_db_forbes_cat/`,
2. Yelp subcategory embeddings (category descriptions) - `chroma_db_yelp_cat/`,
3. Company name embeddings (from the enriched datasets) - `chroma_db_biz/`

These vector DBs support fast nearest-neighbor retrieval for downstream RAG pipelines, similarity-based classification, and interactive analysis—tying together our enriched metadata, semantic graphs, and classification engines into a unified, high-performance system.


Example of bipartite graph:

![](./docs/Graph.drawio.png)

## Random forest with embeddings classifier

By using multi-embeddings from different `SentenceTransformer` models, I embedded the company names and tried to build a simple random forest classifier for the task.

In [None]:
y_true_rf, y_pred_rf = train_and_eval_embedding_rf(train_csv_path,test_csv_path)

In [None]:
visualize_classification_report(y_true_rf, y_pred_rf, [l.lower() for l in labels], output_dir='results_rf')

Since the training set might be overrepresented with banking category examples, we can clearly see how the classifier was overbiased towards the banking category, often misclassifying other companies as banking, leading to loss in performance. The classifier did achieve relatively good accuracy due to correctly classifying the largest category, however it feel short in overall F1 score.

![](./results_rf/category_accuracy.png)

![](./results_rf/confusion_matrix.png)

## BERT based sequence classifier

Using an encoder BERT for sequence classification, I fine-tuned the `bert-base-uncased` pretrained model for task of business category classification. By using the company names and descriptions as input and labels as output, we can train the model to perform relatively well for the task. 

In [None]:
y_true_bert, y_pred_bert = evaluate_on_test_set(train_csv_path, test_csv_path, enriched_test_companies_path, bert_folder)

In [None]:
visualize_classification_report(y_true_bert, y_pred_bert, [l.lower() for l in labels], output_dir='results_bert')

In [5]:
# Prediction on single example from test set

COMPANY_NAME = 'ExxonMobil'

enriched_data = {entry["name"].strip().lower(): entry for entry in enriched_test_companies if entry["name"] != None}

le = create_label_encoder(train_csv_path)

res = load_model_and_predict(bert_folder, COMPANY_NAME, enriched_data, le)

print(f'Predicted: {res}')

Predicted: oil & gas operations


From the results we can observe that the model performs great in classifying specific categories, achieveing high recall, however it does not work well on other categories, where the companies might be also diversified in other categories, such as conglomerates, trading companies, diversified financials... From confusion matrix we can also see that the diversified financials were often misclassified with other categories or for example misclassification between banking and diversified financial (very similar semantic meaning). 

![](./results_bert/category_accuracy.png)

![](./results_bert/confusion_matrix.png)

## Graph based classifier

The graph based classifier works based on precomputed company embeddings from `train.csv` and precomputed company category subcategories. 

Steps:
1. The queried company name string to predict is first embedded using multiple `SentenceTransformer` models (`all-MiniLM-L6-v2, BAAI/bge-m3`) to create vector representations of the tokens in the string.
2. We query the company name embedding database, where I have stored embedded vectors from company names in `train.csv`. The returned result are top@k companies which are similar to the queried company name. In case, I observe very high similarity with other company names (is above some threshold $\alpha = 0.9$), I store the category results for the most similar company above threshold.
3. On the other side, I can either decide to use BigPicture API, which can provide us with up-to-date details and description about the company. I can either use this API or not, however it helps in cases, when the company name is too ambigious to actually classify as something.

> In our case, I have queried the API and got results for most of companies in `dataset.csv`, so that I can re-use the API results without reaching the API limit. The BigPicture API also provides us with it's own category classification, however I will be using only the description of the company returned from the API.


4. Based on enriched company description or company name, I query the vector databases for closest subcategories in both Yelp and Forbes subcategories datasets, which I created. This will get us top@k subcategory nodes, which have closest embeddings descriptions to the company description or company name.
5. Afterwards, the derived scores will be computed from the bipartite graph, which was created earlier:
    For each of the graphs:

    1. We get the subcategory nodes which were collected from step 4 (e.g. Finance, Wealth Management, Loans...)
    2. Each subcategory is connected to the classification nodes with some edge weights
    3. For each subcategory node, get all the connected edge weights to classification node weights, sum them and normalize
    4. This will get us a dictionary of scores for all categories (e.g. Banking, Diversified Financials...)
Combine the results from both graph with some factor $\beta$ -> Scores = Yelp $\cdot \beta +(1-\beta) \cdot$ Forbes 

6. If we have stored high company similarity from step 2, combine them into final scores, giving boost to category that is very similar with the queried company. Get the prediction with the highest score.



![](./docs/Graph%20classifier.drawio.png)

In [6]:
classifier = BusinessClassifier(
    model_names=model_names,
    business_db_path=business_db_path,
    yelp_category_db_path=yelp_category_db_path,
    forbes_category_db_path=forbes_category_db_path,
    business_collection_name=business_collection_name,
    category_collection_name=category_collection_name,
    forbes_path=forbes_path,
    yelp2forbes_path=yelp2forbes_path,
    yelp_graph_path=yelp_graph_path,
    forbes_graph_path=forbes_graph_path,
    enrichment_db_path=enrichment_db_path
)

Loading embedding models: all-MiniLM-L6-v2, BAAI/bge-m3
Loading business database from ./chroma_db_biz
Successfully loaded business collection 'businesses_multi'
Loading Yelp category database from ./chroma_db_yelp_cat
Successfully loaded Yelp category collection 'categories_multi'
Loading Forbes category database from ./chroma_db_forbes_cat
Successfully loaded Forbes category collection 'categories_multi'
Loading Yelp category graph from ./resources/yelp2forbes_categories.graphml
Successfully loaded graph with 631 nodes and 16386 edges
Loading Forbes category graph from ./resources/forbes_categories.graphml
Successfully loaded graph with 579 nodes and 14904 edges
Loaded 27 classification categories
Loaded 604 Yelp subcategories
Loaded 552 Forbes subcategories
Loading company data from ./resources/train_companies.json
Loaded 1112 company records
Loading Yelp to Forbes mapping from ./resources/yelp2forbes_business_data.json
Loaded 114023 Yelp-Forbes mappings


In [None]:

# Evaluation with raw business names
evaluation_results = classifier.evaluate(
    test_file_path=test_csv_path,
    results_dir="results",
    use_enrichment=False,
    top_k_businesses=3,
    top_k_yelp=10,
    top_k_forbes=10,
    beta=0.2
)

# Evaluation with enriched business descriptions
evaluation_results = classifier.evaluate(
    test_file_path=test_csv_path,
    results_dir="results_enriched",
    use_enrichment=True,
    top_k_businesses=3,
    top_k_yelp=10,
    top_k_forbes=10,
    beta=0.2
)

In [9]:
# Prediction on single example from test set

COMPANY_NAME = 'ExxonMobil'

res_g = classifier.predict(
    # Business name
    business_name=COMPANY_NAME,

    # Whether to add more context to business name from API (precomputed on test set)
    use_enrichment=True,
    
    # Other variables...
    beta=0.2
)

In [10]:
print(f'Predicted: {res_g["predictions"][0]["category"]}')

Predicted: Oil & gas operations


### Raw business name classifier results

We can see the category wise accuracy performance on the test set. Because of the unbalanced set, some subcategories might underrepresent classification nodes, meaning they will are less likely to be classified. Retailing for example had a lot of misclassifications in other categories, most likely because it is a more general term, which can be tightly connected also with a lot of other categories.

![](./results/category_accuracy.png)

![](./results/confusion_matrix.png)



### Enriched business name with description classifier results

Overall, enriching the business names with descriptions gave boost to the performance, however it still suffered from some categories being underrepresented and less likely to be correctly classified. We can see that in some cases, enriching the description also misclassified some categories completely as compared from before (`Food markets`).

![](./results_enriched/category_accuracy.png)

![](./results_enriched/confusion_matrix.png)


# Comparisons between classifiers

Here we can overall see the performance comparisons between the models that we created for the task. Overall the enriched graph classifier and enriched BERT classifier performed the best achieving ~47% accuracy for both. However, the graph classifier performed faster classifications as it worked with already precomputed embeddings, instead of actually performing more complex computations and inference as BERT model, therefore producing faster classifications.

![](./docs/model_comparison.png)

In [4]:
base_models_evals = [
    'results_rf/classification_report.json',
    'results_bert/classification_report.json',
]

graph_results = [
    'results/evaluation_results.json',
    'results_enriched/evaluation_results.json'
]

plot_model_comparisons(base_models_evals, graph_results, labels=['Random Forest', 'Enriched BERT Classifier', 'Graph classifier', 'Enriched graph classifier'])

# Possible improvements

- While some models perform better on some categories, we could use the basic type of models for some faster and more simpler classifications (e.g. classifying banking or telecommunication service companies). However, other categories which might be harder to classify should be handled by other models, specifically fine-tuned on classifying those categories, this way they can complement each other and help get better results (sort of ensemble of classifiers).

- There is also the problem of ambiguity with companies, as some companies might not be so obvious to detect from name, therefore it's helpful to provide even more context and relevant up-to-date information about the company as it might help with giving more representation insights to the models. If we were working with unseen business company names we could also clean text and improve the names.

- We could also utilize a very large LLM, such as querying or using the OpenAI embeddings to help us with predictions, maybe as using it as the last step to help us get the most up-to-date and accurate final prediction
    - In that case, we could use the previous predictions from other models and returned closest similar subcategories to provide to the prompt as RAG (retrieval augmented generation) that would help the LLM by providing it with more context
    - Furthermore, the large LLM could also help disambiguate between harder examples, such as conglomerates, trading companies, narrowing down the exact industry, even perhaps providing few-shot examples of those categories to help with classification
    - In our case, we observed that many companies were incorrectly classified into broad categories like `Business Services & Supplies` and `Retailing`. To address this, whenever our base classifier predicts one of these "risky" classes, we can trigger a second opinion model, such as an LLM using RAG (Retrieval-Augmented Generation). The LLM would re-evaluate the prediction by considering additional context and company details. We could guide the model by providing a limited set of alternative classes to choose from, based on common misclassification patterns.
```
Given this company and its description, decide between Trading Company, Retailing, or Conglomerate.
Company: [ENRICHED COMPANY INFORMATION]
Here are some examples of (business name, category) pairs: [EXAMPLES]
You may only output these classes: Trading Company, Retailing, or Conglomerate.
```

# Conclusion

Improving business classification could involve balancing accuracy with efficiency. One approach is using a multi-model ensemble, where simpler models handle easy-to-classify categories (e.g., Banking, Telecommunications) quickly and cost-effectively, while specialized models fine-tuned for harder categories handle more complex cases. This strategy ensures both speed and accuracy.

Additionally, incorporating contextual information, such as company descriptions or enriched data, helps address ambiguity in company names, especially for businesses with complex or overlapping industries.

By leveraging these techniques, we can optimize accuracy and reduce resource consumption, ensuring a balance between performance and cost efficiency, crucial when dealing with large-scale classification tasks.