# Multiclass Text Classification Development

Purpose:
This model predicts a Company's business category based on the text of their homepage website. 

Hypothesis: 
The implicit hypothesis is that websites within each category will use distinctive language that can be used to classify them.

Overall process:
1. Normalize Text (done during eda.ipynb to complete EDA)
2. Label Encoding
3. Feature Extraction (TFIDF & BERT)
4. Model Training
5. Evaulate best performing model and vectorization method

In [3]:
import pandas as pd
import os 
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline

# from transformers import BertTokenizer, BertModel
from transformers import DistilBertTokenizer, DistilBertModel #smaller and faster than BERT
import torch

from sklearn.base import TransformerMixin, BaseEstimator

# read in data

In [4]:
# read data back in from pickle file created with eda.ipynb

# Dynamically get the current working directory
current_dir = os.getcwd()
text_path = os.path.abspath(os.path.join(current_dir, '..', 'output','combined_data.pkl'))

# read data back in 
df_clean = pd.read_pickle(text_path)
df_clean.head()

Unnamed: 0,Company_ID,CompanyName,Website,Industry,Size_Range,Locality,Country,Current_Employee_Estimate,Total_Employee_Estimate,Category,...,h3,nav_link_text,meta_keywords,meta_description,len_homepage_text,Full_Text,len_Full_Text,clean_text,len_clean_text,clean_text_str
0,99,crinan hotel,crinanhotel.com,hospitality,1 - 10,"ardchonell, argyll and bute, united kingdom",united kingdom,1,3,Corporate Services,...,Accommodation#sep#Activities#sep#Experience Cr...,,"Crinan hotel, country house hotel, boutique ho...",Crinan Hotel - on waterfront overlooking Loch ...,3467,01546 830261 Crinan · by Lochgilp...,3665,"[crinan, lochgilphead, pa, sr, hotel, history,...",2012,crinan lochgilphead pa sr hotel history ryan f...
1,222,"spot on productions, llc",spotonproductionsllc.com,entertainment,1 - 10,"jackson, mississippi, united states",united states,2,3,"Media, Marketing & Sales",...,,,,"We're Philip Scarborough and Tom Beck, the for...",45,...,75,"[reels, work, storytelling, brought, life, phi...",38,reels work storytelling brought life philip sc...
2,535,akhand jyoti eye hospital,akhandjyoti.in,hospital & health care,11 - 50,"saran, bihar, india",india,8,11,Healthcare,...,Our Girls Help#sep#Donate In Specific Programs...,"why blindness,women empowerment,our impact,abo...",Akhand Jyoti - the largest eye hospital in eas...,"Akhandjyoti, akhand jyoti eye hospital, non-pr...",909,Donate ...,1015,"[donate, gift, someone, sight, support, girl, ...",628,donate gift someone sight support girl child b...
3,642,lasercare eye center,dfweyes.com,medical practice,1 - 10,"irving, texas, united states",united states,4,11,Healthcare,...,,"home,why choose us,new patient information,pat...",,Call 214.574.9600 TODAY for an appointment! Th...,1633,...,1820,"[lasik, hotline, main, number, toll, free, irv...",1210,lasik hotline main number toll free irving tx ...
4,675,compumachine inc,compumachine.com,machinery,1 - 10,"danvers, massachusetts, united states",united states,4,9,Industrials,...,,"home,machines,automation,mastercam,services,ab...",,Compumachine is proud to offer CNC Machine Too...,192,MACHINES & AUTOMATION HOME MACHINE...,228,"[machines, automation, machines, automation, m...",170,machines automation machines automation master...


# Trim the Dataset

There are about 71k rows in this dataset. Because I am training the on my local machine, I have decided to take a random sample for training. 

The limitation of this is that I could miss out on key information but it's my best option since I am not using a GPU. 

As the dataset has a decent mix of categories, a random dataset should maintain the same distribution. However, the industries represented
within each category might change. 


In [5]:
# Get a random sample of 5k rows
df_sample = df_clean.sample(n=5000)

In [139]:
# pickle out results for using later in precomputed embeddings
text_path2 = os.path.abspath(os.path.join(current_dir, '..', 'output','sample_data.pkl'))
df_sample.to_pickle(text_path2)

In [6]:
import plotly_express as px

# Group by Category 
grouped_df = df_sample.groupby(['Category'], as_index=False)['Website'].nunique()

# Create bar plot
fig = px.bar(
    grouped_df,
    x='Category',
    y='Website',
    color='Category',
    title='Unique Company Websites by Category',
    barmode='group'  # Group bars by industry
)

# Adjust the axes to scale automatically per group
fig.update_yaxes(matches=None)  # This ensures y-axes are independent

# Show the plot
fig.show()

# Label Encoding

In [8]:
#Turning the labels into numbers
label_encoder = LabelEncoder()
df_sample['Category_encoded'] = label_encoder.fit_transform(df_sample['Category'])
print(df_sample['Category'].unique())
print(df_sample['Category_encoded'].unique())

['Industrials' 'Media, Marketing & Sales' 'Professional Services'
 'Information Technology' 'Consumer Staples' 'Energy & Utilities'
 'Transportation & Logistics' 'Materials' 'Commercial Services & Supplies'
 'Healthcare' 'Corporate Services' 'Consumer Discretionary' 'Financials']
[ 7 10 11  8  2  4 12  9  0  6  3  1  5]


# Feature Extraction

I'm going to use k-fold cross-validation to evaluate my models later on. 

In [10]:
# split the data into features (X) and labels (y)
X = df_sample['clean_text_str']
y = df_sample['Category_encoded']

print (X.shape)
print(y.shape)

(5000,)
(5000,)


In [11]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42) # using the normal 10 folds

In [12]:
# Define the classification models to be tested
models = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(multi_class='ovr', max_iter=1000),
    'SGD Classifier': SGDClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100)
}

## Notes:

I've chosen the following models to test: 
1. Naive Bayes
    - Pros:
        - This model is extremely fast and in production can be used as an 'online' model (i.e. can be updated in real time)
        - MultinomialNB is usually very good with discrete features like word counts. 
        - Works well with text data and can hand high-dimiensional data well (low memory usage)
    - Cons
        - Assumes independence (words are not independent in real life)
        - Might be too simplistic --> might not work as well for certain industries 
2. Logistic Regression (on)
    - Pros:
        - Easy to interpret
        - Works well with when the relationship between features and classes is roughly linear (--> frequency of terms correlates with business category)
        - regularization helps with overfitting
    - Cons:
        - requires linear separability
        - sensitive to outliers
        - not the best with a large number of features (works best with small or medium-sized datasets)
3. Stochastic Gradient Descent (SGD)
    - Pros:
        - Can scale well for large datasets
        - supports regularlization for overfitting
        - Can learn incrementally in streams or batches
        - can be used with different loss functions
    - Cons:
        - requires careful hyperparameter tuning 
        - might require a lot of iterations
        - can be instable
        - not as interpretable as other algorithms

4. Random Forest
    - Pros:
        - robust to overfitting
        - can handle non-linear relationships
        - gives some insight into which features to select
        - tolerates missing data and outliers
    - Cons:
        - slow and computationally expensive
        - not as interpretable
        - can struggle with text data or data with high-dimensionality (if chosen would benefit from dimensionality reduction)


## DistilBERT embeddings

Because DistilBERT is bidirectional, it accounts for the context of words while TF-IDF does not. BERT uses deep neural networks so it will be
more computationally expensive than TF-IDF, but it's still worth testing. I'm using the DistilBERT model over BERT because I'm not working with GPU. 

In [59]:
class DistilBERTEmbeddingTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name='distilbert-base-uncased', max_length=512): #use distilbert because it's smaller and faster than BERT
        self.model_name = model_name  # Explicitly set model_name
        self.tokenizer = DistilBertTokenizer.from_pretrained(self.model_name)
        self.model = DistilBertModel.from_pretrained(self.model_name)
        self.max_length = max_length

    def fit(self, X, y=None):
        # Fit method required for scikit-learn compatibility
        return self

    def transform(self, X):
        embeddings = []
        for text in X:
            # Tokenize and convert text to tensors
            tokens = self.tokenizer(text, return_tensors='pt', padding='max_length',
                                    truncation=True, max_length=self.max_length)
            
            # Ensure no gradient computation for embeddings
            with torch.no_grad():
                output = self.model(**tokens)
                # Take the CLS token embedding
                cls_embedding = output.last_hidden_state[:, 0, :].numpy()
            
            embeddings.append(cls_embedding[0])  # Append as numpy array
            
        return np.array(embeddings)


In [71]:
# Precompute BERT embeddings for X_train and X_test --> results already precomputed and saved in model folder

# instead of recalculating the BERT embeddings during cross-validation, precompute them
# once and store them 
# this should greatly reduce time for model training and evaluation
bert_transformer = DistilBERTEmbeddingTransformer()
# X_train_embeddings = bert_transformer.transform(X_train)
# X_test_embeddings = bert_transformer.transform(X_test)



`clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884



In [72]:
# # Save the embeddings to disk for future use because it's too computationally expensive to run these
# np.save('X_train_distilbert.npy', X_train_embeddings)
# np.save('X_test_distilbert.npy', X_test_embeddings)


In [93]:
# # load embeddings
X_train_embeddings2 = np.load(os.path.abspath(os.path.join(current_dir, '..', 'model','X_train_distilbert.npy')))
X_test_embeddings2 = np.load(os.path.join(current_dir, '..', 'model','X_test_distilbert.npy'))

# Model Evaluation

In [67]:
# Initialize lists to store results for comparison
results = []
tf_report_list = []

# Iterate over the models and compare pipelines (TF-IDF vs BERT embeddings)
for model_name, model in models.items():
    print(f"\n=== {model_name} ===")
    ### TF-IDF Pipeline
    tfidf_model_pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(max_features=100, stop_words='english')),
        (model_name, model)
    ])

    # Cross-Validation for TF-IDF
    tfidf_scores = cross_val_score(tfidf_model_pipeline, X_train, y_train, cv=kf, scoring='accuracy')
    
    # Train on TF-IDF
    tfidf_model_pipeline.fit(X_train, y_train)
    y_pred_tfidf = tfidf_model_pipeline.predict(X_test)
    tfidf_accuracy = accuracy_score(y_test, y_pred_tfidf)
    tfidf_report = classification_report(y_test, y_pred_tfidf, target_names=label_encoder.classes_, output_dict=True)

    print(f"TF-IDF {model_name} Confusion Matrix of Category Performance: {tfidf_accuracy}")
    print(tfidf_report)

    # tfidf_report['Commercial Services & Supplies']
    tf_df = pd.DataFrame([tfidf_report]).T
    tf_df = tf_df.reset_index().rename(columns={'index':'Category'})
    tf_df_report = tf_df[0].apply(pd.Series)

    tf_final = pd.merge(tf_df['Category'], tf_df_report, left_index=True,right_index=True)

    # filter dataframe                                   
    tf_clean = tf_final.loc[(tf_final['Category']!='weighted avg') & (tf_final['Category']!='macro avg') & (tf_final['Category']!='accuracy')]
    # tf_final = tf_final.loc[(tf_final['Category']!='weighted avg' & tf_final['Category']!='macro avg')]
    # tf_clean.sort_values(by='f1-score',ascending=False).reset_index()
    tf_clean['model']=model_name
    tf_report_list.append(tf_clean)

    # Store TF-IDF results
    results.append({
        'Model': model_name,
        'Pipeline': 'TF-IDF',
        'Cross_Val_Accuracy': tfidf_scores.mean(),
        'Test_Accuracy': tfidf_accuracy,
        'Precision': tfidf_report['weighted avg']['precision'],
        'Recall': tfidf_report['weighted avg']['recall'],
        'F1-Score': tfidf_report['weighted avg']['f1-score']
    })

# Create DataFrame for all the results
results_df = pd.DataFrame(results)

# Display results for comparison
print(results_df)



=== Naive Bayes ===
TF-IDF Naive Bayes Confusion Matrix of Category Performance: 0.616
{'Commercial Services & Supplies': {'precision': 0.578125, 'recall': 0.5441176470588235, 'f1-score': 0.5606060606060606, 'support': 136}, 'Consumer Discretionary': {'precision': 0.6666666666666666, 'recall': 0.06060606060606061, 'f1-score': 0.1111111111111111, 'support': 66}, 'Consumer Staples': {'precision': 0.5223880597014925, 'recall': 0.5882352941176471, 'f1-score': 0.5533596837944663, 'support': 119}, 'Corporate Services': {'precision': 0.5141242937853108, 'recall': 0.6363636363636364, 'f1-score': 0.56875, 'support': 143}, 'Energy & Utilities': {'precision': 0.75, 'recall': 0.711864406779661, 'f1-score': 0.7304347826086958, 'support': 118}, 'Financials': {'precision': 0.8145161290322581, 'recall': 0.7062937062937062, 'f1-score': 0.7565543071161049, 'support': 143}, 'Healthcare': {'precision': 0.7087912087912088, 'recall': 0.86, 'f1-score': 0.7771084337349398, 'support': 150}, 'Industrials': {'p



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



TF-IDF Logistic Regression Confusion Matrix of Category Performance: 0.6226666666666667
{'Commercial Services & Supplies': {'precision': 0.624, 'recall': 0.5735294117647058, 'f1-score': 0.5977011494252874, 'support': 136}, 'Consumer Discretionary': {'precision': 0.5925925925925926, 'recall': 0.24242424242424243, 'f1-score': 0.3440860215053763, 'support': 66}, 'Consumer Staples': {'precision': 0.5504587155963303, 'recall': 0.5042016806722689, 'f1-score': 0.5263157894736842, 'support': 119}, 'Corporate Services': {'precision': 0.558282208588957, 'recall': 0.6363636363636364, 'f1-score': 0.5947712418300654, 'support': 143}, 'Energy & Utilities': {'precision': 0.7410714285714286, 'recall': 0.7033898305084746, 'f1-score': 0.7217391304347825, 'support': 118}, 'Financials': {'precision': 0.7555555555555555, 'recall': 0.7132867132867133, 'f1-score': 0.7338129496402878, 'support': 143}, 'Healthcare': {'precision': 0.7777777777777778, 'recall': 0.84, 'f1-score': 0.8076923076923077, 'support': 15



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



TF-IDF SGD Classifier Confusion Matrix of Category Performance: 0.6026666666666667
{'Commercial Services & Supplies': {'precision': 0.583941605839416, 'recall': 0.5882352941176471, 'f1-score': 0.5860805860805861, 'support': 136}, 'Consumer Discretionary': {'precision': 0.2916666666666667, 'recall': 0.10606060606060606, 'f1-score': 0.15555555555555556, 'support': 66}, 'Consumer Staples': {'precision': 0.48091603053435117, 'recall': 0.5294117647058824, 'f1-score': 0.504, 'support': 119}, 'Corporate Services': {'precision': 0.5228758169934641, 'recall': 0.5594405594405595, 'f1-score': 0.5405405405405406, 'support': 143}, 'Energy & Utilities': {'precision': 0.7213114754098361, 'recall': 0.7457627118644068, 'f1-score': 0.7333333333333334, 'support': 118}, 'Financials': {'precision': 0.7692307692307693, 'recall': 0.6993006993006993, 'f1-score': 0.7326007326007327, 'support': 143}, 'Healthcare': {'precision': 0.7621951219512195, 'recall': 0.8333333333333334, 'f1-score': 0.7961783439490446, 's



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



TF-IDF Random Forest Confusion Matrix of Category Performance: 0.6206666666666667
{'Commercial Services & Supplies': {'precision': 0.5945945945945946, 'recall': 0.4852941176470588, 'f1-score': 0.5344129554655871, 'support': 136}, 'Consumer Discretionary': {'precision': 0.5925925925925926, 'recall': 0.24242424242424243, 'f1-score': 0.3440860215053763, 'support': 66}, 'Consumer Staples': {'precision': 0.5130434782608696, 'recall': 0.4957983193277311, 'f1-score': 0.5042735042735041, 'support': 119}, 'Corporate Services': {'precision': 0.5584415584415584, 'recall': 0.6013986013986014, 'f1-score': 0.5791245791245792, 'support': 143}, 'Energy & Utilities': {'precision': 0.6717557251908397, 'recall': 0.7457627118644068, 'f1-score': 0.7068273092369477, 'support': 118}, 'Financials': {'precision': 0.722972972972973, 'recall': 0.7482517482517482, 'f1-score': 0.7353951890034365, 'support': 143}, 'Healthcare': {'precision': 0.7514792899408284, 'recall': 0.8466666666666667, 'f1-score': 0.7962382445



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [70]:
# TF-IDF Results By Model
results_df.sort_values(by='F1-Score',ascending=False)

Unnamed: 0,Model,Pipeline,Cross_Val_Accuracy,Test_Accuracy,Precision,Recall,F1-Score
1,Logistic Regression,TF-IDF,0.615143,0.622667,0.624543,0.622667,0.616532
3,Random Forest,TF-IDF,0.599714,0.620667,0.616474,0.620667,0.610828
0,Naive Bayes,TF-IDF,0.596,0.616,0.625253,0.616,0.60012
2,SGD Classifier,TF-IDF,0.598,0.602667,0.586275,0.602667,0.589548


In [66]:
tfidf_df = pd.concat(tf_report_list)
tfidf_df.sort_values(by=['model','f1-score'], ascending=False)

Unnamed: 0,Category,0,f1-score,precision,recall,support,model
6,Healthcare,,0.785047,0.736842,0.84,150.0,SGD Classifier
5,Financials,,0.754579,0.792308,0.72028,143.0,SGD Classifier
4,Energy & Utilities,,0.745763,0.745763,0.745763,118.0,SGD Classifier
11,Professional Services,,0.723247,0.771654,0.680556,144.0,SGD Classifier
8,Information Technology,,0.707692,0.657143,0.766667,120.0,SGD Classifier
10,"Media, Marketing & Sales",,0.639175,0.537572,0.788136,118.0,SGD Classifier
0,Commercial Services & Supplies,,0.577075,0.623932,0.536765,136.0,SGD Classifier
3,Corporate Services,,0.561151,0.577778,0.545455,143.0,SGD Classifier
2,Consumer Staples,,0.485106,0.491379,0.478992,119.0,SGD Classifier
12,Transportation & Logistics,,0.446097,0.387097,0.526316,114.0,SGD Classifier


# TF-IDF Results
The best performing model is the multiclass logistic regression. 

Looking into the performance for categories, we see that the top three predictions for all models are:
1. Healthcare
2. Financials
3. Energy & Utilies

The accuracy for other classes begins to drop quickly with the worst predictors coming from businesses in
Industrials, Consumer Discretionary and Materials.

This modeling exercise is based on the hypothesis that websites within each business category will be distinct
enough to identify their characteristics. This clearly works for some like the top 3 categories but the language
and industries represented in the other categories is not as distinct. 

Next steps:
1. Increase datasize (but need more memory to process)
2. Test out different models and hyperparameter tuning
3. Try category specific models (e.g. one model for healthcare, one model for IT)
4. Return to text normalization and see if any important information was accidentally removed. 


In [94]:
# Store BERT results
bert_results = []
bert_report_list=[]

# Iterate over models using the precomputed BERT embeddings
for model_name, model in models.items():
    if isinstance(model, MultinomialNB): # MultinomialNB requires non-negative input values but embeddings can include negative values
        print(f"Skipping {model_name} as it is not compatible with BERT embeddings.")
        continue
    
    print(f"\n=== {model_name} ===")
    
    # Train the model using precomputed BERT embeddings
    model.fit(X_train_embeddings2, y_train)
    y_pred_bert = model.predict(X_test_embeddings2)
    
    # Cross-Validation with precomputed BERT embeddings
    bert_scores = cross_val_score(model, X_train_embeddings2, y_train, cv=kf, scoring='accuracy')
    
    # Calculate accuracy and classification report
    bert_accuracy = accuracy_score(y_test, y_pred_bert)
    bert_report = classification_report(y_test, y_pred_bert, target_names=label_encoder.classes_, output_dict=True)


    bert_df = pd.DataFrame([bert_report]).T
    bert_df = bert_df.reset_index().rename(columns={'index':'Category'})
    bert_df_report = bert_df[0].apply(pd.Series)

    bert_final = pd.merge(bert_df['Category'], bert_df_report, left_index=True,right_index=True)

    # filter dataframe                                   
    bert_clean = bert_final.loc[(bert_final['Category']!='weighted avg') & (bert_final['Category']!='macro avg') & (bert_final['Category']!='accuracy')]
    bert_clean['model']=model_name
    bert_report_list.append(bert_clean)
    
    # Append BERT results to the results list
    bert_results.append({
        'Model': model_name,
        'Cross_Val_Accuracy': bert_scores.mean(),
        'Test_Accuracy': bert_accuracy,
        'Precision': bert_report['weighted avg']['precision'],
        'Recall': bert_report['weighted avg']['recall'],
        'F1-Score': bert_report['weighted avg']['f1-score']
    })

# Convert BERT results to DataFrame
bert_results_df = pd.DataFrame(bert_results)
print("BERT Results:")
print(bert_results_df)


Skipping Naive Bayes as it is not compatible with BERT embeddings.

=== Logistic Regression ===




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




=== SGD Classifier ===




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




=== Random Forest ===
BERT Results:
                 Model  Cross_Val_Accuracy  Test_Accuracy  Precision  \
0  Logistic Regression            0.770000       0.810667   0.811114   
1       SGD Classifier            0.700000       0.753333   0.788897   
2        Random Forest            0.637714       0.662667   0.675663   

     Recall  F1-Score  
0  0.810667  0.809744  
1  0.753333  0.756673  
2  0.662667  0.653985  




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [87]:
bert_results_df

Unnamed: 0,Model,Cross_Val_Accuracy,Test_Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,0.77,0.810667,0.811114,0.810667,0.809744
1,SGD Classifier,0.721429,0.704,0.775288,0.704,0.698077
2,Random Forest,0.644,0.670667,0.674981,0.670667,0.661995


In [88]:
dbert_df = pd.concat(bert_report_list)
dbert_df.sort_values(by=['model','f1-score'], ascending=False)

Unnamed: 0,Category,0,f1-score,precision,recall,support,model
6,Healthcare,,0.89769,0.888889,0.906667,150.0,SGD Classifier
12,Transportation & Logistics,,0.768116,0.654321,0.929825,114.0,SGD Classifier
11,Professional Services,,0.727869,0.689441,0.770833,144.0,SGD Classifier
3,Corporate Services,,0.722689,0.905263,0.601399,143.0,SGD Classifier
2,Consumer Staples,,0.716129,0.581152,0.932773,119.0,SGD Classifier
4,Energy & Utilities,,0.715867,0.633987,0.822034,118.0,SGD Classifier
10,"Media, Marketing & Sales",,0.712766,0.957143,0.567797,118.0,SGD Classifier
5,Financials,,0.690583,0.9625,0.538462,143.0,SGD Classifier
8,Information Technology,,0.645963,0.514851,0.866667,120.0,SGD Classifier
9,Materials,,0.601307,0.455446,0.884615,52.0,SGD Classifier


# DistilBERT Results
Logistic regression with DistilBERT far outperforms the same model with TF-IDF

Unlike the TF-IDF vectorization, we see the top performing categories are 
1. Healthcare
2. Transportation & Logistics
3. Consumer Staples

The accuracy does drop for remaining classes but even the worst performing category (Consumer Discretionary) but
even with 66 examples, it far outperforms any of the TF-IDF models. 

DistilBERT captures the context and not just word frequency, but we can absolutely improve with other features.

Next steps:
1. Increase datasize (but need more memory to process)
2. Test out different models and hyperparameter tuning
3. Try category specific models (e.g. one model for healthcare, one model for IT)
4. Return to text normalization and see if any important information was accidentally removed. 
    - Remove contact information & addresses

# Model Winner

Logistic Regression (ovr) with DistilBERT 

The best performining model is use case-, data-, and application-specific. You have to balance the model's performance against how and where it will be deployed. 

In order to ship this application, I need to share the precomputed distilbert embeddings OR I need to use the TF-IDF so the model can be tested.  I will not use BERT embeddings in the application because of the time constraint 
to compute. 

Criteria used to select model:
1. Accuracy: the global behavior of the model across all category predictions
2. Confusion Matrix:
    - Precision 
    - Recall
    - F1

In [104]:
import pickle 
# Define the classification models to be tested
models = {
    'Logistic Regression': LogisticRegression(multi_class='ovr', max_iter=1000)
}
# Iterate over models using the precomputed BERT embeddings
for model_name, model in models.items():
    if isinstance(model, MultinomialNB): # MultinomialNB requires non-negative input values but embeddings can include negative values
        print(f"Skipping {model_name} as it is not compatible with BERT embeddings.")
        continue
    
    print(f"\n=== {model_name} ===")
    
    # Train the model using precomputed BERT embeddings
    model.fit(X_train_embeddings2, y_train)
    y_pred_bert = model.predict(X_test_embeddings2)

    # Save the model
    with open('logistic_dbert_model.pkl', 'wb') as model_file:
        pickle.dump(model, model_file)

    # To load the model later
    with open('logistic_dbert_model.pkl', 'rb') as model_file:
        loaded_model = pickle.load(model_file)


=== Logistic Regression ===


In [133]:
# select cases not in our sample of 5000 cases
df_test = df_clean[~df_clean.index.isin(df_sample.index)].sample(n=3)
df_test['Category_encoded'] = label_encoder.fit_transform(df_test['Category'])
for i in range(len(df_test)):
    print(df_test['Category'].iloc[i])
    print(df_test['clean_text_str'].iloc[i])


Corporate Services
weight loss high intensity bodybuilding fitness strength training trainers care trainers take fitness goals seriously understand losing weight gaining muscle challenging goal overcome hurdles reach goals analyze body type bmi metabolism create individualized fitness plan founded larry reynolds built first gym basement church troy york years old moved phoenix arizona age work `` real health club soon became head trainer gym swim east gym swim west year later started glendale arizona longest running bodybuilding gym westside barbell club enjoyed fantastic -year run soon started arizona successful personal training business peak trainers well clients thousands success stories `` making arizona years larry semi retired passed reins lrpt mike fox longtime assistant general manager mike continuing larry reynolds legacy helping dozens clients reach health fitness goals larry still trains private clients three afternoons week plans fully retiring anytime soon success goal un

In [134]:
# Example training data (text)
new_X = df_test['clean_text_str']
new_y_train = df_test['Category_encoded']
print(df_test['Category'])
print(df_test['Category_encoded'])

new_embeddings = bert_transformer.transform(new_X)

40475    Corporate Services
35561             Materials
12376    Corporate Services
Name: Category, dtype: object
40475    0
35561    1
12376    0
Name: Category_encoded, dtype: int32


In [135]:
# Make predictions
 # To load the model later
with open('logistic_dbert_model.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)
    
predictions = loaded_model.predict(new_embeddings)

In [136]:
# Output the predictions
print(f"Predicted classes for the new data: {predictions}")

Predicted classes for the new data: [3 0 3]
