Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Totally unreliable results. What I'm doing wrong? #490

Open
pySilver opened this issue Feb 12, 2024 · 2 comments
Open

Totally unreliable results. What I'm doing wrong? #490

pySilver opened this issue Feb 12, 2024 · 2 comments

Comments

@pySilver
Copy link

pySilver commented Feb 12, 2024

I'm evaluating SetFit to predict one of 2 labels using ~500 training samples for both classes and results are very far from being satisfied.

Little bit background:

  • I have an e-commerce website with fashion products for male and female customers in Poland (so polish language is being used)
  • There are top level categories for both genders: Accessories, Underwear, Shoes, Clothing
  • Each top level category has multiple more specific categories. For instance Clothing has: Jackets, T-Shirts, Trousers etc. ; This subcategories are not always identical across gender branches. For example high heels is only present in female branch
  • I've build a synthetic training dataset by using specific category names with plural and singular values i.e. "Men's T-Shirt", "Men's T-shirts" resulting in ~500 rows (as it its fairly small I've attached it here) synthetic-train.csv
  • Class counts for training dataset: Counter({0: 333, 1: 217})
  • as eval_dataset I'm using real world products, manually selected and verified. I've attached few testing rows here. test_sample.csv

I've trained the model using sentence-transformers/paraphrase-multilingual-mpnet-base-v2 and observed that model is never in doubt and predictions are totally useless. Next I've tried allegro/herbert-base-cased with the following args and metrics:

train_args = TrainingArguments(
    num_iterations=20,
    seed=42,
)

Metrics:

accuracy: 0.7290384207174697
f1: 0.5987741631305987
recall: 0.43080054274084123
precision: 0.98145285935085

Lets see how it performs:

from utils import clean_text

class_labels = {v: k for k, v in gender_labels.items()}

# Inference
model = trainer.model
input_texts = ['lajsfhlasfuaer usiyfsf jsdfu', 'szpilki', 'sukienka', 'koszula', 'Spodnie męskie',
               'Koszula damska', "frak czarny elegancki", "buty do garnituru", "nerka"]

for input_text, pred in zip(input_texts, model.predict_proba([clean_text(t) for t in input_texts])):
    # Convert tensor to numpy array
    pred = pred.detach().numpy()

    # Get the index of the maximum probability which corresponds to the predicted class
    predicted_class = np.argmax(pred)

    # Get the maximum probability
    predicted_probability = np.max(pred)

    # Class to label mapping
    predicted_class = class_labels[predicted_class].upper()

    # Print the input text, predicted class and the probability of the predicted class (rounded to 3 decimal places)
    print(f"{clean_text(input_text)} - {predicted_class}, Probability: {predicted_probability:.3f}")
lajsfhlasfuaer usiyfsf jsdfu - FEMALE, Probability: 1.000 (BAD; random string that does not mean anything!)
szpilki - FEMALE, Probability: 1.000 (OK, high heels, typical female product)
sukienka - FEMALE, Probability: 1.000 (OK; dress; typical female product)
koszula - FEMALE, Probability: 1.000 (BAD; "Shirt" can be both male and female)
spodnie męskie - MALE, Probability: 1.000 (OK, gender is mentioned)
koszula damska - FEMALE, Probability: 1.000 (OK, gender is mentioned)
frak czarny elegancki - FEMALE, Probability: 0.983, (BAD; "Black Elegant Tailcoat"; Tailcoat only mentioned in male data)
buty do garnituru - FEMALE, Probability: 1.000 (BAD; `shoes for a suit`; Suit mentioned for male only)
nerka - MALE, Probability: 1.000 (OK; typical male product in dataset)

What I'm doing wrong? Why model is so sure with it's predictions? What can I do to improve? Should training data look like sentence or punctuation does not matter at all?

Thank you for your help!

@pySilver
Copy link
Author

I've tried to change model head this way:

rom sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
from setfit import SetFitModel

model_id = "allegro/herbert-base-cased"
# model = SetFitModel.from_pretrained(model_id)

model_body = SentenceTransformer(model_id)
model_head = LogisticRegression(class_weight="balanced")
model = SetFitModel(model_body=model_body, model_head=model_head)

and that resulted in:

{'accuracy': {'accuracy': 0.9120144343026958},
 'f1': {'f1': 0.9101744501029364},
 'recall': {'recall': 0.9497964721845319},
 'precision': {'precision': 0.8737258165175785}}

and a little bit better test results:
(same input as in original message)

lajsfhlasfuaer usiyfsf jsdfu - MALE, Probability: 0.999
szpilki - FEMALE, Probability: 0.665
sukienka - FEMALE, Probability: 0.995
koszula - FEMALE, Probability: 0.816
spodnie męskie - FEMALE, Probability: 0.990
koszula damska - FEMALE, Probability: 0.999
frak czarny elegancki - MALE, Probability: 0.997
buty do garnituru - MALE, Probability: 0.581
nerka - MALE, Probability: 0.792

but still its far away from being good.

@rolandtannous
Copy link

you're getting great F1 and accuracy scores. why is it far from being good? what am I missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants