You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm evaluating SetFit to predict one of 2 labels using ~500 training samples for both classes and results are very far from being satisfied.
Little bit background:
I have an e-commerce website with fashion products for male and female customers in Poland (so polish language is being used)
There are top level categories for both genders: Accessories, Underwear, Shoes, Clothing
Each top level category has multiple more specific categories. For instance Clothing has: Jackets, T-Shirts, Trousers etc. ; This subcategories are not always identical across gender branches. For example high heels is only present in female branch
I've build a synthetic training dataset by using specific category names with plural and singular values i.e. "Men's T-Shirt", "Men's T-shirts" resulting in ~500 rows (as it its fairly small I've attached it here) synthetic-train.csv
Class counts for training dataset: Counter({0: 333, 1: 217})
as eval_dataset I'm using real world products, manually selected and verified. I've attached few testing rows here. test_sample.csv
I've trained the model using sentence-transformers/paraphrase-multilingual-mpnet-base-v2 and observed that model is never in doubt and predictions are totally useless. Next I've tried allegro/herbert-base-cased with the following args and metrics:
fromutilsimportclean_textclass_labels= {v: kfork, vingender_labels.items()}
# Inferencemodel=trainer.modelinput_texts= ['lajsfhlasfuaer usiyfsf jsdfu', 'szpilki', 'sukienka', 'koszula', 'Spodnie męskie',
'Koszula damska', "frak czarny elegancki", "buty do garnituru", "nerka"]
forinput_text, predinzip(input_texts, model.predict_proba([clean_text(t) fortininput_texts])):
# Convert tensor to numpy arraypred=pred.detach().numpy()
# Get the index of the maximum probability which corresponds to the predicted classpredicted_class=np.argmax(pred)
# Get the maximum probabilitypredicted_probability=np.max(pred)
# Class to label mappingpredicted_class=class_labels[predicted_class].upper()
# Print the input text, predicted class and the probability of the predicted class (rounded to 3 decimal places)print(f"{clean_text(input_text)} - {predicted_class}, Probability: {predicted_probability:.3f}")
lajsfhlasfuaer usiyfsf jsdfu - FEMALE, Probability: 1.000 (BAD; random string that does not mean anything!)
szpilki - FEMALE, Probability: 1.000 (OK, high heels, typical female product)
sukienka - FEMALE, Probability: 1.000 (OK; dress; typical female product)
koszula - FEMALE, Probability: 1.000 (BAD; "Shirt" can be both male and female)
spodnie męskie - MALE, Probability: 1.000 (OK, gender is mentioned)
koszula damska - FEMALE, Probability: 1.000 (OK, gender is mentioned)
frak czarny elegancki - FEMALE, Probability: 0.983, (BAD; "Black Elegant Tailcoat"; Tailcoat only mentioned in male data)
buty do garnituru - FEMALE, Probability: 1.000 (BAD; `shoes for a suit`; Suit mentioned for male only)
nerka - MALE, Probability: 1.000 (OK; typical male product in dataset)
What I'm doing wrong? Why model is so sure with it's predictions? What can I do to improve? Should training data look like sentence or punctuation does not matter at all?
Thank you for your help!
The text was updated successfully, but these errors were encountered:
I'm evaluating SetFit to predict one of 2 labels using ~500 training samples for both classes and results are very far from being satisfied.
Little bit background:
Counter({0: 333, 1: 217})
eval_dataset
I'm using real world products, manually selected and verified. I've attached few testing rows here. test_sample.csvI've trained the model using
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
and observed that model is never in doubt and predictions are totally useless. Next I've triedallegro/herbert-base-cased
with the following args and metrics:Metrics:
Lets see how it performs:
What I'm doing wrong? Why model is so sure with it's predictions? What can I do to improve? Should training data look like sentence or punctuation does not matter at all?
Thank you for your help!
The text was updated successfully, but these errors were encountered: