# Fluff Detection: A Step Toward Concise Writing

## Objective
This project aims to develop a text classifier, in particular **a fluff detector** to identify and reduce verbosity in English writing. By categorizing and annotating instances of fluff, we can later fine-tune language models to rewrite text with clarity and precision.

> **In this notebook we will focus on creating a small synthetic dataset to train a first version of the text classifier**

## What Is Fluff?
Fluff refers to superfluous elements in writing that increase length but do not enhance meaning.
→ It weakens clarity and reduce communication effectiveness.


Below are examples illustrating common types of 'fluff' and overlaps:

#### Example 1
- **Fluffy**: *It is absolutely and completely necessary for us to thoroughly and carefully evaluate all aspects of the situation.* ❌
- **Concise**: *We must evaluate all aspects of the situation.* ✅

#### Example 2
- **Fluffy**: *As I mentioned earlier, we actually need to start working on this project sooner rather than later to ensure that we meet the deadline.* ❌
- **Concise**: *We need to start this project soon to meet the deadline.* ✅

#### Example 3
- **Fluffy**: *At the end of the day, we need to think outside the box to ensure that this project reaches its maximum potential.* ❌
- **Concise**: *We must evaluate all aspects of the situation.* ✅



## How Will We Build The Dataset?
We want to minimize the time we take to build the dataset so we'll:
- Define our labels and describe them.
- Choose language
- Describe the type of text entries we would like to generate
- Use a dataset of `Persona` to achieve great diversity

### 1. Labels
We will use the follow two simple labels:
- **`fluffy`** - (also labelled `1`)
- **`concise`** - (also labelled `0`)

We'll also provide a descriptio for each label

In [1]:
LABELS = [
    {
        "name": "fluffy",
        "description": "Fluff in text is often includes redundancy, filler words, excessive qualifiers, unnecessary adjectives or adverbs, irrelevant information, and repetition of known context. It features generic statements, clichés, excessive formalities, lengthy introductions or conclusions, self-evident statements, overuse of the passive voice, and overly elaborate sentences. Each dilutes clarity, wastes space, and reduces reader engagement.",
        "instruction": "Make sure that the text is quite verbose and contain fluff."
    },
    {
        "name": "concise",
        "description": "Concise text is characterized by clarity, precision, and brevity. It communicates ideas directly, using only the words necessary to convey the intended message. Concise text avoids redundancy, filler, and irrelevant details, focusing instead on impactful and purposeful language. It is specific, avoids overused phrases or qualifiers, and presents information in a straightforward manner, making it easy for the reader to understand and retain.",
        "instruction": "Make sure that the text is concise."
    }
]

### 2. Language
We will choose language:
- For simplicity we'll stick to **english**.

In [2]:
LANGUAGES = {"en": "English"}
# Could also be
# LANGUAGES = ["French", "Spanish", "English"]  ---> each text shows up in one of these language

### Description of Text Entries
We would like to focus on classifying:
- `sentence-level`: each text is a unique sentence
- `persona`: text that could be written by a specific persona (more on that later)
- `styles`: phrases written in different style
- `context`: phrases which can be written in different daily-context
- `medium`: text that can be written on diferent medium
- `intent`: different flavors of intent.

All this will help us construct a procedural prompt for high-diversity, yet realistic texts.

**Example Generation Prompt**:

*Imagine 3 differents but realistic sentences which could have been written by the following person:*

*Persona: Leo, A small wood manufacturing business owner who just got his first kid.*

*Writting style: A tendency for exageration*

*Context: In the middle of the night*

*Medium: Email*

*Intent: To warn somebody about something.*

I'll use HuggingChat to generate some styles ideas, context, medium, and intents.

I'll use this prompt: `Can you give me 5 ideas to describe different writing styles somebody could use?`

> ⚠ These properties should be generic enough to not conflict with each other, and especially not conflict with one of the label.

In [3]:
STYLES = {
    "formal": "Formal, and structured, often used in academic or professional contexts.",
    "casual": "Casual, as they talking to a friend",
    "narrative": "Narrative, like telling a story, often with a chronological flow and a focus on events and characters.",
    "humorous": "Humorous, uses wit, irony, and playful language to entertain and amuse the reader.",
    "technical": "Technical, precise and factual, often used in scientific or technical writing to provide clear, detailed information"
}

In [4]:
CONTEXTS = {
    "professional_meeting": "During a meeting at work",
    "family_gathering": "During a family setting, such as a meal",
    "public_event": "During a public event, like a concert or a seminar.",
    "social_media": "While scrolling on tweeter",
    "night_time": "In the middle of the night",
    "early_morning": "Early in the morning",
    "travel_scenario": "While traveling.",
    "celebratory_event": "In the context of a joyful or celebratory moment, like a birthday or an achievement."
}

In [5]:
MEDIUMS = {
    "email": "email",
    "text_message": "A quick message sent via SMS or instant messaging apps.",
    "social_media_post": "A post or comment shared on social media platforms.",
    "technical_document": "Technical document",
    "personal_note": "Personal notes",
}

In [6]:
INTENTS = {
    "inform": "To provide information or knowledge to the reader.",
    "warn": "To alert or caution the reader about potential risks or dangers.",
    "entertain": "To amuse or captivate the reader through engaging content.",
    "motivate": "To inspire or encourage the reader to take action or feel uplifted.",
    "persuade": "To convince the reader to agree with a point of view or take a specific action.",
    "reflect": "To ponder or explore thoughts, emotions, or experiences.",
    "request": "To ask for information, action, or assistance from the reader.",
    "express_gratitude": "To show appreciation or thanks to the reader."
}

In [28]:
PROMPT_TEMPLATE = """Imagine {num_entries} differents and realistic sentences which could have been written by the following person:
Persona: {persona}
Writting style: {style}
Context: {context}
Medium: {medium}
Intent: {intent}

Instructions:
Each text entry must be short (1 to 2) sentences max.
The core idea in each text entry must be different from the others.
The language must be {language}
"""

In [29]:
generation_prompt = PROMPT_TEMPLATE.format(
    persona = "Leo, A small wood manufacturing business owner who just got his first kid.",
    num_entries = 3,
    style = STYLES['humorous'],
    context = CONTEXTS['night_time'],
    medium = MEDIUMS["social_media_post"],
    intent = INTENTS["request"],
    language=LANGUAGES['en']
)

In [9]:
print(generation_prompt)

Imagine 3 differents and realistic sentences which could have been written by the following person:
Persona: Leo, A small wood manufacturing business owner who just got his first kid.
Writting style: Humorous, uses wit, irony, and playful language to entertain and amuse the reader.
Context: In the middle of the night
Medium: A post or comment shared on social media platforms.
Intent: To ask for information, action, or assistance from the reader.

Instructions:
Each text entry (sentence) must be only 1 sentence.
The core idea in each text entry (sentece) must be different from the others.
The language must be English



## Generation
To generate samples we will:
- Choose an LLM
- Configure sutructured Outputs
- Configure the generation parameters

In [10]:
import google.generativeai as genai
import os
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

In [11]:
genai.configure(api_key=GOOGLE_API_KEY)

In [12]:
model = genai.GenerativeModel(model_name="gemini-1.5-flash")
response = model.generate_content(generation_prompt)
print(response.text)

1.  So, the tiny human decided 3 AM was the perfect time for a symphony of screams – anyone know where I can get a lifetime supply of caffeine AND earplugs?

2.  My meticulously crafted rocking horse is now less "charming heirloom" and more "chew toy," so if anyone knows a good wood sealant that's also baby-safe, hit me up!

3.  Apparently, sleep is a myth now that I'm a dad. Send coffee...or whiskey...or maybe both.  Just kidding...unless...?



In [13]:
!pip install instructor textstat datasets --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.4/71.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/105.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.1/105.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [14]:
from pydantic import BaseModel

class TextEntry(BaseModel):
    content: str

class TextEntries(BaseModel):
    entries : list[TextEntry]

In [15]:
import instructor

google_client = genai.GenerativeModel(
    model_name="gemini-1.5-flash")

client = instructor.from_gemini(
    client=google_client,
    mode=instructor.Mode.GEMINI_JSON,
)

### Persona
As for the persona, we'll be using the **`FinePersonas`** dataset, an Open dataset of 21 million detailed personas, which has specifically been built for diverse and controllable synthetic text generation.

👏 Big kudos to the [Argilla](https://argilla.io/) team for this dataset.
You can find this dataset [here](https://huggingface.co/datasets/argilla/FinePersonas-v0.1)

In [17]:
from datasets import load_dataset
fine_personas_dataset = load_dataset("argilla/FinePersonas-v0.1", "default")

train-00000-of-00012.parquet:   0%|          | 0.00/220M [00:00<?, ?B/s]

train-00001-of-00012.parquet:   0%|          | 0.00/221M [00:00<?, ?B/s]

train-00002-of-00012.parquet:   0%|          | 0.00/222M [00:00<?, ?B/s]

train-00003-of-00012.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

train-00004-of-00012.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

train-00005-of-00012.parquet:   0%|          | 0.00/224M [00:00<?, ?B/s]

train-00006-of-00012.parquet:   0%|          | 0.00/224M [00:00<?, ?B/s]

train-00007-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00008-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00009-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00010-of-00012.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00011-of-00012.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/21071228 [00:00<?, ? examples/s]

We will not need the 21 millions personas 😅
- 4000 for the train set
- 1000 for the test set

In [18]:
my_personas = fine_personas_dataset['train'].train_test_split(
    train_size=4000,
    test_size=1000,
    shuffle=True,
    seed=13)

Let's check the reduced dataset

In [19]:
my_personas

DatasetDict({
    train: Dataset({
        features: ['id', 'persona', 'labels'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['id', 'persona', 'labels'],
        num_rows: 1000
    })
})

In [20]:
# check an example persona from the dataset
my_personas['train'][0]

{'id': '<urn:uuid:5b02d9ba-b011-4ebf-a9a3-0b39f1fb919e>',
 'persona': 'An English literature professor or academic researcher with a focus on medieval studies, particularly on the works of Geoffrey Chaucer and his Canterbury Tales.',
 'labels': '["Education", "Academia", "Specialized Expertise"]'}

# Generation

In [30]:
import random
def pick_one(variations):
    """ Randomly pick a variation among several possible styles, intents, mediums, contexts, languages """
    random_variation_key = random.choice(list(variations.keys()))
    random_variation_value = variations[random_variation_key]
    return random_variation_value

In [31]:
def generate_prompt(num_entries, persona, style, context, medium, intent, language, label_idx):
    generation_prompt = PROMPT_TEMPLATE.format(
        persona = persona,
        num_entries = num_entries,
        style = style,
        context = context,
        medium = medium,
        intent = intent,
        language = language
    )

    final_prompt = generation_prompt + f"\n{LABELS[label_idx]['instruction']} {LABELS[label_idx]['description']}"

    return final_prompt

In [39]:
counter = 0

for item in my_personas['train']:

    fluff_prompt = generate_prompt(
        num_entries=2,
        persona=item['persona'],
        style=pick_one(STYLES),
        context=pick_one(CONTEXTS),
        medium=pick_one(MEDIUMS),
        intent=pick_one(INTENTS),
        language=pick_one(LANGUAGES),  # only english in this example notebook
        label_idx=0
    )

    concise_prompt = generate_prompt(
        num_entries=2,
        persona=item['persona'],
        style=pick_one(STYLES),
        context=pick_one(CONTEXTS),
        medium=pick_one(MEDIUMS),
        intent=pick_one(INTENTS),
        language=pick_one(LANGUAGES),  # only english in this example notebook
        label_idx=1
    )

    fluff_response = client.messages.create(
        messages=[
            {
                "role": "system",
                "content": "You are a helpful AI Assistant"},
            {
                "role": "user",
                "content": fluff_prompt},
        ],
        response_model=TextEntries
    )

    concise_response = client.messages.create(
        messages=[
            {
                "role": "system",
                "content": "You are a helpful AI Assistant"},
            {
                "role": "user",
                "content": concise_prompt},
        ],
        response_model=TextEntries
    )


    print("=======================")
    print(fluff_prompt, '\n')
    print(fluff_response)

    print("\n\n")

    print(concise_prompt, '\n')
    print(concise_response)

    print("\n\n")

    counter += 1
    if counter >= 10:
        break

Imagine 2 differents and realistic sentences which could have been written by the following person:
Persona: An English literature professor or academic researcher with a focus on medieval studies, particularly on the works of Geoffrey Chaucer and his Canterbury Tales.
Writting style: Formal, and structured, often used in academic or professional contexts.
Context: Early in the morning
Medium: Technical document
Intent: To ask for information, action, or assistance from the reader.

Instructions:
Each text entry must be short (1 to 2) sentences max.
The core idea in each text entry must be different from the others.
The language must be English

Make sure that the text is quite verbose and contain fluff. Fluff in text is often includes redundancy, filler words, excessive qualifiers, unnecessary adjectives or adverbs, irrelevant information, and repetition of known context. It features generic statements, clichés, excessive formalities, lengthy introductions or conclusions, self-evident



InstructorRetryException: 429 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint: Resource has been exhausted (e.g. check quota).

In [34]:
fluff_prompt

'Imagine 2 differents and realistic sentences which could have been written by the following person:\nPersona: An English literature professor or academic researcher with a focus on medieval studies, particularly on the works of Geoffrey Chaucer and his Canterbury Tales.\nWritting style: Humorous, uses wit, irony, and playful language to entertain and amuse the reader.\nContext: In the middle of the night\nMedium: Technical document\nIntent: To provide information or knowledge to the reader.\n\nInstructions:\nEach text entry must be short (1 to 2) sentences max.\nThe core idea in each text entry must be different from the others.\nThe language must be English\n\nMake sure that the text is quite verbose and contain fluff. Fluff in text is often includes redundancy, filler words, excessive qualifiers, unnecessary adjectives or adverbs, irrelevant information, and repetition of known context. It features generic statements, clichés, excessive formalities, lengthy introductions or conclusi

In [35]:
concise_prompt

'Imagine 2 differents and realistic sentences which could have been written by the following person:\nPersona: An English literature professor or academic researcher with a focus on medieval studies, particularly on the works of Geoffrey Chaucer and his Canterbury Tales.\nWritting style: Formal, and structured, often used in academic or professional contexts.\nContext: During a meeting at work\nMedium: Technical document\nIntent: To alert or caution the reader about potential risks or dangers.\n\nInstructions:\nEach text entry must be short (1 to 2) sentences max.\nThe core idea in each text entry must be different from the others.\nThe language must be English\n\nMake sure that the text is concise. Concise text is characterized by clarity, precision, and brevity. It communicates ideas directly, using only the words necessary to convey the intended message. Concise text avoids redundancy, filler, and irrelevant details, focusing instead on impactful and purposeful language. It is speci

In [27]:
fluff_response

TextEntries(entries=[TextEntry(content='In the quaint, charming, and utterly delightful year of 1387, amidst a flurry of activity and a cacophony of sounds that characterized the bustling, vibrant, and incredibly lively city of London, a rather significant event took place, a truly momentous occasion, the writing of what would eventually become the incredibly well-known and widely celebrated Canterbury Tales by the one and only, the incomparable and unforgettable Geoffrey Chaucer, a man of letters who was, as we all know and quite possibly more than agree on, an extraordinarily talented writer'), TextEntry(content='The pilgrims, a motley crew of diverse characters, each one unique and wonderfully distinct, with their own stories to tell, their own personalities to reveal, and their own secrets to share, embarked on their journey to Canterbury Cathedral, a journey filled with excitement, adventure, and yes even a sprinkle of danger along the road')])

In [36]:
concise_response

TextEntries(entries=[TextEntry(content='The proposed methodology for analyzing the manuscript presents significant challenges in terms of data preservation and authenticity verification.'), TextEntry(content='Failure to address the identified inconsistencies in the source material may lead to inaccurate interpretations and conclusions.')])

In [None]:
dataset_dict = {}
dataset_dict['uuid'] = []
dataset_dict['persona_id'] = []
dataset_dict['persona'] = []
dataset_dict['text'] = []
dataset_dict['label'] = []
dataset_dict['model'] = []
dataset_dict['prompt'] = []

In [None]:
# Add to dataset dictionary
dataset_dict['uuid'].append(str(uuid.uuid4()))  # Generate a unique ID
dataset_dict['label'].append(1)
dataset_dict['model'].append("gemini-1.5-flash")

In [None]:
dataset_dict

{'uuid': ['8838c79c-87d6-470b-818b-2a6067854134'],
 'persona_id': [],
 'persona': [],
 'text': [],
 'label': [1],
 'model': ['gemini-1.5-flash'],
 'prompt': []}

In [None]:
 import uuid

# Generate a random UUID
random_uuid = uuid.uuid4()
random_uuid

UUID('d244acce-55ea-4a94-b336-8357f0e22a53')

# Next steps
- Function to generate sentence fluffy
- Function to generate sentence concise
- try / except to catch issues
- Proofread and improve prompt
- Generate some examples
- Verify, improve prompt if necessary
- think what is relevant to track
- convert to HF dataset object and push to hub
- test on some more samples
- if good launch for a full dataset


# Another notebook maybe:
- upload to an Argilla space as pre-annotated
- human to review a sample
- compute an agreeement score
- Compare agreement score with expected threshold

# Another notebook: SetFit model training.

In [None]:
import textstat

In [None]:
s1 = "What if we go on this lovely day, perhapse, swimming in the bank of the river?"
s2 = "What if we go simming in the river?"

In [None]:
textstat.flesch_reading_ease(s1), textstat.flesch_reading_ease(s2)

(80.62, 88.74)

In [None]:
textstat.flesch_kincaid_grade(s1), textstat.flesch_kincaid_grade(s2)

(6.0, 2.9)

In [None]:
textstat.dale_chall_readability_score_v2(s1), textstat.dale_chall_readability_score_v2(s2)

(5.42, 6.01)

In [None]:
textstat.avg_syllables_per_word(s1), textstat.avg_syllables_per_word(s2)

(1.3, 1.3)

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
from datasets import load_dataset
ds = load_dataset("argilla/FinePersonas-v0.1", "default")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md:   0%|          | 0.00/129k [00:00<?, ?B/s]

train-00000-of-00012.parquet:   0%|          | 0.00/220M [00:00<?, ?B/s]

train-00001-of-00012.parquet:   0%|          | 0.00/221M [00:00<?, ?B/s]

train-00002-of-00012.parquet:   0%|          | 0.00/222M [00:00<?, ?B/s]

train-00003-of-00012.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

train-00004-of-00012.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

train-00005-of-00012.parquet:   0%|          | 0.00/224M [00:00<?, ?B/s]

train-00006-of-00012.parquet:   0%|          | 0.00/224M [00:00<?, ?B/s]

train-00007-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00008-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00009-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00010-of-00012.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00011-of-00012.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/21071228 [00:00<?, ? examples/s]

In [None]:
num_labels = 14 # @param {"type":"integer","placeholder":"Number of classes"}
multi_label_flag = True # @param {"type":"boolean","placeholder":"False"}

In [None]:
!pip install textblob --quiet

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from textblob import Word

from textblob import Word

word = Word("happy")
synonyms = [syn.lemma_names() for syn in word.synsets]
print(synonyms)

[['happy'], ['felicitous', 'happy'], ['glad', 'happy'], ['happy', 'well-chosen']]


In [None]:
antonyms = []
for syn in word.synsets:
    for lemma in syn.lemmas():
        if lemma.antonyms():
            antonyms.append(lemma.antonyms()[0].name())

print(antonyms)

['unhappy']


In [None]:
[syn.definition() for syn in word.synsets]

['enjoying or showing or marked by joy or pleasure',
 'marked by good fortune',
 'eagerly disposed to act or to be of service',
 'well expressed and to the point']

In [None]:
LABELS = ["propulsion", "orbit", "thermal", "power", "operations"]

In [None]:
i = 1
word = Word(LABELS[i])
[syn.definition() for syn in word.synsets]

['the (usually elliptical) path described by one celestial body in its revolution about another',
 'a particular environment or walk of life',
 'an area in which something acts or operates or has power or control:',
 'the path of an electron around the nucleus of an atom',
 'the bony cavity in the skull containing the eyeball',
 'move in an orbit']

In [None]:
i = 0
word = Word(LABELS[i])
[syn.lemma_names() for syn in word.synsets]

[['propulsion'], ['propulsion', 'actuation']]

In [None]:
label = "hit"

word = Word(label)

# Use a set comprehension to collect unique synonyms and ensure consistent casing
similar_words = {w.lower() for syn in word.synsets for w in syn.lemma_names()}

# Add the label to the set if not already present
similar_words.add(label.lower())

# Convert the set back to a list
similar_words = list(similar_words)

similar_words

['dispatch',
 'gain',
 'polish_off',
 'striking',
 'smash',
 'come_to',
 'bump_off',
 'bang',
 'score',
 'attain',
 'rack_up',
 'impinge_on',
 'slay',
 'hit',
 'make',
 'smasher',
 'tally',
 'murder',
 'pip',
 'collision',
 'arrive_at',
 'strike',
 'stumble',
 'run_into',
 'reach',
 'collide_with',
 'off',
 'hitting',
 'shoot',
 'remove']

In [None]:

TEXT_CLASSIFICATION_SYNONYM_SEEDS = [
    "This sentence discusses {label}",
    "Let's talk {label}",
    "This text is classified under {label}",
    "The primary focus here is {label}",
    "It's all about {label}",
    "This content relates to {label}",
    "Does this text talk about {label}?",
    "Assign the {label} category to this sentence",
    "This illustrate the concept of '{label}'",
    "This document pertains to {label}",
    "{label} is the core of the content"
]

def create_starting_sentences(label: str):
    sentences = [s.format(label=label) for s in TEXT_CLASSIFICATION_SYNONYM_SEEDS]
    return sentences

In [None]:
label = "propulsion"

In [None]:
sent = create_starting_sentences(label='propulsion')
print(sent)

['This sentence discusses propulsion', "Let's talk propulsion", 'This text is classified under propulsion', 'The primary focus here is propulsion', "It's all about propulsion", 'This content relates to propulsion', 'Does this text talk about propulsion?', 'Assign the propulsion category to this sentence', "This illustrate the concept of 'propulsion'", 'This document pertains to propulsion', 'propulsion is the core of the content']


In [None]:
str(label)

'propulsion'

In [None]:
!pip install --upgrade gensim --quiet

# How can we generate instances
### Without LLMs

- synonyms
- seed sentence templates
- internet search in something were we are more of less sure to get sentences that match the label semantic
- top_n most similar embedding in an embedding space