# Fluff Detection: A Step Toward Concise Writing

## Objective
This project aims to develop a text classifier, in particular **a fluff detector** to identify and reduce verbosity in English writing. By categorizing and annotating instances of fluff, we can later fine-tune language models to rewrite text with clarity and precision.

> **In this notebook we will focus on creating a small synthetic dataset to train a first version of the text classifier**

## What Is Fluff?
Fluff refers to superfluous elements in writing that increase length but do not enhance meaning.
- It weakens clarity and reduce communication effectiveness.


Below are examples illustrating common types of 'fluff' and overlaps:

#### Example 1
- **Fluffy**: *It is absolutely and completely necessary for us to thoroughly and carefully evaluate all aspects of the situation.* ❌
- **Concise**: *We must evaluate all aspects of the situation.* ✅

#### Example 2
- **Fluffy**: *As I mentioned earlier, we actually need to start working on this project sooner rather than later to ensure that we meet the deadline.* ❌
- **Concise**: *We need to start this project soon to meet the deadline.* ✅

#### Example 3
- **Fluffy**: *At the end of the day, we need to think outside the box to ensure that this project reaches its maximum potential.* ❌
- **Concise**: *We need creative ideas to make this project succeed.* ✅



## How Will We Build The Dataset?
We want to minimize the time we take to build the dataset so we'll:
- Define our labels and describe them.
- Choose language
- Describe the type of text entries we would like to generate, incuding prompt variations to create different styles, scenario, and intents.
- Use a dataset of `Persona` to maximize diversity accross the potential writter of verbose / concise text.

### 1. Labels
We will use the follow two simple labels:
- **`fluffy`** - (also labelled `1`)
- **`concise`** - (also labelled `0`)

We'll also provide a descriptio for each label
- `name`: The name of the label
- `instruction`: this is a specific instruction to tell the LLM to generate either one label or the other.
- `description`: this description will be used by the generative model clarify what our expectation behind our label.


In [1]:
LABELS = [
    {
        "name": "fluffy",
        "instruction": "Make sure that the text is realistic, but a bit verbose and with a bit of fluff.",
        "description": "Fluff in text is often includes some redundancy, filler words, excessive qualifiers, unnecessary adjectives or adverbs, irrelevant information, and repetition of known context."
    },
    {
        "name": "concise",
        "instruction": "Make sure that the text is very concise, and compact.",
        "description": "Concise text is characterized by clarity, precision, and brevity. It communicates ideas directly, in a very compact manner using only the words necessary to convey the intended message.",
    }
]

### 2. Language
We will choose language:
- For simplicity we'll stick to **english**.

In [2]:
LANGUAGES = {"en": "English"}
# Could also be
# LANGUAGES = {
#     "fr": "French",
#     "es": "Spanish",
#     "en": "English"
# }

### Description of Text Entries
We would like to focus on classifying:
- `sentence-level`: each text is short (1 or 2 sentences)

To create a **realistic** dataset we need to inject **variations**.
> For these **we will combine different styles, contexts, communication mediums and intents, together with unique personas.**
- `persona`: text that could be written by a specific persona (more on that later)
- `styles`: phrases written in different styles
- `context`: phrases which can be written in different daily-context
- `medium`: text that can be written on diferent medium
- `intent`: different flavors of intent.

All this will help us construct a procedural prompt for high-diversity, yet realistic texts.

**Example Generation Prompt**:

*Imagine 3 differents but realistic sentences which could have been written by the following person:*

*Persona: Leo, A small wood manufacturing business owner who just got his first kid.*

*Writting style: A tendency for exageration*

*Context: In the middle of the night*

*Medium: Email*

*Intent: To warn somebody about something.*

Then I would use HuggingChat, Claude or ChatGPT to generate some styles ideas, context, medium, and intents.

> ⚠ These properties should be generic enough to not conflict with each other, and especially not conflict with one of the label.

In [3]:
STYLES = {
    "formal1": "Structured and professional, suitable for official contexts.",
    "formal2": "Polished and respectful, adhering to formal conventions.",
    "formal3": "Clear and authoritative, used for professional communication.",
    "formal4": "Impersonal and precise, avoiding casual language.",
    "formal5": "Serious and proper, often found in legal or academic writing.",
    "technical1": "Detailed and precise, explaining concepts clearly.",
    "technical2": "Jargon-heavy and factual, tailored for experts.",
    "technical3": "Objective and structured, presenting data or instructions.",
    "technical4": "Focused and methodical, designed for problem-solving.",
    "technical5": "Concise and systematic, delivering technical details.",
    "casual": "Relaxed and conversational, like chatting with a friend.",
    "narrative": "Story-like, focused on events or characters in a sequence.",
    "humorous": "Playful and witty, aimed at entertaining or amusing.",
    "technical": "Precise and factual, used for scientific or detailed explanations.",
    "persuasive": "Convincing and influential, designed to sway opinions or actions.",
    "descriptive": "Vivid and detailed, creating a clear mental picture.",
    "emotional": "Expressive and heartfelt, focused on feelings or personal views.",
    "instructional": "Clear and step-by-step, aimed at guiding or teaching.",
    "poetic": "Artistic and rhythmic, often using metaphors and imagery.",
    "journalistic": "Objective and concise, focusing on facts and current events.",
    "rhetorical": "Argumentative and impactful, designed to provoke thought."
}

In [4]:
CONTEXTS = {
    "professional_meeting": "During a meeting at work.",
    "family_gathering": "During a family gathering.",
    "public_event": "At a public event, like a concert or seminar.",
    "social_media": "While scrolling through social media.",
    "night_time": "Late at night, during quiet hours.",
    "early_morning": "Early in the morning, starting the day.",
    "evening": "In the evening, winding down after the day.",
    "overtime": "Working late hours, beyond regular time.",
    "travel_scenario": "While traveling or preparing for a trip.",
    "celebratory_event": "During a joyful moment, like a party or achievement.",
    "emergency_situation": "In an urgent or crisis situation requiring immediate action.",
    "outdoor_activity": "While engaging in an activity outdoors, like hiking or picnicking.",
    "quiet_reflection": "In a moment of calm reflection or introspection.",
    "holiday_season": "During festive holidays or seasonal celebrations.",
    "classroom_setting": "In a classroom or educational environment.",
    "waiting_room": "While waiting at a doctor’s office or similar space.",
    "sports_event": "At a sports event, watching or participating.",
    "work_from_home": "Working remotely from home.",
    "traffic_jam": "Stuck in traffic or commuting.",
    "coffee_shop": "Relaxing or working in a coffee shop.",
    "shopping": "While shopping in a mall or store.",
    "childcare": "Taking care of children or helping with homework.",
    "fitness_activity": "During a workout or physical exercise session.",
    "friend_gathering": "Hanging out with friends in a casual setting.",
    "study_session": "Focused on studying or group learning.",
    "online_meeting": "During a video conference or virtual call."
}

In [5]:
MEDIUMS = {
    "email": "email",
    "text_message": "A quick text message",
    "social_media_post": "A post or comment shared on social media platforms.",
    "technical_document": "Technical document",
    "paper": "Scientific paper",
    "book": "Book",
    "personal_note": "Personal notes",
}

In [6]:
INTENTS = {
    "inform": "To provide information or knowledge to the reader.",
    "warn": "To alert or caution the reader about potential risks or dangers.",
    "entertain": "To amuse or captivate the reader through engaging content.",
    "motivate": "To inspire or encourage the reader to take action or feel uplifted.",
    "persuade": "To convince the reader to agree with a point of view or take a specific action.",
    "reflect": "To ponder or explore thoughts, emotions, or experiences.",
    "request": "To ask for information, action, or assistance from the reader.",
    "express_gratitude": "To show appreciation or thanks to the reader."
}

Then we are ready to put everything together for a prompt template.
> *With the large diversity of personas, and the contextual variations set above, this will create several different prompt to optimize for diversity*

In [7]:
PROMPT_TEMPLATE = """Imagine {num_entries} differents and realistic sentences which could have been written by the following person:
Persona: {persona}
Writting style: {style}
Context: {context}
Medium: {medium}
Intent: {intent}

Instructions:
Each text entry must be short (1 to 2) sentences max.
The core idea in each text entry must be different from the others.
The language must be {language}
"""

Let's see the prompt construction in action

In [8]:
generation_prompt = PROMPT_TEMPLATE.format(
    persona = "Leo, A small wood manufacturing business owner who just got his first kid.",
    num_entries = 2,
    style = STYLES['humorous'],
    context = CONTEXTS['night_time'],
    medium = MEDIUMS["social_media_post"],
    intent = INTENTS["request"],
    language=LANGUAGES['en']
)

print(generation_prompt)

Imagine 2 differents and realistic sentences which could have been written by the following person:
Persona: Leo, A small wood manufacturing business owner who just got his first kid.
Writting style: Playful and witty, aimed at entertaining or amusing.
Context: Late at night, during quiet hours.
Medium: A post or comment shared on social media platforms.
Intent: To ask for information, action, or assistance from the reader.

Instructions:
Each text entry must be short (1 to 2) sentences max.
The core idea in each text entry must be different from the others.
The language must be English



## Connect to Google Gemini
Now that we can generate many different prompts, it's time to generate responses.

To generate samples we will:
- Choose an LLM: we'll use *`gemini-1.5-flash`* which has a **15 RPM (request per minute) quota and 1,500 RPD (request per day)**.
    - It's not huge but given this is free, it's great!

- Configure sutructured Outputs: we'll use pydantic and the intructor library to force gemini to answer following a specific format.

In [9]:
!pip install instructor datasets --quiet

In [10]:
import google.generativeai as genai
from google.colab import userdata
from datasets import load_dataset
from pydantic import BaseModel
import instructor
import random
import uuid
import time
import os

If you want to know more about working with Gemini, I have a few tutorial notebooks [here](https://patrickfleith.github.io/datapipes/?utm_source=notebooks&utm_medium=colab&utm_campaign=notebooks) which I try to keep up-to-date.

In [11]:
# Setup your API in Colab Secrets and read it here. Pass it to genai to interact with Gemini.
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [12]:
# let's instantiate a model and try it
MODEL_ID = "gemini-1.5-flash"
model = genai.GenerativeModel(model_name=MODEL_ID)
response = model.generate_content(generation_prompt)
print(response.text)

1.  "So, the tiny human finally decided sleep is overrated. Anyone got tips on surviving on 3 hours of sleep AND running a woodworking business? Send coffee (and maybe a nap pod)."

2.  "My sawdust-covered workshop is officially a baby-proofing nightmare.  Seriously, any recommendations for child-safe woodworking clamps?  Asking for a sleep-deprived, slightly frantic, father-carpenter."



### Create the class for structured output


Creating Pydantic models enables to automatically create a "JSON" schema which is described to the LLM at inference time, and which will be used by `instructor` to remove illegal tokens. If you want to better understand how to generate structured outputs, checkout my guides [here](https://patrickfleith.github.io/datapipes/?utm_source=notebooks&utm_medium=colab&utm_campaign=notebooks)

In [13]:
from pydantic import BaseModel, Field

class TextEntries(BaseModel):
    entries: list[str] = Field(
        ...,
        description="List of texts"
    )

In [14]:
# we can access the JSON Schema
TextEntries.model_json_schema()

{'properties': {'entries': {'description': 'List of texts',
   'items': {'type': 'string'},
   'title': 'Entries',
   'type': 'array'}},
 'required': ['entries'],
 'title': 'TextEntries',
 'type': 'object'}

We instantiate a client with instructor *from the google_client* to follow the desired structured outputs

In [15]:
google_client = genai.GenerativeModel(
    model_name=MODEL_ID)

client = instructor.from_gemini(
    client=google_client,
    mode=instructor.Mode.GEMINI_JSON,
)

### Persona
As for the persona, we'll be using the **`FinePersonas`** dataset, an Open dataset of 21 million detailed personas, which has specifically been built for diverse and controllable synthetic text generation.

👏 Big kudos to the [Argilla](https://argilla.io/) team for this dataset.
You can find this dataset [here](https://huggingface.co/datasets/argilla/FinePersonas-v0.1)

In [16]:
fine_personas_dataset = load_dataset("argilla/FinePersonas-v0.1", "default")

We will not need the 21 millions personas 😅
- 400 for the train set.
- 100 for the test set.

For each persona will generate *`N`* texts for each of the 2 class (fluff or concise). This mean:
- 800 GEMINI API requests (1600 generated datapoints) in the train set
- 200 GEMINI API requests (400  generated datapoints) in the test set

✅ This comply with the quota of 1500 RPD

✅ This will result in a dataset of 2k examples (1600 train + 400 test).

> Below I actually use a ratio of 40/10 to run the notebook faster.

In [17]:
personas = fine_personas_dataset['train'].train_test_split(
    train_size=40,
    test_size=10,
    shuffle=True,
    seed=15)

Let's check the reduced dataset

In [18]:
personas

DatasetDict({
    train: Dataset({
        features: ['id', 'persona', 'labels'],
        num_rows: 40
    })
    test: Dataset({
        features: ['id', 'persona', 'labels'],
        num_rows: 10
    })
})

In [19]:
# check an example persona from the dataset
personas['train'][0]

{'id': '<urn:uuid:b5cbc9da-10e6-4be1-bc90-ef8ed3cc99c5>',
 'persona': 'An environmentally conscious middle school student who is interested in science and is doing a project on deforestation, its causes, effects and potential solutions.',
 'labels': '["Environmental", "Scientific", "Professional"]'}

### Utilities
- One function that will randomy pick one variations among styles, itents, languages, etc...
- One function which, given the prompt template and the randomised variable will construct the prompt.
    - This function also adds special instruction for Gemini to generate either **fluffy** or **concise** texts.

In [20]:
def pick_one(variations):
    """ Randomly pick a variation among several possible styles, intents, mediums, contexts, languages """
    random_variation_key = random.choice(list(variations.keys()))
    random_variation_value = variations[random_variation_key]
    return random_variation_value

In [21]:
def generate_prompt(num_entries, persona, style, context, medium, intent, language, label_idx):
    generation_prompt = PROMPT_TEMPLATE.format(
        persona = persona,
        num_entries = num_entries,
        style = style,
        context = context,
        medium = medium,
        intent = intent,
        language = language
    )

    final_prompt = generation_prompt + f"\n{LABELS[label_idx]['instruction']} {LABELS[label_idx]['description']}"

    return final_prompt

## Generation Time ⏰
- We loop over each persona
- We generate a unique prompt for each label (fluff, concise)
- We generate a model response for each label
- We extract the generated text from our structured outputs and store them in a dictionary
- We add some counter to avoid exceeding Gemini Quota and try/except to catch potential errors.

In [27]:
# we'll store our data in this dictionary
dataset_dict = {
    'uuid': [],
    'persona': [],
    'text': [],
    'label': [],
    'model': []
    }

In [28]:
# we set two counters make sure we don't exceed quote. We pause before exceeding quota.
rpd_counter = 0
rpm_counter = 0

for persona in personas['train']:

    for label_idx, label in enumerate(LABELS):

        prompt = generate_prompt(
            num_entries=2,  # we will generate two texts for each LLM call
            persona=persona['persona'],
            style=pick_one(STYLES),
            context=pick_one(CONTEXTS),
            medium=pick_one(MEDIUMS),
            intent=pick_one(INTENTS),
            language=pick_one(LANGUAGES),  # only english in this example notebook
            label_idx=label_idx,  # label for class 'fluff' and index in LABELS
        )

        try:
            model_response = client.messages.create(
                messages=[
                    {
                        "role": "system",
                        "content": "You are a helpful AI Assistant"},
                    {
                        "role": "user",
                        "content": prompt},
                ],
                response_model=TextEntries
            )

            try:
                for entry in model_response.entries:
                    dataset_dict['uuid'].append(str(uuid.uuid4()))  # Generate a unique ID
                    dataset_dict['persona'].append(persona['persona'])
                    dataset_dict['text'].append(entry)
                    dataset_dict['label'].append(label_idx)
                    dataset_dict['model'].append(MODEL_ID)
            except:
                print("Error saving record in dataset dictionary")

        except:
            print("Error calling model")
            time.sleep(10) # wait for 10s if an exception occured.
            pass

    rpd_counter += 2
    if rpd_counter > 1200:
        break

    print(f"{rpd_counter//2} personas --- Generated a total of {rpd_counter*2} instances so far")

    rpm_counter += 2
    if rpm_counter >= 14:
        print("Quota pause...")
        time.sleep(60)
        rpm_counter = 0

1 personas --- Generated a total of 4 instances so far
2 personas --- Generated a total of 8 instances so far
3 personas --- Generated a total of 12 instances so far
4 personas --- Generated a total of 16 instances so far
5 personas --- Generated a total of 20 instances so far
6 personas --- Generated a total of 24 instances so far
7 personas --- Generated a total of 28 instances so far
Quota pause...
8 personas --- Generated a total of 32 instances so far
9 personas --- Generated a total of 36 instances so far
10 personas --- Generated a total of 40 instances so far
11 personas --- Generated a total of 44 instances so far
12 personas --- Generated a total of 48 instances so far
13 personas --- Generated a total of 52 instances so far
14 personas --- Generated a total of 56 instances so far
Quota pause...
15 personas --- Generated a total of 60 instances so far
16 personas --- Generated a total of 64 instances so far
17 personas --- Generated a total of 68 instances so far
18 personas 

In [29]:
from datasets import Dataset
ds = Dataset.from_dict(dataset_dict)

In [30]:
ds

Dataset({
    features: ['uuid', 'persona', 'text', 'label', 'model'],
    num_rows: 160
})

🥳 **We have our dataset and we can push it to the hub!**

In [31]:
HF_USERNAME="patrickfleith" # <--- replace with yours otherwise this won't work!
PRIVATE = False # choose if you want a private dataset. If false it will be public.

ds.push_to_hub(
    repo_id=f'{HF_USERNAME}/fluff-vs-concise-demo',
    private=False
)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/patrickfleith/fluff-vs-concise-demo/commit/628ac0a95a7d9a85946f3504ef6d72a3b972e79d', commit_message='Upload dataset', commit_description='', oid='628ac0a95a7d9a85946f3504ef6d72a3b972e79d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/patrickfleith/fluff-vs-concise-demo', endpoint='https://huggingface.co', repo_type='dataset', repo_id='patrickfleith/fluff-vs-concise-demo'), pr_revision=None, pr_num=None)