# Fluff Detection: A Step Toward Concise Writing

## Objective
This project aims to develop a text classifier, in particular **a fluff detector** to identify and reduce verbosity in English writing. By categorizing and annotating instances of fluff, we can later fine-tune language models to rewrite text with clarity and precision.

> **In this notebook we will focus on creating a small synthetic dataset to train a first version of the text classifier**

## What Is Fluff?
Fluff refers to superfluous elements in writing that increase length but do not enhance meaning.
→ It weakens clarity and reduce communication effectiveness.


Below are examples illustrating common types of 'fluff' and overlaps:

#### Example 1
- **Fluffy**: *It is absolutely and completely necessary for us to thoroughly and carefully evaluate all aspects of the situation.* ❌
- **Concise**: *We must evaluate all aspects of the situation.* ✅

#### Example 2
- **Fluffy**: *As I mentioned earlier, we actually need to start working on this project sooner rather than later to ensure that we meet the deadline.* ❌
- **Concise**: *We need to start this project soon to meet the deadline.* ✅

#### Example 3
- **Fluffy**: *At the end of the day, we need to think outside the box to ensure that this project reaches its maximum potential.* ❌
- **Concise**: *We need creative ideas to make this project succeed.* ✅



## How Will We Build The Dataset?
We want to minimize the time we take to build the dataset so we'll:
- Define our labels and describe them.
- Choose language
- Describe the type of text entries we would like to generate
- Use a dataset of `Persona` to achieve great diversity

### 1. Labels
We will use the follow two simple labels:
- **`fluffy`** - (also labelled `1`)
- **`concise`** - (also labelled `0`)

We'll also provide a descriptio for each label

In [1]:
LABELS = [
    {
        "name": "fluffy",
        "description": "Fluff in text is often includes some redundancy, filler words, excessive qualifiers, unnecessary adjectives or adverbs, irrelevant information, and repetition of known context.",
        "instruction": "Make sure that the text is realistic, but a bit verbose and with a bit of fluff."
    },
    {
        "name": "concise",
        "description": "Concise text is characterized by clarity, precision, and brevity. It communicates ideas directly, in a very compact manner using only the words necessary to convey the intended message.",
        "instruction": "Make sure that the text is very concise, and compact."
    }
]

### 2. Language
We will choose language:
- For simplicity we'll stick to **english**.

In [4]:
LANGUAGES = {"en": "English"}
# Could also be
# LANGUAGES = {
#     "fr": "French",
#     "es": "Spanish",
#     "en": "English"
# }

### Description of Text Entries
We would like to focus on classifying:
- `sentence-level`: each text is a unique sentence
- `persona`: text that could be written by a specific persona (more on that later)
- `styles`: phrases written in different style
- `context`: phrases which can be written in different daily-context
- `medium`: text that can be written on diferent medium
- `intent`: different flavors of intent.

All this will help us construct a procedural prompt for high-diversity, yet realistic texts.

**Example Generation Prompt**:

*Imagine 3 differents but realistic sentences which could have been written by the following person:*

*Persona: Leo, A small wood manufacturing business owner who just got his first kid.*

*Writting style: A tendency for exageration*

*Context: In the middle of the night*

*Medium: Email*

*Intent: To warn somebody about something.*

I'll use HuggingChat to generate some styles ideas, context, medium, and intents.

I'll use this prompt: `Can you give me 5 ideas to describe different writing styles somebody could use?`

> ⚠ These properties should be generic enough to not conflict with each other, and especially not conflict with one of the label.

In [6]:
STYLES = {
    "formal1": "Structured and professional, suitable for official contexts.",
    "formal2": "Polished and respectful, adhering to formal conventions.",
    "formal3": "Clear and authoritative, used for professional communication.",
    "formal4": "Impersonal and precise, avoiding casual language.",
    "formal5": "Serious and proper, often found in legal or academic writing.",
    "technical1": "Detailed and precise, explaining concepts clearly.",
    "technical2": "Jargon-heavy and factual, tailored for experts.",
    "technical3": "Objective and structured, presenting data or instructions.",
    "technical4": "Focused and methodical, designed for problem-solving.",
    "technical5": "Concise and systematic, delivering technical details.",
    "casual": "Relaxed and conversational, like chatting with a friend.",
    "narrative": "Story-like, focused on events or characters in a sequence.",
    "humorous": "Playful and witty, aimed at entertaining or amusing.",
    "technical": "Precise and factual, used for scientific or detailed explanations.",
    "persuasive": "Convincing and influential, designed to sway opinions or actions.",
    "descriptive": "Vivid and detailed, creating a clear mental picture.",
    "emotional": "Expressive and heartfelt, focused on feelings or personal views.",
    "instructional": "Clear and step-by-step, aimed at guiding or teaching.",
    "poetic": "Artistic and rhythmic, often using metaphors and imagery.",
    "journalistic": "Objective and concise, focusing on facts and current events.",
    "rhetorical": "Argumentative and impactful, designed to provoke thought."
}

In [7]:
CONTEXTS = {
    "professional_meeting": "During a meeting at work.",
    "family_gathering": "During a family gathering.",
    "public_event": "At a public event, like a concert or seminar.",
    "social_media": "While scrolling through social media.",
    "night_time": "Late at night, during quiet hours.",
    "early_morning": "Early in the morning, starting the day.",
    "evening": "In the evening, winding down after the day.",
    "overtime": "Working late hours, beyond regular time.",
    "travel_scenario": "While traveling or preparing for a trip.",
    "celebratory_event": "During a joyful moment, like a party or achievement.",
    "emergency_situation": "In an urgent or crisis situation requiring immediate action.",
    "outdoor_activity": "While engaging in an activity outdoors, like hiking or picnicking.",
    "quiet_reflection": "In a moment of calm reflection or introspection.",
    "holiday_season": "During festive holidays or seasonal celebrations.",
    "classroom_setting": "In a classroom or educational environment.",
    "waiting_room": "While waiting at a doctor’s office or similar space.",
    "sports_event": "At a sports event, watching or participating.",
    "work_from_home": "Working remotely from home.",
    "traffic_jam": "Stuck in traffic or commuting.",
    "coffee_shop": "Relaxing or working in a coffee shop.",
    "shopping": "While shopping in a mall or store.",
    "childcare": "Taking care of children or helping with homework.",
    "fitness_activity": "During a workout or physical exercise session.",
    "friend_gathering": "Hanging out with friends in a casual setting.",
    "study_session": "Focused on studying or group learning.",
    "online_meeting": "During a video conference or virtual call."
}

In [8]:
MEDIUMS = {
    "email": "email",
    "text_message": "A quick text message",
    "social_media_post": "A post or comment shared on social media platforms.",
    "technical_document": "Technical document",
    "paper": "Scientific paper",
    "book": "Book",
    "personal_note": "Personal notes",
}

In [10]:
INTENTS = {
    "inform": "To provide information or knowledge to the reader.",
    "warn": "To alert or caution the reader about potential risks or dangers.",
    "entertain": "To amuse or captivate the reader through engaging content.",
    "motivate": "To inspire or encourage the reader to take action or feel uplifted.",
    "persuade": "To convince the reader to agree with a point of view or take a specific action.",
    "reflect": "To ponder or explore thoughts, emotions, or experiences.",
    "request": "To ask for information, action, or assistance from the reader.",
    "express_gratitude": "To show appreciation or thanks to the reader."
}

Then we are ready to put everything together for a prompt template.
> *With the large diversity of personas, and the contextual variations set above, this will create several different prompt to optimize for diversity*

In [11]:
PROMPT_TEMPLATE = """Imagine {num_entries} differents and realistic sentences which could have been written by the following person:
Persona: {persona}
Writting style: {style}
Context: {context}
Medium: {medium}
Intent: {intent}

Instructions:
Each text entry must be short (1 to 2) sentences max.
The core idea in each text entry must be different from the others.
The language must be {language}
"""

Let's see the prompt construction in action

In [12]:
generation_prompt = PROMPT_TEMPLATE.format(
    persona = "Leo, A small wood manufacturing business owner who just got his first kid.",
    num_entries = 2,
    style = STYLES['humorous'],
    context = CONTEXTS['night_time'],
    medium = MEDIUMS["social_media_post"],
    intent = INTENTS["request"],
    language=LANGUAGES['en']
)

In [13]:
print(generation_prompt)

Imagine 2 differents and realistic sentences which could have been written by the following person:
Persona: Leo, A small wood manufacturing business owner who just got his first kid.
Writting style: Playful and witty, aimed at entertaining or amusing.
Context: Late at night, during quiet hours.
Medium: A post or comment shared on social media platforms.
Intent: To ask for information, action, or assistance from the reader.

Instructions:
Each text entry must be short (1 to 2) sentences max.
The core idea in each text entry must be different from the others.
The language must be English



## Generation
Now that we can generate many different prompts, it's time to generate responses.

To generate samples we will:
- Choose an LLM: we'll use `gemini-1.5-flash` which has a 15 RPM (request per minute) quota and 1,500 RPD (request per day). It's not huge but given this is free, it's great!

- Configure sutructured Outputs: we'll use pydantic and the intructor library to force gemini to answer following a specific format.

In [14]:
!pip install instructor datasets --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.4/71.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m14.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [15]:
import google.generativeai as genai
from google.colab import userdata
from datasets import load_dataset
from pydantic import BaseModel
import instructor
import random
import uuid
import time
import os

I you want to know more about working with Gemini, I have a few tutorial notebooks [here](https://patrickfleith.github.io/datapipes/?utm_source=notebooks&utm_medium=colab&utm_campaign=notebooks) which I try to keep up-to-date.

In [17]:
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [18]:
# let's instantiate a model
MODEL_ID = "gemini-1.5-flash"
model = genai.GenerativeModel(model_name=MODEL_ID)
response = model.generate_content(generation_prompt)
print(response.text)

1.  So, the tiny human is finally asleep (miracle!), which means I can finally tackle the mountain of sawdust in my workshop.  Anyone know a good, kid-safe way to vacuum up miniature wood shavings?

2.  Sleep-deprived, sawdust-covered, and utterly charmed.  Anyone got recommendations for baby-wearing backpacks that won't completely obliterate my already precarious posture?



### Create the class for structured output


Creating Pydantic models enables to automatically create a "JSON" schema which is described to the LLM at inference time, and which will be used by `instructor` to remove illegal tokens. If you want to better understand how to generate structured outputs, checkout my guides [here](https://patrickfleith.github.io/datapipes/?utm_source=notebooks&utm_medium=colab&utm_campaign=notebooks)

In [19]:
# class for one text / row in the final dataset
class TextEntry(BaseModel):
    content: str

# class for the list of texts to be generate per unique Gemini API call.
#  This way we generate multiple texts per API call ;)
class TextEntries(BaseModel):
    entries : list[TextEntry]

### Instantiate a client with instructor to follow the desired structured outputs

In [20]:
google_client = genai.GenerativeModel(
    model_name=MODEL_ID)

client = instructor.from_gemini(
    client=google_client,
    mode=instructor.Mode.GEMINI_JSON,
)

### Persona
As for the persona, we'll be using the **`FinePersonas`** dataset, an Open dataset of 21 million detailed personas, which has specifically been built for diverse and controllable synthetic text generation.

👏 Big kudos to the [Argilla](https://argilla.io/) team for this dataset.
You can find this dataset [here](https://huggingface.co/datasets/argilla/FinePersonas-v0.1)

In [21]:
fine_personas_dataset = load_dataset("argilla/FinePersonas-v0.1", "default")

README.md:   0%|          | 0.00/129k [00:00<?, ?B/s]

train-00000-of-00012.parquet:   0%|          | 0.00/220M [00:00<?, ?B/s]

train-00001-of-00012.parquet:   0%|          | 0.00/221M [00:00<?, ?B/s]

train-00002-of-00012.parquet:   0%|          | 0.00/222M [00:00<?, ?B/s]

train-00003-of-00012.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

train-00004-of-00012.parquet:   0%|          | 0.00/223M [00:01<?, ?B/s]

train-00005-of-00012.parquet:   0%|          | 0.00/224M [00:00<?, ?B/s]

train-00006-of-00012.parquet:   0%|          | 0.00/224M [00:00<?, ?B/s]

train-00007-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00008-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00009-of-00012.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

train-00010-of-00012.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00011-of-00012.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/21071228 [00:00<?, ? examples/s]

We will not need the 21 millions personas 😅
- 400 for the train set.
- 100 for the test set.

For each persona will generate *`N`* texts for each of the 2 class (fluff or concise). This mean:
- 800 GEMINI API requests (1600 generated datapoints) in the train set
- 200 GEMINI API requests (400  generated datapoints) in the test set

✅ This comply with the quota of 1500 RPD

In [22]:
personas = fine_personas_dataset['train'].train_test_split(
    train_size=400,
    test_size=100,
    shuffle=True,
    seed=15)

Let's check the reduced dataset

In [23]:
personas

DatasetDict({
    train: Dataset({
        features: ['id', 'persona', 'labels'],
        num_rows: 400
    })
    test: Dataset({
        features: ['id', 'persona', 'labels'],
        num_rows: 100
    })
})

In [24]:
# check an example persona from the dataset
personas['train'][0]

{'id': '<urn:uuid:b7463b96-14f6-4a84-aef7-f63f525734b5>',
 'persona': 'An educational psychologist or an instructional designer with a strong interest in cognitive learning theories and their applications in education.',
 'labels': '["Education", "Academia", "Specialized Expertise"]'}

### Some Utils fonction to generate

- One function that will randomy pick one variations among styles, itents, languages, etc...
- One function which, given the prompt template and the randomised variable will construct the prompt.
    - This function also adds special instruction for Gemini to generate either **fluffy** or **concise** texts.

In [25]:
def pick_one(variations):
    """ Randomly pick a variation among several possible styles, intents, mediums, contexts, languages """
    random_variation_key = random.choice(list(variations.keys()))
    random_variation_value = variations[random_variation_key]
    return random_variation_value

In [26]:
def generate_prompt(num_entries, persona, style, context, medium, intent, language, label_idx):
    generation_prompt = PROMPT_TEMPLATE.format(
        persona = persona,
        num_entries = num_entries,
        style = style,
        context = context,
        medium = medium,
        intent = intent,
        language = language
    )

    final_prompt = generation_prompt + f"\n{LABELS[label_idx]['instruction']} {LABELS[label_idx]['description']}"

    return final_prompt

In [30]:
# we'll store our data in this dictionary
dataset_dict = {
    'uuid': [],
    'persona': [],
    'text': [],
    'label': [],
    'model': []
    }

## Generation Time ⏰
- We loop over each persona
- We generate a unique prompt for each label (fluff, concise)
- We generate a model response for each label
- We extract the generated text from our structured outputs and store them in a dictionary
- We add some counter to avoid exceeding Gemini Quota and try/except to catch potential errors.

In [37]:
# we set two counters make sure we don't exceed quote. We pause before exceeding quota.
rpd_counter = 0
rpm_counter = 0

for persona in personas['train']:

    for label_idx, label in enumerate(LABELS):

        prompt = generate_prompt(
            num_entries=2,
            persona=persona['persona'],
            style=pick_one(STYLES),
            context=pick_one(CONTEXTS),
            medium=pick_one(MEDIUMS),
            intent=pick_one(INTENTS),
            language=pick_one(LANGUAGES),  # only english in this example notebook
            label_idx=label_idx,  # label for class 'fluff' and index in LABELS
        )

        try:
            model_response = client.messages.create(
                messages=[
                    {
                        "role": "system",
                        "content": "You are a helpful AI Assistant"},
                    {
                        "role": "user",
                        "content": prompt},
                ],
                response_model=TextEntries
            )

            try:
                for entry in model_response.entries:
                    dataset_dict['uuid'].append(str(uuid.uuid4()))  # Generate a unique ID
                    dataset_dict['persona'].append(persona['persona'])
                    dataset_dict['text'].append(entry.content)
                    dataset_dict['label'].append(label_idx)
                    dataset_dict['model'].append(MODEL_ID)
            except:
                print("Error saving record in dataset dictionary")

        except:
            print("Error calling model")
            time.sleep(10) # wait for 10s if an exception occured.
            pass

    rpd_counter += 2
    if rpd_counter > 1200:
        break

    print(f"{rpd_counter//2} personas --- Generated a total of {rpd_counter*2} instances so far")

    rpm_counter += 2
    if rpm_counter >= 14:
        print("Quota pause...")
        time.sleep(60)
        rpm_counter = 0

In [32]:
from datasets import Dataset
ds = Dataset.from_dict(dataset_dict)

In [33]:
ds

Dataset({
    features: ['uuid', 'persona', 'text', 'label', 'model'],
    num_rows: 1600
})

In [41]:
ds.push_to_hub(
    repo_id='patrickfleith/fluff-vs-concise',
    private=False
)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/patrickfleith/fluff-vs-concise/commit/e04806999a54c2e3205399b1984a287df3dcf7c2', commit_message='Upload dataset', commit_description='', oid='e04806999a54c2e3205399b1984a287df3dcf7c2', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/patrickfleith/fluff-vs-concise', endpoint='https://huggingface.co', repo_type='dataset', repo_id='patrickfleith/fluff-vs-concise'), pr_revision=None, pr_num=None)