<a href="https://colab.research.google.com/github/piesauce/llm-playbooks/blob/ateng%2FCH2_exercises/CH2_FasttextClassifieripynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 FastText Classifier Exercise
Create a quality classifier using fasttext. Your positive examples can be drawn from Wikipedia and the negative examples can be randomly drawn from the unclean version of C4. Once trained, feed documents from the realnewslike subset of C4 to this classifier. Is this classifier able to do a good job?

https://huggingface.co/datasets/allenai/c4

### This Notebook:
- Install & Import: Installs fasttext and datasets; imports them.
- Positive Data: We use wikitext-2-raw-v1 as a small stand-in for “high-quality” text.
- Negative Data: Drawn from allenai/c4 (the unclean version), streaming the 'train' split and taking a few thousand lines.
- Splits: 90% train, 10% validation; saved to disk in FastText-style format.
- Train: A simple supervised classifier with fasttext.train_supervised(...).
- Evaluate on validation.
- Inference on the “realnewslike” subset (also from allenai/c4), sampling ~100 lines, counting how many are classified as positive vs. negative.


In [1]:
# Install necessary libraries
!pip install fasttext datasets

import os
import random
import fasttext
from datasets import load_dataset


Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m71.7/73.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manyli

In [2]:
# Load Positive Examples (Wikipedia)
# We'll use wikitext-2-raw-v1 for demonstration (small subset).
# Filter out very short lines.
# Positive Data: We use wikitext-2-raw-v1 as a small stand-in for “high-quality” text.

wiki = load_dataset("wikitext", "wikitext-2-raw-v1")
wiki_texts = [t.strip() for t in wiki["train"]["text"] if len(t.strip()) > 30]


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [10]:
#wiki_texts

In [6]:
# Load Negative Examples (Unclean version of C4 from allenai/c4)
# We'll assume the main 'train' split is unclean. We just randomly sample lines.
# Note: This is large, so use streaming for Google Colab Free Tier :-)

unclean_c4_stream = load_dataset('allenai/c4', 'en', split="train", streaming=True)

neg_samples = []
for example in unclean_c4_stream:
    txt = example["text"].strip()
    if len(txt) > 30:
        neg_samples.append(txt)
    if len(neg_samples) >= 3000:  # sample size for demonstration
        break


Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

In [7]:
# Prepare Train/Validation Splits
# We'll do a simple 90/10 split for each category (pos vs. neg).

random.shuffle(wiki_texts)
random.shuffle(neg_samples)

val_ratio = 0.1
pos_val_count = int(len(wiki_texts) * val_ratio)
neg_val_count = int(len(neg_samples) * val_ratio)

pos_train = wiki_texts[pos_val_count:]
pos_val = wiki_texts[:pos_val_count]
neg_train = neg_samples[neg_val_count:]
neg_val = neg_samples[:neg_val_count]


In [9]:
neg_val[0]

'If you are interested in the new Ram 1500, Dodge Grand Caravan, Jeep Grand Cherokee or Chrysler 300, the Crown Auto World Bristow new car dealership is your connection for a superb inventory of new cars. To discover the many stylish, versatile vehicles we provide for drivers in the Tulsa, Oklahoma City and Broken Arrow area, explore our inventory online, or visit us in person to get a closer look at all of the amazing features.\nWhen you have made the decision on which new Chrysler, Dodge, Jeep or Ram model is the perfect match for you, be sure to speak to one of our auto financing specialists for a great car loan or lease in Bristow! We will help you get a customized car finance program that fits your individual needs. If you are still searching for a vehicle, be sure to also check out our superb inventory of quality used cars and fill out our CarFinder form if you need help searching for a particular model.\nProudly Serving Tulsa, Oklahoma City, Broken Arrow & Okmulgee and Beyond wi

In [11]:
pos_val[0]

'Chandler , David P. The Tragedy of Cambodian History . New Haven CT : Yale University Press , 1991 .'

In [12]:
pos_train[0]

'= = = Commercial recordings = = ='

In [13]:
neg_train[0]

'This time, players can use communication features in the game to show each other the unique houses they have made.\nHH Showcase where you can check out homes belonging to people you have encountered via the StreetPass feature, just as you would view a show home. When we came up with this idea, we wanted to make it so it was as little trouble as possible for players to make use of this feature.\nExactly. In the end, we made it so all you need to do is have StreetPass switched on.\nDream Suite run by Luna, which we touched upon earlier.\nWell, a fair while ago, when you originally announced the Wii, you explained the concept of WiiConnect249 and gave the example of a friend coming over to play while you’re asleep.9. WiiConnect24: A Wii network service that is always connected to the internet, allowing automatic downloads of the latest news and other messages. Please note, the WiiConnect24 service is not available in South Africa.\nAh, yes. That’s right. That was the first thing I wanted

In [14]:
# Write training and validation data in FastText format
# FastText expects lines like "__label__positive This is text"
train_file = "train_fasttext.txt"
val_file = "val_fasttext.txt"

def write_fasttext_data(filename, pos, neg):
    with open(filename, "w", encoding="utf-8") as f:
        for t in pos:
            t = t.replace("\n", " ")
            f.write(f"__label__positive {t}\n")
        for t in neg:
            t = t.replace("\n", " ")
            f.write(f"__label__negative {t}\n")

write_fasttext_data(train_file, pos_train, neg_train)
write_fasttext_data(val_file, pos_val, neg_val)


In [15]:
# Train the FastText model
model = fasttext.train_supervised(
    input=train_file,
    lr=0.1,
    epoch=5,
    wordNgrams=1,
    dim=100,
    loss='softmax'
)

# Evaluate on validation set
val_result = model.test(val_file)
print("[Validation Results]")
print(f"  Examples: {val_result[0]}")
print(f"  Precision (P@1): {val_result[1]:.3f}")
print(f"  Recall (R@1): {val_result[2]:.3f}\n")


[Validation Results]
  Examples: 2126
  Precision (P@1): 0.994
  Recall (R@1): 0.994



In [17]:
# Test on the "realnewslike" subset of allenai/c4
# We'll sample N lines and predict. We check how many are labeled positive vs negative.
realnews_c4_stream = load_dataset("allenai/c4", "realnewslike", split="train", streaming=True)
n_samples = 100
count = 0
pos_count = 0
neg_count = 0

for example in realnews_c4_stream:
    text = example["text"].replace("\n", " ").replace("\r", " ").strip()
    # Remove newlines to avoid FastText error
    if len(text) < 30:
        continue
    prediction = model.predict(text)  # returns (labels, probabilities)
    label = prediction[0][0]  # e.g., "__label__positive" or "__label__negative"
    if label == "__label__positive":
        pos_count += 1
    else:
        neg_count += 1
    count += 1
    if count >= n_samples:
        break

print(f"[RealNewslike Subset - First {n_samples} Non-trivial Samples]")
print(f"  Predicted Positive: {pos_count}")
print(f"  Predicted Negative: {neg_count}")

# Result explanation:
# Getting 0 positives out of 100 suggests the model is overly biased toward the negative label and not generalizing well.
# By broadening the notion of “quality,” balancing the data, and tuning hyperparameters, we can likely achieve better performance when classifying real news text.


Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/512 [00:00<?, ?it/s]

[RealNewslike Subset - First 100 Non-trivial Samples]
  Predicted Positive: 0
  Predicted Negative: 100


**Explanation of the “All Negative” Result in Our FastText Quality Classifier**

In this notebook, we trained a FastText classifier to distinguish “quality” text from “low-quality” text:

- **Positive Data** (High Quality): Wikipedia excerpts  
- **Negative Data** (Low Quality): Unclean C4 excerpts  
- **Evaluation**: We tested on the “realnewslike” subset of the same C4 dataset, expecting at least some examples to be classified as positive.

**Observed Result**

When predicting on the first 100 non-trivial lines of “realnewslike,” the classifier labeled **all 100 as negative**. This means our model did **not** recognize any sample in that subset as “quality.”

**Why Is This Happening?**

1. **Narrow Definition of “Positive”**  
   - Our positive examples come **only** from Wikipedia. Wikipedia has a very distinct, encyclopedic style, so the model might learn that *only* that specific style is “quality.”  
   - Real news text can be formal, but it still differs in tone, structure, and vocabulary from Wikipedia. The classifier sees enough differences to lump it into the “negative” category.

2. **Data Imbalance or Mismatch**  
   - We might have more negative than positive examples, leading the model to favor “negative” predictions.  
   - The negative data in unclean C4 includes a wide variety of text—some could be spam, some legitimate writing, etc. If the model primarily learns “anything that doesn’t look exactly like Wikipedia is negative,” it will default to that label for most real-world text.

3. **Minimal Tuning**  
   - We used a handful of epochs (like 5) and basic hyperparameters. FastText can be quick but might underfit if we don’t explore additional epochs, adjust learning rate, or increase the dimension.  
   - We also haven’t done any advanced filtering or domain adaptation.

**How to Improve the Classifier**

1. **Include More Representative Positive Data**  
   - Add other sources of reputable text (e.g., high-quality blogs, published articles, genuine news) to broaden the classifier’s notion of “quality.” or look through other huggingface open source datasets  
   - Having just Wikipedia may be too narrow.

2. **Curate Negative Data More Carefully**  
   - Ensure the negative samples are truly low quality (spam, nonsense, incomplete sentences) rather than random text from the web.  
   - Otherwise, “negative” might include decent text, making the boundary between “positive” and “negative” too fuzzy.

3. **Balance The Dataset**  
   - Aim for roughly the same number of positive and negative examples.  
   - If the classifier sees far more negative examples, it may learn to lean on the negative label.

4. **Adjust Training Parameters**  
   - Increase the `epoch` count (e.g., 10 or 15 instead of 5).  
   - Try different `lr` and `wordNgrams`.  
   - Experiment with subword or character-level features.

5. **Check Preprocessing & Domain**  
   - Ensure the data used at inference is processed similarly to the training data.  
   - If realnewslike text contains many links, HTML, or other elements not common in Wikipedia, it may throw off the model.


