<a href="https://colab.research.google.com/github/happyahluwalia/llm_journey/blob/main/week-01-data-pipeline/Data_Cleaning_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [28]:
# 📚 LLM Data Cleaning Pipeline - Week 1
# Full tutorial: https://github.com/happyahluwalia/llm_journey

# Install dependencies (only needed in Colab)
!pip install trafilatura langdetect -q

print("✅ Setup complete! Ready to start.")

✅ Setup complete! Ready to start.


## 📚 What This Notebook Covers

**Concepts explained:** All steps of a production data cleaning pipeline

**Code implemented:** Core foundational steps (collection through quality filtering)
**Advanced topics:** Deduplication and content filtering are explained conceptually;
implementations will be covered in advanced tutorials

*This approach lets you understand the full pipeline without getting overwhelmed
by complex algorithms like MinHash or toxicity classifiers.*

One of the critical steps before training or using any data is preprocessing. Data in its raw format is seldom usable. It needs to undergo cleaning before we can ingest it into our system.

## Core Pipeline

1. **Data Collection** — Gather from sources
2. **Extraction** — Get plain text (method varies by source)
3. **Language Detection** — Filter to target language(s)
4. **Quality Filtering (Heuristics)** — Remove obvious junk
   - Length filters
   - Gibberish detection
   - High repetition
   - Spam patterns
5. **Exact Deduplication** — Remove identical documents
6. **Fuzzy Deduplication** — Remove near-duplicates (modern pipelines)

## Optional Steps (Depends on Your Goals)

7. **URL/Domain Filtering** — Block known bad domains (RefinedWeb style)
8. **Toxicity Filtering** — Remove harmful content (if needed)
9. **Quality Scoring** — Score each document (Dolma style)
10. **PII Removal** — Remove personal info (usually left to users)

Remember **"There's no standard pipeline - it depends on goals"**


---

### 🧩 Real-World Pipelines Using These Steps

The datasets we're sampling from — Wikipedia, StackExchange, GitHub, Common Crawl — are used in many large-scale open-source corpora:

- **Dolma (Allen AI)** — Metadata-enriched, multilingual corpus built from Common Crawl, Wikipedia, and structured sources
- **RefinedWeb & FineWeb (Hugging Face)** — Filtered Common Crawl with quality scoring and aggressive deduplication
- **The Pile (EleutherAI)** — Foundational dataset used for GPT-NeoX, includes Books3, ArXiv, PubMed, StackExchange

**Our goal:** Simulate a smaller-scale version of these pipelines, understanding each step deeply.

---

## 📋 What We'll Cover

Below, we'll walk through each step with:
- ✅ **Why it matters** — The problem it solves
- ✅ **How it works** — The techniques used
- ✅ **Trade-offs** — Precision vs. recall, speed vs. quality
- ✅ **Real examples** — How production pipelines implement it

Let's start with **Data Collection**...

# **1. Data Collection**

One of the first steps when you start any data project is to decide **what data sources** to use.

This naturally raises two key questions:

> **Question 1:** "What do I want my model or chatbot to *do*?"

- Should it be **general-purpose**, able to talk about anything?  
- Or should it be **domain-specialized**, e.g., a Legal assistant, Medical expert, or Coding Ninja?

> **Question 2:** "What kind of *attitude* or *voice* do I want my model to have?"

- Chatty and conversational, like Reddit?  
- Or factual and formal, like Wikipedia?

---

## 🧭 Before You Jump In — Ask These Questions

1. **What is my end goal?** (Chat, Q&A, summarization, code generation)
2. **How wide should my data coverage be?** (Broad vs. deep)
3. **How much domain knowledge vs. general reasoning is required?**
4. **What's my tolerance for noise, bias, and licensing complexity?**
5. **How much data do I need?** (1GB for fine-tuning vs. 1TB for pre-training)

---

## ⚙️ When to Use What

| Model Type | Data Sources | Description |
|-------------|---------------|--------------|
| **General Purpose** | Web + Books + Dialog + Code | Broad, diverse mix (Wikipedia, Reddit, GitHub, Books3) |
| **Domain Specialized**<br>(Finance, Legal, Medical) | Curated licensed datasets | Higher precision and quality, less noise<br>(e.g., PubMed, Legal archives, Financial reports) |
| **Code Model** | GitHub, StackOverflow, CodeSearchNet | Code and explanations; focus on structure and correctness |

---

## 🌐 Common Data Sources (With Trade-offs)

| Category | Examples | Notes |
|-----------|-----------|-------|
| **Web Crawls** | • **Common Crawl** – 400+ TB of raw web data<br>• **RefinedWeb** – Filtered Common Crawl (Hugging Face)<br>• **FineWeb** – Even cleaner version (Hugging Face)<br>• **Dolma** – Metadata-rich (Allen AI) | **Trade-off:** Scale vs. quality<br>Raw Common Crawl is 95% noise; use pre-filtered versions |
| **Code** | • GitHub<br>• StackOverflow<br>• CodeSearchNet<br>• The Stack (Hugging Face) | **Best for:** Models that reason about code, logic, debugging<br>**Watch out for:** Licensing (some repos are proprietary) |
| **Community Q&A** | • Reddit<br>• Quora<br>• StackExchange | **Adds:** Conversational tone, diverse phrasing, slang<br>**Watch out for:** Toxicity, misinformation |
| **Curated Knowledge** | • Wikipedia<br>• ArXiv (academic papers)<br>• Books3<br>• Project Gutenberg | **Best for:** Factual grounding, high-quality prose<br>**Trade-off:** Low noise but limited diversity |

---

## 🔄 How to Get the Data

### **Option 1: Use Pre-Built Datasets**

✅ **Pros:**
- Instant access via Hugging Face, ready to use
- Pre-filtered for quality
- Well-documented

❌ **Cons:**
- Limited customization
- May be outdated
- Can't control the collection process

### **Option 2: Web APIs (For structured, real-time data)**
✅ Pros:

- Clean, structured data (JSON)
- Officially supported, legal clarity
- Rate limits are transparent

❌ Cons:

- Requires API keys
- Rate limits (e.g., Reddit: 60 req/min)
- May have usage costs

Best for: Reddit, GitHub, Twitter/X, StackExchange

### **Option 3: Web Scraping (For custom sources without APIs)**
✅ Pros:

- Full control over what you collect
- Free (no API costs)
- Can access any public website

❌ Cons:

- Fragile (breaks when HTML changes)
- Legal gray area (check robots.txt, ToS)
- Risk of IP bans if too aggressive
- Requires more technical work

Best for: Niche forums, blogs, specialized websites



---



##**⚖️ Legal & Ethical Considerations**

⚠️ Always check the license of each dataset.

---


In [None]:
# --- Cell: Fetch HTML content from multiple sources for data cleaning demo ---

import requests
from pathlib import Path
import time

# Config: Base directory and size limit (4 MB per source)
BASE_DIR = Path("data/raw")
source_sizes = {"wiki": 0, "github": 0, "stackexchange": 0}

# Create directories for each source
for sub in ["wiki", "github", "stackexchange"]:
    (BASE_DIR / sub).mkdir(parents=True, exist_ok=True)

# Headers to avoid rate-limiting
headers = {"User-Agent": "Mozilla/5.0 (compatible; DataPipelineBot/1.0)"}

# 1️⃣ Wikisource: Fetch 4 HTML public domain pages (stories/essays)
print("Fetching 4 Wikisource HTML pages...")
wiki_urls = [
    ("The Raven by Edgar Allan Poe", "https://en.wikisource.org/wiki/The_Raven"),
    ("Rip Van Winkle by Washington Irving", "https://en.wikisource.org/wiki/Rip_Van_Winkle"),
    ("The Fall of the House of Usher by Edgar Allan Poe", "https://en.wikisource.org/wiki/The_Fall_of_the_House_of_Usher"),
    ("A Modest Proposal by Jonathan Swift", "https://en.wikisource.org/wiki/A_Modest_Proposal")
]
try:
    sample_text = ""  # Store last text for sample
    for i, (title, url) in enumerate(wiki_urls):
        resp = requests.get(url, headers=headers)
        if resp.status_code != 200:
            print(f"Error fetching '{title}': {resp.status_code}")
            continue
        text_bytes = html_text.encode("utf-8")
        filename = f"wiki_{i+1}_{title.replace(' ', '_').replace(' by ', '_')[:50]}.html"
        (BASE_DIR / "wiki" / filename).write_bytes(text_bytes)
        source_sizes["wiki"] += len(text_bytes)
        sample_text = html_text  # Update sample
        print(f"  - Downloaded: {title} ({len(text_bytes)/1024:.1f} KB)")
        time.sleep(1)  # Polite delay
except Exception as e:
    print(f"Wikisource fetch failed: {e}")

# 2️⃣ GitHub: Fetch 3-4 HTML-rendered READMEs or docs from transformers repo
print("Fetching 3-4 GitHub HTML pages...")
GITHUB_BASE = "https://github.com/huggingface/transformers/raw/main"  # Raw file base URL
GITHUB_FILES = [
    ("README", "/README.md"),  # Main README
    ("BERT Docs", "/docs/source/en/model_doc/bert.md"),  # Model doc
    ("GPT-2 Docs", "/docs/source/en/model_doc/gpt2.md"),
    ("T5 Docs", "/docs/source/en/model_doc/t5.md")
]
try:
    sample_text = ""  # Reset for GitHub
    files_downloaded = 0  # Track to ensure 3-4 files
    for i, (name, path) in enumerate(GITHUB_FILES):
        url = GITHUB_BASE + path
        resp = requests.get(url, headers=headers)
        if resp.status_code != 200:
            print(f"Error fetching '{name}': {resp.status_code}")
            continue
        # Convert Markdown to HTML-like content (for cleaning demo)
        html_text = f"<html><body><h1>{name}</h1><pre>{resp.text[:50_000]}</pre></body></html>"
        text_bytes = html_text.encode("utf-8")
        filename = f"github_{i+1}_{name.replace(' ', '_')}.html"
        (BASE_DIR / "github" / filename).write_bytes(text_bytes)
        source_sizes["github"] += len(text_bytes)
        files_downloaded += 1
        sample_text = html_text
        print(f"  - Downloaded: {name} ({len(text_bytes)/1024:.1f} KB)")
        time.sleep(1)
except Exception as e:
    print(f"GitHub fetch failed: {e}")

# 3️⃣ StackExchange: Fetch 5 Python-tagged Q&A posts
print("Fetching 5 StackExchange Q&A posts...")
so_url = "https://api.stackexchange.com/2.3/questions?order=desc&sort=activity&tagged=python&site=stackoverflow&filter=withbody"
try:
    resp = requests.get(so_url, headers=headers)
    if resp.status_code == 200:
        posts = resp.json().get("items", [])[:5]  # Limit to 5
        for i, post in enumerate(posts):
            content = f"Title: {post.get('title', '')}\n\nBody: {post.get('body', '')}"
            content_bytes = content.encode("utf-8")
            (BASE_DIR / "stackexchange" / f"post_{i+1}.html").write_bytes(content_bytes)
            source_sizes["stackexchange"] += len(content_bytes)
            print(f"  - Downloaded: Post {i+1} ({len(content_bytes)/1024:.1f} KB)")
            time.sleep(1)
    else:
        print(f"StackExchange API error: {resp.status_code}")
except Exception as e:
    print(f"StackExchange fetch failed: {e}")

# Summary
print(f"✅ Done. Files in 'data/raw/*'")
print(f"Total sizes: Wiki={source_sizes['wiki']/1_000_000:.2f} MB, "
      f"GitHub={source_sizes['github']/1_000_000:.2f} MB, "
      f"StackExchange={source_sizes['stackexchange']/1_000_000:.2f} MB")


##**🚀 What's Next?**
Once you've collected your data, it's rarely in a usable format. Raw data contains:

- HTML tags, scripts, CSS
- Duplicate content
- Low-quality or spam content
- Multiple languages mixed together
- Encoding issues

**Next steps in the pipeline:**

- **Extraction** — Get plain text from raw formats
- **Cleaning** — Remove noise, duplicates, low-quality content
- **Filtering** — Keep only relevant, high-quality data

Let's dive into **Extraction** next...

# **2. Data Extraction**

Once we have the data available, it is rarely in a usable format. We need to extract the actual data and ignore things like HTML tags, scripts, non-text elements, etc., to be able to use only the relevant text.


---

## 🔧 The Extraction Pipeline

Most modern systems use a **multi-step approach**:

1. **Rule-based extraction** → Remove all HTML tags, scripts, styles
2. **Heuristic techniques** → Score content blocks using:
   - Text density (ratio of text to HTML)
   - Link density (ratio of links to text)
   - DOM structure analysis
3. **Optional ML-based extraction** → Refine results for edge cases

### Popular Tool: **[Trafilatura](https://github.com/adbar/trafilatura)**

A production-ready Python library that combines rules + heuristics for fast, accurate extraction.

**What it does:**
- Removes HTML boilerplate (headers, footers, ads, navigation)
- Extracts main article/post content
- Preserves paragraph structure
- Supports 50+ languages
- Fast enough for large-scale processing

**Used by:** Dolma, various research projects, production web scrapers

---

## 🎯 Key Concepts to Understand

### **1. Precision vs. Recall Tradeoff**

| Approach | Strategy | Pros | Cons | Example |
|----------|----------|------|------|---------|
| **High Precision**<br>(Conservative) | Extract only what you're confident about | ✅ Low noise<br>✅ High quality | ❌ Miss good content | Extract only from `<article>` tags |
| **High Recall**<br>(Aggressive) | Extract anything that *might* be content | ✅ Capture more content<br>✅ Better coverage | ❌ More noise | Extract all `<p>` tags |

**Modern Approach:**  
✨ **Go with High Recall** → Extract more, filter later  

💾 Storage is cheap, missing good data is expensive


---



### **2. Extraction vs. Cleaning**

These are **different stages** in the pipeline:

| Stage | Focus | What It Does | Example |
|-------|-------|--------------|---------|
| **Extraction** | Structure | Get text from raw data format | Remove `<script>` tags from HTML |
| **Cleaning** | Content | Improve text quality | Remove duplicates, fix encoding, filter profanity |

    
**Order matters:**

*Raw Data -> Extraction -> Cleaning -> Training Ready Text*


- **Extraction happens first** → It's about the *structure* of the data
- **Cleaning happens next** → It's about the *content* of the data


---

## 💡 Key Insight

> **The best extraction technique depends on the source of the data.**

Different sources need different strategies:
- Wikipedia dumps → Already clean, minimal extraction needed
- Reddit/Forums → Extract posts/comments via API (structured)
- General web → Aggressive HTML extraction + heavy filtering
- Code repositories → Syntax-aware extraction

---
###**⚠️ Why Extraction Quality Matters**
Poor extraction leads to:

 - ❌ Training on navigation menus ("Home | About | Contact" appears thousands of times)
 - ❌ Learning ad language ("Buy now! Limited time offer!")
 - ❌ Memorizing boilerplate ("Copyright 2024. All rights reserved.")
 - ❌ Wasted compute on junk tokens

Good extraction means:

- ✅ Model learns from actual content
- ✅ Better reasoning and generation quality
- ✅ Less memorization of meaningless patterns
- ✅ More efficient training (fewer junk tokens)
---




In [None]:
# Text Extraction
%pip install trafilatura
import trafilatura
from pathlib import Path

# Config: Input/output directories and size limit (4 MB per source)
BASE_DIR = Path("data/raw")
OUT_DIR = Path("data/extracted")
MAX_SIZE_BYTES = 4_000_000  # 4 MB in bytes
source_sizes = {"wiki": 0, "github": 0, "stackexchange": 0}

# Create output directories for each source
for sub in ["wiki", "github", "stackexchange"]:
    (OUT_DIR / sub).mkdir(parents=True, exist_ok=True)

# Process each source directory
for source in ["wiki", "github", "stackexchange"]:
    print(f"Extracting content from {source} HTML files...")
    input_dir = BASE_DIR / source
    output_dir = OUT_DIR / source

    # Get all .html files in source directory
    html_files = list(input_dir.glob("*.html"))
    if not html_files:
        print(f"No HTML files found in {input_dir}")
        continue

    try:
        for html_file in html_files:
            # Check size limit before processing
            if source_sizes[source] > MAX_SIZE_BYTES * 0.9:
                print(f"Stopping {source} extraction: Approaching 4 MB limit")
                break

            # Read HTML content
            html_content = html_file.read_text(encoding="utf-8")

            # Extract main content with Trafilatura
            extracted_text = trafilatura.extract(
                html_content,
                include_formatting=True,  # Plain text, no markup
                include_links=False,       # Exclude hyperlinks
                include_tables=False,      # Exclude table content
                deduplicate=False           # Remove duplicate content
            ) or ""  # Fallback to empty string if extraction fails

            # Limit to ~50 KB chars (UTF-8 ~50 KB bytes)
            extracted_text = extracted_text[:50_000]
            text_bytes = extracted_text.encode("utf-8")

            if source_sizes[source] + len(text_bytes) > MAX_SIZE_BYTES:
                print(f"Skipping {html_file.name}: Exceeds 4 MB limit for {source}")
                continue

            # Save extracted text
            output_file = output_dir / f"{html_file.stem}.txt"
            output_file.write_bytes(text_bytes)
            source_sizes[source] += len(text_bytes)

            print(f"Extracted: {html_file.name} ({len(text_bytes)/1024:.1f} KB)")

        # Print sample from last extracted file (if any)
        if extracted_text:
            print(f"Sample extracted text from {source} (first 500 chars): {extracted_text[:500]}...")

    except Exception as e:
        print(f"Error processing {source}: {e}")

# Summary
print(f"✅ Done. Extracted files in 'data/extracted/*'")
print(f"Total sizes: Wiki={source_sizes['wiki']/1_000_000:.2f} MB, "
      f"GitHub={source_sizes['github']/1_000_000:.2f} MB, "
      f"StackExchange={source_sizes['stackexchange']/1_000_000:.2f} MB")

## 🚀 What's Next?

After extraction, we have plain text — but it's still messy!

**Next step**: **Data Cleaning**
- Remove duplicates
- Filter low-quality content
- Handle encoding issues
- Language detection
- Quality scoring

# **3. Data Cleaning**

We now have data extracted from all the sources. However, we cannot use it currently as it needs to be cleaned.

## ❓ Why Clean Data?

Raw extracted text still contains major issues:

| Issue | Problem | Impact |
|-------|---------|--------|
| **Duplicates** | Same content repeated multiple times | Leads to memorization and overfitting |
| **Near-duplicates** | Similar but not exact copies | Model learns same patterns repeatedly |
| **Multiple languages** | Content not in target language | Wasted training on irrelevant data |
| **Low quality content** | Spam, gibberish, auto-generated text | Degrades model performance |
| **Encoding issues** | `Caf�`, `â€™`, mojibake | Creates noise in vocabulary |
| **Harmful content** | Toxicity, profanity | Safety concerns |
| **PII data** | Personal information | Privacy and legal issues |

---

## 🔄 The Cleaning Pipeline

Modern systems use this multi-step approach:
```
Extracted Text
     ↓
Text Normalization
     ↓
Language Detection & Filtering
     ↓
Quality Filtering
     ↓
URL-level Deduplication
     ↓
Exact Deduplication
     ↓
Fuzzy Deduplication
     ↓
(Optional) Content Filtering
     ↓
Clean Training Data

```

---

## 🛠️ Step 1: Text Normalization

**What it does:** Fix basic text issues before processing.

**Common fixes:**
- Remove excessive whitespace and normalize line breaks
- Fix encoding errors (convert all to UTF-8)
- Remove special/invisible characters
- Standardize punctuation and formatting

**Why it matters:** Makes downstream steps (especially deduplication) more accurate.

---

## 🌍 Step 2: Language Detection & Filtering

**What it does:** Identify and keep only content in your target language(s).

**How:** Use libraries like **fastText language detector**
- Trained on 176 languages from Wikipedia
- Fast and accurate
- Returns language + confidence score

**Decision:** Set confidence threshold (e.g., keep only documents with >70% confidence in target language)

---

## ✂️ Step 3: Quality Filtering

**What it does:** Remove low-quality content using rule-based filters.

### Filter by Size:
- **Too short:** < 200 characters (likely fragments)
- **Too long:** > 100K words (likely corrupted or auto-generated)

### Filter by Repetition:
- **Line-level:** Remove if same line repeats 3+ times
- **Word-level:** Remove excessive duplication ("the the the")

### Filter by Content Patterns:
- Copyright notices: "© 2024 All rights reserved"
- Navigation: "Home | About | Contact"
- Boilerplate: "This site uses cookies"
- Placeholder: "Lorem ipsum", "Coming soon"

---

## 🔄 Step 4: Deduplication (Exact)

**What it does:** Remove duplicate URLs and identical documents.

### URL-level:
- Hash each URL
- Store in a set
- Skip if already seen

### Content-level:
- Hash the full document
- Use set or Bloom filter to track
- Remove if hash exists

**Impact:** Removes 15-30% of data typically.

---

### 💡 Key Insight:

> **Exact deduplication is necessary but not sufficient.**

Many documents are *similar* but not *identical*:
- Same article with minor edits
- Different formatting of same content
- Paraphrased versions

This is where **fuzzy deduplication** becomes critical.

---

## 🔍 Step 5: Fuzzy Deduplication (Near-Duplicates)

**What it does:** Remove documents that are similar but not exactly the same.

### Why This Is Critical:

Near-duplicates cause:
- ❌ Memorization of repeated patterns
- ❌ Wasted compute training on same content
- ❌ Overfitting to specific phrasings
- ❌ Reduced effective dataset diversity

**Impact:** Modern pipelines remove 50-70% of remaining data!

---

### 🎯 Major Insight (2023-2024):

**Old thinking:**
- More data = better models

**New understanding:**
- More **unique** data = better models
- Training on 500GB deduplicated > 1TB raw data
- Aggressive fuzzy dedup often **improves** model quality

**Key learning:** Quality > Quantity

---

## 🛡️ Step 6: Content Filtering (Optional)

**What it does:** Remove harmful, toxic, or sensitive content.

### Common Filters:

#### **Toxicity & Profanity:**
- Use word lists or ML classifiers
- Trade-off: Safety vs. over-filtering

#### **PII (Personal Information):**
- Remove emails, phone numbers, addresses
- Use regex patterns or NER models
- Challenge: Slow at scale, many false positives

#### **Adult Content:**
- Domain blocklists
- Content classifiers

**Note:** Many datasets skip this and leave filtering to end users.

---

## 📊 Real-World Impact

### Example: Common Crawl → FineWeb

```
10 TB    Common Crawl (raw)
  ↓ Extraction
7 TB     (30% removed: HTML, scripts)
  ↓ Language Detection
5 TB     (30% removed: non-English)
  ↓ Quality Filtering
3.5 TB   (30% removed: low quality)
  ↓ Exact Deduplication
2.5 TB   (25% removed: duplicates)
  ↓ Fuzzy Deduplication
1 TB     (60% removed: near-duplicates!)

```

**Result:** 90% removed, but the remaining 10% is far higher quality!

---

## 🎯 Key Takeaways

1. **Cleaning is not optional** — Raw data will hurt model performance
2. **Deduplication is the most impactful step** — Removes 50-70% of data but improves quality
3. **Fast heuristics > slow ML** — Rule-based filters work well and scale
4. **No one-size-fits-all** — Different use cases need different cleaning strategies

---




### **Step 1: Text Normalization**

In [None]:
# Step 1: Text Normalization
import re
import unicodedata
from pathlib import Path

# Config: Input/output directories and size limit
IN_DIR = Path("data/extracted")
OUT_DIR = Path("data/cleaned")
source_sizes = {"wiki": 0, "github": 0, "stackexchange": 0}

# Create output directories
for sub in ["wiki", "github", "stackexchange"]:
    (OUT_DIR / sub).mkdir(parents=True, exist_ok=True)

# Normalize text: whitespace, encoding, punctuation, special chars
def normalize_text(text):
    # Convert to UTF-8, normalize Unicode (e.g., fix 'Café' vs 'Caf�')
    text = unicodedata.normalize("NFKC", text)
    # Remove excessive whitespace and normalize line breaks
    text = re.sub(r'\s+', ' ', text.strip())
    # Standardize punctuation (e.g., curly quotes to straight)
    text = re.sub(r'[‘’]', "'", text)
    text = re.sub(r'[“”]', '"', text)
    # Remove special/invisible characters (e.g., control chars)
    text = re.sub(r'[^\x20-\x7E\n]', '', text)
    return text

# Process each source
for source in ["wiki", "github", "stackexchange"]:
    print(f"Normalizing {source} text files...")
    input_dir = IN_DIR / source
    output_dir = OUT_DIR / source

    # Get all .txt files
    txt_files = list(input_dir.glob("*.txt"))

    try:
        sample_text = ""  # For sample output
        for txt_file in txt_files:
            # Read and normalize text
            text = txt_file.read_text(encoding="utf-8")
            normalized_text = normalize_text(text)
            text_bytes = normalized_text.encode("utf-8")

             # Save normalized text
            output_file = output_dir / f"{txt_file.stem}_normalized.txt"
            output_file.write_bytes(text_bytes)
            source_sizes[source] += len(text_bytes)
            sample_text = normalized_text

            print(f"   - Normalized: {txt_file.name} ({len(text_bytes)/1024:.1f} KB)")

    except Exception as e:
        print(f"Error normalizing {source}: {e}")

print(f"✅ Normalized files in 'data/cleaned/*'")
print(f"Total sizes: Wiki={source_sizes['wiki']/1_000_000:.2f} MB, "
      f"GitHub={source_sizes['github']/1_000_000:.2f} MB, "
      f"StackExchange={source_sizes['stackexchange']/1_000_000:.2f} MB")


### **Step 2 - Language detect**

In [None]:
# Step 2 - Language detect
# using langdetect in place of fastText as its easier to demo in notebook

%pip install langdetect
import sys
from pathlib import Path
from langdetect import detect_langs

# Config: Input/output directories and size limit
IN_DIR = Path("data/cleaned")
OUT_DIR = Path("data/cleaned")
source_sizes = {"wiki": 0, "github": 0, "stackexchange": 0}

# Process each source
for source in ["wiki", "github", "stackexchange"]:
    print(f"Filtering {source} for English text...")
    input_dir = IN_DIR / source
    output_dir = OUT_DIR / source

    txt_files = list(input_dir.glob("*_normalized.txt"))
    if not txt_files:
        print(f"No normalized files found in {input_dir}")
        continue

    try:
        sample_text = ""
        for txt_file in txt_files:

            text = txt_file.read_text(encoding="utf-8")
            # Detect language with confidence
            try:
                langs = detect_langs(text)
                is_english = any(lang.lang == "en" and lang.prob >= 0.7 for lang in langs)
                if not is_english:
                    print(f"Skipping {txt_file.name}: Non-English or low confidence")
                    continue
            except Exception as e:
                print(f"Skipping {txt_file.name}: Language detection failed ({e})")
                continue

            text_bytes = text.encode("utf-8")

            # Save English text
            output_file = output_dir / f"{txt_file.stem}_en.txt"
            output_file.write_bytes(text_bytes)
            source_sizes[source] += len(text_bytes)
            sample_text = text

            print(f"  - Filtered: {txt_file.name} ({len(text_bytes)/1024:.1f} KB, English)")

    except Exception as e:
        print(f"Error filtering {source}: {e}")

print(f"✅ English-filtered files in 'data/cleaned/*'")
print(f"Total sizes: Wiki={source_sizes['wiki']/1_000_000:.2f} MB, "
      f"GitHub={source_sizes['github']/1_000_000:.2f} MB, "
      f"StackExchange={source_sizes['stackexchange']/1_000_000:.2f} MB")

###**Step 3: Quality Filtering**

In [None]:
# Step 3: Quality Filtering
from pathlib import Path
import re

# Config: Input/output directories and size limit
IN_DIR = Path("data/cleaned")
OUT_DIR = Path("data/cleaned")
source_sizes = {"wiki": 0, "github": 0, "stackexchange": 0}

# Quality filters
def is_high_quality(text):
    # Too short: < 200 chars
    if len(text) < 200:
        return False, "Too short (< 200 chars)"
    # Too long: > 100K words (rough estimate: chars / 5)
    if len(text.split()) > 100_000 / 5:
        return False, "Too long (> 100K words)"
    # Repetitive lines: same line 3+ times
    lines = text.split("\n")
    if any(lines.count(line) >= 3 for line in set(lines) if line.strip()):
        return False, "Repetitive lines"
    # Repetitive words: "the the the" patterns
    if re.search(r'\b(\w+)\s+\1\s+\1\b', text, re.IGNORECASE):
        return False, "Repetitive words"
    # Boilerplate patterns
    boilerplate = [
        r"copyright.*all rights reserved",
        r"home\s*\|\s*about\s*\|\s*contact",
        r"this site uses cookies",
        r"lorem ipsum"
    ]
    if any(re.search(pattern, text, re.IGNORECASE) for pattern in boilerplate):
        return False, "Boilerplate content"
    return True, "Passed"

# Process each source
for source in ["wiki", "github", "stackexchange"]:
    print(f"Quality filtering {source} files...")
    input_dir = IN_DIR / source
    output_dir = OUT_DIR / source

    txt_files = list(input_dir.glob("*_en.txt"))
    if not txt_files:
        print(f"No English files found in {input_dir}")
        continue

    try:
        sample_text = ""
        for txt_file in txt_files:

            text = txt_file.read_text(encoding="utf-8")
            is_quality, reason = is_high_quality(text)
            if not is_quality:
                print(f"Skipping {txt_file.name}: {reason}")
                continue

            text_bytes = text.encode("utf-8")

            output_file = output_dir / f"{txt_file.stem}_quality.txt"
            output_file.write_bytes(text_bytes)
            source_sizes[source] += len(text_bytes)
            sample_text = text

            print(f"  - Filtered: {txt_file.name} ({len(text_bytes)/1024:.1f} KB)")

    except Exception as e:
        print(f"Error filtering {source}: {e}")

print(f"✅ Quality-filtered files in 'data/cleaned/*'")
print(f"Total sizes: Wiki={source_sizes['wiki']/1_000_000:.2f} MB, "
      f"GitHub={source_sizes['github']/1_000_000:.2f} MB, "
      f"StackExchange={source_sizes['stackexchange']/1_000_000:.2f} MB")

## 🚀 Next Steps: Going Deeper

Now that you understand the data cleaning pipeline, here are ways to expand:

### **Implement Deduplication (Intermediate)**
- **Exact dedup:** Use Python's `hashlib` to hash documents
- **Fuzzy dedup:** Explore libraries like `datasketch` (MinHash) or `text-dedup`
- **Project idea:** Build a dedup tool and measure impact on a small dataset

### **Add Content Filtering (Intermediate)**
- **Toxicity:** Use `detoxify` library or Perspective API
- **PII:** Regex patterns for emails/phones, or `presidio-analyzer`
- **Project idea:** Create a safety filter for your specific use case

### **Scale Up (Advanced)**
- Process full Common Crawl snapshots
- Implement distributed deduplication with Spark
- Build quality scoring models
