### Package install

In [None]:
! pip3 install trafilatura requests bs4 fitz pytesseract pillow surya-ocr faster-whisper openai-whisper datasketch

# install ffmpg for Whisper to process your audio
# On macOS (with Homebrew)
! brew install ffmpeg
#On Ubuntu/Debian:
# ! sudo apt-get update -y
# ! sudo apt-get install -y ffmpeg
# 👉 On Windows (if using WSL or native):
# You can download it from:
# 🔗 https://ffmpeg.org/download.html
# Or use a package manager like Chocolatey:
# ! choco install ffmpeg

[33mDEPRECATION: Loading egg at /Users/scottlai/.pyenv/versions/3.11.8/lib/python3.11/site-packages/python_autocite-0.0.4-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Week 3: Pretraining Data Collection & Extraction - Hands-on Notebook

## 1. Clean Web Page Text Using trafilatura

In [18]:
# ✅ Install dependencies if not already installed
# !pip install trafilatura

import trafilatura
import requests

# Example: An arXiv paper abstract page
url = "https://arxiv.org/abs/2404.00001"

# Step 1: Fetch raw HTML
response = requests.get(url)
html = response.text

# Step 2: Use Trafilatura to extract clean text
downloaded_text = trafilatura.extract(html, include_comments=False, include_tables=False)

# Step 3: Display the result
print("📄 Extracted Text Preview:\n")
print(downloaded_text[:1000])  # Show first 1000 characters


📄 Extracted Text Preview:

Physics > Physics Education
[Submitted on 3 Feb 2024]
Title:Uso de herramientas digitales matemáticas en la Educación Secundaria
View PDF HTML (experimental)Abstract:Information and Community Technologies (ICT) are very present in our society nowadays and particularly in the educative field. In just two decades, we have passed from a learning based, in many cases, on the master lessons to one such that methodologies like the flipped classroom or the gamification are stronger than ever. Along this work, we have done a study to teachers and students with the main objective to compare the knowledge on digital tools, their use and their acceptation. We use WxMaxima and Geogebra in order to solve an exercise of \textit{Evaluación de Bachillerato para el Acceso a la Universidad} (EBAU) related with Geometry, comparing their ins and outs with the manual solution. Finally, we expose some conclusions and some possible research lines about digital tools, as well as a p

Explanation:
trafilatura.extract() pulls main article content while removing headers, menus, and boilerplate.

This works great on academic websites like arXiv, blog posts, or news articles.

No need to write custom HTML parsers.

## 2: OCR – Convert Images to Text
### Option A: Tesseract OCR (Offline)

In [None]:
# you might use the following install if the pytesseract is not installed
# ! sudo apt-get update -y
# ! sudo apt-get install -y tesseract-ocr

In [20]:
# Install: sudo apt install tesseract-ocr OR !pip install pytesseract Pillow
import pytesseract
from PIL import Image

# Load and preprocess image (convert to grayscale)
image = Image.open("./test_data/image/image.png").convert("L")  # grayscale
text = pytesseract.image_to_string(image)

print("📄 Tesseract OCR Output (first 500 chars):")
print(text[:500])


📄 Tesseract OCR Output (first 500 chars):
a 4: Leading intelligence.

Unrivaled speed and efficiency.

The most accessible and scalable generation of Llama is here.
Native multimodality, mixture-of-experts models, super long
context windows, step changes in performance, and
unparalleled efficiency. All in easy-to-deploy sizes custom fit for
how you want to use it.



### Option B: Surya OCR (Fast PyTorch-based layout-aware tool)
https://github.com/VikParuchuri/surya

### Usage
To perform OCR on an image, PDF, or a folder containing them:

* Good for: simple single-column text, PDFs converted to images
* Struggles with layout, math, or low-res scans 
    * As you can see from the image: "Download Models" has not been extreact out correctly.

In [131]:
! surya_ocr ./test_data/image/image.png --langs en --images --output_dir results/


Loaded detection model vikp/surya_det3 on device mps with dtype torch.float16
Loaded recognition model vikp/surya_rec2 on device mps with dtype torch.float16
Detecting bboxes: 100%|███████████████████████████| 1/1 [00:02<00:00,  2.02s/it]
Recognizing Text: 100%|███████████████████████████| 1/1 [00:11<00:00, 11.26s/it]
Wrote results to /Users/scottlai/Library/Mobile Documents/com~apple~CloudDocs/Desktop/work/inferenceAI/Class3/results/image


Where:

**DATA_PATH** is the path to your image, PDF, or folder.

**--langs** specifies the language(s) for OCR (e.g., en for English).

**--images** saves images of the pages and detected text lines (optional).

**--output_dir** specifies the directory to save results.​

This command will generate a results.json file containing the detected text and bounding boxes.​

Sample Output Structure
The **results.json** will have entries like:​

{
  "image": [
    {
      "text_lines": [
        {
          "polygon": [
            [
              13,
              48
            ],
            [
              538,
              51
            ],
            [
              538,
              87
            ],
            [
              12,
              84
            ]
          ],
          "confidence": 0.9970703125,
          "text": "Llama 4: Leading intelligence.",
          "bbox": [
            12,
            48,
            538,
            87
          ]
        },
        ...
        {
          "polygon": [
            [
              47,
              364
            ],
            [
              176,
              364
            ],
            [
              176,
              378
            ],
            [
              47,
              378
            ]
          ],
          "confidence": 0.9716796875,
          "text": "Download models",
          "bbox": [
            47,
            364,
            176,
            378
          ]
        }
      ],
      "languages": [
        "en"
      ],
      "image_bbox": [
        0,
        0,
        600,
        471
      ],
      "page": 1
    }
  ]
}

#### or in python code

In [127]:
from PIL import Image
from surya.detection import DetectionPredictor
from surya.recognition import RecognitionPredictor

# Load the image
image = Image.open("./test_data/image/image.png")  # Replace with your image path
langs = ["en"]  # Specify the language(s)

# Initialize predictors
detection_predictor = DetectionPredictor()
recognition_predictor = RecognitionPredictor()

# Perform OCR
predictions = recognition_predictor([image], [langs], detection_predictor)

# Display results with polygon coordinates
for page in predictions:
    for line in page.text_lines:
        print(f"Text: {line.text}")
        print(f"Confidence: {line.confidence}")
        print(f"Polygon: {line.polygon}\n")


Loaded detection model vikp/surya_det3 on device mps with dtype torch.float16
Loaded recognition model vikp/surya_rec2 on device mps with dtype torch.float16


Detecting bboxes: 100%|██████████| 1/1 [00:00<00:00,  2.03it/s]
Recognizing Text: 100%|██████████| 1/1 [00:03<00:00,  3.06s/it]

Text: Llama 4: Leading intelligence.
Confidence: 0.9970703125
Polygon: [[13.0, 48.0], [538.0, 51.0], [538.0, 87.0], [12.0, 84.0]]

Text: Unrivaled speed and efficiency.
Confidence: 0.99462890625
Polygon: [[12.0, 116.0], [564.0, 113.0], [565.0, 148.0], [13.0, 151.0]]

Text: The most accessible and scalable generation of Llama is here.
Confidence: 0.9990234375
Polygon: [[13.0, 186.0], [565.0, 186.0], [565.0, 204.0], [13.0, 204.0]]

Text: Native multimodality, mixture-of-experts models, super long
Confidence: 0.99462890625
Polygon: [[12.0, 214.0], [557.0, 212.0], [558.0, 230.0], [13.0, 231.0]]

Text: context windows, step changes in performance, and
Confidence: 0.99853515625
Polygon: [[13.0, 240.0], [481.0, 240.0], [481.0, 258.0], [13.0, 258.0]]

Text: unparalleled efficiency. All in easy-to-deploy sizes custom fit for
Confidence: 0.9990234375
Polygon: [[13.0, 268.0], [586.0, 268.0], [586.0, 285.0], [13.0, 285.0]]

Text: how you want to use it.
Confidence: 0.9765625
Polygon: [[13.0, 295.0




* Good for: structured layouts like academic papers
* Fast inference and easy to integrate with PDF workflows

### Option C: OpenAI GPT-4o Vision OCR (Highly Accurate & Multicolumn)
don't forget to add you `OPENAI_API_KEY`

In [129]:
import base64
import requests

def vision_extract(b64_image, prompt, api_key):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4o-mini",
        "temperature": 0.0,
        "messages": [
            {"role": "user", "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_image}"}}
            ]}
        ],
        "max_tokens": 3000
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    return response.json()

# Load image and run GPT-4o OCR
with open("test_data/image/image.png", "rb") as f:
    b64_img = base64.b64encode(f.read()).decode("utf-8")

# Use your actual API key here
result = vision_extract(b64_img, "Extract all the readable text from this document.", api_key="YOUR_OPENAI_API_KEY")
print(result["choices"][0]["message"]["content"])


**Llama 4: Leading intelligence. Unrivaled speed and efficiency.**

The most accessible and scalable generation of Llama is here. Native multimodality, mixture-of-experts models, super long context windows, step changes in performance, and unparalleled efficiency. All in easy-to-deploy sizes custom fit for how you want to use it.

**Download models**


* Good for: complex, multi-column documents and natural layout reasoning
* Great fallback when you need accuracy over speed

## 3. Automatic Speech Recognition (ASR)
### Option A: Whisper by OpenAI

In [132]:
# ! brew install ffmpeg


In [43]:
# Install: pip install openai-whisper
import whisper

# Load model
model = whisper.load_model("base")  # or "small", "medium", "large"

# Transcribe audio
result = model.transcribe("./test_data/audio/sample-1.mp3")
print("📄 Whisper Transcription:")
print(result["text"])




📄 Whisper Transcription:
 So we pay for that, they didn't, you know, there are some people, they said, it's on, Kren the Benzo, instead of caught in a musical, I was just not talking about your own interest and just turning child, that's the...


* Great for: balanced speed and accuracy
* Supports many audio formats: mp3, wav, m4a, webm

### Option B: Faster-Whisper (Fast & Lightweight)

In [2]:
# ! pip install faster-whisper

In [None]:
from faster_whisper import WhisperModel

# Load model with float16 for speed
model = WhisperModel("base", device="cpu", compute_type="int8")  # For CPUs

# Transcribe
segments, _ = model.transcribe("./test_data/audio/sample-1.mp3")

print("📄 Faster-Whisper Transcription:")
for segment in segments:
    print(f"[{segment.start:.2f} - {segment.end:.2f}] {segment.text}")


  from .autonotebook import tqdm as notebook_tqdm


📄 Faster-Whisper Transcription:
[0.00 - 4.00]  If we pay for that, they don't, you know, there are some people who say that it's on.
[4.00 - 8.00]  Cren the Benzo instead of caught in a musical or there's not a story of what you're interested in.
[8.00 - 10.00]  It's just turning child, that's the...


* Optimized for GPU or even CPU 
* Useful when batch-processing long audio datasets

## 4. Pretraining Data Cleaning Pipeline
### Step 1: Remove duplicates using MinHash

In [119]:
from datasketch import MinHash, MinHashLSH

def minhash_deduplication(texts, threshold=0.7):
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique_texts = []
    for i, doc in enumerate(texts):
        m = MinHash(num_perm=128)
        for word in set(doc.split()):
            m.update(word.encode('utf8'))
        if not lsh.query(m):
            lsh.insert(f"doc{i}", m)
            unique_texts.append(doc)
    return unique_texts


### Step 2: Filter for language and strip HTML noise

In [1]:
! pip install langdetect

[33mDEPRECATION: Loading egg at /Users/scottlai/.pyenv/versions/3.11.8/lib/python3.11/site-packages/python_autocite-0.0.4-py3.11.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
from langdetect import detect
from bs4 import BeautifulSoup

def clean_html_and_filter_lang(texts, lang='en'):
    filtered = []
    for txt in texts:
        txt = BeautifulSoup(txt, 'html.parser').get_text()
        try:
            if detect(txt.strip()) == lang:
                filtered.append(txt.strip())
        except:
            continue
    return filtered

### Step 3: Strip PII using regex

In [6]:
import re

def strip_pii(text):
    text = re.sub(r'[\w\.-]+@[\w\.-]+', '[EMAIL]', text)
    text = re.sub(r'\b\d{12,19}\b', '[CREDIT_CARD]', text)
    text = re.sub(r'\b(?:\d{3}-){2}\d{4}\b', '[PHONE]', text)
    return text

### Step 4: Remove repetitive n-grams

In [99]:
import re
from collections import Counter

def remove_repetitive_ngrams(text, n=3, threshold=3):
    words = text.split()
    ngrams = [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]

    counts = Counter(ngrams)
    repetitive = [ngram for ngram, count in counts.items() if count >= threshold]

    for phrase in repetitive:
        # regex-safe version of the phrase
        escaped_phrase = re.escape(phrase)
        # match the phrase repeated 2+ times with optional whitespace
        text = re.sub(rf'(?:{escaped_phrase}\s*){{{threshold},}}', phrase + ' ', text)

    # Remove extra spaces
    text = re.sub(r'\s{2,}', ' ', text).strip()
    return text


### Step 7: prepare for the text data
load the Fake_pretraining_Texts.csv

In [130]:
import pandas as pd
fake_texts = pd.read_csv("test_data/data/Fake_Pretraining_Texts.csv")
raw_dataset = fake_texts["Raw Text"]
print(raw_dataset)

0    Hello! Contact us at support@data.org or call ...
1    Hola! Este artículo está completamente en espa...
2    <html><body><div><h1>Breaking News</h1><p>This...
3    Buy now! Best product ever. Best product ever....
4    Python 3.14 introduces several improvements in...
5    Python 3.14 introduces several improvements in...
6    <div>For inquiries, email jane_doe@example.com...
7    Large Language Models are transforming the AI ...
8                  这是一个包含有用技术信息的中文段落。电话号码：010-12345678
9    Buy now! Best product ever. Best product ever....
Name: Raw Text, dtype: object


### Step 7: Apply the Cleaning Pipeline

In [121]:
# Step 1: Remove HTML + Language Filter
step1 = clean_html_and_filter_lang(raw_dataset)
display(step1)

['Hello! Contact us at support@data.org or call 123-456-7890. Your credit card 4111111111111111 was declined. This message is intended only for the recipient. Visit our site for more.',
 'Breaking NewsThis is a major event!Contact us',
 'Buy now! Best product ever. Best product ever. Best product ever.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official site.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official docs.',
 'For inquiries, email jane_doe@example.com or visit our site. Card number: 378282246310005.',
 'Large Language Models are transforming the AI landscape with few-shot capabilities.',
 'Buy now! Best product ever. Best product ever. Best product ever.']

In [122]:
# Step 2: Deduplicate Paragraphs
step2 = minhash_deduplication(step1)
display(step2)


['Hello! Contact us at support@data.org or call 123-456-7890. Your credit card 4111111111111111 was declined. This message is intended only for the recipient. Visit our site for more.',
 'Breaking NewsThis is a major event!Contact us',
 'Buy now! Best product ever. Best product ever. Best product ever.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official site.',
 'For inquiries, email jane_doe@example.com or visit our site. Card number: 378282246310005.',
 'Large Language Models are transforming the AI landscape with few-shot capabilities.']

In [123]:
# Step 3: Strip PII
step3 = [strip_pii(t) for t in step2]
display(step3)

['Hello! Contact us at [EMAIL] or call [PHONE]. Your credit card [CREDIT_CARD] was declined. This message is intended only for the recipient. Visit our site for more.',
 'Breaking NewsThis is a major event!Contact us',
 'Buy now! Best product ever. Best product ever. Best product ever.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official site.',
 'For inquiries, email [EMAIL] or visit our site. Card number: [CREDIT_CARD].',
 'Large Language Models are transforming the AI landscape with few-shot capabilities.']

In [124]:
# Step 4: Remove Repetitive N-grams
cleaned_data = [remove_repetitive_ngrams(t) for t in step3]
display(cleaned_data)

['Hello! Contact us at [EMAIL] or call [PHONE]. Your credit card [CREDIT_CARD] was declined. This message is intended only for the recipient. Visit our site for more.',
 'Breaking NewsThis is a major event!Contact us',
 'Buy now! Best product ever.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official site.',
 'For inquiries, email [EMAIL] or visit our site. Card number: [CREDIT_CARD].',
 'Large Language Models are transforming the AI landscape with few-shot capabilities.']

In [125]:
# Done!
print("✅ Cleaned dataset sample:")
for idx, text in enumerate(cleaned_data):
    print(f"--- Article {idx + 1} ---")
    print(text)


✅ Cleaned dataset sample:
--- Article 1 ---
Hello! Contact us at [EMAIL] or call [PHONE]. Your credit card [CREDIT_CARD] was declined. This message is intended only for the recipient. Visit our site for more.
--- Article 2 ---
Breaking NewsThis is a major event!Contact us
--- Article 3 ---
Buy now! Best product ever.
--- Article 4 ---
Python 3.14 introduces several improvements including better error messages. Learn more on the official site.
--- Article 5 ---
For inquiries, email [EMAIL] or visit our site. Card number: [CREDIT_CARD].
--- Article 6 ---
Large Language Models are transforming the AI landscape with few-shot capabilities.
