### Package install

In [1]:
! pip3 install trafilatura requests bs4 fitz pytesseract pillow surya-ocr faster-whisper openai-whisper datasketch

# install ffmpg for Whisper to process your audio
# On macOS (with Homebrew)
! brew install ffmpeg
#On Ubuntu/Debian:
# ! sudo apt-get update -y
# ! sudo apt-get install -y ffmpeg
# 👉 On Windows (if using WSL or native):
# You can download it from:
# 🔗 https://ffmpeg.org/download.html
# Or use a package manager like Chocolatey:
# ! choco install ffmpeg

Collecting trafilatura
  Downloading trafilatura-2.0.0-py3-none-any.whl.metadata (12 kB)
Collecting bs4
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting fitz
  Downloading fitz-0.0.1.dev2-py2.py3-none-any.whl.metadata (816 bytes)
Collecting surya-ocr
  Downloading surya_ocr-0.15.2-py3-none-any.whl.metadata (32 kB)
Collecting faster-whisper
  Downloading faster_whisper-1.2.0-py3-none-any.whl.metadata (16 kB)
Collecting openai-whisper
  Downloading openai_whisper-20250625.tar.gz (803 kB)
     ---------------------------------------- 0.0/803.2 kB ? eta -:--:--
     ---------------------------------------- 803.2/803.2 kB 8.6 MB/s  0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'

'brew' is not recognized as an internal or external command,
operable program or batch file.


# Week 3: Pretraining Data Collection & Extraction - Hands-on Notebook

## 1. Clean Web Page Text Using trafilatura

In [3]:
# ✅ Install dependencies if not already installed
# !pip install trafilatura

import trafilatura
import requests

# Example: An arXiv paper abstract page
url = "https://arxiv.org/abs/2404.00001"

# Step 1: Fetch raw HTML
response = requests.get(url)
html = response.text

# Step 2: Use Trafilatura to extract clean text
downloaded_text = trafilatura.extract(html, include_comments=False, include_tables=False)

# Step 3: Display the result
print("📄 Extracted Text Preview:\n")
print(downloaded_text[:1000])  # Show first 1000 characters
print()

url = "https://www.apple.com/"
response = requests.get(url)
html = response.text
downloaded_text = trafilatura.extract(html, include_comments=False, include_tables=False)
print("📄 Extracted Text Preview:\n")
print(downloaded_text[:1000])  # Show first 1000 characters

📄 Extracted Text Preview:

Physics > Physics Education
[Submitted on 3 Feb 2024]
Title:Uso de herramientas digitales matemáticas en la Educación Secundaria
View PDF HTML (experimental)Abstract:Information and Community Technologies (ICT) are very present in our society nowadays and particularly in the educative field. In just two decades, we have passed from a learning based, in many cases, on the master lessons to one such that methodologies like the flipped classroom or the gamification are stronger than ever. Along this work, we have done a study to teachers and students with the main objective to compare the knowledge on digital tools, their use and their acceptation. We use WxMaxima and Geogebra in order to solve an exercise of \textit{Evaluación de Bachillerato para el Acceso a la Universidad} (EBAU) related with Geometry, comparing their ins and outs with the manual solution. Finally, we expose some conclusions and some possible research lines about digital tools, as well as a p

Explanation:
trafilatura.extract() pulls main article content while removing headers, menus, and boilerplate.

This works great on academic websites like arXiv, blog posts, or news articles.

No need to write custom HTML parsers.

## 2: OCR – Convert Images to Text
### Option A: Tesseract OCR (Offline)

In [None]:
# you might use the following install if the pytesseract is not installed
# ! sudo apt-get update -y
# ! sudo apt-get install -y tesseract-ocr

In [4]:
# Install: sudo apt install tesseract-ocr OR !pip install pytesseract Pillow
import pytesseract
from PIL import Image

# Load and preprocess image (convert to grayscale)
image = Image.open("./test_data/image/image.png").convert("L")  # grayscale
text = pytesseract.image_to_string(image)

print("📄 Tesseract OCR Output (first 500 chars):")
print(text[:500])


📄 Tesseract OCR Output (first 500 chars):
a 4: Leading intelligence.

Unrivaled speed and efficiency.

The most accessible and scalable generation of Llama is here.
Native multimodality, mixture-of-experts models, super long
context windows, step changes in performance, and
unparalleled efficiency. All in easy-to-deploy sizes custom fit for
how you want to use it.



### Option B: Surya OCR (Fast PyTorch-based layout-aware tool)
https://github.com/VikParuchuri/surya

### Usage
To perform OCR on an image, PDF, or a folder containing them:

* Good for: simple single-column text, PDFs converted to images
* Struggles with layout, math, or low-res scans 
    * As you can see from the image: "Download Models" has not been extreact out correctly.

In [8]:
! surya_ocr ./test_data/image/image.png --langs en --images --output_dir results/

! surya_ocr ./test_data/image/image.png --images --output_dir results/


Usage: surya_ocr [OPTIONS] INPUT_PATH
Try 'surya_ocr --help' for help.

Error: No such option: --langs Did you mean --images?

Downloading text_recognition model to C:\Users\johnny\AppData\Local\datalab\datalab\Cache\models\text_recognition/2025_08_04:   0%|          | 0/11 [00:00<?, ?it/s]2025-08-06 15:28:53,700 [ERROR] surya: Download error for file https://models.datalab.to/text_recognition/2025_08_04/model.safetensors: ('Connection broken: IncompleteRead(2184376320 bytes read, 173371606 more expected)', IncompleteRead(2184376320 bytes read, 173371606 more expected))
2025-08-06 15:28:53,704 [ERROR] surya: Error downloading model from text_recognition/2025_08_04. Attempt 1 of 3. Error: ('Connection broken: IncompleteRead(2184376320 bytes read, 173371606 more expected)', IncompleteRead(2184376320 bytes read, 173371606 more expected))
2025-08-06 15:28:53,704 [INFO] surya: Retrying in 5 seconds...


Downloading text_recognition model to C:\Users\johnny\AppData\Local\datalab\datalab\Cach

Where:

**DATA_PATH** is the path to your image, PDF, or folder.

**--langs** specifies the language(s) for OCR (e.g., en for English).

**--images** saves images of the pages and detected text lines (optional).

**--output_dir** specifies the directory to save results.​

This command will generate a results.json file containing the detected text and bounding boxes.​

Sample Output Structure
The **results.json** will have entries like:​

{
  "image": [
    {
      "text_lines": [
        {
          "polygon": [
            [
              13,
              48
            ],
            [
              538,
              51
            ],
            [
              538,
              87
            ],
            [
              12,
              84
            ]
          ],
          "confidence": 0.9970703125,
          "text": "Llama 4: Leading intelligence.",
          "bbox": [
            12,
            48,
            538,
            87
          ]
        },
        ...
        {
          "polygon": [
            [
              47,
              364
            ],
            [
              176,
              364
            ],
            [
              176,
              378
            ],
            [
              47,
              378
            ]
          ],
          "confidence": 0.9716796875,
          "text": "Download models",
          "bbox": [
            47,
            364,
            176,
            378
          ]
        }
      ],
      "languages": [
        "en"
      ],
      "image_bbox": [
        0,
        0,
        600,
        471
      ],
      "page": 1
    }
  ]
}

#### or in python code

In [21]:
from PIL import Image
from surya.foundation import FoundationPredictor
from surya.detection import DetectionPredictor
from surya.recognition import RecognitionPredictor

# Load the image
image = Image.open("./test_data/image/image.png")  # Replace with your image path
langs = ["en"]  # Specify the language(s)

# Initialize predictors
foundation_predictor = FoundationPredictor()
recognition_predictor = RecognitionPredictor(foundation_predictor)
detection_predictor = DetectionPredictor()


# Perform OCR
predictions = recognition_predictor([image], det_predictor=detection_predictor)

# Display results with polygon coordinates
for page in predictions:
    for line in page.text_lines:
        print(f"Text: {line.text}")
        print(f"Confidence: {line.confidence}")
        print(f"Polygon: {line.polygon}\n")


Detecting bboxes: 100%|██████████| 1/1 [00:00<00:00,  2.96it/s]
Recognizing Text: 100%|██████████| 8/8 [00:03<00:00,  2.21it/s]

Text: Llama 4: Leading intelligence.
Confidence: 0.9846809168656667
Polygon: [[14.0, 52.0], [535.0, 52.0], [535.0, 86.0], [14.0, 86.0]]

Text: Unrivaled speed and efficiency.
Confidence: 0.9874766815093255
Polygon: [[13.0, 118.0], [560.0, 118.0], [560.0, 150.0], [13.0, 150.0]]

Text: The most accessible and scalable generation of Llama is here.
Confidence: 0.9974610502602624
Polygon: [[13.0, 187.0], [564.0, 187.0], [564.0, 204.0], [13.0, 204.0]]

Text: Native multimodality, mixture-of-experts models, super long
Confidence: 0.9975904054560903
Polygon: [[14.0, 214.0], [556.0, 214.0], [556.0, 231.0], [14.0, 231.0]]

Text: context windows, step changes in performance, and
Confidence: 0.9984168057539025
Polygon: [[12.0, 241.0], [480.0, 241.0], [480.0, 258.0], [12.0, 258.0]]

Text: unparalleled efficiency. All in easy-to-deploy sizes custom fit for
Confidence: 0.9977069648344126
Polygon: [[13.0, 268.0], [585.0, 268.0], [585.0, 285.0], [13.0, 285.0]]

Text: how you want to use it.
Confidence:




* Good for: structured layouts like academic papers
* Fast inference and easy to integrate with PDF workflows

### Option C: OpenAI GPT-4o Vision OCR (Highly Accurate & Multicolumn)
don't forget to add you `OPENAI_API_KEY`

In [None]:
import base64
import requests

def vision_extract(b64_image, prompt, api_key):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    payload = {
        "model": "gpt-4o-mini",
        "temperature": 0.0,
        "messages": [
            {"role": "user", "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64_image}"}}
            ]}
        ],
        "max_tokens": 3000
    }

    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    return response.json()

# Load image and run GPT-4o OCR
with open("test_data/image/image.png", "rb") as f:
    b64_img = base64.b64encode(f.read()).decode("utf-8")

# Use your actual API key here
result = vision_extract(b64_img, "Extract all the readable text from this document.", api_key="removed")
print(result["choices"][0]["message"]["content"])


**Llama 4: Leading intelligence. Unrivaled speed and efficiency.**

The most accessible and scalable generation of Llama is here. Native multimodality, mixture-of-experts models, super long context windows, step changes in performance, and unparalleled efficiency. All in easy-to-deploy sizes custom fit for how you want to use it.

**Download models**


* Good for: complex, multi-column documents and natural layout reasoning
* Great fallback when you need accuracy over speed

## 3. Automatic Speech Recognition (ASR)
### Option A: Whisper by OpenAI

In [132]:
# ! brew install ffmpeg


In [23]:
# Install: pip install openai-whisper
import whisper

# Load model
model = whisper.load_model("base")  # or "small", "medium", "large"

# Transcribe audio
result = model.transcribe("./test_data/audio/sample-1.mp3")
print("📄 Whisper Transcription:")
print(result["text"])


100%|███████████████████████████████████████| 139M/139M [00:07<00:00, 18.9MiB/s]


FileNotFoundError: [WinError 2] The system cannot find the file specified

* Great for: balanced speed and accuracy
* Supports many audio formats: mp3, wav, m4a, webm

### Option B: Faster-Whisper (Fast & Lightweight)

In [2]:
# ! pip install faster-whisper

In [24]:
from faster_whisper import WhisperModel

# Load model with float16 for speed
model = WhisperModel("base", device="cpu", compute_type="int8")  # For CPUs

# Transcribe
segments, _ = model.transcribe("./test_data/audio/sample-1.mp3")

print("📄 Faster-Whisper Transcription:")
for segment in segments:
    print(f"[{segment.start:.2f} - {segment.end:.2f}] {segment.text}")


  import pkg_resources


model.bin:   0%|          | 0.00/145M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

vocabulary.txt: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

📄 Faster-Whisper Transcription:
[0.00 - 4.44]  So, we pay for that, they didn't, you know, there are some people who said it's on Cren the
[4.44 - 8.04]  Benzo, instead of court and a musical or there's not a story of what you're interested
[8.04 - 9.04]  in just turning child.
[9.04 - 10.04]  That's the...


* Optimized for GPU or even CPU 
* Useful when batch-processing long audio datasets

## 4. Pretraining Data Cleaning Pipeline
### Step 1: Remove duplicates using MinHash

In [25]:
from datasketch import MinHash, MinHashLSH

def minhash_deduplication(texts, threshold=0.7):
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    unique_texts = []
    for i, doc in enumerate(texts):
        m = MinHash(num_perm=128)
        for word in set(doc.split()):
            m.update(word.encode('utf8'))
        if not lsh.query(m):
            lsh.insert(f"doc{i}", m)
            unique_texts.append(doc)
    return unique_texts


### Step 2: Filter for language and strip HTML noise

In [26]:
! pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
     ---------------------------------------- 0.0/981.5 kB ? eta -:--:--
     ---------------------------------------- 981.5/981.5 kB 23.2 MB/s  0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py): started
  Building wheel for langdetect (setup.py): finished with status 'done'
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993363 sha256=f284fa92a1cdab41276628913f22728836e148910e3ead6c39236a57a7ec4f44
  Stored in directory: c:\users\johnny\appdata\local\pip\cache\wheels\0a\f2\b2\e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


  DEPRECATION: Building 'langdetect' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'langdetect'. Discussion can be found at https://github.com/pypa/pip/issues/6334


In [61]:
from langdetect import detect
from bs4 import BeautifulSoup

def clean_html_and_filter_lang(texts, lang='en'):
    filtered = []
    for txt in texts:
        txt = BeautifulSoup(txt, 'html.parser').get_text()
        try:
            if detect(txt.strip()) == lang:
                filtered.append(txt.strip())
        except:
            continue
    return filtered

html_doc =[ """
<html>
  <body>
    <div id="header">...</div>
    <div id="main-content">
      <h1>Article Title</h1>
      <p>This is the main body of the text.</p>
      <img src="..."/>
    </div>
    <div id="footer">...</div>
  </body>
</html>
""" ,
"""
<html>
  <body>
    <h1>Test Header</h1>
    <p class="intro">This is an intro paragraph.</p>
    <a href="/link1">Link 1</a>
    <a href="/link2">Link 2</a>
  </body>
</html>
""" ]


clean_html = clean_html_and_filter_lang(html_doc)
for html in clean_html:
    print(html)
print()

url = "http://apple.com"

# Make the request and get the content
response = requests.get(url)

# Check for a successful response before proceeding
if response.status_code == 200:
    # Create the BeautifulSoup object
    soup = BeautifulSoup(response.text, 'html.parser')
    print(soup.get_text(separator=' ', strip=True))
else:
    print(f"Failed to fetch the page. Status code: {response.status_code}")





...

Article Title
This is the main body of the text.


...
Test Header
This is an intro paragraph.
Link 1
Link 2

Apple Apple Apple Store Mac iPad iPhone Watch Vision AirPods TV & Home Entertainment Accessories Support 0 + Buy Mac or iPad for college with education savings Choose AirPods or an eligible accessory 1 Shop MacBook Air Sky blue color. Sky high performance with M4. Learn more Buy Built for Apple Intelligence. iPhone Meet the iPhone 16 family. Learn more Shop iPhone Built for Apple Intelligence. iPad Pro Unbelievably thin. Incredibly powerful. Learn more Buy Built for Apple Intelligence. Apple Intelligence Point. Shoot. Cook. With visual intelligence. 2 Watch the clip Learn more AirPods Pro 2 Now with a Hearing Aid feature. 3 Learn more Buy Apple Watch Series 10 Thinstant classic. Learn more Buy Apple Trade In Get $160-$600 in credit when you trade in iPhone 12 or higher. 4 Get your estimate Apple Card Get up to 3% Daily Cash back with every purchase. Learn more Apply now Ap

### Step 3: Strip PII using regex

In [34]:
import re

def strip_pii(text):
    text = re.sub(r'[\w\.-]+@[\w\.-]+', '[EMAIL]', text)
    text = re.sub(r'\b\d{12,19}\b', '[CREDIT_CARD]', text)
    text = re.sub(r'\b(?:\d{3}-){2}\d{4}\b', '[PHONE]', text)
    return text

my_doc = " this is sent by abc@beanai.com     your credit card number is 1234567890123   your phone number is 123-456-3456"
print("before: " + my_doc)
print("after: " + strip_pii(my_doc))

before:  this is sent by abc@beanai.com     your credit card number is 1234567890123   your phone number is 123-456-3456
after:  this is sent by [EMAIL]     your credit card number is [CREDIT_CARD]   your phone number is [PHONE]


### Step 4: Remove repetitive n-grams

In [57]:
import re
from collections import Counter

def remove_repetitive_ngrams(text, n=3, threshold=3):
    words = text.split()
    ngrams = [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]

    counts = Counter(ngrams)
    repetitive = [ngram for ngram, count in counts.items() if count >= threshold]

    for phrase in repetitive:
        # regex-safe version of the phrase
        escaped_phrase = re.escape(phrase)
        # match the phrase repeated 2+ times with optional whitespace
        text = re.sub(rf'(?:{escaped_phrase}\s*){{{threshold},}}', phrase + ' ', text)

    # Remove extra spaces
    text = re.sub(r'\s{2,}', ' ', text).strip()
    return text


### Step 7: prepare for the text data
load the Fake_pretraining_Texts.csv

In [58]:
import pandas as pd
fake_texts = pd.read_csv("test_data/data/Fake_Pretraining_Texts.csv")
raw_dataset = fake_texts["Raw Text"]
print(raw_dataset)

0    Hello! Contact us at support@data.org or call ...
1    Hola! Este artículo está completamente en espa...
2    <html><body><div><h1>Breaking News</h1><p>This...
3    Buy now! Best product ever. Best product ever....
4    Python 3.14 introduces several improvements in...
5    Python 3.14 introduces several improvements in...
6    <div>For inquiries, email jane_doe@example.com...
7    Large Language Models are transforming the AI ...
8                  这是一个包含有用技术信息的中文段落。电话号码：010-12345678
9    Buy now! Best product ever. Best product ever....
Name: Raw Text, dtype: object


### Step 7: Apply the Cleaning Pipeline

In [59]:
# Step 1: Remove HTML + Language Filter
step1 = clean_html_and_filter_lang(raw_dataset)
display(step1)

['Hello! Contact us at support@data.org or call 123-456-7890. Your credit card 4111111111111111 was declined. This message is intended only for the recipient. Visit our site for more.',
 'Breaking NewsThis is a major event!Contact us',
 'Buy now! Best product ever. Best product ever. Best product ever.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official site.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official docs.',
 'Large Language Models are transforming the AI landscape with few-shot capabilities.',
 'Buy now! Best product ever. Best product ever. Best product ever.']

In [62]:
# Step 2: Deduplicate Paragraphs
step2 = minhash_deduplication(step1)
display(step2)


['Hello! Contact us at support@data.org or call 123-456-7890. Your credit card 4111111111111111 was declined. This message is intended only for the recipient. Visit our site for more.',
 'Breaking NewsThis is a major event!Contact us',
 'Buy now! Best product ever. Best product ever. Best product ever.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official site.',
 'Large Language Models are transforming the AI landscape with few-shot capabilities.']

In [63]:
# Step 3: Strip PII
step3 = [strip_pii(t) for t in step2]
display(step3)

['Hello! Contact us at [EMAIL] or call [PHONE]. Your credit card [CREDIT_CARD] was declined. This message is intended only for the recipient. Visit our site for more.',
 'Breaking NewsThis is a major event!Contact us',
 'Buy now! Best product ever. Best product ever. Best product ever.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official site.',
 'Large Language Models are transforming the AI landscape with few-shot capabilities.']

In [64]:
# Step 4: Remove Repetitive N-grams
cleaned_data = [remove_repetitive_ngrams(t) for t in step3]
display(cleaned_data)

['Hello! Contact us at [EMAIL] or call [PHONE]. Your credit card [CREDIT_CARD] was declined. This message is intended only for the recipient. Visit our site for more.',
 'Breaking NewsThis is a major event!Contact us',
 'Buy now! Best product ever.',
 'Python 3.14 introduces several improvements including better error messages. Learn more on the official site.',
 'Large Language Models are transforming the AI landscape with few-shot capabilities.']

In [65]:
# Done!
print("✅ Cleaned dataset sample:")
for idx, text in enumerate(cleaned_data):
    print(f"--- Article {idx + 1} ---")
    print(text)


✅ Cleaned dataset sample:
--- Article 1 ---
Hello! Contact us at [EMAIL] or call [PHONE]. Your credit card [CREDIT_CARD] was declined. This message is intended only for the recipient. Visit our site for more.
--- Article 2 ---
Breaking NewsThis is a major event!Contact us
--- Article 3 ---
Buy now! Best product ever.
--- Article 4 ---
Python 3.14 introduces several improvements including better error messages. Learn more on the official site.
--- Article 5 ---
Large Language Models are transforming the AI landscape with few-shot capabilities.
