# Dataset Inspection Notebook

This notebook is designed to inspect the train and test splits of your dataset.
It allows you to load the dataset from disk and display a few examples from each split.


In [11]:
from pathlib import Path
from datasets import load_from_disk


In [17]:
# Define the paths relative to the project root (assumes the notebook is in the "notebooks" folder)
PROJECT_ROOT = Path().resolve().parent  # Parent directory of "notebooks"
TRAIN_SPLIT_DIR = PROJECT_ROOT / "data/split/train"
TEST_SPLIT_DIR = PROJECT_ROOT / "data/split/test"
# Check if the train and test split directories exist
if not TRAIN_SPLIT_DIR.exists():
    print(f"Train split not found at {TRAIN_SPLIT_DIR}")
if not TEST_SPLIT_DIR.exists():
    print(f"Test split not found at {TEST_SPLIT_DIR}")


In [19]:
def load_and_inspect(split_dir, num_examples=5):
    """
    Load and display examples from a dataset split.

    Args:
        split_dir (Path): Path to the dataset split directory.
        num_examples (int): Number of examples to display.
    """
    print(f"Loading dataset from: {split_dir}")
    dataset = load_from_disk(split_dir)

    print(f"Dataset contains {len(dataset)} examples.")
    print(f"Displaying {num_examples} examples:")
    for i, example in enumerate(dataset["text"][:num_examples]):
        print(f"Example {i + 1}:\n{example}")
        print("=" * 50)


In [20]:
# Inspect the train split
print("Inspecting Train Split:")
if TRAIN_SPLIT_DIR.exists():
    load_and_inspect(TRAIN_SPLIT_DIR, num_examples=5)
else:
    print("Train split not found.")


Inspecting Train Split:
Loading dataset from: C:\Users\prite\Desktop\custom-bert-pretraining\data\split\train
Dataset contains 637416 examples.
Displaying 5 examples:
Example 1:
Follow Jeff Add to circle
Rolls-Royce is embracing its dark side. The British automaker is on a quest to expand its bespoke offerings, and the Black Badge cars are part of this journey. Now there's a handful of new members in the group and you can refer to them as the Adamas Collection.
There will be 40 Wraiths and 30 Dawns that receive the Adamas treatment. What that entails is a whole lot of darkness, and the results are stunning. Right away you'll find a Spirit of Ecstasy that's not the shining example of luxury you expect to find. Instead, the lovely lady has been machined from carbon fiber. This is the first time that Rolls has ever crafted its iconic leading woman in the light yet strong material. She's comprised of 294 layers of aerospace grade carbon fiber and it takes 68 hours for her form to take shap

In [21]:
# Inspect the test split
print("\nInspecting Test Split:")
if TEST_SPLIT_DIR.exists():
    load_and_inspect(TEST_SPLIT_DIR, num_examples=5)
else:
    print("Test split not found.")



Inspecting Test Split:
Loading dataset from: C:\Users\prite\Desktop\custom-bert-pretraining\data\split\test
Dataset contains 70825 examples.
Displaying 5 examples:
Example 1:
The Masontown Water Works have issued an emergency announcement for part of Preston County.
The Lumberport/Shinnston Gas Company hit a 10" main water line that services the area East of Elkins Ave.
This main line affects all customers from Elkins Ave., along Veterans Memorial Hwy to and including the Bretz, McKinney Cave Road and Giuliani Road area will be without water.
Officials are currently working to fix the leak. There is no estimated time for the line to be fixed due to the size of the line.
Example 2:
April 12 Henan Province Xixia Auto-Pump Co Ltd :
* Sees H1 FY 2017 net profit to increase by 50 percent to 90 percent, or to be 77.7 million yuan to 98.5 million yuan
* Says H1 FY 2016 net profit was 51.8 million yuan
* Says increased ratio of high-end products and decreasing investment scale as main reasons