<a href="https://colab.research.google.com/github/nikolina-p/NLP-with-Transformers/blob/main/Making_my_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Gutenberg preprocessing**
Original dataset: manu/project_gutenberg

It took ~15min on Colab CPU HighRAM environment to download the manu/gutenberg dataset, clean it and make 39 parquet files.

It took ~5min to upload parquet files to Hub.


---
It took 40 min to download the manu/gutenberg dataset, clean it (2 additional cleaning functions), tokenize it and save into 39 parquet files.

Tokenized dataset is 15GB.

Uploading takes about 5-10 min.


- total num tokens (from main notebook):

100%|██████████| 38026/38026 [23:34<00:00, 26.88it/s]Total number of tokens: 3638561697


In [None]:
%%capture
!pip install --upgrade huggingface_hub

In [None]:
#@title Huggingface Hub login
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#@title Upload parquet dataset to Huggingface Hub
from huggingface_hub import HfApi, Repository
import shutil
import os

hf_repo_name = "nikolina-p/gutenberg_clean_en"
local_dir = "deduplicated_dataset_by_id"

def upload_to_hf(repo_name, local_dir):
    # Create a new dataset repo on Huggingface
    api = HfApi()
    api.create_repo(repo_name, repo_type="dataset", exist_ok=True)

    # Clone the repo
    repo = Repository(local_dir="repo_tmp", clone_from=repo_name, repo_type="dataset")

    shutil.copytree(local_dir, os.path.join(repo.local_dir, "data"), dirs_exist_ok=True)

    repo.push_to_hub(commit_message="Initial upload of dataset")


## **Gutenberg: clean, tokenized, splits**

In [None]:
import os
from datasets import load_dataset
import pyarrow as pa
import pyarrow.parquet as pq
from tqdm import tqdm

In [None]:
#@title text cleaning functions
import re

def strip_gutenberg_header_footer(text):
    # remove header before ***START ...*** or *** START ... ***
    start_match = re.search(r"\*{3}\s*START[^*]*\*{3}", text, re.IGNORECASE)
    if start_match:
        text = text[start_match.end():]

    # remove footer after ***END ...*** or *** END ...***
    end_match = re.search(r"\*{3}\s*END[^*]*\*{3}", text, re.IGNORECASE)
    if end_match:
        text = text[:end_match.start()]
    return text.strip()

def clean_newlines(text):
    # replace 3 or more newlines with 2 newlines (paragraph break)
    return re.sub(r'\n{3,}', '\n\n', text)

def replace_single_newlines(text):
    # replace exactly one newline (\n) with a single space
    return re.sub(r'(?<!\n)\n(?!\n)', ' ', text)

def clean_spaces(text):
    # replace 2 or more spaces with a single space
    return re.sub(r' {2,}', ' ', text)

def preprocess(text):
    text = strip_gutenberg_header_footer(text)
    text = clean_newlines(text)
    text = clean_spaces(text)
    text = replace_single_newlines(text)
    return text

In [None]:
#@title Download dataset, apply cleaning and save parquet files

import tiktoken

ds_gutenberg = load_dataset("manu/project_gutenberg", split="en", streaming=True)

seen_ids = set()
buffer = []
shard_size = 1000
shard_idx = 0
counter = 0
total = 0

output_dir = "cleaned_dataset"
os.makedirs(output_dir, exist_ok=True)

tokenizer = tiktoken.get_encoding("gpt2")
try:
    for example in tqdm(ds_gutenberg):
        total += 1
        row_id = example["id"]
        if row_id not in seen_ids:
            seen_ids.add(row_id)

            # inform about extra large books
            no_char = len(example["text"])
            if no_char >= 1_500_000:
                print(f"Book {total}, id {example['id']}, length {no_char} characters.")

            example["text"] = preprocess(example["text"])
            example["tokenized"] = tokenizer.encode(example["text"], allowed_special={"<|endoftext|>"})
            buffer.append(example)

            counter += 1

            if len(buffer) >= shard_size:
                table = pa.Table.from_pylist(buffer)
                pq.write_table(table, os.path.join(output_dir, f"shard-{shard_idx:03d}.parquet"))
                buffer = []
                shard_idx += 1
except KeyboardInterrupt:
    print("Keboard Interrupted")
# Write remaining buffer
if buffer:
    table = pa.Table.from_pylist(buffer)
    pq.write_table(table, os.path.join(output_dir, f"shard-{shard_idx:03d}.parquet"))
    print("END OF DATASET")

print(f"Total examples {total}")



Resolving data files:   0%|          | 0/52 [00:00<?, ?it/s]

67it [00:04, 35.20it/s]

Book 70, id 56673-0, length 1650422 characters.


79it [00:05, 20.89it/s]

Book 81, id 65236-0, length 1501473 characters.


182it [00:08, 32.19it/s]

Book 184, id 43990-8, length 1563602 characters.


336it [00:13, 22.49it/s]

Book 339, id 32699-8, length 1649986 characters.


400it [00:15, 34.98it/s]

Book 405, id 61937-0, length 1812255 characters.


450it [00:17, 25.56it/s]

Book 450, id 44437-8, length 1550665 characters.


461it [00:17, 30.04it/s]

Book 464, id 45737-0, length 2192673 characters.


510it [00:20, 20.66it/s]

Book 515, id 47263-8, length 1647517 characters.


543it [00:21, 36.54it/s]

Book 546, id 38194-8, length 2500505 characters.


547it [00:21, 21.41it/s]

Book 548, id 55195-0, length 2305678 characters.


592it [00:23, 21.67it/s]

Book 596, id 6603-8, length 1638310 characters.


936it [00:36, 36.82it/s]

Book 938, id 59553-8, length 5087364 characters.


955it [00:37, 20.10it/s]

Book 962, id 55059-0, length 4400386 characters.


1045it [00:41, 22.18it/s]

Book 1047, id 53571-0, length 2750982 characters.


1151it [00:45, 32.82it/s]

Book 1158, id 45368-0, length 2785138 characters.


1200it [00:49, 13.69it/s]

Book 1204, id 967-0, length 1867837 characters.


1213it [00:50, 14.66it/s]

Book 1213, id 58834-0, length 1524713 characters.


1245it [00:52, 14.71it/s]

Book 1248, id 57241-0, length 2894611 characters.


1259it [00:53, 21.70it/s]

Book 1260, id 53625-0, length 1783705 characters.


1290it [00:54, 32.72it/s]

Book 1291, id 54225-0, length 1556356 characters.


1294it [00:54, 21.93it/s]

Book 1297, id 36722-8, length 1671487 characters.


1678it [01:18, 29.63it/s]

Book 1686, id 32938-8, length 1821692 characters.


1772it [01:22, 25.12it/s]

Book 1777, id 24869-8, length 2295380 characters.


1815it [01:24, 21.44it/s]

Book 1817, id 42410-0, length 1722796 characters.


1867it [01:26, 25.43it/s]

Book 1870, id 19831-8, length 1812631 characters.


1870it [01:26, 20.06it/s]

Book 1872, id 57111-0, length 1709021 characters.


1898it [01:27, 28.71it/s]

Book 1901, id 20587-8, length 2219540 characters.


1993it [01:30, 24.68it/s]

Book 1998, id 53275-0, length 3417917 characters.


2106it [01:35, 27.90it/s]

Book 2107, id 61626-0, length 1836830 characters.


2189it [01:39, 28.05it/s]

Book 2194, id 60692-0, length 1843173 characters.


2253it [01:41, 32.77it/s]

Book 2254, id 35120-8, length 1855336 characters.


2378it [01:47, 13.92it/s]

Book 2381, id 58529-0, length 5139511 characters.


2596it [01:54, 36.58it/s]

Book 2598, id 37683-8, length 2249026 characters.


2818it [02:00, 34.36it/s]

Book 2819, id 1438-0, length 1508263 characters.


2851it [02:02, 27.72it/s]

Book 2852, id 18879-8, length 1642862 characters.


3068it [02:09, 34.71it/s]

Book 3069, id 58216-0, length 2243380 characters.


3349it [02:30, 30.18it/s]

Book 3351, id 55330-0, length 1635065 characters.


3435it [02:33, 26.42it/s]

Book 3436, id 38895-8, length 1564628 characters.


3454it [02:33, 29.53it/s]

Book 3460, id 42719-8, length 1721731 characters.


3674it [02:43, 31.44it/s]

Book 3679, id 7297-8, length 1654409 characters.


3816it [02:49, 21.45it/s]

Book 3818, id 28556-8, length 3782570 characters.


3844it [02:50, 30.51it/s]

Book 3848, id 41983-8, length 2550449 characters.


3924it [02:53, 23.53it/s]

Book 3926, id 56955-0, length 2875497 characters.


4078it [02:58, 32.63it/s]

Book 4081, id 19217-8, length 4886750 characters.


4184it [03:02, 27.41it/s]

Book 4185, id 15191-8, length 2300148 characters.


4191it [03:03, 21.12it/s]

Book 4193, id 40619-8, length 2076450 characters.


4282it [03:05, 44.96it/s]

Book 4289, id 56190-0, length 2245011 characters.


4379it [03:09, 30.53it/s]

Book 4384, id 20484-0, length 2065624 characters.


4405it [03:10, 39.68it/s]

Book 4409, id 7671-0, length 1896688 characters.


4468it [03:12, 24.83it/s]

Book 4470, id 47692-8, length 1768856 characters.


4649it [03:18, 30.16it/s]

Book 4653, id 49266-0, length 1669998 characters.


4925it [03:40, 25.42it/s]

Book 4931, id 54210-8, length 2084519 characters.


4962it [03:41, 28.99it/s]

Book 4965, id 55870-0, length 2741586 characters.


5010it [03:43, 22.49it/s]

Book 5015, id 61649-0, length 1988043 characters.


5025it [03:44, 22.00it/s]

Book 5032, id 7521-8, length 2065141 characters.


5202it [03:50, 31.39it/s]

Book 5204, id 41766-8, length 1744101 characters.


5315it [03:55, 23.30it/s]

Book 5315, id 22483-8, length 1565220 characters.


5425it [03:58, 24.23it/s]

Book 5430, id 45259-0, length 1620317 characters.


5430it [03:58, 21.57it/s]

Book 5431, id 57868-0, length 2013062 characters.


5459it [04:00, 28.08it/s]

Book 5461, id 56634-0, length 1598504 characters.


5493it [04:01, 22.64it/s]

Book 5494, id 9670-8, length 1570642 characters.


5735it [04:09, 34.26it/s]

Book 5739, id 41957-8, length 4938081 characters.


5754it [04:10, 25.08it/s]

Book 5757, id 53884-0, length 6165567 characters.


5766it [04:12, 15.35it/s]

Book 5773, id 39580-0, length 1962341 characters.


5800it [04:13, 27.13it/s]

Book 5804, id 58280-0, length 1762682 characters.


5816it [04:13, 28.95it/s]

Book 5820, id 31412-8, length 2364945 characters.


5913it [04:20,  7.31it/s]

Book 5913, id 21500-8, length 1598134 characters.


5979it [04:22, 36.60it/s]

Book 5982, id 34736-0, length 1689257 characters.


6057it [04:25, 25.96it/s]

Book 6059, id 66116-0, length 1554719 characters.


6178it [04:29, 32.08it/s]

Book 6182, id 26849-0, length 1703776 characters.


6325it [04:34, 27.07it/s]

Book 6326, id 23000-8, length 1684593 characters.


6344it [04:35, 30.76it/s]

Book 6345, id 57119-8, length 1897266 characters.


6596it [04:54, 29.57it/s]

Book 6603, id 32600-8, length 1982228 characters.


6659it [04:56, 32.42it/s]

Book 6660, id 49710-0, length 1661850 characters.


6666it [04:57, 18.09it/s]

Book 6669, id 57439-8, length 3013641 characters.


6676it [04:58, 15.88it/s]

Book 6680, id 18997-8, length 2082235 characters.


6893it [05:04, 28.46it/s]

Book 6895, id 51649-0, length 2555168 characters.


7209it [05:17, 30.71it/s]

Book 7213, id 9334-8, length 1871629 characters.


7308it [05:21, 32.54it/s]

Book 7315, id 50657-0, length 1532074 characters.


7444it [05:26, 32.22it/s]

Book 7448, id 29084-8, length 1850683 characters.


7890it [05:40, 36.84it/s]

Book 7892, id 30708-8, length 1818187 characters.


7929it [05:42, 25.33it/s]

Book 7939, id 11934-8, length 1925396 characters.


7945it [05:43, 25.49it/s]

Book 7945, id 64943-0, length 1604882 characters.


8063it [05:56, 27.41it/s]

Book 8065, id 56631-8, length 2965461 characters.


8111it [05:58, 29.72it/s]

Book 8116, id 19447-0, length 2638070 characters.


8254it [06:03, 23.16it/s]

Book 8256, id 44608-0, length 2638423 characters.


8286it [06:07, 16.60it/s]

Book 8288, id 46804-8, length 1979036 characters.


8512it [06:14, 27.32it/s]

Book 8515, id 47312-8, length 2521891 characters.


8526it [06:15, 24.32it/s]

Book 8528, id 44525-8, length 2976231 characters.


8529it [06:16, 14.86it/s]

Book 8530, id 63912-0, length 3257193 characters.


8532it [06:16, 11.64it/s]

Book 8535, id 669-0, length 3120374 characters.


8582it [06:18, 36.63it/s]

Book 8586, id 4200-0, length 6658807 characters.


8620it [06:21, 20.09it/s]

Book 8622, id 42770-0, length 2250674 characters.


8659it [06:22, 29.14it/s]

Book 8661, id 48480-0, length 1524044 characters.


8711it [06:24, 32.53it/s]

Book 8715, id 42506-8, length 1957699 characters.


8895it [06:31, 29.72it/s]

Book 8895, id 46016-0, length 1502974 characters.


9002it [06:34, 32.59it/s]

Book 9006, id 11615-8, length 6266628 characters.


9201it [06:41, 20.69it/s]

Book 9202, id 49352-8, length 2859404 characters.


9324it [06:46, 24.49it/s]

Book 9324, id 44202-8, length 1535255 characters.


9504it [06:54, 30.73it/s]

Book 9506, id 39733-8, length 4919134 characters.


9511it [06:55, 12.77it/s]

Book 9514, id 63127-0, length 4566178 characters.


9581it [07:09,  2.34it/s]

Book 9583, id 36984-8, length 1646703 characters.


9607it [07:10, 12.34it/s]

Book 9607, id 38607-8, length 1650256 characters.


9831it [07:17, 31.29it/s]

Book 9833, id 1301-0, length 2074757 characters.


9864it [07:19, 27.23it/s]

Book 9862, id 30745-8, length 1517606 characters.


10007it [07:23, 30.07it/s]

Book 10011, id 57159-0, length 2297231 characters.


10017it [07:24, 20.62it/s]

Book 10020, id 48661-0, length 5991627 characters.


10088it [07:27, 39.02it/s]

Book 10092, id 51291-0, length 1775953 characters.


10193it [07:32, 16.58it/s]

Book 10194, id 51789-0, length 2778535 characters.


10307it [07:36, 28.29it/s]

Book 10308, id 29092-8, length 1608220 characters.


10336it [07:37, 28.11it/s]

Book 10338, id 45085-8, length 1540163 characters.


10408it [07:39, 39.89it/s]

Book 10411, id 39308-8, length 1958034 characters.


10428it [07:40, 28.04it/s]

Book 10427, id 60758-8, length 1585109 characters.


10451it [07:41, 20.95it/s]

Book 10452, id 46509-8, length 1505244 characters.


10548it [07:44, 39.22it/s]

Book 10550, id 48889-0, length 2955724 characters.


10705it [07:53, 18.27it/s]

Book 10704, id 32510-8, length 1549512 characters.


10771it [07:55, 29.01it/s]

Book 10774, id 51489-0, length 1698404 characters.


11037it [08:04, 29.96it/s]

Book 11037, id 56600-0, length 1559609 characters.


11051it [08:04, 37.02it/s]

Book 11052, id 58268-8, length 1546136 characters.


11112it [08:06, 27.14it/s]

Book 11114, id 12606-8, length 3000265 characters.


11341it [08:24, 33.53it/s]

Book 11343, id 28148-8, length 2244631 characters.


11418it [08:27, 25.16it/s]

Book 11419, id 29363-0, length 1894080 characters.


11694it [08:35, 38.55it/s]

Book 11697, id 33224-8, length 2599244 characters.


11699it [08:35, 25.09it/s]

Book 11700, id 58031-0, length 2569679 characters.


11735it [08:37, 27.33it/s]

Book 11743, id 1662-8, length 3005270 characters.


11844it [08:43, 21.73it/s]

Book 11847, id 51104-0, length 1551325 characters.


12194it [08:56, 27.60it/s]

Book 12194, id 12565-8, length 1505337 characters.


12205it [08:56, 27.95it/s]

Book 12205, id 46998-8, length 1520052 characters.


12224it [08:57, 16.78it/s]

Book 12230, id 50710-8, length 2523211 characters.


12332it [09:01, 32.84it/s]

Book 12335, id 20450-8, length 2758557 characters.


12352it [09:02, 22.08it/s]

Book 12354, id 3913-0, length 1561553 characters.


12402it [09:04, 37.23it/s]

Book 12406, id 59837-0, length 2743698 characters.


12657it [09:14, 20.45it/s]

Book 12659, id 61502-0, length 2643957 characters.


13066it [09:42, 26.00it/s]

Book 13068, id 43111-8, length 1970668 characters.


13132it [09:44, 26.13it/s]

Book 13132, id 8737-8, length 1645374 characters.


13156it [09:45, 28.70it/s]

Book 13158, id 56644-0, length 1934508 characters.


13189it [09:46, 37.85it/s]

Book 13191, id 22400-8, length 1554024 characters.


13301it [09:50, 32.93it/s]

Book 13303, id 15263-8, length 2407254 characters.


13312it [09:51, 20.43it/s]

Book 13317, id 54554-0, length 1795276 characters.


13355it [09:53, 17.30it/s]

Book 13357, id 19218-0, length 7111867 characters.


13464it [09:57, 22.99it/s]

Book 13464, id 36031-8, length 1557221 characters.
Book 13470, id 46752-0, length 1548685 characters.


13541it [10:00, 40.48it/s]

Book 13547, id 43032-8, length 3292440 characters.


13604it [10:03, 30.51it/s]

Book 13605, id 51294-0, length 2473439 characters.


13656it [10:05, 25.79it/s]

Book 13658, id 44318-0, length 1687648 characters.


13665it [10:06, 22.62it/s]

Book 13668, id 47289-8, length 4615433 characters.


13691it [10:07, 25.77it/s]

Book 13693, id 51960-0, length 1889132 characters.


13844it [10:13, 24.04it/s]

Book 13846, id 48105-0, length 3931259 characters.


13928it [10:16, 39.63it/s]

Book 13932, id 39597-0, length 1784596 characters.


13969it [10:17, 26.61it/s]

Book 13970, id 18637-8, length 3727572 characters.


14034it [10:20, 30.17it/s]

Book 14039, id 60557-0, length 1518291 characters.


14118it [10:22, 39.75it/s]

Book 14119, id 51514-0, length 2200393 characters.


14205it [10:28, 26.03it/s]

Book 14208, id 736-0, length 1568332 characters.


14234it [10:29, 33.73it/s]

Book 14238, id 56162-8, length 1813807 characters.


14389it [10:34, 40.33it/s]

Book 14391, id 59386-0, length 1635447 characters.


14414it [10:46,  1.91it/s]

Book 14416, id 24518-0, length 1656683 characters.


14454it [10:47, 18.14it/s]

Book 14456, id 13606-8, length 1642724 characters.


14485it [10:49, 20.71it/s]

Book 14484, id 2158-8, length 1534294 characters.


14626it [10:54, 34.11it/s]

Book 14629, id 60440-0, length 3153856 characters.


14633it [10:54, 19.97it/s]

Book 14637, id 44007-8, length 2070388 characters.


14682it [10:56, 30.24it/s]

Book 14684, id 8187-8, length 2127676 characters.


14773it [10:59, 36.48it/s]

Book 14778, id 44008-8, length 1715003 characters.


14812it [11:00, 35.10it/s]

Book 14816, id 36298-8, length 1712858 characters.


14870it [11:02, 40.26it/s]

Book 14876, id 58669-0, length 1871469 characters.


14978it [11:06, 30.14it/s]

Book 14979, id 15475-0, length 3961795 characters.


15021it [11:08, 21.83it/s]

Book 15021, id 40540-8, length 1595441 characters.


15072it [11:10, 19.62it/s]

Book 15072, id 16050-8, length 1662804 characters.


15214it [11:14, 30.08it/s]

Book 15220, id 52811-0, length 1988533 characters.


15294it [11:17, 35.46it/s]

Book 15296, id 27889-8, length 3123046 characters.


15355it [11:23,  9.17it/s]

Book 15358, id 6478-0, length 1696308 characters.


15369it [11:23, 16.61it/s]

Book 15373, id 51634-0, length 1772457 characters.


15377it [11:24, 17.00it/s]

Book 15380, id 39136-8, length 1767706 characters.


15436it [11:26, 30.64it/s]

Book 15437, id 1200-0, length 1853051 characters.


15485it [11:29, 20.56it/s]

Book 15485, id 61809-0, length 1512355 characters.


15512it [11:30, 27.22it/s]

Book 15512, id 44010-8, length 1532630 characters.


15557it [11:31, 27.53it/s]

Book 15559, id 44562-8, length 6573855 characters.


15890it [11:45, 28.84it/s]

Book 15891, id 52160-8, length 1551922 characters.


15977it [11:48, 32.18it/s]

Book 15979, id 1366-0, length 1544821 characters.


15991it [11:49, 21.20it/s]

Book 15995, id 5231-8, length 1927820 characters.


16038it [12:01, 15.85it/s]

Book 16042, id 57630-0, length 4026047 characters.


16387it [12:14, 21.34it/s]

Book 16387, id 15863-8, length 1552969 characters.


16433it [12:16, 32.10it/s]

Book 16434, id 54824-0, length 1970343 characters.


16489it [12:17, 32.13it/s]

Book 16488, id 2613-8, length 1640839 characters.


16528it [12:22,  7.21it/s]

Book 16528, id 57055-0, length 1625009 characters.


16531it [12:22,  8.69it/s]

Book 16533, id 39225-8, length 4775045 characters.


16726it [12:29, 20.00it/s]

Book 16727, id 59075-0, length 1533615 characters.


16861it [12:34, 26.72it/s]

Book 16864, id 26134-8, length 1924446 characters.


17055it [12:40, 31.33it/s]

Book 17057, id 55455-0, length 1749809 characters.


17080it [12:41, 29.93it/s]

Book 17082, id 6483-8, length 1699586 characters.


17111it [12:42, 22.79it/s]

Book 17111, id 52654-0, length 1651672 characters.


17286it [12:48, 24.30it/s]

Book 17291, id 53264-0, length 1685160 characters.


17387it [12:51, 36.32it/s]

Book 17397, id 44104-8, length 1519605 characters.


17397it [12:51, 34.68it/s]

Book 17399, id 20065-8, length 1854550 characters.


17424it [12:52, 35.65it/s]

Book 17428, id 27829-0, length 1585010 characters.


17488it [12:55, 35.49it/s]

Book 17489, id 40223-8, length 1732470 characters.


17523it [12:56, 27.35it/s]

Book 17527, id 32902-8, length 2548871 characters.


17580it [12:58, 31.65it/s]

Book 17583, id 59468-0, length 3746590 characters.


17976it [13:24, 27.96it/s]

Book 17979, id 41070-8, length 2399871 characters.


18070it [13:27, 25.70it/s]

Book 18073, id 45313-8, length 4506369 characters.


18457it [13:42, 28.46it/s]

Book 18458, id 18500-8, length 2127554 characters.


18541it [13:45, 25.74it/s]

Book 18542, id 31087-8, length 1952425 characters.


18548it [13:46, 20.88it/s]

Book 18554, id 13615-8, length 1785404 characters.


18791it [13:54, 29.07it/s]

Book 18792, id 35508-8, length 1655594 characters.


18905it [14:02, 16.04it/s]

Book 18910, id 61249-0, length 1759395 characters.


18922it [14:02, 19.29it/s]

Book 18924, id 63274-0, length 1847636 characters.


19008it [14:05, 38.64it/s]

Book 19013, id 30051-8, length 2127241 characters.


19013it [14:06, 26.88it/s]

Book 19015, id 42315-8, length 1964594 characters.


19077it [14:08, 32.24it/s]

Book 19080, id 42709-0, length 2323166 characters.


19081it [14:08, 19.45it/s]

Book 19086, id 41047-8, length 1779576 characters.


19097it [14:09, 17.04it/s]

Book 19097, id 51161-0, length 1604539 characters.


19103it [14:10, 15.87it/s]

Book 19107, id 52045-0, length 2620842 characters.


19138it [14:12, 23.74it/s]

Book 19139, id 44125-0, length 1513352 characters.


19386it [14:30, 37.25it/s]

Book 19389, id 7852-8, length 1627242 characters.


19391it [14:31, 25.28it/s]

Book 19393, id 62023-0, length 1754953 characters.


19612it [14:38, 30.88it/s]

Book 19613, id 45283-8, length 2014475 characters.


19727it [14:42, 27.32it/s]

Book 19726, id 44495-8, length 1520975 characters.


19736it [14:42, 31.64it/s]

Book 19741, id 56200-0, length 4417571 characters.


19745it [14:44, 14.97it/s]

Book 19747, id 26000-0, length 1989031 characters.


19888it [14:49, 24.82it/s]

Book 19888, id 22762-8, length 1520295 characters.


20060it [14:54, 38.46it/s]

Book 20062, id 47746-8, length 2179520 characters.


20119it [14:59, 26.60it/s]

Book 20120, id 26706-8, length 1742181 characters.


20143it [15:00, 29.07it/s]

Book 20145, id 21851-8, length 1723023 characters.


20343it [15:06, 28.74it/s]

Book 20349, id 11275-8, length 5816510 characters.


20401it [15:09, 31.81it/s]

Book 20406, id 48244-0, length 2163932 characters.


20439it [15:10, 24.39it/s]

Book 20441, id 13310-0, length 1634368 characters.


20505it [15:12, 31.01it/s]

Book 20505, id 46808-8, length 1516922 characters.
Book 20510, id 57724-0, length 4007524 characters.


20656it [15:17, 31.76it/s]

Book 20662, id 28500-8, length 2867861 characters.


20738it [15:20, 31.53it/s]

Book 20741, id 44209-8, length 1692893 characters.


20766it [15:21, 38.17it/s]

Book 20768, id 14754-8, length 1590159 characters.


20782it [15:22, 23.60it/s]

Book 20783, id 65464-0, length 4576712 characters.


20827it [15:24, 19.60it/s]

Book 20831, id 32987-8, length 1776355 characters.


20835it [15:24, 21.02it/s]

Book 20837, id 247-0, length 1506901 characters.


20962it [15:40, 27.92it/s]

Book 20962, id 14517-8, length 1578722 characters.


21064it [15:43, 28.82it/s]

Book 21067, id 44969-8, length 2008044 characters.


21416it [15:57, 29.08it/s]

Book 21418, id 64037-0, length 2923967 characters.


21430it [15:58, 18.43it/s]

Book 21429, id 35169-8, length 1518822 characters.


21436it [15:58, 23.45it/s]

Book 21437, id 48451-0, length 2385354 characters.


21514it [16:01, 40.91it/s]

Book 21516, id 62704-0, length 1516903 characters.


21519it [16:01, 29.87it/s]

Book 21521, id 43869-0, length 2476150 characters.


21761it [16:10, 31.26it/s]

Book 21764, id 38390-0, length 2153729 characters.


21765it [16:11, 16.06it/s]

Book 21767, id 48539-0, length 1620646 characters.


21855it [16:14, 26.90it/s]

Book 21857, id 8419-8, length 3575875 characters.


21966it [16:17, 42.44it/s]

Book 21967, id 57634-0, length 1675554 characters.


22087it [16:21, 29.84it/s]

Book 22092, id 57343-0, length 2214570 characters.


22133it [16:23, 30.18it/s]

Book 22135, id 55736-8, length 2976680 characters.


22205it [16:26, 27.91it/s]

Book 22208, id 24238-8, length 3290914 characters.


22218it [16:27, 20.05it/s]

Book 22218, id 34350-8, length 1558091 characters.


22430it [16:36,  7.69it/s]

Book 22434, id 29090-0, length 2717922 characters.


22497it [16:48, 13.42it/s]

Book 22503, id 22036-8, length 3018680 characters.


23000it [17:04, 33.76it/s]

Book 23001, id 29233-8, length 9637237 characters.


23035it [17:07, 19.98it/s]

Book 23035, id 44280-8, length 1597841 characters.


23226it [17:12, 27.78it/s]

Book 23229, id 38015-8, length 1887389 characters.


23344it [17:16, 34.58it/s]

Book 23345, id 41710-0, length 1569171 characters.
Book 23346, id 15000-0, length 1960038 characters.


23352it [17:17, 23.07it/s]

Book 23356, id 44638-0, length 1778777 characters.


23372it [17:18, 22.38it/s]

Book 23383, id 43524-8, length 3077741 characters.


23508it [17:23, 17.05it/s]

Book 23508, id 55581-0, length 1550054 characters.


23572it [17:25, 26.84it/s]

Book 23577, id 22542-8, length 1792959 characters.


23601it [17:29,  6.45it/s]

Book 23606, id 58360-0, length 1665963 characters.


23771it [17:35, 28.64it/s]

Book 23773, id 23403-8, length 1748973 characters.


23810it [17:36, 32.95it/s]

Book 23817, id 55368-0, length 2892899 characters.


23841it [17:37, 29.44it/s]

Book 23844, id 59843-0, length 1627923 characters.


23917it [17:40, 27.89it/s]

Book 23921, id 60766-0, length 1810916 characters.


23938it [17:41, 32.16it/s]

Book 23939, id 55192-0, length 1929591 characters.


24014it [17:43, 34.79it/s]

Book 24016, id 11119-8, length 1639875 characters.


24068it [17:45, 29.55it/s]

Book 24067, id 36299-8, length 1668988 characters.


24347it [18:03, 32.82it/s]

Book 24348, id 17216-8, length 2082647 characters.


24430it [18:07, 24.70it/s]

Book 24432, id 58658-0, length 1973275 characters.


24451it [18:08, 29.18it/s]

Book 24455, id 44772-0, length 2255324 characters.


24595it [18:13, 25.36it/s]

Book 24599, id 3875-0, length 2406073 characters.


24628it [18:14, 41.23it/s]

Book 24632, id 52177-0, length 2266048 characters.


24655it [18:15, 29.51it/s]

Book 24659, id 62958-0, length 1885134 characters.


24783it [18:28,  1.98it/s]

Book 24788, id 54899-0, length 2706832 characters.


24793it [18:28,  4.32it/s]

Book 24799, id 34122-0, length 1692135 characters.


24806it [18:29,  9.09it/s]

Book 24807, id 55390-0, length 1895301 characters.


24898it [18:34, 25.08it/s]

Book 24903, id 56880-0, length 1526412 characters.


24946it [18:36, 29.77it/s]

Book 24950, id 44035-8, length 1885792 characters.


25062it [18:40, 37.77it/s]

Book 25064, id 59134-0, length 2381173 characters.


25090it [18:41, 30.76it/s]

Book 25094, id 29870-8, length 3160218 characters.


25274it [18:49, 24.29it/s]

Book 25277, id 2016-8, length 3374068 characters.


25477it [18:58, 23.04it/s]

Book 25478, id 54345-0, length 4455931 characters.


25547it [19:01, 20.90it/s]

Book 25551, id 52709-0, length 1705172 characters.


25618it [19:04, 37.31it/s]

Book 25620, id 42172-0, length 1808672 characters.


25622it [19:04, 26.38it/s]

Book 25628, id 55228-0, length 2260785 characters.


25636it [19:05, 18.31it/s]

Book 25635, id 31278-8, length 1542612 characters.


25660it [19:06, 21.69it/s]

Book 25666, id 968-0, length 1899703 characters.


25768it [19:23, 28.65it/s]

Book 25771, id 53750-0, length 5137932 characters.


25864it [19:27, 31.10it/s]

Book 25867, id 40068-8, length 3242496 characters.


25910it [19:29, 36.45it/s]

Book 25912, id 5140-8, length 1872933 characters.


26028it [19:39, 26.56it/s]

Book 26032, id 15474-0, length 3694015 characters.


26032it [19:40, 14.42it/s]

Book 26037, id 47757-8, length 2004198 characters.


26118it [19:42, 35.39it/s]

Book 26122, id 46446-0, length 1801156 characters.


26149it [19:43, 33.86it/s]

Book 26151, id 3623-8, length 2288584 characters.


26200it [19:45, 25.83it/s]

Book 26204, id 48448-0, length 2961028 characters.


26362it [19:51, 34.44it/s]

Book 26366, id 1365-0, length 1904655 characters.


26810it [20:06, 32.26it/s]

Book 26812, id 56386-0, length 2897168 characters.


26928it [20:10, 26.13it/s]

Book 26932, id 53277-0, length 3811892 characters.


26957it [20:12, 26.02it/s]

Book 26958, id 44005-8, length 2315247 characters.


26978it [20:13, 24.10it/s]

Book 26982, id 54977-0, length 2041469 characters.


27082it [20:16, 27.75it/s]

Book 27083, id 731-0, length 1786043 characters.


27153it [20:21,  9.71it/s]

Book 27157, id 43671-8, length 2518516 characters.


27314it [20:27, 30.09it/s]

Book 27315, id 55186-0, length 2632026 characters.


27388it [20:40, 26.67it/s]

Book 27390, id 56022-0, length 2870776 characters.


27515it [20:45, 32.61it/s]

Book 27516, id 51470-0, length 2225292 characters.


27534it [20:45, 29.89it/s]

Book 27537, id 46585-8, length 2068548 characters.


27768it [20:52, 34.38it/s]

Book 27768, id 43967-0, length 1526797 characters.


27811it [20:54, 37.34it/s]

Book 27812, id 57779-0, length 2869729 characters.


27871it [20:56, 24.26it/s]

Book 27872, id 30896-8, length 2160973 characters.


27874it [20:56, 16.98it/s]

Book 27881, id 57336-0, length 2255546 characters.


27945it [20:59, 33.68it/s]

Book 27946, id 57713-0, length 2534621 characters.


27954it [20:59, 22.78it/s]

Book 27957, id 55998-0, length 2891848 characters.


27984it [21:01, 22.77it/s]

Book 27986, id 41357-8, length 2103000 characters.


28114it [21:05, 33.47it/s]

Book 28116, id 54279-0, length 1612234 characters.


28173it [21:07, 33.02it/s]

Book 28175, id 39423-0, length 2496267 characters.


28177it [21:07, 17.44it/s]

Book 28183, id 40499-8, length 4724588 characters.


28205it [21:09, 19.19it/s]

Book 28205, id 39502-8, length 1546030 characters.


28216it [21:09, 25.16it/s]

Book 28220, id 31270-8, length 2334979 characters.


28405it [21:20, 30.77it/s]

Book 28407, id 43195-8, length 1678400 characters.


28559it [21:25, 19.46it/s]

Book 28558, id 44012-8, length 1590190 characters.


28638it [21:28, 23.08it/s]

Book 28637, id 33857-8, length 1597852 characters.


28657it [21:29, 21.06it/s]

Book 28656, id 8076-8, length 1606744 characters.


28694it [21:30, 31.19it/s]

Book 28697, id 22815-8, length 2001338 characters.


28729it [21:31, 40.40it/s]

Book 28731, id 5668-8, length 3002634 characters.


28765it [21:32, 34.62it/s]

Book 28771, id 44621-8, length 5497297 characters.


28890it [21:38, 34.06it/s]

Book 28891, id 27988-8, length 1690849 characters.


28939it [21:51,  2.70it/s]

Book 28941, id 580-0, length 1765553 characters.


29213it [22:00, 22.94it/s]

Book 29213, id 24898-8, length 1651228 characters.


29222it [22:01, 28.32it/s]

Book 29226, id 4084-0, length 1875324 characters.


29378it [22:06, 41.28it/s]

Book 29382, id 38699-8, length 1831397 characters.


29383it [22:06, 27.50it/s]

Book 29385, id 732-0, length 1966049 characters.


29541it [22:14, 27.89it/s]

Book 29543, id 34010-8, length 2006346 characters.


29573it [22:16, 29.39it/s]

Book 29576, id 34916-8, length 2022019 characters.


29716it [22:21, 39.79it/s]

Book 29720, id 52105-0, length 2080044 characters.


29836it [22:25, 28.92it/s]

Book 29835, id 8874-8, length 1543978 characters.


30038it [22:31, 44.84it/s]

Book 30039, id 48790-0, length 2905289 characters.


30050it [22:31, 26.67it/s]

Book 30053, id 41073-8, length 1714916 characters.


30161it [22:35, 36.98it/s]

Book 30165, id 44001-8, length 1762180 characters.


30170it [22:35, 27.73it/s]

Book 30176, id 46994-8, length 1984374 characters.


30184it [22:36, 25.27it/s]

Book 30189, id 51326-8, length 2410749 characters.


30680it [23:02, 36.78it/s]

Book 30682, id 59433-0, length 2817464 characters.


30923it [23:13, 35.66it/s]

Book 30925, id 39665-8, length 1509300 characters.


31393it [23:29, 35.59it/s]

Book 31395, id 46807-8, length 2034348 characters.


31531it [23:33, 29.30it/s]

Book 31537, id 62149-0, length 1769701 characters.


31710it [23:39, 27.21it/s]

Book 31712, id 40521-0, length 1535355 characters.


31800it [23:42, 28.36it/s]

Book 31803, id 54653-0, length 2012192 characters.


31826it [23:44, 22.05it/s]

Book 31827, id 25545-8, length 2226602 characters.


31958it [23:51, 34.53it/s]

Book 31963, id 47790-0, length 4127370 characters.


31990it [23:53, 20.30it/s]

Book 31989, id 43296-8, length 1608638 characters.


32041it [23:54, 40.76it/s]

Book 32044, id 42220-8, length 1864811 characters.


32081it [23:56, 26.81it/s]

Book 32084, id 49948-8, length 2989664 characters.


32294it [24:14, 24.27it/s]

Book 32294, id 53635-0, length 1563865 characters.


32346it [24:15, 32.36it/s]

Book 32353, id 733-0, length 1664329 characters.


32436it [24:18, 33.14it/s]

Book 32440, id 58767-0, length 3097027 characters.


32531it [24:22, 29.54it/s]

Book 32536, id 46595-8, length 2247730 characters.


32558it [24:23, 29.50it/s]

Book 32556, id 45654-8, length 1591077 characters.


32610it [24:24, 38.70it/s]

Book 32611, id 3221-8, length 1699394 characters.


32700it [24:27, 26.36it/s]

Book 32705, id 11273-8, length 2091866 characters.


33232it [24:50, 35.75it/s]

Book 33235, id 43211-8, length 1686154 characters.


33347it [24:54, 31.36it/s]

Book 33351, id 53935-8, length 3331447 characters.


33368it [24:56, 22.51it/s]

Book 33372, id 50721-0, length 2934239 characters.


33488it [25:00, 29.37it/s]

Book 33489, id 59364-0, length 2937643 characters.


33623it [25:05, 26.09it/s]

Book 33625, id 56441-0, length 1565097 characters.


33797it [25:22,  7.39it/s]

Book 33797, id 37285-8, length 1576396 characters.


33860it [25:24, 29.72it/s]

Book 33861, id 45116-0, length 2404330 characters.


33943it [25:27, 38.18it/s]

Book 33945, id 57383-8, length 6341436 characters.


33958it [25:29, 17.96it/s]

Book 33959, id 10800-0, length 3311784 characters.


34015it [25:31, 21.31it/s]

Book 34013, id 32423-8, length 1559049 characters.
Book 34016, id 29878-8, length 2263950 characters.


34031it [25:32, 23.98it/s]

Book 34034, id 54488-8, length 2407631 characters.


34116it [25:35, 31.84it/s]

Book 34120, id 44011-8, length 1750062 characters.


34339it [25:44, 37.42it/s]

Book 34349, id 44851-8, length 4266988 characters.


34415it [25:47, 39.63it/s]

Book 34417, id 32352-8, length 1542500 characters.
Book 34420, id 62250-0, length 2849843 characters.


34552it [25:51, 32.95it/s]

Book 34558, id 54578-0, length 1749347 characters.


34585it [25:52, 35.38it/s]

Book 34588, id 64677-0, length 1826696 characters.


34710it [25:56, 27.52it/s]

Book 34712, id 49843-0, length 2781337 characters.


34737it [25:57, 33.97it/s]

Book 34739, id 62891-0, length 1709262 characters.


34792it [25:59, 25.73it/s]

Book 34794, id 19699-8, length 1914883 characters.


34836it [26:01, 32.69it/s]

Book 34837, id 62972-0, length 1514358 characters.


35103it [26:10, 24.87it/s]

Book 35101, id 63183-0, length 1595265 characters.


35112it [26:10, 27.50it/s]

Book 35114, id 14300-8, length 2308254 characters.


35265it [26:16, 29.87it/s]

Book 35270, id 54811-0, length 1503344 characters.


35314it [26:17, 38.89it/s]

Book 35318, id 59506-0, length 1607098 characters.


35425it [26:35,  9.35it/s]

Book 35429, id 51491-0, length 4738241 characters.


35432it [26:36,  9.34it/s]

Book 35436, id 60786-0, length 1642217 characters.


35480it [26:38, 26.39it/s]

Book 35480, id 13097-8, length 1698625 characters.


35494it [26:38, 32.89it/s]

Book 35496, id 65090-0, length 1729212 characters.


35648it [26:43, 40.93it/s]

Book 35650, id 135-0, length 3250633 characters.


35684it [26:44, 36.84it/s]

Book 35689, id 13552-8, length 1543269 characters.


35737it [26:46, 34.84it/s]

Book 35738, id 45268-0, length 2086372 characters.


35791it [26:48, 27.40it/s]

Book 35793, id 63467-0, length 1689794 characters.


35815it [26:49, 21.77it/s]

Book 35817, id 58447-0, length 2950756 characters.


35821it [26:50, 14.23it/s]

Book 35821, id 24596-8, length 1634538 characters.


35839it [26:51, 18.88it/s]

Book 35839, id 60230-0, length 1508436 characters.


36002it [26:56, 36.91it/s]

Book 36007, id 42686-0, length 1735314 characters.


36259it [27:05, 23.30it/s]

Book 36260, id 31543-0, length 1761431 characters.


36443it [27:12, 30.84it/s]

Book 36443, id 21853-8, length 1508283 characters.


36533it [27:14, 24.67it/s]

Book 36533, id 64748-0, length 1724543 characters.


36684it [27:21, 25.91it/s]

Book 36688, id 50354-0, length 1623217 characters.


36766it [27:24, 38.29it/s]

Book 36770, id 53186-0, length 1882649 characters.


36841it [27:27, 37.58it/s]

Book 36843, id 48428-8, length 1782150 characters.


36892it [27:28, 30.43it/s]

Book 36900, id 40438-8, length 1562080 characters.


36972it [27:31, 29.43it/s]

Book 36974, id 47703-8, length 1590733 characters.


37122it [27:47, 35.35it/s]

Book 37126, id 25889-8, length 1787115 characters.


37133it [27:47, 30.22it/s]

Book 37137, id 44009-8, length 1717814 characters.


37288it [27:52, 23.12it/s]

Book 37290, id 59465-0, length 1859152 characters.


37359it [27:55, 31.76it/s]

Book 37362, id 40074-0, length 5449399 characters.


37432it [27:58, 34.49it/s]

Book 37433, id 52056-0, length 2504122 characters.


37492it [28:00, 33.18it/s]

Book 37493, id 21128-8, length 1731240 characters.


37503it [28:01, 28.26it/s]

Book 37505, id 31885-0, length 1760077 characters.


37839it [28:13, 19.76it/s]

Book 37842, id 4300-0, length 1539056 characters.


37906it [28:16, 29.55it/s]

Book 37910, id 28496-8, length 2571971 characters.


37984it [28:19, 30.17it/s]

Book 37991, id 54617-0, length 2864961 characters.


38023it [28:20, 24.88it/s]

Book 38023, id 42680-8, length 1573924 characters.


38147it [28:24, 25.95it/s]

Book 38148, id 46471-8, length 2098124 characters.


38157it [28:25, 19.58it/s]

Book 38159, id 46324-8, length 3127861 characters.


38364it [28:33, 26.52it/s]

Book 38369, id 53626-0, length 1536698 characters.


38403it [28:34, 19.59it/s]

Book 38402, id 34868-8, length 1532915 characters.


38454it [28:36, 34.31it/s]

Book 38458, id 37240-0, length 2461508 characters.


38498it [28:38, 28.58it/s]

Book 38499, id 59290-0, length 3931980 characters.


38550it [28:40, 46.00it/s]

Book 38553, id 32066-8, length 1552257 characters.


38775it [29:04, 23.21it/s]

Book 38776, id 54503-0, length 2036469 characters.


38824it [29:06, 26.71it/s]

Book 38830, id 57441-0, length 1791077 characters.


38948it [29:17,  4.22it/s]

Book 38952, id 25851-8, length 2626881 characters.


38964it [29:17, 10.48it/s]

Book 38968, id 883-0, length 1832162 characters.


38977it [29:18, 15.58it/s]

Book 38979, id 43016-8, length 1505822 characters.


39096it [29:22, 31.30it/s]

Book 39102, id 46853-8, length 2099965 characters.


39234it [29:27, 34.71it/s]

Book 39238, id 3610-0, length 1658746 characters.


39421it [29:34, 23.90it/s]

Book 39422, id 59469-0, length 3709568 characters.


39520it [29:38, 28.77it/s]

Book 39522, id 47476-8, length 2550026 characters.


39539it [29:39, 24.39it/s]

Book 39544, id 44555-0, length 2008734 characters.


39853it [29:48, 45.18it/s]

Book 39856, id 3361-0, length 1645832 characters.


40374it [30:19, 34.95it/s]

Book 40375, id 33294-8, length 1630857 characters.


40383it [30:19, 26.11it/s]

Book 40385, id 44315-8, length 1650613 characters.


40413it [30:20, 28.55it/s]

Book 40415, id 51032-0, length 2927642 characters.


40456it [30:22, 24.96it/s]

Book 40457, id 43880-8, length 1556465 characters.


40703it [30:30, 22.82it/s]

Book 40703, id 61652-0, length 1545327 characters.


40896it [30:37, 29.34it/s]

Book 40898, id 50503-0, length 3868413 characters.


41117it [30:45, 19.98it/s]

Book 41117, id 58857-0, length 1562107 characters.


41223it [30:48, 25.89it/s]

Book 41228, id 39537-8, length 2024526 characters.


41228it [30:49, 20.38it/s]

Book 41230, id 6593-8, length 1980513 characters.


41398it [30:57, 36.68it/s]

Book 41404, id 55841-8, length 2972911 characters.


41601it [31:05, 22.75it/s]

Book 41601, id 53353-8, length 1693486 characters.


41664it [31:07, 34.23it/s]

Book 41665, id 2333-0, length 1743436 characters.


41680it [31:07, 22.45it/s]

Book 41681, id 24365-8, length 1768659 characters.


41705it [31:08, 29.08it/s]

Book 41709, id 62231-0, length 3078292 characters.


41771it [31:11, 28.55it/s]

Book 41775, id 3332-0, length 2531859 characters.


41811it [31:13, 33.75it/s]

Book 41816, id 32736-8, length 2705577 characters.


41897it [31:28,  8.46it/s]

Book 41898, id 43123-0, length 1914736 characters.


42081it [31:34, 41.13it/s]

Book 42082, id 2334-0, length 1717196 characters.


42120it [31:35, 37.22it/s]

Book 42129, id 53881-0, length 1827761 characters.


42171it [31:37, 37.73it/s]

Book 42173, id 56213-0, length 2849105 characters.


42200it [31:38, 38.90it/s]

Book 42202, id 54475-0, length 2327933 characters.


42378it [31:43, 26.68it/s]

Book 42379, id 40851-8, length 4573986 characters.


42430it [31:46, 20.83it/s]

Book 42434, id 65341-0, length 2419071 characters.


42740it [31:57, 36.12it/s]

Book 42744, id 10136-8, length 3049521 characters.


42769it [31:59, 30.03it/s]

Book 42773, id 26716-8, length 1794879 characters.


42814it [32:00, 27.22it/s]

Book 42818, id 46228-8, length 1729449 characters.


42881it [32:03, 24.93it/s]

Book 42887, id 62369-0, length 3684823 characters.


42935it [32:05, 33.02it/s]

Book 42940, id 46450-0, length 2822489 characters.


43162it [32:12, 24.14it/s]

Book 43169, id 11010-8, length 1832717 characters.


43266it [32:15, 40.57it/s]

Book 43267, id 3374-0, length 2183419 characters.


43375it [32:19, 24.91it/s]

Book 43375, id 14415-8, length 1638430 characters.


43382it [32:20, 26.77it/s]

Book 43385, id 60708-8, length 1695137 characters.


43530it [32:34,  3.83it/s]

Book 43530, id 54612-8, length 1548578 characters.


43777it [32:43, 40.53it/s]

Book 43778, id 62148-0, length 2395026 characters.


43875it [32:47, 29.65it/s]

Book 43879, id 45978-8, length 1610275 characters.


43879it [32:47, 23.29it/s]

Book 43884, id 54587-0, length 1788181 characters.


43959it [32:50, 32.31it/s]

Book 43961, id 32573-8, length 1648889 characters.


44375it [33:03, 26.53it/s]

Book 44377, id 24586-8, length 1899255 characters.


44378it [33:03, 20.72it/s]

Book 44385, id 34827-8, length 1904269 characters.


44450it [33:06, 21.76it/s]

Book 44451, id 27604-8, length 2669044 characters.


44469it [33:07, 24.58it/s]

Book 44474, id 42766-8, length 3071414 characters.


44642it [33:12, 37.00it/s]

Book 44645, id 44002-8, length 1833776 characters.


44658it [33:13, 28.22it/s]

Book 44661, id 37404-0, length 1608064 characters.


44810it [33:18, 36.18it/s]

Book 44813, id 54377-0, length 2862767 characters.


44877it [33:34, 21.45it/s]

Book 44881, id 50724-0, length 1830206 characters.


45136it [33:55,  3.01it/s]

Book 45135, id 30310-8, length 1591868 characters.


45151it [33:56,  8.05it/s]

Book 45157, id 53527-0, length 3858938 characters.


45172it [33:57, 16.82it/s]

Book 45173, id 50191-0, length 1559262 characters.


45219it [33:59, 19.87it/s]

Book 45228, id 8896-8, length 2000530 characters.


45264it [34:01, 23.49it/s]

Book 45265, id 2981-0, length 6710245 characters.


45267it [34:02,  9.16it/s]

Book 45269, id 57500-0, length 2012129 characters.


45498it [34:09, 28.75it/s]

Book 45499, id 44837-8, length 4143642 characters.


45519it [34:10, 26.64it/s]

Book 45520, id 1349-0, length 1721382 characters.


45592it [34:13, 36.48it/s]

Book 45596, id 50987-0, length 2098258 characters.


45712it [34:17, 30.74it/s]

Book 45714, id 24527-8, length 1719336 characters.


45857it [34:21, 33.09it/s]

Book 45861, id 39367-8, length 3245079 characters.


46108it [34:35, 21.73it/s]

Book 46112, id 735-0, length 1749219 characters.


46218it [34:38, 35.22it/s]

Book 46219, id 39316-8, length 1574241 characters.
Book 46221, id 60852-0, length 1643576 characters.


46279it [34:40, 33.69it/s]

Book 46281, id 49140-0, length 1536443 characters.


46345it [34:42, 48.46it/s]

Book 46348, id 43572-0, length 1802153 characters.


46502it [34:48, 28.71it/s]

Book 46506, id 2988-0, length 2949787 characters.


46627it [34:51, 37.04it/s]

Book 46629, id 64176-0, length 2282859 characters.


46657it [34:53, 35.16it/s]

Book 46662, id 7714-0, length 2558524 characters.


46686it [34:54, 27.22it/s]

Book 46690, id 39157-8, length 3641985 characters.


46708it [34:55, 17.73it/s]

Book 46710, id 63415-0, length 2573057 characters.


46711it [34:56, 14.17it/s]

Book 46712, id 11100-0, length 1523591 characters.


46808it [35:12, 35.08it/s]

Book 46810, id 21006-8, length 2012060 characters.


46819it [35:12, 26.19it/s]

Book 46821, id 3350-0, length 3031514 characters.


46831it [35:13, 17.07it/s]

Book 46831, id 45634-8, length 1766109 characters.


46873it [35:15, 21.74it/s]

Book 46872, id 44494-8, length 1595838 characters.


46918it [35:16, 41.11it/s]

Book 46919, id 44438-8, length 1807426 characters.
Book 46921, id 53672-0, length 1506905 characters.


46969it [35:18, 25.31it/s]

Book 46969, id 44860-8, length 1576681 characters.


46982it [35:18, 33.85it/s]

Book 46984, id 61178-0, length 1681731 characters.


47084it [35:22, 33.43it/s]

Book 47086, id 64111-0, length 2022907 characters.


47091it [35:22, 23.55it/s]

Book 47095, id 28272-8, length 1727479 characters.


47099it [35:23, 22.75it/s]

Book 47101, id 59563-0, length 3674785 characters.


47107it [35:23, 16.51it/s]

Book 47111, id 45851-8, length 2409202 characters.


47282it [35:32, 16.01it/s]

Book 47284, id 49171-0, length 1633634 characters.


47467it [35:38, 34.49it/s]

Book 47469, id 47767-0, length 2330099 characters.


47513it [35:39, 34.86it/s]

Book 47517, id 19846-8, length 2083398 characters.


47699it [35:46, 28.87it/s]

Book 47703, id 42817-8, length 1835144 characters.


47825it [35:49, 35.70it/s]

Book 47826, id 28020-8, length 2888613 characters.


48263it [36:04, 36.01it/s]

Book 48268, id 6634-0, length 1822548 characters.


48319it [36:06, 37.66it/s]

Book 48324, id 57376-0, length 1610706 characters.


48324it [36:06, 27.79it/s]

Book 48326, id 54889-0, length 1735876 characters.


48414it [36:32, 13.67it/s]

Book 48417, id 62474-0, length 1919062 characters.


48497it [36:35, 31.83it/s]

Book 48500, id 43868-0, length 2369497 characters.


48627it [36:40, 18.22it/s]

Book 48627, id 40686-8, length 1572458 characters.


48797it [36:45, 28.05it/s]

Book 48797, id 45733-8, length 1504777 characters.


48857it [36:46, 32.80it/s]

Book 48858, id 41680-8, length 2060687 characters.


48912it [36:48, 22.99it/s]

Book 48913, id 43097-8, length 1598361 characters.


48932it [36:49, 28.18it/s]

Book 48933, id 44006-8, length 1561721 characters.


48951it [36:50, 28.58it/s]

Book 48955, id 58030-0, length 3002676 characters.


48955it [36:51, 15.60it/s]

Book 48958, id 63929-0, length 1959557 characters.


49229it [36:59, 34.76it/s]

Book 49230, id 53702-0, length 2610025 characters.


49282it [37:01, 30.23it/s]

Book 49282, id 13266-8, length 1659710 characters.


49341it [37:03, 35.04it/s]

Book 49338, id 21091-8, length 1541031 characters.


49459it [37:06, 30.65it/s]

Book 49463, id 58124-0, length 2417309 characters.


49541it [37:09, 39.80it/s]

Book 49542, id 64185-0, length 1611972 characters.


49546it [37:09, 24.11it/s]

Book 49545, id 24561-8, length 1543161 characters.


49617it [37:14, 27.20it/s]

Book 49617, id 8145-8, length 1822981 characters.


49621it [37:14, 27.77it/s]

Book 49623, id 41617-8, length 1624467 characters.


49649it [37:15, 34.20it/s]

Book 49652, id 44700-8, length 3498923 characters.


49716it [37:19, 20.56it/s]

Book 49716, id 3208-8, length 1606485 characters.


49810it [37:22, 24.75it/s]

Book 49810, id 30612-8, length 1676196 characters.


49936it [37:26, 30.69it/s]

Book 49937, id 65011-0, length 1575282 characters.


49950it [37:37,  1.40it/s]

Book 49952, id 62657-0, length 2248908 characters.


50082it [37:41, 31.79it/s]

Book 50082, id 34238-8, length 1582659 characters.


50235it [37:46, 42.38it/s]

Book 50240, id 53305-8, length 3898753 characters.


50275it [37:47, 25.99it/s]

Book 50278, id 31125-8, length 1710526 characters.


50284it [37:48, 24.66it/s]

Book 50285, id 7727-0, length 1513646 characters.


50670it [37:59, 37.21it/s]

Book 50676, id 36375-8, length 1852196 characters.


50730it [38:03,  6.04it/s]

Book 50732, id 36206-0, length 1687806 characters.


50772it [38:05, 23.25it/s]

Book 50774, id 59656-0, length 4200918 characters.


50847it [38:07, 40.11it/s]

Book 50848, id 16478-8, length 1671322 characters.


51179it [38:19, 24.93it/s]

Book 51184, id 56805-0, length 1800390 characters.


51192it [38:19, 26.99it/s]

Book 51195, id 61985-0, length 5922950 characters.


51321it [38:25, 25.64it/s]

Book 51327, id 53276-0, length 3316783 characters.


51472it [38:31, 39.99it/s]

Book 51474, id 450-8, length 1717730 characters.


51677it [38:49, 22.97it/s]

Book 51680, id 45849-8, length 1818916 characters.


51721it [38:51, 37.26it/s]

Book 51723, id 48507-0, length 1501149 characters.


51800it [38:54, 20.10it/s]

Book 51805, id 52106-8, length 4054576 characters.


51980it [39:09, 16.15it/s]

Book 51986, id 44578-8, length 1950576 characters.


52026it [39:11, 24.31it/s]

Book 52028, id 47753-8, length 1846236 characters.


52072it [39:13, 28.52it/s]

Book 52076, id 4022-0, length 1716977 characters.


52149it [39:16, 22.58it/s]

Book 52153, id 248-0, length 1530330 characters.


52169it [39:17, 27.84it/s]

Book 52170, id 44213-8, length 2098810 characters.


52350it [39:22, 38.59it/s]

Book 52357, id 42808-8, length 1928112 characters.


52718it [39:35, 20.17it/s]

Book 52721, id 21880-8, length 1521516 characters.


52789it [39:38, 26.97it/s]

Book 52792, id 47759-0, length 1964173 characters.


52916it [39:43, 21.75it/s]

Book 52919, id 42557-0, length 1928560 characters.


52919it [39:43, 17.87it/s]

Book 52920, id 41633-0, length 2983410 characters.


53488it [40:23, 21.04it/s]

Book 53493, id 62587-8, length 2827819 characters.


53560it [40:26, 26.63it/s]

Book 53561, id 22094-0, length 1581386 characters.


53635it [40:29, 54.78it/s]

Book 53638, id 44526-8, length 2479836 characters.


53641it [40:29, 29.33it/s]

Book 53644, id 54905-0, length 2260924 characters.


54009it [40:42, 33.09it/s]

Book 54015, id 53143-0, length 3571864 characters.


54120it [40:45, 41.15it/s]

Book 54123, id 2332-0, length 1744150 characters.


54232it [40:49, 26.19it/s]

Book 54231, id 27479-8, length 1569994 characters.


54335it [40:57, 30.64it/s]

Book 54340, id 44443-0, length 2870499 characters.


54400it [41:00, 34.12it/s]

Book 54402, id 50436-0, length 2210968 characters.


54429it [41:01, 31.49it/s]

Book 54432, id 14380-8, length 1697426 characters.


54485it [41:03, 20.13it/s]

Book 54485, id 60744-0, length 1517835 characters.


54562it [41:06, 31.31it/s]

Book 54566, id 36450-8, length 2796262 characters.


54693it [41:10, 27.69it/s]

Book 54693, id 3199-0, length 1632524 characters.


55013it [41:31, 30.82it/s]

Book 55014, id 58358-0, length 2592896 characters.


55046it [41:32, 30.08it/s]

Book 55047, id 38538-8, length 2453394 characters.


55111it [41:34, 30.08it/s]

Book 55112, id 3090-0, length 2730110 characters.


55134it [41:35, 29.91it/s]

Book 55137, id 47444-0, length 1673683 characters.


55178it [41:37, 28.82it/s]

Book 55180, id 41032-8, length 3127761 characters.


55287it [41:40, 42.66it/s]

Book 55290, id 57374-0, length 1706173 characters.


55368it [41:43, 19.71it/s]

Book 55368, id 4973-8, length 1652601 characters.
Book 55374, id 63489-0, length 2279255 characters.


55518it [41:50, 47.85it/s]

Book 55521, id 1895-0, length 1659796 characters.


55658it [41:55, 31.07it/s]

Book 55665, id 49930-0, length 1931337 characters.


55719it [41:57, 25.36it/s]

Book 55718, id 17265-8, length 1514677 characters.


55734it [41:58, 24.89it/s]

Book 55736, id 28039-8, length 3634365 characters.


55802it [42:00, 37.39it/s]

Book 55805, id 55620-0, length 2513807 characters.


55919it [42:04, 35.06it/s]

Book 55921, id 34439-0, length 1535696 characters.


55927it [42:05, 23.15it/s]

Book 55930, id 4397-0, length 1994790 characters.


56031it [42:08, 41.95it/s]

Book 56032, id 63116-0, length 1771229 characters.


56045it [42:08, 23.99it/s]

Book 56044, id 20758-8, length 1581716 characters.


56218it [42:15, 41.13it/s]

Book 56221, id 59133-0, length 1715701 characters.


56231it [42:16, 20.45it/s]

Book 56230, id 35685-8, length 1505365 characters.


56562it [42:38, 25.97it/s]

Book 56564, id 19488-8, length 1813999 characters.


56658it [42:46, 15.53it/s]

Book 56661, id 59745-0, length 2539923 characters.


56738it [42:49, 44.31it/s]

Book 56740, id 49027-0, length 1919301 characters.


56752it [42:49, 28.24it/s]

Book 56754, id 11272-8, length 1863250 characters.


56815it [42:52, 27.49it/s]

Book 56820, id 2848-8, length 3017178 characters.


57032it [42:59, 26.63it/s]

Book 57032, id 2612-8, length 1507872 characters.


57120it [43:01, 45.48it/s]

Book 57125, id 62687-0, length 2663525 characters.


57136it [43:02, 29.28it/s]

Book 57139, id 51393-0, length 1622403 characters.


57228it [43:06, 22.18it/s]

Book 57229, id 53565-0, length 2835844 characters.


57689it [43:21, 24.40it/s]

Book 57686, id 62291-0, length 1546040 characters.


57696it [43:21, 26.03it/s]

Book 57700, id 46223-8, length 1835904 characters.


57715it [43:22, 21.10it/s]

Book 57717, id 57060-0, length 2511790 characters.


57754it [43:23, 29.36it/s]

Book 57755, id 63593-0, length 2295334 characters.


57764it [43:24, 18.70it/s]

Book 57766, id 50801-0, length 2116721 characters.


57820it [43:33,  5.55it/s]

Book 57823, id 61176-0, length 1921697 characters.


58036it [43:40, 33.62it/s]

Book 58039, id 57628-0, length 1669410 characters.


58043it [43:40, 22.16it/s]

Book 58045, id 30312-8, length 2472280 characters.


58057it [43:41, 26.17it/s]

Book 58061, id 31824-8, length 1703915 characters.


58238it [43:59, 31.25it/s]

Book 58242, id 65855-0, length 6384596 characters.


58298it [44:02, 23.63it/s]

Book 58301, id 49351-8, length 3449458 characters.


58383it [44:05, 22.74it/s]

Book 58383, id 60174-0, length 1522199 characters.


58407it [44:06, 32.48it/s]

Book 58408, id 44004-8, length 1740186 characters.


58604it [44:12, 38.70it/s]

Book 58607, id 12667-8, length 1789573 characters.


58769it [44:17, 31.41it/s]

Book 58772, id 65909-0, length 2170162 characters.


59312it [44:47, 41.19it/s]

Book 59314, id 8676-8, length 1580659 characters.


59422it [44:51, 26.82it/s]

Book 59420, id 3220-8, length 1979694 characters.


59649it [44:57, 36.17it/s]

Book 59655, id 35589-8, length 2049818 characters.


59655it [44:58, 27.90it/s]

Book 59657, id 55497-0, length 2299826 characters.


59772it [45:12, 30.26it/s]

Book 59776, id 51836-0, length 10095626 characters.


59817it [45:15, 24.92it/s]

Book 59824, id 54846-0, length 1801391 characters.


59882it [45:18, 24.87it/s]

Book 59883, id 58237-0, length 5369789 characters.


59909it [45:19, 26.68it/s]

Book 59911, id 16528-8, length 1810048 characters.


60009it [45:23, 35.61it/s]

Book 60013, id 50640-0, length 2026102 characters.


60039it [45:24, 31.18it/s]

Book 60046, id 38700-8, length 2840680 characters.


60068it [45:25, 28.10it/s]

Book 60072, id 41873-8, length 1789308 characters.


60084it [45:26, 24.39it/s]

Book 60082, id 4952-8, length 1589978 characters.


60137it [45:28, 30.47it/s]

Book 60137, id 16997-8, length 1517032 characters.


60141it [45:28, 31.69it/s]

Book 60145, id 58062-0, length 2621083 characters.


60212it [45:35, 28.92it/s]

Book 60214, id 55191-0, length 2015341 characters.


60239it [45:36, 32.43it/s]

Book 60241, id 57375-0, length 1810751 characters.


60337it [45:39, 29.76it/s]

Book 60340, id 50883-0, length 2020879 characters.


60427it [45:42, 38.36it/s]

Book 60428, id 60968-0, length 2099507 characters.


60502it [45:45, 28.64it/s]

Book 60506, id 7795-0, length 2334613 characters.


60530it [45:46, 38.50it/s]

Book 60532, id 28540-8, length 1967830 characters.


60855it [45:57, 26.27it/s]

Book 60859, id 44003-8, length 1766901 characters.


61031it [46:03, 38.05it/s]

Book 61033, id 46242-8, length 1777215 characters.


61036it [46:03, 28.38it/s]

Book 61037, id 37137-8, length 1679059 characters.


61068it [46:04, 25.08it/s]

Book 61068, id 56812-0, length 1625333 characters.


61124it [46:08,  6.95it/s]

Book 61130, id 8123-0, length 1998208 characters.


61281it [46:13, 40.43it/s]

Book 61284, id 53360-0, length 3487078 characters.


61296it [46:26,  1.27it/s]

Book 61297, id 10900-8, length 4363592 characters.


61340it [46:28, 21.99it/s]


END OF DATASET
Total examples 61340


In [None]:
#@title Organize parquet files in train, validation and test splits
import os
import shutil

parquet_dir = "cleaned_dataset"
parquet_files = sorted(f for f in os.listdir(parquet_dir) if f.endswith(".parquet"))

n = len(parquet_files)
train_files = parquet_files[:int(0.85 * n)]
val_files   = parquet_files[int(0.85 * n):int(0.95 * n)]
test_files  = parquet_files[int(0.95 * n):]

split_dir = "split_dataset"
for split in ["train", "validation", "test"]:
    os.makedirs(os.path.join(split_dir, split), exist_ok=True)

# Copy files
for fname in train_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(split_dir, "train", fname))
for fname in val_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(split_dir, "validation", fname))
for fname in test_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(split_dir, "test", fname))



In [None]:
!du -sh split_dataset/

15G	split_dataset/


In [None]:
# inspecting parquet files
import pyarrow.parquet as pq

parquet_dir = "split_dataset/train"
parquet_files = [f for f in os.listdir(parquet_dir)]

table = pq.read_table(os.path.join(parquet_dir, parquet_files[0]))
print(table.schema)
print(table.num_rows)
print(table.to_pandas().head())

id: string
text: string
tokenized: list<element: int64>
  child 0, element: int64
1000
        id                                               text  \
0  31764-8  Produced by David Garcia, Joseph R. Hauser and...   
1  11805-8  Produced by Michael Dyck, Charles Franks, pour...   
2  55473-0  Produced by Marc D'Hooghe at Free Literature (...   
3  42787-8  Produced by Charlene Taylor, Jonathan Ingram, ...   
4  28177-8  Produced by Chris Curnow, Lindy Walsh, Greg Be...   

                                           tokenized  
0  [11547, 771, 416, 3271, 18555, 11, 7212, 371, ...  
1  [11547, 771, 416, 3899, 23524, 694, 11, 7516, ...  
2  [11547, 771, 416, 13067, 360, 6, 28900, 519, 2...  
3  [11547, 771, 416, 6258, 1734, 8121, 11, 11232,...  
4  [11547, 771, 416, 5180, 327, 700, 322, 11, 932...  


In [None]:
# load local datastet
from datasets import load_dataset

parquet_files_test = [f for f in os.listdir("split_dataset/test")]
ds = load_dataset("parquet", data_files=[os.path.join("split_dataset/test", f) for f in parquet_files_test])
print(ds)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'tokenized'],
        num_rows: 1026
    })
})


In [None]:
#@title Loading uploaded tokenized dataset gutenberg_clean_tokenized_en_splits
from datasets import load_dataset

ds = load_dataset("nikolina-p/gutenberg_clean_tokenized_en_splits", split="train", streaming=True)

stream = iter(ds)
print(next(stream)['text'][:50])

Resolving data files:   0%|          | 0/33 [00:00<?, ?it/s]

E-text prepared by the Online Distributed Proofrea


In [None]:
from datasets import load_dataset

ds = load_dataset("nikolina-p/gutenberg_clean_tokenized_en", split="train", streaming=True)

print(next(iter(ds)))

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

{'id': '41496-8', 'text': 'E-text prepared by the Online Distributed Proofreading Team (http://www.pgdp.net) from page images generously made available by Internet Archive (http://archive.org)\n\nNote: Images of the original pages are available through  Internet Archive. See  http://archive.org/details/addison_00cour\n\nTranscriber\'s note:\n\n Text enclosed by underscores is in italics (_italics_).\n\n Text enclosed by curly brackets is superscripted  (example: y{e}).\n\nEnglish Men of Letters\n\nEdited by John Morley\n\nADDISON\n\nby\n\nW. J. COURTHOPE\n\nHarper & Brothers Publishers New York and London 1902\n\n * * * * *\n\nENGLISH MEN OF LETTERS.\n\nEDITED BY JOHN MORLEY.\n\n JOHNSON Leslie Stephen.  GIBBON J. C. Morison.  SCOTT R. H. Hutton.  SHELLEY J. A. Symonds.  HUME T. H. Huxley.  GOLDSMITH William Black.  DEFOE William Minto.  BURNS J. C. Shairp.  SPENSER R. W. Church.  THACKERAY Anthony Trollope.  BURKE John Morley.  MILTON Mark Pattison.  HAWTHORNE Henry James, Jr.  SOUTHE

## **Gutenberg: clean and splits**

In [None]:
from datasets import load_dataset, DownloadConfig
import os
from tqdm import tqdm
import shutil

In [None]:
#@title download shards
os.makedirs("my_dataset_cache", exist_ok=True)

for i in range(39):
    url = f"https://huggingface.co/datasets/nikolina-p/gutenberg_clean_en/resolve/main/data/shard-{i:03d}.parquet"
    !wget -P /content/my_dataset_cache/ {url}
    # os.system(f"wget -P /content {url}")


--2025-08-19 21:16:59--  https://huggingface.co/datasets/nikolina-p/gutenberg_clean_en/resolve/main/data/shard-000.parquet
Resolving huggingface.co (huggingface.co)... 3.170.185.33, 3.170.185.14, 3.170.185.35, ...
Connecting to huggingface.co (huggingface.co)|3.170.185.33|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/43/94/4394ce560c8ef6e208a04840c798f42f6b223d7d3bc3d76c5345ef772ccb7d70/e2c3d2eb968e35413d0a889bee25b60cdeb534a7de4881d31534d2414c19d193?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27shard-000.parquet%3B+filename%3D%22shard-000.parquet%22%3B&Expires=1755641820&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1NTY0MTgyMH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzQzLzk0LzQzOTRjZTU2MGM4ZWY2ZTIwOGEwNDg0MGM3OThmNDJmNmIyMjNkN2QzYmMzZDc2YzUzNDVlZjc3MmNjYjdkNzAvZTJjM2QyZWI5NjhlMzU0MTNkMGE4ODliZWUyNWI2MGNkZWI1MzRhN2RlNDg4MWQzMTUzNGQyNDE0YzE5

In [None]:
#@title split shards into train, validation and test splits
parquet_dir = "my_dataset_cache"
parquet_files = sorted(f for f in os.listdir(parquet_dir) if f.endswith(".parquet"))
n = len(parquet_files)
train_files = parquet_files[:int(0.85 * n)]
val_files   = parquet_files[int(0.85 * n):int(0.95 * n)]
test_files  = parquet_files[int(0.95 * n):]

dataset_dir = os.makedirs("new_dir", exist_ok=True)
new_dir = "new_dir"
for split in ["train", "validation", "test"]:
    os.makedirs(os.path.join(new_dir, split), exist_ok=True)

# Copy files
for fname in train_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(new_dir, "train", fname))
for fname in val_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(new_dir, "validation", fname))
for fname in test_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(new_dir, "test", fname))

In [None]:
#@title upload dataset to HF Hub (LOGIN TO HUB FIRST)
upload_to_hf("nikolina-p/gutenberg_clean_en_splits", "new_dir")

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/datasets/nikolina-p/gutenberg_clean_en_splits into local empty directory.


Upload file data/train/shard-008.parquet:   0%|          | 1.00/235M [00:00<?, ?B/s]

Upload file data/train/shard-009.parquet:   0%|          | 1.00/235M [00:00<?, ?B/s]

Upload file data/train/shard-015.parquet:   0%|          | 1.00/234M [00:00<?, ?B/s]

Upload file data/train/shard-000.parquet:   0%|          | 1.00/234M [00:00<?, ?B/s]

Upload file data/train/shard-016.parquet:   0%|          | 1.00/232M [00:00<?, ?B/s]

Upload file data/train/shard-003.parquet:   0%|          | 1.00/233M [00:00<?, ?B/s]

Upload file data/train/shard-005.parquet:   0%|          | 1.00/234M [00:00<?, ?B/s]

Upload file data/train/shard-006.parquet:   0%|          | 1.00/232M [00:00<?, ?B/s]

Upload file data/test/shard-037.parquet:   0%|          | 1.00/232M [00:00<?, ?B/s]

Upload file data/train/shard-028.parquet:   0%|          | 1.00/230M [00:00<?, ?B/s]

Upload file data/train/shard-011.parquet:   0%|          | 1.00/229M [00:00<?, ?B/s]

Upload file data/train/shard-022.parquet:   0%|          | 1.00/229M [00:00<?, ?B/s]

Upload file data/train/shard-010.parquet:   0%|          | 1.00/229M [00:00<?, ?B/s]

Upload file data/train/shard-020.parquet:   0%|          | 1.00/229M [00:00<?, ?B/s]

Upload file data/train/shard-007.parquet:   0%|          | 1.00/227M [00:00<?, ?B/s]

Upload file data/train/shard-021.parquet:   0%|          | 1.00/227M [00:00<?, ?B/s]

Upload file data/train/shard-013.parquet:   0%|          | 1.00/227M [00:00<?, ?B/s]

Upload file data/train/shard-012.parquet:   0%|          | 1.00/226M [00:00<?, ?B/s]

Upload file data/train/shard-025.parquet:   0%|          | 1.00/226M [00:00<?, ?B/s]

Upload file data/train/shard-014.parquet:   0%|          | 1.00/226M [00:00<?, ?B/s]

Upload file data/train/shard-017.parquet:   0%|          | 1.00/226M [00:00<?, ?B/s]

Upload file data/train/shard-023.parquet:   0%|          | 1.00/226M [00:00<?, ?B/s]

Upload file data/train/shard-002.parquet:   0%|          | 1.00/225M [00:00<?, ?B/s]

Upload file data/train/shard-018.parquet:   0%|          | 1.00/224M [00:00<?, ?B/s]

Upload file data/train/shard-029.parquet:   0%|          | 1.00/222M [00:00<?, ?B/s]

Upload file data/train/shard-024.parquet:   0%|          | 1.00/221M [00:00<?, ?B/s]

Upload file data/validation/shard-035.parquet:   0%|          | 1.00/221M [00:00<?, ?B/s]

Upload file data/train/shard-019.parquet:   0%|          | 1.00/220M [00:00<?, ?B/s]

Upload file data/train/shard-001.parquet:   0%|          | 1.00/219M [00:00<?, ?B/s]

Upload file data/validation/shard-034.parquet:   0%|          | 1.00/218M [00:00<?, ?B/s]

Upload file data/train/shard-004.parquet:   0%|          | 1.00/217M [00:00<?, ?B/s]

Upload file data/train/shard-027.parquet:   0%|          | 1.00/217M [00:00<?, ?B/s]

Upload file data/train/shard-030.parquet:   0%|          | 1.00/216M [00:00<?, ?B/s]

Upload file data/train/shard-031.parquet:   0%|          | 1.00/213M [00:00<?, ?B/s]

Upload file data/train/shard-026.parquet:   0%|          | 1.00/213M [00:00<?, ?B/s]

Upload file data/validation/shard-033.parquet:   0%|          | 1.00/213M [00:00<?, ?B/s]

Upload file data/validation/shard-036.parquet:   0%|          | 1.00/211M [00:00<?, ?B/s]

Upload file data/train/shard-032.parquet:   0%|          | 1.00/207M [00:00<?, ?B/s]

Upload file data/test/shard-038.parquet:   0%|          | 1.00/8.95M [00:00<?, ?B/s]

To https://huggingface.co/datasets/nikolina-p/gutenberg_clean_en_splits
   e6c5a59..5e2d89a  main -> main

   e6c5a59..5e2d89a  main -> main



In [None]:
dss = load_dataset("nikolina-p/gutenberg_clean_en_splits", split="test", streaming=True)

README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/33 [00:00<?, ?it/s]

In [None]:
books_test = []
for book in tqdm(dss):
    books_test.append(book['id'])

1026it [00:04, 212.20it/s]


In [None]:
len(books_test)

1026

In [None]:
valid = load_dataset("nikolina-p/gutenberg_clean_en_splits", split="validation", streaming=True)

Resolving data files:   0%|          | 0/33 [00:00<?, ?it/s]

In [None]:
books_valid = []
for book in tqdm(valid):
    books_valid.append(book['id'])

4000it [00:12, 326.02it/s]


## **Mini Gutenberg: clean and tokenized**

In [None]:
#load the dataset
from datasets import load_dataset

ds = load_dataset('nikolina-p/gutenberg_clean_en', split='train', streaming=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

In [None]:
#@title Making Mini Gutenberg clean
#read text and split it on 2, 3, 1.5, 5, 2.5 out of 14 proportion and save in parquet file
stream = iter(ds)
splits = [(0, 2), (2, 5), (5, 6.5), (6.5, 11.5), (11.5, 14)]

output_dir = "local_data"
os.makedirs(output_dir, exist_ok=True)

buffer = []
shard_idx = 0
while shard_idx < 39:
    if len(buffer) <3:
        book = next(stream)
        l = len(book['text'])

        buffer.extend(
            {
                'id': book['id'],
                'text':book['text'][int(s*(l+13)/14):int(e*((l+13)/14))]
            }
            for (s, e) in splits
        )

    table = pa.Table.from_pylist(buffer[:3])
    del buffer[:3]
    pq.write_table(table, os.path.join(output_dir, f"shard-{shard_idx:03d}.parquet"))
    shard_idx += 1


In [None]:
mini = load_dataset('local_data')
mini

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/39 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'text'],
        num_rows: 117
    })
})

In [None]:
for e in mini['train']:
    print(f"{e['id']} : {len(e['text'])} characters")

41496-8 : 47843 characters
41496-8 : 71764 characters
41496-8 : 35883 characters
41496-8 : 119608 characters
41496-8 : 59791 characters
7529-0 : 43360 characters
7529-0 : 65041 characters
7529-0 : 32520 characters
7529-0 : 108401 characters
7529-0 : 54188 characters
25056-8 : 4696 characters
25056-8 : 7045 characters
25056-8 : 3522 characters
25056-8 : 11742 characters
25056-8 : 5858 characters
48941-8 : 85943 characters
48941-8 : 128915 characters
48941-8 : 64458 characters
48941-8 : 214859 characters
48941-8 : 107417 characters
35338-8 : 122240 characters
35338-8 : 183362 characters
35338-8 : 91680 characters
35338-8 : 305602 characters
35338-8 : 152789 characters
21060-8 : 81730 characters
21060-8 : 122596 characters
21060-8 : 61298 characters
21060-8 : 204326 characters
21060-8 : 102151 characters
22782-8 : 15054 characters
22782-8 : 22582 characters
22782-8 : 11291 characters
22782-8 : 37636 characters
22782-8 : 18806 characters
38207-0 : 24966 characters
38207-0 : 37450 character

In [None]:
# upload to the hub
upload_to_hf("nikolina-p/mini_gutenberg", "local_data")

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/datasets/nikolina-p/mini_gutenberg into local empty directory.


Upload file data/shard-007.parquet:   9%|9         | 32.0k/354k [00:00<?, ?B/s]

Upload file data/shard-009.parquet:  15%|#4        | 32.0k/220k [00:00<?, ?B/s]

Upload file data/shard-026.parquet:  10%|#         | 32.0k/314k [00:00<?, ?B/s]

Upload file data/shard-025.parquet:  12%|#1        | 32.0k/272k [00:00<?, ?B/s]

Upload file data/shard-008.parquet:  15%|#4        | 32.0k/217k [00:00<?, ?B/s]

Upload file data/shard-034.parquet:  15%|#5        | 32.0k/212k [00:00<?, ?B/s]

Upload file data/shard-006.parquet:  12%|#1        | 32.0k/272k [00:00<?, ?B/s]

Upload file data/shard-019.parquet:  17%|#6        | 32.0k/192k [00:00<?, ?B/s]

Upload file data/shard-005.parquet:  18%|#8        | 32.0k/175k [00:00<?, ?B/s]

Upload file data/shard-014.parquet:  20%|##        | 32.0k/159k [00:00<?, ?B/s]

Upload file data/shard-029.parquet:  21%|##        | 32.0k/156k [00:00<?, ?B/s]

Upload file data/shard-033.parquet:  21%|##        | 32.0k/153k [00:00<?, ?B/s]

Upload file data/shard-022.parquet:  22%|##1       | 32.0k/148k [00:00<?, ?B/s]

Upload file data/shard-018.parquet:  22%|##2       | 32.0k/145k [00:00<?, ?B/s]

Upload file data/shard-036.parquet:  23%|##3       | 32.0k/138k [00:00<?, ?B/s]

Upload file data/shard-017.parquet:  24%|##3       | 32.0k/135k [00:00<?, ?B/s]

Upload file data/shard-001.parquet:  24%|##3       | 32.0k/135k [00:00<?, ?B/s]

Upload file data/shard-032.parquet:  24%|##4       | 32.0k/132k [00:00<?, ?B/s]

Upload file data/shard-031.parquet:  25%|##5       | 32.0k/127k [00:00<?, ?B/s]

Upload file data/shard-002.parquet:  26%|##5       | 32.0k/124k [00:00<?, ?B/s]

Upload file data/shard-024.parquet:  28%|##7       | 32.0k/116k [00:00<?, ?B/s]

Upload file data/shard-023.parquet:  31%|###       | 32.0k/105k [00:00<?, ?B/s]

Upload file data/shard-021.parquet:  31%|###       | 32.0k/104k [00:00<?, ?B/s]

Upload file data/shard-035.parquet:  31%|###       | 32.0k/104k [00:00<?, ?B/s]

Upload file data/shard-013.parquet:  31%|###       | 32.0k/104k [00:00<?, ?B/s]

Upload file data/shard-028.parquet:  33%|###2      | 32.0k/97.3k [00:00<?, ?B/s]

Upload file data/shard-000.parquet:  33%|###3      | 32.0k/95.8k [00:00<?, ?B/s]

Upload file data/shard-037.parquet:  34%|###3      | 32.0k/94.3k [00:00<?, ?B/s]

Upload file data/shard-030.parquet:  39%|###8      | 32.0k/82.3k [00:00<?, ?B/s]

Upload file data/shard-016.parquet:  41%|####      | 32.0k/79.0k [00:00<?, ?B/s]

Upload file data/shard-012.parquet:  48%|####8     | 32.0k/66.1k [00:00<?, ?B/s]

Upload file data/shard-020.parquet:  51%|#####     | 32.0k/63.1k [00:00<?, ?B/s]

Upload file data/shard-011.parquet:  66%|######6   | 32.0k/48.3k [00:00<?, ?B/s]

Upload file data/shard-038.parquet:  74%|#######3  | 32.0k/43.5k [00:00<?, ?B/s]

Upload file data/shard-015.parquet:  75%|#######5  | 32.0k/42.6k [00:00<?, ?B/s]

Upload file data/shard-003.parquet:  78%|#######8  | 32.0k/41.0k [00:00<?, ?B/s]

Upload file data/shard-027.parquet:  94%|#########3| 32.0k/34.1k [00:00<?, ?B/s]

Upload file data/shard-010.parquet: 100%|##########| 31.3k/31.3k [00:00<?, ?B/s]

Upload file data/shard-004.parquet: 100%|##########| 13.7k/13.7k [00:00<?, ?B/s]

To https://huggingface.co/datasets/nikolina-p/mini_gutenberg
   4d8f7c9..9018eea  main -> main

   4d8f7c9..9018eea  main -> main



In [None]:
mini_hf = load_dataset('nikolina-p/mini_gutenberg', split='train', streaming=True)
mini_stream = iter(mini_hf)

next(mini_stream)

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

{'id': '41496-8',
 'text': 'E-text prepared by the Online Distributed Proofreading Team (http://www.pgdp.net) from page images generously made available by Internet Archive (http://archive.org)\n\nNote: Images of the original pages are available through  Internet Archive. See  http://archive.org/details/addison_00cour\n\nTranscriber\'s note:\n\n Text enclosed by underscores is in italics (_italics_).\n\n Text enclosed by curly brackets is superscripted  (example: y{e}).\n\nEnglish Men of Letters\n\nEdited by John Morley\n\nADDISON\n\nby\n\nW. J. COURTHOPE\n\nHarper & Brothers Publishers New York and London 1902\n\n * * * * *\n\nENGLISH MEN OF LETTERS.\n\nEDITED BY JOHN MORLEY.\n\n JOHNSON Leslie Stephen.  GIBBON J. C. Morison.  SCOTT R. H. Hutton.  SHELLEY J. A. Symonds.  HUME T. H. Huxley.  GOLDSMITH William Black.  DEFOE William Minto.  BURNS J. C. Shairp.  SPENSER R. W. Church.  THACKERAY Anthony Trollope.  BURKE John Morley.  MILTON Mark Pattison.  HAWTHORNE Henry James, Jr.  SOUTH

In [None]:
from datasets import load_dataset

mini_hf = load_dataset('nikolina-p/mini_gutenberg', split='train')
books = set()
for book in mini_hf:
    books.add(book['id'])

len(books)

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

24

In [None]:
mini_hf

Dataset({
    features: ['id', 'text'],
    num_rows: 117
})

In [None]:
#@title Making Mini Gutenberg TOKENIZED
#read text and split it on 2, 3, 1.5, 5, 2.5 out of 14 proportion and save in parquet file
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

stream = iter(ds)
splits = [(0, 2), (2, 5), (5, 6.5), (6.5, 11.5), (11.5, 14)]

output_dir = "local_data"
os.makedirs(output_dir, exist_ok=True)

buffer = []
shard_idx = 0
while shard_idx < 39:
    if len(buffer) <3:
        # add pieces of next book to buffer
        book = next(stream)
        l = len(book['text'])
        for (s, e) in splits:
            text = book['text'][int(s*(l+13)/14):int(e*((l+13)/14))]
            buffer.append(
                {
                    'id': book['id'],
                    'text':text,
                    'tokenized':tokenizer.encode(text)
                }
            )

    table = pa.Table.from_pylist(buffer[:3])
    del buffer[:3]
    pq.write_table(table, os.path.join(output_dir, f"shard-{shard_idx:03d}.parquet"))
    shard_idx += 1


In [None]:
mini_tok = load_dataset("local_data")
mini_tok

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'text', 'tokenized'],
        num_rows: 117
    })
})

In [None]:
# upload to the hub
upload_to_hf("nikolina-p/mini_gutenberg_tokenized", "local_data")

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/datasets/nikolina-p/mini_gutenberg_tokenized into local empty directory.


Upload file data/shard-007.parquet:   5%|5         | 32.0k/622k [00:00<?, ?B/s]

Upload file data/shard-025.parquet:   7%|6         | 32.0k/492k [00:00<?, ?B/s]

Upload file data/shard-006.parquet:   6%|6         | 32.0k/495k [00:00<?, ?B/s]

Upload file data/shard-026.parquet:   6%|5         | 32.0k/567k [00:00<?, ?B/s]

Upload file data/shard-019.parquet:   9%|8         | 32.0k/359k [00:00<?, ?B/s]

Upload file data/shard-008.parquet:   8%|8         | 32.0k/396k [00:00<?, ?B/s]

Upload file data/shard-009.parquet:   8%|8         | 32.0k/383k [00:00<?, ?B/s]

Upload file data/shard-034.parquet:   8%|8         | 32.0k/398k [00:00<?, ?B/s]

Upload file data/shard-005.parquet:  10%|9         | 32.0k/325k [00:00<?, ?B/s]

Upload file data/shard-029.parquet:  11%|#         | 32.0k/303k [00:00<?, ?B/s]

Upload file data/shard-014.parquet:  11%|#         | 32.0k/292k [00:00<?, ?B/s]

Upload file data/shard-033.parquet:  11%|#         | 32.0k/292k [00:00<?, ?B/s]

Upload file data/shard-018.parquet:  12%|#1        | 32.0k/273k [00:00<?, ?B/s]

Upload file data/shard-022.parquet:  12%|#1        | 32.0k/268k [00:00<?, ?B/s]

Upload file data/shard-017.parquet:  13%|#2        | 32.0k/255k [00:00<?, ?B/s]

Upload file data/shard-036.parquet:  13%|#2        | 32.0k/252k [00:00<?, ?B/s]

Upload file data/shard-001.parquet:  13%|#3        | 32.0k/246k [00:00<?, ?B/s]

Upload file data/shard-032.parquet:  13%|#3        | 32.0k/238k [00:00<?, ?B/s]

Upload file data/shard-031.parquet:  14%|#3        | 32.0k/233k [00:00<?, ?B/s]

Upload file data/shard-002.parquet:  14%|#3        | 32.0k/232k [00:00<?, ?B/s]

Upload file data/shard-024.parquet:  15%|#5        | 32.0k/213k [00:00<?, ?B/s]

Upload file data/shard-013.parquet:  15%|#5        | 32.0k/207k [00:00<?, ?B/s]

Upload file data/shard-028.parquet:  16%|#6        | 32.0k/194k [00:00<?, ?B/s]

Upload file data/shard-021.parquet:  17%|#6        | 32.0k/194k [00:00<?, ?B/s]

Upload file data/shard-023.parquet:  17%|#6        | 32.0k/193k [00:00<?, ?B/s]

Upload file data/shard-035.parquet:  17%|#6        | 32.0k/192k [00:00<?, ?B/s]

Upload file data/shard-000.parquet:  18%|#8        | 32.0k/176k [00:00<?, ?B/s]

Upload file data/shard-037.parquet:  19%|#8        | 32.0k/172k [00:00<?, ?B/s]

Upload file data/shard-016.parquet:  21%|##        | 32.0k/155k [00:00<?, ?B/s]

Upload file data/shard-030.parquet:  22%|##2       | 32.0k/145k [00:00<?, ?B/s]

Upload file data/shard-012.parquet:  26%|##6       | 32.0k/122k [00:00<?, ?B/s]

Upload file data/shard-020.parquet:  28%|##7       | 32.0k/115k [00:00<?, ?B/s]

Upload file data/shard-011.parquet:  34%|###3      | 32.0k/95.1k [00:00<?, ?B/s]

Upload file data/shard-038.parquet:  39%|###8      | 32.0k/82.7k [00:00<?, ?B/s]

Upload file data/shard-015.parquet:  40%|###9      | 32.0k/80.0k [00:00<?, ?B/s]

Upload file data/shard-003.parquet:  41%|####1     | 32.0k/77.4k [00:00<?, ?B/s]

Upload file data/shard-027.parquet:  50%|####9     | 32.0k/64.5k [00:00<?, ?B/s]

Upload file data/shard-010.parquet:  51%|#####1    | 32.0k/62.4k [00:00<?, ?B/s]

Upload file data/shard-004.parquet: 100%|##########| 26.4k/26.4k [00:00<?, ?B/s]

To https://huggingface.co/datasets/nikolina-p/mini_gutenberg_tokenized
   21bb2c7..b60b62f  main -> main

   21bb2c7..b60b62f  main -> main



In [None]:
mini_tokenized = load_dataset("nikolina-p/mini_gutenberg_tokenized", split="train")
mini_tokenized

README.md:   0%|          | 0.00/964 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/39 [00:00<?, ?files/s]

shard-013.parquet:   0%|          | 0.00/212k [00:00<?, ?B/s]

shard-007.parquet:   0%|          | 0.00/637k [00:00<?, ?B/s]

shard-009.parquet:   0%|          | 0.00/393k [00:00<?, ?B/s]

shard-012.parquet:   0%|          | 0.00/125k [00:00<?, ?B/s]

shard-006.parquet:   0%|          | 0.00/507k [00:00<?, ?B/s]

shard-015.parquet:   0%|          | 0.00/82.0k [00:00<?, ?B/s]

shard-008.parquet:   0%|          | 0.00/405k [00:00<?, ?B/s]

shard-003.parquet:   0%|          | 0.00/79.2k [00:00<?, ?B/s]

shard-000.parquet:   0%|          | 0.00/180k [00:00<?, ?B/s]

shard-002.parquet:   0%|          | 0.00/237k [00:00<?, ?B/s]

shard-011.parquet:   0%|          | 0.00/97.3k [00:00<?, ?B/s]

shard-005.parquet:   0%|          | 0.00/332k [00:00<?, ?B/s]

shard-004.parquet:   0%|          | 0.00/27.0k [00:00<?, ?B/s]

shard-014.parquet:   0%|          | 0.00/299k [00:00<?, ?B/s]

shard-010.parquet:   0%|          | 0.00/63.9k [00:00<?, ?B/s]

shard-001.parquet:   0%|          | 0.00/252k [00:00<?, ?B/s]

shard-017.parquet:   0%|          | 0.00/261k [00:00<?, ?B/s]

shard-016.parquet:   0%|          | 0.00/159k [00:00<?, ?B/s]

shard-019.parquet:   0%|          | 0.00/368k [00:00<?, ?B/s]

shard-020.parquet:   0%|          | 0.00/117k [00:00<?, ?B/s]

shard-022.parquet:   0%|          | 0.00/275k [00:00<?, ?B/s]

shard-023.parquet:   0%|          | 0.00/198k [00:00<?, ?B/s]

shard-026.parquet:   0%|          | 0.00/581k [00:00<?, ?B/s]

shard-021.parquet:   0%|          | 0.00/199k [00:00<?, ?B/s]

shard-028.parquet:   0%|          | 0.00/199k [00:00<?, ?B/s]

shard-024.parquet:   0%|          | 0.00/218k [00:00<?, ?B/s]

shard-025.parquet:   0%|          | 0.00/504k [00:00<?, ?B/s]

shard-027.parquet:   0%|          | 0.00/66.1k [00:00<?, ?B/s]

shard-029.parquet:   0%|          | 0.00/310k [00:00<?, ?B/s]

shard-018.parquet:   0%|          | 0.00/280k [00:00<?, ?B/s]

shard-030.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

shard-031.parquet:   0%|          | 0.00/239k [00:00<?, ?B/s]

shard-033.parquet:   0%|          | 0.00/299k [00:00<?, ?B/s]

shard-035.parquet:   0%|          | 0.00/197k [00:00<?, ?B/s]

shard-037.parquet:   0%|          | 0.00/176k [00:00<?, ?B/s]

shard-038.parquet:   0%|          | 0.00/84.7k [00:00<?, ?B/s]

shard-032.parquet:   0%|          | 0.00/244k [00:00<?, ?B/s]

shard-036.parquet:   0%|          | 0.00/258k [00:00<?, ?B/s]

shard-034.parquet:   0%|          | 0.00/408k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/117 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'tokenized'],
    num_rows: 117
})

In [None]:
tok_num = 0
for e in mini_tokenized:
    tok_num += len(e["tokenized"])

print(f"NUMBER OF TOKENS IN MINI_GUTENBERG_TOKENIZED: {tok_num:,}")

NUMBER OF TOKENS IN MINI_GUTENBERG_TOKENIZED: 2,110,010


## **Mini Gutenberg: clean and splits**

In [None]:
# import
from datasets import load_dataset, DownloadConfig
import os
from tqdm import tqdm
import shutil

In [None]:
# download shard files
os.makedirs("my_dataset_cache", exist_ok=True)

for i in range(39):
    url = f"https://huggingface.co/datasets/nikolina-p/mini_gutenberg/resolve/main/data/shard-{i:03d}.parquet"
    !wget -P /content/my_dataset_cache/ {url}
    # os.system(f"wget -P /content {url}")

--2025-08-20 16:49:18--  https://huggingface.co/datasets/nikolina-p/mini_gutenberg/resolve/main/data/shard-000.parquet
Resolving huggingface.co (huggingface.co)... 18.239.50.16, 18.239.50.103, 18.239.50.49, ...
Connecting to huggingface.co (huggingface.co)|18.239.50.16|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/ef/e5/efe56c725c6b7e0561577cd25a004fa326efb6de5db901a649ba071f1b30b280/9592e0e2e83e406847cf7c88bda67088dced910a2538a0fadd6fd793ae8df668?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27shard-000.parquet%3B+filename%3D%22shard-000.parquet%22%3B&Expires=1755712159&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1NTcxMjE1OX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2VmL2U1L2VmZTU2YzcyNWM2YjdlMDU2MTU3N2NkMjVhMDA0ZmEzMjZlZmI2ZGU1ZGI5MDFhNjQ5YmEwNzFmMWIzMGIyODAvOTU5MmUwZTJlODNlNDA2ODQ3Y2Y3Yzg4YmRhNjcwODhkY2VkOTEwYTI1MzhhMGZhZGQ2ZmQ3OTNhZThkZjY

In [None]:
# split shards into train/test/valid => run the cell
parquet_dir = "my_dataset_cache"
parquet_files = sorted(f for f in os.listdir(parquet_dir) if f.endswith(".parquet"))
n = len(parquet_files)
train_files = parquet_files[:int(0.85 * n)]
val_files   = parquet_files[int(0.85 * n):int(0.95 * n)]
test_files  = parquet_files[int(0.95 * n):]

dataset_dir = os.makedirs("new_dir", exist_ok=True)
new_dir = "new_dir"
for split in ["train", "validation", "test"]:
    os.makedirs(os.path.join(new_dir, split), exist_ok=True)

# Copy files
for fname in train_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(new_dir, "train", fname))
for fname in val_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(new_dir, "validation", fname))
for fname in test_files:
    shutil.copy(os.path.join(parquet_dir, fname), os.path.join(new_dir, "test", fname))

In [None]:
# upload dataset to HF Hub (LOGIN TO HUB FIRST)
upload_to_hf("nikolina-p/mini_gutenberg_splits", "new_dir")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/datasets/nikolina-p/mini_gutenberg_splits into local empty directory.


Upload file data/train/shard-007.parquet:   9%|9         | 32.0k/354k [00:00<?, ?B/s]

Upload file data/train/shard-009.parquet:  15%|#4        | 32.0k/220k [00:00<?, ?B/s]

Upload file data/train/shard-008.parquet:  15%|#4        | 32.0k/217k [00:00<?, ?B/s]

Upload file data/validation/shard-034.parquet:  15%|#5        | 32.0k/212k [00:00<?, ?B/s]

Upload file data/train/shard-019.parquet:  17%|#6        | 32.0k/192k [00:00<?, ?B/s]

Upload file data/train/shard-025.parquet:  12%|#1        | 32.0k/272k [00:00<?, ?B/s]

Upload file data/train/shard-026.parquet:  10%|#         | 32.0k/314k [00:00<?, ?B/s]

Upload file data/train/shard-006.parquet:  12%|#1        | 32.0k/272k [00:00<?, ?B/s]

Upload file data/train/shard-005.parquet:  18%|#8        | 32.0k/175k [00:00<?, ?B/s]

Upload file data/train/shard-014.parquet:  20%|##        | 32.0k/159k [00:00<?, ?B/s]

Upload file data/train/shard-029.parquet:  21%|##        | 32.0k/156k [00:00<?, ?B/s]

Upload file data/validation/shard-033.parquet:  21%|##        | 32.0k/153k [00:00<?, ?B/s]

Upload file data/train/shard-022.parquet:  22%|##1       | 32.0k/148k [00:00<?, ?B/s]

Upload file data/train/shard-018.parquet:  22%|##2       | 32.0k/145k [00:00<?, ?B/s]

Upload file data/validation/shard-036.parquet:  23%|##3       | 32.0k/138k [00:00<?, ?B/s]

Upload file data/train/shard-017.parquet:  24%|##3       | 32.0k/135k [00:00<?, ?B/s]

Upload file data/train/shard-001.parquet:  24%|##3       | 32.0k/135k [00:00<?, ?B/s]

Upload file data/train/shard-032.parquet:  24%|##4       | 32.0k/132k [00:00<?, ?B/s]

Upload file data/train/shard-031.parquet:  25%|##5       | 32.0k/127k [00:00<?, ?B/s]

Upload file data/train/shard-002.parquet:  26%|##5       | 32.0k/124k [00:00<?, ?B/s]

Upload file data/train/shard-024.parquet:  28%|##7       | 32.0k/116k [00:00<?, ?B/s]

Upload file data/train/shard-023.parquet:  31%|###       | 32.0k/105k [00:00<?, ?B/s]

Upload file data/train/shard-021.parquet:  31%|###       | 32.0k/104k [00:00<?, ?B/s]

Upload file data/validation/shard-035.parquet:  31%|###       | 32.0k/104k [00:00<?, ?B/s]

Upload file data/train/shard-013.parquet:  31%|###       | 32.0k/104k [00:00<?, ?B/s]

Upload file data/train/shard-028.parquet:  33%|###2      | 32.0k/97.3k [00:00<?, ?B/s]

Upload file data/train/shard-000.parquet:  33%|###3      | 32.0k/95.8k [00:00<?, ?B/s]

Upload file data/test/shard-037.parquet:  34%|###3      | 32.0k/94.3k [00:00<?, ?B/s]

Upload file data/train/shard-030.parquet:  39%|###8      | 32.0k/82.3k [00:00<?, ?B/s]

Upload file data/train/shard-016.parquet:  41%|####      | 32.0k/79.0k [00:00<?, ?B/s]

Upload file data/train/shard-012.parquet:  48%|####8     | 32.0k/66.1k [00:00<?, ?B/s]

Upload file data/train/shard-020.parquet:  51%|#####     | 32.0k/63.1k [00:00<?, ?B/s]

Upload file data/train/shard-011.parquet:  66%|######6   | 32.0k/48.3k [00:00<?, ?B/s]

Upload file data/test/shard-038.parquet:  74%|#######3  | 32.0k/43.5k [00:00<?, ?B/s]

Upload file data/train/shard-015.parquet:  75%|#######5  | 32.0k/42.6k [00:00<?, ?B/s]

Upload file data/train/shard-003.parquet:  78%|#######8  | 32.0k/41.0k [00:00<?, ?B/s]

Upload file data/train/shard-027.parquet:  94%|#########3| 32.0k/34.1k [00:00<?, ?B/s]

Upload file data/train/shard-010.parquet: 100%|##########| 31.3k/31.3k [00:00<?, ?B/s]

Upload file data/train/shard-004.parquet: 100%|##########| 13.7k/13.7k [00:00<?, ?B/s]

To https://huggingface.co/datasets/nikolina-p/mini_gutenberg_splits
   0749a40..12e35fd  main -> main

   0749a40..12e35fd  main -> main



In [None]:
from datasets import load_dataset
ds = load_dataset("nikolina-p/mini_gutenberg", split="train", streaming=True)
print(next(iter(ds)))

README.md:   0%|          | 0.00/863 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/39 [00:00<?, ?it/s]

{'id': '41496-8', 'text': 'E-text prepared by the Online Distributed Proofreading Team (http://www.pgdp.net) from page images generously made available by Internet Archive (http://archive.org)\n\nNote: Images of the original pages are available through  Internet Archive. See  http://archive.org/details/addison_00cour\n\nTranscriber\'s note:\n\n Text enclosed by underscores is in italics (_italics_).\n\n Text enclosed by curly brackets is superscripted  (example: y{e}).\n\nEnglish Men of Letters\n\nEdited by John Morley\n\nADDISON\n\nby\n\nW. J. COURTHOPE\n\nHarper & Brothers Publishers New York and London 1902\n\n * * * * *\n\nENGLISH MEN OF LETTERS.\n\nEDITED BY JOHN MORLEY.\n\n JOHNSON Leslie Stephen.  GIBBON J. C. Morison.  SCOTT R. H. Hutton.  SHELLEY J. A. Symonds.  HUME T. H. Huxley.  GOLDSMITH William Black.  DEFOE William Minto.  BURNS J. C. Shairp.  SPENSER R. W. Church.  THACKERAY Anthony Trollope.  BURKE John Morley.  MILTON Mark Pattison.  HAWTHORNE Henry James, Jr.  SOUTHE

## test: Check duplicate books in manu/project_gutenberg

In [None]:
from datasets import load_dataset

ds = load_dataset("manu/project_gutenberg", split="en", streaming=True)

stream = iter(ds)

Resolving data files:   0%|          | 0/52 [00:00<?, ?it/s]

In [None]:
count = []
previous_id, previous_text = None, ""
stream_b = iter(ds)

while len(count) < 20:
    book = next(stream_b)
    book_id, book_text = book["id"], book["text"]

    if book_id == previous_id:
        dupl = {
            "id": book_id,
            "dupl": book_text == previous_text,
        }
        count.append(dupl)

    previous_id, previous_text = book_id, book_text

In [None]:
count

[{'id': '41496-8', 'dupl': True},
 {'id': '25056-8', 'dupl': True},
 {'id': '48941-8', 'dupl': True},
 {'id': '35338-8', 'dupl': True},
 {'id': '21060-8', 'dupl': True},
 {'id': '22782-8', 'dupl': True},
 {'id': '59467-8', 'dupl': True},
 {'id': '18503-8', 'dupl': True},
 {'id': '30837-8', 'dupl': True},
 {'id': '14456-8', 'dupl': True},
 {'id': '18449-8', 'dupl': True},
 {'id': '36297-8', 'dupl': True},
 {'id': '31200-8', 'dupl': True},
 {'id': '57826-8', 'dupl': True},
 {'id': '24551-8', 'dupl': True},
 {'id': '28570-8', 'dupl': True},
 {'id': '13504-8', 'dupl': True},
 {'id': '39477-8', 'dupl': True},
 {'id': '26954-8', 'dupl': True},
 {'id': '49411-8', 'dupl': True}]

## ...