<a href="https://colab.research.google.com/github/piesauce/llm-playbooks/blob/ateng%2Fc4_analysis/Chapter2/C4_Dataset_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# C4 Dataset Vocabulary Analysis

Using the ‘realnewslike’ subset of C4, prepare a word frequency counter, counting the number of times each word appears in the dataset. To make it simple, define a word as a contiguous sequence of characters separated by white space. Remove frequent function words (called stop words in NLP) like ‘the’, ‘is’ etc from your analysis. What topics seem to be underrepresented or overrepresented in your opinion?

In [1]:
!pip install datasets nltk


Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.11.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)
  Downloading aiohappyeyeballs-2.4.4-py3-none-any.whl.metadata (6.1 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Downloading aiosignal-1.3.2-py2.py3-none-any.whl.metadata (3.8 kB)
Colle

In [2]:
import nltk
from nltk.corpus import stopwords
from collections import Counter
from datasets import load_dataset
import string

nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
def main():
    # Load the 'realnewslike' subset of the C4 dataset
    try:
        dataset = load_dataset('allenai/c4', 'realnewslike', split='train')
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return

    # Initialize stop words
    stop_words = set(stopwords.words('english'))

    # Initialize a counter for word frequencies
    word_counter = Counter()

    # Define a translation table to remove punctuation
    translator = str.maketrans('', '', string.punctuation)

    # Iterate through the dataset and count word frequencies
    for idx, example in enumerate(dataset):
        text = example['text']
        # Remove punctuation
        text = text.translate(translator)
        # Tokenize by whitespace
        words = text.lower().split()
        # Remove stop words
        filtered_words = [word for word in words if word not in stop_words]
        # Update the counter
        word_counter.update(filtered_words)

        # Print progress every 10000 examples
        if (idx + 1) % 10000 == 0:
            print(f"Processed {idx + 1} examples...")

    # Get the top 100 most common words
    top_n = 100
    most_common = word_counter.most_common(top_n)

    # Display the results
    print(f"\nTop {top_n} most frequent words (excluding stop words):")
    for rank, (word, freq) in enumerate(most_common, start=1):
        print(f"{rank}. {word}: {freq}")

    # Analyze topic representation
    analyze_topics(most_common)

def analyze_topics(most_common):
    """
    Provide a basic analysis of overrepresented or underrepresented topics
    based on the most common words.
    """
    # Define some boilerplate categories and associated keywords -- can do this with ML if you want ie LDA topic modeling etc
    # but because i am on google colab free, we will just use a set of predifned words for each topic
    topics = {
        'Politics': ['government', 'president', 'election', 'policy', 'congress'],
        'Economy': ['market', 'economy', 'stock', 'trade', 'investment'],
        'Technology': ['technology', 'software', 'internet', 'device', 'computer'],
        'Health': ['health', 'medical', 'doctor', 'disease', 'hospital'],
        'Entertainment': ['movie', 'music', 'television', 'celebrity', 'show'],
        'Sports': ['game', 'team', 'score', 'player', 'season'],
        'Environment': ['environment', 'climate', 'energy', 'pollution', 'conservation'],
        'Education': ['education', 'school', 'university', 'student', 'teacher'],
        'Science': ['science', 'research', 'study', 'experiment', 'theory'],
        'International': ['country', 'international', 'foreign', 'diplomacy', 'global']
    }

    # Initialize a dictionary to hold topic frequencies
    topic_freq = {topic:0 for topic in topics}

    for word, freq in most_common:
        for topic, keywords in topics.items():
            if word in keywords:
                topic_freq[topic] += freq

    # Determine overrepresented and underrepresented topics
    # For simplicity, we'll define overrepresented as top 3 and underrepresented as bottom 3
    sorted_topics = sorted(topic_freq.items(), key=lambda x: x[1], reverse=True)
    overrepresented = sorted_topics[:3]
    underrepresented = sorted_topics[-3:]

    print("\nTopic Representation Analysis:")
    print("\nOverrepresented Topics:")
    for topic, freq in overrepresented:
        print(f"- {topic}: {freq} mentions")

    print("\nUnderrepresented Topics:")
    for topic, freq in underrepresented:
        print(f"- {topic}: {freq} mentions")

if __name__ == "__main__":
    main()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/41.1k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/512 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/512 [00:00<?, ?files/s]

c4-train.00000-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00001-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00002-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00003-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00004-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00005-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00006-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00007-of-00512.json.gz:   0%|          | 0.00/30.6M [00:00<?, ?B/s]

c4-train.00008-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00009-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00010-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00011-of-00512.json.gz:   0%|          | 0.00/30.6M [00:00<?, ?B/s]

c4-train.00012-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00013-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00014-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00015-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00016-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00017-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00018-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00019-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00020-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00021-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00022-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00023-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00024-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00025-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00026-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00027-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00028-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00029-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00030-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00031-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00032-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00033-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00034-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00035-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00036-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00037-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00038-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00039-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00040-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00041-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00042-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00043-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00044-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00045-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00046-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00047-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00048-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00049-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00050-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00051-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00052-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00053-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00054-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00055-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00056-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00057-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00058-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00059-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00060-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00061-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00062-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00063-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00064-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00065-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00066-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00067-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00068-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00069-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00070-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00071-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00072-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00073-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00074-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00075-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00076-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00077-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00078-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00079-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00080-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00081-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00082-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00083-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00084-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00085-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00086-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00087-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00088-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00089-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00090-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00091-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00092-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00093-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00094-of-00512.json.gz:   0%|          | 0.00/29.6M [00:00<?, ?B/s]

c4-train.00095-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00096-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00097-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00098-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00099-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00100-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00101-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00102-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00103-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00104-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00105-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00106-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00107-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00108-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00109-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00110-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00111-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00112-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00113-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00114-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00115-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00116-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00117-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00118-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00119-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00120-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00121-of-00512.json.gz:   0%|          | 0.00/29.6M [00:00<?, ?B/s]

c4-train.00122-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00123-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00124-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00125-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00126-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00127-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00128-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00129-of-00512.json.gz:   0%|          | 0.00/30.6M [00:00<?, ?B/s]

c4-train.00130-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00131-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00132-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00133-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00134-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00135-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00136-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00137-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00138-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00139-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00140-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00141-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00142-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00143-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00144-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00145-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00146-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00147-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00148-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00149-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00150-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00151-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00152-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00153-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00154-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00155-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00156-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00157-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00158-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00159-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00160-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00161-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00162-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00163-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00164-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00165-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00166-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00167-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00168-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00169-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00170-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00171-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00172-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00173-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00174-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00175-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00176-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00177-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00178-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00179-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00180-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00181-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00182-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00183-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00184-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00185-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00186-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00187-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00188-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00189-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00190-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00191-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00192-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00193-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00194-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00195-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00196-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00197-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00198-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00199-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00200-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00201-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00202-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00203-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00204-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00205-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00206-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00207-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00208-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00209-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00210-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00211-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00212-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00213-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00214-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00215-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00216-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00217-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00218-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00219-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00220-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00221-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00222-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00223-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00224-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00225-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00226-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00227-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00228-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00229-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00230-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00231-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00232-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00233-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00234-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00235-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00236-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00237-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00238-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00239-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00240-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00241-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00242-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00243-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00244-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00245-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00246-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00247-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00248-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00249-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00250-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00251-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00252-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00253-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00254-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00255-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00256-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00257-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00258-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00259-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00260-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00261-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00262-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00263-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00264-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00265-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00266-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00267-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00268-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00269-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00270-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00271-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00272-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00273-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00274-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00275-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00276-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00277-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00278-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00279-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00280-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00281-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00282-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00283-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00284-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00285-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00286-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00287-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00288-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00289-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00290-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00291-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00292-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00293-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00294-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00295-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00296-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00297-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00298-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00299-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00300-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00301-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00302-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00303-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00304-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00305-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00306-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00307-of-00512.json.gz:   0%|          | 0.00/29.5M [00:00<?, ?B/s]

c4-train.00308-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00309-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00310-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00311-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00312-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00313-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00314-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00315-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00316-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00317-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00318-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00319-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00320-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00321-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00322-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00323-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00324-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00325-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00326-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00327-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00328-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00329-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00330-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00331-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00332-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00333-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00334-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00335-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00336-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00337-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00338-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00339-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00340-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00341-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00342-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00343-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00344-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00345-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00346-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00347-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00348-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00349-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00350-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00351-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00352-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00353-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00354-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00355-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00356-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00357-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00358-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00359-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00360-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00361-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00362-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00363-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00364-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00365-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00366-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00367-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00368-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00369-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00370-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00371-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00372-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00373-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00374-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00375-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00376-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00377-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00378-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00379-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00380-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00381-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00382-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00383-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00384-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00385-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00386-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00387-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00388-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00389-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00390-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00391-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00392-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00393-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00394-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00395-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00396-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00397-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00398-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00399-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00400-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00401-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00402-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00403-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00404-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00405-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00406-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00407-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00408-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00409-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00410-of-00512.json.gz:   0%|          | 0.00/30.5M [00:00<?, ?B/s]

c4-train.00411-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00412-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00413-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00414-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00415-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00416-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00417-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00418-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00419-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00420-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00421-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00422-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00423-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00424-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00425-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00426-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00427-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00428-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00429-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00430-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00431-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00432-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00433-of-00512.json.gz:   0%|          | 0.00/30.7M [00:00<?, ?B/s]

c4-train.00434-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00435-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00436-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00437-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00438-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00439-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00440-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00441-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00442-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00443-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00444-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00445-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00446-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00447-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00448-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00449-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00450-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00451-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00452-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00453-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00454-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00455-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00456-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00457-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00458-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00459-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00460-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00461-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00462-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00463-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00464-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00465-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00466-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00467-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00468-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00469-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00470-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00471-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00472-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00473-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00474-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00475-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00476-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00477-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00478-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00479-of-00512.json.gz:   0%|          | 0.00/30.6M [00:00<?, ?B/s]

c4-train.00480-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00481-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00482-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00483-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00484-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00485-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00486-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00487-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00488-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00489-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00490-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00491-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00492-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00493-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00494-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00495-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00496-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00497-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00498-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00499-of-00512.json.gz:   0%|          | 0.00/29.8M [00:00<?, ?B/s]

c4-train.00500-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00501-of-00512.json.gz:   0%|          | 0.00/30.1M [00:00<?, ?B/s]

c4-train.00502-of-00512.json.gz:   0%|          | 0.00/30.4M [00:00<?, ?B/s]

c4-train.00503-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00504-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00505-of-00512.json.gz:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

c4-train.00506-of-00512.json.gz:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

c4-train.00507-of-00512.json.gz:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

c4-train.00508-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00509-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-train.00510-of-00512.json.gz:   0%|          | 0.00/29.7M [00:00<?, ?B/s]

c4-train.00511-of-00512.json.gz:   0%|          | 0.00/29.9M [00:00<?, ?B/s]

c4-validation.00000-of-00001.json.gz:   0%|          | 0.00/15.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13799838 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13863 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/76 [00:00<?, ?it/s]

Processed 10000 examples...
Processed 20000 examples...
Processed 30000 examples...
Processed 40000 examples...
Processed 50000 examples...
Processed 60000 examples...
Processed 70000 examples...
Processed 80000 examples...
Processed 90000 examples...
Processed 100000 examples...
Processed 110000 examples...
Processed 120000 examples...
Processed 130000 examples...
Processed 140000 examples...
Processed 150000 examples...
Processed 160000 examples...
Processed 170000 examples...
Processed 180000 examples...
Processed 190000 examples...
Processed 200000 examples...
Processed 210000 examples...
Processed 220000 examples...
Processed 230000 examples...
Processed 240000 examples...
Processed 250000 examples...
Processed 260000 examples...
Processed 270000 examples...
Processed 280000 examples...
Processed 290000 examples...
Processed 300000 examples...
Processed 310000 examples...
Processed 320000 examples...
Processed 330000 examples...
Processed 340000 examples...
Processed 350000 exampl