## Creating the training corpus

This document downloads the most popular catalan datasets from projecte-aina and creates a training corpus with them. The datasets are:
- Oscar
- Catalan Wikipedia

Although they are one of the best options for training and LLM it is still not enough filtered and preprocessed so the file best_catalan_dataset
will try to create a better dataset for training. This is crucial for small LLM models as they need to be trained with a good dataset to perform well.

In [1]:
from datasets import load_dataset

# Load the dataset
dataset = load_dataset('projecte-aina/catalan_general_crawling', trust_remote_code=True)

# Extract the 'train' split and preprocess
corpus_text = " ".join(dataset['train']['text'])

# Storing different data sizes from the full dataset

In [1]:
def limit_dataset_size(corpus_text, size_mb):
    max_bytes = size_mb * 1024 * 1024  # Convert MB to bytes
    encoded_text = corpus_text.encode('utf-8')
    limited_text = encoded_text[:max_bytes].decode('utf-8', errors='ignore')
    return limited_text

In [2]:
def saving_text_to_file(text, filename):
    with open(f"../data/{filename}.txt", "w") as f:
        f.write(text)


"""# Limit the dataset size to X MB
corpus_text = limit_dataset_size(corpus_text, 10)
saving_text_to_file(corpus_text, "tiny_corpus")"""

'# Limit the dataset size to X MB\ncorpus_text = limit_dataset_size(corpus_text, 10)\nsaving_text_to_file(corpus_text, "tiny_corpus")'

In [None]:
# Define the function to save dataset texts to a .txt file
def save_text_to_file(texts, file_path):
    with open(file_path, "w") as f:
        for line in texts:
            f.write(line.replace("\n", " ") + "\n")  # Replace newlines within articles to maintain proper formatting

In [3]:
# Cargar el dataset original
with open('../data/catalan_oscar.txt', 'r', encoding='utf-8') as f:
    corpus_text = f.read()

# Limitar el tamaño del dataset a 50 MB
limited_text = limit_dataset_size(corpus_text, 50)
saving_text_to_file(limited_text, "small_catalan_oscar")

In [2]:
# Load the OSCAR dataset for Catalan
oscar_dataset = load_dataset("oscar", "unshuffled_deduplicated_ca", split="train")
oscar_text = [example['text'] for example in oscar_dataset]
save_text_to_file(oscar_text, "catalan_oscar.txt")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/303k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/246 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/741M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/741M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/252M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2458067 [00:00<?, ? examples/s]

In [5]:
import requests
import bz2
import xml.etree.ElementTree as ET
from datasets import load_dataset

# Define the function to save dataset texts to a .txt file
def save_text_to_file(texts, file_path):
    with open(file_path, "w") as f:
        for line in texts:
            f.write(line.replace("\n", " ") + "\n")  # Replace newlines within articles to maintain proper formatting

# Function to download and extract Wikipedia dump
def download_wikipedia_dump(url, output_file):
    response = requests.get(url)
    with open(output_file, 'wb') as file:
        file.write(response.content)
    print(f"Downloaded Wikipedia dump to {output_file}")

def parse_wikipedia_dump(dump_file):
    with bz2.open(dump_file, 'rt') as f:
        context = ET.iterparse(f, events=('end',))
        for event, elem in context:
            if elem.tag.endswith('text'):
                yield elem.text
            elem.clear()

# Download Wikipedia dump
wikipedia_dump_url = 'https://dumps.wikimedia.org/cawiki/latest/cawiki-latest-pages-articles.xml.bz2'
wikipedia_dump_file = 'cawiki-latest-pages-articles.xml.bz2'
download_wikipedia_dump(wikipedia_dump_url, wikipedia_dump_file)

# Parse Wikipedia dump
wikipedia_texts = list(parse_wikipedia_dump(wikipedia_dump_file))
save_text_to_file(wikipedia_texts, "catalan_wikipedia.txt")

Downloaded Wikipedia dump to cawiki-latest-pages-articles.xml.bz2


AttributeError: 'NoneType' object has no attribute 'replace'