# Wikisource Dataset Processing
---
We'll be using the [English Wikisource Dataset](https://wikimedia.bringyour.com/enwikisource/20240320/?C=S&O=D) (Pages & Articles, 2.9GB compressed) to train a tokenizer and word embedding model. Unzipped, this dataset is 13GB which is too large for my laptop to work with efficiently. Additionally, the data comes unzipped as XML format and requires processing and cleaning before we can use it for our model.

This workbook will let us visualize the data, define our text cleaning operation, and convert this larger dataset into Parquet chunks of a specified size (e.g., 50MB parquet files). We're using Parquet files because they allow for efficient data access and streaming -- which we'll use Pytorch Dataset and Dataloader objects to manipulate in training.

## Import Dependencies
---

In [1]:
import xml.etree.ElementTree as ET
from collections import defaultdict
import re
import pandas as pd
import os
import random

## Understand Structure of Our Data
---
Because our data is stored in XML, we're using the 'ElementTree' module from Python's XML library. With this, we're able to efficiently load data from our XML file (don't need to load in the entire file - would cause OOM error if so).   
We see that the specific data we want to extract is in the 'revision' tag and 'text' child tag.

In [2]:
# First - define the path to your data file (in our case, an XML)
path = "enwikisource-20240320-pages-articles.xml"

In [3]:
def explore_xml_structure_modified(file_path, skip_elements=5000, max_elements=1000):
    """
    Explore the XML structure by printing out the names of different element tags
    and their counts, skipping the first 'skip_elements' and then considering
    the next 'max_elements' within the file.

    :param file_path: Path to the XML file.
    :param skip_elements: Number of elements to skip before starting to count.
    :param max_elements: Maximum number of elements to explore after skipping.
    """
    tag_counts = defaultdict(int)
    total_elements_processed = 0  # Total elements processed, including those skipped

    for _, elem in ET.iterparse(file_path, events=("start",)):
        # Skip the first 'skip_elements'
        if total_elements_processed < skip_elements:
            total_elements_processed += 1
            elem.clear()
            continue  # Skip the rest of the loop and proceed to the next element
        
        # Start counting after skipping
        tag_counts[elem.tag] += 1
        total_elements_processed += 1
        
        # Break the loop if we have processed 'max_elements' after skipping
        if total_elements_processed >= skip_elements + max_elements:
            break
        
        elem.clear()  # Clear the element to save memory

    # Print the discovered tags and their counts
    for tag, count in tag_counts.items():
        print(f"Tag: {tag}, Count: {count}")

# Uncomment to run code
# --------------------------------------------
# explore_xml_structure_modified(path)

        
## --TAKEAWAY-- ## 
# From this, we can see the different tags that are in play. Printing out the whole element shows that what we're looking for is in the 'revision' tag, followed by the 'text' subclass. 
# From here, we'll need to further process our data so that we can pull just the Wikipedia text.

## Define and Test Extraction and Processing Functions
---
Let's split this into a few steps.  
1. **Random Samples**: Let's take a semi-random sample of 100 examples. We'll use this to test our Processing Function. (Note: To get a proper random sample, it would require us to loop through the whole dataset which is costly, so we'll take a semi-random sample of elements in the first ~100k articles).
2. **Processing Function**: Let's write and test a processing function (optimized to run quickly) on our random samples.
3. **Extraction Function**: Generator function that loops through our dataset and yields (same as 'return' but the function keeps running) lists of articles that contain - approximately - a defined size (in MB) of text. This will then be saved down as a parquet file.  

### 1) Random Samples

In [4]:
def rand_sampler(path, num_samples):
    """
        Function that iterates through our dataset. If a random number comes below a threshold and the article being currently processed is not None, it will append it to a sample list.
        Returns said sample list.

        Inputs:
            path:        Filepath to dataset
            num_samples: Length of list to be returned (number of samples sampled)

        Returns list of len=num_samples of sampled articles
    """
    threshold = 0.05
    samples = []
    
    # Namespace dict to handle the XML namespace in tags
    ns = {'mw': 'http://www.mediawiki.org/xml/export-0.10/'}

    for event, elem in ET.iterparse(path, events=("end",)):
        if len(samples) >= num_samples:
            break
        if elem.tag == f"{{{ns['mw']}}}page":
            for revision in elem.findall(f"{{{ns['mw']}}}revision"):
                text_element = revision.find(f"{{{ns['mw']}}}text")
                if text_element is not None:
                    if random.random() < threshold:
                        samples.append(text_element.text)
            elem.clear()  # This frees up memory

    return samples

In [5]:
# Let's generate our random samples
text_samples = rand_sampler(path, num_samples = 100)

In [6]:
# Check our data
print(f"Num Samples: {len(text_samples)}\n{'-'*60}")
print(text_samples[0])

# We can see the structure. Some articles are short (we'll want to ignore these). There is also a 'header' at the start we want to remove. We'll experiment with our 'process_text' function
# on this dataset until we get a function that cleans our text properly

Num Samples: 100
------------------------------------------------------------
{{other versions|Much Ado About Nothing (Shakespeare)}}
{{header
 | title      = [[../]]
 | author     = William Shakespeare (1564-1616) | override_author = [[Author:William Shakespeare (1564-1616)|William Shakespeare]]
 | translator = 
 | section    = Much adoe about Nothing
 | previous   = [[Shakespeare - First Folio facsimile (1910)/The Comedy of Errors|The Comedie of Errors]]
 | next       = [[Shakespeare - First Folio facsimile (1910)/Loves Labour's lost|Loues Labour's lost]]
 | notes      = 
}}
{{AuxTOC|title=Acts|width=22em|
*[[Shakespeare - First Folio facsimile (1910)/Much adoe about Nothing/Act 1|Act I]]
*[[Shakespeare - First Folio facsimile (1910)/Much adoe about Nothing/Act 2|Act II]]
*[[Shakespeare - First Folio facsimile (1910)/Much adoe about Nothing/Act 3|Act III]]
*[[Shakespeare - First Folio facsimile (1910)/Much adoe about Nothing/Act 4|Act IV]]
*[[Shakespeare - First Folio facsimile (1910

### 2) Processing Function

In [7]:
def text_preprocessing(text):
    # First, handle special cases by removing lines starting with *
    text = '\n'.join(line for line in text.split('\n') if not line.strip().startswith('*'))
    
    cleaned_text = []
    stack = []
    for char in text:
        if char in '{[<':
            stack.append(char)
        elif char in '}]>':
            if stack:
                last_open = stack.pop()
                if (char == '}' and last_open != '{') or \
                   (char == ']' and last_open != '[') or \
                   (char == '>' and last_open != '<'):
                    pass    # Handle syntax error or unbalanced brackets if needed
        else:
            if not stack:
                cleaned_text.append(char)
    
    cleaned_text_str = ''.join(cleaned_text)
    
    # Removing content between '__' delimiters
    while '__' in cleaned_text_str:
        open_index = cleaned_text_str.find('__')
        close_index = cleaned_text_str.find('__', open_index + 2)
        if close_index == -1:
            break
        cleaned_text_str = cleaned_text_str[:open_index] + cleaned_text_str[close_index + 2:]
        
    return cleaned_text_str


def clean_processed_text(text):
    # Replace custom heading formats with HTML tags
    text = re.sub(r'====(.*?)====', r'<h4>\1</h4>', text)      # Replace '====Heading 4====' with '<h4>Heading 4</h4>'
    text = re.sub(r'===(.*?)===', r'<h3>\1</h3>', text)        # Replace '===Heading 3===' with '<h3>Heading 3</h3>'
    text = re.sub(r'==(.*?)==', r'<h2>\1</h2>', text)          # Replace '==Heading 2==' with '<h2>Heading 2</h2>'
    text = re.sub(r'=(.*?)=', r'<h1>\1</h1>', text)            # Replace '=Heading 1=' with '<h1>Heading 1</h1>'
    
    text = text.strip('\n ')                                   # Strip leading and trailing newline characters and spaces
    text = [line.lstrip(': ') for line in text.split('\n') ]   # Remove leading colons and spaces from each line
    text = '\n'.join(text)                                     # Rejoin cleaned lines
    text = re.sub(r'\n{3,}', '\n\n', text)                     # Replace three or more consecutive newlines with two newlines
    return text

def process_text(text):
    """ 
        Parent function - pass in a string and returns your cleaned_text and a boolean that can be used to discard short passages
        *Note* If you set the threshold for passage length to be 700, you discard ~1/3 of the passages
    """
    if text is None:
        return "", False
    preprocessed_text = text_preprocessing(text)
    cleaned_text = clean_processed_text(preprocessed_text)
    if len(cleaned_text) > 700:
        return cleaned_text, True
    else:
        return cleaned_text, False

In [8]:
# Function to clean our samples and print out a random cleaned sample from the list
cleaned_samples = []
for text in text_samples:
    cleaned_text, append_text = process_text(text)
    if append_text:
        cleaned_samples.append(cleaned_text)

print(f"Samples remaining: {len(cleaned_samples)}/{len(text_samples)}")
rand_idx = random.randrange(0, len(cleaned_samples))
print(f"Printing: cleaned_samples[{rand_idx}]:\n{'-'*60}")
print(random.choice(cleaned_samples))

Samples remaining: 59/100
Printing: cleaned_samples[24]:
------------------------------------------------------------
<h1>Character</h1>

'', , ''

It is said of the ermine that it will suffer capture rather than allow pollution to touch its glossy coat, but take away that coat and the animal is worthless.

We have ermines in higher life&mdash;those who love display.  The desire to seem, rather than to be, is one of the faults which our age, as well as other ages, must deplore.

Appearance too often takes the place of reality&mdash;the stamp of the coin is there, and the glitter of the gold, but, after all, it is but a worthless wash.  Sham is carried into every department of life, and we are being corrupted by show and surface.  We are too apt to judge people by what they have, rather than by what they are; we have too few Hamlets who are bold enough to proclaim, "I know not seem!"

The counterfeit, however, only proves the value of the coin, and, although reputation may in some degre

### 3) Extraction Function

In [9]:
def extract_clean_texts(file_path, start_num=0, batch_size_MB=50, testing_fn=True):
    """
    A generator that iterates through an XML file and yields batches of cleaned text strings.

    Input: 
        file_path:  Path to XML file.
        start_num:  Specify which passage to start at (i.e., discards first 'start_num' passages).
        batch_size: Appx. size per batch in MB.

    Yields batches of strings of length 'batch_size' until the end of the file is reached.
    """
    chars_per_batch = batch_size_MB*(1024**2) / 0.6     # From quick empirical analysis - imperfect but close
    chars_processed = 0
    pages_processed = 0
    extracted_texts = []

    # Namespace dict to handle the XML namespace in tags
    ns = {'mw': 'http://www.mediawiki.org/xml/export-0.10/'}

    for event, elem in ET.iterparse(file_path, events=("end",)):
        if testing_fn:
            if pages_processed == 10000:
                break      # Ensures you don't iterate over the whole dataset
        if elem.tag == f"{{{ns['mw']}}}page":
            pages_processed += 1

            if pages_processed > start_num:
                for revision in elem.findall(f"{{{ns['mw']}}}revision"):
                    text_element = revision.find(f"{{{ns['mw']}}}text")
                    if text_element is not None:
                        clean_text, append_text = process_text(text_element.text)  # Assuming process_text returns a tuple
                        if append_text:
                            extracted_texts.append(clean_text)
                            chars_processed += len(clean_text)
                            if chars_processed >= chars_per_batch:
                                chars_processed = 0
                                yield extracted_texts
                                extracted_texts = []  # Reset for the next batch
                
            elem.clear()  # This helps in freeing up memory
    
    # Yield any remaining texts if they don't fill up a complete batch
    if extracted_texts:
        yield extracted_texts

In [10]:
def save_chunk_as_parquet(chunk, batch_number):
    """ 
        Function that takes a list of cleaned texts and saves it as a Parquet file
    """
    directory = 'chunked_data'
    os.makedirs(directory, exist_ok=True)
    filename = f'{directory}/enwiki_20240320_{batch_number}.parquet'
    structured_chunk = [[text] for text in chunk]  # Convert each string to a list
    chunk_df = pd.DataFrame(structured_chunk, columns=['text'])
    chunk_df.to_parquet(filename)
    print(f'Saved {filename}')

In [14]:
# Run to test if our code is working properly
# for i, clean_list in enumerate(extract_clean_texts(path, start_num=0, batch_size_MB=10, testing_fn=True)):
#     print(f"Batch {i+1} has {len(clean_list)} articles in it.")
#     if i == 1:
#         save_chunk_as_parquet(clean_list, "testbatch")    # Test export of our parquet file 

Batch 1 has 779 articles in it.
Batch 2 has 831 articles in it.
Saved chunked_data/enwiki_20240320_testbatch.parquet
Batch 3 has 545 articles in it.
Batch 4 has 821 articles in it.
Batch 5 has 1193 articles in it.
Batch 6 has 558 articles in it.


## Export and Process our Full Dataset
---
Now that we know our functions are working, processing and saving down our full dataset.

In [None]:
# Uncomment and run when ready to clean full dataset
for i, chunk in enumerate(extract_clean_texts(path, start_num=0, batch_size_MB=50, testing_fn=False)):
    save_chunk_as_parquet(chunk, i+1)

Saved chunked_data/enwiki_20240320_1.parquet
Saved chunked_data/enwiki_20240320_2.parquet
Saved chunked_data/enwiki_20240320_3.parquet
Saved chunked_data/enwiki_20240320_4.parquet
Saved chunked_data/enwiki_20240320_5.parquet
Saved chunked_data/enwiki_20240320_6.parquet
Saved chunked_data/enwiki_20240320_7.parquet
Saved chunked_data/enwiki_20240320_8.parquet
Saved chunked_data/enwiki_20240320_9.parquet
Saved chunked_data/enwiki_20240320_10.parquet
Saved chunked_data/enwiki_20240320_11.parquet
Saved chunked_data/enwiki_20240320_12.parquet
Saved chunked_data/enwiki_20240320_13.parquet
Saved chunked_data/enwiki_20240320_14.parquet
Saved chunked_data/enwiki_20240320_15.parquet
Saved chunked_data/enwiki_20240320_16.parquet
Saved chunked_data/enwiki_20240320_17.parquet
Saved chunked_data/enwiki_20240320_18.parquet
Saved chunked_data/enwiki_20240320_19.parquet
Saved chunked_data/enwiki_20240320_20.parquet
Saved chunked_data/enwiki_20240320_21.parquet
Saved chunked_data/enwiki_20240320_22.parqu

## View and Analyze Parquet Files
---
Quick code to open random Parquet file and view the contents to see if code is working correctly.

In [27]:
directory = 'chunked_data'
parquet_files = [f for f in os.listdir(directory) if f.endswith('.parquet')]

if parquet_files:
    random_file = random.choice(parquet_files)
    # file_path = os.path.join(directory, "enwiki_20240320_1.parquet")
    file_path = os.path.join(directory, random_file)
    df = pd.read_parquet(file_path)
    
    df.info()
    print(f"\n{'-'*100}\n")
    random_row = df.sample(n=1).iloc[0]
    print(random_row['text']) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 831 entries, 0 to 830
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    831 non-null    object
dtypes: object(1)
memory usage: 6.6+ KB

----------------------------------------------------------------------------------------------------

TO what a cumbersome unwieldiness
And burdenous corpulence my love had grown,
But that I did, to make it less,
And keep it in proportion,
Give it a diet, made it feed upon
That which love worst endures, discretion

Above one sigh a day I allow'd him not,
Of which my fortune, and my faults had part;
And if sometimes by stealth he got
A she sigh from my mistress' heart,
And thought to feast upon that, I let him see
'Twas neither very sound, nor meant to me.

If he wrung from me a tear, I brined it so
With scorn and shame, that him it nourish'd not;
If he suck'd hers, I let him know
'Twas not a tear which he had got;
His drink was counterfe