# Preprocessing the data

The data is downloaded from https://www.kaggle.com/datasets/ltcmdrdata/plain-text-wikipedia-202011?resource=download

Before we can used it to train vectors, we need to do some pre-processing. Among others throwing away things that we do not need.

And a tokenization process that makes sure that we do not have out of vocabulary terms.

In [1]:
import os
import json
import re

## File management

Because the size of the wikipedia data is quite big (8GB when zipped) I store it on an external hard disk. For this, some file management must be performed.

In [2]:
# set the path to the folder where the wiki dump is located.

wikidump_folder = "/media/hugo/Seagate Expansion Drive/wiki_dump"
if not (os.path.exists(wikidump_folder)):
    raise Exception(f"Folder {wikidump_folder} does not exist!")

In [3]:
# set the filename of the wiki dump

wikidump_filename = f"{wikidump_folder}/wikidump_2020_11.zip"
if not (os.path.exists(wikidump_filename)):
    raise Exception(f"File {wikidump_filename} does not exist!")

In [4]:
# Show number of files in wikidump zip file
import zipfile
with zipfile.ZipFile(wikidump_filename, 'r') as zip_ref:
    print(len(zip_ref.namelist()))

605


In [5]:
# wiki_texts_filename = f"{wikidump_folder}/wiki_texts.txt"

## Preprocessing and tokenization

First preprocess the file. Remove
* digits
* punctuation
* stop words

Also all text will be lowercased.

In [6]:
from nltk.corpus import stopwords

def preprocess(text):
    
    # remove numbers
    text = re.sub(r'\d+', '', text)

    # remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # lowercase
    text = text.lower()

    # remove stopwords
    stop_words = set(stopwords.words('english'))
    word_tokens = text.split()
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    text = " ".join(filtered_sentence)

    return text

For tokenization I use the wordpiece tokenizer. See https://huggingface.co/course/chapter6/6?fw=pt for a good resource about that.

In [7]:
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

def tokenize(text:str) -> list:
    return tokenizer.encode(text).tokens

## Applying the preprocessing

The following functions apply the preprocessing to all files.

In [8]:
# from tqdm import tqdm_notebook as tqdm
from tqdm.notebook import tqdm
import hashlib
from multiprocessing import Pool
# import sentence tokenizer
from nltk.tokenize import sent_tokenize

# get available cores
cores = os.cpu_count()
print(f"Available cores: {cores}")
use_cores = cores - 1
print(f"Using {use_cores} cores")

# current nice value
current_nice = os.nice(0)

# lower priority
os.nice(0)

preprocessed_text_folder = f"{wikidump_folder}/preprocessed_texts"

if not (os.path.exists(preprocessed_text_folder)):
    os.mkdir(preprocessed_text_folder)

# open the first file in the wiki dump zip file

def process_text(text):
    preprocessed_text = preprocess(text)
    tokens = tokenize(preprocessed_text)[1:-1]
    tokenized_text = " ".join(tokens)
    print(tokenized_text)


# def save_preprocessed_text(text, zip_out):
#     # get hash of text
#     text_hash = hashlib.md5(text.encode()).hexdigest()

#     text_filename = f"{text_hash}.txt"

#     # save text to zip file


def process_json_file(json_filename, zip_ref):
    with zip_ref.open(json_filename) as f:
        items = json.loads(f.read())
        # for item in items:
        #     text = item["text"]
        #     preprocessed_text = preprocess(text)

        # preprocessed_texts = [preprocess(item["text"]) for item in tqdm(items, desc="Preprocessing texts")]

        # texts = [item["text"] for item in items]
        texts = [sent_tokenize( item["text"]) for item in items]
        # print(len(texts))
        # flatten list
        texts = [item for sublist in texts for item in sublist]
        # print(len(texts))

        with Pool(use_cores) as myPool:
            preprocessed_texts = myPool.map(preprocess, texts)

        # get hash of preprocessed texts
        preprocessed_texts_string = json.dumps(preprocessed_texts)
        preprocessed_texts_hash = hashlib.md5(preprocessed_texts_string.encode()).hexdigest()

        preprocessed_texts_filename = f"{preprocessed_text_folder}/{preprocessed_texts_hash}.json"
        if not (os.path.exists(preprocessed_texts_filename)):
            json.dump(preprocessed_texts, open(preprocessed_texts_filename, 'w'))

        # print(f"Saved preprocessed texts to {preprocessed_texts_filename}")
        assert os.path.exists(preprocessed_texts_filename)
        # print(tokenized_text)


def process_archive(filename):
    with zipfile.ZipFile(wikidump_filename, 'r') as zip_ref:
        json_filenames = zip_ref.namelist()
        for i, json_filename in tqdm(enumerate(json_filenames), total=len(json_filenames), desc="Processing json files"):
            # print(i,json_filename)
            # print(f"Processing {json_filename}")
            process_json_file(json_filename, zip_ref)

try:
    process_archive(wikidump_filename)
finally:
    # set nice value back to original value
    os.nice(current_nice)

Available cores: 4
Using 3 cores


Processing json files:   0%|          | 0/605 [00:00<?, ?it/s]

Processing enwiki20201020/00c2bfc7-57db-496e-9d5c-d62f8d8119e3.json
9982
264720
Processing enwiki20201020/00e58afe-3ef5-42a6-92f3-8ee7abf868e1.json
14679
265292
Processing enwiki20201020/0104b39c-7aa4-45cd-8d28-b05dc6bafdf2.json
15059
270475
Processing enwiki20201020/01472aab-d8c2-43aa-b510-e259d58cd9a4.json
12061
268622
Processing enwiki20201020/017a0674-613b-428d-8e3a-7dcf86b72edb.json
4534
265817
Processing enwiki20201020/0203e66f-4fda-4f79-b454-ef353914810b.json
15392
275273
Processing enwiki20201020/02343952-78b0-4e51-9d0f-ccfc6afd1eff.json
11824
245750
Processing enwiki20201020/0266c1a8-c9e9-4fe5-b271-85269772e88e.json
14197
262495
Processing enwiki20201020/034254a6-d1ba-4619-b7a8-4b3f2098f37f.json
13474
266731
Processing enwiki20201020/034c6cd6-4e94-4be4-96a0-9e077b2ed089.json
6299
262918
Processing enwiki20201020/03a621b9-c2e8-4a8a-a992-e6353ebd8ddc.json


KeyboardInterrupt: 

## zipping

Evert json file is about 30 MB big. Given that we have 600 files, this is about 18 GB.
In order to save space, I archive everything to a zip and remove the original files.

In [None]:
# zip the preprocessed texts folder

preprocessed_zipfile = f"{wikidump_folder}/preprocessed_texts"

import shutil
shutil.make_archive(preprocessed_zipfile, 'zip', preprocessed_text_folder)

'/media/hugo/Seagate Expansion Drive/wiki_dump/preprocessed_texts.zip'

When zipped it is about **X** gb

In [None]:
# # remove all json files from preprocessed texts folder

# import glob
# import os

# files = glob.glob(f"{preprocessed_text_folder}/*.json")
# for f in files:
#     os.remove(f)
#     assert not os.path.exists(f), f"File {f} still exists!"

## Selection

Because of the limitations of my computer, I want my corpus to have a maximum size of **X** gb when unzipped. So I have to make a selection. I will just make a random selection.