Copyright (c) 2023 Graphcore Ltd. All rights reserved.

# Convert WikiText-103 (raw) dataset

This script runs two operations on the [WikiText-103 (raw)](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/) dataset:

 - Normalise names `wiki.train.raw => train.txt`
 - Construct the set of all unicode characters present in {train, valid, test} to form the vocabulary `vocab.json`

In [1]:
from pathlib import Path
import json

In [3]:
root = Path("data/wikitext103_raw")
root.mkdir(exist_ok=False, parents=True)

In [5]:
!wget -nv https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip -O {root / "raw.zip"}
!unzip -oj {root / "raw.zip"} -d {root}
!mv {root}/wiki.train.raw {root}/train.txt
!mv {root}/wiki.valid.raw {root}/valid.txt
!mv {root}/wiki.test.raw {root}/test.txt
!tree -lh {root}

2023-05-09 10:54:29 URL:https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip [191984949/191984949] -> "data/wikitext103_raw/raw.zip" [1]
Archive:  data/wikitext103_raw/raw.zip
  inflating: data/wikitext103_raw/wiki.test.raw  
  inflating: data/wikitext103_raw/wiki.valid.raw  
  inflating: data/wikitext103_raw/wiki.train.raw  
data/wikitext103_raw
├── [183M]  raw.zip
├── [1.2M]  test.txt
├── [516M]  train.txt
└── [1.1M]  valid.txt

0 directories, 4 files


In [6]:
def get_vocab():
    vocab = set([])
    for part in ["train", "valid", "test"]:
        vocab.update(set((root / f"{part}.txt").read_text(encoding="utf8")))
    return sorted(vocab)

vocab = get_vocab()
Path(root / "vocab.json").write_text(json.dumps(vocab))

49723