# Wiki Corpus Generation

This notebook is a part of the replication package for my ICSE'22 paper.

There are two main types of automatic topic evaluation: **intrinsic** and **extrinsic**. **Intrinsic** evaluation uses the original dataset that the topic model was trained on, while **extrinsic** evaluation uses a different dataset. **Intrinsic** evaluation shows how well a model learned the underlying dataset, and **extrinsic** evaluation indicates human interpretability of topics.

For extrinsic evaluation, I follow the conventional practice and use a [Wikipedia dump](https://dumps.wikimedia.org/). The dump provides a complete collection of English Wiki articles. The dump includes 5 million articles, 133 word-length per article on average, packed into a 16 GB *JSON* file. Such rich semantic information allows the extrinsic evaluation to produce the results which are very similar to human evaluation.

In [1]:
import gensim.downloader as api
import json
info = api.info()
print(json.dumps(info, indent=4))

{
    "corpora": {
        "semeval-2016-2017-task3-subtaskBC": {
            "num_records": -1,
            "record_format": "dict",
            "file_size": 6344358,
            "reader_code": "https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py",
            "license": "All files released for the task are free for general research use",
            "fields": {
                "2016-train": [
                    "..."
                ],
                "2016-dev": [
                    "..."
                ],
                "2017-test": [
                    "..."
                ],
                "2016-test": [
                    "..."
                ]
            },
            "description": "SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collect

The code below can be used to download and save the Wiki dump onto the disk. A basic preprocessing is used as opposed to the preprocessing I used for the reviews for the following reasons:
- Wiki articles are semantically rich, long texts which contain no misspellings and colloquials. 
- Using the complete preprocessing pipeline, especially lemmatization, is very computationally expensive.

In [None]:
# download wiki dump
model = api.load('wiki-english-20171001')
# specify correct path
with open('C:\\wiki\\wiki_dump_raw.txt', 'w', encoding='UTF-8') as f:
    for i,entry in enumerate(model):
        if i == 0 or i % 10000 == 0:
            print(f"Document {i}")
        # preprocess and save
        f.write(" ".join(entry['section_texts']).replace("\n", "").lower() + '\n')

As I will show in the further notebooks, the main goal of the wikipedia dump is to be able to quickly check if a given word is present in an article and count the number of articles it is present in. To efficiently retrieve that information from the dump, a binary matrix is generated. Each column in the matrix represents a word, and each row represents a document.

In [None]:
# generate a sparse binary matrix
from sklearn.feature_extraction.text import CountVectorizer

with open('C:\\wiki\\wiki_dump_raw.txt', 'r', encoding='UTF-8') as f:
    wiki_corpus = f.readlines()
del f

# use min_df to save space
v = CountVectorizer(binary=True)
X = v.fit_transform(wiki_corpus)

Because Wikipedia contains millions of unique words, the size of that matrix is too big. Furthermore, the matrix would contain a lot of zeroes because only a few words would be present in each given article (relatively speaking). Therefore, before saving that matrix on the disk, it is converted to a sparse format.

In [None]:
# save matrix on disk
from scipy import sparse
sparse.save_npz('C:\\wiki\\wiki_dump_binary_matrix.npz', X)

Features (words) are also saved on the disk as the matrix does not contain that information.

In [None]:
# save features (words)
with open('C:\\wiki\\features.txt', 'w', encoding='UTF-8') as f:
    for feature in v.get_feature_names():
        f.write(feature + '\n')