Get Wikinews
===

First, we need a corpus of news to work with. 
We'll use the `textacy` library, which conveniently provides an interface for getting a full, cleaned dump of Wikinews ([ref](https://chartbeat-labs.github.io/textacy/api_reference/datasets.html#textacy.datasets.wikimedia.Wikinews)).
Note that you can pick any language of choice, but you need to download the corresponding spacy language model, using:


```bash
python -m spacy download en_core_web_md
```

(note: take the spacy [`md` or `lg` models](https://spacy.io/models/en); the `sm` do not have word vectors)

In [9]:
import textacy
from textacy.datasets.wikimedia import Wikinews

wikinews = Wikinews(lang="en", version="current", data_dir="./data")
wikinews.download()

textacy.Corpus
-----
We'll then transform our `wikinews` dataset into a `textacy.Corpus`, or: 

> *An ordered collection of spacy.tokens.Doc* ([source](https://chartbeat-labs.github.io/textacy/api_reference/lang_doc_corpus.html#textacy.corpus.Corpus))

Which allows us to extract advanced representations from the article content using spaCy's `Doc` objects, see [spaCy.Doc's API reference](https://spacy.io/api/doc).

Note that we need to pass a (previously downloaded) `spacy` model (i.e., either `en_core_web_md` or `en_core_web_lg`) to the `Corpus` initialization to ensure we have (GloVe) vectors for our words and documents.

In [11]:
corpus = textacy.Corpus("en_core_web_md", data=wikinews.records())  # Convert to textacy.Corpus
corpus.save("./data/enwikinews/textacy_corpus.bin.gz")              # Saves to disk