In [None]:
import textacy
from textacy.datasets.wikimedia import Wikinews

Get Wikinews
===

First, we need a corpus of news to work with. 
We'll use the `textacy` library, which conveniently provides an interface for getting a full, cleaned dump of Wikinews ([ref](https://chartbeat-labs.github.io/textacy/api_reference/datasets.html#textacy.datasets.wikimedia.Wikinews)).
Note that you can pick any language of choice, but you need to download the corresponding spacy language model, using:

```bash
python -m spacy download <lang>
```

In [None]:
wikinews = Wikinews(lang="en", version="current", data_dir="./data")
wikinews.download()

Wikinews.records
--------
`wikinews` contains `.records`, or tuples of documents and corresponding metadata, e.g. `for doc, meta in wikinews.records(limit=5):`, yields:

1. `meta`: A dictionary containing metadata, e.g.:
```python
{'page_id': '37488', 
 'title': 'News briefs:March 30, 2006', 
 'headings': ('Audio Wikinews transcript, 2006-03-30 0730 UTC', ...), 
 'wiki_links': ('Cyclone_Glenda_closes_in_on_Western_Australia', ...), 
 'ext_links': (), 
 'categories': ('March 30, 2006', 'Brief'), 
 'dt_created': '2006-03-30T07:31:33Z', 
 'n_incoming_links': 5, 
 'popularity_score': 1.985466386054084e-06}
```

1. `doc`: The parsed/cleaned article content:
```
"I'm Phillip Hong. The time is 0730 UTC on Wednesday the 30th of March 2006, and this is Audio Wikinews: News Briefs. UK public sector workers strike over pension rights Government workers in the UK withdrew [...]"
```

textacy.Corpus
-----
We'll now transform our `wikinews` dataset into a `textacy.Corpus`, or *An ordered collection of spacy.tokens.Doc* ([source](https://chartbeat-labs.github.io/textacy/api_reference/lang_doc_corpus.html#textacy.corpus.Corpus)), which allows us to extract advanced representations from the article content using spaCy's `Doc` objects, e.g., for each doc we have:

```python
doc.ents: (Phillip Hong, 0730, Wednesday the 30th of March 2006, ...)  # named entities
doc.vector: [ 1.1235747e+00 -1.4028672e+00 -1.1004950e+00 ... ]  # average of tokens' word embeddings
doc.sentiment: 0.0  # article's sentiment score
```

([And more...](https://spacy.io/api/doc))

In [None]:
#corpus = textacy.Corpus("en", data=wikinews.records())
#corpus.save("./data/enwikinews/textacy_corpus.bin.gz")

In [None]:
corpus = textacy.Corpus.load("en", "./data/enwikinews/textacy_corpus.bin.gz")  # takes about 3min on my laptop...

Since corpus only stores the article text, and the `wikinews` object doesn't support indexing, we'll make sure to store our metadata alongside the corpus, for easy retrieval. 

In [None]:
meta = [m for d, m in wikinews.records()]

In [None]:
import json
with open('./data/enwikinews/meta.json', 'w') as out_file:
        json.dump(meta, out_file)