# Cleaning Te Ara

The first text source we will be handling is Te Ara, the Encyclopedia of New Zealand.

From Wikipedia:
> Te Ara: The Encyclopedia of New Zealand is an online encyclopedia created by the Ministry for Culture and Heritage of the New Zealand Government.

In [31]:
import os, re
import taumahi
import itertools
import collections
from nltk.tokenize import sent_tokenize

In [2]:
te_ara_path = "../sources/teara-mi-content.txt"
with open(te_ara_path, "r") as f:
    te_ara = f.read()

In [36]:
"Te Ara has {} characters".format(len(te_ara))

'Te Ara has 8873533 characters'

We start off by reading the first 1000 characters of the text:

In [7]:
print(te_ara[:1000])

### New article
https://teara.govt.nz/mi/te-mahi-kai
Ko te kāinga te pokapū o ngā mahi kai a te Māori. Ko te maramataka ka tohu i te wā ki tēnā mahi, ki tēnā mahi. Ka tauhokohoko ngā iwi i ngā kai mai i ngā māra, te hī ika, te mahi tuna, te tāwhiti manu, te kohikohi kai hoki.
### New article
https://teara.govt.nz/mi/te-mahi-kai/page-1
Ngā kaihōpara me te hunga tauhokohoko
        Nō te takiwā o ngā tau 1250 – 1300 AD ka tae ngā tīpuna o te Māori ki Aotearoa. Ko te iwi Māori te whakamutunga o ngā iwi hōpara i Te Moananui-a-Kiwa. Ka tauhokohoko ngā tīpuna o te Māori ki tēnā iwi ki tēnā iwi i ngā moutere o Te Moananui-a-Kiwa. Ko Aotearoa te whenua rahi rawa i nōhia e ngā tāngata o Te Moananui-a-Kiwa. Hāunga te pāmamao o te whenua hou, taea noatia ai e te waka haere moana.
        Ngā moutere tango kai
        Noho ai ngā iwi o Te Moananui-a-Kiwa ki ngā moutere tūtata, ka hūpeke i tēnā moutere, i tēnā moutere ki te mahi kai. Ko te whakapae, i pērā te noho a te Māori ki Aotearoa i te taenga

A few comments:
- Te Ara contains multiple kinds of text (urls, Te Reo and also English)
- It'll be worthwhile to run through `te_ara` and clean up any non-māori text etc.

Fortunately, the `taumahi` library has the tools we need to do this.

## Cleaning the text

First we split up `te_ara` into sentences, using `nltk.sent_tokenize`:

In [51]:
te_ara_sents = [s.strip() for t in te_ara.split("\n") for s in sent_tokenize(t)]

In [52]:
# Print the number of sentences in te_ara
print("There are {} sentences in te_ara".format(len(te_ara_sents)))

There are 109617 sentences in te_ara


Here are the first 5 sentences:

In [53]:
te_ara_sents[:5]

['### New article',
 'https://teara.govt.nz/mi/te-mahi-kai',
 'Ko te kāinga te pokapū o ngā mahi kai a te Māori.',
 'Ko te maramataka ka tohu i te wā ki tēnā mahi, ki tēnā mahi.',
 'Ka tauhokohoko ngā iwi i ngā kai mai i ngā māra, te hī ika, te mahi tuna, te tāwhiti manu, te kohikohi kai hoki.']

Now we inspect the comments:

In [54]:
collections.Counter(sent for sent in te_ara_sents if sent.startswith("#"))

Counter({'### New article': 5407})

It looks like the only commented line in `te_ara` is `'### New article'`, which occurs 5407 times in the corpus. That means they're easy to remove.

In [55]:
te_ara_sents = [sent for sent in te_ara_sents if not sent == "### New article"]

In [59]:
# Print the number of sentences in te_ara
print("There are {} sentences in te_ara".format(len(te_ara_sents)))

There are 104210 sentences in te_ara


Next, we can remove the urls from the text:

In [60]:
url_regex = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')

In [73]:
te_ara_urls = {url_regex.match(sent).group(0) for sent in te_ara_sents if url_regex.match(sent)}

In [74]:
# Print the number of sentences in te_ara
print("There are {} urls in te_ara".format(len(te_ara_urls)))

There are 3 urls in te_ara


Likewise, there are 5411 urls in `te_ara` as well, so we remove these too.

In [75]:
te_ara_sents = [sent for sent in te_ara_sents if not sent in te_ara_urls]

In [77]:
te_ara_sents[:10]

['Ko te kāinga te pokapū o ngā mahi kai a te Māori.',
 'Ko te maramataka ka tohu i te wā ki tēnā mahi, ki tēnā mahi.',
 'Ka tauhokohoko ngā iwi i ngā kai mai i ngā māra, te hī ika, te mahi tuna, te tāwhiti manu, te kohikohi kai hoki.',
 'Ngā kaihōpara me te hunga tauhokohoko',
 'Nō te takiwā o ngā tau 1250 – 1300 AD ka tae ngā tīpuna o te Māori ki Aotearoa.',
 'Ko te iwi Māori te whakamutunga o ngā iwi hōpara i Te Moananui-a-Kiwa.',
 'Ka tauhokohoko ngā tīpuna o te Māori ki tēnā iwi ki tēnā iwi i ngā moutere o Te Moananui-a-Kiwa.',
 'Ko Aotearoa te whenua rahi rawa i nōhia e ngā tāngata o Te Moananui-a-Kiwa.',
 'Hāunga te pāmamao o te whenua hou, taea noatia ai e te waka haere moana.',
 'Ngā moutere tango kai']

The next thing we want to do is detect any kupu pākehā (English words) in `te_ara`. In the `taumahi` library, there's the following useful `kupu_pākehā` function.

In [91]:
kupu_pākehā = {kupu for sent in te_ara_pākehā_sents for kupu in taumahi.kupu_pākehā(sent)}

So we can see that in some cases, the kupu pākehā in the corpus are names of people (e.g. 'James Belich'), or sometimes organisations 'Peoples of the Pacific' and sometimes they are out of vocab terms like 'AD'.

In [92]:
kupu_pākehā

{'possession',
 'Reggae',
 'Stanislaus',
 'Milroy',
 "'Rōre'",
 'venus',
 'Claims',
 'natives',
 'food',
 'Gilbert',
 'Survey',
 'BEM',
 'nohopuku’',
 "kaitā'",
 'Stanton',
 "Hick's",
 'Tuckett',
 "'forage",
 'Impressive',
 'January',
 'mio',
 'Barry',
 'translated',
 'kiwi’',
 "'Puku-tohe-ki-te-riri'",
 'governor’',
 'pitpito',
 'Nu',
 'Harvey',
 'raft',
 'category',
 'towelling',
 'Mirror',
 'Puckey',
 'Crosby',
 'Von',
 'Micronesian',
 'Donna',
 'Onslow',
 'Peninsula',
 'Eyre',
 'includes',
 'Riversdale',
 'Tip',
 'Normanby',
 "'Ō",
 'system',
 'Collectors',
 'Seymour',
 'Aborigines',
 'weight',
 'zucchini',
 'Shore',
 'Harold',
 'Lucy',
 'raining',
 'Railway',
 'Score',
 'Auckland',
 'Think',
 'Louis-Auguste',
 'essence',
 'away',
 'lions',
 'sunshine',
 'free',
 'Gunnah',
 "'kūpapa'",
 'Rennie',
 'Moses',
 'Matthew’s',
 "'koia",
 "turupa'",
 'ana’',
 'Spence',
 'Lillian',
 'minenga’',
 'Ū’',
 'cumulus',
 'selection',
 'Southland',
 'race',
 'night',
 'Neudorf',
 'Sorensen',
 'Camb