# Collecting text
This notebook includes the code necessary to collect text data and to process it
in such a way that we can use it for our NLP exercises.

## 1. "Māori Names" text data

We're going to grab a dubious list of "Maori names" from a website that doesn't
seem to have any concept of Māori data sovereignty or concerns about cultural appropriation:
`https://momlovesbest.com/maori-names`

We'll convert this data to a list of tuples of the form `('name', 'gender')` and save it to file.

In [None]:
# Import selenium to load content from the dynamic website
# Import BeautifulSoup4 to process the html
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

# Grab the webpage content via starting a browser
browser = webdriver.Chrome()
browser.get("https://momlovesbest.com/maori-names")
body = browser.find_element(By.TAG_NAME, 'body')

# need to move around on the page a bit to force content to load
body.send_keys(Keys.END)
body.send_keys(Keys.PAGE_UP)
time.sleep(3)

# now grab all the html from the page and close the browser
html = browser.page_source
browser.quit()

# pull the name and gender records from the html
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all('div', class_='name-list' )

# map boy->male and girl->female to match name data from other sources
map_name_gender = {'boy':'male', 'girl':'female', 'unisex':'unisex'}
name_list = []

# iterate through the list of all the `div` sections and get the names and genders
# map the genders from boy/girl to male/female
for div in divs:
  name = div.get('data-name')
  gender = div.get('data-gender')
  name_list.append((name, map_name_gender[gender]))

print(name_list[:3]) #print the first few tagged names

# write the list of ('name', 'gender') tuples as a text file that we can use later
with open('maoriNames.txt','w') as f:
  for item in name_list:
    f.write(f"{item}\n")

## Intermission - defining a function for later use
In the next couple of sections we're going to want to repeat the same task a couple of times - 
pulling text out of a PDF document, tokenizing it as words, cleaning it up a bit and then tagging the words.
We'll write a function that we can re-use for this task

In [None]:
def getTextAndTagIt(pdfPath, tag):
  # pdfPath should be the path to the PDF doc that we're going to pull text from
  # tag should be a string that we want to tag all the words in the text with
  from PyPDF2 import PdfReader
  reader = PdfReader(pdfPath)

  # get all the text from all the pages
  text = " "
  for pageNum in range(len(reader.pages)):
    text += reader.pages[pageNum].extract_text()
  textList = text.split() #tokenize on whitespace

  # do some quick and dirty cleaning by stripping punctuation and numbers at the start/end of words
  # then get the set of unique words
  textList = [word.strip('0123456789$?!()/%,.;-<>') for word in textList]
  textList = set(textList)
  taggedText = [(word, tag) for word in textList if not word==""]

  print("Extracted",len(textList),"unique words from",pdfPath,
        "\nTagged them all as", tag)
  print(taggedText[:3])

  return taggedText

## 2. Kupu Māori text data

In the next section we're going to process some text data from a PDF report that's entirely
in te reo Māori. 

The report comes from `https://www.tematawai.maori.nz/assets/Corporate-Documents/WEB3-Singles-Te-Matawai-Annual-Report-22-23-Maori-v9.pdf` but we've dropped some of the title pages to get to the text-heavy sections.

In [None]:
# get text, tag it
taggedKupu = getTextAndTagIt("teMatawai.pdf", "maori")

# save the tagged text to file
with open('kupuMaori.txt', 'w', encoding='utf8') as f:
  for item in taggedKupu:
    f.write(f"{item}\n")


The resulting list of tagged words isn't perfect. This is because there are some
kupu in the input data (like names of institutions and web addresses) that mean
we still have English kupu like 'Limited' in our corpus that have been tagged as 'Māori'.

We won't worry too much about trying to fix all of these here. But it's worth thinking
about how inaccuracies like these might manifest in the classifier training that we 
will be doing later.

## 3. Kupu Pakeha text data

Now we're going to follow the same process as above but to get a corpus of Pakeha words and to tag them.
(If we wanted to, we could just use an English corpus from the NLTK library, but let's make our own for fun).

The data source we'll use as input is a policy document from Thames Coromandel District Council `https://docs.tcdc.govt.nz/store/default/8021091.pdf`. At first glance
it has entirely English words in it. Most of them relate to finance-y things. How might this affect the results
of training our classifier?

In [None]:
# get text, tag it
taggedWords = getTextAndTagIt("thamesTreasury.pdf", "english")

# save the tagged text to file
with open('englishWords.txt', 'w', encoding='utf8') as f:
  for item in taggedWords:
    f.write(f"{item}\n")


## 4. Get some bilingual text data
We would also like to have some text data with a mix of Māori and English words in it
that we can test our classifier on. We'll use the function we wrote earlier to get some text from a PDF
that includes a mix of reo - the Māori Health Action Plan from Manatū Hauora https://www.health.govt.nz/system/files/documents/publications/whakamaua-maori-health-action-plan-2020-2025-2.pdf (CC-BY).

If we want to use these as a test set with any sort of quantitative metrics, we'll need to tag these words too. 
But for now we'll put that in the too hard basket and just eyeball the results when it comes to applying our classifier.

In [None]:
# get text data, tag it with a junk tag that we won't use.

taggedWords = getTextAndTagIt("whakamaua-maori-health-action-plan.pdf","no-tag")

# save the text to file
with open('mixedWords.txt', 'w', encoding='utf8') as f:
  for item in taggedWords:
    f.write(f"{item}\n")