<b> Name: </b> MANAY, Justin Gabrielle A.

## Programming Exercise # 01: Regular Expressions-based Tokenizer

### A. HOW IT WORKS

Import the `re` package, which allows us to use regular expressions in Python.

In [65]:
import re

We first load the article into a string, making sure that we exclude the byline (first 4 lines). We also make all the words in the article lowercase, as we do not want to distinguish between say "According" and "according."

In [66]:
# Load article into a string.
article_txt = open("S18 Text File.txt", "r", encoding = "utf8")
article_lines = article_txt.readlines()

article = ""
for i in range(len(article_lines)):
    if(i >= 5):
        article += article_lines[i]
        
# Set to lowercase.
article = article.lower()

Note that the apostrophe used in the text is `’` and not `'`, so we replace it using `re.sub` for convenience. We then use a regular expression to extract all words with apostrophes.

In [67]:
# Replace apostrophe.
article = re.sub("’", "'", article)

# Find all words with apostrophes.
withApostrophe = re.findall(r"\b\w+\'\w+\b", article)

In [68]:
print(withApostrophe)

["it's", "can't", "gonzalez's", "everyone's", "there's", "isn't", "isn't", "filipino's", "it'll"]


Some of these words will be contractions, and some will be possessives. We thus change the words accordingly. 

Since regular expression matches are non-iterable, we would have to replace the words one by one, since conditional replacements are not possible with regular expressions.

Therefore, we use `string.replace` instead. Using `string.replace`,

In [69]:
# Modify words with apostrophes.
for word in withApostrophe:
    if word == "can't":
        article = article.replace("can't", "can not")
    elif word == "isn't":
        article = article.replace("isn't", "is not")
    elif "'ll" in word:
        article = article.replace(word, word[:len(word) - 3] + " will")
    elif word == "it's" or word == "there's":
        article = article.replace(word, word[:len(word) - 2] + " is")
    else:
        article = article.replace(word, word[:len(word) - 2])

From the above code, we lengthen the contractions and remove the `'s` from possessives. We base the conditions on the words in `withApostrophe` that we had extracted earlier using regular expressions.

We can choose to replace the apostrophe with ` no` in the case where the word ends with `'t`, but this would turn `isn't` into `isnnot`. Since there are only two such cases ending in `'t`, we simply have cases for each one.

The code works for all such words ending in `'ll`. However, for words ending in `'s`, they can be either contractions or possessives. Since we need the part of speech of the word in order to determine this, we simply create a case for the contractions that we know are in the article (`It's` and `there's`) and one for the possessives.

We would also like to not differentiate between certain terms like `conyo` and `conyo-tic`, or `come` and `came`.
Thus, we simply replace all instances of the latter with the former.

In [70]:
# Replace words
article = re.sub("conyo-tic", "conyo", article)
article = re.sub("came", "come", article)

We have other such cases, where words end in `-s` (plural/present tense singular), `-ing` or `-ed`. Take the word, `take` for example.

In [71]:
# Check for multiple instances of the words "take" and "emerge."
print(re.findall(r"\btak(?:es?|ing)?\b", article))
print(re.findall(r"\bemerg(?:e[sd]?|ing)?\b", article))

['takes', 'take']
['emerged', 'emerging']


From the above code, we see that `take` and `takes` and `emerged` and `emerging` occur in the article, but we would like to count both variants as `take` and `emerge`, respectively. Thus, we would have to "stem" some of the nouns/verbs, particularly those ending in `-s`, `-ing` and `-ed`. As you'll see later on, we would have to do this manually for some nouns/verbs, so we ignore those whose plurals, present tense or past tense are irregular.

In [72]:
# Find all words with five letters or more ending in -s, -ing and -ed.
toStem = re.findall(r"\b(\w\w\w\w+)(s|ing|ed)\b", article)

We first review the regular expression above. We use parentheses to group our output into two, the root and the suffix. We use four `\w`'s to limit our search to words with five letters or more, to avoid words like `is`, `as` and `this`. 

In [73]:
print(toStem)

[('shoe', 's'), ('talk', 's'), ('alway', 's'), ('seem', 's'), ('associat', 'ed'), ('dres', 's'), ('sometime', 's'), ('speak', 'ing'), ('start', 'ed'), ('origin', 's'), ('student', 's'), ('idea', 's'), ('mean', 's'), ('take', 's'), ('talk', 'ing'), ('teache', 's'), ('englishe', 's'), ('linguistic', 's'), ('happen', 's'), ('research', 'ing'), ('search', 'ed'), ('mean', 'ing'), ('acros', 's'), ('refer', 's'), ('point', 'ed'), ('mean', 'ing'), ('mean', 'ing'), ('intertwin', 'ing'), ('alter', 'ing'), ('statement', 's'), ('replac', 'ing'), ('verb', 's'), ('equivalent', 's'), ('insert', 'ing'), ('result', 'ing'), ('phrase', 's'), ('fishball', 's'), ('accord', 'ing'), ('emerg', 'ed'), ('cite', 's'), ('start', 'ed'), ('teacher', 's'), ('start', 'ing'), ('teacher', 's'), ('hypothesize', 's'), ('evolv', 'ed'), ('1940', 's'), ('1950', 's'), ('origin', 's'), ('continue', 's'), ('filipino', 's'), ('know', 'ing'), ('language', 's'), ('speak', 'ing'), ('nowaday', 's'), ('seem', 's'), ('associat', 'ed'

From the output from the regular expression, we note the following:
- Some words (such as `associated`, `emerging`, etc.) have had their final `e`'s cut. Note that there is not particular rule for this case, so we would have to do this manually (or with a word database).
- Some words ending in `-s` have been cut incorrectly.
    - In some cases (like `class`), the word ends in two `s`'s.
    - In others (like `conscious`), the letter before it is a letter  which English words rarely end on.
    - In others (like `teaches`), the ending letter is one of `s`, `x`, `z` and `h`, which means that you need to add `-es` to form the plural/present tense.
    - In others (like `linguistics`), the ending `s` need not be cut at all.
- For the rest of the cases, the method worked perfectly.

Thus, based on the rules above, we again use `string.replace` to stem the words in the article. We also create a list of exceptions (`exceptions`), and a list of the verbs that have had their final `e`'s cut (`cut_verbs`), based on the output of the regular expression.

In [74]:
es_plurals = ["s", "x", "z", "h"]
cut_verbs = ["associated", "intertwining", "hypothesizes", "replacing", "emerged", "evolved", "involved", "perceived", "dedicated", "excited", "emerging", "evolving", "enthuses"]
exceptions = ["always", "sometimes", "linguistics", "according", "nowadays", "belongings", "clothes", "towards", "themselves", "fledged"]

for root, suffix in stem:
    if (root[-1] in ["u", "s"] and suffix == "s") or root + suffix in exceptions:
        continue
    elif root[len(root) - 2] in es_plurals and suffix == "s":
        article = article.replace(root + suffix, root[:-1])
    elif root + suffix in cut_verbs:
        article = article.replace(root + suffix, root + "e")
    else:
        article = article.replace(root + suffix, root)

There are other words in the article which can be stemmed, such as adverbs (ending in `-ly`) and negations in adjectives (`un-`, `in-`). But they do not seem to affect the word counts too much, so we ignore them for now.

We now check the article for any inconsistencies.

In [75]:
print(article)

“it is so init here, diba?” “those shoe are so mahal talaga!” “can you make para here, boss?”

do you know anyone who talk like this?

de la salle university always seem to be humorously associate with the conyo subculture, so one can not help but wonder just how the typical conyo kid come into existence. the term can be used to describe a person, their language, or even the way they act or dress. as a language, some describe it as a form of taglish, taglish spoken in a “maarte” way, or sometimes just speak in filipino with a heavy foreign accent.

in spite of all this, though, does anyone really know what the exact definition of conyo is, where it start, and how it come to be?

to get a better look at the origin and evolution of the word, some student give their own idea on what the term conyo mean to them. the menagerie take it a step further by also talk to dr. ariane macalinga borlongan, a professor who teach world english and english linguistics in tokyo, japan, and who also happe

We can now process the entire article and create a frequency list (or a frequency dictionary).

In [76]:
# Finding all words...
wordList = re.findall(r"\b(\w+\-\w+|\w+)\b", article)

# Create frequency "list"
freqDict = dict((word, 0) for word in set(wordList))
for word in wordList:
    freqDict[word] += 1
for key in freqDict.keys():
    print(key, ":", freqDict[key], "\n")

evidently : 1 

result : 2 

put : 1 

imply : 1 

had : 1 

completely : 1 

it : 18 

unlike : 1 

during : 2 

would : 1 

many : 1 

simply : 1 

init : 1 

average : 1 

through : 1 

able : 1 

give : 1 

normal : 1 

quite : 1 

belong : 1 

has : 1 

endless : 1 

define : 1 

designer : 1 

step : 1 

fuck : 1 

time : 1 

act : 1 

help : 1 

form : 2 

did : 1 

foundation : 2 

seem : 3 

insert : 1 

tusok-tusok : 1 

humorously : 1 

status : 1 

they : 5 

potential : 1 

linguistics : 1 

despite : 1 

because : 5 

user : 1 

talaga : 1 

dictionary : 1 

search : 1 

intertwine : 1 

group : 1 

exactly : 1 

that : 7 

answer : 1 

not : 5 

people : 1 

with : 6 

start : 4 

well : 2 

being : 5 

menagerie : 1 

equivalent : 1 

basically : 1 

themselves : 1 

lack : 1 

there : 2 

typical : 1 

take : 2 

existence : 1 

english : 22 

happen : 1 

among : 2 

spanish : 1 

american : 2 

always : 2 

borlongan : 6 

come : 3 

but : 3 

macalinga : 1 

may : 8

### B. ASSESSMENT

To assess the tokenizer, we make a list of its strengths and weaknesses.

#### STRENGTHS:
- The tokenizer stems nouns and verbs, so multiple instances of some words (e.g., `emerge`, `emerges`, `emerging`, `emerged`) are counted similarly.
- The tokenizer is case-insenstive.

#### WEAKNESSES:
- The tokenizer works for this particular article, so we would have to change it somewhat to fit other articles.
- The tokenizer does not stem adjectives and adverbs.