## Comment Analysis Tool

In this notebook, we'll extract and analyze a single comment from our dataset to demonstrate text processing techniques. This approach allows us to examine the characteristics of user comments and perform detailed linguistics analysis on specific examples. [SOURCE](https://docs.python.org/3/howto/regex.html)

In [1]:
# Import the re module-- we'll use this throughout the notebook
import re

<br>

## First, an example

Let's say we want to create a function that validates newschool email addresses.

The function will return True if the address has the following features:<br>
`five letters` + `three digits` + `@newschool.edu`

Otherwise, it will return False.

In [1]:
# Without regular expressions

def validateAddress(address):
    # If address matches sookmyung.ac.kr template, then print match
    # else print no match

validateAddress('xxxxxxx@sookmyung.ac.kr')

IndentationError: expected an indented block (<ipython-input-1-06edc02cf629>, line 7)

In [3]:
# With regular expressions
def validateAddress(address):
    # If address matches newschool.edu template, then print match
    # else print no match

validateAddress('xxxxxxx@sookmyung.ac.kr')

match


In [4]:
# Regular expressions become very important when the pattern
# increases in flexibility. For example, creating a generic
# email validator.
def validateAddress(address):
    # If address matches any reasonable email template, then print match
    # else print no match

validateAddress('xxxxxxx@sookmyung.ac.kr')

match


<br>

## Writing regular expressions

Regular expressions use `metacharacters`:

`. ^ $ * + ? { } [ ] \ | ( )`

These metacharacters hold special meaning, giving regular expressions powerful flexibility compared to strings.

In [5]:
# Square brackets -> Sets of characters

# Letters
pattern = r"[a-z]" # All lowercase letters
pattern = r"[A-Z]" # All uppercase letters
pattern = r"[A-z]" # All letters (r"[A-Za-z]" also works)

# Digits
pattern = r"[0-9]" # All digits
pattern = r"[0-5]" # All digits from 0 to 5

# Custom set of characters
pattern = r"[A-z0-9]" # All letters and digits
pattern = r"[AEIOUaeiou]" # All vowels

# NOT in set
pattern = r"[^aeiou]" # Any symbol that is NOT a vowel
pattern = r"[^0-9]" # Any symbol that is NOT a digit

In [6]:
# Caret and dollar sign -> Beginning and end

# At the beginning
pattern = r"^[A-z]" # A letter at the beginning
pattern = r"^[0-9]" # A number at the beginning

# At the end
pattern = r"[A-z]$" # A letter at the end
pattern = r"[0-9]$" # A number at the end

# Defining both beginning and end
pattern = r"^[A-z]$" # Exactly one letter
pattern = r"^[0-9]$" # Exactly one number

In [7]:
# Curly brackets, asterisk, and question mark -> Repetition

# Curly brackets
pattern = r"[A-z]{3}" # Exactly three letters
pattern = r"[A-z]{3,5}" # Between three and five letters

# Asterisk
pattern = r"[A-z]*" # Zero or more letters

# Question mark
pattern = r"[A-z]?" # One or more letters

In [8]:
# Dot - Any character
pattern = r"." # Any character
pattern = r".{3}" # Three of any character
pattern = r"[A-z].?" # Any letter followed by one or more of any character

In [9]:
# Slash - Escape character
pattern = r"\." # A literal period symbol
pattern = r"[A-z]?\?" # One or more letters followed by a literal question mark
pattern = r"\[\]" # A literal set of square brackets

In [10]:
# Parentheses - Grouping
pattern = r"anna( banana)*" # 'anna' followed by zero or more 'banana's

In [11]:
# Bar - Logical or
pattern = r"(anna|banana)" # Either 'anna' or 'banana'
pattern = r"\.(com|edu|org$)" # Ends in a literal dot followed by 'com', 'edu', or 'org'
pattern = r"([aeiou]|[02468])" # Either a vowel or an even number

<br>

## Function for testing regular expression matching

In [12]:
# The function accepts to parameters:
# A regular expression pattern, and a string.
# If the string matches the pattern, the function will
# print MATCH; else it will print NO MATCH.

def testRegexMatch(pattern, string):
    if (re.match(pattern,string)):
        print('MATCH')
    else:
        print('NO MATCH')
    
    print('pattern: %s\nstring: %s' %(pattern, string))

In [13]:
# Change these
my_pattern = r"^anna (banana|bobana)$"
my_string = 'anna bobana'

# Test the pattern against the string by calling the function
testRegexMatch(my_pattern, my_string)

MATCH
pattern: ^anna (banana|bobana)$
string: anna bobana


<br>

## Going futher

During class, we introduced regular expression `metacharacters` and `re.search()`.

There is more you can do with regular expressions. The following cells show pattern substitution (`re.sub()`), which can be used, for example, to write fun littel micro-programs that manipulate text.

[This](https://docs.python.org/3/howto/regex.html) and [this](https://docs.python.org/3/library/re.html) is a pretty thorough documents which includes even more you can do with regular expressions, if you're interested in independent study.

### Micro-programs using `re.sub()`

In [14]:
def eggify(s):
    """Add 'egg' before every vowel cluster"""
    print(re.sub(r"([aeiouAEIOU]+)", r"egg\1", s))


eggify('well hello!')

weggell heggelleggo!


In [15]:
def elooooongate(s):
    """Elongate all vowels"""
    print(re.sub(r"([aeiouAEIOU])", r"\1\1\1\1", s))


elooooongate('hello, how are you?')

heeeelloooo, hoooow aaaareeee yoooouuuu?


In [16]:
def numberify(s):
    """Change some letters to their 'corresponding' digits"""
    print(re.sub(r"[Aa]", r"4",
                re.sub(r"[Ee]", r"3",
                      re.sub(r"[Ii]", r"1",
                            re.sub(r"[Oo]", r"0",
                                  re.sub(r"B", r"8", s))))))


numberify('hello, how are you?')

h3ll0, h0w 4r3 y0u?


In [17]:
from random import choice

def pausify(s):
    """Add a pause after every word."""
    print(re.sub(r" ", choice([' um... ', ' uh... ']), s))


pausify('hello, how are you?')

hello, um... how um... are um... you?


In [18]:
def hashtagify(s):
    """Convert string to hashtag formatting"""
    print("#" + re.sub(r"[^\w]", "", s))


hashtagify('hello, how are you?')

#hellohowareyou


In [19]:
def cleanSpaces(s):
    """Change all consecutive spaces into a single space"""
    print(re.sub(r"\s+", " ", s))


cleanSpaces(' hello,   how\tare     you? ')

 hello, how are you? 
