# Recap
---

## Classes & Objects

### Scope
Is the set of rules that defines the accessibility of the variables and methods within a program or a library or a package.

| ![scope.png](attachment:c68c4b75-7d46-489e-b563-8395551577a4.png) |
|:---:|
| **Python's 4 layers of Scope** |

* **Local Scope** - variables existing within a function or a code block (applies to recursive and `lambda` functions).
* **Enclosing Scope** - only applicable to nested functions (not covered).
* **Global Scope** - variables defined in this scope has the most reach within a program. Requires the `global` keyword.
* **Built-in Scope** -  special Python scope that is used to define the scope of built-in modules or imported libraries of Python.

### Classes

**General syntax:**
```python
class ClassName:
    "optional class docstring"
    class_suite (consisting of attributes and methods)
```

**Classes contains:**
* Variables 
 * Instance - has the **`self` keyword in front**, used for 95% of any form of processing.
 * Static (also called `class`) - located **before all functions in a class**, used mainly for class wide variables.
 * Local - **within functions**, no capabilities outside the function.
* Methods
 * Instance - has the **`self` keyword as the very first parameter** of all the other parameters, used for 95% of any form of processing.
 * Class - uses the **`@classmethod` decorator**, has the **`cls` keyword as the very first parameter**, used mainly for creating factory templates. Eg: to load data to populate the class.
 * Static - uses the **`@staticmethod` decorator**, **no special** first parameter, used mainly for helper or utility methods. Eg: method for checking validatility of input values for that particular class.
* 1 special method called the **constructor** which can be identified with the **`__init__(self)` signature**

### Objects
**Everything in Python is an object!** Therefore classes are also treated as such. 
* Classes can be passed to other classes. Eg: `Student` class can be passed to a `Subject` class for processing.
* Classes can have inner classes (also called nested classes), used mainly for logical grouping of classes.

---
# Regular Expressions

Regular expressions (also known as regex) is a special sequence of characters that defines a pattern for a complex string matching algorithm. In Python, regular expressions are processed using the library `re`.

**Problem Statement:** How do we extract an email address from a document or test that a given string has the right format for credit card numbers?

**Program:**

In [None]:
doc_lst = ['My email address is John@nospamplease.com, do not send any spam',
           'You can contact us at gongmoon@sunmoonenterprise.net, we are pleased to help you.',
           'Please send for a quotation using cement4us@cementworld.org. We will reply you within 3 working days.'
          ]

cc_num_lst = ['856-444-888-966', '7774-2664-8872-3222', '854-4547-2114-2282', '1115-8881-4552-6333',
              '8751-961-5454-2122', '2212-2224-9961-1482', '5557-6639-1117-2255', '2023-55536-2121-998'
             ]

## Topics Covered

* Raw Strings
* Syntax
* Library Functions

---

## Raw Strings

These are strings prefixed with a `r` or `R`. For example 

```python
r'this is a raw string\n'
```

The main difference between a normal string and a raw string is how the backslash (`\`) character is treated. In a normal string, `'\n'` is treated as a newline character but if it is a raw string `r'\n'`, this is literary treated as a backslash with the letter `n`. In other words it emphasizes the "What You See Is What You Get" idea onto strings.

**Example 2: Normal Strings vs Raw Strings**

In [None]:
norm_str = 'This is a \n normal string'
raw_str = r'This is a \n normal string'

print(norm_str)
print('-' * 30)
print(raw_str)

Raw strings are required because it is used to create the special sequences of characters that defines the pattern used for regular expressions. It is important to note that there is no pattern that fits 100% of every input string encountered therefore it is advisable to construct patterns based on a set of well defined conditions and leave the outliers for manual  or program logic checks.

For example, the email address convention is briefly outline in the [wiki article](https://en.wikipedia.org/wiki/Email_address) as 2 parts, a local part and a domain part. It then goes into further detail about what are the valid charcters for each part. Example, emails looking like

* `"John..Doe"@example.com`
* `" "@example.org`
* `user.name+tag+sorting@example.com`

are all valid email addresses but mail servers may restrict the use of some characters (depending on the rules implemented). Therefore depending on the data that you are working with, adhering to commonly known naming conventions and company recognized naming conventions will help with constructing the raw string patterns.

---
## Syntax

The regex syntax starts simple but it can get complicated very fast depending on the matching criteria used to construct the pattern. We start with single expressions then moving on to strategies on how to construct the patterns and lastly how to read patterns constructed by others.

### Single expressions

If you go look that Python's documentation for the [`regex`](https://docs.python.org/3/library/re.html) library, it details ALL the expressions that can be used to construct the regex pattern. The amount of information can be highly overwhelming therefore we are going to start small.

We are going to use 1 of the `re` functions to help us understand how regex works. The function is 
* `finditer()` - this function with return an iterator of `match` objects that stores all the non-overlapping matching objects from the matched pattern.

<br>

Before we start, we have to remember to import the `re` library and in addition, we will be using the custom function `test_regex` to test the regex expressions.

In [2]:
import re

In [3]:
def test_regex(pattern, text, flag=0):
    '''
    Function to test the regular expression.
    Inputs:
        pattern - regular expression
        text - the text upon which the regular expression is to work upon
        flag - flags for the regular expressions
    '''
    matched = re.finditer(pattern, text, flag)
    
    for item in matched:
        print(item)

**Example 3: Basic expression**

The most basic pattern of any regex is to match 1 or more letters in its entirety (case sensitivity included). Let's say we want to match the string `rst` from the text contained in the variable `text_to_search`.

In [None]:
text_to_search = 'abcdefghijklmnopqrstuvwxyz'


From the result, we can see that the returned object is indeed a `match` object and it has several other information:
* `span(start, end)` - this shows the start and end index of where the pattern was found within the text.
* `match=<something>` - this shows the character/s that matches the criteria defined by the regex.

The order of each character in a regex pattern, matters (this will be more apparent as we progress). This means the pattern `rst` is different from `srt`. Patterns are also read from left to right.

---
Below is a table of expressions that are used for matching characters (known as character classes).

| Character | Description |
|:---:|:---|
| `\w` | Matches alphanumeric characters, which means `a-z`, `A-Z` and `0-9`. It also matches the underscore `_`. |
| `\W` | Matches non alphanumeric characters. |
| `\d` | Matches digits, which means `0-9`. |
| `\D` | Matches any non-digits. |
| `\s` | Matches whitespace characters, which include the `\t`, `\n`, `\r` and space characters. |
| `\S` | Matches non-whitespace characters. |
| `\b` | Matches the boundary (or empty string) at the start and end of a word, that is, between `\w` and `\W`. |
| `\B` | Matches where `\b` does not, that is, the boundary of `\w` characters. |

From the description for `/w`, `/W`, `/d`, `/D`, `/s` and `/S` we can deduce that the lower case matching criteria is exactly opposite of the upper case matching criteria. 

**Example 4: Matching alphanumeric characters**

Difference between `/w` and `/W`.

In [None]:
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
123456789
'''

Next up are the `/b` and `/B` expressions. In the description, it states the word "boundary". This "boundary" is referring to the whitespace or any non alphanumeric character before and after a whole word.

**Example 5: Matching the word "Hello"**

In [None]:
sentence = 'Hello World HelloHello'

The second "Hello" in the results comes from the text "HelloHello". The reason that only the first word is matched is because the second "Hello" has no preceeding whitespace or any non alphanumeric character in front of it

---
The next set of regular expressions are called *Anchors* because they deal with pattern matching from either the start or the end of a string.

| Character | Description |
|:---:|:---|
| `^` (caret) | Matches the expression to its right at the start of a string. If it is a multiline string, it matches every instance of the expression immediately after the newline (`\n`) character of the previous string. |
| `$` (dollar) | Matches the expression to its left at the end of a string. If it is a multiline string, it matches every instance of the expression just before the newline (`\n`) character of the current string. |
| `\A` | Matches the expression to its right at the absolute start of a string whether in single or multi-line mode. |
| `\Z` | Matches the expression to its left at the absolute end of a string whether in single or multi-line mode. |

<br>

Both `^`, `$` and `\A`, `\Z` have very similar functionalities but their differences lies in the regex mode (`re.MULTILINE`) used for their matching criteria. `^` and `$` will match all patterns regardless of the number of lines in the text but `\A` and `\Z` will only match from the very start and the very end of the text regardless of the number of lines in the text.

**Example 6: Differences between `^` and `\A`**

In [None]:
# text for matching
sentence = 'Some are different\nSome the same\nSome are short\nSome are long'
print(sentence)

print('\n')
# multiline text are denoted by the newline characters in them

print('-' * 30)


After learning about *Character Classes* and *Anchors*, figure 3 below illustrates some of the expressions pictorially.

| ![word_boundary.png](attachment:word_boundary.png) |
|:---:|
| **Figure 3:** Some *Character Classes* and *Anchors* illustration from [Python Course](https://www.python-course.eu/re.php) |


---
Next we have the *Quantifiers*. These are the characters where the general criteria denotes the number of characters to match.

| Character | Description |
|:---:|:---|
| `.` (period) | Matches any character except line terminators like the newline (`\n`) character. |
| `+` (plus) | Greedily matches the expression to its left 1 or more times. |
| `*` (asterisk) | Greedily matches the expression to its left 0 or more times. |
| `?` | Greedily matches the expression to its left 0 or 1 times. |
| `{m}` | Matches the expression to its left exactly `m` or more times, not less. |
| `{m,n}` | Matches the expression to its left between the range defined by `m` to `n` times repetitions, not less. |
| `{m,n}?` | Matches the expression to its left `m` times repeatedly and ignores `n`. |

<br>

Quantifiers are generally used together with all the above expressions to form a regular expression.

**Example 7: Match all words starting with `B`**

In [None]:
# text for matching
sentence = 'Some are different\nBeginning and end\nSome are short\nButter and almonds\nBanana blend'
print(sentence)

print('\n')


**Example 8: Extract all valid credit card numbers**

Valid credit card numbers are in groups of 4 digits and seperated with dashes (`-`).

In [None]:
cc_num_lst = ['856-444-888-966', '7774-2664-8872-3222', '854-4547-2114-2282', '1115-8881-4552-6333',
              '8751-961-5454-2122', '2212-2224-9961-1482', '5557-6639-1117-2255', '2023-55536-2121-998']

for cc_num in cc_num_lst:
    # can this be done with a different regular expression? stay tuned!
    

---
Before we can demonstrate at the rest of the *Quantifiers*, let's first learn about the final set of regular expressions: the logical OR, the escape character, the set and the group.

| Character | Description |
|:---:|:---|
| `A\|B` (pipe) | Matches expression `A` or `B` where *A* and *B* can be any valid regex. If `A` is matched first, `B` is left untested. |
| `\` (backslash) | Escapes special characters (like `*` or `?`) so that they can be used literary. |
| `[...]` | Contains a set of characters to match. |
| `(...)` | Matches the expression inside the parentheses and groups it. |

There are several more regular expressions in the Python documentation but the ones highlighted here is generally enough to be used for most applications which we will see in a bit.

<br>

Just like strings, the backslash character (`\`) is meant to be used as an escape character. That means a pattern has a period symbol in it, the backlash character needs to added so that it can be treated as a literal. However, the other way that chracters are treated as literals is by using the square brackets `[]`. 

Any characters placed within square brackets `[]` means that those chracters are match individually in the text. But there are certain things to note:

1. If the there is a dash (`-`) between 2 letters (eg: `[a-g]`), it means that it is matching any alphabet from `a` to `g`.
2. The special characters like `(`, `)`, `*`, `/`, `+`, etc are treated as literals unless they are used with characters like `\w`, `\W`, etc. **Tip**: If a dash is to be matched as a literal, place the dash as the very last character within the `[]` to prevent regular expression errors.
3. The caret symbol (`^`) within the square brackets means the **NOT operation**.

**Example 9: Constructing a regex pattern for an email address**

In [None]:
email = 'johnny_darko@storks-enterprise.com'

pattern = 

test_regex(pattern, email)

That is for 1 email but what if we have plenty of email addresses with several different formats? Look back to the top of the chapter where we have a list of text with email addresses embedded in it. Let's say that we would to extract email addresses that are **NOT** from non-profit organizations (aka we do not want email addresses ending with `.org`)?

This condition can be done by constructing a pattern that includes the matches for `.com` and `.net`. We do this with the rounded brackets `()` and the pipe `|` operator (meaning the OR Operation). The round brackets groups expressions meant to be matched together.

**Example 10: Extracting email addresses**

In [None]:
doc_lst = ['My email address is John.Tan@nospamplease.com, do not send any spam',
           'You can contact us at gong-moon@sunmoonenterprise.net, we are pleased to help you.',
           'Please send for a quotation using cement4us@cementworld.org. We will reply you within 3 working days.'
          ]

pattern = 

for doc in doc_lst:
    test_regex(pattern, doc)

Going back to *Example 6*, let's see how the expression can be optimized.    
What are the characteristics of credit card numbers:
* comes in groups of 4 numbers
* dashes are used to seperate the first 3 group of numbers

**Example 11: Different version of the regular expression compared to Example 6**

In [None]:
cc_num_lst = ['856-444-888-966', '7774-2664-8872-3222', '854-4547-2114-2282', '1115-8881-4552-6333',
              '8751-961-5454-2122', '2212-2224-9961-1482', '5557-6639-1117-2255', '2023-55536-2121-998']

for cc_num in cc_num_lst:
    # optimized expression
    

---
Now that we have a rough idea of how regular expressions are constructed, let's do an exercise by taking some regular expressions from the Internet and decipher them.

### Exercise

Decipher the following expressions and write an expression for the given string.
1. `[+-]?[1-9]\d*|0`
2. `\w+\.(gif|png|jpg|jpeg)`
3. `['tom.jones-887@las-vegas.us', 'tan_ah_kau@longlong-uni.edu.sg', 'wilson@remote-island.co.uk']`

---
## Library Functions

Now that we have an overview of the regex syntax, we can now look at some of the more popular functions from the `re` library upon which we can apply the regex string patterns. A suggestion is to use [this website](https://regex101.com/) to test your regular expressions as it has a debugger to help explain the different parts of your regular expression.

The `re` library function that we have been using thus far is the `finditer()` which returns an iterator of `match` objects. A `match` object is an object that consist of the results matched using the regular expression. We have seen from the results like `<re.Match object; span=(19, 28), match='Beginning'>` that it has a `span` and `match` attribute. To retrive information from the object, we can use functions avaliable to the `match` object such as:

* `group([group_number, ...])` - Returns one or more subgroups of the match, provided that the regular expression has been constructed with groups. If the there is no *group_number*, it returns the whole matched result.
* `groups(<group_number>)` - Returns the matched string of the defined group. If no *group_number* is provided, all groups are returned.
* `span([group_number])` - Returns a 2-tuple consisting of the start and end index of the substring matched by the regular expression.
* `start([group_number])` or `end([group_number])` - Returns the start/end index of the substring matched by the regular expression.

**Example 12: Match objects behaviours and attributes** 

In [None]:
str_float = '24.1632'

pattern = r'(\d+)\.(\d+)'

result = re.finditer(pattern, str_float)

for item in result:
    print(f'span: {item.span()}')
    print(f'start index: {item.start()}')
    print(f'end index: {item.end()}')
    print(f'extracted string: {item.group()}')
    print(f'extracted groups: {item.groups()}')

---
`match` objects are returned by most `re` library functions and all `re` library function has the form `function_name(pattern, string, flag=0)`. This is not the first time we have seen the argument `flag`. So what is this argument `flag`? It is an argument where a constant is used to modify an expression behaviour. A list of commonly used flags are listed in the table below.

| Abbreviation | Full Name | Description |
|:---:|:---:|:---|
| re.I | re.IGNORECASE | Makes the regular expression case-insensitive. |
| re.L | re.LOCALE | The behaviour of some special sequences like \w, \W, \b,\s, \S will be made dependent on the current locale, i.e. the user's language, country etc. |
| re.M | re.MULTILINE | ^ and \$ will match at the beginning and at the end of each line, seperated by the newline character (`\n`) and not just at the beginning and the end of the string. |
| re.S | re.DOTALL | The dot/period (`.`) character will match every character **plus the newline character**. |


So what are some of the popular `re` library functions?

1. **`search()` / `match()`** - the purpose of these functions are to scan through the string looking for **only** the first occurrence of the regex pattern. Both returns a `match` object. The main difference is that
 * the `match()` function will start the scan from the **beginning** of the string (aka it's like having the `^` caret functionality built-in) and if the pattern is **not found** at the beginning of the string, it is deemed *not found*
 * the `search()` function scans through the whole string *looking* for the first match.

 **Example 13: Differences between `match()` and `search()`**

In [None]:
# list of fake addresses
list_addr = ['3631  Hurry Street, Stone Mountain, Virginia 24533',
             '1036  Jefferson Street, Lightfoot, Maryland 22070',
             '4984  Willis Avenue, Palatka, Florida 32077',
             '85695 Tasmania, COLEBROOK, 70895 South Street']

# the pattern matches 5 digits numbers in strings
pattern = r'\d{5}'

In [None]:
# 'match()' function where it starts its match from the beginning of the string
for s in list_addr:
    result = re.match(pattern,s)
    print(result)

In [None]:
# 'search()' function where it scans the whole string looking for the first match.
for s in list_addr:
    result = re.search(pattern,s)
    print(result)

2. **`findall()` / `finditer()`** - Both these functions are used to search for all matched patterns from the string. The only difference is the format of the returned  results.
 * `findall()` - Returns a list of strings of all non-overlapping matches of the regex pattern
 * `finditer()` - Returns an iterator of `match` objects of all non-overlapping matches of the regex pattern
 
 Non-overlapping means that the string is searched from left to right, and the next match attempt starts immediately after the previous matched character.

 **Example 14: Differences between `findall()` and `finditer()`**

In [None]:
wall_of_text = 'Account Number: 1-88455-96 Name: John Doe Contact: 84599612 ' + \
               'Account Number: 1-85222-82 Name: Tom Ellis Contact: +49-518-524-9155 ' + \
               'Account Number: 2-74112-70 Name: James Dean Contact: +65-69825412 ' + \
               'Account Number: 8-99251-02 Name: Sarah Anne Contact: +65-98421622 ' + \
               'Account Number: 5-84587-12 Name: Emma Smith Contact: +65 68542845'

In [None]:
# extracting all the account numbers
list_acct_nums = re.findall('\d{1}-\d{5}-\d{2}', wall_of_text)

print(type(list_acct_nums))
print(list_acct_nums)

In [None]:
# extracting all the account numbers
list_acct_nums = re.finditer('\d{1}-\d{5}-\d{2}', wall_of_text)

print(type(list_acct_nums))
for item in list_acct_nums:
    print(item.group())

3. **`split()`** - this function works similar to the string `split()` function from the string object but it uses a regular expression pattern instead. The return value is a list of strings split in accordance to the regular expression pattern and the **remainder** of the string is always returned as the **last element** of the list.

 It's superiority shines when there is a need to clear data of superfluous and redundant text. The example below shows how the data is cleaned of the text descriptions such as surname, firstname, profession.

 **Example 15: `split()` function usage**

In [None]:
# removing unnecessary data
lines = ["surname: Putin, firstname: Vladimir, profession: president", 
         "surname: Merkel, firstname: Angela, profession: chancellor",
         "surname: Ramsay, firstname: Gordon, profession: chef",
         "surname: Puck, firstname: Wolfgang, profession: chef",
         "surname: Spielberg, firstname: Steven, profession: Director",
         "surname: Tarantino, firstname: Quentin, profession: Director"
         ]

# want to find out the number of folder levels to reach the file
pattern = 
for line in lines:
    result = re.split(pattern, line)[1:]
    print(result)

4. **`sub()`** - this function works similar to the string `replace()` function from the string object but it uses a regular expression pattern instead.

 The signature of the `sub()` function is as follows:
 `re.sub(pattern, repl, string, count=0, flags=0)`

 * `pattern` - regex pattern
 * `repl` - replacment string or function
 * `string` - string for which the regular expression is to be worked on
 * `count` - the maximum number pattern matches to replace
 * `flags` - regular expression flag

 **Example 16: Replacing all Gender Values to a consistant form**

In [None]:
gender = ['male', 'm', 'f', 'fem', 'Male', 'Female', 'f', 'm']

def replace_gender(gen):
    '''
    Function to return the correct string replacement for the gender
    Input: 
        gen - match object of the subtext
    Return:
        replacement string
    '''
#     print(gen)
    

cleaned_data = []
for gen in gender:
    cleaned_data.append(re.sub(r'\w+', replace_gender, gen))
    
print(cleaned_data)

5. **`compile()` / `purge()`** - these functions enable the reuse of frequently used regular expressions within a program. `compile()` compiles a regular expression pattern into an regular expression object which can then be used for calling the `re` library functions directly. `purge()` clears the internal cache for regular expressions.

 The steps that Python follows whenever a regular expression function is used is:
 
  1. compile the expressions
  2. cache the result of the compilation
  3. invoke the function
  
 These steps increases the efficiency when the same regular expression is used several times in a single program. The caveat is that the maximum number of compiled regular expression objects that are stored in the internal cache is 512 for Python 3. This internal cache can either be cleared manually using the `purge()` function or automatically when it's full.
 
 **Example 17: Compiling, using then purging a regex (same codes from Example 8)**

In [None]:
doc_lst = ['My email address is John.Tan@nospamplease.com, do not send any spam',
           'You can contact us at gong-moon@sunmoonenterprise.net, we are pleased to help you.',
           'Please send for a quotation using cement4us@cementworld.org. We will reply you within 3 working days.'
          ]

# compiling the regex
pattern = re.compile(r'[a-zA-Z.-]+@[a-z-]+\.(com|net)')

for doc in doc_lst:
    matched = pattern.search(doc)
    if matched:
        print(matched.group())

In [None]:
re.purge()