<a href="https://colab.research.google.com/github/jcl353/Final-Project-Bootcamp/blob/main/info2950_regex_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 2950: Regular Expressions

During Friday's section, we will introduce and run through a tutorial on regular expressions. After this tutorial, you should be able to:


*   Know when and why you might use regular expressions
*   Generate a regular expression to match a substring within a string
*   Use the `regex` module to match, return, and replace substrings within a string



## Part 1: Regular expression basics

A **regular expression** (also often referred to as "regex" or "regexp") is a special type of text string that can be used to search for patterns in text. Regular expressions are useful for a variety of tasks, including:
* Learning if a specific word occurs in a document
* Finding all strings that match a general pattern or characteristic 
* Changing all instances of a certain word to a different word

**In what scenarios might you want to use a regular expression? Have you already used any regular expressions?**

In [None]:
import regex as re

The simplest type of regular expression matches a character in a string with a literal character (or group of characters). For example, you might check whether the group of characters `Ithaca` occurs in the string `Ithaca is gorges!`. 

When using regular expressions in python, we will often use the `regex` module. This module has a handy function, `search()` that searches for the first match of a regular expression in a string. You can pair `group()` with the returned value from `search()` to print out the matched string. 

Let's use `search()` to find whether the string `Ithaca` occurs in the string `Ithaca is gorges!`, and print out the match (if it is found). 

In [None]:
match = re.search(r'Ithaca', 'Ithaca is gorges!')

print(type(match)) # re.search() returns a re.Match object if found else None

if match:
  print(match.group())
else:
  print('not found !!')

<class '_regex.Match'>
Ithaca


In [None]:
match

<regex.Match object; span=(0, 6), match='Ithaca'>

In [None]:
'Ithaca is gorges!'[0:6]

'Ithaca'

In [None]:
match.group()

'Ithaca'

Because we knew exactly what we were looking for (the string `Ithaca`) we were able to use a literal pattern to find the match in the string.

**Why there is a `r` before the regex pattern?**

An `r` before a string tells the Python interpreter to treat backslashes as a literal (raw) character. 

Normally, Python uses backslashes as escape characters. Prefacing the string definition with `r` is a useful way to define a string where you need the backslash to be an actual backslash and not part of an escape code.

Let's write a function for checking if our regular expression successfully finds a match. This function will print the matched string if found, and give us a clear message (`not found !!`) if the regular expression is absent.  

In [None]:
def check(match:re.Match):
  if match:
    print(match.group())
  else:
    print('not found !!')

And let's make sure that this function works on the regular expression we just found:

In [None]:
check(re.search(r'Ithaca', 'Ithaca is gorges!'))

Ithaca


In [None]:
match = re.search(r"Columbia", "Ithaca is gorges!")
type(match)

NoneType

## Part 2: Special characters

So far, we've only looked at examples of regular expressions where the raw characters *literally* map onto the string. 

However, the real power of regular expressions comes from characters that symbolize special properties in a string. For example, are we looking for a string of numbers (regardless of the specific numbers)? Does the string always contain an `@` symbol in the middle? Do we only want to find strings that occur directly after a tab? 

To do this, there are a number of regular expressions that match specific patterns in strings--also known as **special characters**. 

**Here are some common special characters:**

|Regular Expression | Pattern Matched|
| --- | --- |
|`^`| start of string|
|`$`| end of string|
|`.`| any character (except a newline) |
|`\n`| newline|
|`\t`| tab|
|`\w`| alphanumeric character|
|`\d` | decimal digit character `[0-9]` |
| `\s`| single whitespace character


In [None]:
check(re.search(r'^I', 'Ithaca'))
check(re.search(r'^x', 'Ithaca'))

check(re.search(r'y$', 'Ithaca'))
check(re.search(r'a$', 'Ithaca'))

I
not found !!
not found !!
a


You might also want to provide *options* for single characters, like if you are looking for the word "grey" but know it could also be spelled with a single character difference, "gray". You can use bracketing to give character options.

|Regular Expression | Pattern Matched|
| --- | --- |
|`[ae]`| `a` or `e`

In [None]:
check(re.search(r'gr[ae]y', 'gray'))
check(re.search(r'gr[ae]y', 'grey'))
check(re.search(r'gr[ae]y', 'gryy'))


gray
grey
not found !!


###Repetition

Things get more interesting when you use `+` and `*` to specify repetition in the pattern

* `+` : 1 or more occurrences of the pattern to its left, e.g. `i+` = one or more `i`'s
* `*` : 0 or more occurrences of the pattern to its left
* `?` : match 0 or 1 occurrences of the pattern to its left

In [None]:
# word starting with b and following by 1 or more word character 
check(re.search(r'^b\w+', 'foobar'))

# b (not necessarily in start) followed by 1 or more word character
check(re.search(r'b\w+', 'foobar'))

not found !!
bar


**Why did the second expression work, but not the first?** 

## Part 3: Practice finding regular expressions in strings
In small groups, complete the following exercises. 

Write regular expression to find the following patterns in this string `Cornell University`


```
re.search(r'write your answer regex here', 'Cornell University')

```



1. Search the string to see if it starts with "Corn" and ends with "College"

2. Search the string to see if it starts with "C", followed by 1 or more characters and a single whitespace

3. Find one or more decimal numbers in this string:   
`Cornell Uni. was established in 1865`

4. Find the email address inside the string `'xyz alice-b@google.com purple cat'`.

**📣 New function alert! 📣** You can also use the regex function `sub()` to 

---

replace a string with a different string. 

`re.sub("og_string", "new_string", "string_containing_og_string")`

5. In the string `Ithaca is gorgeous`, replace the word `gorgeous` with `gorges`. 

**📣 (Another) new function alert! 📣** 

So far, you've only been identifying the first occurrence of a regular expression in a string. There's another function, `findall()` that finds as many occurrences of a regular expression as there are within a given string. This function takes two arguments: (1) the pattern to look for and (2) the string that contains the pattern(s):

`re.findall("string", "longer_string")`

6. Find all single digit characters (0-9) in the string `Today is the 14th day of October, in the year 2022.`

7. Find all **words** (understood as multiple alphanumeric characters in a row) in the string `The leaves are turning red and orange!`.