# `REGEX` - [regular expression](https://en.wikipedia.org/wiki/Regular_expression) matching operations
First we shall import the python [`re` library](https://docs.python.org/3/library/re.html)

In [12]:
import re

### Search (`search`)
The following example which searches for the word `letter` in the document.
If the pattern is not found then `None` (*i.e.* `False`) is returned.

In [13]:
# our string
str = 'An example of a three letter word: cat, and a four letter word: dogs!!'

match = re.search(r'letter', str)

if match:
    print('found', match.group() )
else:
    print('match not found')

found letter


The following example which searches for any three letter word.

In [14]:
# our string
str = 'An example of a three letter word: cat, and a four letter word: dogs!!'

match = re.search(r'\b\w{3}\b', str)

if match:
    print('found', match.group() )
else:
    print('match not found')

found cat


Notice the initial `r` for raw, so as to not act on the escape character `\`

### Basic patterns

*    `a, X, 9` ordinary characters just match themselves exactly
*    `. ^ $ * + ? { [ ] \ | ( )` meta-characters which do not match themselves because they have special meanings
*    Square brackets can be used to indicate a set of chars, so `[abc]` matches `a` or `b` or `c'.

### Common tokens
|   |  |
| --- | --- |
| `[abc]` | A single character out of `a` `b` or `c` |
| `[^abc]` | A single character not out of of `a` `b` or `c` |
| `[a-z]` | A character in the range `a` to `z` |
| `[^a-z]` | A character not in the range `a` to `z` |
| `[a-zA-Z]` | A character in the range `a` to `z` or `A` to `Z`|
| `.` | Any single character (except newline) |
| `a\|b` | match either `a` or `b` |
| \s | any whitespace character, *i.e.* space, newline, return, tab, form `[ \n\r\t\f]` |
| \S | any non-whitespace character |
| \d | any decimal digit |
| \D | any non-digit |
| \w | any word character, *i.e.* a letter or digit or underscore in `[a-zA-Z0-9_]` |
| \W | any non-word character |
| ^ | start of string |
| $ | end of string |
| \b | boundary between word and non-word |
| \B | non-word boundary |
| \n | newline |
| \r | carriage return |
| \t | tab
| \0 | null character |

### Repetition

You can use + and * quantifiers to specify repetition in the pattern:

* `+` -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
*  `*` -- 0 or more occurrences of the pattern to its left
* `?` -- match 0 or 1 occurrences of the pattern to its left

### Example
Find the (first) email address in a string

In [15]:
str = 'Yesterday alice@amazon.com, was working with bob@spacex.com on the project'

match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())

alice@amazon.com


However, what if we have more than one example in a string that we want to find?
### Find all (`findall`)
`search` found an instance. To find all of the matches, use `findall`

In [16]:
str = 'Yesterday alice@amazon.com, was working with bob@spacex.com on the project'

emails = re.findall(r'[\w\.-]+@[\w\.-]+', str)
for email in emails:
    print(email)

alice@amazon.com
bob@spacex.com


### Replace (substitute) (`sub`)
A very common usage of regex is replacement: `re.sub(pattern, replacement, str)`

(Note that replacing with `''` is akin to deletion)
#### Example of cleaning a tweet

In [17]:
def clean_tweet(tweet):
    tweet = re.sub(r'http\S+\s*', '', tweet)  # remove URLs
    tweet = re.sub(r'RT|cc', '', tweet)  # remove RT and cc
    tweet = re.sub(r'#\S+', '', tweet)  # remove hashtags
    tweet = re.sub(r'@\S+', '', tweet)  # remove mentions
    tweet = re.sub(r'[%s]' % re.escape(r"""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"""), '', tweet)  # remove punctuations
    tweet = re.sub(r'\s+', ' ', tweet)  # remove extra whitespace
    return tweet

In [18]:
document = """Just warning you... this book has real life dialog.
              The characters drop the F-bomb     on occasion üôÇ COWBOY
              TAKE ME AWAY http://ow.ly/lKwx5 (@PenelopeChilds)
              #books #stuff"""

In [19]:
clean_tweet(document)



#### Deleting all emojis

In [20]:
# to strip all emojis
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)

In [21]:
emoji_pattern.sub(r'', document) # no emojis



#### Emoji to text
Example of converting emojis to text using the [emoji](https://github.com/carpedm20/emoji/) package (which is pre-installed in Kaggle)

In [22]:
!pip install emoji
import emoji

result = emoji.demojize('As text this will be a üëç')
result

Collecting emoji
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.15.0-py3-none-any.whl (608 kB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/608.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m604.2/608.4 kB[0m [31m31.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m608.4/608.4 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.15.0


'As text this will be a :thumbs_up:'

In [23]:
document = emoji.demojize(document)
document = clean_tweet(document)
document



# Exercise
Use `regex` to normalize (*i.e.* clean up) this web page: [https://ai.stanford.edu/~amaas/data/sentiment/](https://ai.stanford.edu/~amaas/data/sentiment/)

In [24]:
import urllib.request
url = 'https://ai.stanford.edu/~amaas/data/sentiment/'
# and a alternative, more challenging, page:
url = 'https://elpais.com'
page = urllib.request.urlopen(url).read().decode('utf-8')
print(page)

<!DOCTYPE html><html lang="es-ES"><head><meta charSet="UTF-8"/><meta name="viewport" content="width=device-width, initial-scale=1.0"/><link rel="preconnect" href="//static.elpais.com"/><link rel="preconnect" href="//assets.adobedtm.com"/><link rel="preconnect" href="//sdk.privacy-center.org"/><link rel="preload" href="https://imagenes.elpais.com/resizer/v2/E5S3ZKNOJBARBLSELYUVG4S6GA.jpg?auth=d3cde50971446b9d78cb879cc45989336d01354555132196cbe5963440e76117&amp;width=414&amp;height=233&amp;smart=true" imageSrcSet="https://imagenes.elpais.com/resizer/v2/E5S3ZKNOJBARBLSELYUVG4S6GA.jpg?auth=d3cde50971446b9d78cb879cc45989336d01354555132196cbe5963440e76117&amp;width=414&amp;height=233&amp;smart=true 414w,https://imagenes.elpais.com/resizer/v2/E5S3ZKNOJBARBLSELYUVG4S6GA.jpg?auth=d3cde50971446b9d78cb879cc45989336d01354555132196cbe5963440e76117&amp;width=828&amp;height=466&amp;smart=true 828w,https://imagenes.elpais.com/resizer/v2/E5S3ZKNOJBARBLSELYUVG4S6GA.jpg?auth=d3cde50971446b9d78cb879cc4598

In [25]:
# from bs4 import BeautifulSoup
# soup = BeautifulSoup(page)
# soup.get_text(strip=True)