# NLP Basics: Learning how to use regular expressions

### Using regular expressions in Python

Python's `re` package is the most commonly used regex resource. More details can be found [here](https://docs.python.org/3/library/re.html).

In [2]:
import re

re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This      is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

### Splitting a sentence into a list of words

In [7]:
a = re.split('\s', re_test)
print(type(a), a)

<class 'list'> ['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']


In [9]:
# by just doing '\s' you will not be able to remove all the messy whitespaces
b = re.split('\s', re_test_messy)
b

['This',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']

In [11]:
b = re.split('\s+', re_test_messy)
b

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [15]:
# This doesnt split well
c = re.split('\s+', re_test_messy1)
c

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

**If you capitalise the regex, you will match with the opposite one.**   
For example,  
'\s' matches with a whitespace, '\S' matches with NON-whitespace  

'\w' matches with a word, '\W' matches with NON-word.

In [19]:
# '\W' matches with non-word characters
c = re.split('\W+', re_test_messy)
c

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [21]:
d = re.findall('\S+', re_test)
d

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [22]:
e = re.findall('\S+', re_test_messy)
e

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [23]:
# doesnt work
f = re.findall('\S+', re_test_messy1)
f

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

In [24]:
g = re.findall('\w+', re_test_messy1)
g

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

### Replacing a specific string

In [25]:
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [28]:
print(re.findall('[A-Z]+[0-9]+', pep7_test))
print(re.findall('[A-Z]+[0-9]+', peep8_test))

['PEP7']
['PEEP8']


In [30]:
print(re.sub('[A-Z]+[0-9]+', 'PEP8', pep7_test))
print(re.sub('[A-Z]+[0-9]+', 'PEP8', peep8_test))

I try to follow PEP8 guidelines
I try to follow PEP8 guidelines


### Other examples of regex methods

- re.search()
- re.match()
- re.fullmatch()
- re.finditer()
- re.escape()