# NLP Basics: Learning how to use regular expressionsb

Python's re package is the most commonly used regex resource.

In [1]:
import re

re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This      is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

Python will have to split the strings into what is called tokens or words so that the model can learn how those tokens relate to the response variable.

# Splitting a sentence into a list of words

In [17]:
#single white space to split the string
re.split('\s', re_test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [4]:
#single white space to split the string
re.split('\s', re_test_messy)

['This',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']

In [6]:
#single white space to split the string
re.split('\s+', re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [8]:
#single white space to split the string
re.split('\s+', re_test_messy1)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

When dealing with special characters between strings, a simple \s+ cannot work.

In [9]:
#non word character to split the string
re.split('\W+', re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

Instead of searching for what splits the words, just search for the actual words themselves and ignore the characters that split the words.

This can be done using the findall() method.

In [11]:
#it will look for one or more non white space character
re.findall('\S+', re_test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [13]:
re.findall('\S+', re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [15]:
#word character to split the string
re.findall('\w+', re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

Useful methods for tokenizing:
1. findall() : will search for the actual words while ignoring the things that separate the words

2. split() : will search for the characters that split the words while ignoring the words themselves

Useful regexes for tokenizing:
1. \W and \w is used for words

2. \S and \s is used for whitespaces

# Replacing a specific string

In [18]:
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [19]:
import re

re.findall('[a-z]+', pep8_test)

['try', 'to', 'follow', 'guidelines']

In [20]:
re.findall('[A-Z]+', pep8_test)

['I', 'PEP']

In [21]:
re.findall('[0-9]+', pep8_test)

['8']

In [22]:
re.findall('[A-Z]+[0-9]+', pep8_test)

['PEP8']

In [23]:
re.findall('[A-Z]+[0-9]+', pep7_test)

['PEP7']

In [24]:
re.findall('[A-Z]+[0-9]+', peep8_test)

['PEEP8']

In [26]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', pep8_test)

'I try to follow PEP8 Python Styleguide guidelines'

In [27]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', pep7_test)

'I try to follow PEP8 Python Styleguide guidelines'

In [28]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', peep8_test)

'I try to follow PEP8 Python Styleguide guidelines'