# Module 7: Text Processing

- How to parse and/or pattern match within text data?


- __String Object Methods__ in Python are covered in the PfDA text and are easy to understand. Refer to pages 215 - 217.


- Pattern matching is possible via "regular expressions", aka "regex"


- The regular expression "language" provides a universal method for parsing/pattern matching of text data. Regex processing is often used for extracting useful information from HTML, JSON, JS, and other heavily formatted types of data sources.


- The idea behind regex is to use special codes/character sequences as "wildcards" for purposes of finding character strings within a text that conform to a particular syntactical structure.


- Multiple regex reference docs are provided in the Module 7 reading materials within Canvas. You should refer to them for a detailed explanation of regex logic + syntax.


- In Python, we rely on the 're' module for regex functionality

### Regular Expressions

We'll start with a very simple example: splitting a string containing words that are separated by a varying number of whitespace characters

In [1]:
# load the re module
import re

# how to split a string containing a varying number of whitespace characters
# start by generating some data to use
# NOTE: '\t' is the escape sequence for a horizontal whitespace 'tab'
text = "foo    bar\t baz  \tqux"
text

'foo    bar\t baz  \tqux'

In [2]:
# apply the split() function using a regex of '\s+' which describes a 
# string having one or more whitespace characters
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

#### Extracting Components of a String as Separate Groupings of Data

Suppose we are given the text string shown below. The content represents information related to courses being offered by a university. We are provided with the course number, a three letter course code that uniquely identifies academic department responsible for the course, and the course name. (Sourced from Module 7 Recommended Readings: https://www.machinelearningplus.com/python/python-regex-tutorial-examples/)

In [3]:
text = """101   COM   Computers
205   MAT   Mathematics
189   ENG    English"""  

Let's say we want to __extract all of the course numbers__ from the string. How might we accomplish this using a regular expression?

In [4]:
# 1. extract all course numbers
re.findall('[0-9]+', text)

['101', '205', '189']

As we discussed above, regular expressions are used for parsing/pattern matching within text data. To extract the course numbers (and only the course numbers) from the string shown above, we've defined a regular expression:

- __[0-9]+__ 

The parsing behavior of this regular expression can be explained as follows:

- __[0-9]__ = Match any digit within the range of 0, 1, 2, .., 9


- __'+'__   = A plus sign ('+') within a regular expression indicates that the regular expression parser should __match one or more of the preceding regex characters/codes__. In this example, the preceding characters/codes are comprised of the digits ranging from 0 through 9 ('[0-9]')



What if we'd like to __extract all of the three letter academic department codes__ from the string?

In [5]:
# 2. extract all course codes
re.findall('[A-Z]{3}', text)

['COM', 'MAT', 'ENG']

The parsing behavior of this regular expression can be explained as follows:

- __[A-Z]__ = Match any uppercase letter within the range of A, B, C, .., Z


- __{3}__ = Match __exactly__ three of the preceding regex characters/codes. In this example, the preceding characters/codes are the uppercase letters within the range of A, B, C, .., Z

Now let's __extract the course names__ from the provided string:

In [6]:
# 3. extract all course names
re.findall('[A-Za-z]{4,}', text)

['Computers', 'Mathematics', 'English']

The parsing behavior of this regular expression can be explained as follows:

- __[A-Za-z]__ = Match any uppercase OR lowercase letter within the range of A, B, C, .., Z and a, b, c, .., z


- __{4,}__ = Match the preceding regex characters/codes __4 or more times__. In this example, the preceding characters/codes are any uppercase or lowercase letter within the range of A, B, C, .., Z and a, b, c, .., z

#### Extracting Email Addresses From a String

Now a more complicated (and more realistic) example: __Extracting some simple email addresses__ from a text.

In [4]:
# start by compiling the regex you will use to parse out the email addresses:
# Here you are creating a compiled regex and assigning it the name 'email'
# The regex can then be applied repeatedly throughout your Python code.
# The regex we will use is '\w+@\w+\.[a-z]{3}', which we will decipher a bit
# later
email = re.compile('\w+@\w+\.[a-z]{3}')

In [5]:
# define some sample text containing email addresses
text = "To email Guido, try guido@python.org or the older address guido@google.com."

# now apply the 'email' regex to the sample text to extract ONLY the email
# addresses
email.findall(text)

['guido@python.org', 'guido@google.com']

So what exactly is the regular expression __'\w+@\w+\\.[a-z]{3}'__ going to "pattern match" to? 

Let's dissect the individual components. *(NOTE: For a detailed description of regex syntax, please refer to https://docs.python.org/3/howto/regex.html and https://www.machinelearningplus.com/python/python-regex-tutorial-examples/)*

- __'\w'__ = Match any alphanumeric character (equiv. to regex [^a-zA-Z0-9]


- __'+'__ = Match one or more of the preceding regex character/code.


- __'@'__ = @ sign


- __'\w'__ = Match any alphanumeric


- __'+'__ = Match one or more of the preceding regex character/codes


- __'\\.'__ = A period ('.', which in regex MUST be preceded by the escape character '\\')


- __'[a-z]{3}'__ = exactly 3 lowercase characters in the range of [a-z].

So how might this fail to match to an email address like 'joe.smith@python.net'?

- Since we are only pattern matching to alphanumerics prior to the '@' character, this regex will fail to detect any email addresses that contain a '.' separator prior to the '@' character. 


- Solution: Add a __'\\.\w+'__ to the regex: __'\w+\\.\w+@\w+\\.[a-z]{3}'__

Constructing an effective regex typically requires a fair amount of iterative trial & error effort, e.g., construct what you think will work -> test it -> modify it -> test it again, etc.


#### Removing Unwanted Characters From a String

Another example: What if we wanted to remove everything __except__ alphanumeric characters from a string?

In [6]:
# create a regex to remove non-alphanumerics from a string
text2 = '**//Regex Exercises// + 42 - 37:?'

# our regex is looking for any non-alphanumeric char: '\W_' = any non-alphanumeric character 
# or an undescore
pattern = re.compile('[\W_]+')

# now use the regex to find the non alphanumerics. When found, replace them
# by placing an empty string '' in their location.
# so the call to pattern.sub() is saying: "replace anything in text2 matching the
# regex defined as 'pattern' with an empty string ('').
print(pattern.sub('', text2))

RegexExercises4237


### Vectorized String Functions in Pandas

Pandas has some built-in string manipulation capabilities that allow the user to automatically apply a regex or other string function to each item of a Series or DataFrame column.

In [7]:
# import Pandas library
import pandas as pd
import numpy as np

# define some simple data to use
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [8]:
# search the 'data' object for the string 'gmail': the result is a boolean
# vector
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [30]:
# define a regex to separate the email user's name from the web domain name
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)'
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)'

In [31]:
# Use the findall() function to separate each user's email name from the 
# web domain name
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google.com)]
Steve    [(steve, gmail.com)]
Rob        [(rob, gmail.com)]
Wes                       NaN
dtype: object

In [32]:
# extract the user name + web domain into separate data frame columns
data.str.extract(pattern, flags =re.IGNORECASE)

Unnamed: 0,0,1
Dave,dave,google.com
Steve,steve,gmail.com
Rob,rob,gmail.com
Wes,,
