# Introduction to Text Data Processing

-----

In this notebook, we explore how to actually pull text data of interest out of unstructured data sets. First we will review basic Python tools that can be used for either an initial data exploration or in many cases, more advanced data processing tasks. Next, we review another important tool, regular expressions, which can simplify the task of finding and selecting specific data in a large document. Python provides a native implementation of [regular expressions][re] through the `re` module.

-----
[re]: https://docs.python.org/3/library/re.html

## Table of Contents


[Text Data Processing](#Text-Data-Processing)

- [Sequence Operators](#Sequence-Operators)

- [String Functions](#String-Functions)

- [Data Collection Classes](#Data-Collection-Classes)

[Regular Expressions](#Regular-Expressions)

-----

Before proceeding with the rest of this notebook, we first define our _data_ directory and load a sample text document, in this case an email.

----

In [1]:
# First we find our HOME directory
home_dir = !echo $HOME

# Define data directory
data_dir = home_dir[0] +'/data/'

In [2]:
print(data_dir)

/home/data_scientist/data/


In [3]:
# Read in sample email document
with open (data_dir + 'email.txt', 'r') as myfile:
    msg = myfile.read()

FileNotFoundError: [Errno 2] No such file or directory: '/home/data_scientist/data/email.txt'

-----

[[Back to TOC]](#Table-of-Contents)

## Text Data Processing

In many cases, we will be presented with unstructured or even semi-structured text data. For example, Tweet messages, email messages, or other documents can often be considered as character sequences. In these cases, we often can perform basic data processing by employing built-in Python data structures and collections. 

The main tool we can use for text processing is the Python `string` object and its associated methods. One important point to remember is that in Python, a `string` is immutable, thus any change will create a new `string`. This will have an impact on using Python to process large text data sets, which often leads to other solutions, of which several are presented later in this notebook. 

In the following two Code cells, we first display the message's data type, which is class string. After which, we display the message itself, which, since the message is a sequence of characters, will wrap as the size of the browser window is changed.

-----

In [3]:
# Message data type
print(f'Message is encoded as type: {type(msg)}')

Message is encoded as type: <class 'str'>


In [4]:
# Display message
print(msg)

From: ACCY 570 Instructor <no-reply@illinois.edu>
Subject: [Instr Note] New Docker Course Image
Date: September 29, 2017 at 5:54:37 PM CDT
To: a.student@gmail.com

Instructor Robert J. Brunner posted a new Note. 

New Docker Course Image

We generated a new Docker course image. If you want to follow along on your laptop or work on the course Notebooks offline, you should download this new image by issuing a 

docker pull lcdm/rppdm-standalone

command at a Unix command line prompt. On the other hand, if you simply use the JupyterHub Server, no action is required on your part (we have already updated the server).

Let us know if you have any questions.

Robert

You're receiving this email because a.student@gmail.com is enrolled in ACCY 570 at University of Illinois. Sign in to manage your email preferences or un-enroll from this class. 


-----

[[Back to TOC]](#Table-of-Contents)

### Sequence Operators

Since a Python _string_ is a standard data structure, the standard [Python sequence operators][pso] can be used to quickly perform basic text data processing.  Given a value `v`, integer `n`, and similar typed sequences `s` and `t`:

| Operation | Description |
| ----- | ----- |
| `v in s`| `True` if `v` is in the sequence `s`, otherwise `False`|
| `v not in s`| `False` if `v` is in the sequence `s`, otherwise `True`|
| `s + t`| concatenation of `s` and `t`|
| `s * n` or `n* s`| `n` shallow copies of `s` concatenated|
| `len(s)`| the number of elements in the sequence `s`|
| `min(s)`| the smallest elements in the sequence `s`|
| `max(s)`| the largest of elements in the sequence `s`|
| `s.count(v)`| number of times `v` appears in `s`|

In the following Code cells, we demonstrate the use of most of these operators on our email message. First, we count and display the number of characters in the message via the `len` function. Next, we test if a character sequence `Brunner` is in the message, which returns `True`. Third, we count and display the number of times a sequence of characters (`ou`) appears in the message.

Finally, we find and display the maximum and minimum characters in the message. Note, the value of a character derives from its [Unicode][pu] numerical value. One other thing the second Code cell demonstrates is the conversion of the value `max(msg)` via the special format code `!r`. This converts the result into its native representation, which is Unicode string, which allows the Unicode to be displayed instead of processed directly (we would not see the result otherwise).


-----

[pso]: https://docs.python.org/3/library/stdtypes.html#common-sequence-operations
[pu]: https://docs.python.org/3/howto/unicode.html

In [5]:
print(f'{len(msg)} characters in email.')

847 characters in email.


In [6]:
print('Brunner' in msg)

True


In [7]:
print(f'"ou" appears {msg.count("ou")} times in the message.')

"ou" appears 13 times in the message.


In [8]:
print(f'(Max : Min) characters = ({max(msg)!r} : {min(msg)!r})')

(Max : Min) characters = ('\xa0' : '\n')


-----

[[Back to TOC]](#Table-of-Contents)

### String Functions

The `string` object also has a number of built-in, [useful methods][pystm]:

- `split`: Return a list of token strings that are delimited by a character, such as space.

- `find`: return the lowest index in the string where a substring is located.

- `replace`: return a new string with all occurrences of a pattern replaced.

- `join`: return a string that is the combination of the input strings

- `count`: return the number of non-overlapping instances of a substring.

- `lower`: convert text to lowercase characters.

- `lstrip` / `rstrip`: return a string with the leading/trailing characters removed.

These functions are demonstrated int he following Code cells, where we apply them to the email message to replace text, split the message into tokens, join tokens by using a character sequence, change the case of a token, strip characters from tokens, and find a specific character sequence.

-----

[pystm]: https://docs.python.org/3/library/stdtypes.html#string-methods

In [9]:
# Replace text, in this case newline with space  
msg_text = msg.replace('\n', ' ')
print(msg_text)  

From: ACCY 570 Instructor <no-reply@illinois.edu> Subject: [Instr Note] New Docker Course Image Date: September 29, 2017 at 5:54:37 PM CDT To: a.student@gmail.com  Instructor Robert J. Brunner posted a new Note.   New Docker Course Image  We generated a new Docker course image. If you want to follow along on your laptop or work on the course Notebooks offline, you should download this new image by issuing a   docker pull lcdm/rppdm-standalone  command at a Unix command line prompt. On the other hand, if you simply use the JupyterHub Server, no action is required on your part (we have already updated the server).  Let us know if you have any questions.  Robert  You're receiving this email because a.student@gmail.com is enrolled in ACCY 570 at University of Illinois. Sign in to manage your email preferences or un-enroll from this class. 


In [10]:
# Tokenize message
words = msg.split()

# Pretty print last ten tokens
import pprint as pp
pp.pprint(words[-10:], indent=3)

[  'to',
   'manage',
   'your',
   'email',
   'preferences',
   'or',
   'un-enroll',
   'from',
   'this',
   'class.']


In [11]:
# Join last ten tokens with *-* character sequence
print('*-*'.join(words[-10:]))

to*-*manage*-*your*-*email*-*preferences*-*or*-*un-enroll*-*from*-*this*-*class.


In [12]:
# Extract last word and change to uppercase
print(words[-1])
print(words[-1].upper())

class.
CLASS.


In [13]:
# Strip 'yo' from last ten tokens
# Display token and stripped token
for word in words[-10:]:
    print(word, word.lstrip('yo'))

to to
manage manage
your ur
email email
preferences preferences
or r
un-enroll un-enroll
from from
this this
class. class.


In [14]:
# Find string sequence
msg_text.find('ACCY')

6

In [15]:
# Extract sequence
msg[6:10]

'ACCY'

In [16]:
# Convert sequence to lowercase
msg[6:10].lower()

'accy'

-----

Since `words` is a list of tokens, we can employ standard Python `list` methods. For example, we can sort the list of tokens, either in ascending order (the default), or descending order. Both approaches are shown in the following two Code cells, where we sort in both orders and display the last ten tokens.

-----

In [17]:
# Sort tokens and display last ten
words.sort()
pp.pprint(words[-10:], indent=3)

['use', 'want', 'work', 'you', 'you', 'you', 'you', 'your', 'your', 'your']


In [18]:
# Reverse sort tokens and display last ten
words.sort(reverse=True)
pp.pprint(words[-10:], indent=3)

[  'Brunner',
   'ACCY',
   'ACCY',
   '<no-reply@illinois.edu>',
   '5:54:37',
   '570',
   '570',
   '29,',
   '2017',
   '(we']


-----

[[Back to TOC]](#Table-of-Contents)

### Data Collection Classes

Python provides additional [data collection classes][cl] in the `collections` library, which is part of the standard Python distribution. Current;y, this library introduces the `namedTuple`, `deque`, `ChainMap`, `Counter`, `OrderedDict`, `defaultDict`, `UserDict`, `UserList`, and `UserString` classes. In the following code example, we demonstrate the use of a `Counter` object to perform a simple word count.

-----
[cl]: https://docs.python.org/3/library/collections.html

In [19]:
# Count tokens by using a collection
import collections as cl

# Find and display ten most common tokens 
mr = cl.Counter(words)
pp.pprint(mr.most_common(10))

[('you', 4),
 ('the', 4),
 ('a', 4),
 ('your', 3),
 ('this', 3),
 ('on', 3),
 ('new', 3),
 ('at', 3),
 ('Docker', 3),
 ('to', 2)]


-----

[[Back to TOC]](#Table-of-Contents)

## Regular Expressions

Regular expressions, or RE or regexes, are expressions that can be used to match one or more occurrences of a particular pattern. Regular expressions are not unique to Python, they are used in many programming languages and many Unix command line tools like sed, grep, or awk. [Regular expressions][re] are used in Python through the `re` module. To build a regular expression, you need to understand the syntax of the RE language. Once a regular expression is developed, it is compiled and executed by an engine written in C in order to provide fast execution.

To begin, most characters in a regular expression simply match themselves, For example `python` would match any occurrence of the six letters `python` either alone or embedded in another word. There are several special characters, known as metacharacters, that control the behavior of the rest of the regular expression. These metacharacters are listed in the following table.

| Metacharacter | Meaning | Example |
| ---- | ----- | ----- |
| . |  Matches any character except a newline | `1.3` matches `123`, `1a3`, and `1#3` among others |
| ^ | Matches sequence at the beginning of the line| `^Python` matches `Python` at the beginning of a line |
| $ | Matches sequence at the end of the line | `Python$` matches `Python` at the end of a line |
| * | Matches zero or more occurrences of a pattern | `12*3` matches `13`, `123`, `1223`, etc. |
| + |  Matches one or more occurrences of a pattern | `12+3` matches `123`, `1223`, etc. |
| ? |  Matches zero or one occurrences of a pattern | `12?3` matches `13` and `123` |
| { }| Match repeated qualifier | `{m, n}` means match at least `m` and at most `n` occurrences | 
| [ ] | Used to specify a character class | `[a-z]` means match any lower case character |
| \ | Escape character | `\w` means match an alphanumeric character, `\d` means match numerical character, `\s` means match any whitespace character, and `\\` means match a backslash |
| &#124; | or operator | `A ` &#124; ` B` match either `A` or `B` |
| ( ) | Grouping Operator | (a, b) |

One additional point to remember is that inside a character class (i.e., `[ ]`) many of these metacharacters lose their special meaning, and thus can be used to match themselves. For example, inside a character class, the `^` character means _not_, so `[^\w]` means match any non-alphanumeric character.

To master regular expressions requires a lot of practice, but the investment is well worth it as they are used in many different contexts and can greatly simplify otherwise complex tasks. Given a regular expression, there are several functions that can be used to process text data.

- `compile`: compiles a regular expression for faster evaluation.
- `search`: find regular expression in string
- `match`: find regular expression at start of string
- `split`: splits the string by matches of a regular expression.
- `sub`: replaces substrings that match a regular expression with different string

In the following Code cells, we repeat several of our previous string processing examples, by using regular expressions to find text sequences, transform text sequences, and to find most common tokens (note the result is slightly different to the order in which the regular expression finds matches).

-----
[re]: https://docs.python.org/3/howto/regex.html

In [20]:
# Find and iterate over instances of ACCY,
# extract starting and ending indices
# and entire matching token

import re
for match in re.finditer(r'ACCY', msg):
    print(f'{match.start():03d}-{match.end():03d}: {match.group(0)}')

006-010: ACCY
740-744: ACCY


In [21]:
# Find and iterate over instances of either on, On, or ON,
# extract starting and ending indices
# and entire matching token

for match in re.finditer(r'on|On|ON', msg):
    print(f'{match.start():03d}-{match.end():03d}: {match.group(0)}')

303-305: on
307-309: on
330-332: on
443-445: on
487-489: On
554-556: on
569-571: on
655-657: on


In [22]:
# Find and iterate over instances of one or more 
# numerical values, extract starting and ending indices
# and entire matching token

for match in re.finditer(r'\d+', msg):
    print(f'{match.start():03d}-{match.end():03d}: {match.group(0)}')

011-014: 570
112-114: 29
116-120: 2017
124-125: 5
126-128: 54
129-131: 37
745-748: 570


In [23]:
# Replace any six alpha character sequence with
# six asterisk characters
re.sub(r'[a-zA-Z]{6}', '******', msg)

"From: ACCY 570 ******ctor <no-reply@******is.edu>\n******t: [Instr Note] New ****** ****** Image\nDate: ******ber 29, 2017 at 5:54:37 PM CDT\nTo: a.******t@gmail.com\n\n******ctor ****** J. ******r ****** a new Note. \n\nNew ****** ****** Image\n\nWe ******ted a new ****** ****** image. If you want to ****** along on your ****** or work on the ****** ******oks ******e, you ****** ******ad this new image by ******g a\xa0\n\n****** pull lcdm/rppdm-******lone\n\n******d at a Unix ******d line ******. On the other hand, if you ****** use the ******rHub ******, no ****** is ******ed on your part (we have ******y ******d the ******).\n\nLet us know if you have any ******ons.\n\n******\n\nYou're ******ing this email ******e a.******t@gmail.com is ******ed in ACCY 570 at ******sity of ******is. Sign in to ****** your email ******ences or un-****** from this class. "

In [24]:
# Define word boundary as Not any alphanumeric character 
# followed by a whitespace character. We repalce these boundaries
# with a single space, and split on the space.
pattern = re.compile(r'[^\w\s]')
words = re.sub(pattern, ' ', msg).split()

# Find and display top ten tokens
mr = cl.Counter(words)
pp.pprint(mr.most_common(10))

[('a', 6),
 ('you', 4),
 ('the', 4),
 ('Docker', 3),
 ('at', 3),
 ('new', 3),
 ('on', 3),
 ('your', 3),
 ('this', 3),
 ('ACCY', 2)]


-----

<font color='red' size = '5'> Student Exercise </font>

Earlier in this notebook, we used the text data processing techniques to  manipulate unstructured text data. Now that you have run the cells in this notebook, go back to the relevant cells and make these changes. Be sure to understand how your changes impact the file input and output process.

1. Modify the string tokenization code to convert all text to lowercase characters before accumulating the word counts.

2. Use the Python set to obtain the list of unique words in the text message.

3. Use Regular Expressions to extract the email header information (i.e., sender, receiver, date) from the email message text.

As a challenge problem:

1. Save several emails from within your mail reader and modify the Python code to process them in bulk to extract out the sender, date sent, and subject.

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [Dive Into Python3][1] regular expression chapter.
4. The [Python Collections][pycol] documentation.

-----

[1]: http://www.diveintopython3.net/regular-expressions.html
[pycol]: https://docs.python.org/3/library/collections.html

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode