# Introduction to Text Data Processing

-----

In this notebook we explore how to actually pull text data of interest out of unstructured data sets. First we will review basic Python tools that can be used for either an initial data exploration or in many cases, more advanced data processing tasks. Next, we review another important tool, regular expressions, which can simplify the task of finding and selecting specific data in a large document. Python provides a native implementation of [regular expressions][re] through the `re` module.

-----
[re]: https://docs.python.org/3/library/re.html

## Table of Contents


[XXX](#XXX)
- [YYY](#YYY)


### Text Data Processing

In many cases, we will be presented with unstructured or even
semi-structured text data. For example, Tweet messages, email messages,
or other documents can often be considered as character sequences. In
these cases, we often can perform basic data processing by employing
built-in Python data structures and collections. 

The main tool we can use for text processing is the Python `string`
object and its associated methods. One important point to remember is
that in Python, a `string` is immutable, thus any change will create a
new `string`. This will have an impact on using Python to process large
text data sets, which often leads to other solutions, of which several
are presented later in this Notebook. The `string` object has a number
of [useful methods][pystm]:

- `split`: Return a list of token strings that are delimited by a
character, such as space.

- `find`: return the lowest index in the string where a substring is
located.

- `replace`: return a new string with all occurrences of a pattern
replaced.

- `join`: return a string that is the combination of the input strings

- `count`: return the number of non-overlapping instances of a substring.

- `lower`: convert text to lowercase characters.

- `lstrip` / `rstrip`: return a string with the leading/trailing
characters removed.

In addition, one can make use of standard [Python sequence
operators][pso] to quickly perform basic text data processing.  Given a
value `v`, integer `n`, and similar typed sequences `s` and `t`:

| Operation | Description |
| ----- | ----- |
| `v in s`| `True` if `v` is in the sequence `s`, otherwise `False`|
| `v not in s`| `False` if `v` is in the sequence `s`, otherwise `True`|
| `s + t`| concatenation of `s` and `t`|
| `s * n` or `n* s`| `n` shallow copies of `s` concatenated|
| `len(s)`| the number of elements in the sequence `s`|
| `min(s)`| the smallest elements in the sequence `s`|
| `max(s)`| the largest of elements in the sequence `s`|
| `s.count(v)`| number of times `v` appears in `s`|

Finally, Python provides additional [data collection classes][cl] in the
`collections` library, which is part of the standard Python
distribution. Current;y, this library introduces the `namedTuple`,
`deque`, `ChainMap`, `Counter`, `OrderedDict`, `defaultDict`,
`UserDict`, `UserList`, and `UserString` classes. In the following code
example, we demonstrate the use of a `Counter` object to perform a
simple word count.

-----
[pystm]: https://docs.python.org/3/library/stdtypes.html#string-methods
[pso]: https://docs.python.org/3/library/stdtypes.html#common-sequence-operations
[cl]: https://docs.python.org/3/library/collections.html

In [1]:
with open ("data/email.txt", "r") as myfile:
    msg = myfile.read().replace('\n', ' ')
    
words = msg.split()

import collections as cl

mr = cl.Counter(words)

print(mr.most_common(25))

[('a', 4), ('you', 4), ('the', 4), ('Docker', 3), ('at', 3), ('new', 3), ('on', 3), ('your', 3), ('this', 3), ('ACCY', 2), ('570', 2), ('Instructor', 2), ('New', 2), ('Course', 2), ('Image', 2), ('a.student@gmail.com', 2), ('Robert', 2), ('course', 2), ('to', 2), ('or', 2), ('command', 2), ('if', 2), ('is', 2), ('have', 2), ('email', 2)]


### Regular Expressions

Regular expressions, or RE or regexes, are expressions that can be used
to match one or more occurrences of a particular pattern. Regular
expressions are not unique to Python, they are used in many programming
languages and many Unix command line tools like sed, grep, or awk.
[Regular expressions][re] are used in Python through the `re` module. To
build a regular expression, you need to understand the syntax of the RE
language. Once a regular expression is developed, it is compiled and
executed by an engine written in C in order to provide fast execution.

To begin, most characters in a regular expression simply match
themselves, For example `python` would match any occurrence of the six
letters `python` either alone or embedded in another word. There are
several special characters, known as metacharacters, that control the
behaviour of the rest of the regular expresion. These metacharacters are
listed in the following table.

| Metacharacter | Meaning | Example |
| ---- | ----- | ----- |
| . |  Matches any character except a newline | `1.3` matches `123`, `1a3`, and `1#3` among others |
| ^ | Matches sequence at the beginning of the line| `^Python` matches `Python` at the beginning of a line |
| $ | Matches sequence at the end of the line | `Python$` matches `Python` at the end of a line |
| * | Matches zero or more occurrences of a pattern | `12*3` matches `13`, `123`, `1223`, etc. |
| + |  Matches one or more occurrences of a pattern | `12+3` matches `123`, `1223`, etc. |
| ? |  Matches zero or one occurrences of a pattern | `12?3` matches `13` and `123` |
| { }| Match repeated qualifier | `{m, n}` means match at least `m` and at most `n` occurrences | 
| [ ] | Used to specify a character class | `[a-z]` means match any lower case character |
| \ | Escape character | `\w` means match an alphanumeric character, `\s` means match any whitespace character, and `\\` means match a backslash |
| &#124; | or operator | `A ` &#124; ` B` match either `A` or `B` |
| ( ) | Grouping Operator | (a, b) |

One additional point to remember is that inside a character class (i.e.,
`[ ]`) many of these metacharacters lose their special meaning, and thus
can be used to match themselves. For example, inside a character class,
the `^` character means _not_, so `[^\w]` means match any
non-alphanumeric character.

To master regular expressions requires a lot of practice, but the
investment is well worth it as they are used in many different contexts
and can greatly simplify otherwise complex tasks. Given a regular
expression, there are several functions that can be used to process text
data.

- `compile`: compiles a regular expression for faster evaluation.
- `search`: find regular expression in string
- `match`: find regular expression at start of string
- `split`: splits the string by matches of a regular expression.
- `sub`: replaces substrings that match a regular expression with different string

We can modify our previous string processing example, by using regular
expressions to removing punctuation and other non-alphanumeric or
whitespace characters.

-----
[re]: https://docs.python.org/3/howto/regex.html

In [2]:
import re

pattern = re.compile(r'[^\w\s]')
with open ("data/email.txt", "r") as myfile:
    msg = myfile.read().replace('\n', ' ')
    
words = re.sub(pattern, ' ', msg).split()

mr = cl.Counter(words)

print(mr.most_common(25))

[('a', 6), ('you', 4), ('the', 4), ('Docker', 3), ('at', 3), ('new', 3), ('on', 3), ('your', 3), ('this', 3), ('ACCY', 2), ('570', 2), ('Instructor', 2), ('no', 2), ('Note', 2), ('New', 2), ('Course', 2), ('Image', 2), ('student', 2), ('gmail', 2), ('com', 2), ('Robert', 2), ('course', 2), ('image', 2), ('to', 2), ('or', 2)]


-----

Another example

-----

-----
## Breakout Session

During this breakout, you should work to improve your Python text data
processing skills. Specific problems you can attempt include the
following:

1. Modify the first String Processing code to convert all text to
lowercase characters before accumulating the word counts.

2. Use the Python set to obtain the list of unique words in the text
message.

3. Use Regular Expressions to remove the email encoding characters from
the message text.

Additional, more advanced problems:

1. Save several emails from within your mail reader and modify the
Python code to process them in bulk to extract out the sender, date
sent, and subject.

2. Save several webpages (perhaps by using wget), and modify the
BeautifulSoup code example to parse out and display the page title, any
Javascript code libraries, and any css style file references.

-----

<font color='red' size = '5'> Student Exercise </font>

Earlier in this notebook, we used the sqlite module to execute SQL queries. Now that you have run the cells in this notebook, go back to the relevant cells and make these changes. Be sure to understand how your changes impact the file input and output process.

3. Try creating the Surf Shop database as a persistent database (i.e., not in memory).
4. With the persistent Surf Shop database, execute queries to count the number of items in the store, and sort them into descending order by their description.
56. The airport example demonstrate how easy it is to use the `to_sql` function on a Pandas DataFrame. Using any data set you choose (e.g., the Adult data or the Auto MPG data), read the data into a DataFrame and persist in a new database that you have created. Using the database, find all missing values.

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [Dive Into Python3][1] regular expression chapter.
4. The [Python Collectins][pycol] documentation.

-----

[1]: http://www.diveintopython3.net/regular-expressions.html
[pycol]: https://docs.python.org/3/library/collections.html

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode