----
Text Processing
=====

![](http://ecx.images-amazon.com/images/I/512LDICmVjL._SX311_BO1,204,203,200_.jpg)

By The End Of This Session You Should Be Able To:
----

- List the different levels of text processing
- Use the Python Standard Library to generate text
- Write "pretty" classes in Python

----
Text Processing Levels
-----

1. Character
2. Words 
3. Sentences / multiple words
4. Paragraphs / multiple sentences 
5. Document
6. Corpus / multiple documents 
 
Source: Taming Text, p 9

-----
Character
------

- Character encodings
- Case (upper and lower)
- Punctuation
- Numbers 

----
Words
-----

- Word segmentation: dividing text into words. Fairly easy for English and other languages that use whitespace; much harder for languages like Chinese and Japanese.
- Stemming: the process of shortening a word to its base or root form.
- Abbreviations, acronyms, and spelling. All help understand words.

------
Sentences
-----

- Sentence boundary detection is a well-understood problem in English, but is still not perfect.
- Phrase detection. San Francisco and quick red fox are examples of phrases.
- Parsing: breaking sentences down into subject-verb and other relation- ships often yields useful information about words and their relation- ships to each other.
- Combining the definitions of words and their relationships to each other to determine the meaning of a sentence.

------
Paragraphs
------

At this level, processing becomes more difficult in an effort to find deeper understanding of an author’s intent. 

For example, algorithms for summarization often require being able to identify which sentences are more important than others.

----
Document
----

Similar to the paragraph level, understanding the meaning of a document often requires knowledge that goes beyond what’s contained in the actual document. 

Authors often expect readers to have a certain background or possess certain reading skills. 

----
Corpus
-----

At this level, people want to quickly find items of interest as well as group related documents and read summaries of those documents.

Applications that can aggregate and organize facts and opinions and find relationships are particularly useful.

---- 
Python is "batteries included" for text processing
-----

![](http://abstrusegoose.com/strips/batteries_included.png)

In [1]:
from string import *

In [2]:
whos

Variable          Type                  Data/Info
-------------------------------------------------
Formatter         type                  <class 'string.Formatter'>
Template          _TemplateMetaclass    <class 'string.Template'>
ascii_letters     str                   abcdefghijklmnopqrstuvwxy<...>BCDEFGHIJKLMNOPQRSTUVWXYZ
ascii_lowercase   str                   abcdefghijklmnopqrstuvwxyz
ascii_uppercase   str                   ABCDEFGHIJKLMNOPQRSTUVWXYZ
capwords          function              <function capwords at 0x10250eb70>
digits            str                   0123456789
hexdigits         str                   0123456789abcdefABCDEF
octdigits         str                   01234567
printable         str                   0123456789abcdefghijklmno<...>/:;<=>?@[\]^_`{|}~ 	\n
punctuation       str                   !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
whitespace        str                    	\n


In [3]:
phrase_dirty = "You have to be f*#$ing with me!?"

In [13]:
capwords(phrase_dirty)

'You Have To Be F*#$ing With Me.'

Student Activity
-----

In [11]:
# TODO: Remove the punctuation from the  phrase
phrase_dirty = "You have to be f*#$ing with me!?"

assert phrase_radio_edit == 'You have to be fing with me'

In [7]:
phrase_radio_edit = "".join([c for c in phrase_dirty if c not in punctuation])

Student Activity
-----

Let's make a simple password generator function!

Your one-liner should return something like this:  
'kZmuSUVeVC'   
'mGEsuIfl91'  
'FEFsWwAgLM'

In [None]:
import random
import string

n = 10
random = ''.join((random.choice(string.ascii_letters + string.digits) for n in range(n)))

----
Extending Python with new classes
----

In [39]:
s = Sentence('"The time has come," the Walrus said,')

In [40]:
s

Sentence('"The time ha... Walrus said,')

In [41]:
s.words

['"The', 'time', 'has', 'come,"', 'the', 'Walrus', 'said,']

In [43]:
import reprlib

class Sentence:
    "A data structure to hold English sentences for simplied processing."
    def __init__(self, text): 
        self.text = text
        self.words = text.split(" ")
    
    def __getitem__(self, index):
        return self.words[index] 
    
    def __len__(self):
        return len(self.words) 
    
    def __repr__(self):
        return 'Sentence({})'.format(reprlib.repr(self.text))

In [None]:
s.

In [42]:
s[4]

'the'

Classes should have:

0. Meet spec, aka work correctly
1. Docstrings
2. Nice representations

In [44]:
import textblob

In [45]:
textblob?

[RTFM: textblob](https://textblob.readthedocs.io/en/dev/)

In [53]:
from textblob import TextBlob

text = '''
It’s too bad that some of the young people that were killed over the weekend 
didn’t have guns attached to their [hip], 
frankly, where bullets could have flown in the opposite direction...
'''

blob = TextBlob(text)

In [54]:
blob.words

WordList(['It’s', 'too', 'bad', 'that', 'some', 'of', 'the', 'young', 'people', 'that', 'were', 'killed', 'over', 'the', 'weekend', 'didn’t', 'have', 'guns', 'attached', 'to', 'their', 'hip', 'frankly', 'where', 'bullets', 'could', 'have', 'flown', 'in', 'the', 'opposite', 'direction'])

In [55]:
blob.word_counts

defaultdict(int,
            {'attached': 1,
             'bad': 1,
             'bullets': 1,
             'could': 1,
             'didn’t': 1,
             'direction': 1,
             'flown': 1,
             'frankly': 1,
             'guns': 1,
             'have': 2,
             'hip': 1,
             'in': 1,
             'it’s': 1,
             'killed': 1,
             'of': 1,
             'opposite': 1,
             'over': 1,
             'people': 1,
             'some': 1,
             'that': 2,
             'the': 3,
             'their': 1,
             'to': 1,
             'too': 1,
             'weekend': 1,
             'were': 1,
             'where': 1,
             'young': 1})

In [56]:
blob.sentiment

Sentiment(polarity=-0.19999999999999996, subjectivity=0.26666666666666666)

Summary
-----

- Language has many levels. Pick the right level to process and model.
- Use the Standard Library. DO IT!
- Don't be afraid to create your data structures.
- But often people have already created good ones.

<br>
<br> 
<br>

----