# String Processing in Python

One thing that distinguishes data science from traditional statistics is that the data sets we work with tend to be messier and unconventionally structured. Manipulating such data requires some programming chops, which traditional statistics courses do not prepare you for.

One increasingly common data format is **raw text**. Python is uniquely suited to analyzing text data because of its powerful string processing capabilities. In this workbook, we will sample just a few of these capabilities; for a comprehensive list, take a look at [the Python documentation for strings](https://docs.python.org/3.6/library/stdtypes.html#textseq).

# Green Eggs and Ham

It is said that the Dr. Seuss book _Green Eggs and Ham_ was written in response to a challenge from his publisher to write a book using a vocabulary of just 50 words. Is this true? Let's investigate.

## File I/O

First, let's read in the text of Green Eggs and Ham.

In [1]:
# The with statement ensures that the file is closed.
with open("/data/GreenEggsAndHam.txt") as f:
    text = f.read()
    # DO STUFF WITH THE TEXT

## Strings are like Lists

Strings in Python behave like lists in Python.

In [4]:
# You can index them.
text[:100]

'I am Sam\n\nI am Sam\nSam I am\n\nThat Sam-I-am\nThat Sam-I-am!\nI do not like\nthat Sam-I-am\n\nDo you like\ng'

In [11]:
# You can iterate over the characters in a string.
num_As = 0
for char in text:
    if char.upper() == 'A':
        num_As += 1
    pass
num_As

210

### WARNING

Do not attempt to print large amounts of text in the Jupyter notebook. Your notebook will freeze, and you will not be able to do anything for several minutes.

## Splitting and Stripping

We want to count the number of words in the text. To do so, we need to be able to split the string into words. To do this, we can use the `.split()` method. (By default, if you don't pass in a character, it splits on whitespace.)

In [18]:
words = text.split()
words

['I',
 'am',
 'Sam',
 'I',
 'am',
 'Sam',
 'Sam',
 'I',
 'am',
 'That',
 'Sam-I-am',
 'That',
 'Sam-I-am!',
 'I',
 'do',
 'not',
 'like',
 'that',
 'Sam-I-am',
 'Do',
 'you',
 'like',
 'green',
 'eggs',
 'and',
 'ham',
 'I',
 'do',
 'not',
 'like',
 'them,',
 'Sam-I-am.',
 'I',
 'do',
 'not',
 'like',
 'green',
 'eggs',
 'and',
 'ham.',
 'Would',
 'you',
 'like',
 'them',
 'Here',
 'or',
 'there?',
 'I',
 'would',
 'not',
 'like',
 'them',
 'here',
 'or',
 'there.',
 'I',
 'would',
 'not',
 'like',
 'them',
 'anywhere.',
 'I',
 'do',
 'not',
 'like',
 'green',
 'eggs',
 'and',
 'ham.',
 'I',
 'do',
 'not',
 'like',
 'them,',
 'Sam-I-am',
 'Would',
 'you',
 'like',
 'them',
 'in',
 'a',
 'house?',
 'Would',
 'you',
 'like',
 'them',
 'with',
 'a',
 'mouse?',
 'I',
 'do',
 'not',
 'like',
 'them',
 'in',
 'a',
 'house.',
 'I',
 'do',
 'not',
 'like',
 'them',
 'with',
 'a',
 'mouse.',
 'I',
 'do',
 'not',
 'like',
 'them',
 'here',
 'or',
 'there.',
 'I',
 'do',
 'not',
 'like',
 'them',

In [15]:
len(set(words))

117

Why are there so many different words? 

We haven't accounted for capitalization (so "would" is different from "Would") or punctuation (so "house?" is different from "house.")! Let's convert all words to lowercase and strip the punctuation. This process of converting text to a standard form is known as **normalization**.

In [26]:
normal = []
for word in words:
        normal.append(word.lower().rstrip(".?!,"))

In [28]:
len(set(normal))

52

## Format Strings

Another powerful feature of Python is format strings. Format strings allow you to insert numbers or other strings in the middle of a string. They are Python's version of `sprintf()` in C-style languages.

In [29]:
"There are %d unique words in %s." % (len(set(words)), "Green Eggs and Ham")

'There are 117 unique words in Green Eggs and Ham.'

Instead of specifying the replacement values in order as a tuple, we can also assign names to the replacement arguments and then specify the replacement values as a dict. This is useful when the same replacement argument appears multiple times in a string.

Note also the use below of a multiline string, which is delimited by triple-quotes (instead of a single quote).

In [35]:
print("""I am %(name)s
%(name)s I am

That %(name)s-I-am
That %(name)s-I-am!
I do not like
that %(name)s-I-am.""" % {
    "name": "Sam"
})

I am Sam
Sam I am

That Sam-I-am
That Sam-I-am!
I do not like
that Sam-I-am.
