# From Files to Strings to "Texts"

In [1]:
# First we open the file:
opened_file = open('../data/mdg.txt', 'r')

# Then we read the file:
text_as_read = opened_file.read()

In [2]:
# What happens when we try to print the opened file?
print(opened_file)

<_io.TextIOWrapper name='../data/mdg.txt' mode='r' encoding='UTF-8'>


In [4]:
# The work of the two lines above can also be achieved in one line.
mdg = open('../data/mdg.txt', 'r').read()

In [5]:
type(mdg)

str

In [6]:
len(mdg)

44236

In [7]:
print(mdg[0:100])

"Off there to the right -- somewhere -- is a large island," said Whitney. "It's rather a mystery -- 


In [8]:
# One way to include a block of text is to use triple quotes.
# It doesn't matter if the quotes are double or single.
text = """
The Nellie, a cruising yawl, swung to her anchor without a flutter of
the sails, and was at rest. The flood had made, the wind was nearly
calm, and being bound down the river, the only thing for it was to come
to and wait for the turn of the tide.
"""

split_text = text.split()

print(split_text)

['The', 'Nellie,', 'a', 'cruising', 'yawl,', 'swung', 'to', 'her', 'anchor', 'without', 'a', 'flutter', 'of', 'the', 'sails,', 'and', 'was', 'at', 'rest.', 'The', 'flood', 'had', 'made,', 'the', 'wind', 'was', 'nearly', 'calm,', 'and', 'being', 'bound', 'down', 'the', 'river,', 'the', 'only', 'thing', 'for', 'it', 'was', 'to', 'come', 'to', 'and', 'wait', 'for', 'the', 'turn', 'of', 'the', 'tide.']


In [9]:
comma_text = text.split(",")
print(comma_text)

['\nThe Nellie', ' a cruising yawl', ' swung to her anchor without a flutter of\nthe sails', ' and was at rest. The flood had made', ' the wind was nearly\ncalm', ' and being bound down the river', ' the only thing for it was to come\nto and wait for the turn of the tide.\n']


In [11]:
# Whenever you have extra long text in a code example that breaks across lines,
# you enclose it within three quotation marks to let Python know that. 
# (This is one reason to load text from a file.)

sentence = """The Nellie, a cruising yawl, swung to her anchor without a flutter of the sails, and was at rest."""

split = sentence.split()
print(split)
# print("This set has {} items:".format(len(set(split))), set(split))

['The', 'Nellie,', 'a', 'cruising', 'yawl,', 'swung', 'to', 'her', 'anchor', 'without', 'a', 'flutter', 'of', 'the', 'sails,', 'and', 'was', 'at', 'rest.']


In [10]:
'The' == 'the'

False

In [12]:
sentence = """The Nellie, a cruising yawl, swung to her anchor without a flutter of the sails, and was at rest."""

split = sentence.lower().split()

print("This set has {} items:".format(len(set(split))), set(split))

This set has 17 items: {'was', 'to', 'rest.', 'cruising', 'at', 'nellie,', 'a', 'flutter', 'her', 'sails,', 'anchor', 'and', 'without', 'the', 'of', 'yawl,', 'swung'}


### Using Regex to Tokenize

In [13]:
# We need the regular expression library
import re

mdg_words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()

A quick walkthrough might help a little:

* **`import re`** tells the script to import the regular expression module which comes bundled with every Python installation but doesn't get loaded into our workspace unless we tell it to do so.
* **`mdg_words`** is the object we are creating: everything to the right is the process by which we create the object.
* We aren't going to discuss the regular expression substitution, **`re.sub()`** that gets done here, except to note that you should read the stuff inside the parentheses like this: `(find this pattern, substitute this, in this text)` -- in this case I am telling it to find things that are **not** (`[^ ]` is called a *negated set*) letters (big and small) or apostrophes and replace them with spaces. 
* **`.lower()`** is a method you can apply to strings that makes everything lower case -- otherwise "The" and "the" are two different keys. 
* **`.split()`** is the string method discussed above and we're using its default setting of splitting on white spaces, of which we have plenty, since we have replaced everything except for letters and apostrophes with white space.

The **`split()`** method turns our string into a **list**. In this case, a list of words that are in the same order as they are in the original text. (The computer has no reason to disturb this order.)

If we ask how long the list is, we should come back with a reasonably close count of the words -- don't forget we kept apostrophes, and while most are buried inside contractions, there may be some loose apostrophes. 

In [14]:
print('Words in text: {}.'.format(len(mdg_words)))

Words in text: 8017.


In [15]:
# We can slice strings and lists, 
# but lists of words are more human-like than strings of characters.
# Weird, but true.

print(mdg_words[0:50])

['off', 'there', 'to', 'the', 'right', 'somewhere', 'is', 'a', 'large', 'island', 'said', 'whitney', "it's", 'rather', 'a', 'mystery', 'what', 'island', 'is', 'it', 'rainsford', 'asked', 'the', 'old', 'charts', 'call', 'it', "'ship", 'trap', 'island', "'", 'whitney', 'replied', 'a', 'suggestive', 'name', "isn't", 'it', 'sailors', 'have', 'a', 'curious', 'dread', 'of', 'the', 'place', 'i', "don't", 'know', 'why']


### Using NLTK Tokenizers <a name="nltktoken"></a>

Whether you develop your own tokenizer somewhere along the way or not is entirely up to you and the nature of the projects which you undertake. You may never find yourself doing so. There are certainly a lot of already available options, a number of which are packaged with the Natural Language Toolkit, more often called by its acronym "NLTK", which, by the way, is the same way we call it in Python: 

```python
import nltk
```

If we were to run the line above, we would probably wait a few seconds as the NLTK library loaded -- it's quite large. Because it is a large library, and we really don't need all its functionality, most people don't load all of it -- and isn't it cool that we can load only the parts we need! This is handy as you begin to work with larger scripts and larger data sets: keeping your workspace as tidy, and as small, becomes a necessity. And, in some cases, it's actually easier to use certain tools singly and with a particular name.

#### NLTK's WhitespaceTokenizer <a name="whitespace"></a>

In [16]:
# In this instance, we are going to tell Python that we only want 
# one particular tool from the larger toolkit:

from nltk.tokenize import WhitespaceTokenizer

In [17]:
mdg_tokens = WhitespaceTokenizer().tokenize(mdg.lower())

Please note that we now have two versions of "The Most Dangerous Game" rendered as a list in Python: **`mdg_words`** and **`mdg_tokens`**. We created the former using regex and the latter with the NLTK WhitespaceTokenizer. We could have re-used the former name, and that would have, in effect, replaced the older list with the newer one. Please keep this in mind as you work, and also remember to use sensible names for the objects you create. (The more self-explanatory things are, the better as your code grows.)

Now, let's take a look at what kind of object this is, how big it is, and then let's print the first 50 items -- I use 50 as a convenient, quick look, but you can adjust the number down to 25 or up to 100 or some other number that you find more useful. I often change the numbers several times, contracting and expanding as I feel the need.

In [18]:
print(mdg_tokens[0:50])

['"off', 'there', 'to', 'the', 'right', '--', 'somewhere', '--', 'is', 'a', 'large', 'island,"', 'said', 'whitney.', '"it\'s', 'rather', 'a', 'mystery', '--', '"', '"what', 'island', 'is', 'it?"', 'rainsford', 'asked.', '"the', 'old', 'charts', 'call', 'it', "'ship-trap", 'island,\'"', 'whitney', 'replied.', '"a', 'suggestive', 'name,', "isn't", 'it?', 'sailors', 'have', 'a', 'curious', 'dread', 'of', 'the', 'place.', 'i', "don't"]


#### NLTK's Word_Tokenize <a name="word"></a>


To be clear, splitting contractions is actually a feature from the point of view of natural language processing. To follow that form of processing, you can use NLTK's **`word_tokenize()`**. As always, you can import the entire `nltk` and then simply use it in your code by writing `tokens = tokenize.word_tokenize(your_text)` or by importing it specifically as below:

In [19]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(mdg.lower())

print(tokens[0:25])

['``', 'off', 'there', 'to', 'the', 'right', '--', 'somewhere', '--', 'is', 'a', 'large', 'island', ',', "''", 'said', 'whitney', '.', '``', 'it', "'s", 'rather', 'a', 'mystery', '--']


As you can see, the contraction *it's* has been split into *it* and *'s*. 

Finally, we should note that the preferred practice when working with the Penn Treebank word_tokenizer is to use it in combination with the sentence tokenizer. What's that you say? There's a function in the NLTK that will break my text into sentences? Yes, there is, and in a subsequent notebook, working again with "The Most Dangerous Game," we will explore its utility. For now, let's just make sure we know how to use it, and, anticipating some control structures, how to use it in combination with the word tokenizer.

First, let's import the sentence tokenizer, create a list of sentences, and have a look at the first five sentences it creates:

In [None]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(mdg.lower())

print(sentences[0:5])

That's a little hard to read, what if we write a `for` loop to make the printout a bit more readable? There's three ways to do add a line break after a bit of text in Python, one is to let the algorithm do itself. This often works: `print(whatever)` The other is a weird bit of code-lore that always works: add a comma after the item you are printing -- `print(whatever,)`. Another way is to add a newline character to whatever you are printing: `print(whatever + "\n")`. I'm lazy, so I used the mysterious comma technique below:

In [None]:
for sentence in sentences[0:5]:
    print(sentence,)

As you can see from even this limited example, the sentence tokenizer is not perfect: note how  it splits the sentence with a quoted question into two. Yikes. If things like this matter to you, you are going to need to write your own tokenizer -- which is not as hard as it sounds right now to you, but it is additional work when maybe you want to be examining your text.

Now, let's see if we can't combine the two tokenizers so that we get the preferred context for the word tokenizer, which is a single sentence, and yet all of our text is tokenized...

If we first try to pass the output of one, the sentence tokenizer, as input to the other, the word tokenizer, we will get a `TypeError`, revealing that because both tokenizers expect strings as input and we are feeding the output of the sentence tokenizer, which is a list, into the word tokenizer, something's gonna break:

```python
sentences = sent_tokenize(mdg)
tokens = word_tokenize(sentences)
print(tokens[0:20])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-b0aa2f5e281c> in <module>
      1 sentences = sent_tokenize(mdg)
----> 2 tokens = word_tokenize(sentences)
      3 print(tokens[0:20])
```

See how Python not only tells you the kind of error but where the error occurs? (Sometimes it will actually point out the exact spot in a line!)

What we need to do is embed one of the tokenizers inside the other, stacking them as it were, in order to get the results we want:

In [None]:
tokens = []
for sentence in sent_tokenize(mdg):
    for word in word_tokenize(sentence):
        tokens.append(word)
print(tokens[0:50])

It is a bit unfair that your second sight of a `for` loop should be one wrapped inside another, but this bit of code is not as complicated as it might look. Let's examine it more closely:

**`tokens = []`** creates an empty list, like first putting a bucket under a spigot before turning it on. (If you think about it, the idea that you can create the bucket as you pour the water is both weird and cool, but here's the old-fashioned way of doing things. Hmm, it feels solid.)

**`for sentence in sent_tokenize(mdg):`** grabs a sentence at a time using the power of the sentence tokenizer to do the work.

**`for word in word_tokenize(sentence):`** tells Python, "Hey, bud, while you've got that sentence in your hand, would you go ahead and tokenize it?"

**`tokens.append(word)`** drops each word, or punctuation!, into the `tokens` bucket. 

** **`print(tokens[0:50])`** just helps us check out work.

**Nota bene**: about this constant `print` thing I do to check my work, feel free to drop it anywhere, especially in the middle of things like `for` loops: it's a great way to see what the program is doing and helps you learn how to code more quickly. (This is something my collaborator, Katherine Kinnaird, taught me.)

Okay, that's enough for one workbook. We will return to this combination, or stack, of sentence and word tokenizers in the next workbook when we turn to the matter of counting words and creating dictionaries.

## Word Frequencies

Once we've established how we are going to tokenize our strings into words, we can start counting those words, er, tokens! In the remaining part of this notebook, we are going to explore three ways to compile word frequencies for a text:

- The Built-In Way
- The NLTK Way
- The Pandas way

In [None]:
import re

# First we load our file into a string
mdg = open('mdg.txt', 'r').read()

# Then we turn that string into a list of words
mdg_words = re.sub("[^a-zA-Z']"," ", mdg).lower().split()

In [None]:
import pandas as pd

mdg_series = pd.Series(mdg_words)

print(mdg_series[0:5])

In [None]:
mdg_counts = mdg_series.value_counts()
print(mdg_counts[0:5])

In [None]:
mdg_counts.to_csv('mdg.csv')