# Get Unique Word list from Gutenburg

This notebook is a variation based on Anna's [Get Unique Token Jupyter Notebook](http://localhost:8891/notebooks/data_prep/create_unique_token_list.ipynb)

We are going to directly get text corpus from NLTK's Gutenberg collection. Project Gutenberg contains 60,000 public domain e-books made available for non-commercial use.

In [49]:
# Import Gutenberg corpus
from nltk.corpus import gutenberg
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [60]:
# Let's get Jane Austen's "Emma"
emma_text = gutenberg.raw('austen-emma.txt')
print(len(emma_text))
print(emma_text[50:218])

887071
Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly


# Get words and clean words

In [36]:
# Import re, which stands for regular expression
import re

# Define a pattern using a regular expression
pattern = r"[^a-z]"

# Search for the pattern, and replace every instance
# with a replacement string
emma_txt = re.sub(pattern, ' ', emma_text.lower())

In [40]:
emma_wordlist = re.split(r" +", emma_txt)
len(emma_wordlist)

161977

# Keep only unique words

In [42]:
# Create an empty list where we'll store exactly one
# of each token
unique_token_list = []

# For each token in the dialogue list,
for token in emma_wordlist:
    # if (and only if) that token is not yet in the unique list
    if token not in unique_token_list:
        # add it to the unique list
        unique_token_list.append(token)

# Sort the list, so it'll be easier
# to spot duplicates if they exist
unique_token_list.sort()

In [59]:
unique_token_list = unique_token_list[1:]
print(len(unique_token_list))
print(unique_token_list[100:120])

7094
['active', 'activity', 'actual', 'actually', 'acute', 'acuteness', 'adair', 'adapt', 'add', 'added', 'adding', 'addition', 'additional', 'address', 'addressed', 'addresses', 'addressing', 'adelaide', 'adequate', 'adherence']


# Store our words in a text file

In [48]:
with open('jane_austen_emma.txt', 'w') as f:
    for token in unique_token_list:
        f.write(token + '\n')