# Step 1: Explore the data

In [19]:
file = open('../1.3_Cleaning_Text_Data/metamorphosis_clean.txt', 'r+')
data = file.read()
file.close()

In [26]:
data[:1000]

'One morning, when Gregor Samsa woke from troubled dreams, he found\nhimself transformed in his bed into a horrible vermin.  He lay on\nhis armour-like back, and if he lifted his head a little he could\nsee his brown belly, slightly domed and divided by arches into stiff\nsections.  The bedding was hardly able to cover it and seemed ready\nto slide off any moment.  His many legs, pitifully thin compared\nwith the size of the rest of him, waved about helplessly as he\nlooked.\n\n"What\'s happened to me?" he thought.  It wasn\'t a dream.  His room,\na proper human room although a little too small, lay peacefully\nbetween its four familiar walls.  A collection of textile samples\nlay spread out on the table - Samsa was a travelling salesman - and\nabove it there hung a picture that he had recently cut out of an\nillustrated magazine and housed in a nice, gilded frame.  It showed\na lady fitted out with a fur hat and fur boa who sat upright,\nraising a heavy fur muff that covered the whole

In [21]:
len(data)

119163

# Step 2. Whitespace/Punctuation/Normalize Case
## Manually

### Tokenization
Tokenization describes splitting paragraphs into sentences, or sentences into individual words.
Sentences can be split into individual words and punctuation through a similar process. Most commonly we split words between white spaces.
There can be problems when a word is abbreviated, truncated or is possessive especially in the case of names that use punctuation (like O’Neil).

In [29]:
# Split by Whitespace

words = data.split()
words[90:100]

['thought.',
 'It',
 "wasn't",
 'a',
 'dream.',
 'His',
 'room,',
 'a',
 'proper',
 'human']

In [30]:
# Remove punctuation

import re
words = re.split(r'\W+', data)
words[90:100]

['me', 'he', 'thought', 'It', 'wasn', 't', 'a', 'dream', 'His', 'room']

In this code `\W+` matches one or more word characters (same as `[a-zA-Z0-9_]+`). So simple!

In [31]:
# Split by Whitespace and Remove Punctuation

words = data.split()

# remove punctuation from each word
import string
table = str.maketrans('','',string.punctuation)
stripped = [w.translate(table) for w in words]
stripped[90:100]

['thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']

We used python3 string function this time. If you prefer regex or other methods, just use whatever you want.

* `maketrans()` This static method returns a translation table usable for `str.translate()`
* `translate()` Return a copy of the string in which each character has been mapped through the given translation table.
Finally, we got `"wasnt" from "wasn't".`

### Capitalization
Reduce everything to lower case for simplicity. But it is important to remember that some words, like “US” to “us”, can change meanings when reduced to the lower case.

In [34]:
words = data.split()

# convert to lower case
words = [word.lower() for word in words]
words[90:100]

['thought.',
 'it',
 "wasn't",
 'a',
 'dream.',
 'his',
 'room,',
 'a',
 'proper',
 'human']

## NLTK

In [35]:
!pip install -U nltk
!python -m nltk.downloader all

Requirement already up-to-date: nltk in /opt/conda/lib/python3.6/site-packages
Requirement already up-to-date: singledispatch in /opt/conda/lib/python3.6/site-packages (from nltk)
Collecting six (from nltk)
  Downloading https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Installing collected packages: six
  Found existing installation: six 1.11.0
    Uninstalling six-1.11.0:
      Successfully uninstalled six-1.11.0
Successfully installed six-1.12.0
[33mYou are using pip version 9.0.1, however version 19.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpor

[nltk_data]    |   Unzipping corpora/qc.zip.
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    | Downloading package rte to /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/rte.zip.
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/senseval.zip.
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/sentiwordnet.zip.
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/sentence_polarity.zip.
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading p

### Tokenization

In [36]:
from nltk.tokenize import word_tokenize

In [44]:
words = word_tokenize(data)
words[100:110]

['me', '?', "''", 'he', 'thought', '.', 'It', 'was', "n't", 'a']

In [45]:
# Tokenization by sentences

from nltk import sent_tokenize
sentences = sent_tokenize(data)

In [47]:
sentences[1]

'He lay on\nhis armour-like back, and if he lifted his head a little he could\nsee his brown belly, slightly domed and divided by arches into stiff\nsections.'

### Filter out Punctuation

In [49]:
words = word_tokenize(data)

# Remove all tokens that are not alphabetic
words = [word for word in words if word.isalpha()]
words[90:100]

['It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']

# Step 3. Stopwords/Stemming

## 1. Filter out Stopwords and Pipelines
A majority of the words in a given text are connecting parts of a sentence rather than showing subjects, objects or intent. Word like “the” or “and” can be removed by comparing text to a list of stopwords.

In [51]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [53]:
# load data
file = open('../1.3_Cleaning_Text_Data/metamorphosis_clean.txt', 'r+')
data = file.read()
file.close()

# split into words
from nltk.tokenize import word_tokenize
words = word_tokenize(data)

# convert to lower case
words = [w.lower() for w in words]

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[90:100])

['lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy']


## 2. Stemming
Stemming is a process where words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix. There are several stemming models, including Porter and Snowball. But there is a danger of “over-stemming” were words like “universe” and “university” are reduced to the same root of “univers”.

In [60]:
# load data
file = open('../1.3_Cleaning_Text_Data/metamorphosis_clean.txt', 'r+')
data = file.read()
file.close()

# split into words
from nltk.tokenize import word_tokenize
words = word_tokenize(data)

# convert to lower case
words = [w.lower() for w in words]

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]

# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]

print(stemmed[:100])

['one', 'morn', 'gregor', 'samsa', 'woke', 'troubl', 'dream', 'found', 'transform', 'bed', 'horribl', 'vermin', 'He', 'lay', 'armourlik', 'back', 'lift', 'head', 'littl', 'could', 'see', 'brown', 'belli', 'slightli', 'dome', 'divid', 'arch', 'stiff', 'section', 'the', 'bed', 'hardli', 'abl', 'cover', 'seem', 'readi', 'slide', 'moment', 'hi', 'mani', 'leg', 'piti', 'thin', 'compar', 'size', 'rest', 'wave', 'helplessli', 'look', 'what', 'happen', 'thought', 'It', 'nt', 'dream', 'hi', 'room', 'proper', 'human', 'room', 'although', 'littl', 'small', 'lay', 'peac', 'four', 'familiar', 'wall', 'A', 'collect', 'textil', 'sampl', 'lay', 'spread', 'tabl', 'samsa', 'travel', 'salesman', 'hung', 'pictur', 'recent', 'cut', 'illustr', 'magazin', 'hous', 'nice', 'gild', 'frame', 'It', 'show', 'ladi', 'fit', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'rais', 'heavi']


# Step 4. Other tools

## 1. Lemmatization
Lemmatization is also an alternative to removing inflection. By determining the part of text and utilizing WordNet’s lexical database of English, it can get better results.

It is a more accurate but slower. Stemming may be more useful in queries for databases whereas Lemmatization may work much better when trying to determine text sentiment.

## 2. Word Embedding/Text Vectors
Word embedding is the modern way of representing words as vectors. The aim of word embeddings is to find a series of high dimensionality vectors (one for each word) that represent the relation of words in such a way that semantically related words are ‘close together’ in that high dimensional space. `Word2Vec` and `GloVe` are the most common models for converting text to vectors. Often, T-SNE (as well as PCA) is used to reduce the dimensionality enough to display as a 2 or 3 dimensional graph. Check out [this example](https://medium.com/cindicator/t-sne-and-word-embedding-weekend-of-a-data-scientist-5c99ddacbf51) of T-SNE applied to word embeddings.