# Clean Text data using Natural Language Toolkit

Import the libraries needed, for more information of how to install Natural Language ToolKit, please check
http://www.nltk.org/install.html

In [33]:
import pandas as pd #for dataframe imports and data management
import nltk #Natural Language ToolKit
import string #we are going to use some strings methods in the cleaning

Import the sample dataset provided from the Creative Destruction Lab as a Dataframe from a CSV File.

In [27]:
dataset = pd.read_csv('applicant_ventures_without_cohort_1_overfit_test_doubled.csv')

Get the data that we are going to normalize and clean from the CSV. In this example I'm going to use the "Team Relationships" Column of the dataset.

In [28]:
dataset["Team Relationships"]

0      Although there aren't co-founders in strict se...
1      Nick and Cathy have known each other since ear...
2                                 I am the only founder.
3      The founder Denis and Peter have known each ot...
4      The founders are siblings and have known each ...
5      Ruby joined the team as the HR Consultant in A...
6      The concept for Ardle started with my previous...
7      Nzola and Brent have known each other for thre...
8      Reza has known Dr. Shana Kelley for 1½ years a...
9      The team came together just around 2 year ago ...
10     The three of us have been working together for...
11     Although we were both at Western, Mallorie and...
12     Yuri and Alex have been working together for 7...
13     Gil and Eyal are brothers, grew up together (3...
14     Muneer, Andrei and Alex have known each other ...
15     Jacalyn and J. Grant Miller, her husband, have...
16     Ari, Jik and Rohan met early in the summer of ...
17     Brandon and Mike met in 

## Split by Whitespace and Remove Punctuation
We may want the words, but without the punctuation like commas and quotes. We also want to keep contractions together.
One way would be to split the document into words by white space (as in “2. Split by Whitespace“), then use string translation to replace all punctuation with nothing (e.g. remove it). Python provides a constant called string.punctuation that provides a great list of punctuation characters. For example:

In [35]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

We can use the function maketrans() to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process.

In [43]:
table = str.maketrans('', '', string.punctuation)
sample = dataset["Team Relationships"][0]
words = sample.split()
stripped = [w.translate(table) for w in words]
print(stripped)

['Although', 'there', 'arent', 'cofounders', 'in', 'strict', 'sense', 'one', 'of', 'our', 'key', 'employees', 'was', 'Jonathans', 'roommate', 'while', 'he', 'was', 'a', 'student', 'on', 'exchange', 'in', 'Helsinki', 'They', 'have', 'known', 'each', 'other', 'for', '12', 'years', 'Jonathan', 'has', 'successful', 'working', 'relationships', 'with', 'both', 'advisors', 'as', 'well', 'Dr', 'Katzman', 'will', 'be', 'the', 'principal', 'investigator', 'on', 'an', 'upcoming', 'clinical', 'trial', 'of', 'the', '1Datapoint', 'technology', 'Dr', 'Srigley', 'worked', 'with', '1Datapoint', 'on', 'the', 'project', 'that', 'has', 'been', 'published', 'in', 'IEEE']


Contractions like “aren’t” have become “arent” but “co-founders” has become “cofounders“.

## Normalizing Case

It is common to convert all words to one case.

This means that the vocabulary will shrink in size, but some distinctions are lost (e.g. “Apple” the company vs “apple” the fruit is a commonly used example).

In [49]:
normalized_words = [word.lower() for word in words]
print(normalized_words)

['although', 'there', "aren't", 'co-founders', 'in', 'strict', 'sense,', 'one', 'of', 'our', 'key', 'employees', 'was', "jonathan's", 'roommate', 'while', 'he', 'was', 'a', 'student', 'on', 'exchange', 'in', 'helsinki.', 'they', 'have', 'known', 'each', 'other', 'for', '12', 'years.', 'jonathan', 'has', 'successful', 'working', 'relationships', 'with', 'both', 'advisors', 'as', 'well.', 'dr.', 'katzman', 'will', 'be', 'the', 'principal', 'investigator', 'on', 'an', 'upcoming', 'clinical', 'trial', 'of', 'the', '1datapoint', 'technology.', 'dr.', 'srigley', 'worked', 'with', '1datapoint', 'on', 'the', 'project', 'that', 'has', 'been', 'published', 'in', 'ieee.']


## Tokenization and Cleaning with NLTK

In [51]:
# Split into sentences
from nltk import sent_tokenize 
sentences = sent_tokenize(sample)
sentences

["Although there aren't co-founders in strict sense, one of our key employees was Jonathan's roommate while he was a student on exchange in Helsinki.",
 'They have known each other for 12 years.',
 'Jonathan has successful working relationships with both advisors as well.',
 'Dr. Katzman will be the principal investigator on an upcoming clinical trial of the 1Datapoint technology.',
 'Dr. Srigley worked with 1Datapoint on the project that has been published in IEEE.']

In [53]:
# Split into Words (It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens. Contractions are split apart (e.g. “What’s” becomes “What” “‘s“). Quotes are kept, and so on.)
from nltk.tokenize import word_tokenize
tokens = word_tokenize(sample)
print(tokens)

['Although', 'there', 'are', "n't", 'co-founders', 'in', 'strict', 'sense', ',', 'one', 'of', 'our', 'key', 'employees', 'was', 'Jonathan', "'s", 'roommate', 'while', 'he', 'was', 'a', 'student', 'on', 'exchange', 'in', 'Helsinki', '.', 'They', 'have', 'known', 'each', 'other', 'for', '12', 'years', '.', 'Jonathan', 'has', 'successful', 'working', 'relationships', 'with', 'both', 'advisors', 'as', 'well', '.', 'Dr.', 'Katzman', 'will', 'be', 'the', 'principal', 'investigator', 'on', 'an', 'upcoming', 'clinical', 'trial', 'of', 'the', '1Datapoint', 'technology', '.', 'Dr.', 'Srigley', 'worked', 'with', '1Datapoint', 'on', 'the', 'project', 'that', 'has', 'been', 'published', 'in', 'IEEE', '.']


### Filter Out Punctuation
We can filter out all tokens that we are not interested in, such as all standalone punctuation.

This can be done by iterating over all tokens and only keeping those tokens that are all alphabetic. Python has the function isalpha() that can be used. For example:

In [55]:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(sample)
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words)

['Although', 'there', 'are', 'in', 'strict', 'sense', 'one', 'of', 'our', 'key', 'employees', 'was', 'Jonathan', 'roommate', 'while', 'he', 'was', 'a', 'student', 'on', 'exchange', 'in', 'Helsinki', 'They', 'have', 'known', 'each', 'other', 'for', 'years', 'Jonathan', 'has', 'successful', 'working', 'relationships', 'with', 'both', 'advisors', 'as', 'well', 'Katzman', 'will', 'be', 'the', 'principal', 'investigator', 'on', 'an', 'upcoming', 'clinical', 'trial', 'of', 'the', 'technology', 'Srigley', 'worked', 'with', 'on', 'the', 'project', 'that', 'has', 'been', 'published', 'in', 'IEEE']


### Filter out Stop Words (and Pipeline)
Stop words are those words that do not contribute to the deeper meaning of the phrase.

They are the most common words such as: “the“, “a“, and “is“.

For some applications like documentation classification, it may make sense to remove stop words.

NLTK provides a list of commonly agreed upon stop words for a variety of languages, such as English. They can be loaded as follows:

In [57]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

You can see that they are all lower case and have punctuation removed.

You could compare your tokens to the stop words and filter them out, but you must ensure that your text is prepared the same way.

Let’s demonstrate this with a small pipeline of text preparation including:

1. Load the raw text.
2. Split into tokens.
3. Convert to lowercase.
4. Remove punctuation from each token.
5. Filter out remaining tokens that are not alphabetic.
6. Filter out tokens that are stop words.

In [59]:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(sample)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words)

['although', 'nt', 'cofounders', 'strict', 'sense', 'one', 'key', 'employees', 'jonathan', 'roommate', 'student', 'exchange', 'helsinki', 'known', 'years', 'jonathan', 'successful', 'working', 'relationships', 'advisors', 'well', 'dr', 'katzman', 'principal', 'investigator', 'upcoming', 'clinical', 'trial', 'technology', 'dr', 'srigley', 'worked', 'project', 'published', 'ieee']


Running this example, we can see that in addition to all of the other transforms, stop words like “a” and “to” have been removed.

We note that we are still left with tokens like “nt“.

## Stem words

Stemming refers to the process of reducing each word to its root or base. For example “fishing,” “fished,” “fisher” all reduce to the stem “fish.” Some applications, like document classification, may benefit from stemming in order to both reduce the vocabulary and to focus on the sense or sentiment of a document rather than deeper meaning.

There are many stemming algorithms, although a popular and long-standing method is the Porter Stemming algorithm. This method is available in NLTK via the PorterStemmer class.

In [60]:
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(sample)
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed)

['although', 'there', 'are', "n't", 'co-found', 'in', 'strict', 'sens', ',', 'one', 'of', 'our', 'key', 'employe', 'wa', 'jonathan', "'s", 'roommat', 'while', 'he', 'wa', 'a', 'student', 'on', 'exchang', 'in', 'helsinki', '.', 'they', 'have', 'known', 'each', 'other', 'for', '12', 'year', '.', 'jonathan', 'ha', 'success', 'work', 'relationship', 'with', 'both', 'advisor', 'as', 'well', '.', 'dr.', 'katzman', 'will', 'be', 'the', 'princip', 'investig', 'on', 'an', 'upcom', 'clinic', 'trial', 'of', 'the', '1datapoint', 'technolog', '.', 'dr.', 'srigley', 'work', 'with', '1datapoint', 'on', 'the', 'project', 'that', 'ha', 'been', 'publish', 'in', 'ieee', '.']
