# **CLEANING TEXT DATA**

## **0. DOWNLOAD DATA**

[Metamorphosis by Franz Kafka Plain Text UTF-8](http://www.gutenberg.org/cache/epub/5200/pg5200.txt) (load the page twice).

Open the file and delete the header and footer information and save the file as “metamorphosis_clean.txt“.

## **1. SNEAK PEEK INTO THE DATA**

### 1.1 LOAD DATA

In [1]:
# load the data
filename = './data/metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

### 1.2 TOKENIZATION
(describes splitting paragraphs into sentences, or sentences into individual words.)

In [2]:
# Split by Whitespace and Remove Punctuation

# split into words by white space
words = text.split()

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped[:10])

['Metamorphosis', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie', 'I', 'One', 'morning']


[maketrans()](https://docs.python.org/3/library/stdtypes.html#str.maketrans) This static method returns a translation table usable for str.translate()

[translate()](https://docs.python.org/3/library/stdtypes.html#str.translate) Return a copy of the string in which each character has been mapped through the given translation table.

Finally, we got: **'wasnt** from **wasn't'**

### 1.3 CAPITALIZATION

In [3]:
words_lower = [word.lower() for word in words]
print(words_lower[:10])

['metamorphosis', 'franz', 'kafka', 'translated', 'by', 'david', 'wyllie', 'i', 'one', 'morning,']


## **2. NLTK**

### 2.1 INSTALL NLTK (Natural Language Toolkit)

In [4]:
#! pip3 install -U nltk
#! python3 -m nltk.downloader all

### 2.2 TOKENIZATION

In [5]:
# Split into words
filename = './data/metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()


# split by words
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
print(tokens[:10])

['Metamorphosis', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie', 'I', 'One', 'morning']


- **words = word_tokenize(text)**            -  Split by words
- **sentences = sent_tokenize(text)**        -  Split by Sentence

### 1.3 FILTER OUT PUNCTATION

In [6]:
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:10])

['Metamorphosis', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie', 'I', 'One', 'morning']


## **3. STOPWORDS / STEMMING**

### 3.1 FILTER OUT SOPWORDS AND PIPELINES

A majority of the words in a given text are connecting parts of a sentence rather than showing subjects, objects or intent. Word like “the” or “and” can be removed by comparing text to a list of stopwords

In [10]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words[:10])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [8]:
# load data
filename = './data/metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()


# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

# convert to lower case
tokens = [w.lower() for w in tokens]

# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]

# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:10])

['metamorphosis', 'franz', 'kafka', 'translated', 'david', 'wyllie', 'one', 'morning', 'gregor', 'samsa']


### 3.2 STEMMING

Stemming is a process where words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix. There are several stemming models, including Porter and Snowball. But there is a danger of “over-stemming” were words like “universe” and “university” are reduced to the same root of “univers”.

In [9]:
# load data
filename = './data/metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()


# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:10])

['metamorphosi', 'franz', 'kafka', 'translat', 'by', 'david', 'wylli', 'I', 'one', 'morn']


## **4. OTHER TOOLS**

### 4.1 LEMMATIZATION

Lemmatization is also an alternative to removing inflection. By determining the part of text and utilizing WordNet’s lexical database of English, it can get better results.

It is a more accurate but slower. Stemming may be more useful in queries for databases whereas Lemmatization may work much better when trying to determine text sentiment.

### 4.2 WORDS EMBEDDING / TEXT VECTORS

Word embedding is the modern way of representing words as vectors. The aim of word embeddings is to find a series of high dimensionality vectors (one for each word) that represent the relation of words in such a way that semantically related words are ‘close together’ in that high dimensional space. Word2Vec and GloVe are the most common models for converting text to vectors. Often, T-SNE (as well as PCA) is used to reduce the dimensionality enough to display as a 2 or 3 dimensional graph. Check out this example of T-SNE applied to word embeddings.

## **5. FURTHER READING**

[Ultimate guide to deal with Text Data (using Python) – for Data Scientists & Engineers](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/)