# NLP Basics: Implementing a pipeline to clean text

### Pre-processing text data

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
1. **Remove punctuation**
2. **Tokenization**
3. **Remove stopwords**
4. Lemmatize/Stem

The first three steps are covered in this chapter as they're implemented in pretty much any text cleaning pipeline. Lemmatizing and stemming are covered in the next chapter as they're helpful but not critical.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=None)
data.columns = ['label', 'body_text']

data.head()

In [None]:
# What does the cleaned version look like?
data_cleaned = pd.read_csv("SMSSpamCollection_cleaned.tsv", sep='\t')
data_cleaned.head()

### Remove punctuation

In [None]:
import string
string.punctuation

In [None]:
# we are using list comprehension
def remove_punctuation(text):
    # this returns a list of characters, we will then need to join it
    text_nopunct = [ch for ch in text if ch not in string.punctuation]
    return "".join(text_nopunct)

data['body_text_clean'] = data['body_text'].apply(lambda x: remove_punctuation(x))
data.head()

### Tokenization

In [None]:
import re

def tokenize(text):
    # splits when it sees one or more NON-word character
    tokens = re.split('\W+', text)
    return tokens

data['body_text_tokenized'] = data['body_text_clean'].apply(lambda x: tokenize(x.lower()))

data.head()

### Remove stopwords

In [None]:
import nltk

stopword = nltk.corpus.stopwords.words('english')

In [None]:
def rm_stopwords(tokenized_list):
    res = [word for word in tokenized_list if word not in stopword]
    return res

data['body_text_nostop'] = data['body_text_tokenized'].apply(lambda x: rm_stopwords(x))
data.head()