# Supplemental Data Cleaning: Using Stemming

In the last chapter, we learned the basics of preparing our text to build the model. We learned how to **remove punctuation, tokenize, and removed stop words** to provide a clean list of words to Python. In this chapter, we're going to learn how to take our cleaning one step further, to provide the model with better information for classifying the text. We'll the introduce the concepts of stemming and memetizing in this chapter. Let's start with stemming. So what is stemming? The formal definition of stemming is the process of reducing inflected or derived words to their word stem or root.

### Test out Porter stemmer

In [None]:
import nltk

ps = nltk.PorterStemmer()

### Read in raw text

In [None]:
import pandas as pd
import re
import string
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

data.head()

### Clean up text

In [None]:
def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))

### Stem text