# Text pre-processing

All Rights Reserved © <a href="http://www.louisdorard.com" style="color: #6D00FF;">Louis Dorard</a>

<img src="http://s3.louisdorard.com.s3.amazonaws.com/ML_icon.png">

This notebook shows how to use [NLTK](http://www.nltk.org) to pre-process a textual feature in a Machine Learning dataset.

## Algorithm

- Pre-process
- Split into training and test
- CountVectorizer fit training
- CountVectorizer transform training
- Select best features from output of CountVectorizer
- Transform training using best features
- Create RandomForestClassifier 
- fit x_train and y_train to RandomForestClassifier
- predict using x_test
- compute metrics.accuracy_score(y_test, y_pred)

In [28]:
import nltk

## Tokenize
We turn our string into a list of words

In [29]:
str = "Jersey sales is a curious business Whether you re buying the stylish top to represent your favorite team player or color you re always missing out on better artwork"
tknzr = nltk.tokenize.simple.SpaceTokenizer()
a = tknzr.tokenize(str)
a

['Jersey',
 'sales',
 'is',
 'a',
 'curious',
 'business',
 'Whether',
 'you',
 're',
 'buying',
 'the',
 'stylish',
 'top',
 'to',
 'represent',
 'your',
 'favorite',
 'team',
 'player',
 'or',
 'color',
 'you',
 're',
 'always',
 'missing',
 'out',
 'on',
 'better',
 'artwork']

## Filter stopwords
First we need to download the corpus of stopwords. When prompted, press 'd' then type 'stopwords', then press 'q'.

In [30]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Now let's see what these words are

In [31]:
from nltk.corpus import stopwords
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

We define a function that filters out stopwords:

In [32]:
def filter_stopwords(text, stopword_list):
    '''normalizes the words by turning them all lowercase and then filters out the stopwords'''
    words=[w.lower() for w in text] #normalize the words in the text, making them all lowercase
    #filtering stopwords
    filtered_words = [] #declare an empty list to hold our filtered words
    for word in words: #iterate over all words from the text
        if word not in stopword_list and word.isalpha() and len(word) > 1: #only add words that are not in the French stopwords list, are alphabetic, and are more than 1 character
            filtered_words.append(word) #add word to filter_words list if it meets the above conditions
    # filtered_words.sort() #sort filtered_words list
    return filtered_words

In [37]:
words = filter_stopwords(a, stopwords.words('english'))
print(words)

['jersey', 'sales', 'curious', 'business', 'whether', 'buying', 'stylish', 'top', 'represent', 'favorite', 'team', 'player', 'color', 'always', 'missing', 'better', 'artwork']


## Stem

A simple example to illustrate stemming

In [38]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
print(stemmer.stem("science"))

sci


Now we define a function to stem words in a list

In [39]:
def stem_words(words):
    '''stems the word list using the English Stemmer'''
    #stemming words
    stemmed_words = [] #declare an empty list to hold our stemmed words
    stemmer = LancasterStemmer()
    for word in words:
        stemmed_word=stemmer.stem(word) #stem the word
        stemmed_words.append(stemmed_word) #add it to our stemmed word list
    # stemmed_words.sort() #sort the stemmed_words
    return stemmed_words

We define another function that concatenates all the words in a list and apply it to our list of stemmed words:

In [40]:
def concatenate(words):
    s = ""
    for word in words:
        s = s + word + " "
    return s

concatenate(stem_words(words))

'jersey sal cury busy wheth buy styl top repres favorit team play col alway miss bet artwork '

## Application to StumbleUpon data

In [9]:
import pandas as pd
data = pd.read_csv('/data/kaggle-stumbleupon-evergreen.csv', index_col=1)

In [10]:
data.head()

Unnamed: 0_level_0,urlid,body,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
http://techflesh.com/eye-controlled-laptop/,11,A lot of companies including heavyweights like...,?,?,2.218182,0.345455,0.072727,0.054545,0.054545,0.471204,...,1,1,20,0,2240,55,3,0.272727,0.11991,0
http://www.johnnywander.com/comics/163,13,A lot of people have been requesting the frost...,arts_entertainment,0.390873,1.960784,0.365385,0.25,0.057692,0.0,0.53172,...,?,0,18,0,1995,52,1,0.096154,0.107981,0
http://deliciouslyorganic.net/chicken-and-black-bean-quesadillas/,45,Our house is aflutter this week Pete pinned on...,?,?,1.75,0.493151,0.171233,0.150685,0.130137,0.474257,...,1,0,15,0,5370,146,6,0.164384,0.087558,1
http://apac2020.the-diplomat.com/,47,A decade into The Asian Century The Diplomat l...,?,?,1.540541,0.148936,0.042553,0.0,0.0,0.506566,...,?,0,15,?,1638,47,0,0.0,0.098726,0
http://www.news.com.au/business/markets/apples-tim-cook-says-we-have-too-much-money/story-e6frfm30-1226280357290,58,CEO Tim Cook s next challenge is to figure out...,?,?,2.980583,0.65,0.309091,0.154545,0.095455,0.440956,...,1,1,33,0,6122,220,11,0.1,0.087379,0


Let's apply our pre-processing on each row, to the 'body' value

In [16]:
def process_strings(row):
    text = row['body']
    filtered_words = filter_stopwords(tknzr.tokenize(text), stopwords.words('english'))
    text_preprocessed = concatenate(stem_words(filtered_words))
    return text_preprocessed

In [12]:
%time data['body-preprocessed'] = data.apply(process_strings, axis = 1)

NameError: ("name 'filter_stopwords' is not defined", 'occurred at index http://techflesh.com/eye-controlled-laptop/')

In [16]:
data[['body-preprocessed']]

Unnamed: 0_level_0,body-preprocessed
url,Unnamed: 1_level_1
http://techflesh.com/eye-controlled-laptop/,lot company includ heavyweight lik microsoft b...
http://www.johnnywander.com/comics/163,lot peopl request frost recip us shar bas mak ...
http://deliciouslyorganic.net/chicken-and-black-bean-quesadillas/,hous aflut week pet pin new rank fin on job mo...
http://apac2020.the-diplomat.com/,decad as century diplom look back past year dr...
http://www.news.com.au/business/markets/apples-tim-cook-says-we-have-too-much-money/story-e6frfm30-1226280357290,ceo tim cook next challeng fig wheth appl brea...
http://www.gamenet.com/game/Football-Games-Smart-Soccer/,smart socc goalkeep footbal footbal gam smart ...
http://www.rightathome.com/Food/Recipes/Pages/fun-and-flavorful-lasagna-cupcakes.aspx,surpr famy cupcak din try fun twist tradit las...
http://www.webmd.com/balance/features/your-guide-to-never-feeling-tired-again?page=1,webmd feat redbook magazin nant way tackl lif ...
http://h-i-g-h.org/comfortable-personal-transportation-future/,comfort person transport fut comfort person tr...
http://www.menshealth.com/mhlists/best_healthy_foods/Cinnamon.php,old world spic us reach men stomach mix sug st...


In [59]:
import pandas as pd
hotel_data = pd.read_csv('/data/hotel-reviews.csv', index_col=0)
hotel_data.head()
hotel_data.columns

Index(['text', 'label'], dtype='object')

In [60]:
def process_hotel_strings(row):
    text = row['text']
    filtered_words = filter_stopwords(tknzr.tokenize(text), stopwords.words('english'))
    text_preprocessed = concatenate(stem_words(filtered_words))
    return text_preprocessed

hotel_data['text-preprocessed'] = hotel_data.apply(process_hotel_strings, axis = 1)

In [61]:
hotel_data['text-preprocessed'].head()

0    gucc sunglass stol bag fil report hotel sec an...
1    gorg hotel outsid reach elev thing start look ...
2    hotel impress upon ent staff howev felt room d...
3    going internet retail last minut hotel left av...
4    check rm next wok bed bug arm report man assig...
Name: text-preprocessed, dtype: object

In [67]:
hotel_data['text'].head()

0    My $200 Gucci sunglasses were stolen out of my...
1    This was a gorgeous hotel from the outside and...
2    The hotel is very impressive upon entering and...
3    Going to the Internet Retailer 2010 at the las...
4    I checked into this hotel, Rm 1760 on 11/13/20...
Name: text, dtype: object

In [68]:
corpus = hotel_data['text']

from sklearn.feature_extraction.text import CountVectorizer
raw_vectorizer = CountVectorizer(strip_accents='ascii', min_df=1)
raw_vectorizer.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents='ascii', token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [73]:
raw_vectorizer.get_feature_names()

['00',
 '000',
 '00a',
 '00am',
 '00pm',
 '03',
 '04',
 '05',
 '06',
 '07',
 '08',
 '0800',
 '09',
 '10',
 '100',
 '103',
 '104',
 '105',
 '105mph',
 '107',
 '10am',
 '10pm',
 '10th',
 '10x',
 '10yo',
 '11',
 '110',
 '1112',
 '116',
 '11am',
 '11th',
 '12',
 '120',
 '122',
 '1230am',
 '125',
 '127',
 '129',
 '12am',
 '12pm',
 '12th',
 '13',
 '130',
 '130lb',
 '1334',
 '135',
 '139',
 '13th',
 '14',
 '140',
 '1400',
 '149',
 '14th',
 '15',
 '150',
 '1500',
 '1508',
 '1519',
 '1546',
 '155',
 '159',
 '15mins',
 '15th',
 '16',
 '160',
 '1605',
 '160th',
 '1618',
 '165',
 '16th',
 '16thz',
 '17',
 '170',
 '173',
 '175',
 '1760',
 '179',
 '17th',
 '18',
 '180',
 '1802',
 '18th',
 '19',
 '1900',
 '1901',
 '191',
 '1920s',
 '1923',
 '1927',
 '1960',
 '1968',
 '1970s',
 '1992',
 '19th',
 '1am',
 '1or',
 '1pm',
 '1st',
 '20',
 '200',
 '2000',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '200ish',
 '2010',
 '2011',
 '2012',
 '203',
 '20th',
 '20x',
 '21',
 '212',
 '214',
 '215',
 '217'

In [74]:
transformed_corpus = raw_vectorizer.transform(corpus)

In [75]:
transformed_corpus.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [78]:
y = hotel_data['label'].tolist()

In [91]:
TEST_SIZE = 0.3 # ratio of data to have in test
SEED = 8 # to be used to initialize random number generator, for reproducibility
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(transformed_corpus, y, test_size=TEST_SIZE, random_state=SEED)

In [92]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(random_state=SEED)
model = classifier.fit(X_train, y_train)

In [93]:
y_pred = model.predict(X_test)

In [94]:
import sklearn.metrics as metrics
metrics.accuracy_score(y_test, y_pred)

0.7145833333333333

In [95]:
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(chi2, k=20)
selector.fit(transformed_corpus, y)

SelectKBest(k=20, score_func=<function chi2 at 0x7f2102195488>)

In [99]:
selector.transform(transformed_corpus).shape

(1600, 20)

In [97]:
transformed_corpus.shape

(1600, 9571)