# Text pre-processing

All Rights Reserved © <a href="http://www.louisdorard.com" style="color: #6D00FF;">Louis Dorard</a>

<img src="http://s3.louisdorard.com.s3.amazonaws.com/ML_icon.png">

This notebook shows how to use [NLTK](http://www.nltk.org) to pre-process a textual feature in a Machine Learning dataset.

## Algorithm

- Pre-process
- Split into training and test
- Use CountVectorizer to fit training data
- Use CountVectorizer to transform training data
- Select best features from output of CountVectorizer
- Transform training using best features
- Create RandomForestClassifier 
- fit vectorized and selected version of x_train and y_train to RandomForestClassifier
- predict using x_test
- compute metrics.accuracy_score(y_test, y_pred)

### Set up pre-processing

In [9]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Define a function that filters out stopwords:

In [10]:
def filter_stopwords(text, stopword_list):
    '''normalizes the words by turning them all lowercase and then filters out the stopwords'''
    words=[w.lower() for w in text] #normalize the words in the text, making them all lowercase
    #filtering stopwords
    filtered_words = [] #declare an empty list to hold our filtered words
    for word in words: #iterate over all words from the text
        if word not in stopword_list and word.isalpha() and len(word) > 1: #only add words that are not in the French stopwords list, are alphabetic, and are more than 1 character
            filtered_words.append(word) #add word to filter_words list if it meets the above conditions
    # filtered_words.sort() #sort filtered_words list
    return filtered_words

### Define a function to stem words in a list

In [11]:
from nltk.stem import LancasterStemmer
def stem_words(words):
    '''stems the word list using the English Stemmer'''
    #stemming words
    stemmed_words = [] #declare an empty list to hold our stemmed words
    stemmer = LancasterStemmer()
    for word in words:
        stemmed_word=stemmer.stem(word) #stem the word
        stemmed_words.append(stemmed_word) #add it to our stemmed word list
    # stemmed_words.sort() #sort the stemmed_words
    return stemmed_words

### Define another function that concatenates all the words in a list and apply it to our list of stemmed words:

In [12]:
def concatenate(words):
    s = ""
    for word in words:
        s = s + word + " "
    return s

## Application to Hotel Reviews

### Import data

In [13]:
import pandas as pd
hotel_data = pd.read_csv('/data/hotel-reviews.csv', index_col=0)
hotel_data.head()
hotel_data.columns

Index(['text', 'label'], dtype='object')

### Pre-process data

In [15]:
tknzr = nltk.tokenize.simple.SpaceTokenizer()

def process_hotel_strings(row):
    text = row['text']
    filtered_words = filter_stopwords(tknzr.tokenize(text), stopwords.words('english'))
    text_preprocessed = concatenate(stem_words(filtered_words))
    return text_preprocessed

hotel_data['text-preprocessed'] = hotel_data.apply(process_hotel_strings, axis = 1)

In [16]:
hotel_data['text-preprocessed'].head()

0    gucc sunglass stol bag fil report hotel sec an...
1    gorg hotel outsid reach elev thing start look ...
2    hotel impress upon ent staff howev felt room d...
3    going internet retail last minut hotel left av...
4    check rm next wok bed bug arm report man assig...
Name: text-preprocessed, dtype: object

In [17]:
hotel_data['text'].head()

0    My $200 Gucci sunglasses were stolen out of my...
1    This was a gorgeous hotel from the outside and...
2    The hotel is very impressive upon entering and...
3    Going to the Internet Retailer 2010 at the las...
4    I checked into this hotel, Rm 1760 on 11/13/20...
Name: text, dtype: object

### split data

In [73]:
X = hotel_data['text'].tolist()
y = hotel_data['label'].tolist()

In [74]:
TEST_SIZE = 0.3 # ratio of data to have in test
SEED = 8 # to be used to initialize random number generator, for reproducibility
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=SEED)

### Vectorise the training data

In [75]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(strip_accents='ascii', min_df=1)
vectorizer.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents='ascii', token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [95]:
len(vectorizer.get_feature_names())

8097

In [113]:
transformed_X_train = vectorizer.transform(X_train)

### Select top 20 features and output accuracy

In [131]:
def predict(k):
    from sklearn.feature_selection import SelectKBest, chi2
    selector = SelectKBest(chi2, k=k)
    selector.fit(transformed_X_train, y_train)
    selected_X_train = selector.transform(transformed_X_train)

    from sklearn.ensemble import RandomForestClassifier
    classifier = RandomForestClassifier(random_state=SEED)
    model = classifier.fit(selected_X_train, y_train)
    y_pred = model.predict(selector.transform(vectorizer.transform(X_test)))
    import sklearn.metrics as metrics
    return metrics.accuracy_score(y_test, y_pred)

max_accuracy = -1
max_k = -1

for k in range(50,len(vectorizer.get_feature_names()),50):
    accuracy = predict(k)
    if (accuracy > max_accuracy):
        max_k = k
        max_accuracy = accuracy
        
print("Max")
print(max_k)
print(max_accuracy)

Max
600
0.7833333333333333
