## Natural Language Processing for Text Classification with NLTK and Scikit-learn

In this project, we will expand on this foundation and explore different ways to improve our text classification results. We will cover and use:

* Regular Expressions
* Feature Engineering
* Multiple scikit-learn Classifiers
* Ensemble Methods


In [144]:
import sys
import nltk
import sklearn
import pandas as pd
import numpy as np


print('Python: {}'.format(sys.version))
print('NLTK: {}'.format(nltk.__version__))
print('Scikit-learn: {}'.format(sklearn.__version__))
print('Pandas: {}'.format(pd.__version__))
print('Numpy: {}'.format(np.__version__))

Python: 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
NLTK: 3.3
Scikit-learn: 0.19.1
Pandas: 0.24.2
Numpy: 1.19.2


### 2. Load the Dataset

Now that we have ensured that our libraries are installed correctly, let's load the data set as a Pandas DataFrame. Furthermore, let's extract some useful information such as the column information and class distributions. 

The data set we will be using comes from the UCI Machine Learning Repository.  It contains over 5000 SMS labeled messages that have been collected for mobile phone spam research. It can be downloaded from the following URL:

https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

In [145]:
# load the dataset of SMS messages
df = pd.read_csv(r"C:\Users\limz\Desktop\Spam-Filter-using-Ensemble-Learning-master\dataswahili.csv", header=None,  encoding= "ISO-8859-1")
df.head()

Unnamed: 0,0
0,
1,@data
2,"spam,""You have won yourself ksh 100, 000 for b..."
3,customer. Click http://equitybanknig-plc.com/ ...
4,"ham,Please call me . Thank you."


In [146]:
# Remove unnecessary columns and rows
df = df.iloc[:, :2]
df = df.iloc[1:, :]
df.head()

Unnamed: 0,0
1,@data
2,"spam,""You have won yourself ksh 100, 000 for b..."
3,customer. Click http://equitybanknig-plc.com/ ...
4,"ham,Please call me . Thank you."
5,"spam,""SAFARICOM NGURUMA IMBAMBE PROMOTIONS: De..."


In [147]:
# check class distribution 
classes = df[0]
print(classes.value_counts())

ham,leta bread na blue band for the party                                        3
ham,I'm leaving my house now.                                                    2
spam,"WINNER!! As a valued network customer you have been selected to            2
ham,WHO ARE YOU SEEING?                                                          2
rates app                                                                        2
ham,Ok i msg u b4 i leave my house.                                              2
spam,Please call our customer service representative on 0800 169 6031 between    2
spam,Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm     2
ham,26th OF AUGUST  is my engagement                                             1
ham,Where to get those good clothes                                              1
agent no. as 286286 to claim"                                                    1
ham,ALRITE SAM ITS NIC JUST CHECKIN THAT THIS IS UR NUMBER-SO IS IT?T.B*         1
ham,

### 2. Preprocess the Data

Preprocessing the data is an essential step in natural language process. In the following cells, we will convert our class labels to binary values using the LabelEncoder from sklearn, replace email addresses, URLs, phone numbers, and other symbols by using regular expressions, remove stop words, and extract word stems.  

In [148]:
#import Label Encoder from scikit learn 
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df[0] = df[0].apply(str) #Change the format of the line column to str

Y = encoder.fit_transform(classes)
Y[:10]


array([  45, 1177,  216,  710, 1150, 1054, 1209,  194, 1217,  980],
      dtype=int64)

In [149]:
# store the sms data
text_msg = df[0]
print(text_msg[:10])    

1                                                 @data
2     spam,"You have won yourself ksh 100, 000 for b...
3     customer. Click http://equitybanknig-plc.com/ ...
4                       ham,Please call me . Thank you.
5     spam,"SAFARICOM NGURUMA IMBAMBE PROMOTIONS: De...
6     number has WON Ksh100,000/= Call us NOW.. 0788...
7     spam,Dear jumia shopper your purchase last mon...
8                 card go to www.jumiaa.co.ke  to claim
9     spam,Hello ! I just got 500 credit and 10GB in...
10                       here:http://winprize.net/kenya
Name: 0, dtype: object


#### 2.1 Regular Expressions

Some common regular expression metacharacters - copied from wikipedia

**^**     Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

**.**     Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".

**[ ]**    A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].
The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc]. Note that backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].

**[^ ]**   Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed.

**$**      Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.

**( )**    Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group. BRE mode requires \( \).

**\n**     Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is vaguely defined in the POSIX.2 standard. Some tools allow referencing more than nine capturing groups.

**\***     Matches the preceding element zero or more times. For example, ab*c matches "ac", "abc", "abbbc", etc. [xyz]* matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. (ab)* matches "", "ab", "abab", "ababab", and so on.

**{m,n}**  Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few older instances of regexes. BRE mode requires \{m,n\}.

In [150]:
# use regular expressions to replace email addresses, URLs, Phone numbers, other numbers

# Replace email addreses with 'email'
processed = text_msg.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$', 'emailaddress')

# Replace URLs with 'webaddress'
processed = processed.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$','webaddress')

# Replace money symbols with 'moneysymb' (£ can by typed with ALT key + 156)
processed = processed.str.replace(r'£|\$', 'moneysymb')

# Replace 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phonenumber'
processed = processed.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$','phonenumbr')
    
# Replace numbers with 'numbr'
processed = processed.str.replace(r'\d+(\.\d+)?', 'numbr')   



In [151]:
# Remove punctuation
processed = processed.str.replace(r'[^\w\d\s]', ' ')

# Replace whitespace between terms with a single space
processed = processed.str.replace(r'\s+', ' ')

# Remove leading and trailing whitespace
processed = processed.str.replace(r'^\s+|\s+?$', '')


In [152]:
# change words to lower case - Hello, HELLO, hello are all the same word
processed = processed.str.lower()
print(processed)

1                                                    data
2       spam you have won yourself ksh numbr numbr for...
3       customer click http equitybanknig plc com for ...
4                            ham please call me thank you
5       spam safaricom nguruma imbambe promotions dear...
6       number has won kshnumbr numbr call us now numb...
7       spam dear jumia shopper your purchase last mon...
8                    card go to www jumiaa co ke to claim
9       spam hello i just got numbr credit and numbrgb...
10                           here http winprize net kenya
11      spam dear equitel customer your equitel number...
12                                   to numbr for upgrade
13      spam we ve been hired numbr shorten ur life nu...
14                                                 pardon
15      spam hi earn real money online by working as a...
16      up to numbrkshs weekly call our customer care now
17      spam have you seen the doctor about your healness
18      spam n

In [153]:
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

# remove stop words from text messages

stop_words = set(stopwords.words('english'))

processed = processed.apply(lambda x : ' '.join(term for term in x.split() if term not in stop_words))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\limz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\limz\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [154]:
# Remove word stems using a Porter stemmer

PS = nltk.PorterStemmer()

processed = processed.apply(lambda x: ' '.join(PS.stem(term) for term in x.split()))
print(processed[29])

everi numbr min


### 3. Generating Features

Feature engineering is the process of using domain knowledge of the data to create features for machine learning algorithms. In this project, the words in each text message will be our features.  For this purpose, it will be necessary to tokenize each word.  We will use the 1500 most common words as features.

In [155]:
from nltk.tokenize import word_tokenize

# create bags of words
all_words = []
for message in processed:
        words = word_tokenize(message)
        for w in words:
            all_words.append(w)

all_words = nltk.FreqDist(all_words)


In [156]:
# print the total number of words and the 15 most common words

print('Number of words: {}'.format(len(all_words)))
print('Most common words: {}'.format(all_words.most_common(15)))

Number of words: 2299
Most common words: [('ham', 737), ('numbr', 620), ('u', 346), ('nan', 161), ('spam', 160), ('call', 113), ('go', 91), ('ur', 74), ('ok', 68), ('lor', 53), ('come', 52), ('got', 49), ('wat', 47), ('n', 45), ('free', 43)]


In [157]:
# use the 1500 most common words as features

word_features = list(all_words.keys())[:1000]
print(word_features)

['data', 'spam', 'ksh', 'numbr', 'loyal', 'equiti', 'bank', 'custom', 'click', 'http', 'equitybanknig', 'plc', 'com', 'detail', 'ham', 'pleas', 'call', 'thank', 'safaricom', 'nguruma', 'imbamb', 'promot', 'dear', 'subscrib', 'phone', 'number', 'kshnumbr', 'us', 'send', 'money', 'jumia', 'shopper', 'purchas', 'last', 'month', 'via', 'card', 'go', 'www', 'jumiaa', 'co', 'ke', 'claim', 'hello', 'got', 'credit', 'numbrgb', 'internet', 'free', 'winpriz', 'net', 'kenya', 'equitel', 'expir', 'sm', 'upgrad', 'hire', 'shorten', 'ur', 'life', 'mpesa', 'pardon', 'hi', 'earn', 'real', 'onlin', 'work', 'part', 'time', 'job', 'easili', 'numbrksh', 'weekli', 'care', 'seen', 'doctor', 'heal', 'ibamb', 'na', 'nambari', 'yako', 'imezawadia', 'kesho', 'kwa', 'maelezo', 'zaidi', 'piga', 'numbrr', 'major', 'omong', 'forward', 'name', 'nephew', 'ad', 'id', 'armi', 'recruit', 'chanc', 'remain', 'love', 'babi', 'gal', 'tri', 'back', 'worn', 'plu', 'techno', 'use', 'servic', 'menu', 'atm', 'withdraw', 'enter',

In [158]:
# The find_features function will determine which of the 1500 words features are contained in the review

def find_features(message):
    words = word_tokenize(message)
    feature = {}
    for word in word_features:
        feature[word] = (word in words)
    return feature

# Lets see an example
features = find_features(processed[18])
for key, value in features.items():
    if value == True:
        print(key)

spam
numbr
equiti
safaricom
nguruma
ibamb
na
nambari
yako
imezawadia


In [159]:
# Now lets do it for all the messages
messages = list(zip(processed, Y))

# define a seed for reproducibility
seed = 1
np.random.seed = seed
np.random.shuffle(messages)

# call find_features function for each SMS message
featuresets = [(find_features(text), label) for (text, label) in messages]
print(featuresets[2])

({'data': False, 'spam': False, 'ksh': False, 'numbr': False, 'loyal': False, 'equiti': False, 'bank': False, 'custom': False, 'click': False, 'http': False, 'equitybanknig': False, 'plc': False, 'com': False, 'detail': False, 'ham': False, 'pleas': False, 'call': False, 'thank': False, 'safaricom': False, 'nguruma': False, 'imbamb': False, 'promot': False, 'dear': False, 'subscrib': False, 'phone': False, 'number': False, 'kshnumbr': False, 'us': False, 'send': False, 'money': False, 'jumia': False, 'shopper': False, 'purchas': False, 'last': False, 'month': False, 'via': False, 'card': False, 'go': False, 'www': False, 'jumiaa': False, 'co': False, 'ke': False, 'claim': False, 'hello': False, 'got': False, 'credit': False, 'numbrgb': False, 'internet': False, 'free': False, 'winpriz': False, 'net': False, 'kenya': False, 'equitel': False, 'expir': False, 'sm': False, 'upgrad': False, 'hire': False, 'shorten': False, 'ur': False, 'life': False, 'mpesa': False, 'pardon': False, 'hi': F

In [179]:
# we can split featuresets into train and test sets
from sklearn.model_selection import train_test_split

training, testing = train_test_split(featuresets, test_size = 0.25, random_state = seed)


In [180]:
# print lenth of training and testing set
print(len(training))
print(len(testing))

1137
380


### 4. Scikit-Learn Classifiers with NLTK

Now that we have our dataset, we can start building algorithms! Let's start with a simple linear support vector classifier, then expand to other algorithms. We'll need to import each algorithm we plan on using from sklearn.  We also need to import some performance metrics, such as accuracy_score and classification_report.

In [181]:
# we can use sklearn algorithms in NLTK
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC

model = SklearnClassifier(SVC(kernel = 'linear'))

# train the model on the training data
model.train(training)

# and test on the testing dataset !
accuracy = nltk.classify.accuracy(model, testing)*100
print("SVC Accuracy: {}".format(accuracy))

SVC Accuracy: 12.631578947368421


In [182]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Define models to train
names = ["SVM","K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
         "Naive Bayes", "SVM Linear"]

classifier = [SVC(kernel = 'linear'), KNeighborsClassifier(), DecisionTreeClassifier(), RandomForestClassifier(), LogisticRegression(),  SGDClassifier(max_iter = 100), MultinomialNB(), SVC(kernel = 'linear')]
models = zip(names, classifier)
for name, model in models:
    nltk_model = SklearnClassifier(model)
    nltk_model.train(training)
    accuracy = nltk.classify.accuracy(nltk_model, testing)*100
    print("{} Accuracy: {}".format(name, accuracy))

SVM Accuracy: 12.631578947368421
K Nearest Neighbors Accuracy: 12.105263157894736
Decision Tree Accuracy: 12.631578947368421
Random Forest Accuracy: 12.368421052631579
Logistic Regression Accuracy: 12.631578947368421
SGD Classifier Accuracy: 12.631578947368421
Naive Bayes Accuracy: 12.368421052631579
SVM Linear Accuracy: 12.631578947368421


In [183]:
# Ensemble method - Voting classifier
from sklearn.ensemble import VotingClassifier

names = ["K Nearest Neighbors", "Decision Tree", "Random Forest", "Logistic Regression", "SGD Classifier",
         "Naive Bayes", "SVM Linear"]

classifiers = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter = 100),
    MultinomialNB(),
    SVC(kernel = 'linear')
]

models = list(zip(names, classifiers))

nltk_ensemble = SklearnClassifier(VotingClassifier(estimators = models, voting = 'hard', n_jobs = -1 ))
nltk_ensemble.train(training)
accuracy = nltk.classify.accuracy(nltk_ensemble, testing)*100
print("Voting Classifier: Accuracy: {}".format(accuracy))

Voting Classifier: Accuracy: 12.631578947368421


  if diff:


In [185]:
# make class label prediction for testing set
txt_features, labels = zip(*testing)

prediction = nltk_ensemble.classify_many(txt_features)

  if diff:


In [186]:
# print a confusion matrix and a classification report
print(classification_report(labels, prediction))

pd.DataFrame(
    confusion_matrix(labels, prediction),
    index = [['actual', 'actual'], ['ham', 'spam']],
    columns = [['predicted', 'predicted'], ['ham', 'spam']])

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.00      0.00      0.00         0
          2       0.00      0.00      0.00         1
          4       0.00      0.00      0.00         1
          5       0.00      0.00      0.00         1
          6       0.00      0.00      0.00         0
          7       0.00      0.00      0.00         0
          8       0.00      0.00      0.00         1
          9       0.00      0.00      0.00         0
         11       0.00      0.00      0.00         1
         16       0.00      0.00      0.00         1
         21       0.00      0.00      0.00         1
         28       0.00      0.00      0.00         1
         29       0.00      0.00      0.00         0
         30       0.00      0.00      0.00         1
         34       0.00      0.00      0.00         0
         36       0.00      0.00      0.00         0
         38       0.00      0.00      0.00   

ValueError: Shape of passed values is (513, 513), indices imply (2, 2)

#### Thank you!