# Text Classification with NLTK and Scikit-Learn
> In this post, we will expand on our NLP foundation and explore different ways to improve our text classification with NLTK and Scikit-learn. In details, we will build SMS spam filters.

# <font color="black">Scikit-Learn</font>

Scikit-learn is a free machine learning library for Python.
Provides a selection of efficient tools for machine learning and statistical modeling including:
     - **Classification:** Identifying which category an object belongs to. Example: Spam detection
     - **Regression:** Predicting a continuous variable based on relevant independent variables. Example: Stock price predictions
     - **Clustering:** Automatic grouping of similar objects into different clusters. Example: Customer segmentation
     - **Dimensionality Reduction:** Seek to reduce the number of input variables in training data by preserving the salient relationships in the data
- Features various algorithms like support vector machine, random forests, and k-neighbours.
- Supports Python numerical and scientific libraries like NumPy and SciPy.


Some popular groups of models provided by scikit-learn include:

- **Clustering:** Group unlabeled data such as KMeans.
- **Cross Validation:** Estimate the performance of supervised models on unseen data.
- **Datasets:** for test datasets and for generating datasets with specific properties for investigating model behavior.
- **Dimensionality Reduction:** Reduce the number of attributes in data for summarization, visualization and feature selection such as Principal component analysis.
- **Ensemble Methods:** Combine the predictions of multiple supervised models.
- **Feature Extraction:** Define attributes in image and text data.
- **Feature Selection:** Identify meaningful attributes from which to create supervised models.
- **Parameter Tuning:** Get the most out of supervised models.
- **Manifold Learning:** Summarize and depicting complex multi-dimensional data.
- **Supervised Models:** A vast array not limited to generalized linear models, discriminate analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees.
- **Unsupervised Learning Algorithms:** − They include clustering, factor analysis, PCA (Principal Component Analysis), unsupervised neural networks.


![fig_sckl](https://ulhpc-tutorials.readthedocs.io/en/latest/python/advanced/scikit-learn/images/scikit.png)
Image Source: ulhpc-tutorials.readthedocs.io

## Required Packages

In [None]:
import sys
import nltk
import sklearn
import pandas as pd
import numpy as np

## Version Check

In [None]:
print('Python: {}'.format(sys.version))
print('NLTK: {}'.format(nltk.__version__))
print('Scikit-learn: {}'.format(sklearn.__version__))
print('Pandas: {}'.format(pd.__version__))
print('NumPy: {}'.format(np.__version__))

Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
NLTK: 3.8.1
Scikit-learn: 1.2.2
Pandas: 1.5.3
NumPy: 1.23.5


## Load the dataset
Now that we have ensured that our libraries are installed correctly, let's load the data set as a Pandas DataFrame. Furthermore, let's extract some useful information such as the column information and class distributions.

The data set we will be using comes from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). It contains over 5000 SMS labeled messages that have been collected for mobile phone spam research.

In [None]:
!wget https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip
!unzip sms+spam+collection.zip

--2023-12-11 08:38:10--  https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘sms+spam+collection.zip’

sms+spam+collection     [<=>                 ]       0  --.-KB/s               sms+spam+collection     [ <=>                ] 198.65K  --.-KB/s    in 0.04s   

2023-12-11 08:38:10 (5.32 MB/s) - ‘sms+spam+collection.zip’ saved [203415]

Archive:  sms+spam+collection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [None]:
# Load the dataset of SMS messages
sms = pd.read_table('SMSSpamCollection', header=None, encoding='utf-8')
sms.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       5572 non-null   object
 1   1       5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [None]:
# Check class distribution
sms[0].value_counts()

ham     4825
spam     747
Name: 0, dtype: int64

From the data summary, we can find that the SPAM message is defined as `spam` and non-SPAM message is defined as `ham`. And there are 747 spam messages in dataset.

## Data-preprocess
From the label, label is defined with string type. To recognize it in model, It needs to convert it with binary values. This kind of process is called **one-hot encoding**. There are several ways to apply one-hot encoding:

- use `pd.get_dummies`
- use `LabelEncoder` in `sklearn.preprocessing`

In this time, we use `LabelEncoder`,

In [None]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
label = enc.fit_transform(sms[0])
print(label[:10])
print(sms[0][:10])

[0 0 1 0 0 1 0 0 1 1]
0     ham
1     ham
2    spam
3     ham
4     ham
5    spam
6     ham
7     ham
8    spam
9    spam
Name: 0, dtype: object


In [None]:
text = sms[1]
text[:10]

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
5    FreeMsg Hey there darling it's been 3 week's n...
6    Even my brother is not like to speak with me. ...
7    As per your request 'Melle Melle (Oru Minnamin...
8    WINNER!! As a valued network customer you have...
9    Had your mobile 11 months or more? U R entitle...
Name: 1, dtype: object

Now, it is time to text preprocessing. From the previous post, we've learned several text preprocess. But before apply those techniques, we need to formalize the text that need to remove special characters or numbers like phone numbers and so on. To do this, we can use **regular expression**(regex for short) for finding the pattern-matching. Here is some common regex form described in wikipedia.

- **^** Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

- **.** Matches any single character (many applications exclude newlines, and exactly which characters are considered newlines is flavor-, character-encoding-, and platform-specific, but it is safe to assume that the line feed character is included). Within POSIX bracket expressions, the dot character matches a literal dot. For example, a.c matches "abc", etc., but [a.c] matches only "a", ".", or "c".

- **[ ]** A bracket expression. Matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z]. The - character is treated as a literal character if it is the last or the first (after the ^, if present) character within the brackets: [abc-], [-abc]. Note that backslash escapes are not allowed. The ] character can be included in a bracket expression if it is the first (after the ^) character: []abc].

- **[^ ]** Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed.

- **\$** Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.

- **( )** Defines a marked subexpression. The string matched within the parentheses can be recalled later (see the next entry, \n). A marked subexpression is also called a block or capturing group. BRE mode requires ( ).

- **\\n** Matches what the nth marked subexpression matched, where n is a digit from 1 to 9. This construct is vaguely defined in the POSIX.2 standard. Some tools allow referencing more than nine capturing groups.

- **\*** Matches the preceding element zero or more times. For example, abc matches "ac", "abc", "abbbc", etc. [xyz] matches "", "x", "y", "z", "zx", "zyx", "xyzzy", and so on. (ab)* matches "", "ab", "abab", "ababab", and so on.

- **{m,n}** Matches the preceding element at least m and not more than n times. For example, a{3,5} matches only "aaa", "aaaa", and "aaaaa". This is not found in a few older instances of regexes. BRE mode requires {m,n}.

If you want to test your regex form, test it [here](https://regexr.com/)

In [None]:
# Use regular expression

# Replace email addresses with 'email'
processed = text.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$', 'emailaddress')

# Replace URLs with 'webaddress'
processed = processed.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$', 'webaddress')

# Replace money symbols with 'moneysymb' (£ can by typed with ALT key + 156)
processed = processed.str.replace(r'£|\$', 'moneysymb')

# Replace 10 digit phone numbers (formats include paranthesis, spaces, no spaces, dashes) with 'phonenumber'
processed = processed.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$', 'phonenumbr')

# Replace numbers with 'numbr'
processed = processed.str.replace(r'\d+(\.\d+)?', 'numbr')

  processed = text.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$', 'emailaddress')
  processed = processed.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$', 'webaddress')
  processed = processed.str.replace(r'£|\$', 'moneysymb')
  processed = processed.str.replace(r'^\(?[\d]{3}\)?[\s-]?[\d]{3}[\s-]?[\d]{4}$', 'phonenumbr')
  processed = processed.str.replace(r'\d+(\.\d+)?', 'numbr')


And it is required to remove useless characters like whitespace, punctuation and so on.

In [None]:
# Remove punctuation
processed = processed.str.replace(r'[^\w\d\s]', ' ')

# Replace whitespace between terms with a single space
processed = processed.str.replace(r'\s+', ' ')

# Remove leading and trailing whitespace
processed = processed.str.replace(r'^\s+|\s+?$', '')

  processed = processed.str.replace(r'[^\w\d\s]', ' ')
  processed = processed.str.replace(r'\s+', ' ')
  processed = processed.str.replace(r'^\s+|\s+?$', '')


After that, we will use all lower case sentence.

In [None]:
processed = processed.str.lower()
processed

0       go until jurong point crazy available only in ...
1                                 ok lar joking wif u oni
2       free entry in numbr a wkly comp to win fa cup ...
3             u dun say so early hor u c already then say
4       nah i don t think he goes to usf he lives arou...
                              ...                        
5567    this is the numbrnd time we have tried numbr c...
5568                  will ü b going to esplanade fr home
5569    pity was in mood for that so any other suggest...
5570    the guy did some bitching but i acted like i d...
5571                            rofl its true to its name
Name: 1, Length: 5572, dtype: object

Then, in the previous post, we learned about stopword removing for text preprocessing. we can apply this.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

processed = processed.apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Also, using PorterStemmer, we can extract stem of each word.

In [None]:
ps = nltk.PorterStemmer()
processed = processed.apply(lambda x: ' '.join(ps.stem(term) for term in x.split()))

In [None]:
print(processed)

0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri numbr wkli comp win fa cup final tk...
3                     u dun say earli hor u c alreadi say
4                    nah think goe usf live around though
                              ...                        
5567    numbrnd time tri numbr contact u u moneysymbnu...
5568                              ü b go esplanad fr home
5569                                    piti mood suggest
5570    guy bitch act like interest buy someth els nex...
5571                                       rofl true name
Name: 1, Length: 5572, dtype: object


Then, you can see processed message is quite different from original one, since stop word removing, stemming and regular expression is applied.

## Feature extraction
After preprocessing, we need to extract feature from text message. To do this, it will be necessary to tokenize each word. In this case, we will use the 1500 most common words as features.

In [None]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')
all_words = []

for message in processed:
    words = word_tokenize(message)
    for w in words:
        all_words.append(w)

all_words = nltk.FreqDist(all_words)

# Print the result
print('Number of words: {}'.format(len(all_words)))
print('Most common words: {}'.format(all_words.most_common(15)))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Number of words: 6579
Most common words: [('numbr', 2648), ('u', 1207), ('call', 674), ('go', 456), ('get', 451), ('ur', 391), ('gt', 318), ('lt', 316), ('come', 304), ('moneysymbnumbr', 303), ('ok', 293), ('free', 284), ('day', 276), ('know', 275), ('love', 266)]


In [None]:
# use the 1500 most common words as features
word_features = [x[0] for x in all_words.most_common(5000)]
print(word_features)

['numbr', 'u', 'call', 'go', 'get', 'ur', 'gt', 'lt', 'come', 'moneysymbnumbr', 'ok', 'free', 'day', 'know', 'love', 'like', 'got', 'time', 'good', 'want', 'text', 'send', 'txt', 'need', 'one', 'today', 'take', 'ü', 'see', 'stop', 'home', 'think', 'repli', 'r', 'lor', 'sorri', 'still', 'tell', 'n', 'numbrp', 'back', 'mobil', 'da', 'dont', 'make', 'k', 'week', 'pleas', 'phone', 'say', 'hi', 'work', 'new', 'pl', 'later', 'hope', 'miss', 'ask', 'co', 'meet', 'msg', 'messag', 'night', 'dear', 'c', 'wait', 'happi', 'well', 'tri', 'give', 'great', 'much', 'thing', 'claim', 'oh', 'min', 'wat', 'hey', 'number', 'na', 'friend', 'thank', 'ye', 'way', 'www', 'let', 'e', 'prize', 'feel', 'even', 'right', 'tomorrow', 'wan', 'alreadi', 'pick', 'cash', 'said', 'care', 'b', 'amp', 'yeah', 'im', 'leav', 'realli', 'tone', 'babe', 'win', 'life', 'morn', 'find', 'last', 'sleep', 'servic', 'keep', 'sure', 'use', 'anyth', 'uk', 'buy', 'would', 'year', 'start', 'contact', 'lol', 'also', 'urgent', 'nokia', 'w

So we created the feature list, now we need to find the what features are in messages.

In [None]:
def find_features(message):
    words = word_tokenize(message)
    features = []
    for w in words:
      if w in word_features:
        features.append(w)
    features = ' '.join(features)
    return features

In [None]:
dataset = []
for message in processed:
  features = find_features(message)
  dataset.append(features)
print(dataset)

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat', 'ok lar joke wif u oni', 'free entri numbr wkli comp win fa cup final tkt numbrst may numbr text fa numbr receiv entri question std txt rate c appli numbrovernumbr', 'u dun say earli hor u c alreadi say', 'nah think goe usf live around though', 'freemsg hey darl numbr week word back like fun still tb ok xxx std chg send moneysymbnumbr rcv', 'even brother like speak treat like aid patent', 'per request mell mell oru minnaminungint nurungu vettam set callertun caller press numbr copi friend callertun', 'winner valu network custom select receivea moneysymbnumbr prize reward claim call numbr claim code klnumbr valid numbr hour', 'mobil numbr month u r entitl updat latest colour mobil camera free call mobil updat co free numbr', 'gon na home soon want talk stuff anymor tonight k cri enough today', 'six chanc win cash numbr numbr numbr pound txt cshnumbr send numbr cost numbrp day numbrday numbr tsandc appli rep

Finally, we made an one simple data that we can use it as an training set. We can apply same apporach in other dataset. Then, we need to split into training set and test set

In [None]:
messages = list(zip(dataset, label))
np.random.seed(1)
np.random.shuffle(messages)

In [None]:
print(messages[5])

('ok thk got u wan numbr come wat', 0)


In [None]:
X,y = zip(* messages)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)
vectorizer.get_feature_names_out()

array(['____', 'aah', 'aaniy', ..., 'zyada', 'èn', 'únumbr'], dtype=object)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Scikit-learn Classifier
Now, we build the training and test set, we can build machine learning model in scikit-learn. We are using the following alogithms and see the performance of each ones,

- KNearestNeighbors
- Random Forest
- Decision Tree
- Logistic Regression
- Naive Bayes
- Support Vector Machine

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

names = ['K Nearest Neighbors', 'Decision Tree', 'Random Forest', 'Logistic Regression', 'SGD Classifier',
         'Naive Bayes', 'Support Vector Classifier']

classifiers = [
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    LogisticRegression(),
    SGDClassifier(max_iter=100),
    MultinomialNB(),
    SVC(kernel='linear')
]

models = zip(names, classifiers)

# import different metrics to evaluate the classifiers
from sklearn.metrics import accuracy_score

for name, model in models:
    sk_model = model
    sk_model.fit(X_train,y_train)
    predictions = sk_model.predict(X_test)
    accuracy = accuracy_score(predictions, y_test)
    print("{} model Accuracy: {}".format(name, accuracy))

K Nearest Neighbors model Accuracy: 0.9081119885139985
Decision Tree model Accuracy: 0.968413496051687
Random Forest model Accuracy: 0.9784637473079684
Logistic Regression model Accuracy: 0.9676956209619526
SGD Classifier model Accuracy: 0.9827709978463748
Naive Bayes model Accuracy: 0.9777458722182341
Support Vector Classifier model Accuracy: 0.9791816223977028


From the result, most of models can get almost 95~98% accuracy. But we can also enhance our model to voting the best model from the result, the one of ensemble approach. To do this, we need to use `VotingClassifier` from `sklearn.ensemble`. You can find the details of Voting Classifier [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html).

In [None]:
from sklearn.ensemble import VotingClassifier

# Since VotingClassifier can accept list type of models
models = list(zip(names, classifiers))

sk_ensemble = VotingClassifier(estimators=models, voting='hard', n_jobs=-1)
sk_ensemble.fit(X_train,y_train)
predictions = sk_ensemble.predict(X_test)
accuracy = accuracy_score(predictions, y_test)
print("{} model Accuracy: {}".format(name, accuracy))

Support Vector Classifier model Accuracy: 0.9784637473079684


We are done. We can generate the confusion matrix, one of the metrics to check classification performance.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions,digits=5))

              precision    recall  f1-score   support

           0    0.97792   0.99750   0.98761      1199
           1    0.98235   0.86082   0.91758       194

    accuracy                        0.97846      1393
   macro avg    0.98014   0.92916   0.95260      1393
weighted avg    0.97854   0.97846   0.97786      1393

