### Problem:
* How do we deal with imbalanced class distribution for textual data classification?

An imbalanced dataset in Natural Language Processing is a dataset whose number of data samples is not the same in the different classes. One class has more data samples than the other class.

For example, one class has 3000 samples, and the other may have 300. The class with more data samples is known as the majority class, while the other one is known as the minority class.

When we train a model with an imbalanced dataset, the model will be biased towards the majority class. The model may make wrong predictions and give inaccurate results. It has a negative impact when we use the model in production, and the stakeholders depend on it for business operations.

In Natural Language Processing (NLP), we have various libraries that can handle text data that have an imbalance. We will use the Imbalanced-learn library. This library will balance the classes in the dataset. It will also reduce model bias and enhance the NLP performance.

We will first build a spam classifier model with natural language processing without balancing the classes in the dataset. We will implement the same model but use Imbalanced-Learn to balance the classes. Then, we will compare the two models (before and after balancing) and evaluate their performance.

#### Data
* Spam classification dataset:
    We will use the SMS collection dataset to train the NLP model. It has two labeled classes. (spam and ham). The spam class contains all the spam SMS. The ham class has all the SMS that are not spam.
    
    The NLP model will classify an SMS as either spam or not spam. Ensure you can download the SMS collection dataset from here. The link will give you the complete SMS collection dataset.
    
    We load the dataset using Pandas.

In [27]:
import pandas as pd
import numpy as np
import nltk
import string
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import re
from sklearn.feature_extraction.text import TfidfVectorizer

In [74]:
#nltk.download()

In [73]:
df = pd.read_csv("spam_classification.csv", sep="\t", header=None)

In [75]:
print(df.shape)
df.rename(columns={0: 'label', 1: 'text'}, inplace=True)
df.head()

(5572, 2)


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
df['label'].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

Calculating the length of each data sample

We will create a new length column that will show the length of each data sample. This new column will help us with preprocessing the data samples.

In [76]:
df['length'] = df['text'].apply(lambda x: len(x))
df.head()

Unnamed: 0,label,text,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


We will begin cleaning the text.

Text cleaning
Before building the spam classification model, we will clean the dataset to have the required format. Many text cleaning steps will format the text. It includes removing unnecessary words, punctuation, stop words, white spaces, and unnecessary symbols from the text dataset.

For this tutorial, we will implement the following steps:

Removing stop words: Stop words do not contribute to the meaning of a sentence since they are common in a language. Stop words for the English language are pronouns, conjunctions, and articles. Removing stop words enables the NLP model to focus on unique words in the SMS messages that will add value.

Converting all the SMS messages to lower case: It ensures that we have a uniform dataset.

Removing numbers and other numeric values: It ensures that only text that remains in the dataset adds value to the model.

Removing punctuations: It involves removing full stops and other punctuation marks. These are the unnecessary symbols in the dataset.

Removing extra white spaces: White space occupies the dataset, but they do not carry information. Removing the extra white spaces ensures we only remain with the text that the model will use.

Lemmatizing the texts: Stemming reduces inflected forms of a text/word into its lemma or dictionary form. For example, the words/texts “running”, “ran”, and “runs” are all reduced to the root form “run”.

Tokenization: It is the splitting/breaking of the raw texts into smaller words or phrases known as tokens. We will implement the text cleaning steps using Natural Language Toolkit (NLTK).

NLTK has smaller sub-libraries that perform specific text cleaning tasks. These smaller libraries also have methods for text cleaning.

The next step is to download the smaller sub-libraries from NLTK as follows:

It will tokenize the text in the dataset.                                                  

In [None]:
nltk.download('punkt')

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/sumana/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [77]:
def convert_to_lower(text):
    return text.lower()
df['text'] = df['text'].apply(lambda x: convert_to_lower(x))

def remove_numbers(text):
    number_pattern = r'\d+'
    without_number = re.sub(pattern=number_pattern, repl=" ", string=text)
    return without_number
df['text'] = df['text'].apply(lambda x: remove_numbers(x))

def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))
df['text'] = df['text'].apply(lambda x: remove_punctuation(x))

def remove_stopwords(text):
    removed = []
    stop_words = list(stopwords.words("english"))
    tokens = word_tokenize(text)
    for i in range(len(tokens)):
        if tokens[i] not in stop_words:
            removed.append(tokens[i])
    return " ".join(removed)
df['text'] = df['text'].apply(lambda x: remove_punctuation(x))

def remove_extra_white_spaces(text):
    single_char_pattern = r'\s+[a-zA-Z]\s+'
    without_sc = re.sub(pattern=single_char_pattern, repl=" ", string=text)
    return without_sc
df['text'] = df['text'].apply(lambda x: remove_extra_white_spaces(x))

# def lemmatizing(text):
#     lemmatizer = WordNetLemmatizer()
#     tokens = word_tokenize(text)
#     for i in range(len(tokens)):
#         lemma_word = lemmatizer.lemmatize(tokens[i])
#         tokens[i] = lemma_word
#     return " ".join(tokens)
# df['text'] = df['text'].apply(lambda x: lemmatizing(x))

In [78]:
#Converting the class labels into integer values
label_map = {
    'ham': 0,
    'spam': 1,
}

In [79]:
df['label'] = df['label'].map(label_map)

In [80]:
df.head()

Unnamed: 0,label,text,length
0,0,go until jurong point crazy available only in ...,111
1,0,ok lar joking wif oni,29
2,1,free entry in wkly comp to win fa cup final tk...,155
3,0,u dun say so early hor c already then say,49
4,0,nah dont think he goes to usf he lives around ...,61


Implementing text vectorization
It converts the raw text into a format the NLP model can understand and use. Vectorization will create a numerical representation of the text strings called a sparse matrix or word vectors. The model works with numbers and not raw text. We will use TfidfVectorizer to create the sparse matrix.

In [81]:
tf_wb= TfidfVectorizer()

In [82]:
X_tf = tf_wb.fit_transform(df['text'])

In [83]:
X_tf = X_tf.toarray()

In [84]:
print(X_tf)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [85]:
from sklearn.model_selection import train_test_split

In [86]:
X_train_tf, X_test_tf, y_train_tf, y_test_tf = train_test_split(X_tf, df['label'].values, test_size=0.3)

In [87]:
from sklearn.naive_bayes import GaussianNB
NB = GaussianNB()
NB.fit(X_train_tf, y_train_tf)

NB_pred= NB.predict(X_test_tf)
print(NB_pred)

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test_tf, NB_pred))

[0 0 0 ... 0 1 0]
0.8929425837320574


In [88]:
#!pip install imbalanced-learn

In [89]:
from imblearn.over_sampling import RandomOverSampler
from collections import Counter

#### Implementing Imbalanced-Learn
RandomOverSampler will increase the data samples in the minority class (spam). It makes the minority class have the same data samples as the majority class (ham). The function synthesizes new dummy data samples in the minority class to enable class balancing.

In [90]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'].values, test_size=0.30)

In [91]:
Counter(y_train)

Counter({0: 3361, 1: 539})

In [92]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)

X_train_tf = vectorizer.transform(X_train)
X_train_tf = X_train_tf.toarray()

In [93]:
X_test_tf = vectorizer.transform(X_test)
X_test_tf = X_test_tf.toarray()

### Applying RandomOverSampler function
#### The function uses the sampling_strategy parameter to balance the class. We set the parameter’s value to 1 to ensure the dataset classes have 1:1 data samples. We then apply the function to the training set. It will generate the new data samples to ensure both classes are balanced.



In [94]:
ROS = RandomOverSampler(sampling_strategy=1)

In [95]:
X_train_ros, y_train_ros = ROS.fit_resample(X_train_tf, y_train)

In [96]:
Counter(y_train_ros)

Counter({0: 3361, 1: 3361})

### Using the balanced dataset to build the same model

In [97]:
nb = GaussianNB()

In [98]:
nb.fit(X_train_ros, y_train_ros)

y_preds = nb.predict(X_test_tf)
print(y_preds)

print(accuracy_score(y_test, y_preds))

[0 1 0 ... 0 0 0]
0.8995215311004785


Conclusion

We used Imbalanced-learn to handle imbalanced text data in natural language processing. We cleaned the text dataset and implemented the text preprocessing steps using the NLTK library. We implemented text vectorization and fed the model the sparse matrix.

We then implemented a spam classifier model without balancing the dataset and calculated the accuracy score. We also implemented the same model but used Imbalanced-Learn to balance the classes.

Finally, we compared the two models (before and after balancing), with a slghtly increased accuracy.