<a href="https://colab.research.google.com/github/khenm/email_classification/blob/develop/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Text Classification**
<p align="justify">
Text Classification is an important machine learning task in Natural Language Processing. Our task is to build a program to classifiy texts into given categories. Some applications of this task are spam detection, sentiment analysis and topic classification,...
In this project, we will build a Text Classification program to differentiate between spam and ham messages by using Naive Bayes algorithm.
</p>

In [1]:
# Download data
!gdown --id 1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R

Downloading...
From: https://drive.google.com/uc?id=1N7rk-kfnDFIGMeX0ROVTjKh71gcgx-7R
To: /content/2cls_spam_text_cls.csv
100% 486k/486k [00:00<00:00, 73.3MB/s]


In [2]:
import string
import nltk
nltk.download('stopwords')
nltk.download('punkt')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [6]:
# Read data
DATASET_PATH = '/content/2cls_spam_text_cls.csv'
df = pd.read_csv(DATASET_PATH)

messages = df['Message'].values.tolist()
labels = df['Category'].values.tolist()

#### **Preprocessing features**
<p align="justify">
The content in Messages is diversed and comprised of different combinations of words and characters. In special cases, it can be abbreviations, meaningless words, and its family words. Thus, it is important to handle these words in advance. In this section, we will progress through this pipeline:

Message ⟶ Lowercase ⟶ Punctuation Removal ⟶ Tokenizer ⟶ Remove Stopwords ⟶ Stemming
</p>

In [7]:
# Prehandling
def lowercase(text):
    return text.lower()


def punctuation_removal(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)


def tokenize(text):
    return nltk.word_tokenize(text)


def remove_stopwords(tokens):
    stop_words = nltk.corpus.stopwords.words('english')
    return [token for token in tokens if token not in stop_words]


def stemming(tokens):
    stemmer = nltk.stem.PorterStemmer()
    return [stemmer.stem(token) for token in tokens]


def preprocess_text(text):
    text = lowercase(text)
    text = punctuation_removal(text)
    tokens = tokenize(text)
    tokens = remove_stopwords(tokens)
    tokens = stemming(tokens)
    return tokens


messages = [preprocess_text(message) for message in messages]

In [11]:
# Create a dictionary
def create_dictionary(messages):
    dictionary = []
    for tokens in messages:
        for token in tokens:
            if token not in dictionary:
                dictionary.append(token)
    dictionarya = sorted(dictionary)
    return dictionary


dictionary = create_dictionary(messages)

Then, we have to create features of each message by counting the frequency of each word in the message. The result will be a vector.

In [12]:
def create_features(tokens, dictionary):
    features = np.zeros(len(dictionary))
    for token in tokens:
        if token in dictionary:
            features[dictionary.index(token)] += 1
    return features


X = np.array([create_features(tokens, dictionary) for tokens in messages])

In [13]:
# Preprocessing the labels
le = LabelEncoder()
y = le.fit_transform(labels)
print(f'Classes: {le.classes_}')
print(f'Encoded labels: {y}')

Classes: ['ham' 'spam']
Encoded labels: [0 0 1 ... 0 0 0]


#### **Splitting train/val/test**
<p align='justify'>
To train a model, we don't take the whole dataset to train but split it into 3 different subsets: Train, Validation and Test (7/2/1). Besides, to guarantee the same results for each time, we have to fix the SEED.
</p>

In [14]:
VAL_SIZE = 0.2
TEST_SIZE = 0.125
SEED = 0

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=VAL_SIZE,
    shuffle=True,
    random_state=SEED
)

X_train, X_test, y_train, y_test = train_test_split(
    X_train, y_train,
    test_size=TEST_SIZE,
    shuffle=True,
    random_state=SEED
)

#### **Train the Model**
<p align='justify'>
In this part, we will use sklearn library and pass the inputs to Gaussian Naive Bayes model to train
</p>

In [15]:
model = GaussianNB()
print('Start Training...')
model = model.fit(X_train, y_train)
print('Training completed!')

Start Training...
Training completed!


### **Model Evaluation**
<p align='justify'>
After training, the next step is to evaluate the model's performance. Beginning with testing the trained model on Validation and Test set, then, we will use Accuracy Score to assess the model.
</p>

In [16]:
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

val_accuracy = accuracy_score(y_val, y_val_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f'Val accuracy: {val_accuracy}')
print(f'Test accuracy: {test_accuracy}')

Val accuracy: 0.8816143497757848
Test accuracy: 0.8602150537634409


#### **Prediction**
<p align='justify'>
Now,  we can use this model for new Message, we have to start from scratch: preprocessing, creating features, and fitting to the model. The output of the model will give 0 or 1, therefore, to get the correct label, we use inverse_transform().
</p>

In [17]:
def predict(text, model, dictionary):
    processed_text = preprocess_text(text)
    features = create_features(processed_text, dictionary)
    features = np.array(features).reshape(1, -1)
    prediction = model.predict(features)
    prediction_cls = le.inverse_transform(prediction)[0]
    return prediction_cls

test_input = 'I am actually thinking a way of doing something useful'
prediction_cls = predict(test_input, model, dictionary)
print(f'Prediction: {prediction_cls}')

Prediction: ham
