## Introduction
Natural Language Processing (NLP) has gained significant attention in the field of artificial intelligence due to the rise of technologies like ChatGPT, making it one of the most prominent and dynamic areas of research and application today.

NLP involves analyzing, understanding and manipulation of human language by computers to extract meaningful insights to perform sentiment analysis or text classification. Here, we shall employ NLP techniques in performing text classification. 

Text classification can be performed either through supervised and unsupervised learning. For supervised learning, the learning model has the list of targets (or labels) to check the predictions against. Whereas in unsupervised learning, the observations are just partitioned into groups based on similarity scores from the features alone. 

In this exercise, we aim to perform both supervised and unsupervised learning on a set of SMS text corpus that carries ham and spam text messages. The goal is to classify a text into ham or spam based solely on the words present in the text (or the text corpus).

Link to dataset: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset 

## Problem Statement
Will we be able to classify text messages (SMS) into ham or spam based solely on the words in the text?

In [5]:
# Import libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")

In [6]:
# Read in dataset
df = pd.read_csv("../dataset/nus_sms_corpus/spam.csv", 
                 usecols=[0, 1],
                 encoding="ISO-8859-1")
df = df.rename(columns={"v1":"label",
                        "v2":"text"})

sms = df["text"].values

In [None]:
# View first few rows of data set
df.head()

In [124]:
# [skip] Read in output from previous run
df = pd.read_csv("../dataset/nus_sms_corpus/spam2.csv")
corpus_w_trigrams2 = df["trigrams_list"].values

# Subset out the corpus with trigrams and convert to list of lists
import ast
literal_w_trigrams = df["trigrams_list"].apply(ast.literal_eval)
corpus_w_trigrams = list(literal_w_trigrams.values)

sms = df["text"].values

## Data Cleaning

### Text Formatting
The dataset consists of only two columns: the targets (or labels) and the text. A number of things need to be done.
1. Convert text encoding from ASCII to UTF-8.
2. Convert text to lowercases. 
3. Remove punctuations.
4. Remove stop words that very common and carry no meaning.
5. Truncate all whitespaces between words.

### Lemmatization
Text lemmatization is the process of trimming words to their root base known as lemma. Lemmatization retains the meaning of the words truncated. This is done to reduce the total unique words within the corpus to make the NLP process more manageable. Lemmatization reduces the vocabulary size while maintain feature representation. 

In [13]:
# Define function to format text
import re
from unicodedata import normalize
from nltk.corpus import stopwords

# nltk.download("wordnet")
# nltk.download('omw-1.4')
# nltk.download("punkt")
# nltk.download("averaged_perceptron_tagger")

list_stopwords = stopwords.words("english") + ["I", "U", "u", "You", "ur", "2", "4"]

def format_text(text):

    # Remove accents
    rm_accent = normalize("NFKD", text).encode("ascii", "ignore").decode("utf-8", "ignore")
    
    # Remove punctuation from text    
    rm_punc_var = re.sub(r"[^\w\s]", "", rm_accent)
    rm_punc = re.sub(r"_", "", rm_punc_var)       
    
    # Remove frequently occuring words from text
    rm_stopwords_list = [word.lower() for word in rm_punc.split() if word.lower() not in list_stopwords]
    rm_stopwords = " ".join(rm_stopwords_list)
    
    # Reduce all whitespaces between words to one
    new_text = re.sub("\s+", " ", rm_stopwords)
    
    return new_text


In [14]:
# Define function to lemmatize text
from nltk import WordNetLemmatizer, pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet

# Instantialize the lemmatizer. 
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    
    # Tokenize divides the string into a list of substrings
    tokenized_text = word_tokenize(text)
    
    # Tag the tokenized substrings with the "part of speech tag".
    tagged_token = pos_tag(tokenized_text)
    
    root = []
    
    for token in tagged_token:
        
        
        word = token[0]
        tag = token[1][0]
        
        
        if tag.startswith('J'):
            root.append(lemmatizer.lemmatize(word, wordnet.ADJ))
        elif tag.startswith('V'):
            root.append(lemmatizer.lemmatize(word, wordnet.VERB))
        elif tag.startswith('N'):
            root.append(lemmatizer.lemmatize(word, wordnet.NOUN))
        elif tag.startswith('R'):
            root.append(lemmatizer.lemmatize(word, wordnet.ADV))
        else:          
            root.append(word)
    
    # Remove single character word
    root = [word for word in root if len(word) > 1]
    
    return root             # return list of strings in a tokenized format

In [15]:
# Format and lemmatize text
format_sms = list(map(format_text, sms))
lemmatize_sms = list(map(lemmatize_text, format_sms))

## N-gram
In NLP, N-gram is a contiguous sequence of n words from a given sample of text. N-gram captures the context of the word by considering the words surrounding it in a sequence. N-gram can help identify relevant keywords or phrases that are indicative of a certain category, in this case, whether a text is a ham or spam.

The number of phrases generated from a given corpus can be determined empirically to return a certain % of phrases over the total number of wrods. In this project, the number of bigrams is ~4% of unigrams and trigrams are ~1% of unigrams.

In [16]:
# List of unigrams
unigrams = set([token for sms in lemmatize_sms for token in sms])

In [17]:
# Create bigram model
from gensim.models import Phrases
from gensim.models.phrases import Phraser

# total no. of sms = 5573
# total no. of unigrams = 8469

# bigrams to represent ~4% of total unigrams

bigram_model = Phrases(lemmatize_sms, min_count=5, threshold=3)

# min_count=50, threshold=5         # returns 1 bigram
# min_count=20, threshold=5         # returns 17 bigrams
# min_count=10, threshold=5         # returns 90 bigrams
# min_count=10, threshold=10        # returns 75 bigrams
# min_count=5, threshold=5          # returns 287 bigrams
# min_count=5, threshold=3          # returns 318 bigrams       [use this]

# Apply bigram model to corpus
corpus_w_bigrams = [Phraser(bigram_model)[sms] for sms in lemmatize_sms]

In [71]:
# Create trigram model 

# trigrams to represent ~1% of total unigrams

trigram_model = Phrases(corpus_w_bigrams, min_count=3, threshold=5)

# min_count=5, threshold=5          # returns 23 trigrams
# min_count=5, threshold=3          # returns 24 trigrams
# min_count=3, threshold=3          # returns 102 trigrams
# min_count=3, threshold=5          # returns 96 trigrams       [use this]

# Apply trigram model to corpus_w_bigrams
corpus_w_trigrams = [Phraser(trigram_model)[sms] for sms in corpus_w_bigrams]

In [None]:
# Check corpus_w_bigrams & corpus_w_trigrams
bigrams = set()
for sms in corpus_w_bigrams:
    for word in sms:
        if len(word.split("_")) == 2:
            bigrams.add(word)
            
trigrams = set()
for sms in corpus_w_trigrams:
    for word in sms:
        if len(word.split("_")) == 3:
            trigrams.add(word)
            
print(bigrams)            
print(trigrams)

In [20]:
# Define function to count occurences of token and calculate the percentage of total tokens
def count_token(corpus):
    
    # Count the occurences of token in corpus
    token_dict = {}
    
    for text in corpus:
        for token in set(text):
            if token in token_dict:
                token_dict[token] += 1
            elif token not in token_dict:
                token_dict[token] = 1
    
    # Convert dictionary to dataframe for plot
    token_df = (pd.DataFrame(token_dict, 
                             index=[0]).T
                                       .reset_index()
                                       .rename(columns={"index":"token", 
                                                        0:"count"}))
    
    # Calculate count percentage 
    total_token_count = token_df["count"].sum()
    token_df["count_perc"] = token_df["count"] / total_token_count * 100
    
    
    return token_df

In [21]:
# Determine frequently occuring tokens
df_unigrams_count = count_token(lemmatize_sms).sort_values("count_perc", ascending=False)
df_trigrams_count = count_token(corpus_w_trigrams).sort_values("count_perc", ascending=False)

# Top most frequently occuring tokens
top_unigrams = df_unigrams_count.sort_values("count_perc", ascending=False).head(10)
top_trigrams = df_trigrams_count.sort_values("count_perc", ascending=False).head(10)

# Count of unigrams mostly higher than bigrams or trigrams
# To identify top most frequently occuring trigrams
only_trigrams_mask = df_trigrams_count["token"].apply(lambda x: len(x.split("_")) == 3)
only_trigrams = df_trigrams_count[only_trigrams_mask]
top_only_trigrams = only_trigrams.sort_values("count_perc", ascending=False).head(10)

In [22]:
# Group most frequently occuring words by ham or spam
top_unigram_tokens = top_unigrams["token"].values
top_trigram_tokens = top_trigrams["token"].values
top_trigram_tokens_only = top_only_trigrams["token"].values

# Append list of unigrams and trigrams back into the data frame
df["unigrams_list"] = lemmatize_sms
df["trigrams_list"] = corpus_w_trigrams

In [23]:
# Calculate the frequency of token occurence in each ham and spam group
def search_token(top_token_list, column):
    
    tokens_df = pd.DataFrame()
    
    for token in top_token_list:
        
        # Returns True/False if token present in text
        token_pres = df[column].apply(lambda x: token in x)
        df["token_pres"] = token_pres
        
        # Count occurences of token in ham and spam group
        token_df = df.groupby("label").agg(token_pres_count = ("token_pres", "sum"),
                                           label_count = ("token_pres", "count"))
        token_df["token_perc"] = token_df["token_pres_count"] / token_df["label_count"] * 100
        token_df["token"] = [token] * 2
        
        token_df.reset_index(inplace=True)
        
        # Collect outputs into data frame
        tokens_df = pd.concat([tokens_df, token_df], axis=0)
    
    return tokens_df

In [24]:
# Data frame for horizontal stacked barplot for unigram
unigrams_stacked = search_token(top_unigram_tokens, "unigrams_list")
unigrams_stacked.drop(["token_pres_count", "label_count"], axis=1, inplace=True)

# Manually arrange barplot output
token_seq = ["like", "dont", "ill", "know", "ok", "call", "come", "im", "go", "get"]
unigrams_stacked["token"] = pd.Categorical(unigrams_stacked["token"],
                                           token_seq)

unigrams_stacked.sort_values("token", inplace=True)

weight_counts = {
    "ham": unigrams_stacked[unigrams_stacked["label"] == "ham"]["token_perc"].values,
    "spam": unigrams_stacked[unigrams_stacked["label"] == "spam"]["token_perc"].values,
}

In [25]:
# Data frame for horizontal stacked barplot for trigram
trigrams_stacked = search_token(top_trigram_tokens_only, "trigrams_list")
trigrams_stacked.drop(["token_pres_count", "label_count"], axis=1, inplace=True)

# Manually arrange barplot output
token_seq = ['log_onto_httpwwwurawinnercom', 'land_line_claim',
             'urgent_please_call', 'valid_12hrs_150ppm',
             'match_please_call', 'reply_call_08000930705',
             'good_morning_dear', 'pls_send_message',
             'happy_new_year', 'im_gon_na']
trigrams_stacked["token"] = pd.Categorical(trigrams_stacked["token"],
                                           token_seq)

trigrams_stacked.sort_values("token", inplace=True)

weight_counts = {
    "ham": trigrams_stacked[trigrams_stacked["label"] == "ham"]["token_perc"].values,
    "spam": trigrams_stacked[trigrams_stacked["label"] == "spam"]["token_perc"].values,
}


In [26]:
# Measure length of text for histogram
df["unigrams_list"] = lemmatize_sms
df["unigrams_len"] = df["unigrams_list"].apply(lambda x: len(x))

## Data Visualization
Codes for plots for the presentation.

In [None]:
# Import libraries for plotting
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from matplotlib.ticker import FixedLocator, FuncFormatter

In [None]:
# matplotlib reset
matplotlib.rc_file_defaults()

In [None]:
# ham and spam count barplot
count_df = df["label"].value_counts()
count_df = pd.DataFrame(count_df).reset_index()

sns.set_theme(rc={"figure.dpi":300, 'savefig.dpi':300})   # adjust image resolution
sns.set(rc={"figure.facecolor": "#F8F8F8",
            "figure.figsize": (2, 4)})

ax = sns.barplot(count_df, x="index", y="label", linewidth=0);

# Setting fonts for tick labels
ax.set_yticklabels([])
ax.set_xticklabels(ax.get_xticklabels(), family="serif");

# Set title
ax.set_title("ham & spam Count", family="serif", weight="bold", size=14);

# Remove axis labels
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)

# Removing spines
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["bottom"].set_color("#CBCACA")

# Adding gridlines in the background
ax.yaxis.grid(True, which="major", linestyle="--", alpha=0.05, color="#0C3578")

# Label sms label count directly
for patch, ncount in zip(ax.patches, ["4825", "747"]):
    ax.annotate(ncount, 
                (patch.get_x(), patch.get_height()),
                ha = "center",
                xytext=(patch.get_x() + patch.get_width()/2, patch.get_height()-300),
                size=12,
                color="#F8F8F8",
                family="serif")

# Set plot background color
ax.set_facecolor("#F8F8F8")


In [None]:
# Horizontal stacked barplot of trigram proportions in ham and spam groups
tokens = trigrams_stacked[trigrams_stacked["label"] == "spam"]["token"].values

sns.set_theme(rc={"figure.dpi":300, 'savefig.dpi':300})   # adjust image resolution
sns.set(rc={"figure.facecolor": "#F8F8F8",
            "figure.figsize": (4, 1.5)})

fig, ax = plt.subplots()
left = np.zeros(10)

for label, weight_count in weight_counts.items():
    
    if label == "spam":
        color = "#CB8962"
    else:
        color = "#5975A4"
    
    p = ax.barh(tokens, weight_count, label=label, left=left, # type: ignore
                linewidth=0, color=color)
    left += weight_count # type: ignore
    
    
# Setting fonts for tick labels
ax.set_yticklabels(ax.get_yticklabels(), family="serif", size=7)
ax.set_xticklabels(ax.get_xticklabels(), family="serif", size=6)


# Set title
ax.set_title("Proportion of Trigram Based in ham and spam",
             family="serif", size=10, weight="bold")

# Shift tick labels to top
ax.tick_params(top=True, labeltop=True, bottom=False, labelbottom=False, color="#CBCACA")
ax.tick_params(axis="both", which="major", pad=0.1, length=2)

# Set limit
ax.set_xlim(0, 2)

# Remove axis labels
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)

ax.xaxis.set_major_formatter(mtick.PercentFormatter())

# Removing spines
ax.spines["right"].set_visible(False)
ax.spines["top"].set_color("#CBCACA")
ax.spines["top"].set_linewidth(0.8)
ax.spines["left"].set_visible(False)
ax.spines["bottom"].set_visible(False)

# Adding gridlines in the background
ax.yaxis.grid(False)
ax.xaxis.grid(True, which="major", linestyle="--", alpha=0.05, color="#0C3578")

ax.set_facecolor("#F8F8F8")

In [None]:
# Horizontal stacked barplot of unigram proportions in ham and spam groups
tokens = unigrams_stacked[unigrams_stacked["label"] == "spam"]["token"].values

sns.set_theme(rc={"figure.dpi":300, 'savefig.dpi':300})   # adjust image resolution
sns.set(rc={"figure.facecolor": "#F8F8F8",
            "figure.figsize": (4, 1.5)})

fig, ax = plt.subplots()
left = np.zeros(10)

for label, weight_count in weight_counts.items():
    
    if label == "spam":
        color = "#CB8962"
    else:
        color = "#5975A4"
    
    p = ax.barh(tokens, weight_count, label=label, left=left, # type: ignore
                linewidth=0, color=color)
    left += weight_count # type: ignore
    
    
# Setting fonts for tick labels
ax.set_yticklabels(ax.get_yticklabels(), family="serif", size=7)
ax.set_xticklabels(ax.get_xticklabels(), family="serif", size=6)


# Set title
ax.set_title("Proportion of Unigram Based in ham and spam",
             family="serif", size=10, weight="bold")

# Shift tick labels to top
ax.tick_params(top=True, labeltop=True, bottom=False, labelbottom=False, color="#CBCACA")
ax.tick_params(axis="both", which="major", pad=0.1, length=2)

# Set limit
ax.set_xlim(0, 50)

# Remove axis labels
ax.xaxis.label.set_visible(False)
ax.yaxis.label.set_visible(False)

ax.xaxis.set_major_formatter(mtick.PercentFormatter())

# Removing spines
ax.spines["right"].set_visible(False)
ax.spines["top"].set_color("#CBCACA")
ax.spines["top"].set_linewidth(0.8)
ax.spines["left"].set_visible(False)
ax.spines["bottom"].set_visible(False)

# Adding gridlines in the background
ax.yaxis.grid(False)
ax.xaxis.grid(True, which="major", linestyle="--", alpha=0.05, color="#0C3578")

ax.set_facecolor("#F8F8F8")

In [None]:
# Histogram of length of text
sns.set_theme(rc={"figure.dpi":300, 'savefig.dpi':300})   # adjust image resolution
sns.set(rc={"figure.facecolor": "#F8F8F8",
            "figure.figsize": (4, 3)})

ax = sns.histplot(data=df, x="unigrams_len", 
                  hue="label", palette=["#5975A4", "#CB8962"],
                  kde=True, line_kws={"linewidth": 0.7},
                  alpha=0.7, legend=False,
                  linewidth=0.1);

# Adjust grid lines
ax.xaxis.grid(False)
ax.yaxis.grid(True, which="major", linestyle="--", alpha=0.1, linewidth=0.5, color="#0C3578")

# Edit axis tick labels
ax.yaxis.set_major_locator(FixedLocator([150, 300, 450, 600]))
ax.set_yticklabels(ax.get_yticklabels(), family="serif", size=9)
ax.set_xticklabels(ax.get_xticks(), family="serif", size=10)
ax.xaxis.set_major_formatter(FuncFormatter(lambda x, pos: int(x)))

# Adjust titles
ax.set_title("Histogram of Length of Text", family="serif", size=12, weight="bold")
ax.set_xlabel("Length of Text", family="serif", size=10)
ax.yaxis.label.set_visible(False);

# Adjust spines
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["bottom"].set_color("#CBCACA")
ax.spines["bottom"].set_linewidth(0.75)


ax.set_facecolor("#F8F8F8")

## Create the Vocabulary with TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure to evaluate the importance of a word in a document within a corpus. 
* Term Frequency measures how frequently the word occurs in the document.
* Inverse Document Frequency measures how important a word based on how frequently it appears across multiple documents.

TF-IDF for each token (n-gram) was calculated, and the mean of the values for each text was determined and plotted against the labels. From the boxplot, you can tell that the TF-IDF values are generally higher for ham messages than spam messages. It means that ham messages carry more important words than spam messages, possibly due to the repeatitive nature of the words that occur in spam messages.

The TF-IDF means will be another feature added for the classification downstream. 

In [38]:
# Calculate the TF-IDF using the gensim library
from gensim.corpora.dictionary import Dictionary
from gensim.models.tfidfmodel import TfidfModel

# Initialize a Dictionary that maps a token with an integer ID.
vocab = Dictionary(corpus_w_trigrams)
bow = [vocab.doc2bow(sms) for sms in corpus_w_trigrams]

# Calculate the TF-IDF
tfidf = TfidfModel(bow)
corpus_tfidf = tfidf[bow]

# Find TF-IDF mean for each text
df["tfidf"] = corpus_tfidf
df["tfidf_mean"] = df["tfidf"].apply(lambda x: np.mean([i[1] for i in x]))

In [None]:
# Boxplot for TF-IDF Mean
sns.set_theme(rc={"figure.dpi":300, 'savefig.dpi':300})   # adjust image resolution
sns.set(rc={"figure.facecolor": "#F8F8F8",
            "figure.figsize": (3, 4)})

ax = sns.boxplot(data=df, x="label", y="tfidf_mean");

# Plot title 
ax.set_title("TF-IDF Mean Against Label", family="serif",
             size=14, weight="bold")

# Remove axis labels
ax.xaxis.label.set_visible(False)
ax.set_ylabel("TF-IDF Mean", family="serif", size=10)

# Set axis tick labels
ax.set_yticklabels(ax.get_yticklabels(), family="serif", size=9)
ax.set_xticklabels(ax.get_xticklabels(), family="serif", size=10)

# Removing spines
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["bottom"].set_color("#CBCACA")

# Set grid
ax.xaxis.grid(False)
ax.yaxis.grid(True, which="major", linestyle="--", 
              alpha=0.1, linewidth=0.5, color="#0C3578")

ax.set_facecolor("#F8F8F8")

## Word Embedding
Word embedding is a technique to represent words in a numerical form. It maps words into a dense vector that captures the semantic and syntactic information such as the words' context, its part of speech, and its relatedness to other words in a corpus.

Word embedding is usually done using a large text corpus (Google News Database) to determine relationships and co-occurence frequency between words. The corpus has to be exceedingly large (100 billion words) for meaningful connection and interpretation of the words. The SMS corpus (8000 words) we have here is not large enough for a meaningful word embedding to be done, but we can still proceed with it. 

In [43]:
# Word embedding based only on sms corpus
from gensim.models.word2vec import Word2Vec

np.set_printoptions(suppress=True)

word_vec = Word2Vec(corpus_w_trigrams,
                    vector_size=100,        # size of your vector that represents the word
                    window=3,               # number of words away from token
                    min_count=1,            # min occurence of word
                    epochs=5,
                    seed=42)

vectors = word_vec.wv                       # return an object KeyedVectors

In [44]:
# Loading Google News pretrained word2vec model
from gensim.models import KeyedVectors

file_path = "../dataset/nus_sms_corpus/GoogleNews-vectors-negative300.bin.gz"

# Only unigrams in this language model
gvectors = KeyedVectors.load_word2vec_format(file_path, binary=True)

## Supervised Learning
Supervised learning will be performed in two different ways:
1. using TF-IDF sparse matrix
2. word embedding dense vector

The two features, Text Length and TF-IDF Mean, will be appended into the matrix to check if adding those features will result in better classifications. 

In [48]:
# Import libraries for ML training and evaluation
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report

In [91]:
# one-hot encode target labels (for some ML models)
targets = np.where(df["label"] == "spam", 1, 0)

# convert list of trigrams back to text
corpus_text = [text for text in map(lambda x: " ".join(x), corpus_w_trigrams)]

In [None]:
# Multinomial Naive Bayes w sparse matrix
x_train, x_test, y_train, y_test = train_test_split(corpus_text,
                                                    targets,
                                                    train_size=0.8,
                                                    random_state=23)

vectorizer = CountVectorizer()

x_train_vect = vectorizer.fit_transform(x_train)
x_test_vect = vectorizer.transform(x_test)

tfidf = TfidfTransformer()
tfidf.fit_transform(x_train_vect);

nb = MultinomialNB()
nb.fit(x_train_vect, y_train);

y_pred_label = nb.predict(x_test_vect)

print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))


In [None]:
# Improving recall scores by prioritizing recall over precision
y_pred_threshold = (nb.predict_proba(x_test_vect)[:,1] >= 0.3).astype(int)

print("confusion matrix: \n", confusion_matrix(y_test, y_pred_threshold))
print("classification report: \n", classification_report(y_test, y_pred_threshold))

In [None]:
# Improving recall scores by re-sampling
from sklearn.utils import resample
import ast

resample_spam = resample(df[df["label"]=="spam"], replace=True, n_samples=50, random_state=42)
df_upsampled = pd.concat([df, resample_spam])

#literal_w_trigrams = df_upsampled["trigrams_list"].apply(ast.literal_eval)
#corpus_w_trigrams = list(literal_w_trigrams.values)
corpus_w_trigrams2 = df_upsampled["trigrams_list"]

# Re-compute the target and corpus_text
targets = np.where(df_upsampled["label"] == "spam", 1, 0)
corpus_text = [text for text in map(lambda x: " ".join(x), corpus_w_trigrams2)]

# Re-train the model with upsampled dataset
x_train, x_test, y_train, y_test = train_test_split(corpus_text,
                                                    targets,
                                                    train_size=0.8,
                                                    random_state=23)

vectorizer = CountVectorizer()

x_train_vect = vectorizer.fit_transform(x_train)
x_test_vect = vectorizer.transform(x_test)

tfidf = TfidfTransformer()
tfidf.fit_transform(x_train_vect);

nb = MultinomialNB()
nb.fit(x_train_vect, y_train);

y_pred_label = nb.predict(x_test_vect)

print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))

In [None]:
# Using TfidfVectorizer (can try turning hyperparameter)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score

tvec = TfidfVectorizer()
tvec_x_train = tvec.fit_transform(x_train)
tvec_x_test = tvec.transform(x_test)

nb = MultinomialNB()
nb.fit(tvec_x_train, y_train)

# scores = cross_val_score(nb, tvec_x_train, y_train, cv=5)
y_pred_label = nb.predict(tvec_x_test)

print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))

In [72]:
# Convert text to dense vector using word vectors trained from SMS corpus

vector_lists = []

for sms in corpus_w_trigrams:
    
    vector_list = [vectors.word_vec(token) for token in sms if token in vectors]
    
    vector_avg = np.mean(vector_list, axis=0)
    vector_avg = np.square(vector_avg)
    
    # After text formatting, sms is reduced to empty text as sms contains only stopwards
    if not isinstance(vector_avg, np.ndarray):
        vector_avg = np.array(np.zeros(100))
    
    vector_lists.append(vector_avg)

In [73]:
# Convert text to dense vector using word vectors trained from Google News Database

gvector_lists = []

for sms in lemmatize_sms:
    
    gvector_list = []
    for token in sms:
        if token in gvectors:
            gvector_list.append(gvectors.word_vec(token))
        else:
            gvector_list.append(np.zeros(300))
    
    
    gvector_avg = np.mean(gvector_list, axis=0)
    
    # After text formatting, sms is reduced to empty text as sms contains only stopwards
    if not isinstance(gvector_avg, np.ndarray):
        gvector_avg = np.zeros(300)
    
    gvector_lists.append(gvector_avg)

In [74]:
# Appending text length TF-IDF mean to data frame for ML training
tfidf_mean = df["tfidf_mean"].values
tfidf_mean = np.nan_to_num(tfidf_mean, nan=0)

unigrams_len = df["unigrams_len"].values

for i in range(len(vector_lists)):
    vector_lists[i] = np.append(vector_lists[i], [tfidf_mean[i], unigrams_len[i]])

for i in range(len(gvector_lists)):
    gvector_lists[i] = np.append(gvector_lists[i], [tfidf_mean[i], unigrams_len[i]])

In [75]:
# Scale vector with MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
vector_scaled = scaler.fit_transform(vector_lists)
gvector_scaled = scaler.fit_transform(gvector_lists)

In [None]:
# Logistic Regression with dense vector
x_train, x_test, y_train, y_test = train_test_split(gvector_scaled,
                                                    targets,
                                                    train_size=0.8,
                                                    random_state=23)

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

y_pred_label = logreg.predict(x_test)
print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))

In [None]:
# Support Vector Classifier with dense vector
from sklearn.svm import SVC

x_train, x_test, y_train, y_test = train_test_split(gvector_scaled,
                                                    targets,
                                                    train_size=0.8,
                                                    random_state=23)

svc = SVC()

svc.fit(x_train, y_train)

y_pred_label = svc.predict(x_test)
print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))

In [None]:
# xgboost with sparse matrix
from xgboost import XGBClassifier

x_train, x_test, y_train, y_test = train_test_split(corpus_text,        # sparse matrix
                                                    targets,
                                                    train_size=0.8,
                                                    random_state=23)

vectorizer = CountVectorizer()

x_train_vect = vectorizer.fit_transform(x_train)
x_test_vect = vectorizer.transform(x_test)

tfidf = TfidfTransformer()
tfidf.fit_transform(x_train_vect);

xgb = XGBClassifier()
xgb.fit(x_train_vect, y_train)

y_pred_label = xgb.predict(x_test_vect)

print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))

In [None]:
# xgboost with sparse matrix using gridsearchcv 
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

x_train, x_test, y_train, y_test = train_test_split(vector_scaled,
                                                    targets,
                                                    train_size=0.8,
                                                    random_state=23)

xgb = XGBClassifier()
xgb.fit(x_train, y_train)
y_pred_label = xgb.predict(x_test)

param_grid = {"max_depth": [3, 4, 5, 6, 7],
              "learning_rate": [0.1, 0.01, 0.001],
              "n_estimators": [50, 100, 200, 300, 400, 500]}
grid = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=5)
grid.fit(x_train, y_train)

print(grid.best_params_)
y_pred_label = grid.predict(x_test)

print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))

In [None]:
# xgboost with dense vector from sms corpus
from xgboost import XGBClassifier

x_train, x_test, y_train, y_test = train_test_split(vector_scaled,
                                                    targets,
                                                    train_size=0.8,
                                                    random_state=23)

xgb = XGBClassifier()
xgb.fit(x_train, y_train)
y_pred_label = xgb.predict(x_test)

print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))

In [None]:
# xgboost with dense vector from google corpus
from xgboost import XGBClassifier

x_train, x_test, y_train, y_test = train_test_split(gvector_scaled,
                                                    targets,
                                                    train_size=0.8,
                                                    random_state=23)

xgb = XGBClassifier()
xgb.fit(x_train, y_train)
y_pred_label = xgb.predict(x_test)

print("confusion matrix: \n", confusion_matrix(y_test, y_pred_label))
print("classification report: \n", classification_report(y_test, y_pred_label))

## Unsupervised Learning
To determine if unsupervised learning can separate the texts into ham or spam based solely on the text in the messages. Target labels are unavailable in unsupervised learning. But we have the labels here, so we can still cross-check the predictions from the K-means clustering to see if unsupervised learning works here. 

In [77]:
# Import unsupervised learning libraries
from sklearn.cluster import KMeans
from scipy import sparse

In [None]:
# K-means clustering model with dense vector (word embedding)
km = KMeans(n_clusters=2,
            init='k-means++',)
km.fit(np.array(gvector_lists))     # with Google News word vector

unsup_pred = km.predict(np.array(gvector_lists))

# The original split of ham and spam is 4826 and 747 respectively.
np.bincount(km.labels_)

In [None]:
print("confusion matrix: \n", confusion_matrix(unsup_pred, targets))
print("classification report: \n", classification_report(unsup_pred, targets))

In [None]:
# K-means clustering model with sparse matrix (TF-IDF)
vectorizer = CountVectorizer()

# Convert the text length and TF-IDF Mean features into a sparse matrix
stfidf = sparse.csr_matrix(tfidf_mean).transpose()
sunigrams_len = sparse.csr_matrix(unigrams_len).transpose()

corpus_sparse = vectorizer.fit_transform(corpus_text)
tfidf = TfidfTransformer()
tfidf_sparse = tfidf.fit_transform(corpus_sparse);

# Stack the text length and TF-IDF Mean sparse matrix features
tfidf_sparse2 = sparse.hstack((tfidf_sparse, stfidf, sunigrams_len))

km = KMeans(n_clusters=2)
km.fit(tfidf_sparse2)
unsup_pred = km.predict(tfidf_sparse2)
np.bincount(km.labels_)

In [None]:
print("confusion matrix: \n", confusion_matrix(unsup_pred, targets))
print("classification report: \n", classification_report(unsup_pred, targets))

In [None]:
# Plotting out the scatterplot of tfidf_mean and unigrams_len of the actual and predicted targets
new_df = pd.DataFrame({"tfidf_mean": tfidf_mean,
                       "unigrams_len": unigrams_len,
                       "pred_y": unsup_pred,
                       "actual_y": targets})

sns.set_theme(rc={"figure.dpi":300, 'savefig.dpi':300})   # adjust image resolution
sns.set(rc={"figure.facecolor": "#F8F8F8",
            "figure.figsize": (12, 4)})

fig, ax = plt.subplots(ncols=2)
sns.scatterplot(new_df, x="tfidf_mean", y="unigrams_len", hue="pred_y", ax=ax[0]);
sns.scatterplot(new_df, x="tfidf_mean", y="unigrams_len", hue="actual_y", ax=ax[1]);