Binary Text Classification using Logistic Regression of ham and spam text messages

The workflow that this notebook will follow is as follows:

1. Data Preprocessing: 
    -Load the dataset into sentences and labels
    -Split the dataset into training, validation and testing sets 
    -Report the distribution in the form of a table
    -Clean the data of any noise (urls, punctuation, and numbers) & change to lower case
    -Tokenize input text into tokens, including work stemming and stopwords
    -Build your own TD-IDF feature extractor using the training set
2. Build a logistic regression classifier using using L2 regularization
    -Derive the gradient of the objective function of LR with respect to w and b. 
    -Implement logistic regression via initialization, objective function, and gradient descent
    -Implement accuracy, precision, recall and F1 score as test metrics
    -Write a function for SGD and Mini-batch GD
    -Evaluate the model of the test set and report the metrics 
3. Cross Validation
    -Implement cross validation to choose the best hyperparameter lambda for the validation set
4. Conclusion
    -Analyze the results and compare to baseline
5. Create a multiclass classifier from various authors

In [34]:
import pandas as pd
import string

Load the dataset, include more information about what this importing section does

In [15]:
spam_df = pd.read_csv('a1-data/SMSSpamCollection', sep='\t', header=None, names=['label', 'text'])

spam_df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
spam_df.shape

(5572, 2)

Objective of the split_dataset function: 
    -Split the dataset ino training, validation and test sets
    -Return each set split into features and labels (this enables easier tokenization later)
    -This also allows for reproducubility as the split will not be exactly the same each time meaning the if the structure of the model is effecive it should learn at the same rate regardless of the data split

In [18]:
def split_dataset(df, train_size, val_size, test_size):
    df = df.sample(frac=1,random_state=42).reset_index(drop=True)
    n = len(df)
    train_end = int(train_size *n)
    val_end = train_end + int(val_size *n)

    train_df = df.iloc[:train_end]
    val_df = df.iloc[train_end:val_end]
    test_df = df.iloc[val_end:n]

    X_train, y_train = train_df[['text']], train_df['label']
    X_val, y_val = val_df[['text']], val_df['label']
    X_test, y_test = test_df[['text']], test_df[['label']]

    return X_train, X_val, X_test, y_train, y_val, y_test

X_train, X_val, X_test, y_train, y_val, y_test = split_dataset(spam_df, 0.6, 0.2, 0.2)

Output a table showing the number of samples in each class for the training, validation and test sets.

In [28]:
def data_distribution(y_train, y_val, y_test):
    df =pd.DataFrame({'Train': y_train.value_counts(), 'Validation': y_val.value_counts(), 'Test': y_test.value_counts()}).fillna(0).astype(int)
    return df

data_distribution(y_train, y_val, y_test)

Unnamed: 0,Train,Validation,Test
ham,2898,966,0
spam,445,148,0
"(ham,)",0,0,961
"(spam,)",0,0,154


Objective of the clean_data function:
    -Remove punctuation, urls and numbers
    -Change text to lowercase

In [37]:
def clean_text(X):
    X = X.str.lower()
    X = X.str.translate(str.maketrans("", "", string.punctuation))
    X = X.str.replace("http\\S+", "", regex=True)
    X = X.str.replace("https\\S+", "", regex=True)
    X = X.str.replace("\\d+", "", regex=True)
    return X

X_train['text'] = clean_text(X_train['text'])
X_val['text'] = clean_text(X_val['text'])
X_test['text']= clean_text(X_test['text'])
X_train.head()

Unnamed: 0,text
0,squeeeeeze this is christmas hug if u lik my f...
1,and also ive sorta blown him off a couple time...
2,mmm thats better now i got a roast down me id...
3,mm have some kanji dont eat anything heavy ok
4,so theres a ring that comes with the guys cost...


Tokenize the dataset:
    -Remove whitespace between words
    -Including word stems
    -Removing stop words (removing common words that do not add any semantic value)

In [None]:
STOPWORDS = {
    "a", "an", "the", "and", "or", "but",
    "is", "are", "was", "were", "be",
    "to", "of", "in", "on", "for", "with",
    "that", "this", "it", "as", "at"
}

def tokenize_text(X):
    return X.split()

def remove_stopwords(tokens, stopwords = STOPWORDS):
    return [t for t in tokens if t not in stopwords]

def step_tokens(tokens):
    suffixes: ["ing", "ly", "ed", "s", "es", "est"]
    for suf in suffixes:
        if tokens.endswith(suf) and len(tokens) > len(suf) + 2:
            return tokens[:-len(suf)]
    return tokens
