# Group 1
# Document Classification
Adam Gersowitz, Diego Correa, Maria A Ginorio

It can be useful to be able to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

Here is one example of such data: UCI Machine Learning Repository: Spambase Data Set.

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


## Required Packages

In [17]:
# data processing packages
import string

import pandas as pd, numpy as np, os

# nltk packages
import nltk
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords


# sklearn packages
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier

## Data
We read the ham and spam data from a csv file in our github repository and relabled the data columns. The shape of our dataframe and a preview of the data can be viewed below.

In [3]:
df_data = pd.read_csv('https://raw.githubusercontent.com/mgino11/Web_Analytics/main/Datasets/ham_spam_data.txt',
                      error_bad_lines=False, delimiter="\t",header=None)

# label columns
df_data.columns = ['label','email']

# preview data
print("data shape:",df_data.shape)
df_data.head(5)

data shape: (5572, 2)


Unnamed: 0,label,email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [25]:
df_data.drop_duplicates(inplace=True)

# get new shape
df_data.shape

(5169, 2)

## Data cleaning

### Stopwords
In this section, we used the stem and stopword packages from nltk to improve our natural language processing technique. We then created a function to lower the data using the stem and use stopwords for cleaning and organizational purposes.

In [20]:
# stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [30]:
import string

# create data cleaning function
def data_cleaning(email):
    #remove punct
    nopunc = [char for char in email if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    # remove stopwords
    clean_words = [word for word in nopunc.split() if word.lower() not in stop_words]
    return clean_words

We applied the function to our dataframe. The results of our data_cleaning can be previewed below.

In [31]:
df_data['email'] = df_data['email'].apply(data_cleaning)
df_data.head()

Unnamed: 0,label,email
0,ham,"[Go, jurong, point, crazy, Available, bugis, n..."
1,ham,"[Ok, lar, Joking, wif, u, oni]"
2,spam,"[Free, entry, 2, wkly, comp, win, FA, Cup, fin..."
3,ham,"[U, dun, say, early, hor, U, c, already, say]"
4,ham,"[Nah, dont, think, goes, usf, lives, around, t..."


## Split Train and Test

From there, we used the train_test_split function from sklearn to split our data 80/20 for training and testing purposes.

In [32]:
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [33]:
X_train, X_test, y_train, y_test = train_test_split(df_data['label'],
                                                    df_data['email'],
                                                    random_state=0,
                                                    test_size=0.2)

## Extract Features
We further prepared our data by applying a term frequency–inverse document frequency (TFIDF) vectorizer to our email values. The TfidfVectorizer function extracts important features from our corpus.

This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

We will Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams)

In [35]:
vect = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train)
X_train_vector = vect.transform(X_train)

## ADD Features

we can add features such as the number of digits, the dollar sign , the length of the subject line and the number of characters (anything other than a letter, digit or underscore) . This will be helpful given that usually spam emails have digits, dollar signs and lengthy subject lines.

In [36]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')


# Train Data
add_length=X_train.str.len()
add_digits=X_train.str.count(r'\d')
add_dollars=X_train.str.count(r'\$')
add_characters=X_train.str.count(r'\W')

X_train_transformed = add_feature(X_train_vector , [add_length, add_digits,  add_dollars, add_characters])

# Test Data
add_length_t=X_test.str.len()
add_digits_t=X_test.str.count(r'\d')
add_dollars_t=X_test.str.count(r'\$')
add_characters_t=X_test.str.count(r'\W')


X_test_transformed = add_feature(vect.transform(X_test), [add_length_t, add_digits_t,  add_dollars_t, add_characters_t])

# Models
## Logistic Regression
Train the Logistic Regression Model
We will build the Logistic Regression Model and we will report the AUC score on the test dataset: