# Group 1
# Document Classification
Adam Gersowitz, Diego Correa, Maria A Ginorio

It can be useful to be able to classify new "test" documents using already classified "training" documents. A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.

Here is one example of such data: UCI Machine Learning Repository: Spambase Data Set.

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder). For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.


## Required Packages

In [1]:
# data processing packages
import string

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from pandas_profiling import ProfileReport

%matplotlib inline

In [None]:

# # sklearn packages
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.model_selection import train_test_split
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.linear_model import SGDClassifier

## Data
Label = The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

features = Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail.

   * 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD
    * 6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR
    * 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters
    * 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0),
    i.e. unsolicited commercial e-mail.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/mgino11/Web_Analytics/main/Datasets/spambase.csv')

# preview data
df.head(2).T

Unnamed: 0,0,1
word_freq_make,0.0,0.21
word_freq_address,0.64,0.28
word_freq_all,0.64,0.5
word_freq_3d,0.0,0.0
word_freq_our,0.32,0.14
word_freq_over,0.0,0.28
word_freq_remove,0.0,0.21
word_freq_internet,0.0,0.07
word_freq_order,0.0,0.0
word_freq_mail,0.0,0.94


In [3]:
print(df.shape)

(4601, 58)


## Preprocessing

**Pandas Profiling**

In [11]:
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title='Pandas Profiling Report', html={'style':{'full_width':True}})

In [None]:
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Tokenize
Tokens are broken pieces of the original text that are produced after tokenization. Tokens are the basic building blocks of text -everything that helps us understand the meaning of the text is derived from tokens and the relationship to one another. For example, the character is a token in a word

In [40]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')
df_data['email'] = df_data.apply(lambda row: nltk.word_tokenize(row['email']), axis=1)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [41]:
df_data["email"][0]

['Go',
 'until',
 'jurong',
 'point',
 ',',
 'crazy',
 '..',
 'Available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet',
 '...',
 'Cine',
 'there',
 'got',
 'amore',
 'wat',
 '...']

In [42]:
# stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maria\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [44]:
#remove stop words
df_data['email'] = df_data['email'].apply(lambda x: ' '.join(
    [word for word in x if word not in (stop_words)]))

## Lemmatization

While stemming is just concern with giving you the stem word irrespective of its meaning, whereas lemmatization will give you a word that makes sense. For example:
In stemming, history, historical will have the stem word as histori
In lemmatization, the stem word will be history

In [45]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

## Split Train & Test

From there, we used the train_test_split function from sklearn to split our data 80/20 for training and testing purposes.

In [7]:


from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df_data['label'],
                                                    df_data['email'],
                                                    random_state=0,
                                                    test_size=0.2)

## Extract Features
We further prepared our data by applying a term frequency–inverse document frequency (TFIDF) vectorizer to our email values. The TfidfVectorizer function extracts important features from our corpus.

This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction.

We will Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams)

In [10]:
vect = TfidfVectorizer(min_df=5, ngram_range=(1,3)).fit(X_train)
X_train_vector = vect.transform(X_train)

AttributeError: 'int' object has no attribute 'lower'

## ADD Features

we can add features such as the number of digits, the dollar sign , the length of the subject line and the number of characters (anything other than a letter, digit or underscore) . This will be helpful given that usually spam emails have digits, dollar signs and lengthy subject lines.

In [None]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')


# Train Data
add_length=X_train.str.len()
add_digits=X_train.str.count(r'\d')
add_dollars=X_train.str.count(r'\$')
add_characters=X_train.str.count(r'\W')

X_train_transformed = add_feature(X_train_vector , [add_length, add_digits,  add_dollars, add_characters])

# Test Data
add_length_t=X_test.str.len()
add_digits_t=X_test.str.count(r'\d')
add_dollars_t=X_test.str.count(r'\$')
add_characters_t=X_test.str.count(r'\W')


X_test_transformed = add_feature(vect.transform(X_test), [add_length_t, add_digits_t,  add_dollars_t, add_characters_t])

# Models
## Logistic Regression
Train the Logistic Regression Model
We will build the Logistic Regression Model and we will report the AUC score on the test dataset:

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer

In [None]:
# X_train_transformed.toarray()
# X_test_transformed.toarray()
# np.array(y_train)

In [None]:
logReg_model = LogisticRegression(C=100, solver='lbfgs', multi_class='ovr', max_iter=1000)

logReg_model.fit(X_train_transformed, y_train)

y_predicted = logReg_model.predict(X_test_transformed)

auc = roc_auc_score(y_test, y_predicted)
auc