### Data 620 - Week 10-11 Assignment 
### Leticia Salazar
### April 2, 2023

##### Task:

It can be useful to be able to classify new "test" documents using already classified "training" documents.  A common example is using a corpus of labeled spam and ham (non-spam) e-mails to predict whether or not a new document is spam.  Here is one example of such data:  UCI Machine Learning Repository: Spambase Data Set.

For this project, you can either use the above dataset to predict the class of new documents (either withheld from the training dataset or from another source such as your own spam folder).

For more adventurous students, you are welcome (encouraged!) to come up a different set of documents (including scraped web pages!?) that have already been classified (e.g. tagged), then analyze these documents to predict how new documents should be classified.

#### Dataset:

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography...

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.

* Dataset characteristics: Multivariate
* Attribute Characteristics: Integer, Real
* Associated Tasks: Classification
* Number of Instances: 4601
* Number of Attributes: 57
* Missing Values?: Yes
* Area: Computer
* Date Donated: 1999-07-01
* Number of Web Hits: 736179


#### Attribute Information:

The last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes:

* 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string.


* 6 continuous real [0,100] attributes of type char_freq_CHAR] = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail


* 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters


* 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters


* 1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail


* 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

#### Load Libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
import csv

#### Load Data

In [None]:
# import data files
spam_data = 'spambase.data'
spam_data_names = 'spambase.names'

In [None]:
#read the `spambase.data` into a python list
with open(spam_data,'rU') as f:
    reader = csv.reader(f)
    #for line in reader:
    dataset = [line for line in reader]

In [None]:
#read the `spambase.names` into a python list
with open(spam_data_names,'rU') as f:
    reader = csv.reader(f, quoting = csv.QUOTE_NONE)
    #for line in reader:
    fieldnames = [ line[0].split(":",1)[0] for line in reader if len(line) > 0 and '|' not in line[0] and line[0][0] != '1' ]

In [2]:
spam = pd.read_csv("https://raw.githubusercontent.com/ustunb/classification-pipeline/master/Data/Raw%20Data%20Files/spambase.csv")
spam.head(10)

Unnamed: 0,Spam,WordFreqMake,WordFreqAddress,WordFreqAll,WordFreq3D,WordFreqOur,WordFreqOver,WordFreqRemove,WordFreqInternet,WordFreqOrder,...,WordFreqConference,CharFreqSemicolon,CharFreqParentheses,CharFreqBracket,CharFreqExcalamationMark,CharFreqDollarSign,CharFreqPound,CapitalRunLengthAverage,CapitalRunLengthLongest,CapitalRunLengthTotal
0,1,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,1,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,1,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191
5,1,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,...,0.0,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54
6,1,0.0,0.0,0.0,0.0,1.92,0.0,0.0,0.0,0.0,...,0.0,0.0,0.054,0.0,0.164,0.054,0.0,1.671,4,112
7,1,0.0,0.0,0.0,0.0,1.88,0.0,0.0,1.88,0.0,...,0.0,0.0,0.206,0.0,0.0,0.0,0.0,2.45,11,49
8,1,0.15,0.0,0.46,0.0,0.61,0.0,0.3,0.0,0.92,...,0.0,0.0,0.271,0.0,0.181,0.203,0.022,9.744,445,1257
9,1,0.06,0.12,0.77,0.0,0.19,0.32,0.38,0.0,0.06,...,0.0,0.04,0.03,0.0,0.244,0.081,0.0,1.729,43,749


In [3]:
spam.describe()

Unnamed: 0,Spam,WordFreqMake,WordFreqAddress,WordFreqAll,WordFreq3D,WordFreqOur,WordFreqOver,WordFreqRemove,WordFreqInternet,WordFreqOrder,...,WordFreqConference,CharFreqSemicolon,CharFreqParentheses,CharFreqBracket,CharFreqExcalamationMark,CharFreqDollarSign,CharFreqPound,CapitalRunLengthAverage,CapitalRunLengthLongest,CapitalRunLengthTotal
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.394045,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,...,0.031869,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285
std,0.488698,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,...,0.285735,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0
75%,1.0,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,...,0.0,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0
max,1.0,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,...,10.0,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0


In [4]:
print(spam.dtypes)

Spam                          int64
WordFreqMake                float64
WordFreqAddress             float64
WordFreqAll                 float64
WordFreq3D                  float64
WordFreqOur                 float64
WordFreqOver                float64
WordFreqRemove              float64
WordFreqInternet            float64
WordFreqOrder               float64
WordFreqMail                float64
WordFreqReceive             float64
WordFreqWill                float64
WordFreqPeople              float64
WordFreqReport              float64
WordFreqAddresses           float64
WordFreqFree                float64
WordFreqBusiness            float64
WordFreqEmail               float64
WordFreqYou                 float64
WordFreqCredit              float64
WordFreqYour                float64
WordFreqFont                float64
WordFreq0                   float64
WordFreqMoney               float64
WordFreqHP                  float64
WordFreqHPL                 float64
WordFreqGeorge              

In [5]:
# Count spam and non-spam
count_spam = len(spam[spam.spamclass==1])
count_nonspam = len(spam[spam.spamclass==0])

print("Spam: %d" %count_spam)
print("Non-spam: %d" %count_nonspam)

AttributeError: 'DataFrame' object has no attribute 'spamclass'