<a href="https://colab.research.google.com/github/sergiomora03/AdvancedTopicsAnalytics/blob/main/exercises/E4-SpamClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 3
## Spam Classification
### Context
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

### Content
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

This corpus has been collected from free or free for research sources at the Internet:

- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link](http://www.grumbletext.co.uk/).
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link](http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/).
- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link](http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf).
- Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link](http://www.esp.uem.es/jmgomez/smsspamcorpus/). This corpus has been used in the following academic researches:

Acknowledgements
The original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.

We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.

Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

In [None]:
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9656 sha256=20db22d04371325242d52d6cc90be7eb384188ea4ed09cf4711359319b8ac4fc
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import pandas as pd
import numpy as np
import wget
import os
from zipfile import ZipFile

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import string

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, auc, roc_curve

import gensim
from gensim.models import Word2Vec
import warnings

warnings.filterwarnings('ignore')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
try :
    from google.colab import files
    !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
    !unzip smsspamcollection.zip
    df = pd.read_csv('SMSSpamCollection', sep='\t',  header=None, names=['target', 'text'])
except ModuleNotFoundError :
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    path = os.getcwd()+'\Data'
    wget.download(url,path)
    temp=path+'\smsspamcollection.zip'
    file = ZipFile(temp)
    file.extractall(path)
    file.close()
    df = pd.read_csv(path + '\SMSSpamCollection', sep='\t',  header=None, names=['target', 'text'])

--2024-03-20 20:47:50--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z     [<=>                 ]       0  --.-KB/s               smsspamcollection.z     [ <=>                ] 198.65K  --.-KB/s    in 0.1s    

2024-03-20 20:47:50 (1.83 MB/s) - ‘smsspamcollection.zip’ saved [203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [None]:
df.head()

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
display(df.shape) #Number of rows (instances) and columns in the dataset
df["target"].value_counts()/df.shape[0] #Class distribution in the dataset

(5572, 2)

ham     0.865937
spam    0.134063
Name: target, dtype: float64

In [None]:
df ['target'].value_counts ()

ham     4825
spam     747
Name: target, dtype: int64

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})

In [None]:
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

Preprocess the text data by removing stop words, converting all text to lowercase, and removing punctuation using NLTK package.


In [None]:
stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

Train a Word2Vec model on the preprocessed training data using Gensim package.

In [None]:
sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=5, negative=20, min_count=1, workers=4)

Convert the preprocessed text data to a vector representation using the Word2Vec model.

In [None]:
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

Train a classification model such as logistic regression, random forests, or support vector machines using the vectorised training data and the sentiment labels.

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

Evaluate the performance of the classification model on the testing set with the accuracy, precision, recall and F1 score.

In [None]:
y_pred = clf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))

Accuracy: 0.8660287081339713
AUC: 0.5


**Random Forest Classification**

In [None]:
clrf = RandomForestClassifier()
clrf.fit(X_train, y_train)

In [None]:
y_pred = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))

Accuracy: 0.9204545454545454
AUC: 0.7087855169692187


# Excercise 3.1

Remove stopwords, then predict target using CountVectorizer.

use Random Forest classifier

In [None]:
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

1.

In [None]:
#Count Vectorizer simple
vect = CountVectorizer()
X_dcv_tr = vect.fit_transform (X_train)
X_dcv_te = vect.transform (X_test)
clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_dcv_tr, y_train, cv=10)).describe())

count    10.000000
mean      0.976667
std       0.004902
min       0.969231
25%       0.974359
50%       0.976923
75%       0.976923
max       0.984615
dtype: float64


In [None]:
clrf.fit(X_dcv_tr, y_train) #Train
y_pred_cv = clrf.predict(X_dcv_te) #Test
fpr, tpr, thresholds = roc_curve(y_test, y_pred_cv)

In [None]:
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))

Accuracy: 0.9204545454545454
AUC: 0.8727678571428572


2

In [None]:
#Count Vectorizer sin mayusculas, ni stop words
vect = CountVectorizer(lowercase=False, stop_words='english')
X_dcv_tr = vect.fit_transform (X_train)
X_dcv_te = vect.transform (X_test)
clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_dcv_tr, y_train, cv=10)).describe())

count    10.000000
mean      0.976923
std       0.007352
min       0.966667
25%       0.973077
50%       0.976923
75%       0.980769
max       0.989744
dtype: float64


In [None]:
clrf.fit(X_dcv_tr, y_train) #Train
y_pred_cv = clrf.predict(X_dcv_te) #Test
fpr, tpr, thresholds = roc_curve(y_test, y_pred_cv)

In [None]:
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))

Accuracy: 0.9204545454545454
AUC: 0.8858154104183109


3

In [None]:
#Count Vectorizer con N-grams, max features, min df y max df
vect = CountVectorizer (ngram_range=(1,2), max_features=2000, max_df=0.95, min_df=5)
X_dcv_tr = vect.fit_transform (X_train)
X_dcv_te = vect.transform (X_test)
clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_dcv_tr, y_train, cv=10)).describe())

count    10.000000
mean      0.981282
std       0.005803
min       0.971795
25%       0.976923
50%       0.983333
75%       0.986538
max       0.987179
dtype: float64


In [None]:
clrf.fit(X_dcv_tr, y_train) #Train
y_pred_cv = clrf.predict(X_dcv_te) #Test
fpr, tpr, thresholds = roc_curve(y_test, y_pred_cv)

In [None]:
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))

Accuracy: 0.9204545454545454
AUC: 0.9059046961325967


# Excercise 3.2

Predict target using TdidfVectorizer.

use Random Forest classifier

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})

# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3900,)
(1672,)
(3900,)
(1672,)


In [None]:
#TfIdfVectorizer simple
tfidf = TfidfVectorizer ()
X_train_tfidf = tfidf.fit_transform (X_train)
X_test_tfidf = tfidf.transform (X_test)
clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train_tfidf, y_train, cv=10)).describe())

count    10.000000
mean      0.977179
std       0.006333
min       0.969231
25%       0.974359
50%       0.975641
75%       0.981410
max       0.989744
dtype: float64


In [None]:
y_train.astype('int')
y_test.astype('int')

3087    0
4062    0
572     0
2453    0
4593    0
       ..
5405    0
2771    0
3373    0
1204    0
2400    0
Name: target, Length: 1672, dtype: int64

In [None]:
print(X_train_tfidf)

In [None]:
print(X_test_tfidf)

In [None]:
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(3900, 7172)
(1672, 7172)


In [None]:
clrf.fit(X_train_tfidf, y_train) #Train

In [None]:
y_pred_cv = clrf.predict(X_train_tfidf)

In [None]:
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)

Accuracy on Training:  1.0


In [None]:
y_pred_te = clrf.predict(X_test_tfidf)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

Accuracy on Test:  0.97188995215311
AUC: 0.8950892857142857


###**TfId Complejo**

In [None]:
#TfIdfVectorizer Complejo
tfidf = TfidfVectorizer (ngram_range = (1, 2), max_features = 2000, max_df = 0.95, min_df = 5)
X_train_tfidf = tfidf.fit_transform (X_train)
X_test_tfidf = tfidf.transform (X_test)
clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train_tfidf, y_train, cv=10)).describe())

count    10.000000
mean      0.980769
std       0.007074
min       0.969231
25%       0.974359
50%       0.983333
75%       0.986538
max       0.989744
dtype: float64


In [None]:
clrf.fit(X_train_tfidf, y_train) #Train

In [None]:
y_pred_cv = clrf.predict(X_train_tfidf)

In [None]:
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)

Accuracy on Training:  1.0


In [None]:
y_pred_te = clrf.predict(X_test_tfidf)

In [None]:
y_pred_te = clrf.predict(X_test_tfidf)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

Accuracy on Test:  0.9736842105263158
AUC: 0.9036725532754539


# Excercise 3.3

Predict target using CountVectorizer or TfideVectorizer.

choose any classification model and justify why

En el ejercicio 3.2 se produjo el la predicción de los 2 metodos, registrando los siguientes resultados:

|Metodo | Accuracy | AUC
|-------|-----------|-----------|
|Count Vectorizer simple| 0.920454 | 0.87276|
|Count Vectorizer sin mayusculas, ni stop words|0.92045 |0.88582|
|Count Vectorizer con N-grams, max features, min df y max df| 0.920454 | 0.90590|
|TfIdfVectorizer simple | 0.971889 | 0.89508|
|TfIdfVectorizer con N-grams,max features, min df y max df | 0.973684 | 0.90367|

Lo anterior confirma que el modelo que mejor Accuracy registro fue el **TfIdfVectorizer con N-grams** con **0.973684** y, AUC de **0.90367** por lo cual es el mejor modelo.

# Excercise 3.4

Increase and decrece the parameters values vector_size, window and negative then predict the target.

Plot the different values of the parameters with the performance of the model.

Use a Random Forest classifier and classification model of your choice and justify why.

parameter variation: `vector_size`

###**Vector_size = 150**

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

In [None]:
stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

In [None]:
sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=150, window=5, negative=20, min_count=1, workers=4)

In [None]:
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(150)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

In [None]:
clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.945385
std       0.011147
min       0.920513
25%       0.944231
50%       0.946154
75%       0.953205
max       0.956410
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.9300239234449761
AUC: 0.7444998026835044


###**Vector_Size = 50**

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=50, window=5, negative=20, min_count=1, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(50)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.942308
std       0.008903
min       0.925641
25%       0.939103
50%       0.944872
75%       0.946154
max       0.956410
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.9192583732057417
AUC: 0.7213027821625887


###**Vector_Size = 200**

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=200, window=5, negative=20, min_count=1, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(200)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.940256
std       0.009898
min       0.925641
25%       0.934615
50%       0.939744
75%       0.945513
max       0.958974
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.9234449760765551
AUC: 0.7161725532754538


parameter variation: `window`

###**window = 10**

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=10, negative=20, min_count=1, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.958462
std       0.010865
min       0.935897
25%       0.952564
50%       0.960256
75%       0.964103
max       0.976923
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.937799043062201
AUC: 0.7904992107340174


###**window = 15**

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=15, negative=20, min_count=1, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.960513
std       0.010342
min       0.933333
25%       0.958974
50%       0.962821
75%       0.966026
max       0.969231
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.9503588516746412
AUC: 0.8392610497237569


parameter variation: `negative`

###**negative = 10**

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=15, negative=10, min_count=1, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.958974
std       0.008800
min       0.943590
25%       0.956410
50%       0.957692
75%       0.965385
max       0.971795
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.9467703349282297
AUC: 0.816433997632202


###**negative = 15**

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=15, negative=15, min_count=1, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.962051
std       0.008615
min       0.941026
25%       0.959615
50%       0.962821
75%       0.967949
max       0.971795
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.9521531100478469
AUC: 0.8384101223362274


###**negative = 25**

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=15, negative=25, min_count=1, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.960000
std       0.007852
min       0.941026
25%       0.957051
50%       0.961538
75%       0.964103
max       0.969231
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.9473684210526315
AUC: 0.8243266574585635


Running best parameters `vector_size, window and negative` (All at the same time)

El mejor modelo quedo con los siguientes parametros los cuales indican los mejores rendimientos:

-------------------------

* vector_size=100
* window=15
* negative=15

In [None]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=15, negative=15, min_count=1, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

clrf = RandomForestClassifier()
print(pd.Series(cross_val_score(clrf, X_train, y_train, cv=10)).describe())
clrf.fit(X_train, y_train) #Train
y_pred_cv = clrf.predict(X_train)
accuracy_training = accuracy_score(y_train, y_pred_cv)
print('Accuracy on Training: ', accuracy_training)
y_pred_te = clrf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred_te)
accuracy_test = accuracy_score(y_test, y_pred_te)
print('Accuracy on Test: ', accuracy_test)
print('AUC:', auc(fpr, tpr))

count    10.000000
mean      0.958462
std       0.010662
min       0.933333
25%       0.956410
50%       0.958974
75%       0.964744
max       0.971795
dtype: float64
Accuracy on Training:  1.0
Accuracy on Test:  0.9491626794258373
AUC: 0.8253625690607734
