<a href="https://colab.research.google.com/github/m-mejiap/TopicosAvanzadosEnAnalitica/blob/main/Soluciones/E4_SpamClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 4
## Spam Classification
### Context
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

### Content
The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

This corpus has been collected from free or free for research sources at the Internet:

- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: [Web Link](http://www.grumbletext.co.uk/).
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: [Web Link](http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/).
- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at [Web Link](http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf).
- Finally, we have incorporated the SMS Spam Corpus v.0.1 Big. It has 1,002 SMS ham messages and 322 spam messages and it is public available at: [Web Link](http://www.esp.uem.es/jmgomez/smsspamcorpus/). This corpus has been used in the following academic researches:

Acknowledgements
The original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The creators would like to note that in case you find the dataset useful, please make a reference to previous paper and the web page: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ in your papers, research, etc.

We offer a comprehensive study of this corpus in the following paper. This work presents a number of statistics, studies and baseline results for several machine learning methods.

Almeida, T.A., GÃ³mez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

In [1]:
!pip install wget



In [2]:
import pandas as pd
import numpy as np
import wget
import os
from zipfile import ZipFile

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import string

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, auc, roc_curve
from sklearn.metrics import classification_report

import gensim
from gensim.models import Word2Vec
import warnings

warnings.filterwarnings('ignore')
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
try :
    from google.colab import files
    !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
    !unzip smsspamcollection.zip
    df = pd.read_csv('SMSSpamCollection', sep='\t',  header=None, names=['target', 'text'])
except ModuleNotFoundError :
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    path = os.getcwd()+'\Data'
    wget.download(url,path)
    temp=path+'\smsspamcollection.zip'
    file = ZipFile(temp)
    file.extractall(path)
    file.close()
    df = pd.read_csv(path + '\SMSSpamCollection', sep='\t',  header=None, names=['target', 'text'])

--2023-09-22 20:49:11--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘smsspamcollection.zip’

smsspamcollection.z     [<=>                 ]       0  --.-KB/s               smsspamcollection.z     [ <=>                ] 198.65K  --.-KB/s    in 0.1s    

2023-09-22 20:49:12 (1.43 MB/s) - ‘smsspamcollection.zip’ saved [203415]

Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


In [4]:
df.head()

Unnamed: 0,target,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
display(df.shape) #Number of rows (instances) and columns in the dataset
df["target"].value_counts()/df.shape[0] #Class distribution in the dataset

(5572, 2)

ham     0.865937
spam    0.134063
Name: target, dtype: float64

In [6]:
X = df['text']
y = df['target'].map({'ham':0, 'spam':1})

In [7]:
# split data into training and validation set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

Preprocess the text data by removing stop words, converting all text to lowercase, and removing punctuation using NLTK package.


In [8]:
stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

Train a Word2Vec model on the preprocessed training data using Gensim package.

In [9]:
sentences = [sentence.split() for sentence in X_train]
model = Word2Vec(sentences, vector_size=100, window=5, negative=20, min_count=1, workers=4)

Convert the preprocessed text data to a vector representation using the Word2Vec model.

In [10]:
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])

Train a classification model such as logistic regression, random forests, or support vector machines using the vectorised training data and the sentiment labels.

In [11]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

Evaluate the performance of the classification model on the testing set with the accuracy, precision, recall and F1 score.

In [12]:
y_pred = clf.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))

Accuracy: 0.8660287081339713
AUC: 0.5


In [13]:
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

Precision: 0.0
Recall: 0.0
F1 Score: 0.0


In [14]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1448
           1       0.00      0.00      0.00       224

    accuracy                           0.87      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.75      0.87      0.80      1672



# Excercise 3.1

Remove stopwords, then predict target using CountVectorizer.

use Random Forest classifier

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

In [17]:
CV = CountVectorizer(stop_words = 'english')

In [18]:
X_train_CV = CV.fit_transform(X_train)
X_test_CV = CV.transform(X_test)

In [19]:
clf = RandomForestClassifier()
clf.fit(X_train_CV, y_train)

In [20]:
y_pred = clf.predict(X_test_CV)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1448
           1       0.99      0.77      0.87       224

    accuracy                           0.97      1672
   macro avg       0.98      0.89      0.93      1672
weighted avg       0.97      0.97      0.97      1672

Accuracy: 0.9688995215311005
AUC: 0.8858154104183109
Precision: 0.9942528735632183
Recall: 0.7723214285714286
F1 Score: 0.8693467336683417


# Excercise 3.2

Predict target using TdidfVectorizer.

use Random Forest classifier

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

In [23]:
Tfidf = TfidfVectorizer(stop_words = 'english')

In [24]:
X_train_Tfidf = Tfidf.fit_transform(X_train)
X_test_Tfidf = Tfidf.transform(X_test)

In [25]:
clf = RandomForestClassifier()
clf.fit(X_train_Tfidf, y_train)

In [26]:
y_pred = clf.predict(X_test_Tfidf)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      1448
           1       1.00      0.76      0.87       224

    accuracy                           0.97      1672
   macro avg       0.98      0.88      0.92      1672
weighted avg       0.97      0.97      0.97      1672

Accuracy: 0.9683014354066986
AUC: 0.8816964285714286
Precision: 1.0
Recall: 0.7633928571428571
F1 Score: 0.8658227848101265


# Excercise 3.3

Predict target using CountVectorizer or TfideVectorizer.

choose any classification model and justify why

In [27]:
from sklearn.svm import SVC

In [28]:
# Train using CountVectorizer.
clf = SVC(gamma=0.25, random_state=42)
clf.fit(X_train_CV, y_train)

In [29]:
y_pred = clf.predict(X_test_CV)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96      1448
           1       1.00      0.41      0.58       224

    accuracy                           0.92      1672
   macro avg       0.96      0.70      0.77      1672
weighted avg       0.93      0.92      0.91      1672

Accuracy: 0.9204545454545454
AUC: 0.703125
Precision: 1.0
Recall: 0.40625
F1 Score: 0.5777777777777777


In [30]:
# Train using TfidfVectorizer.
clf = SVC(gamma=0.25, C=0.8, random_state=42)
clf.fit(X_train_Tfidf, y_train)

In [31]:
y_pred = clf.predict(X_test_Tfidf)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98      1448
           1       0.97      0.78      0.86       224

    accuracy                           0.97      1672
   macro avg       0.97      0.89      0.92      1672
weighted avg       0.97      0.97      0.97      1672

Accuracy: 0.9671052631578947
AUC: 0.8866663378058407
Precision: 0.9720670391061452
Recall: 0.7767857142857143
F1 Score: 0.8635235732009927


Escogemos usar una máquina de soporte vectorial con kernel 'rbf', debido a que es un modelo que funciona bien en casos en los que la frontera de decisión no es lineal. Además, es un modelo relativamente simple y fácil de ejecutar. Notamos que el TfidfVectorizer es la mejor opción a usar para este problema.

# Excercise 3.4

Increase and decrece the parameters values vector_size, window and negative then predict the target.

Plot the different values of the parameters with the performance of the model.

Use a Random Forest classifier and classification model of your choice and justify why.

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = df['target'], test_size = 0.3, random_state = 18)

In [33]:
stop_words = set(stopwords.words('english'))
def preprocess(text):
    text = text.lower()
    text = ''.join([word for word in text if word not in string.punctuation])
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

X_train = X_train.apply(preprocess)
X_test = X_test.apply(preprocess)

sentences = [sentence.split() for sentence in X_train]

parameter variation: `vector_size`

In [34]:
model = Word2Vec(sentences, vector_size=100, window=5, negative=20, min_count=1, workers=4)

#¡Con cualquier otro valor de vactor_size no funciona el modelo, genera error!

In [35]:
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train1 = np.array([vectorize(sentence) for sentence in X_train])
X_test1 = np.array([vectorize(sentence) for sentence in X_test])

In [36]:
clf = RandomForestClassifier()
clf.fit(X_train1, y_train)

In [37]:
y_pred = clf.predict(X_test1)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96      1448
           1       0.94      0.46      0.61       224

    accuracy                           0.92      1672
   macro avg       0.93      0.73      0.79      1672
weighted avg       0.93      0.92      0.91      1672

Accuracy: 0.9234449760765551
AUC: 0.7256067482241515
Precision: 0.9444444444444444
Recall: 0.45535714285714285
F1 Score: 0.6144578313253012


In [38]:
clf = SVC(gamma=0.25, C=0.8, random_state=42)
clf.fit(X_train1, y_train)

In [39]:
y_pred = clf.predict(X_test1)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1448
           1       0.00      0.00      0.00       224

    accuracy                           0.87      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.75      0.87      0.80      1672

Accuracy: 0.8660287081339713
AUC: 0.5
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


parameter variation: `window`

In [40]:
model = Word2Vec(sentences, vector_size=100, window=15, negative=20, min_count=1, workers=4)

In [41]:
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train1 = np.array([vectorize(sentence) for sentence in X_train])
X_test1 = np.array([vectorize(sentence) for sentence in X_test])

In [42]:
clf = RandomForestClassifier()
clf.fit(X_train1, y_train)

In [43]:
y_pred = clf.predict(X_test1)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1448
           1       0.94      0.68      0.79       224

    accuracy                           0.95      1672
   macro avg       0.95      0.84      0.88      1672
weighted avg       0.95      0.95      0.95      1672

Accuracy: 0.9515550239234449
AUC: 0.8361779794790843
Precision: 0.9440993788819876
Recall: 0.6785714285714286
F1 Score: 0.7896103896103897


In [44]:
clf = SVC(gamma=0.25, C=0.8, random_state=42)
clf.fit(X_train1, y_train)

In [45]:
y_pred = clf.predict(X_test1)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1448
           1       0.00      0.00      0.00       224

    accuracy                           0.87      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.75      0.87      0.80      1672

Accuracy: 0.8660287081339713
AUC: 0.5
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


parameter variation: `negative`

In [46]:
model = Word2Vec(sentences, vector_size=100, window=5, negative=15, min_count=1, workers=4)

In [47]:
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train1 = np.array([vectorize(sentence) for sentence in X_train])
X_test1 = np.array([vectorize(sentence) for sentence in X_test])

In [48]:
clf = RandomForestClassifier()
clf.fit(X_train1, y_train)

In [49]:
y_pred = clf.predict(X_test1)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96      1448
           1       0.93      0.44      0.60       224

    accuracy                           0.92      1672
   macro avg       0.93      0.72      0.78      1672
weighted avg       0.92      0.92      0.91      1672

Accuracy: 0.9210526315789473
AUC: 0.7185650157853196
Precision: 0.9339622641509434
Recall: 0.4419642857142857
F1 Score: 0.6


In [50]:
clf = SVC(gamma=0.25, C=0.8, random_state=42)
clf.fit(X_train1, y_train)

In [51]:
y_pred = clf.predict(X_test1)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      1448
           1       0.00      0.00      0.00       224

    accuracy                           0.87      1672
   macro avg       0.43      0.50      0.46      1672
weighted avg       0.75      0.87      0.80      1672

Accuracy: 0.8660287081339713
AUC: 0.5
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


Running best parameters `vector_size, window and negative` (All at the same time)

In [52]:
model = Word2Vec(sentences, vector_size=100, window=15, negative=15, min_count=1, workers=4)

In [53]:
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [model.wv[word] for word in words if word in model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train1 = np.array([vectorize(sentence) for sentence in X_train])
X_test1 = np.array([vectorize(sentence) for sentence in X_test])

In [54]:
clf = RandomForestClassifier()
clf.fit(X_train1, y_train)

In [55]:
y_pred = clf.predict(X_test1)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1448
           1       0.91      0.69      0.78       224

    accuracy                           0.95      1672
   macro avg       0.93      0.84      0.88      1672
weighted avg       0.95      0.95      0.95      1672

Accuracy: 0.9485645933014354
AUC: 0.8382251381215471
Precision: 0.9058823529411765
Recall: 0.6875
F1 Score: 0.7817258883248731


In [56]:
clf = SVC(gamma=0.25, C=0.8, random_state=42)
clf.fit(X_train1, y_train)

In [57]:
y_pred = clf.predict(X_test1)
fpr, tpr, thresholds = roc_curve(y_test, y_pred)

print(classification_report(y_test, y_pred))
print('Accuracy:', accuracy_score(y_test, y_pred))
print('AUC:', auc(fpr, tpr))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1 Score:', f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      1.00      0.94      1448
           1       0.94      0.14      0.24       224

    accuracy                           0.88      1672
   macro avg       0.91      0.57      0.59      1672
weighted avg       0.89      0.88      0.84      1672

Accuracy: 0.8833732057416268
AUC: 0.568505820836622
Precision: 0.9393939393939394
Recall: 0.13839285714285715
F1 Score: 0.2412451361867704
