<a href="https://colab.research.google.com/github/hend-isleem/ml-spam-classification/blob/main/Spam_SMS_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **SMS Spam Detection**


---


Ikhlas Jihad El-Aydi 220170458

Hend Tarek Isleem     220171091

## 1 - Overview of the Idea

 A Machine Learning classic beginner project using Python libraries to cluster a data set of 'sms' messages into 'spam' and 'ham' using **k-means**.
 
 The dataset is a collection of 5,574 SMS meesages taken from UCI Machine Learning repository, need to be tagged as "spam" and "ham".

 Accurate Results of sms messages being spam(0) or not(1) are uploaded in 'accurate_results.csv' file, so the results will be compared with them to figure out the accuracy.

 We use **tfidf approach** for text clustering, as it's proved to give a high prediction accuracy.



## 2 - Overview of the Work Pipeline:


* Loading Data
* Preprocessing
* Feature Selection
* Feature Vector Modelling 
* k-means clustering and evaluation
* Showing Results 

## 3 - Packages

In [2]:
pip install -U nltk

Requirement already up-to-date: nltk in /usr/local/lib/python3.7/dist-packages (3.6.2)


In [3]:
from sklearn.metrics import accuracy_score
import csv
import nltk
import re
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns
# %matplotlib inline
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfTransformer

## **Loading Data**

In [4]:
#Reading the spam collection dataset
def reading_training_file():
    print("Reading messages from dataset...")
    documents = []
    with open("sms_dataset.csv") as csvfile:
        rows = csv.reader(csvfile)
        next(rows, None) #Skip column headers
        for row in rows:
            message = row[1]
            documents.append(message) #messages are appended to the list 'documents'
    print("Finished reading messages and appended to the list..")
    return documents
documents = reading_training_file()
print(f'length of documents: ', len(documents))

Reading messages from dataset...
Finished reading messages and appended to the list..
length of documents:  5574


In [5]:
# doing so with pandas Dataframe
data = pd.read_csv('sms_dataset.csv')
df = pd.DataFrame(data=data)
df.head()

Unnamed: 0,SMS_id,SMS
0,1,"\tGo until jurong point, crazy.. Available on..."
1,2,\tOk lar... Joking wif u oni...\n
2,3,\tFree entry in 2 a wkly comp to win FA Cup f...
3,4,\tU dun say so early hor... U c already then ...
4,5,"\tNah I don't think he goes to usf, he lives ..."


## **Pre-processing**

1. Remove non-alphapet characters, using re(regex built in library)
2. Tokenization: split longer strings of text into smaller pieces, or tokens.     
  Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc.
3. Stop words Reduction: remove stop words, like(a, the, an, to, of) with nltk.
4. Stemming: eliminate affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem.

In [6]:
nltk.download('punkt')
nltk.download('stopwords')
#Preprocessing steps including tokenization
def preprocessing(documents):
    print ("Precprocessing the messages for clustering...")
    vocab_glob = {}
    tokenized_document = []
    final_documents=[]
    for document in documents:
        text=document.replace("</p>","") # removing </p>
        text=text.replace("<p>"," ")  # removing <p>
        text = text.replace("http", " ")
        text = text.replace("www", " ")
        text = re.sub(r'([a-z])\1+', r'\1', text)
        text = re.sub('\s+', ' ', text)
        text = re.sub('\.+', '.', text)
        text = re.sub(r"(?:\@|'|https?\://)\s+","",text) #delete punctuation
        text = re.sub("[^a-zA-Z]", " ",text)
        text=re.sub(r'[^\w\s]','',text) # remove punctuation
        text=re.sub("\d+","",text) # remove number from text
        tokens_text = nltk.word_tokenize(text) # tokenizing the documents
        stopwords=nltk.corpus.stopwords.words('english') #stopword reduction
        tokens_text=[w for w in tokens_text if w.lower() not in stopwords]
        tokens_text=[w.lower() for w in tokens_text] #convert to lower case
        tokens_text=[w for w in tokens_text if len(w)>2] #considering tokens with length>2(meaningful words)
        p= PorterStemmer() # stemming tokenized documents using Porter Stemmer
        tokens_text = [p.stem(w) for w in tokens_text]
        token_ind= []
        counter=len(vocab_glob)-1
        for token in tokens_text:
         if token not in vocab_glob:
            counter+=1
            vocab_glob[token]=counter
            token_ind.append(counter)
         else:
            token_ind.append(vocab_glob[token])
        final_documents.append(token_ind)
    print ("Finished pre-processing words..")
    return vocab_glob,final_documents
vocab_glob,final_documents =  preprocessing(documents)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Precprocessing the messages for clustering...
Finished pre-processing words..


## **Feature Selection**

A required step to reduce the number of input variables to those that are believed to be most useful to the model.
Between all the tokenized words, the most frequent words are better candidates to give the result than the low frequent words that can be ignored.

We use the **Band-pass filter**, which passes only the words that have frequencies within a certain range and rejects frequencies outside that range.


In [7]:
#Feature Engineerng for selecting top words
def feature_selection(final_documents):
 print("Feature selection started...")
 doc_freq = {}
 for document in final_documents:
   for index in document:
    if index in doc_freq:
        doc_freq[index] += 1
    else:
        doc_freq[index] = 1
    top_features = []
    for token in doc_freq.keys():
     if doc_freq[token] > 14:    #removing low frequent words(Band Pass Filtering) (by trial: 2* avg_freq)
      top_features.append(token)
 print('avg_freq = ', sum(doc_freq.values()) / len(doc_freq))
 i = 0
 top_words = {}
 for token in top_features:
        top_words[i] = token
        i += 1
 print("Feature selection Done...")
 return top_words
top_words = feature_selection(final_documents)

Feature selection started...
avg_freq =  7.228547854785479
Feature selection Done...


## **Feature Vector Modelling**

In [9]:
#Approach tf-idf --> Creating the Term Frequency Matrix and Calculating tfidf weight using tfidf transformer - used for obtaining high accuracy
def feature_matrix_tfidf(top_words,final_documents):
    print("Calculating tfidf weight using tfidf transformer...")
    print(final_documents)
    indexes_features = top_words.values()
    print(indexes_features)
    rows = []
    rows = indexes_features
    columns = []
    values = []

    for val in final_documents:
        feature_vector = [0] * (len(indexes_features))
        for j in val:
            if j in rows:
                feature_vector[list(rows).index(j)] = val.count(j)
        columns.append(feature_vector)
   
    #Implementing tfidftransformer for creating TfidTransformer from scikit-learn
    tfidf = TfidfTransformer(norm=False,use_idf=True,sublinear_tf=True, smooth_idf=True)
    tfidf.fit(columns)
    tfidf_matrix = tfidf.transform(columns)
    test=tfidf_matrix
    #print tf_idf_matrix.todense()
    return test
test1 = feature_matrix_tfidf(top_words, final_documents)
print ("test1 length: " ,  test1.shape)

Calculating tfidf weight using tfidf transformer...
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 17, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 33], [37, 38, 39, 40, 41, 42, 43], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 28, 55, 56, 57], [58, 59, 51, 60, 61, 51, 62, 63], [64, 65, 66, 66, 67, 68, 69, 70, 47, 71, 72, 73, 74, 75, 76, 72], [77, 78, 79, 80, 47, 81, 82, 83, 84, 85, 86, 85, 87, 88, 89], [90, 91, 92, 93, 94, 95, 90, 96, 16, 86, 90, 93, 97], [98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108], [109, 110, 20, 111, 112, 29, 113, 56, 114, 115, 115, 116, 31, 117, 118], [119, 48, 97, 120, 83, 121, 29, 49, 85, 122, 123, 124, 125, 126], [47, 127, 128, 49, 129, 130, 131, 132, 133, 134, 135, 136, 137, 131, 47, 138, 139, 140], [141, 142], [143, 144, 145, 146, 147, 148, 149, 29, 150, 146, 147, 151, 152, 153], [154], [155, 156, 157, 158, 159, 160, 161], [162, 163, 164, 163, 165], [166, 167, 168, 169, 170, 171, 172, 29, 173, 171

## **2-means clustering and evaluation**

In [11]:
#Clustering the tfidf matrix using k-means algorithm from scikit learn
def kmeans(f_vector):
    print ("Clustering the matrix...")
    num_clusters = 2
    km = KMeans(num_clusters,random_state=99,init='k-means++', n_init=14, max_iter=100, tol=0.001, copy_x=True)
    y_kmeans = km.fit_predict(f_vector)
    print("typeof y_kmeans : ", type(y_kmeans)) #ndarray
   #y_kmeans = km.fit(f_vector)
    clusters = km.labels_.tolist()
    df['cluster'] = km.labels_
    print ("Results of Clustering:")
    print (clusters[:5])
    print ("Length of results:", len(clusters))
    return clusters

## **Utilities Functions to Write/Read Results and Evaluation**

In [12]:
#Printing Clustering Results to the file with headers and serial_no
def write_to_csv(clusters):
   print ("opening the file to save clustering results.....")
   file_save = "final_results.csv"
   with open(file_save, 'w',  encoding='UTF8') as f:
      writer = csv.writer(f)
      writer.writerow(["SMS_id","label"])
      numbers = list(range(1, len(clusters)+1))
      print ("writing to the file....")
      for row in zip(numbers, clusters):
          writer.writerow(row)
      print ("Completed writing process")

   #printing the clustering results to a different file without headers to calculate the accuracy by comparing with 'accurate_results' file
   f_save = "clustering_results.csv"
   with open(f_save, 'w', encoding='UTF8') as f:
      writer = csv.writer(f)
      for row in clusters:
        writer.writerow([row])

In [13]:
#Calculating accuracy of results by comparing with original results
#In the original_accurate_results file ham mesages are denoted by '1' and spam messages by '0'
#The k-means clustering results will be produce random results for different runs(ie, sometimes 1 for ham and 0 for spam and the vice versa)
#So make sure that we are comparing the results file having results --> ham as '1' and spam as '0' with accurate_results file, by converting the results appropriately
def read_accuracy_file():
    list_prediction=[]
    list_true = []
    #clustering_results
    with open("clustering_results.csv","r") as f_pre:
        rows = csv.reader(f_pre)
        for row in rows:
            if row[0]==str(0):
                list_prediction.append(int(row[0]))
            elif row[0]==str(1):
                list_prediction.append(int(row[0]))

    with open("accurate_results.csv","r") as f:
        rows= csv.reader(f)
        for row in rows:
            if row[0]==str(0):
                list_true.append(0)
            elif row[0]==str(1):
                list_true.append(1)
    return list_prediction, list_true


## **Main Function**

In [14]:
#Main function
def main():
    documents = reading_training_file()
    vocab_glob,final_documents=preprocessing(documents)
    top_words= feature_selection(final_documents)
    print(len(top_words.keys()) , " features")
    #feature_vector= features_vector_binary(vocab_glob,final_documents)
    #freq_matrix=feature_matrix_term_frequency(top_words, final_documents)
    feature_vector=feature_matrix_tfidf(top_words, final_documents)  
    print(feature_vector)
    print("")
    clusters = kmeans(feature_vector)
    print("\nOriginal Results:")
    original_results = pd.read_csv('accurate_results.csv')
    print(original_results['1'].values[:5])
    write_to_csv(clusters)
    y_pre, list_true = read_accuracy_file()
    accuracy =  accuracy_score(list_true, y_pre)
    print ("Accuracy score:", accuracy)
main()

Reading messages from dataset...
Finished reading messages and appended to the list..
Precprocessing the messages for clustering...
Finished pre-processing words..
Feature selection started...
avg_freq =  7.228547854785479
Feature selection Done...
587  features
Calculating tfidf weight using tfidf transformer...
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 17, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 33], [37, 38, 39, 40, 41, 42, 43], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 28, 55, 56, 57], [58, 59, 51, 60, 61, 51, 62, 63], [64, 65, 66, 66, 67, 68, 69, 70, 47, 71, 72, 73, 74, 75, 76, 72], [77, 78, 79, 80, 47, 81, 82, 83, 84, 85, 86, 85, 87, 88, 89], [90, 91, 92, 93, 94, 95, 90, 96, 16, 86, 90, 93, 97], [98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108], [109, 110, 20, 111, 112, 29, 113, 56, 114, 115, 115, 116, 31, 117, 118], [119, 48, 97, 120, 83, 121, 29, 49, 85, 122, 123, 124, 125, 126], [47, 127, 128, 49, 129, 130,

In [15]:
df.head(10)

Unnamed: 0,SMS_id,SMS,cluster
0,1,"\tGo until jurong point, crazy.. Available on...",1
1,2,\tOk lar... Joking wif u oni...\n,1
2,3,\tFree entry in 2 a wkly comp to win FA Cup f...,0
3,4,\tU dun say so early hor... U c already then ...,1
4,5,"\tNah I don't think he goes to usf, he lives ...",1
5,6,\tFreeMsg Hey there darling it's been 3 week'...,1
6,7,\tEven my brother is not like to speak with m...,1
7,8,\tAs per your request 'Melle Melle (Oru Minna...,1
8,9,\tWINNER!! As a valued network customer you h...,0
9,10,\tHad your mobile 11 months or more? U R enti...,0
