<a href="https://colab.research.google.com/github/hend-isleem/ml-spam-classification/blob/main/Spam_SMS_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1 - Packages

In [None]:
pip install -U nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/5e/37/9532ddd4b1bbb619333d5708aaad9bf1742f051a664c3c6fa6632a105fd8/nltk-3.6.2-py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 4.0MB/s 
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.6.2


In [None]:
from sklearn.metrics import accuracy_score
import csv
import nltk
import re
from sklearn.cluster import KMeans
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfTransformer

## 2 - Overview of the Idea

 A Machine Learning classic beginner project using Python libraries to cluster a data set of 'sms' messages into 'spam' and 'ham' using **k-means**.
 
 The dataset is a collection of 5,574 SMS meesages taken from UCI Machine Learning repository, need to be tagged as "spam" and "ham".

 We use **tfidf approach** for text clustering, as it's proved to give a high prediction accuracy.



## 3 - Overview of the Work Pipeline:

* Loading Data
* Preprocessing
* Feature Selection
* Feature Vector Modelling 
* k-means clustering and evaluation
* Showing Results 

## **Loading Data**

In [None]:
#Reading the spam collection dataset
def reading_training_file():
    print("Reading messages from dataset...")
    documents = []
    with open("sms_dataset.csv") as csvfile:
        rows = csv.reader(csvfile)
        next(rows, None) #Skip column headers
        for row in rows:
            message = row[1]
            documents.append(message) #messages are appended to the list 'documents'
    print("Finished reading messages and appended to the list..")
    return documents
print(f'length of documents: ', len(reading_training_file()))

Reading messages from dataset...
Finished reading messages and appended to the list..
length of documents:  5574


## **Pre-processing**

In [None]:
#Preprocessing steps including tokenization
def preprocessing(documents):
    print "Precprocessing the messages for clustering..."
    vocab_glob = {}
    tokenized_document = []
    final_documents=[]
    for document in documents:
        text=document.replace("</p>","") # removing </p>
        text=text.replace("<p>"," ")  # removing <p>
        text = text.replace("http", " ")
        text = text.replace("www", " ")
        text = re.sub(r'([a-z])\1+', r'\1', text)
        text = re.sub('\s+', ' ', text)
        text = re.sub('\.+', '.', text)
        text = re.sub(r"(?:\@|'|https?\://)\s+","",text) #delete punctuation
        text = re.sub("[^a-zA-Z]", " ",text)
        text=re.sub(r'[^\w\s]','',text) # remove punctuation
        text=re.sub("\d+","",text) # remove number from text
        tokens_text = nltk.word_tokenize(text) # tokenizing the documents
        stopwords=nltk.corpus.stopwords.words('english') #stopword reduction
        tokens_text=[w for w in tokens_text if w.lower() not in stopwords]
        tokens_text=[w.lower() for w in tokens_text] #convert to lower case
        tokens_text=[w for w in tokens_text if len(w)>2] #considering tokens with length>2(meaningful words)
        p= PorterStemmer() # stemming tokenized documents using Porter Stemmer
        tokens_text = [p.stem(w) for w in tokens_text]
        token_ind= []
        counter=len(vocab_glob)-1
        for token in tokens_text:
         if token not in vocab_glob:
            counter+=1
            vocab_glob[token]=counter
            token_ind.append(counter)
         else:
            token_ind.append(vocab_glob[token])
        final_documents.append(token_ind)
    print "Finished pre-processing words.."
    return vocab_glob,final_documents

