<a href="https://colab.research.google.com/github/heiniit/Colab/blob/master/spam_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a spam detector 

Just "Run all" :)

Load the dataset and do some preprocessing

In [0]:
!wget http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip smsspamcollection.zip

In [0]:
import pandas as pd

with open("SMSSpamCollection", "r") as file:
  sms_df = pd.read_csv(file, sep="\t")
sms_df.columns = ["class", "message"]

sms_df["class_inx"] = sms_df.apply(lambda row: int(row["class"] == "spam"), axis=1)

Split data to training and test sets

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sms_df["message"], sms_df["class_inx"], test_size=0.3)

Build the model...

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

vectorizer = TfidfVectorizer()
train_vectors = vectorizer.fit_transform(X_train).toarray()

classifier = MultinomialNB()
classifier.fit(train_vectors, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

...classify test data...

In [0]:
test_vectors = vectorizer.transform(X_test).toarray()
predicted = classifier.predict(test_vectors)

... and compare to true values.

In [6]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test.values, predicted)

0.9569377990430622

What to do next?
* The classifier is not perfect. Investigate which messages are classified wrong.  
* Does removing stop words (TfidfVectorizer stop_words parameter) improve the classification?
* The vectorizer is using words by default. Try to use ngrams of different sizes. How does it affect to the result?
* Instead of using naive Bayes classifier, try e.g. SVM or neural network classifiers. Which one works best?
* Can you tune classifier parameters to get even better accuracy? 
* Try to classify your own messages. Can you construct messages that are classified wrong?

