# Spam Classification

A quite simple application that demonstrates the use of Naive Bayes algorithm for multinomial models, pandas for reading datasets and other sickitlearn models.

## Installation

``Step 1``

Clone this repository. 
```shell
git clone https://github.com/grayoj/spam-detection.git
```

Install the <a href="http://python.org">Python Programming Language</a>

``Step 2``
Using Pip, Install <a href="http://streamlit.com">Streamlit</a> which is the server for the application.
```shell
$ pip install streamlit
```

``Step 3``
Install pandas to read data sets
```shell
$ pip install pandas
```

``Step 4``
Install Sickitlearn modules that contains all necessary modules used in the project
```shell
$ pip install sklearn
```

``Step 5``
Install Numpy for mathematical functions.
```shell
$ pip install numpy
```

``Step 6``
Install the Natural language model toolkit.
```shell
$ pip install nltk
```

If you use Pylance, it should validate all imports made.

In [None]:
# Import String
import string
from sklearn.model_selection import train_test_split

# Import pandas to read CSV files.
import pandas as pd

# Import natural language toolkit.
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

# Import Sickitlearn
import sklearn

# Import Naive Bayes Module
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# Import module to display accuracy
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Multinomial NB
The Multinomial Naive Bayes algorithm is a classifier is used for the classification with discrete features.
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html">Sklearn</a>

## Loading the Dataset

In [None]:

df = pd.read_csv('dataset/data.csv')
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis =1)
df.rename(columns = {'v1':'labels', 'v2': 'message'}, inplace = True)
df.drop_duplicates(inplace = True)
df['labels'] = df['labels'].map({'ham': 0, 'spam': 1})
print(df.head())

## Cleaning the Dataset

In [None]:
def clean_data(message):
    message_without_labels = message
    message_without_punc = [character for character in message if character not in string]
    message_without_punc = ''.join(message_without_punc)
    
    separator = ''
    return separator.join([word for word in message_without_punc.split() if word.lower() not in stopwords.words('english')])

df['message'] = df['message'].apply(clean_data)
x = df['message']
y = df['labels']

## Implement Count Vectorizer
This would Convert a collection of text to a matrix of token counts.

In [None]:
cv = CountVectorizer()

## Train Model

In [None]:
x = cv.fit_transform(x)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

model = MultinomialNB().fit(x_train, y_train)
predictions = model.predict(x_test)

print(accuracy_score(y_test, predictions))
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

## Predict Text
Predict text and return string

In [None]:
def predict(text):
    labels = ['This is not a Spam', 'This is Spam']
    x = cv.fit_transform(text).toarray()
    p = model.predict(x)
    s = [str(i) for i in p]
    v = int(''.join(s))
    return str('This message looks like a spam message.')