# Criando um Filtro de SPAM para SMS

Nesse projeto, vamos criar a lógica de um filtro de SPAM para mensagens SMS. Utilizaremos o algoritmo Naive Bayes.

Para treinar o algoritmo, utilizaremos um data set com 5.572 mensagens SMS já classificadas por humanos que pode ser encontrado [aqui](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [1]:
import pandas as pd

df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label','SMS'])
df.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
df.shape

(5572, 2)

In [3]:
df['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

In [4]:
random_df = df.sample(frac=1,random_state=1)
random_df

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...
5461,ham,Ok i thk i got it. Then u wan me 2 come now or...
4210,ham,I want kfc its Tuesday. Only buy 2 meals ONLY ...
4216,ham,No dear i was sleeping :-P
1603,ham,Ok pa. Nothing problem:-)
1504,ham,Ill be there on &lt;#&gt; ok.


In [5]:
training_df = random_df.iloc[0:4458]
training_df.reset_index(inplace=True)
training_df.shape

(4458, 3)

In [6]:
test_df = random_df.iloc[4458:]
test_df.reset_index(inplace=True)
test_df.shape

(1114, 3)

In [7]:
training_df['Label'].value_counts(normalize=True) * 100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [8]:
test_df['Label'].value_counts(normalize=True) * 100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

In [9]:
training_df['SMS'] = training_df['SMS'].str.replace('\W',' ').str.lower().copy()
training_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,index,Label,SMS
0,1078,ham,yep by the pretty sculpture
1,4028,ham,yes princess are you going to make me moan
2,958,ham,welp apparently he retired
3,4642,ham,havent
4,4674,ham,i forgot 2 ask ü all smth there s a card on ...


In [10]:
test_df['SMS'] = test_df['SMS'].str.replace('\W',' ').str.lower().copy()
test_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,index,Label,SMS
0,2131,ham,later i guess i needa do mcat study too
1,3418,ham,but i haf enuff space got like 4 mb
2,3424,spam,had your mobile 10 mths update to latest oran...
3,1538,ham,all sounds good fingers makes it difficult ...
4,5393,ham,all done all handed in don t know if mega sh...


In [11]:
training_df['SMS'] = training_df['SMS'].str.split().astype(list)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [12]:
vocabulary = []
for i in training_df['SMS']:
    for j in i:
        vocabulary.append(j)
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)
vocabulary[0:10]

['airport',
 'prepaid',
 'stripes',
 '8883',
 'telly',
 'psychic',
 'tiime',
 'project',
 'determine',
 'jay']

In [13]:
word_counts_per_sms = {unique_word: [0] * len(training_df['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_df['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        

In [14]:
transformed_df = pd.DataFrame(word_counts_per_sms)


In [15]:
transformed_df.shape


(4458, 7783)

In [16]:
training_df.shape

(4458, 3)

In [17]:
training_set = pd.concat([training_df, transformed_df], axis=1)
training_set.shape

(4458, 7786)

In [20]:
training_set.head()

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [18]:
p_ham = training_set['Label'].value_counts(normalize=True)[0]
p_spam = 1-p_ham

In [25]:
n_spam = 0 
n_ham = 0 
n_vocabulary = len(vocabulary)
alpha = 1
for i in training_set.itertuples():
    len_sms = len(i[3])
    if i[2] == 'spam':
        n_spam = n_spam + len_sms
    else:
        n_ham = n_ham + len_sms
    


In [26]:
n_vocabulary

7783

In [31]:
dict_spam = {unique_word:0 for unique_word in vocabulary}
dict_ham = {unique_word:0 for unique_word in vocabulary}

In [32]:
training_spam = training_set[training_set['Label'] == 'spam']
training_ham = training_set[training_set['Label'] == 'ham']

In [39]:
for i in vocabulary:
    n_word_spam = training_spam[i].sum()
    p_word_given_spam = (n_word_spam + alpha) / (n_spam + (alpha * n_vocabulary))
    n_word_ham = training_ham[i].sum()
    p_word_given_ham = (n_word_ham + alpha) / (n_ham + (alpha * n_vocabulary))
    dict_spam[i] = p_word_given_spam
    dict_ham[i] = p_word_given_ham

In [53]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for i in message:
        if i in dict_spam:
            p_spam_given_message *= dict_spam[i]
        if i in dict_ham:
            p_ham_given_message *= dict_ham[i]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [44]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888126e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [45]:
correct = 0
total = test_df.shape[0]

In [54]:
test_df['predicted'] = test_df['SMS'].apply(classify)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [60]:
for i in test_df.iterrows():
    if i[1]['Label'] == i[1]['predicted']:
        correct += 1

In [61]:
accuracy = correct / total
accuracy

0.9874326750448833