### Hello
### Bag-of-Words and Naive Bayes are the topics of this learning video.
#### You are listening John William - Schindler's List

### Implementing Bag of Word from scratch
##### The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR).
#### The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

In [39]:
# Sample Documents. DANTE ALIGHIERI is the artist of this video
documents = ['Dante Alighieri, who wrote the Divine Comedy, was an italian poet',
             'The Divine Comedy describes the journey of Dante through Hell,\
Purgatory, and Paradise']

# lower case
lower_case_documents = []
for i in documents:
    lower_case_documents.append(i.lower())
print (lower_case_documents)

['dante alighieri, who wrote the divine comedy, was an italian poet', 'the divine comedy describes the journey of dante through hell,purgatory, and paradise']


In [40]:
# remove punctuation
sans_punctuation_documents = []
import string, re

for i in lower_case_documents:
    sans_punctuation_documents.append(re.sub(r"[^a-zA-Z0-9]", " ", i))

print(sans_punctuation_documents)

['dante alighieri  who wrote the divine comedy  was an italian poet', 'the divine comedy describes the journey of dante through hell purgatory  and paradise']


In [42]:
# tokenization with nltk
from nltk.tokenize import word_tokenize
preprocessed_documents = []
for i in sans_punctuation_documents:
    preprocessed_documents.append(word_tokenize(i))
print(preprocessed_documents)

[['dante', 'alighieri', 'who', 'wrote', 'the', 'divine', 'comedy', 'was', 'an', 'italian', 'poet'], ['the', 'divine', 'comedy', 'describes', 'the', 'journey', 'of', 'dante', 'through', 'hell', 'purgatory', 'and', 'paradise']]


In [43]:
# 'Counter' counts the occurrence of each item in the list and returns a dictionary
frequency_list = []

import pprint
from collections import Counter
for i in preprocessed_documents:
    frequency_list.append(dict(Counter(i)))

pprint.pprint(frequency_list)

[{'alighieri': 1,
  'an': 1,
  'comedy': 1,
  'dante': 1,
  'divine': 1,
  'italian': 1,
  'poet': 1,
  'the': 1,
  'was': 1,
  'who': 1,
  'wrote': 1},
 {'and': 1,
  'comedy': 1,
  'dante': 1,
  'describes': 1,
  'divine': 1,
  'hell': 1,
  'journey': 1,
  'of': 1,
  'paradise': 1,
  'purgatory': 1,
  'the': 2,
  'through': 1}]


### Implementing Bag of Word with sklearn

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

In [45]:
print(count_vector)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


In [46]:
count_vector.fit(documents)
features_name = count_vector.get_feature_names() # each word will be a column of our matrix
features_name

['alighieri',
 'an',
 'and',
 'comedy',
 'dante',
 'describes',
 'divine',
 'hell',
 'italian',
 'journey',
 'of',
 'paradise',
 'poet',
 'purgatory',
 'the',
 'through',
 'was',
 'who',
 'wrote']

In [47]:
doc_array = count_vector.transform(documents)
doc_array.toarray()

# two sentences and 19 words

array([[1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1],
       [0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 2, 1, 0, 0, 0]])

In [48]:
import pandas
frequency_matrix = pandas.DataFrame(doc_array.toarray(), columns = features_name )
frequency_matrix

Unnamed: 0,alighieri,an,and,comedy,dante,describes,divine,hell,italian,journey,of,paradise,poet,purgatory,the,through,was,who,wrote
0,1,1,0,1,1,0,1,0,1,0,0,0,1,0,1,0,1,1,1
1,0,0,1,1,1,1,1,1,0,1,1,1,0,1,2,1,0,0,0


### Bayes Classification Model

#### We are going to use the Naive Bayes algorithm to create a model that can classify dataset SMS messages as spam or not spam, based on the training we give to the model.

#### Understanding our dataset
We use a dataset posted on the UCI Machine Learning repository. The 2 columns in the data set are currently not named.

The first column takes two values, 'ham'/'no spam' and 'spam'.

In [49]:
import pandas as pd
# You need to download the dataset here: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

df = pd.read_table('smsspamcollection/SMSSpamCollection', sep = '\t', names = ['label', 'sms_message'])
# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [50]:
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [51]:
# Training and testing sets¶
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [52]:
X_train.head()

710     4mths half price Orange line rental & latest c...
3740                           Did you stitch his trouser
2711    Hope you enjoyed your new content. text stop t...
3155    Not heard from U4 a while. Call 4 rude chat pr...
3748    Ü neva tell me how i noe... I'm not at home in...
Name: sms_message, dtype: object

#### Applying Bag of Words processing to our dataset

In [53]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

In [54]:
print('training data',training_data.shape)
print('testing data',testing_data.shape)

training data (4179, 7456)
testing data (1393, 7456)


### Naive Bayes implementation using scikit-learn

In [55]:
# Sklearn has several Naive Bayes implementations that we can use.
# We will be using sklearn's sklearn.naive_bayes method to make predictions on our SMS messages dataset.
# 3 steps:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [56]:
# predictions
predictions = naive_bayes.predict(testing_data)
predictions

array([0, 0, 0, ..., 0, 1, 0])

In [57]:
# check the accuracy of our model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test,predictions)))
print('Precision score: ', format(precision_score(y_test,predictions)))
print('Recall score: ', format(recall_score(y_test,predictions)))
print('F1 score: ', format(f1_score(y_test,predictions)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562


#### Thanks for watching the video.
#### Please subscribe to the youtube channel:
https://www.youtube.com/channel/UCokk3F3UoS7O0Vmja7MXi-w?
#### and follow Jabraghe on Facebook: 
https://www.facebook.com/Jabraghe/
