In [None]:
from google.colab import drive
drive.mount('/content/drive')
# Changing working directory
import os
os.chdir('/content/drive/My Drive/Tutorials/NLP_Learning/Udacity-NLP-tutorial/Udacity-Natural-Language-Processing-Nanodegree/2. Tutorial Spam Classifier')

Mounted at /content/drive


In [None]:
!pwd

/content/drive/My Drive/Tutorials/NLP_Learning/Udacity-NLP-tutorial/Udacity-Natural-Language-Processing-Nanodegree/2. Tutorial Spam Classifier


In [None]:
!ls

bayesian_exp.ipynb	  Bayesian_Inference_solution.ipynb  smsspamcollection
Bayesian_Inference.ipynb  images


In [None]:
import os
import nltk
nltk.data.path.append(os.path.join(os.getcwd(), "smsspamcollection"))
import re

### Step 0: Introduction to the Naive Bayes Theorem ###

Bayes Theorem is one of the earliest probabilistic inference algorithms. It was developed by `Reverend Bayes` (which he used to try and infer the existence of God no less), and still performs extremely well for certain use cases. 

It's best to understand this theorem using an example. Let's say you are a member of the Secret Service and you have been deployed to protect the Democratic presidential nominee during one of his/her campaign speeches. Being a public event that is open to all, your job is not easy and you have to be on the constant lookout for threats. So one place to start is to put a certain threat-factor for each person. So based on the features of an individual, like age, whether the person is carrying a bag, looks nervous, etc., you can make a judgment call as to whether that person is a viable threat. 

If an individual ticks all the boxes up to a level where it crosses a threshold of doubt in your mind, you can take action and remove that person from the vicinity. 

_Bayes Theorem works in the same way, as we are computing the probability of an event (a person being a threat) based on the probabilities of certain related events (age, presence of bag or not, nervousness of the person, etc.)._

One thing to consider is the independence of these features amongst each other. For example if a child looks nervous at the event then the likelihood of that person being a threat is not as much as say if it was a grown man who was nervous. To break this down a bit further, here there are two features we are considering, age AND nervousness. Say we look at these features individually, we could design a model that flags ALL persons that are nervous as potential threats. However, it is likely that we will have a lot of false positives as there is a strong chance that minors present at the event will be nervous. Hence by considering the age of a person along with the 'nervousness' feature we would definitely get a more accurate result as to who are potential threats and who aren't. 

This is the 'Naive' bit of the theorem where it considers each feature to be independent of each other which may not always be the case and hence that can affect the final judgement.

In short, Bayes Theorem calculates the probability of a certain event happening (in our case, a message being spam) based on the joint probabilistic distributions of certain other events (in our case, the appearance of certain words in a message). We will dive into the workings of Bayes Theorem later in the mission, but first, let us understand the data we are going to work with.

In [None]:
print(f'List of files inside "smsspamcollection" : ')
!ls smsspamcollection

List of files inside "smsspamcollection" : 
readme	SMSSpamCollection


In [None]:
import pandas as pd
import string

In [None]:
df = pd.read_csv("smsspamcollection/SMSSpamCollection", sep='\t', header = None, names=['label', 'sms_message'])

In [None]:
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# convert label to binary
df['label'] = df.label.map({"ham":0, "spam":1})

In [None]:
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df.shape

(5572, 2)

### Bag of Words

What we have here in our data set is a large collection of text data (__5,572 rows of data__). Most ML algorithms rely on numerical data to be fed into them as input, and email/sms messages are usually text heavy. 

Here we'd like to introduce the Bag of Words (`BoW`) concept which is a term used to specify the problems that have a 'bag of words' or a collection of text data that needs to be worked with. <font color='green'>
The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. 
</font>

It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter. 

Using a process which we will go through now, we can convert a collection of documents to a matrix, with each document being a row and each word (token) being the column, and the corresponding (row, column) values being the frequency of occurrence of each word or token in that document.

For example: 

Let's say we have 4 documents, which are text messages
in our case, as follows:

`['Hello, how are you!',
'Win money, win from home.',
'Call me now',
'Hello, Call you tomorrow?']`

Here as we can see, the documents are numbered in the rows, and each word is a column name, with the corresponding value being the frequency of that word in the document.

Let's break this down and see how we can do this conversion using a small set of documents.

To handle this, we will be using sklearn's 
[count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) method which does the following:

* It tokenizes the string (separates the string into individual words) and gives an integer ID to each token.
* It counts the occurrence of each of those tokens.

**Please Note:** 

* The `CountVectorizer` method automatically converts all tokenized words to their lower case form so that it does not treat words like 'He' and 'he' differently. It does this using the `lowercase` parameter which is by default set to `True`.

* It also ignores all punctuation so that words followed by a punctuation mark (for example: 'hello!') are not treated differently than the same words not prefixed or suffixed by a punctuation mark (for example: 'hello'). It does this using the `token_pattern` parameter which has a default regular expression which selects tokens of 2 or more alphanumeric characters.

* The third parameter to take note of is the `stop_words` parameter. Stop words refer to the most commonly used words in a language. They include words like 'am', 'an', 'and', 'the', etc. By setting this parameter value to `english`, CountVectorizer will automatically ignore all words (from our input text) that are found in the built in list of English stop words in scikit-learn. This is extremely helpful as stop words can skew our calculations when we are trying to find certain key words that are indicative of spam.

We will dive into the application of each of these into our model in a later step, but for now it is important to be aware of such preprocessing techniques available to us when dealing with textual data.

In [None]:
# Convert all strings to lower case
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

l_docs = [w.lower() for w in documents]
print(f'documents to lower case : {l_docs}')

documents to lower case : ['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


**Step 2: Removing all punctuation**

>>**Instructions:**
Remove all punctuation from the strings in the document set. Save the strings into a list called 
'sans_punctuation_documents'. 

In [None]:
sans_punctuation_documents = []

for doc in l_docs:
    sans_punctuation_documents.append(doc.translate(str.maketrans('','',string.punctuation)))

print(f'after removal of punctuation : {sans_punctuation_documents}')

after removal of punctuation : ['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


**Step 3: Tokenization**

Tokenizing a sentence in a document set means splitting up the sentence into individual words using a delimiter. The delimiter specifies what character we will use to identify the beginning and  end of a word. Most commonly, we use a single space as the delimiter character for identifying words, and this is true in our documents in this case also.

>>**Instructions:**
Tokenize the strings stored in 'sans_punctuation_documents' using the split() method. Store the final document set 
in a list called 'preprocessed_documents'.

In [None]:
preprocessed_documents = []

# generating tokens
for tokens in sans_punctuation_documents:
    preprocessed_documents.append(tokens.split())

print(f'generated tokens : {preprocessed_documents}')

generated tokens : [['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


**Step 4: Count frequencies**

Now that we have our document set in the required format, we can proceed to counting the occurrence of each word in each document of the document set. We will use the `Counter` method from the Python `collections` library for this purpose. 

`Counter` counts the occurrence of each item in the list and returns a dictionary with the key as the item being counted and the corresponding value being the count of that item in the list. 

>>**Instructions:**
Using the Counter() method and preprocessed_documents as the input, create a dictionary with the keys being each word in each document and the corresponding values being the frequency of occurrence of that word. Save each Counter dictionary as an item in a list called 'frequency_list'.

In [None]:
from collections import Counter

frequency_list = []

for freq in preprocessed_documents:
    frequency_list.append(Counter(freq))

print(f'frequencies are : {frequency_list}')

frequencies are : [Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}), Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}), Counter({'call': 1, 'me': 1, 'now': 1}), Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]


<font color='green'>
Congratulations! You have implemented the Bag of Words process from scratch! As we can see in our previous output, we have a frequency distribution dictionary which gives a clear view of the text that we are dealing with.

We should now have a solid understanding of what is happening behind the scenes in the `sklearn.feature_extraction.text.CountVectorizer` method of scikit-learn. 

We will now implement `sklearn.feature_extraction.text.CountVectorizer` method in the next step.
</font>

### Implementing Bag of Words in scikit-learn ###

Now that we have implemented the BoW concept from scratch, let's go ahead and use scikit-learn to do this process in a clean and succinct way. We will use the same document set as we used in the previous step. 

>>**Instructions:**
Import the sklearn.feature_extraction.text.CountVectorizer method and create an instance of it called 'count_vector'. 

In [None]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call hello you tomorrow?']


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()

In [None]:
count_vector.fit(documents)
print(f'count vector : {count_vector.get_feature_names_out()}')

count vector : ['are' 'call' 'from' 'hello' 'home' 'how' 'me' 'money' 'now' 'tomorrow'
 'win' 'you']


The `get_feature_names()` method returns our feature names for this dataset, which is the set of words that make up our vocabulary for 'documents'.

>>**Instructions:**
Create a matrix with each row representing one of the 4 documents, and each column representing a word (feature name). 
Each value in the matrix will represent the frequency of the word in that column occurring in the particular document in that row. 
You can do this using the transform() method of CountVectorizer, passing in the document data set as the argument. The transform() method returns a matrix of NumPy integers, which you can convert to an array using
toarray(). Call the array 'doc_array'.

In [None]:
doc_array = count_vector.transform(documents).toarray()
print(f'doc array is : \n {doc_array}')

doc array is : 
 [[1 0 0 1 0 1 0 0 0 0 0 1]
 [0 0 1 0 1 0 0 1 0 0 2 0]
 [0 1 0 0 0 0 1 0 1 0 0 0]
 [0 1 0 2 0 0 0 0 0 1 0 1]]


Now we have a clean representation of the documents in terms of the frequency distribution of the words in them. To make it easier to understand our next step is to convert this array into a dataframe and name the columns appropriately.

>>**Instructions:**
Convert the 'doc_array' we created into a dataframe, with the column names as the words (feature names). Call the dataframe 'frequency_matrix'.

In [None]:
frequency_matrix = pd.DataFrame(doc_array,columns=count_vector.get_feature_names_out())



frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


Congratulations! You have successfully implemented a Bag of Words problem for a document dataset that we created. 

One potential issue that can arise from using this method is that if our dataset of text is extremely large (say if we have a large collection of news articles or email data), there will be certain values that are more common than others simply due to the structure of the language itself. For example, words like 'is', 'the', 'an', pronouns, grammatical constructs, etc., could skew our matrix and affect our analyis. 

There are a couple of ways to mitigate this. One way is to use the `stop_words` parameter and set its value to `english`. This will automatically ignore all the words in our input text that are found in a built-in list of English stop words in scikit-learn.

Another way of mitigating this is by using the [tfidf](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) method. This method is out of scope for the context of this lesson.