# A Notebook for Text Classification #  

This notebook will show you how to classify text documents.  Before running a classifier, the text has to be converted into features.  This is called featurization. 

The following cell contains some predefined functions to implement text featurization and classification. Please make sure you have run this cell before you run other cells in this notebook.

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def SampleData(dataset):
    import pandas as pd
    df=pd.read_csv(dataset,'\t')
    with pd.option_context('max_colwidth',160):
        display(df.head())
    return df.head()
    

def featDataset(dataset):
    output=dataset[:-4]+'_Vectorized.txt'
    with open(output,"w") as w:
        with open(dataset,encoding='utf-8', mode = 'r') as f:
            data=f.readlines()
            text=[entry.split('\t')[1].rstrip() for entry in data[1:]]
            labels=[entry.split('\t')[0] for entry in data[1:]]
            parsedText=list(map(textParse,text))
            vocabList=createVocabList(parsedText)
            for word in vocabList:
                w.write(word+',')
            w.write('class\n')
            for i in range(len(labels)):
                returnVec=setOfWords2Vec(vocabList,parsedText[i])
                for num in returnVec:
                    w.write(str(num)+',')
                w.write(labels[i]+"\n")
            return vocabList

def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)
    return list(vocabSet)
        
def textParse(bigString):
    import re
    #listOfTokens=re.split(r'\W*',bigString)
    listOfTokens=re.split(r'[^A-Za-z]+',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else: print('the word: %s is not in my Vocabulary' % word)
    return returnVec

def loadDataSet(dataset): 
    with open(dataset) as f:
        data=f.readlines()
        attributes=data[0].rstrip().split(',')[:-1]
        #print("attributes",len(attributes))
        instances=[entry.rstrip().split(',')[:-1] for entry in data[1:]]
        dataArray=[]
        for i in range(len(instances[0])):
            try:
                dataArray.append([float(instance[i]) for instance in instances])
            except:
                encodedData,codeBook=encode([instance[i] for instance in instances])
                dataArray.append(encodedData)
                print(attributes[i],': ',list(codeBook.items()))
        instances=np.array(dataArray).T
        labels=[entry.rstrip().split(',')[-1] for entry in data[1:]]
        return instances,labels
    
def evaluateClf(clf,instances,labels,n_foldCV):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======BernoulliNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="GaussianNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======GaussianNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="MultinomialNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======MultinomialNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

def predict(testset):
    if "clf_G" in globals():
        prediction=clf_G.predict(testset)
        print("GaussianNB: ",prediction)

## Explore the data
["SMSSpamCollection.txt"](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) includes 5572 SMS messages collected from four different research sources and they were labeled as spam or ham. "testset_SMS" is the testset and contains two randomly chosen SMS messages from the SMSSpamCollection dataset and they were deleted from the original dataset to prevent dataset contamination. 

The example text data that we are using are SMS messages which are labeled as spam or no spam (ham in the file).  The task is to classify a new text message as spam or no spam. Each SMS message is considered as a document (an instance for classification). All the instances are stored in a single file, and each line corresponds to a single document.

The following cell will give you an excerpt of the SMS message dataset. 

In [2]:
!wget https://raw.githubusercontent.com/khider/INF549/master/Homework%20Assignments/Homework%206/SMSSpamCollection.txt
sample = SampleData('SMSSpamCollection.txt')

--2021-10-26 11:26:12--  https://raw.githubusercontent.com/khider/INF549/master/Homework%20Assignments/Homework%206/SMSSpamCollection.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
ERROR: cannot verify raw.githubusercontent.com's certificate, issued by ‘CN=DigiCert SHA2 High Assurance Server CA,OU=www.digicert.com,O=DigiCert Inc,C=US’:
  Unable to locally verify the issuer's authority.
To connect to raw.githubusercontent.com insecurely, use `--no-check-certificate'.


Unnamed: 0,class,content
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


The cells below display the number of instances in each class in two different ways. 

In [3]:
dataset = 'SMSSpamCollection.txt'
df=pd.read_csv(dataset,'\t')
df['class'].value_counts().plot(kind='bar')

<AxesSubplot:>

In [4]:
df['class'].value_counts()

ham     4824
spam     746
Name: class, dtype: int64

##  An Example of Featurization
Before we explore the real data, it would be helpful to understand how featurization works first. The following cell will show you two documents and how they are combined to generate two feature vectors.  

Many machine learning algorithms only take numerical data as input. Since our data is words, we need a way to convert it into numerical data. This is called featurization. First, we will create a list of all the words that appear in all the documents. That list will enable us to create a **vector** for each document where each element of the vector corresponds to a word in that list. For a given instance, each element is a 1 if that word is in the instance and a 0 if it is not.  
 
For instance, if the two documents are, **"I am taking INF549"** and **"I love learning data science"**, then the vocabulary list would be ["inf","love","learning","taking","science","data"] and the two vectorized instances would be [1,0,0,1,0,0] and [0,1,1,0,1,1]. 0 and 1 are boolean values denoting the presence of a word (token). 0 means the word doesn't appear and 1 means it appears in the document. **In our algorithm, the fuction only keeps tokens that are made of English characters and whose lengths are longer than 2**. Run the cell below to see the result.

In [5]:
two_pieces_text=list(map(textParse,["I am taking INF549","I love learning data science"]))
vocabList=createVocabList(two_pieces_text)
featurized_vectors=pd.DataFrame([setOfWords2Vec(vocabList,instance) for instance in two_pieces_text],columns=vocabList,\
                               index=["Text 1","Text 2"])
display(featurized_vectors)

Unnamed: 0,science,data,inf,taking,learning,love
Text 1,0,0,1,1,0,0
Text 2,1,1,0,0,1,1


## Preprocess the document to identify all the words
Before we featurize the text, we have to create a list with all the words that appear in all the messages. The following cell will generate and display the vocabulary list.

In [6]:
text=[instance.rstrip() for instance in sample.iloc[:,1]]
parsedText=list(map(textParse,text))
vocabList=createVocabList(parsedText)
#print('Parsed Text: ')
#for instance in parsedText:
#    print(instance)
print('Vocabulary List: \n',vocabList)

Vocabulary List: 
 ['goes', 'final', 'win', 'txt', 'lives', 'free', 'rate', 'there', 'dun', 'crazy', 'apply', 'bugis', 'great', 'lar', 'say', 'entry', 'cup', 'early', 'cine', 'question', 'wif', 'std', 'receive', 'usf', 'wat', 'think', 'jurong', 'tkts', 'amore', 'then', 'comp', 'wkly', 'point', 'over', 'oni', 'around', 'buffet', 'already', 'may', 'joking', 'though', 'only', 'hor', 'text', 'got', 'nah', 'don', 'here', 'available', 'world', 'until']


## Generate features for the documents##
For each document, we generate a feature vector which records the appearance of each word in the vocabulary list. 

The following cell will output the vectors corresponding to the parsed text you got from the last step.

In [7]:
for instance in parsedText:
    print(setOfWords2Vec(vocabList,instance))

[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]


## Put them together
Now we will generate feature vectors for each of the documents to the whole dataset.

**Training set vectorization**

In [8]:
dataset='SMSSpamCollection.txt'
vocabList=featDataset(dataset)
print('The featurization of your documents is done!')

The featurization of your documents is done!


## Training and testing a Naïve Bayes Classifier
The following cells will train a Naïve Bayes classifier on the featurized dataset. There are three Naive Bayes classifiers provided and calculate the accuracy score.

In [9]:
# data prep
instances,labels=loadDataSet(dataset[:-4]+'_Vectorized.txt')
idx_spam = [i for i, j in enumerate(labels) if j == 'spam']
idx_ham = [i for i, j in enumerate(labels) if j == 'ham']
Xs_Train, Xs_test, Ys_train, Ys_test = train_test_split(instances[idx_spam], np.array(labels)[idx_spam], test_size=0.25, random_state=42) 
Xh_Train, Xh_test, Yh_train, Yh_test = train_test_split(instances[idx_ham], np.array(labels)[idx_ham], test_size=0.25, random_state=42) 

In [10]:
X_train=np.append(Xs_Train,Xh_Train, axis=0)
X_test=np.append(Xs_test,Xh_test, axis=0) 
Y_train=np.append(Ys_train,Yh_train) 
Y_test=np.append(Ys_test,Yh_test) 

In [11]:
clf_G = MultinomialNB()
clf_G.fit(X_train, Y_train)
print("Multinomial Naive Bayes is used.")

Gaussian Naive Bayes is used.


In [12]:
prediction=clf_G.predict(X_test)
accuracy_score(Y_test,prediction)

0.8916786226685797

## Predict unseen examples
The following cell will ask you to input the test set, featurize it and predict it. Run the cell and you will get the results from the classifier you have trained.

**Test set vectorization**  

In [None]:
testset=input('Please enter the text message that you want to test:')
returnVec=setOfWords2Vec(vocabList,textParse(testset))
print(returnVec)

**Predict result**  

In [None]:
testset=np.array(returnVec).reshape(1, -1)
predict(testset)