# A Notebook for Text Classification #  

This notebook will show you how to classify text documents.  Before running a classifier, the text has to be converted into features.  This is called featurization. 

The following cell contains some predefined functions to implement text featurization and classification. Please make sure you have run this cell before you run other cells in this notebook.

In [None]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB,GaussianNB,MultinomialNB

def SampleData(dataset):
    import pandas as pd
    df=pd.read_csv(dataset,'\t')
    with pd.option_context('max_colwidth',160):
        display(df.head())
    return df.head()
    

def featDataset(dataset):
    output=dataset[:-4]+'_Vectorized.txt'
    with open(output,"w") as w:
        with open(dataset,encoding='utf-8', mode = 'r') as f:
            data=f.readlines()
            text=[entry.split('\t')[1].rstrip() for entry in data[1:]]
            labels=[entry.split('\t')[0] for entry in data[1:]]
            parsedText=list(map(textParse,text))
            vocabList=createVocabList(parsedText)
            for word in vocabList:
                w.write(word+',')
            w.write('class\n')
            for i in range(len(labels)):
                returnVec=setOfWords2Vec(vocabList,parsedText[i])
                for num in returnVec:
                    w.write(str(num)+',')
                w.write(labels[i]+"\n")
            return vocabList

def createVocabList(dataSet):
    vocabSet=set([])
    for document in dataSet:
        vocabSet=vocabSet|set(document)
    return list(vocabSet)
        
def textParse(bigString):
    import re
    #listOfTokens=re.split(r'\W*',bigString)
    listOfTokens=re.split(r'[^A-Za-z]+',bigString)
    return [tok.lower() for tok in listOfTokens if len(tok)>2]

def setOfWords2Vec(vocabList,inputSet):
    returnVec=[0]*len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)]=1
        else: print('the word: %s is not in my Vocabulary' % word)
    return returnVec

def loadDataSet(dataset): 
    with open(dataset) as f:
        data=f.readlines()
        attributes=data[0].rstrip().split(',')[:-1]
        #print("attributes",len(attributes))
        instances=[entry.rstrip().split(',')[:-1] for entry in data[1:]]
        dataArray=[]
        for i in range(len(instances[0])):
            try:
                dataArray.append([float(instance[i]) for instance in instances])
            except:
                encodedData,codeBook=encode([instance[i] for instance in instances])
                dataArray.append(encodedData)
                print(attributes[i],': ',list(codeBook.items()))
        instances=np.array(dataArray).T
        labels=[entry.rstrip().split(',')[-1] for entry in data[1:]]
        return instances,labels
'''
def chooseClassifier(choice,instances,labels):
    clf=[]
    choice=choice.split(',')
    if "1" in choice:
        clf_B = BernoulliNB()
        clf_B.fit(instances, labels)
        print('Bernoulli Naive Bayes is used.')
        clf.append(clf_B)
    if "2" in choice:
        clf_G = GaussianNB()
        clf_G.fit(instances, labels)
        print("Gaussian Naive Bayes is used.")
        clf.append(clf_G)
    if "3" in choice:
        clf_M = MultinomialNB()
        clf_M.fit(instances, labels)
        print("Multinomial Naive Bayes is used.")
        clf.append(clf_M)
    if not ('1' in choice or '2' in choice or '3' in choice):
        print("Please choose a correct classifier.")
    return clf
'''
    
def evaluateClf(clf,instances,labels,n_foldCV):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======BernoulliNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="GaussianNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======GaussianNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
        elif type(item).__name__=="MultinomialNB":
            scores = cross_val_score(item, instances, labels, cv=n_foldCV)
            print("======MultinomialNB======")
            print(scores)
            print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
'''         
def predict(clf,testset):
    for item in clf:
        if type(item).__name__=="BernoulliNB":
            prediction=item.predict(testset)
            print("BernoulliNB: ",prediction)
        elif type(item).__name__=="GaussianNB":
            prediction=item.predict(testset)
            print("GaussianNB: ",prediction) 
        elif type(item).__name__=="MultinomialNB":
            prediction=item.predict(testset)
            print("MultinomialNB:",prediction)
'''   
def predict(testset):
    if "clf_B" in globals():
        prediction=clf_B.predict(testset)
        print("BernoulliNB: ",prediction)
    if "clf_G" in globals():
        prediction=clf_G.predict(testset)
        print("GaussianNB: ",prediction)
    if "clf_M" in globals():
        prediction=clf_M.predict(testset)
        print("MultinomialNB: ",prediction)

## Explore the data
["SMSSpamCollection.txt"](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) includes 5572 SMS messages collected from four different research sources and they were labeled as spam or ham. "testset_SMS" is the testset and contains two randomly chosen SMS messages from the SMSSpamCollection dataset and they were deleted from the original dataset to prevent dataset contamination. 

The example text data that we are using are SMS messages which are labeled as spam or no spam.  The task is to classify a new text message as spam or no spam. Each SMS message is considered as a document (an instance for classification). All the instances are stored in a single file, and each line corresponds to a single document.

The following cell will give you an excerpt of the SMS message dataset. 

In [None]:
dataset=input('Please Enter Your Data Set:')
sample=SampleData(dataset)

##  An Example of Featurization
Before we explore the real data, it would be helpful to understand how featurization works first. The following cell will show you two documents and how they are combined to generate two feature vectors.  

Many machine learning algorithms only take numerical data as input. Since our data is words, we need a way to convert it into numerical data. This is called featurization. First, we will create a list of all the words that appear in all the documents. That list will enable us to create a **vector** for each document where each element of the vector corresponds to a word in that list. For a given instance, each element is a 1 if that word is in the instance and a 0 if it is not.  
 
For instance, if the two documents are, **"I am taking INF549"** and **"I love learning data science"**, then the vocabulary list would be ["inf","love","learning","taking","science","data"] and the two vectorized instances would be [1,0,0,1,0,0] and [0,1,1,0,1,1]. 0 and 1 are boolean values denoting the presence of a word (token). 0 means the word doesn't appear and 1 means it appears in the document. In our algorithm, the fuction only keeps tokens that are made of English characters and whose lengths are longer than 2. Run the cell below to see the result.

In [None]:
two_pieces_text=list(map(textParse,["I am taking INF549","I love learning data science"]))
vocabList=createVocabList(two_pieces_text)
featurized_vectors=pd.DataFrame([setOfWords2Vec(vocabList,instance) for instance in two_pieces_text],columns=vocabList,\
                               index=["Text 1","Text 2"])
display(featurized_vectors)

## Preprocess the document to identify all the words
Before we featurize the text, we have to create a list with all the words that appear in all the messages. The following cell will generate and display the vocabulary list.

In [None]:
text=[instance.rstrip() for instance in sample.iloc[:,1]]
parsedText=list(map(textParse,text))
vocabList=createVocabList(parsedText)
#print('Parsed Text: ')
#for instance in parsedText:
#    print(instance)
print('Vocabulary List: \n',vocabList)

## Generate features for the documents##
For each document, we generate a feature vector which records the appearance of each word in the vocabulary list. 

The following cell will output the vectors corresponding to the parsed text you got from the last step.

In [None]:
for instance in parsedText:
    print(setOfWords2Vec(vocabList,instance))

## Put them together
Now we will generate feature vectors for each of the documents to the whole dataset.

**Training set vectorization**

In [None]:
dataset=input('Please Enter Your Data Set:')
vocabList=featDataset(dataset)
print('The featurization of your documents is done!')

## Train a Naïve Bayes Classifier
The following cells will train a Naïve Bayes classifier on the featurized dataset. There are three Naive Bayes classifiers provided. They are based on different mathematical foundations and might have different performance over different datasets.  

You can try to run different classifiers and compare their performance.

### Bernoulii Naïve Bayes Classifier

In [None]:
instances,labels=loadDataSet(dataset[:-4]+'_Vectorized.txt')
clf_B = BernoulliNB()
clf_B.fit(instances, labels)
print('Bernoulli Naive Bayes is used.')

### Gaussian Naïve Bayes Classifier

In [None]:
instances,labels=loadDataSet(dataset[:-4]+'_Vectorized.txt')
clf_G = GaussianNB()
clf_G.fit(instances, labels)
print("Gaussian Naive Bayes is used.")

### Multinomial Naive Bayes Classifier

In [None]:
instances,labels=loadDataSet(dataset[:-4]+'_Vectorized.txt')
clf_M = MultinomialNB()
clf_M.fit(instances, labels)
print("Multinomial Naive Bayes is used.")

## Predict unseen examples
The following cell will ask you to input the test set, featurize it and predict it. Run the cell and you will get the results from the classifiers you have run.

**Test set vectorization**  

In [None]:
testset=input('Please enter the text message that you want to test:')
returnVec=setOfWords2Vec(vocabList,textParse(testset))
print(returnVec)

**Predict result**  

In [None]:
testset=np.array(returnVec).reshape(1, -1)
predict(testset)

Now you can print this notebook as a PDF file and turn it in.