# Lecture 11: Classification3 Part 2
Example for naïve Bayes

### Example
The example we are going to follow for the naïve Bayes classifier is based on text classification. Instead of a spam detector, we will work with Twitter data and check whether a tweet is related to data science.

In [None]:
import pandas as pd
train = pd.read_csv('Train_QuantumTunnel_Tweets.csv', encoding='utf-8')
# In Python 3 it is better to specify the encoding of a text file. In this case UTF-8.

print(train[62:65])

We would like to pre-process the text of the tweets to get rid of URLs and hashtags. 

Read about [Regular Expressions](https://www.w3schools.com/python/python_regex.asp), [Cheat Sheet](https://www.rexegg.com/regex-quickstart.html), [Simple Example](https://www.w3schools.com/python/trypython.asp?filename=demo_regex_match_group)

In [None]:
import re # regular expressions package
def tw_preprocess(tw):
    ptw = re.sub(r"http\S+", "", tw)
    ptw = re.sub(r"#", "", ptw)
    return ptw

In [None]:
train['Tweet'] = train['Tweet'].apply(tw_preprocess)

print(train[62:65])

Generate a term-document matrix is a matrix whose rows correspond to documents and its columns to words. We can get
this done with the help of *CountVectorizer*.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectoriser = CountVectorizer(lowercase=True,stop_words='english',binary=True)

We can now apply our vectoriser to the training tweets.

In [None]:
X_train = vectoriser.fit_transform(train['Tweet'])

The result is a large sparse matrix, so printing it is not a good idea. Nonetheless, we can still see the vocabulary that has been gathered from the training set

In [None]:
vectoriser.get_feature_names_out()[1005:1011]  
# We can list the vocabulary created with the help of get_feature_names_out.

We are now in a position to create our model using the sparse matrix generated by our vectoriser and the labels provided with the training dataset

In [None]:
from sklearn import naive_bayes
model = naive_bayes.MultinomialNB().fit(X_train, list(train['Data_Science']))
# We are using the MultinomialNB algorithm from naive_bayes.

Confusion matrix on the training set, which should be very nice.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(train['Data_Science'],model.predict(X_train))

Finally, we can now apply our model to the testing dataset provided. We need to load the data and let us apply the same preprocessing we used for the training dataset:

In [None]:
test = pd.read_csv('Test_QuantumTunnel_Tweets.csv',encoding='utf-8')
test['Tweet'] = test['Tweet'].apply(tw_preprocess)

X_test = vectoriser.transform(test['Tweet'])

In [None]:
pred = model.predict(X_test)
print(pred)

pred_probs = model.predict_proba(X_test)[:,1]
print(pred_probs)

Let us check for example the probability assigned to the tweet with id 103

In [None]:
pred_probs[102]

In [None]:
print(test[102:103])

The rest of the data can be checked in a similar fashion. Notice that we have used a very small corpus for this demo, but you can see how powerful naïve Bayes is, even when it is naïve. Furthermore, in this example we were not particularly careful when cleaning the data, for instance we left numbers and punctuation in the corpus, also we did not apply any stemming on the text either.