# Email similarity
### Supervised learning - Naive Bayes classifier using sklearn. 

- Dataset is taken from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html
- The project is part of Data Science career path on Codecademy platform.

In [59]:
#Importing the libraries 
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

### Exploring the Data


In [12]:
#Loading the dataset of emails 
emails = fetch_20newsgroups()
#20 different categories that are available in the dataset
emails.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

We are looking into classifying the emails about hockey and soccer. To load those categories, we need to pass ***categories*** parameter to the *fetch_20newsgroups()* function.

In [66]:
#categories = ['rec.sport.baseball', 'rec.sport.hockey']

Alternatively, we can try different category combinations for classification and obserbe how the accuracy of classifier changes. We can assume that the categories that are further from each other in terms of their possible meanings and usage are going to be classified with greater accuracy. For example, it's easier to distinguish emails that are about hockey and tech support than hockey and baseball.

In [80]:
#Uncomment the following lists and check the accuracy at the end of the program.
#categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey']
categories = ['comp.windows.x','sci.electronics']


In [81]:
emails = fetch_20newsgroups(categories=categories)

From the emails dataset, we can access the following information:
- Data
- Labels
- Class names

In [82]:
#Data stored as a list in emails.data
emails.data
#Total of 1197 emails are available in the dataset
len(emails.data)

1184

In [83]:
#Labels for emails
emails.target
#Corresponding labels, 1197 in total
len(emails.target)

1184

In [84]:
#The names of classes
emails.target_names

['comp.windows.x', 'sci.electronics']

### Making the Training and Test Sets
We can split the data into training and test sets by specifying the parameter ***subset*** to be either *'train'* and *'test'*. The parameter ***shuffle*** allows shuffling the data, which might be important for models that make the assumption that the samples are independent and identically distributed (i.i.d). ***random_state*** determines random number generation for dataset shuffling and integer value needs to be passed in order to ensure that the output of multiple function calls will be reproducible.

In [85]:
train_emails = fetch_20newsgroups(categories = categories,
                                 subset = 'train',shuffle = True, random_state = 108)
test_emails = fetch_20newsgroups(categories = categories ,
                                 subset = 'test',shuffle = True, random_state = 108)

### Counting Words
The next step would be to encode the text into a vector form so that our machine learning algorithm would be able to "understand" the data. We will be using scikit-learn's CountVectorizer() class for this purpose. 

In [86]:
#Instantiating a vectorizer instance
counter = CountVectorizer()
#Create a vectorizer by applying .fit method on the training and test texts
counter.fit(train_emails.data+test_emails.data)

CountVectorizer()

Vectorized text: each word in the email and its index in the vector is saved as a key-value pair in a dictionary-like format.


In [95]:
#counter.vocabulary_

To create a vector of counts, we need to pass the text into the vectorizer and apply ***.tranform*** method.

In [88]:
train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

Each sample in the training dataset has been converted into a vector.

In [89]:
#Let's observe the vector value of the first sample
train_counts.toarray()[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [90]:
#The length of the first sample is quite large, we can only see it as 0s. Check the full length:
len(train_counts.toarray()[0])

35316

In [91]:
#Let's count number of words in the first sample:
np.sum(train_counts.toarray()[0])

206

### Making a Naive Bayes Classifier

We will be using one of the supervised machine learning algorithms that leverages Bayes' theorem to make classifications - Naive Bayes classifier. 

$$P(A ∣ B)= \frac{P(B ∣ A)⋅P(A)}{P(B)}$$

The theorem helps to find the probability of A given B. In our classification problem, we can rewrite that to "probability of hockey or baseball being mentioned given the email". We will be comparing *P(hockey | email)* and *P(baseball | email)* and whichever probabibilty is higher will be the classifier's prediction. Therefore, our probability statement will be given by the following formulas:

$$P(hockey ∣ email)= \frac{P(email ∣ hockey)⋅P(hockey)}{P(email)}$$

$$P(baseball ∣ email)= \frac{P(email ∣ baseball)⋅P(baseball)}{P(email)}$$


In [92]:
#Build a classifier using scikit-learn
classifier = MultinomialNB()

In [93]:
#Train a classifier on training data and its labels
classifier.fit(train_counts,train_emails.target)

MultinomialNB()

In [94]:
#Accuracy of classifier with respect to our test data
classifier.score(test_counts,test_emails.target)

0.9606598984771574