# Email Similarity

In this project, you will use scikit-learn’s Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? How hard is it to tell the difference between emails about hockey and emails about tech? In this project, we’ll find out exactly how difficult those two tasks are.

# Question

1.
We’ve imported a dataset of emails from scikit-learn’s datasets. All of these emails are tagged based on their content.

Print emails.target_names to see the different categories.

2.
We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball email and a hockey email. We can select the categories of articles we want from fetch_20newsgroups by adding the parameter categories.

In the function call, set categories equal to the list ['rec.sport.baseball', 'rec.sport.hockey']



3.
Let’s take a look at one of these emails.

All of the emails are stored in a list called emails.data. Print the email at index 5 in the list.



4.
All of the labels can be found in the list emails.target. Print the label of the email at index 5.

The labels themselves are numbers, but those numbers correspond to the label names found at emails.target_names.

Is this a baseball email or a hockey email?



Making the Training and Test Sets
5.
We now want to split our data into training and test sets. Change the name of your variable from emails to train_emails. Add these three parameters to the function call:

subset='train'
shuffle = True
random_state = 108
Adding the random_state parameter will make sure that every time you run the code, your dataset is split in the same way.



6.
Create another variable named test_emails and set it equal to fetch_20newsgroups. The parameters of the function should be the same as before except subset should now be 'test'.

Counting Words
7.
We want to transform these emails into lists of word counts. The CountVectorizer class makes this easy for us.

Create a CountVectorizer object and name it counter.



8.
We need to tell counter what possible words can exist in our emails. counter has a .fit() a function that takes a list of all your data.

Call .fit() with test_emails.data + train_emails.data as a parameter.



9.
We can now make a list of the counts of our words in our training set.

Create a variable named train_counts. Set it equal to counter‘s transform function using train_emails.data as a parameter.



10.
Let’s also make a variable named test_counts. This should be the same function call as before, but use test_emails.data as the parameter of transform.

Making a Naive Bayes Classifier
11.
Let’s now make a Naive Bayes classifier that we can train and test on. Create a MultinomialNB object named classifier.



12.
Call classifier‘s .fit() function. .fit() takes two parameters. The first should be our training set, which for us is train_counts. The second should be the labels associated with the training emails. Those are found in train_emails.target.



13.
Test the Naive Bayes Classifier by printing classifier‘s .score() function. .score() takes the test set and the test labels as parameters.

.score() returns the accuracy of the classifier on the test data. Accuracy measures the percentage of classifications a classifier correctly made.



Testing Other Datasets
14.
Our classifier does a pretty good job distinguishing between soccer emails and hockey emails. But let’s see how it does with emails about really different topics.

Find where you create train_emails and test_emails. Change the categories to be ['comp.sys.ibm.pc.hardware','rec.sport.hockey'].

Did your classifier do a better or worse job on these two datasets?



15.
Play around with different sets of data. Can you find a set that’s incredibly accurate or incredibly inaccurate?

The possible categories are listed below.

* 'alt.atheism'
* 'comp.graphics'
* 'comp.os.ms-windows.misc'
* 'comp.sys.ibm.pc.hardware'
* 'comp.sys.mac.hardware'
* 'comp.windows.x'
* 'misc.forsale'
* 'rec.autos'
* 'rec.motorcycles'
* 'rec.sport.baseball'
* 'rec.sport.hockey'
* 'sci.crypt'
* 'sci.electronics'
* 'sci.med'
* 'sci.space'
* 'soc.religion.christian'
* 'talk.politics.guns'
* 'talk.politics.mideast'
* 'talk.politics.misc'
* 'talk.religion.misc'

In [9]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups(categories= ['rec.sport.baseball', 'rec.sport.hockey','comp.sys.ibm.pc.hardware',
                                         'rec.sport.hockey'])

print(emails.target_names)
print("")
print(emails.target)
print("")
print(emails.data[5])




['comp.sys.ibm.pc.hardware', 'rec.sport.baseball', 'rec.sport.hockey', 'rec.sport.hockey']

[1 1 0 ... 2 1 1]

From: cdw@dcs.ed.ac.uk (Chris Walton)
Subject: Upgrading a modem ...
Organization: Department of Computer Science, Edinburgh University
Lines: 19

I have an old tandon type modem (that's all the info I have apart from 
the fact that it is black!).   Does anyone have any info about this modem
or upgrading it ??? Reply by e-mail please to cdw@dcs.ed.ac.uk.

= Chris - E-mail: cdw@dcs.ed.ac.uk or C.Walton@ed or p92019@cplab.ph.ed.ac.uk =
=         Tel.:   031-667-9764 or 0334-74244 (at weekends)                    =
=         Write:  4/2 Romero Place, Edinburgh, EH16 5BJ.                      =
Finagle's Fourth Law:
  Once a job is fouled up, anything done to improve it only makes it worse.


-- 
= Chris - E-mail: cdw@dcs.ed.ac.uk or C.Walton@ed or p92019@cplab.ph.ed.ac.uk =
=         Tel.:   031-667-9764 or 0334-74244 (at weekends)                    =
=         Write:  4/2 Romer

In [16]:
train_emails = fetch_20newsgroups(categories= ['rec.sport.baseball', 'rec.sport.hockey',
                                               'comp.sys.ibm.pc.hardware','rec.sport.hockey'],
                                     subset = 'train', shuffle =True, random_state= 108)

test_emails = fetch_20newsgroups(categories= ['rec.sport.baseball', 'rec.sport.hockey',
                                              'comp.sys.ibm.pc.hardware','rec.sport.hockey'],
                                     subset = 'test', shuffle =True, random_state= 108)



counter = CountVectorizer()

counter.fit(test_emails.data+train_emails.data)

train_counts = counter.transform(train_emails.data)

test_counts = counter.transform(test_emails.data)

print(train_counts)
print(" ")
print(" ")
print(test_counts)


  (0, 332)	2
  (0, 1508)	2
  (0, 2174)	2
  (0, 3022)	1
  (0, 4492)	2
  (0, 4655)	2
  (0, 5551)	1
  (0, 5883)	1
  (0, 5977)	1
  (0, 6091)	1
  (0, 6195)	2
  (0, 7008)	1
  (0, 7053)	2
  (0, 7688)	3
  (0, 7693)	2
  (0, 8218)	2
  (0, 8526)	1
  (0, 8570)	3
  (0, 9187)	1
  (0, 9288)	2
  (0, 10168)	3
  (0, 10408)	1
  (0, 10544)	1
  (0, 10968)	4
  (0, 11203)	2
  :	:
  (1786, 21624)	1
  (1786, 21848)	1
  (1786, 22136)	1
  (1786, 22285)	1
  (1786, 24655)	1
  (1786, 24720)	3
  (1786, 24754)	2
  (1786, 25295)	1
  (1786, 25498)	1
  (1786, 27485)	1
  (1786, 27831)	2
  (1786, 28073)	3
  (1786, 28377)	1
  (1786, 29021)	1
  (1786, 29197)	2
  (1786, 29202)	2
  (1786, 29268)	1
  (1786, 29280)	1
  (1786, 30905)	1
  (1786, 31560)	1
  (1786, 31792)	1
  (1786, 31902)	1
  (1786, 31961)	2
  (1786, 31975)	1
  (1786, 32235)	2
 
 
  (0, 410)	3
  (0, 1530)	2
  (0, 1866)	1
  (0, 1869)	1
  (0, 1870)	1
  (0, 1893)	2
  (0, 2192)	2
  (0, 3355)	1
  (0, 4574)	1
  (0, 5488)	1
  (0, 5527)	1
  (0, 5674)	2
  (0, 5883)	4
  (0,

In [17]:
classifier = MultinomialNB()

classifier.fit(train_counts,train_emails.target)

print(classifier.score(test_counts,test_emails.target))

0.9781144781144782
