# Email Similarity

In this project, I will use scikit-learn’s Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, I can find which datasets are harder to distinguish. For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? How hard is it to tell the difference between emails about hockey and emails about tech? In this project, I will find out exactly how difficult those two tasks are.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

emails = fetch_20newsgroups()

### Exploring the Data

A datset of emails have been imported from scikit-learn's datasets. All these emails are tagged based on their content. Let's see the different categories. 

In [2]:
print(emails.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


I am interested in seeing how effective the Naive Bayes classifier is at telling the difference between a baseball email and a hockey email. 

In [5]:
emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'])

Let's take a look at one of these emails. 

In [6]:
print(emails.data[5])

From: mmb@lamar.ColoState.EDU (Michael Burger)
Subject: More TV Info
Distribution: na
Nntp-Posting-Host: lamar.acns.colostate.edu
Organization: Colorado State University, Fort Collins, CO  80523
Lines: 36

United States Coverage:
Sunday April 18
  N.J./N.Y.I. at Pittsburgh - 1:00 EDT to Eastern Time Zone
  ABC - Gary Thorne and Bill Clement

  St. Louis at Chicago - 12:00 CDT and 11:00 MDT - to Central/Mountain Zones
  ABC - Mike Emerick and Jim Schoenfeld

  Los Angeles at Calgary - 12:00 PDT and 11:00 ADT - to Pacific/Alaskan Zones
  ABC - Al Michaels and John Davidson

Tuesday, April 20
  N.J./N.Y.I. at Pittsburgh - 7:30 EDT Nationwide
  ESPN - Gary Thorne and Bill Clement

Thursday, April 22 and Saturday April 24
  To Be Announced - 7:30 EDT Nationwide
  ESPN - To Be Announced


Canadian Coverage:

Sunday, April 18
  Buffalo at Boston - 7:30 EDT Nationwide
  TSN - ???

Tuesday, April 20
  N.J.D./N.Y. at Pittsburgh - 7:30 EDT Nationwide
  TSN - ???

Wednesday, April 21
  St. Louis a

In [7]:
print(emails.target[5])

1


In [8]:
print(emails.target_names)

['rec.sport.baseball', 'rec.sport.hockey']


Email at index 5 has label 1 which means a hockey subject email. 

### Making the Training & Test Sets

In [9]:
train_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset = 'train', shuffle= True, random_state= 108)

In [10]:
test_emails = fetch_20newsgroups(categories = ['rec.sport.baseball', 'rec.sport.hockey'], subset = 'test', shuffle= True, random_state= 108)

### Counting Words

I want to transform these emails into lists of word counts. 

In [11]:
counter = CountVectorizer()

I need to tell counter what possible words can exist in our emails. 

In [12]:
counter.fit(test_emails.data + train_emails.data)

CountVectorizer()

I can now make a list of the counts of our words in our training set. 

In [13]:
train_counts = counter.transform(train_emails.data)

In [14]:
test_counts = counter.transform(test_emails.data)

### Making a Naive Bayes Clsssifier

Let's make a Naive Bayes classifier that I can train and test on. 

In [15]:
classifier = MultinomialNB()

In [16]:
classifier.fit(train_counts, train_emails.target)

MultinomialNB()

In [17]:
print(classifier.score(test_counts, test_emails.target))

0.9723618090452262


The above shows my classifier has an accuracy of just over 97%. 

### Testing Other Datasets

In [18]:
train_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset = 'train', shuffle= True, random_state= 108)

test_emails = fetch_20newsgroups(categories = ['comp.sys.ibm.pc.hardware','rec.sport.hockey'], subset = 'test', shuffle= True, random_state= 108)

In [19]:
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

In [20]:
classifier.fit(train_counts, train_emails.target)
print(classifier.score(test_counts, test_emails.target))

0.9974715549936789


Accuracy has imoproved to nearly 100%. This makes sense as the two two topics are very different and are less likely to share the same words (when compared to the previous examples of hockey and baseball)

Let's try two closes related topics to see if the accuracy drops. 

In [21]:
train_emails = fetch_20newsgroups(categories = ['soc.religion.christian','talk.religion.misc'], subset = 'train', shuffle= True, random_state= 108)

test_emails = fetch_20newsgroups(categories = ['soc.religion.christian','talk.religion.misc'], subset = 'test', shuffle= True, random_state= 108)

In [22]:
counter.fit(test_emails.data + train_emails.data)

train_counts = counter.transform(train_emails.data)
test_counts = counter.transform(test_emails.data)

In [23]:
classifier.fit(train_counts, train_emails.target)
print(classifier.score(test_counts, test_emails.target))

0.9183359013867488


As you can see from the above when using to topics closely related to each other the accuracy drops compared to the previous two examples. This makes sense at the two topics will share a lot of the same words making to difficult to distinguish between the two. 