# Naive Bayes Model for Newsgroups Data

For an explanation of the Naive Bayes model, see [our course notes](https://jennselby.github.io/MachineLearningCourseNotes/#naive-bayes).

This notebook uses code from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.

## Instructions
0. If you haven't already, follow [the setup instructions here](https://jennselby.github.io/MachineLearningCourseNotes/#setting-up-python3) to get all necessary software installed.
0. Read through the code in the following sections:
  * [Newgroups Data](#Newgroups-Data)
  * [Model Training](#Model-Training)
  * [Prediction](#Prediction)
0. Complete at least one of the following exercises:
  * [Exercise Option #1 - Standard Difficulty](#Exercise-Option-#1---Standard-Difficulty)
  * [Exercise Option #2 - Advanced Difficulty](#Exercise-Option-#2---Advanced-Difficulty)

In [291]:
from sklearn.datasets import fetch_20newsgroups # the 20 newgroups set is included in scikit-learn
from sklearn.naive_bayes import MultinomialNB # we need this for our Naive Bayes model

# These next two are about processing the data. We'll look into this more later in the semester.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

import pandas

## Newgroups Data

Back in the day, [Usenet](https://en.wikipedia.org/wiki/Usenet_newsgroup) was a popular discussion system where people could discuss topics in relevant newsgroups (think Slack channel or subreddit). At some point, someone pulled together messages sent to 20 different newsgroups, to use as [a dataset for doing text processing](http://qwone.com/~jason/20Newsgroups/).

We are going to pull out messages from just a few different groups to try out a Naive Bayes model.

Examine the newsgroups dictionary, to make sure you understand the dataset.

**Note**: If you get an error about SSL certificates, you can fix this with the following:
1. In Finder, click on Applications in the list on the left panel
1. Double click to go into the Python folder (it will be called something like Python 3.7)
1. Double click on the Install Certificates command in that folder


In [292]:
# which newsgroups we want to download
newsgroup_names = ['comp.graphics', 'rec.sport.hockey', 'sci.electronics', 'sci.space']

# get the newsgroup data (organized much like the iris data)
newsgroups = fetch_20newsgroups(categories=newsgroup_names, shuffle=True, random_state=265)

newsgroups.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

This next part does some processing of the data, because the scikit-learn Naive Bayes module is expecting numerical data rather than text data. We will talk more about what this code is doing later in the semester. For now, you can ignore it.

In [293]:
# Convert the text into numbers that represent each word (bag of words method)
word_vector = CountVectorizer()
word_vector_counts = word_vector.fit_transform(newsgroups.data)

# Account for the length of the documents:
#   get the frequency with which the word occurs instead of the raw number of times
term_freq_transformer = TfidfTransformer()
term_freq = term_freq_transformer.fit_transform(word_vector_counts)

## Model Training

Now we fit the Naive Bayes model to the subset of the 20 newsgroups data that we've pulled out.

In [294]:
# Train the Naive Bayes model
model = MultinomialNB().fit(term_freq, newsgroups.target)

## Prediction

Let's see how the model does on some (very short) documents that we made up to fit into the specific categories our model is trained on.

In [295]:
# Predict some new fake documents
fake_docs = [
    'That GPU has amazing performance with a lot of shaders',
    'The player had a wicked slap shot',
    'I spent all day yesterday soldering capacitors banks',
    'I spent all day soldering capacitors banks',
    'Today I have to solder a capacitor',
    'NASA has rovers on Mars',
    'I like space',
    'I play hockey',
    'Electronics are so much better than hockey!',
    'I think that electronics have a lot of applications to computer graphics',
    'I think that computer graphics have a lot of applications to space',
    'I think that graphics have a lot of applications to space']
fake_counts = word_vector.transform(fake_docs)

fake_term_freq = term_freq_transformer.transform(fake_counts)

predicted = model.predict(fake_term_freq)
print('Predictions:')
for doc, group in zip(fake_docs, predicted):
    print('\t{0} => {1}'.format(doc, newsgroups.target_names[group]))

probabilities = model.predict_proba(fake_term_freq)
print('Probabilities:')
print(''.join(['{:17}'.format(name) for name in newsgroups.target_names]))
for probs in probabilities:
    print(''.join(['{:<17.8}'.format(prob) for prob in probs]))

Predictions:
	That GPU has amazing performance with a lot of shaders => comp.graphics
	The player had a wicked slap shot => rec.sport.hockey
	I spent all day yesterday soldering capacitors banks => sci.space
	I spent all day soldering capacitors banks => sci.electronics
	Today I have to solder a capacitor => sci.electronics
	NASA has rovers on Mars => sci.space
	I like space => sci.space
	I play hockey => rec.sport.hockey
	Electronics are so much better than hockey! => rec.sport.hockey
	I think that electronics have a lot of applications to computer graphics => comp.graphics
	I think that computer graphics have a lot of applications to space => comp.graphics
	I think that graphics have a lot of applications to space => sci.space
Probabilities:
comp.graphics    rec.sport.hockey sci.electronics  sci.space        
0.29466149       0.22895149       0.24926344       0.22712357       
0.12948055       0.51155698       0.18248712       0.17647535       
0.18765395       0.24247134       0.277

# Exercise Option #1 - Standard Difficulty

Modify the fake documents and add some new documents of your own. 

What words in your documents have particularly large effects on the model probabilities? Note that we're not looking for documents that consist of a single word, but for words that, when included or excluded from a document, tend to change the model's output.



One interesting thing to note was that, by removing the word "yesterday" in sentence three, the model's classification moved from the space group to the electronics group. A possible explanation for this comes from the exercise below, where it seems that the word "yesterday" shows up less frequently in the electronics group than in the space group; as such, not seeing it might have made the model less confident that the message belonged to the space group.

Words from the title of the various groups have a lot of effect on the model's prediction. Interestingly, however, there does not appear to be an equal influence. For example, in my third addition I used both hockey and electronics in the phrase. While a human would probably assign that sentence to the electronics group, the model assigned it to hockey with quite some confidence, potentially indicating that "hockey" has more weight than "electronics". The reason for the higher weight on "hockey" might be becuase "hockey" only appears in the "hockey" group (see below), while "electronics" could reasonably appear in multiple groups; as such, the model gives more weight to "hockey".

It was also interesting to note that the phrase "computer graphics" is more heavily weighted than just "graphics". This is not because it is longer (as I previously thought), but because it is recognizing both the words "computer" and "graphics" individually, and assigning more weight for that reason.

# Exercise Option #2 - Advanced Difficulty

Write some code to count up how often the words you found in the exercise above appear in each category in the training dataset. Does this match up with your intuition?

In [296]:
#Divide up Newsgroup data by categories
newsgroup_groups=[[] for i in range(len(newsgroups.target_names))]

for message, index in zip(newsgroups.data, newsgroups.target):
    newsgroup_groups[index].append(message)

#Transform messages
group_word_vector_counts=[word_vector.transform(group) for group in newsgroup_groups]

In [300]:
#Words to be counted
words=['NASA','the', 'hockey', 'electronics', 'space', 'is', 'yesterday']

#Makes list of the numeric value assigned to each of the words under word_vector.transform
word_values=[word_vector.transform([i]).nonzero()[1][0] for i in words if word_vector.transform([i]).nnz!=0]

#Finds how many of each word appears in the data by category
counts=[[sum([entry for entry in group_word_vector_count[:,i]])[0,0] for i in word_values] 
        for group_word_vector_count in group_word_vector_counts]

#Flip axis
counts=list(zip(*counts))

#Add totals
counts_totals=[list(word_counts)+[sum(word_counts)] for word_counts in counts]

In [301]:
counts_df=pandas.DataFrame(counts_totals, columns=newsgroup_names+["Total"], index=words)

counts_df

Unnamed: 0,comp.graphics,rec.sport.hockey,sci.electronics,sci.space,Total
NASA,75,14,68,737,894
the,4537,8080,5527,8264,26408
hockey,0,649,0,0,649
electronics,4,0,103,10,117
space,69,8,19,1400,1496
is,1751,1422,1738,2068,6979
yesterday,5,16,1,11,33


As we can see from the above code, words of interest (such as 'space' and 'hockey') show up reasonably frequently in their respective groups ('space' shows up more than half as often as 'is' in the space group, which is really surprising given how common 'is' is). 

It is worth noting that the number of uses of 'electronics' is somewhat low compared to some of the other keywords, even in its own group. This is likely not because the electronics group is small, as the electronics group is the third largest if you use the usage of the word 'the' as a proxy.