# Extra Bits 

A few extra things we can do with classifiers

We'll need to retrain our classifer from class 15...

In [27]:
import numpy as np

In [28]:
# back to the newsgroup data; load the categories we want and create our test and training data

from sklearn.datasets import fetch_20newsgroups

categories = ['rec.autos', 'rec.motorcycles',
              'sci.space', 'comp.graphics']

train = fetch_20newsgroups(subset='train', categories=categories) # quickly make a train dataset
test = fetch_20newsgroups(subset='test', categories=categories) # quickly make a test dataset


In [29]:
# train the model

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

count_vect = CountVectorizer(stop_words='english')
X_train_counts = count_vect.fit_transform(train.data)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, train.target)

In [30]:
# predict labels for our test data

X_new_counts = count_vect.transform(test.data)

X_new_tfidf = tfidf_transformer.transform(X_new_counts)

labels = clf.predict(X_new_tfidf)

In [31]:
# just check some metrics before we move on
from sklearn.metrics import classification_report
report = classification_report(test.target, labels, target_names=train.target_names)

print(report) # f1-score is harmonic mean of precision and recall

                 precision    recall  f1-score   support

  comp.graphics       0.97      0.92      0.94       389
      rec.autos       0.92      0.96      0.94       396
rec.motorcycles       0.96      0.96      0.96       398
      sci.space       0.96      0.97      0.96       394

       accuracy                           0.95      1577
      macro avg       0.95      0.95      0.95      1577
   weighted avg       0.95      0.95      0.95      1577



## Extraxting most informative features

Here's one new thing: we can extract the most informative features for each label. This uses the `coef_` attribute, which interprets `MultinomialNB()` as a linear model, giving the empirical log probability of each feature given a class, P(x_i|y)

In [32]:
# extract most informative features -- NOTE TO LK, FIGURE OUT COEF VAR

def print_top10(vect, clf, labels):
    feature_names = vect.get_feature_names()
    for i, label in enumerate(train.target_names):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("Top features for " + label + " (label " + str(i) + "): ") 
        top_features = ""
        for j in top10:
            top_features += str(feature_names[j]) + ", "
        print(top_features)
        
print_top10(count_vect, clf, labels)

Top features for comp.graphics (label 0): 
3d, organization, com, subject, files, lines, university, image, edu, graphics, 
Top features for rec.autos (label 1): 
just, lines, organization, subject, writes, article, cars, com, edu, car, 
Top features for rec.motorcycles (label 2): 
lines, subject, organization, writes, ca, article, dod, edu, bike, com, 
Top features for sci.space (label 3): 
toronto, com, gov, moon, alaska, access, henry, nasa, edu, space, 


In [33]:
# predict the probablity for all classes given a test vector X

def predict_probs(s, model=clf):
    test_str = [s] # need to turn the string into an array first
    X_new_counts = count_vect.transform(test_str)
    X_new_tfidf = tfidf_transformer.transform(X_new_counts)
    
    pred = model.predict_proba(X_new_tfidf)
    for i, label in enumerate(train.target_names):
        print("Probability of being classified as " + label + " (label " + str(i) + "): " + str(pred[0][i])) 

In [34]:
predict_probs("sending a payload to the ISS")

Probability of being classified as comp.graphics (label 0): 0.20380777835091615
Probability of being classified as rec.autos (label 1): 0.19778891182239855
Probability of being classified as rec.motorcycles (label 2): 0.18993853422928691
Probability of being classified as sci.space (label 3): 0.4084647755973983


Much of this is super easy to do with [TextBlob](https://textblob.readthedocs.io/en/dev/classifiers.html). 

In [35]:
# create some basic training data (note that you can also load in data from files)

train = [
     ('I love this sandwich.', 'yum'),
     ('this is an amazing sundae!', 'yum'),
     ('I feel very good about these dumplings.', 'yum,'),
     ('this is the most delicious salad I have ever had.', 'yum'),
     ("what a delicious pizza", 'yum'),
     ('I do not like this restaurant', 'yuck'),
     ('I am tired of this chicken.', 'yuck'),
     ("I can't deal with all this mayonnaise", 'yuck'),
     ('mayonnaise is my sworn enemy!', 'yuck'),
     ('my sandwich is horrible.', 'yuck')
]

test = [
     ('the sandwich was good.', 'yum'),
     ('I do not enjoy mayonnaise', 'yum'),
     ("The special salad was not very good today.", 'yuck'),
     ("This tastes amazing!", 'yum'),
     ('I always like to put sprinkles on my ice cream come.', 'yum'),
     ("I can't believe I'm eating this.", 'yuck')
]

In [36]:
# Now we’ll create a Naive Bayes classifier, passing the training data into the constructor.

from textblob.classifiers import NaiveBayesClassifier
clf = NaiveBayesClassifier(train)

In [37]:
clf.classify("This is an amazing hot dog!")

'yum'

You can get the label probability distribution with the prob_classify(text) method.

In [38]:
prob_dist = clf.prob_classify("This one's delicious.")

prob_dist.max()

'yum'

You can get the label probability distribution with the prob_classify(text) method.

In [39]:
round(prob_dist.prob("yum"), 2)

0.8

In [40]:
round(prob_dist.prob("yuck"), 2)

0.2

Classifying TextBlobs
Another way to classify text is to pass a classifier into the constructor of TextBlob and call its classify() method.

In [43]:
from textblob import TextBlob
blob = TextBlob("The pasta is delicious. But the sauce is horrible.", classifier=clf)
blob.classify()

'yum'

The advantage of this approach is that you can classify sentences within a TextBlob.

In [44]:
for s in blob.sentences:
    print(s)
    print(s.classify())

The pasta is delicious.
yum
But the sauce is horrible.
yuck


Evaluating Classifiers
To compute the accuracy on our test set, use the accuracy(test_data) method.

In [45]:
clf.accuracy(test)

0.6666666666666666

Use the show_informative_features() method to display a listing of the most informative features.

In [46]:
# show informative features

clf.show_informative_features(5)  

Most Informative Features
          contains(this) = False            yum, : yum    =      2.5 : 1.0
     contains(delicious) = False            yuck : yum    =      1.8 : 1.0
            contains(my) = False             yum : yuck   =      1.5 : 1.0
    contains(mayonnaise) = False             yum : yuck   =      1.5 : 1.0
            contains(is) = False            yum, : yum    =      1.5 : 1.0


## Feature Extractors

By default, the NaiveBayesClassifier uses a simple feature extractor that indicates which words in the training set are contained in a document.

For example, the sentence “I love ice cream sundaes” might have the features contains(snndaes): True or contains(cones): False.

You can override this feature extractor by writing your own. A feature extractor is simply a function with document (the text to extract features from) as the first argument. The function may include a second argument, train_set (the training dataset), if necessary.

The function should return a dictionary of features for document.

For example, let’s create a feature extractor that just uses the first and last words of a document as its features.

In [50]:
def end_word_extractor(document):
    tokens = document.split()
    first_word, last_word = tokens[0], tokens[-1]
    feats = {}
    feats["first({0})".format(first_word)] = True
    feats["last({0})".format(last_word)] = False
    return feats

We can then use the feature extractor in a classifier by passing it as the second argument of the constructor.

In [48]:
clf2 = NaiveBayesClassifier(test, feature_extractor=end_word_extractor)
blob = TextBlob("Ice cream sundaes are my very favorite!", classifier=clf2)
blob.classify()

'yum'

*Note about how this can be used to filter *