# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup postings on a variety of topics. You'll train classifiers to distinguish between the topics based on the text of the posts. Whereas with digit classification, the input is relatively dense: a 28x28 matrix of pixels, many of which are non-zero, here we'll represent each document with a "bag-of-words" model. As you'll see, this makes the feature representation quite sparse -- only a few words of the total vocabulary are active in any given document. The bag-of-words assumption here is that the label depends only on the words; their order is not important.

The SK-learn documentation on feature extraction will prove useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but please prepare your own write-up and write your own code.

In [3]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

Load the data, stripping out metadata so that we learn classifiers that only use textual features. By default, newsgroups data is split into train and test sets. We further split the test so we have a dev set. Note that we specify 4 categories to use for this project. If you remove the categories argument from the fetch function, you'll get all 20 categories.

In [5]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)

num_test = len(newsgroups_test.target)
test_data, test_labels = newsgroups_test.data[num_test/2:], newsgroups_test.target[num_test/2:]
dev_data, dev_labels = newsgroups_test.data[:num_test/2], newsgroups_test.target[:num_test/2]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print 'training label shape:', train_labels.shape
print 'test label shape:', test_labels.shape
print 'dev label shape:', dev_labels.shape
print 'labels names:', newsgroups_train.target_names

training label shape: (2034L,)
test label shape: (677L,)
dev label shape: (676L,)
labels names: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


(1) For each of the first 5 training examples, print the text of the message along with the label.

In [139]:
def P1(num_examples=5):
### STUDENT START ###
    for i in range(num_examples):
        print train_labels[i], ' : ',  newsgroups_train.target_names[train_labels[i]], '\n',  train_data[i], '\n'
### STUDENT END ###
P1()

1  :  comp.graphics 
Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych 

3  :  talk.religion.misc 


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries. 

2  :  sci.

(2) Use CountVectorizer to turn the raw training text into feature vectors. You should use the fit_transform function, which makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").

The vectorizer has a lot of options. To get familiar with some of them, write code to answer these questions:

a. The output of the transform (also of fit_transform) is a sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html. What is the size of the vocabulary? What is the average number of non-zero features per example? What fraction of the entries in the matrix are non-zero? Hint: use "nnz" and "shape" attributes.

b. What are the 0th and last feature strings (in alphabetical order)? Hint: use the vectorizer's get_feature_names function.

c. Specify your own vocabulary with 4 words: ["atheism", "graphics", "space", "religion"]. Confirm the training vectors are appropriately shaped. Now what's the average number of non-zero features per example?

d. Instead of extracting unigram word features, use "analyzer" and "ngram_range" to extract bigram and trigram character features. What size vocabulary does this yield?

e. Use the "min_df" argument to prune words that appear in fewer than 10 documents. What size vocabulary does this yield?

f. Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? Hint: build a vocabulary for both train and dev and look at the size of the difference.

In [142]:
def P2():
    ### STUDENT START ###
    #--a Fitting and Transforming the text into numerical matrices
    vectorizer = CountVectorizer(min_df=1)
    X = vectorizer.fit_transform(train_data)
    print '\nnon-zeros:',  X.nnz, ' shape:',  X.shape, ' avg non-zeros:', X.nnz / float(X.shape[0]), ' sorted?:', X.has_sorted_indices

    #--b Getting original lables for train-data
    Y = vectorizer.get_feature_names()
    print '\nzeroth feature: ' , Y[0], ' last feature: ', Y[len(Y)-1]

    #--c  -- dummy matric to prove the shape  of the training vectors
    own_vocab = (['atheism', 'graphics', 'space', 'religion'])
    own_X = vectorizer.fit_transform(own_vocab)
    print '\nnon-zeros:',  own_X.nnz, ' shape:',  own_X.shape, ' avg non-zeros:', own_X.nnz / float(own_X.shape[0])

    #--d Switching the fitting from unigrams to bi and trigrams
    ngram_vectorizer = CountVectorizer(ngram_range=(2, 3), analyzer='word', min_df=1)
    M = ngram_vectorizer.fit_transform(train_data)
    print '\nWith bigram and trigrams, vocab is:',  M.shape[1]

    #--e Restricting minimum training features to be from at least 10 documents
    ngram_vectorizer_df = CountVectorizer(ngram_range=(2, 3), analyzer='word', min_df=10)
    D = ngram_vectorizer_df.fit_transform(train_data)
    print '\nWith constraint of min. documents of 10, vocab is:',  D.shape[1]

    #--f Using default count vectorizer properties to establish dev and train data vocab differences 
    H = vectorizer.fit_transform(dev_data)
    print '\nvocab(train data):', X.shape[1], 'vocab(dev data):', H.shape[1], ' Difference is:', (X.shape[1]-H.shape[1]), 'as a fraction:', float((X.shape[1]-H.shape[1]))/(X.shape[1]) 
    ### STUDENT END ###
P2()


non-zeros: 196700  shape: (2034, 26879)  avg non-zeros: 96.7059980334  sorted?: 0

zeroth feature:  00  last feature:  zyxel

non-zeros: 4  shape: (4, 4)  avg non-zeros: 1.0

With bigram and trigrams, vocab is: 510583

With constraint of min. documents of 10, vocab is: 3381

vocab(train data): 26879 vocab(dev data): 16246  Difference is: 10633 as a fraction: 0.395587633469


(3) Use the default CountVectorizer options and report the f1 score (use metrics.f1_score) for a k nearest neighbors classifier; find the optimal value for k. Also fit a Multinomial Naive Bayes model and find the optimal value for alpha. Finally, fit a logistic regression model and find the optimal value for the regularization strength C using l2 regularization. A few questions:

a. Why doesn't nearest neighbors work well for this problem?

ANS: Because of the nature of language. Syntax, which is to say word groupings, does not vary by topic and is not unique to a topic except the use of a few special dictions

b. Any ideas why logistic regression doesn't work as well as Naive Bayes?

ANS: Logistic regression works best when there are few, dominant features and the absolute probabilities are large. Naive Bayes has more resolution and works better where there is a spread.

c. Logistic regression estimates a weight vector for each class, which you can access with the coef\_ attribute. Output the sum of the squared weight values for each class for each setting of the C parameter. Briefly explain the relationship between the sum and the value of C.

ANS:  The larger the C, the larger the sum of the squared weights 

In [144]:
def P3():
    ### STUDENT START ###
    vectorizer = CountVectorizer(min_df=1)
    train_data_vectorized = vectorizer.fit_transform(train_data)
    dev_data_vectorized = vectorizer.transform(dev_data)

    # knn
    # arbitrarily chose 1-10 neighbors to loop through
    for i in range (1, 11):
        knn_params = KNeighborsClassifier(n_neighbors= i, metric='minkowski', p = 2) 
        # train knn classifier
        knn_fit = knn_params.fit(train_data_vectorized, train_labels)
        # for each k-value, print perfomance metrics
        predKnn =  knn_fit.predict(dev_data_vectorized)
        print 'k-value:', i , 'f1 score:', metrics.f1_score(dev_labels, predKnn, average='weighted')
    print 'Optimal k-value = 7, with an f1-score=0.450. \n'

    # multinomial NB
    # chose seveal values of  alpha spanning a couple of orders to evaluate
    alphas =[0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]
    for i in range (len(alphas)):    
        Mclf = MultinomialNB(alpha=alphas[i])
        Mclf_fit = Mclf.fit(train_data_vectorized, train_labels)
        predMNB =  Mclf_fit.predict(dev_data_vectorized)
        print 'alpha:', alphas[i], 'f1 score:', metrics.f1_score(dev_labels, predMNB, average='weighted')
    print 'Optimal alpha=0.1 with an f1-score=0.790. \n'

    #logistic regression
    Cs = [1e5, 1e4, 1e3, 1e2, 1e1, 1, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
    for i in range (len(Cs)):
        logf= LogisticRegression(penalty='l2', C=Cs[i])
        log_fit= logf.fit(train_data_vectorized, train_labels)
        predlog =  log_fit.predict(dev_data_vectorized)
        print '\n', 'C:', Cs[i],'f1 score:', metrics.f1_score(dev_labels, predlog, average='weighted')
        print "class:1:", sum(pow(log_fit.coef_[0], 2)), "class:2:", sum(pow(log_fit.coef_[1], 2)), "class:3:", sum(pow(log_fit.coef_[2], 2)), "class:4:", sum(pow(log_fit.coef_[3], 2))
    print 'Optimal C=0.1 with an f1-score=0.697. \n'   

### STUDENT END ###
P3()

k-value: 1 f1 score: 0.380503001853
k-value: 2 f1 score: 0.380542124044
k-value: 3 f1 score: 0.408415022544
k-value: 4 f1 score: 0.403122799385
k-value: 5 f1 score: 0.428760723622
k-value: 6 f1 score: 0.446665054087
k-value: 7 f1 score: 0.450479100061
k-value: 8 f1 score: 0.446983581171
k-value: 9 f1 score: 0.43656661762
k-value: 10 f1 score: 0.427850290594
Optimal k-value = 7, with an f1-score=0.450. 

alpha: 0.0 f1 score: 0.374977827703
alpha: 0.0001 f1 score: 0.762834870483
alpha: 0.001 f1 score: 0.770251883616
alpha: 0.01 f1 score: 0.775166321854
alpha: 0.1 f1 score: 0.79030523851
alpha: 0.5 f1 score: 0.7862862962
alpha: 1.0 f1 score: 0.777732023602
alpha: 2.0 f1 score: 0.768996647234
alpha: 10.0 f1 score: 0.667481433826
Optimal alpha=0.1 with an f1-score=0.790. 


C: 100000.0 f1 score: 0.680147626545
class:1: 2624.0387608 class:2: 2269.15106905 class:3: 3209.38075587 class:4: 3501.001475

C: 10000.0 f1 score: 0.685522363993
class:1: 3628.93154986 class:2: 2814.72343565 class:3: 33

ANSWER:

(4) Train a logistic regression model. Find the 5 features with the largest weights for each label -- 20 features in total. Create a table with 20 rows and 4 columns that shows the weight for each of these features for each of the labels. Create the table again with bigram features. Any surprising features in this table?

In [145]:
def P4():
    ### STUDENT START ###
    
    labels = newsgroups_train.target_names
    # preparing data for the CountVectorizer for monogram features
    monogram_vectorizer = CountVectorizer(ngram_range=(1, 1), min_df=1)
    train_data_Mono_vect = monogram_vectorizer.fit_transform(train_data)
    Mono_featNames = monogram_vectorizer.get_feature_names()
    
    bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), analyzer='word', min_df=1)
    train_data_Bi_vect = bigram_vectorizer.fit_transform(train_data)
    Bi_featNames = bigram_vectorizer.get_feature_names()
    
    # core logistic regression declaration
    logf= LogisticRegression(penalty='l2', C=0.1)
        
    # a --- fitting the logistic model on a monogram feature data
    log_fit= logf.fit(train_data_Mono_vect, train_labels) 
    # negate coeffs and use argsort to get descending sort, then pick top 5 words for each feature ( along axis 1)
    sort_coeff =  np.argsort(-1.00000*(log_fit.coef_), axis=1)
    top20wghts_coeff = sort_coeff[0:4, 0:5]
    # flatten for easier looping
    flat_coeff= top20wghts_coeff.flatten()

    # print table of monogram fits
    print '\n| ', 'feature', ' |  ', labels[0], ' | ', labels[1],' | ', labels[2],' | ', labels[3],' |'
    for idx in range(0, 20):
        print Mono_featNames[flat_coeff[idx]], ' | ', log_fit.coef_[0,flat_coeff[idx]] , ' | ', log_fit.coef_[1,flat_coeff[idx]] ,' | ', log_fit.coef_[2,flat_coeff[idx]] ,' | ', log_fit.coef_[3,flat_coeff[idx]], ' |' 
            
    # b --- fitting the logistic model on bigram  feature data
    log_fit_b = logf.fit(train_data_Bi_vect, train_labels)
    # negate coeffs and use argsort to get descending sort, then pick top 5 words for each feature ( along axis 1)
    sort_coeff_b =  np.argsort(-1.00000*(log_fit_b.coef_), axis=1)
    top20wghts_coeff_b = sort_coeff_b[0:4, 0:5]
    # flatten for easier looping
    flat_coeff_b= top20wghts_coeff_b.flatten()
    
    # print table of bigram fits
    print '\n| ', 'feature', ' |  ', labels[0], ' | ', labels[1],' | ', labels[2],' | ', labels[3],' |'
    for idx in range(0, 20):
        print Bi_featNames[flat_coeff_b[idx]], ' | ', log_fit_b.coef_[0,flat_coeff_b[idx]] , ' | ', log_fit_b.coef_[1,flat_coeff_b[idx]] ,' | ', log_fit_b.coef_[2,flat_coeff_b[idx]] ,' | ', log_fit_b.coef_[3,flat_coeff_b[idx]], ' |' 
    print '\n Surprising result: With Bigrams lots of uninformative frequntly use English articles and prepositions, yielding a generally poorer fit'
    ### STUDENT END ###
P4()


|  feature  |   alt.atheism  |  comp.graphics  |  sci.space  |  talk.religion.misc  |
atheism  |  0.495418356796  |  -0.207325630215  |  -0.199941614282  |  -0.267754467218  |
religion  |  0.494077150295  |  -0.298757335164  |  -0.3932368874  |  0.00392538451667  |
bobby  |  0.478243383208  |  -0.120402716857  |  -0.16789623251  |  -0.22782902043  |
atheists  |  0.461117113844  |  -0.0793601633718  |  -0.158353196395  |  -0.295276401438  |
islam  |  0.426406272369  |  -0.0848390127703  |  -0.165086290613  |  -0.164906119095  |
graphics  |  -0.411184655658  |  1.00737234387  |  -0.651167218264  |  -0.372234034171  |
image  |  -0.263467734536  |  0.641990191736  |  -0.36761726718  |  -0.216178532876  |
file  |  -0.17728920947  |  0.641100294104  |  -0.421568504325  |  -0.288273762203  |
computer  |  -0.0398300853404  |  0.559031049019  |  -0.329120157129  |  -0.228690702954  |
3d  |  -0.182003018061  |  0.546926455409  |  -0.311652540329  |  -0.18144038333  |
space  |  -0.65522186367  |

ANSWER:

(5) Try to improve the logistic regression classifier by passing a custom preprocessor to CountVectorizer. The preprocessing function runs on the raw text, before it is split into words by the tokenizer. Your preprocessor should try to normalize the input in various ways to improve generalization. For example, try lowercasing everything, replacing sequences of numbers with a single token, removing various other non-letter characters, and shortening long words. If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular. With your new preprocessor, how much did you reduce the size of the dictionary?

For reference, I was able to improve dev F1 by 2 points.

In [182]:
def empty_preprocessor(s):
    return s

def better_preprocessor(s):
### STUDENT START ###
    # adapted from sci-kit example, replace numbers with a generic #
    token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
    tokens = token_pattern.findall(s)
    tokens = ["#" if token[0] in "0123456789_" else token
              for token in tokens]
    return tokens
### STUDENT END ###

def P5():
### STUDENT START ###
    # fitting with preprocessing
    vectorizer = CountVectorizer(min_df=1, tokenizer= better_preprocessor)
    train_data_vectorized = vectorizer.fit_transform(train_data)
    dev_data_vectorized = vectorizer.transform(dev_data)

    #logistic regression
    logf= LogisticRegression(penalty='l2', C=0.1)
    log_fit= logf.fit(train_data_vectorized, train_labels)
    predlog =  log_fit.predict(dev_data_vectorized)
    print '\n', 'With preprocessing, f1 score:', metrics.f1_score(dev_labels, predlog, average='weighted'), 'Without preprocessing, for same C, from P3, C: 0.01 f1 score: 0.697'

### STUDENT END ###
P5()


With preprocessing, f1 score: 0.688768654247 Without preprocessing, for same C, from P3, C: 0.01 f1 score: 0.697


(6) The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. That is, logistic regression seeks the set of weights that minimizes errors in the training data AND has a small size. The default regularization, L2, computes this size as the sum of the squared weights (see P3, above). L1 regularization computes this size as the sum of the absolute values of the weights. The result is that whereas L2 regularization makes all the weights relatively small, L1 regularization drives lots of the weights to 0, effectively removing unimportant features.

Train a logistic regression model using a "l1" penalty. Output the number of learned weights that are not equal to zero. How does this compare to the number of non-zero weights you get with "l2"? Now, reduce the size of the vocabulary by keeping only those features that have at least one non-zero weight and retrain a model using "l2".

Make a plot showing accuracy of the re-trained model vs. the vocabulary size you get when pruning unused features by adjusting the C parameter.

Note: The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.01 (the default is .0001).

In [179]:
def P6():
    # Keep this random seed here to make comparison easier.
    np.random.seed(0)

    ### STUDENT START ###
    # og_labels = newsgroups_train.target_names
    
    m_vectorizer = CountVectorizer(min_df=1)
    train_data_vect = m_vectorizer.fit_transform(train_data)
    dev_data_vect = m_vectorizer.transform(dev_data)
    
    featNames = m_vectorizer.get_feature_names()
    
    L1_nnz = []
    L1_acc = []
    
    Cs = [1e5, 1e4, 1e3, 1e2, 1e1, 1, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
    for i in range (len(Cs)):
        # l1 Regularization
        l1_logf= LogisticRegression(penalty='l1', C=Cs[i],  tol=0.01)
        l1_log_fit= l1_logf.fit(train_data_vect, train_labels)  
        l1_predlog =  l1_log_fit.predict(dev_data_vect)
        l1_nnz_coeffs = l1_log_fit.coef_.nonzero()
        l1_vocab_nnz = np.array(l1_nnz_coeffs).shape[1]  
        
        #l2 regularization for comparison
        l2_logf= LogisticRegression(penalty='l2', C=Cs[i],  tol=0.01)
        l2_log_fit= l2_logf.fit(train_data_vect, train_labels)  
        l2_predlog =  l2_log_fit.predict(dev_data_vect)
        l2_nnz_coeffs = l2_log_fit.coef_.nonzero() 
        l2_vocab_nnz = np.array(l2_nnz_coeffs).shape[1] 
        #l1_acc = metrics.f1_score(dev_labels, l1_predlog, average='weighted')
                                    
                
        L1_nnz = []
        L1_acc = []
        
        print '\n L1', 'C:', Cs[i], 'non-zero coeffs:',  l1_vocab_nnz ,'L2 non-zero coeffs:', l2_vocab_nnz                               
        
    print 'Non zero coefficients for L2 do not vary by C, while for L1 as C gets smaller, so does the feature space'
    
   
    ### STUDENT END ###
P6()


 L1 C: 100000.0 non-zero coeffs: 106437 L2 non-zero coeffs: 107516

 L1 C: 10000.0 non-zero coeffs: 95802 L2 non-zero coeffs: 107516

 L1 C: 1000.0 non-zero coeffs: 62059 L2 non-zero coeffs: 107516

 L1 C: 100.0 non-zero coeffs: 17773 L2 non-zero coeffs: 107516

 L1 C: 10.0 non-zero coeffs: 4849 L2 non-zero coeffs: 107516

 L1 C: 1 non-zero coeffs: 1778 L2 non-zero coeffs: 107516

 L1 C: 0.1 non-zero coeffs: 359 L2 non-zero coeffs: 107516

 L1 C: 0.01 non-zero coeffs: 35 L2 non-zero coeffs: 107516

 L1 C: 0.001 non-zero coeffs: 5 L2 non-zero coeffs: 107516

 L1 C: 0.0001 non-zero coeffs: 0 L2 non-zero coeffs: 107516

 L1 C: 1e-05 non-zero coeffs: 0 L2 non-zero coeffs: 107516
Non zero coefficients for L2 do not vary by C, while for L1 as C gets smaller, so does the feature space


(7) Use the TfidfVectorizer -- how is this different from the CountVectorizer? Train a logistic regression model with C=100.

Make predictions on the dev data and show the top 3 documents where the ratio R is largest, where R is:

maximum predicted probability / predicted probability of the correct label

What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

In [175]:
def P7():
### STUDENT START ###
    og_labels = newsgroups_train.target_names 

    tfid_vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.95, min_df=2, norm='l2', stop_words='english')
    train_data_vect_t = tfid_vectorizer.fit_transform(train_data)
    dev_data_vect_t = tfid_vectorizer.transform(dev_data)
    fns =tfid_vectorizer.get_feature_names()
     
    # fitting and training logistic model
    t_logf= LogisticRegression(penalty='l2', C=100)
    t_log_fit= t_logf.fit(train_data_vect_t, train_labels)
    t_predlog =  t_log_fit.predict(dev_data_vect_t)
    
    # computing R ratio from predicted probabilities
    p_pred = t_log_fit.predict_proba(dev_data_vect_t)
    p_pred_max = p_pred[np.argmax(p_pred, axis= 0)]
    p_pred_max_flat = p_pred_max.flatten()
    max_p =  p_pred_max_flat[np.argmax( p_pred_max_flat)]
    
    R = p_pred/max_p
    
    sort_R =  np.argsort(-1.00000*(R), axis=0)
    top3_R = sort_R[0:4, 0:3]
    # flatten for easier looping
    flat_R= top3_R.flatten()

    # print misclassifications
    for idx in range(len(flat_R)):
        if idx ==0:
            print '\n', og_labels[0] , 'top 3 misclassifications \n'
        elif idx==3:
            print '\n',og_labels[1] , 'top 3 misclassifications \n'
        elif idx==6:
            print '\n',og_labels[2] , 'top 3 misclassifications \n'
        elif idx==9:
            print '\n',og_labels[3] , 'top 3 misclassifications \n'
        else:
            print ''
            
        print fns[flat_R[idx]]
              
    
   
    # print(classification_report(dev_labels, t_predlog, target_names= og_labels))
    # print metrics.f1_score(dev_labels, t_predlog, average='weighted')
    
## STUDENT END ###
P7()


alt.atheism top 3 misclassifications 

33

5011

15m

comp.graphics top 3 misclassifications 

101010

58

130

sci.space top 3 misclassifications 

1024x768

373

391

talk.religion.misc top 3 misclassifications 

150

45g

2000


ANSWER: The model has a really hard time with numbers. Tokenizing numbers would improve performance. 

(8) EXTRA CREDIT

Try implementing one of your ideas based on your error analysis. Use logistic regression as your underlying model.