# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup posts on a variety of topics. You'll train classifiers to distinguish posts by topics inferred from the text. Whereas with digit classification, where each input is relatively dense (represented as a 28x28 matrix of pixels, many of which are non-zero), here each document is relatively sparse (represented as a bag-of-words). Only a few words of the total vocabulary are active in any given document. The assumption is that a label depends only on the count of words, not their order.

The `sklearn` documentation on feature extraction may be useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on Slack, but <b> please prepare your own write-up with your own code. </b>

In [None]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

import nltk

Load the data, stripping out metadata so that only textual features will be used, and restricting documents to 4 specific topics. By default, newsgroups data is split into training and test sets, but here the test set gets further split into development and test sets.  (If you remove the categories argument from the fetch function calls, you'd get documents from all 20 topics.)

In [None]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test  = fetch_20newsgroups(subset='test',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

num_test = int(len(newsgroups_test.target) / 2)
test_data, test_labels   = newsgroups_test.data[num_test:], newsgroups_test.target[num_test:]
dev_data, dev_labels     = newsgroups_test.data[:num_test], newsgroups_test.target[:num_test]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target
label_names              = newsgroups_train.target_names

print('training label shape:', train_labels.shape)
print('dev label shape:',      dev_labels.shape)
print('test label shape:',     test_labels.shape)
print('labels names:',         newsgroups_train.target_names)

In [None]:
print('training label shape:', train_labels.shape)
print('training data size:', len(train_data))

### Part 1:

For each of the first 5 training examples, print the text of the message along with the label.

In [None]:
def P1(num_examples=5):
    ### STUDENT START ###
    for i in range(num_examples):
        print("TEXT: ",train_data[i],"\nLabel",train_labels[i],"\nLabel name:", newsgroups_train.target_names[train_labels[i]],"\n")
    ### STUDENT END ###

P1(5)

### Part 2:

Transform the training data into a matrix of **word** unigram feature vectors.  What is the size of the vocabulary? What is the average number of non-zero features per example?  What is the fraction of the non-zero entries in the matrix?  What are the 0th and last feature strings (in alphabetical order)?<br/>
_Use `CountVectorization` and its `.fit_transform` method.  Use `.nnz` and `.shape` attributes, and `.get_feature_names` method._

Now transform the training data into a matrix of **word** unigram feature vectors using your own vocabulary with these 4 words: ["atheism", "graphics", "space", "religion"].  Confirm the size of the vocabulary. What is the average number of non-zero features per example?<br/>
_Use `CountVectorization(vocabulary=...)` and its `.transform` method._

Now transform the training data into a matrix of **character** bigram and trigram feature vectors.  What is the size of the vocabulary?<br/>
_Use `CountVectorization(analyzer=..., ngram_range=...)` and its `.fit_transform` method._

Now transform the training data into a matrix of **word** unigram feature vectors and prune words that appear in fewer than 10 documents.  What is the size of the vocabulary?<br/>
_Use `CountVectorization(min_df=...)` and its `.fit_transform` method._

Now again transform the training data into a matrix of **word** unigram feature vectors. What is the fraction of words in the development vocabulary that is missing from the training vocabulary?<br/>
_Hint: Build vocabularies for both train and dev and look at the size of the difference._

Notes:
* `.fit_transform` makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").
* `.fit_transform` and `.transform` return sparse matrix objects.  See about them at http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html.  

In [None]:
def P2():
    ### STUDENT START ###
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(train_data)
    Xarr = X.toarray()
    nonzero = np.count_nonzero(Xarr)
    fraction = nonzero/(Xarr.shape[0]*Xarr.shape[1])
    print("Fraction of non-zeros per row {:.3f}".format(fraction))
    
    features = vectorizer.get_feature_names()
    print("First and last features:",features[0],"; ",features[len(features)-1])
    
    print("Second question")
    vocab = ["atheism", "graphics", "space", "religion"]
    vectorizer1 = CountVectorizer(vocabulary = vocab)
    X1 = vectorizer1.transform(train_data)
    Xarr1 = X1.toarray()
    nonzero = np.zeros(Xarr1.shape[1])
    for i in range(Xarr1.shape[1]):
        nonzero[i] = np.count_nonzero(Xarr1[:,i])
#    print(nonzero)
#    print(max(nonzero))
    avg_nonzero = sum(nonzero)/nonzero.shape[0]
    print("  Average number of non-zero per example: {:.5}".format(avg_nonzero))
    
    # remove commas use REGEX
    print("Third Question: character bigrams and trigrams")

    vectorizer2 = CountVectorizer(analyzer = 'char_wb', ngram_range = (2,3))
    train_data1 = train_data.copy()
#    for i in range(len(train_data)): 
#        train_data1[i] = re.sub(r'[^\w\s]', '', train_data1[i]) 
    X2 = vectorizer2.fit_transform(train_data1)
 #   print(vectorizer2.vocabulary_)
    Xarr2 = X2.toarray()
    print("  Vocabulary size: ", Xarr2.shape[1])
#    print(Xarr2.shape)
#    print(Xarr2[1,:30])

    
    print("Fourth Question: remove words that appear in less than 10 documents")
    vectorizer3 = CountVectorizer(min_df = 11)
    X3 = vectorizer3.fit_transform(train_data)
    Xarr3 = X3.toarray()
    print("  Vocabulary size: ",Xarr3.shape[1])
    
    print("Fifth Question: Fraction of words in dev_data that are not in train_data")
    vectorizer4 = CountVectorizer()
    Y = vectorizer4.fit_transform(dev_data)
    Yarr = Y.toarray()
    old_vocab = vectorizer.vocabulary_
    train_vocab_size =len(old_vocab)
    new_vocab = vectorizer4.vocabulary_
    
    numdiff = 0
    for key, value in new_vocab.items():
        if key not in old_vocab:
            numdiff += 1
#    print(numdiff)
    print("  Fraction is {:.3f}".format((numdiff)/Yarr.shape[1]))
    ### STUDENT END ###

P2()

### Part 3:

Transform the training and development data to matrices of word unigram feature vectors.

1. Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score.  For each model, show the k value and f1 score.
1. Produce several Naive Bayes models by varying smoothing (alpha), including one with alpha set approximately to optimize f1 score.  For each model, show the alpha value and f1 score.
1. Produce several Logistic Regression models by varying L2 regularization strength (C), including one with C set approximately to optimize f1 score.  For each model, show the C value, f1 score, and sum of squared weights for each topic.

* Why doesn't k-Nearest Neighbors work well for this problem?
* Why doesn't Logistic Regression work as well as Naive Bayes does?
* What is the relationship between logistic regression's sum of squared weights vs. C value?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer` and its `.fit_transform` and `.transform` methods to transform data.
* You can use `KNeighborsClassifier(...)` to produce a k-Nearest Neighbors model.
* You can use `MultinomialNB(...)` to produce a Naive Bayes model.
* You can use `LogisticRegression(C=..., solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.

In [None]:
from sklearn.model_selection import GridSearchCV
def P3():
    ### STUDENT START ###
#    print(Xarr.shape)
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(train_data)
#    print(len(train_data))
    Xarr = X.toarray()
#    print(Xarr5.shape)
    # model will ignore the ~4000 words in Y that are missing in X
    Y =  vectorizer.transform(dev_data)
    Yarr = Y.toarray()
#    print(Yarr.shape) 
#    features = vectorizer.get_feature_names()

    #K-neighbors
    print("K-neighbors")
#    for k in [1,2,3,4,5,6,7,8,9,10,11,12]:
    for k in [6,7,8]:
        model = KNeighborsClassifier(k)
        model.fit(Xarr, train_labels)
        dev_predict = model.predict(Yarr)
        out = classification_report(dev_labels, dev_predict, output_dict=True)
#        print(out)
        wgtavg = out['weighted avg']
        print("k=", k," Accuracy {:.3f}".format(out['accuracy'])," f1 score {:.3f}".format(wgtavg['f1-score']))

    #MultinomialNB    
    print("Naive Bayes")
    model = MultinomialNB()
    #build a scorer
    f1 = metrics.make_scorer(metrics.f1_score , average='weighted')
    #run model with various alphas
#    for alpha in [0.1, 0.5]:
    for alpha in [1.0e-10, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]:
        model = MultinomialNB(alpha)
        model.fit(Xarr, train_labels)
        model.predict(Yarr)
        dev_predict = model.predict(Yarr)
        out = classification_report(dev_labels, dev_predict, output_dict=True)
        wgtavg = out['weighted avg']
        print("Alpha: ",alpha, "\tf1 score {:.3f}".format(wgtavg['f1-score']))
###


#Logistic Regression
    print("Logistic Regression")
    model = LogisticRegression(solver="liblinear", multi_class="auto") 
#    for c in [0.001,0.01,0.1,0.5]:
    for c in [0.001, 0.01, 0.1, 0.5, 1, 1.5, 2, 10]:
        model = LogisticRegression(C=c, solver="liblinear", multi_class="auto") 
        model.fit(Xarr, train_labels)
        model.predict(Yarr)
        dev_predict = model.predict(Yarr)
        out = classification_report(dev_labels, dev_predict, output_dict=True)
        wgtavg = out['weighted avg']
#        print(model.coef_.shape)

        num_topics = model.coef_.shape[0]
        s = np.zeros(num_topics)
        for i in range(num_topics):
            s[i] = np.dot(model.coef_[i,:],model.coef_[i,:])
        print("C= ",c, "f1 score {:.3f}".format(wgtavg['f1-score']),"\tSum sq wgts: ",s)
#        print("\t\tSum sq wgts: ",s)
 
    ### STUDENT END ###

P3()

ANSWER: 
I ran KNeighbors classifier for k in range(1,11). It took a long time to run.
The accuracy peaked at 0.47 for k=7 then went back down to 0.45 for k>7.
I have reduced the range so it is easier for the grader to run the cell.
Best alpha = 0.1
Best C = 0.5

Why doesn't k-Nearest Neighbors work well for this problem?
k-Nearest Neighbors calculates the Euclidian distance between vectors. If two features have similar numbers of appearances 
across all documents they will be considered close. 

Why doesn't Logistic Regression work as well as Naive Bayes does? Logistic regression is primarily based on a linear regression that is then modified to look like a probability distribution. It assumes constant weights for each feature. That is not what happens in reality. Naive Bayes actually works with the distribution of the vocabulary, which is a closer model for the structure of text. 

What is the relationship between logistic regression's sum of squared weights vs. C value?
As C gets larger the term  Sum(wTw) in the cost function becomes less important hence the algorithm has more leeway to 
increase positive and negative weights to minimize cost, and in the process increase F1 

### Part 4:

Transform the data to a matrix of word **bigram** feature vectors.  Produce a Logistic Regression model.  For each topic, find the 5 features with the largest weights (that's 20 features in total).  Show a 20 row (features) x 4 column (topics) table of the weights.

Do you see any surprising features in this table?

Notes:
* Train on the transformed training data.
* You can use `CountVectorizer` and its `.fit_transform` method to transform data.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `np.argsort` to get indices sorted by element value. 

In [None]:
def P4():
    ### STUDENT START ###
    vectorizer = CountVectorizer(ngram_range = (2,2))
    X = vectorizer.fit_transform(train_data)
    Xarr = X.toarray()
    print(Xarr.shape)
    model = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto") 
    model.fit(Xarr, train_labels)


  
    num_topics = model.coef_.shape[0]
    s = np.zeros(num_topics*5*4).reshape(num_topics,5,4)
    s = np.zeros(num_topics*5*4).reshape(num_topics,5,4)
    feature_names = vectorizer.get_feature_names()
    ####### Format output table #######
    print ("{:<18} {:<5} {:<17} {:<40}".format('Topic','Rank','Feature','Weights'))
    for topic in range(num_topics):
        sort = np.argsort(model.coef_[topic,:])
    #    print(sorted.shape)
    #    if (i==0): print(sorted[-4:])
        for rank in range(-5,0):
            table_row=[]
            feat = feature_names[sort[rank]]
            for bigram in range(4):
                table_row.append(model.coef_[bigram,sort[rank]])

            message = ', '.join(['{:.4f}'.format(x) for x in table_row])
#            print (categories[topic], "\t",rank,"\t",feat,"\t",message)
    #        s[topic,rank] = [categories[topic], rank, table_row]
            print ("{:<18} {:<5} {:<17} {:<40}".format(newsgroups_train.target_names[topic], rank, feat, message))

        ########### take_along_axis ? ##################
    #    print(feature_names[int(sorted[-5])])
    #    print(feature_names[int(sorted[-4])])
    #    print(feature_names[int(sorted[-3])])
    #    print(feature_names[int(sorted[-2])])
    #    print(feature_names[int(sorted[-1])])
    #    names.append(feature_names[sorted[-5:]])
    #    names.append(train_data.columns[sorted[-5:]])
    #    print(names[i])
    #   print(s)

    ### STUDENT END ###

P4()

ANSWER: Some classifications make no sense. For example "is there" and "out there" are classified as comp.graphics 
    with high probability, "are you" and "you are" are alt.atheism 

### Part 5:

To improve generalization, it is common to try preprocessing text in various ways before splitting into words. For example, you could try transforming strings to lower case, replacing sequences of numbers with single tokens, removing various non-letter characters, and shortening long words.

Produce a Logistic Regression model (with no preprocessing of text).  Evaluate and show its f1 score and size of the dictionary.

Produce an improved Logistic Regression model by preprocessing the text.  Evaluate and show its f1 score and size of the vocabulary.  Try for an improvement in f1 score of at least 0.02.

How much did the improved model reduce the vocabulary size?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer(preprocessor=...)` to preprocess strings with your own custom-defined function.
* `CountVectorizer` default is to preprocess strings to lower case.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.
* If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular.

In [None]:
import re
import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

    
def better_preprocessor(s):
    ### STUDENT START ###
    new_s = s
    new_s = new_s.lower()
    new_s=re.sub("\\W"," ",new_s) # remove special chars
    

    # stem words

# Lancaster stemmer works better for this example than Porter stemmer
#    porter_stemmer=PorterStemmer()
#    words=re.split("\\s+",new_s)
#    stemmed_words=[porter_stemmer.stem(word=word) for word in words]
    lancaster_stemmer = LancasterStemmer()
    words=re.split("\\s+",new_s)
    stemmed_words=[lancaster_stemmer.stem(word=word) for word in words]
    new_s = ' '.join(stemmed_words)
    
    new_s = re.sub("1|2|3|4|5|6|7|8|9","",new_s)    # remove digits
    new_s = re.sub("!|@|-|;|:", "",new_s)
#    new_s = re.sub("(|)|#|&|_", "",new_s)  #adding this line lowered f1-score!!
    new_s = re.sub("\\s+(in|the|all|for|and|on|you|i|he|it|have|am|are|is|just|that|be|a|to|but|there)\\s"," ",new_s)

    return new_s
#    return ' '.join(stemmed_words)
    ### STUDENT END ###


def P5():
    ### STUDENT START ###
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(train_data)
    Xarr = X.toarray()
    print("First run, Xarr and Yarr shapes")
    print(Xarr.shape)
    # Y will ignore the 4000 missing words
    Y =  vectorizer.transform(dev_data)
    Yarr = Y.toarray()
    print(Yarr.shape) 
#    features = vectorizer.get_feature_names()
    model = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto") 
    model.fit(Xarr, train_labels)
    model.predict(Yarr)
    dev_predict = model.predict(Yarr)
    out = classification_report(dev_labels, dev_predict, output_dict=True)
    wgtavg = out['weighted avg']
    print("Dictionary size: ",Xarr.shape[1],"f1 score {:.3f}".format(wgtavg['f1-score']))


    # with preprocessor
#    print(train_data[1])
#    print(better_preprocessor(train_data[1]))
    vectorizer1 = CountVectorizer(preprocessor=better_preprocessor)
    X1 = vectorizer1.fit_transform(train_data)
    X1arr = X1.toarray()
    print("After preprocessor, Xarr and Yarr shapes")
    print(X1arr.shape)
    Y1 =  vectorizer1.transform(dev_data)
    Y1arr = Y1.toarray()
    print(Y1arr.shape) 
    
    model.fit(X1arr, train_labels)
    model.predict(Y1arr)
    dev_predict1 = model.predict(Y1arr)
    out1 = classification_report(dev_labels, dev_predict1, output_dict=True)
    wgtavg1 = out1['weighted avg']
    print("Dictionary size: ",X1arr.shape[1],"f1 score {:.3f}".format(wgtavg1['f1-score']))
    
    #remove words common to all 
    feature_names = vectorizer1.get_feature_names()
#    num_occur = np.sum(X1arr, axis=0)/X1arr.shape[1]
#    print (num_occur.shape)
#    dispersion = np.max(X1arr,axis=0) - np.min(X1arr, axis=0)
#    sort = np.argsort(num_occur)

#    print("NumNonZero")
    ### STUDENT END ###

P5()

Vocabulary size was reduced by about 12,000 words or 45% and f1 score by close to 0.02

### Part 6:

The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. Logistic regression seeks the set of weights that minimizes errors in the training data AND has a small total size. The default L2 regularization computes this size as the sum of the squared weights (as in Part 3 above). L1 regularization computes this size as the sum of the absolute values of the weights. Whereas L2 regularization makes all the weights relatively small, L1 regularization drives many of the weights to 0, effectively removing unimportant features.

For several L1 regularization strengths ...<br/>
* Produce a Logistic Regression model using the **L1** regularization strength.  Reduce the vocabulary to only those features that have at least one non-zero weight among the four categories.  Produce a new Logistic Regression model using the reduced vocabulary and **L2** regularization strength of 0.5.  Evaluate and show the L1 regularization strength, vocabulary size, and f1 score associated with the new model.

Show a plot of f1 score vs. log vocabulary size.  Each point corresponds to a specific L1 regularization strength used to reduce the vocabulary.

How does performance of the models based on reduced vocabularies compare to that of a model based on the full vocabulary?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `LogisticRegression(..., penalty="l1")` to produce a logistic regression model using L1 regularization.
* You can use `LogisticRegression(..., penalty="l2")` to produce a logistic regression model using L2 regularization.
* You can use `LogisticRegression(..., tol=0.015)` to produce a logistic regression model using relaxed gradient descent convergence criteria.  The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.015 (the default is .0001).

In [None]:
def P6():
    # Keep this random seed here to make comparison easier.
    #np.random.seed(0)
    
    ### STUDENT START ###
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(train_data)
    Xarr = X.toarray()
    #print(Xarr.shape)
    # Y will ignore the 4000 missing words
    Y =  vectorizer.transform(dev_data)
    Yarr = Y.toarray()
    #print(Yarr.shape) 
#    features = vectorizer.get_feature_names()

# initialize x and y for graph
    x = []
    y = []
    y1 = []
    for c in [0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1,1.5,2]:
        print("c = ", c)
        l1_model = LogisticRegression(C=c, solver="liblinear", multi_class="auto", penalty = "l1",tol=0.15) 
        l1_model.fit(Xarr, train_labels)
        l1_model.predict(Yarr)
        dev_predict = l1_model.predict(Yarr)
#        print(dev_predict.shape)
        out = classification_report(dev_labels, dev_predict, output_dict=True)
        wgtavg = out['weighted avg']
        print("l1 - Dictionary size: ",Xarr.shape[1],"f1 score {:.3f}".format(wgtavg['f1-score']))
        y1.append(wgtavg['f1-score'])
#        print(l1_model.coef_.shape)
#delete columns with all zeroes
        delete_list = []
        for i in range(l1_model.coef_.shape[1]):
            if (np.all(l1_model.coef_[:,i] == np.zeros(4))):
                delete_list.append(i)
        X1arr = np.delete(Xarr,delete_list,1)
        Y1arr = np.delete(Yarr,delete_list,1)
#        print("X1arr shape after removing all zeroes: ",X1arr.shape)
            
        #l1 model with reduced Xarr 
        l1_model.fit(X1arr, train_labels)
        l1_model.predict(Y1arr)
        dev_predict = l1_model.predict(Y1arr)
#        print(dev_predict.shape)
        out = classification_report(dev_labels, dev_predict, output_dict=True)
        wgtavg = out['weighted avg']
        print("l1 with dictionary size = ",X1arr.shape[1],"f1 score {:.3f}".format(wgtavg['f1-score']))
    
        l2_model = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto", penalty = "l2") 
        l2_model.fit(X1arr, train_labels)
        dev_predict = l2_model.predict(Y1arr)
        #print(dev_labels.shape,dev_predict.shape)
        out = classification_report(dev_labels, dev_predict, output_dict=True)
        wgtavg = out['weighted avg']
        print("l2 with dictionary size= ",X1arr.shape[1],"f1 score {:.3f}".format(wgtavg['f1-score']))
        x.append(np.log(X1arr.shape[1]))
        y.append(wgtavg['f1-score'])

    # plot y(x)
    fig, ax = plt.subplots()
    ax.plot(x, y, c='green', label="l2")
    ax.plot(x, y1, c='blue', label="l1")
    ax.legend()
    ax.grid(True)         
    ax.set_xlabel("Ln(vocabulary size)")
    ax.set_ylabel("f1-score")
    ax.set_title("f1 score for various levels of C")
    fig.show()
    ### STUDENT END ###

P6()

ANSWER: For L1 regularization the model's performance first improves as vocabulary size increases, but then deteriorates after vocabulary size reaches a certain value (~ 7). After vocabulary has been reduced, L2 regularization provides a better f1 score than L1

### Part 7:

How is `TfidfVectorizer` different than `CountVectorizer`?

Produce a Logistic Regression model based on data represented in tf-idf form, with L2 regularization strength of 100.  Evaluate and show the f1 score.  How is `TfidfVectorizer` different than `CountVectorizer`?

Show the 3 documents with highest R ratio, where ...<br/>
$R\,ratio = maximum\,predicted\,probability \div predicted\,probability\,of\,correct\,label$

Explain what the R ratio describes.  What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

Note:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `TfidfVectorizer` and its `.fit_transform` method to transform data to tf-idf form.
* You can use `LogisticRegression(C=100, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `LogisticRegression`'s `.predict_proba` method to access predicted probabilities.

In [None]:
def P7():
    ### STUDENT START ###
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(train_data)
    Xarr = X.toarray()
    #print(Xarr.shape)
    # Y will ignore the 4000 missing words
    Y =  vectorizer.transform(dev_data)
    Yarr = Y.toarray()
    #print(Yarr.shape) 
    
    # Run regression, fit, predict
    LR = LogisticRegression(C=100, solver="liblinear", multi_class="auto")
    LR.fit(Xarr, train_labels)
    LR.predict(Yarr)
    dev_predict = LR.predict(Yarr)
    #        print(dev_predict.shape)
    out = classification_report(dev_labels, dev_predict, output_dict=True)
    wgtavg = out['weighted avg']

    print("Dictionary size: ",Xarr.shape[1],"\nf1 score for dev set with TfidfVectorizer {:.3f}".format(wgtavg['f1-score']))
    #    print(" Dictionary size: ",Xarr.shape[1],"f1 score",out)

    # Run regression, fit, predict for Xarr
    LR = LogisticRegression(C=100, solver="liblinear", multi_class="auto")
    LR.fit(Xarr, train_labels)
    LR.predict(Xarr)
    train_predict = LR.predict(Xarr)
    #        print(dev_predict.shape)
    out = classification_report(train_labels, train_predict, output_dict=True)
    wgtavg = out['weighted avg']

    print("f1 score for train set with TfidfVectorizer {:.3f}".format(wgtavg['f1-score']))
    #    print(" Dictionary size: ",Xarr.shape[1],"f1 score",out)
    
    # Calculate R
    p = LR.predict_proba(Yarr)
    R = np.zeros(p.shape[0])
    
    #    print(type(p))
    #    print("p.shape",p.shape)
    #    print(len(dev_labels))
    #    for i in range(10):
    #        print(max(p[i,:]))
    #        print(p[i,dev_labels[i]])
    
    for i in range(p.shape[0]):
        R[i]= max(p[i,:])/p[i,dev_labels[i]]

    # find n highest values
    n=10
    ind = np.argpartition(R, -n)[-n:]  
    #print(ind)
  
  #print topic for top n R ratios
    print("R for top 10 ratios")
    for i in range(-10,0):
        print("i=",i," predicted topic:",label_names[np.argmax(p[ind[i],:])], "\t correct topic:",label_names[dev_labels[ind[i]]])
  
    for i in range(-3,0):
        print("\n i=",i," document: \n",dev_data[ind[i]])
           

    # Find top features for atheism and religion
    features = vectorizer.get_feature_names()
    i_ath = 0
    i_rel = 3
    coef_ath = LR.coef_[i_ath,:]
    coef_rel = LR.coef_[i_rel,:]
    #print(coef_ath.shape)

    ind_ath = np.argsort(coef_ath)  
    ind_rel = np.argsort(coef_rel)

    print ("\n\nTOP FEATURES FOR Atheism and Religion")
    for i in range(-10,0):
        print(i," Top Atheism features:",features[ind_ath[i]], coef_ath[ind_ath[i]])

    for i in range(-10,0):
        print(i," Top Religion features:",features[ind_rel[i]], coef_rel[ind_rel[i]])
 

    #same model with CountVectorizer
    cv = CountVectorizer()
    X_cv = cv.fit_transform(train_data)
    Xarr_cv = X_cv.toarray()
    # print(Xarr.shape)

    Y_cv =  cv.transform(dev_data)
    Yarr_cv = Y_cv.toarray()
    # print(Yarr.shape) 
    #
    LR = LogisticRegression(C=100, solver="liblinear", multi_class="auto")
    LR.fit(Xarr_cv, train_labels)
    LR.predict(Yarr_cv)
    dev_predict = LR.predict(Yarr_cv)
    #        print(dev_predict.shape)
    out_cv = classification_report(dev_labels, dev_predict, output_dict=True)
    wgtavg_cv = out_cv['weighted avg']

    print("\nf1 score with CountVectorizer {:.3f}".format(wgtavg_cv['f1-score']))

    # Calculate top R and errors for Count Vectorizer
    #p = LR.predict_proba(Yarr)
    #R = np.zeros(p.shape[0])
    #    print(type(p))
    #    print(p.shape)
    #    print(len(dev_labels))
    #    for i in range(10):
    #        print(max(p[i,:]))
    #        print(p[i,dev_labels[i]])
    # Calculate R
    #for i in range(p.shape[0]):
    #    R[i]= max(p[i,:])/p[i,dev_labels[i]]
    # find 3 highest values
    #ind = np.argpartition(R, -3)[-3:]  

    #features = vectorizer.get_feature_names()
    #print(features)
    #for i in range(3):
    #    print("Feature name:",features[ind[i]], " R: {:.1f}".format(R[ind[i]]))
    #    print("   predicted topic:",categories[np.argmax(p[ind[i],:])], " correct topic:",categories[dev_labels[ind[i]]])
    #print (R[ind])

    ### STUDENT END ###

P7()

ANSWER: In essence Tfidf.vectorizer calculates the number of times a given term appears in a document as a fraction of 
    the total number of documents. The ratio is then modified in order to 
    
    (1) reduce the total scale of these numbers (take log)
    (2) have only positive numbers (take the inverse of the ratio to make them all larger than 1, hence positive log) 
    (3) prevent division by zero (add 1 to denominator)
    
A large ratio means that the model was very much off in the sense that the probability assigned to the wrong predicted 
label is much larger than the probability assigned to the correct label.

Issue: The model is not good at differentiating Religion and Atheism. That is because both topics treat of very similar 
    subjects. Removing features that are likely to appear in both topics should help the model.
    Another issue could be with the fact that there are 4000 features in the dev data set that are not in the train data 
    set. Adding those will probably help the model's performance since the f1-score for the train data set is 0.98 compared
    to 0.76 for the dev set

### Part 8 EXTRA CREDIT:

Produce a Logistic Regression model to implement your suggestion from Part 7.

In [None]:
# remove Jesus and Christ from features
delete_list = []

delete_features = ["christ","jesus"]
for i in range(len(features)):
    if (features[i] in delete_features):
        delete_list.append(i)

X2arr = np.delete(Xarr, delete_list, 1)
Y2arr = np.delete(Yarr, delete_list, 1)

LR.fit(X2arr, train_labels)
LR.predict(Y2arr)
dev_predict2 = LR.predict(Y2arr)
out2 = classification_report(dev_labels, dev_predict2, output_dict=True)
wgtavg2 = out2['weighted avg']
print("Dictionary size: ",X2arr.shape[1],"f1 score {:.3f}".format(wgtavg2['f1-score']))



ANSWER: Removing the words christ and jesus from features since they are likely to appear in both religion and 
    atheism slightly improved the performance of the model (from 0.760 to 0.764). 