# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup posts on a variety of topics. You'll train classifiers to distinguish posts by topics inferred from the text. Whereas with digit classification, where each input is relatively dense (represented as a 28x28 matrix of pixels, many of which are non-zero), here each document is relatively sparse (represented as a bag-of-words). Only a few words of the total vocabulary are active in any given document. The assumption is that a label depends only on the count of words, not their order.

The `sklearn` documentation on feature extraction may be useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on Slack, but <b> please prepare your own write-up with your own code. </b>

In [None]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

import nltk

Load the data, stripping out metadata so that only textual features will be used, and restricting documents to 4 specific topics. By default, newsgroups data is split into training and test sets, but here the test set gets further split into development and test sets.  (If you remove the categories argument from the fetch function calls, you'd get documents from all 20 topics.)

In [None]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test  = fetch_20newsgroups(subset='test',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

num_test = int(len(newsgroups_test.target) / 2)
test_data, test_labels   = newsgroups_test.data[num_test:], newsgroups_test.target[num_test:]
dev_data, dev_labels     = newsgroups_test.data[:num_test], newsgroups_test.target[:num_test]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training label shape:', train_labels.shape)
print('dev label shape:',      dev_labels.shape)
print('test label shape:',     test_labels.shape)
print('labels names:',         newsgroups_train.target_names)

### Part 1:

For each of the first 5 training examples, print the text of the message along with the label.

In [None]:
def P1(num_examples=5):
    ### STUDENT START ###
    i = 0
    # Print label and message by looping through the examples
    while i < num_examples:
        print("Train Label: " + str(newsgroups_train.target[train_labels[i]]) + "\nTrain Data: \n" + str(train_data[i] + "\n\n"))
        i = i + 1
    ### STUDENT END ###

P1(5)

### Part 2:

Transform the training data into a matrix of **word** unigram feature vectors.  What is the size of the vocabulary? What is the average number of non-zero features per example?  What is the fraction of the non-zero entries in the matrix?  What are the 0th and last feature strings (in alphabetical order)?<br/>
_Use `CountVectorization` and its `.fit_transform` method.  Use `.nnz` and `.shape` attributes, and `.get_feature_names` method._

Now transform the training data into a matrix of **word** unigram feature vectors using your own vocabulary with these 4 words: ["atheism", "graphics", "space", "religion"].  Confirm the size of the vocabulary. What is the average number of non-zero features per example?<br/>
_Use `CountVectorization(vocabulary=...)` and its `.transform` method._

Now transform the training data into a matrix of **character** bigram and trigram feature vectors.  What is the size of the vocabulary?<br/>
_Use `CountVectorization(analyzer=..., ngram_range=...)` and its `.fit_transform` method._

Now transform the training data into a matrix of **word** unigram feature vectors and prune words that appear in fewer than 10 documents.  What is the size of the vocabulary?<br/>
_Use `CountVectorization(min_df=...)` and its `.fit_transform` method._

Now again transform the training data into a matrix of **word** unigram feature vectors. What is the fraction of words in the development vocabulary that is missing from the training vocabulary?<br/>
_Hint: Build vocabularies for both train and dev and look at the size of the difference._

Notes:
* `.fit_transform` makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").
* `.fit_transform` and `.transform` return sparse matrix objects.  See about them at http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html.  

In [None]:
def P2():
    ### STUDENT START ###
    #training data into a matrix of word unigram feature vectors
    vectorizer = CountVectorizer()
    matrix = vectorizer.fit_transform(train_data)
    print("1)\n")
    #1a) What is the size of the vocabulary?
    print("Length of vocab: " + str(len(vectorizer.get_feature_names())))
    #1b) What is the average number of non-zero features per example? 
    print("Avg num of non-zero features: " + str(matrix.nnz/len(train_data)))
    #1c) What is the fraction of the non-zero entries in the matrix? 
    print("Fraction of non-zero features: " + str(matrix.nnz/(matrix.shape[0]*matrix.shape[1])))
    #1d) What are the 0th and last feature strings (in alphabetical order)?
    print("Zeroth feature string: " + vectorizer.get_feature_names()[0])
    print("Last feature string: " + vectorizer.get_feature_names()[-1])
    print("\n\n")
    
    #training data into a matrix of word unigram feature vectors using your own vocabulary with 
    #these 4 words: ["atheism", "graphics", "space", "religion"]
    vectorizer2 = CountVectorizer(vocabulary = ["atheism", "graphics", "space", "religion"])
    matrix2 = vectorizer2.transform(train_data)
    print("2)\n")
    #2a) Confirm the size of the vocabulary. 
    print("Length of vocab: " + str(len(vectorizer2.get_feature_names())))
    #2b) What is the average number of non-zero features per example?
    print("Avg num of non-zero features: " + str(matrix2.nnz/len(train_data)))
    print("\n\n")
    
    #training data into a matrix of character bigram and trigram feature vectors
    vectorizer3 = CountVectorizer(analyzer="char", ngram_range=(2,3))
    matrix3 = vectorizer3.fit_transform(train_data)
    print("3)\n")
    #3a) What is the size of the vocabulary?
    print("Length of vocab: " + str(len(vectorizer3.get_feature_names())))
    print("\n\n")
    
    #training data into a matrix of word unigram feature vectors and prune words that appear in fewer than 10 documents
    vectorizer4 = CountVectorizer(min_df=10)
    matrix4 = vectorizer4.fit_transform(train_data)
    print("4)\n")
    #4a) What is the size of the vocabulary?
    print("Length of vocab: " + str(len(vectorizer4.get_feature_names())))
    print("\n\n")
   
    # training data into a matrix of word unigram feature vectors -difference with dev
    vectorizer = CountVectorizer()
    matrix = vectorizer.fit_transform(train_data)
    vectorizer5 = CountVectorizer()
    matrix5 = vectorizer5.fit_transform(dev_data)
    
    #5a) What is the fraction of words in the development vocabulary that is missing from the training vocabulary?
    len_dev_features = set(vectorizer5.get_feature_names())
    len_train_features = set(vectorizer.get_feature_names())
    difference = len_dev_features - len_train_features
    print("5)\n")
    print("Length of vocab: " + str(len(difference)/len(len_dev_features)))
    
    ### STUDENT END ###

P2()

### Part 3:

Transform the training and development data to matrices of word unigram feature vectors.

1. Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score.  For each model, show the k value and f1 score.
1. Produce several Naive Bayes models by varying smoothing (alpha), including one with alpha set approximately to optimize f1 score.  For each model, show the alpha value and f1 score.
1. Produce several Logistic Regression models by varying L2 regularization strength (C), including one with C set approximately to optimize f1 score.  For each model, show the C value, f1 score, and sum of squared weights for each topic.

* Why doesn't k-Nearest Neighbors work well for this problem?
* Why doesn't Logistic Regression work as well as Naive Bayes does?
* What is the relationship between logistic regression's sum of squared weights vs. C value?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer` and its `.fit_transform` and `.transform` methods to transform data.
* You can use `KNeighborsClassifier(...)` to produce a k-Nearest Neighbors model.
* You can use `MultinomialNB(...)` to produce a Naive Bayes model.
* You can use `LogisticRegression(C=..., solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.

In [None]:
def P3():
    ### STUDENT START ###
    vectorizer = CountVectorizer()
    vectorizer.fit(train_data)
    matrix_train = vectorizer.transform(train_data)
    matrix_dev = vectorizer.transform(dev_data)
    
    #Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score. 
    #For each model, show the k value and f1 score.
    k_vals = [1, 2, 3, 4, 5, 6, 7]
#     k = 1
    for k in k_vals:
        k_model1 = KNeighborsClassifier(k)
        k_model1.fit(matrix_train, train_labels)
        k_model1_preds = k_model1.predict(matrix_dev)
        k_model1_fscore = metrics.f1_score(y_true = dev_labels, y_pred = k_model1_preds, average='weighted')
        print("For Nearest Neighbor = " + str(k) + ", the f-score of the trained model is " + str( k_model1_fscore))
    
#     k = 3
#     k_model3 = KNeighborsClassifier(k)
#     k_model3.fit(matrix_train, train_labels)
#     k_model3_preds = k_model3.predict(matrix_dev)
#     k_model3_fscore = metrics.f1_score(y_true = dev_labels, y_pred = k_model3_preds, average='weighted')
#     print("For Nearest Neighbor = " + str(k) + ", the f-score of the trained model is " + str(k_model3_fscore))
          
#     k = 5
#     k_model5 = KNeighborsClassifier(k)
#     k_model5.fit(matrix_train, train_labels)
#     k_model5_preds = k_model5.predict(matrix_dev)
#     k_model5_fscore = metrics.f1_score(y_true = dev_labels, y_pred = k_model5_preds, average='weighted')
#     print("For Nearest Neighbor = " + str(k) + ", the f-score of the trained model is " + str(k_model5_fscore))
    print("\n\n")
    
    #Optimize F1 SCORE
    
    
    #Produce several Naive Bayes models by varying smoothing (alpha), 
    #including one with alpha set approximately to optimize f1 score. 
    #For each model, show the alpha value and f1 score.
    #a = .0001
    alpha_vals = [.0001, .001, .01, .1, 1, 2, 10]
    for a in alpha_vals:
        nv_model1 = MultinomialNB(alpha = a)
        nv_model1.fit(matrix_train, train_labels)
        nv_model1_preds = nv_model1.predict(matrix_dev)
        nv_model1_fscore = metrics.f1_score(y_true = dev_labels, y_pred = nv_model1_preds, average='weighted')
        print("For Multinomial naive bayes, alpha = " + str(a) + ", the f-score of the trained model is " + str(nv_model1_fscore))

#     a = .001
#     nv_model2 = MultinomialNB(alpha = a)
#     nv_model2.fit(matrix_train, train_labels)
#     nv_model2_preds = nv_model2.predict(matrix_dev)
#     nv_model2_fscore = metrics.f1_score(y_true = dev_labels, y_pred = nv_model2_preds, average='weighted')
#     print("For Multinomial naive bayes, alpha = " + str(a) + ", the f-score of the trained model is " + str(nv_model2_fscore))

    
#     a = .01
#     nv_model3 = MultinomialNB(alpha = a)
#     nv_model3.fit(matrix_train, train_labels)
#     nv_model3_preds = nv_model3.predict(matrix_dev)
#     nv_model3_fscore = metrics.f1_score(y_true = dev_labels, y_pred = nv_model3_preds, average='weighted')
#     print("For Multinomial naive bayes, alpha = " + str(a) + ", the f-score of the trained model is " + str(nv_model3_fscore))
    print("\n\n")
    
    #Optimize F1 SCORE
    
    #Produce several Logistic Regression models by varying L2 regularization strength (C),
    #including one with C set approximately to optimize f1 score. 
    #For each model, show the C value, f1 score, and sum of squared weights for each topic.
    #c = .0001
    c_vals = [.0001, .001, .01, .1, 1, 2, 10]
    for c in c_vals:
        lgr_model1 = LogisticRegression(C=c, solver="liblinear", multi_class="auto", penalty="l2")
        lgr_model1.fit(matrix_train, train_labels)
        lgr_model1_preds = lgr_model1.predict(matrix_dev)
        lgr_model1_fscore = metrics.f1_score(y_true = dev_labels, y_pred = lgr_model1_preds, average='weighted')
        print("For Logistic Regression naive bayes, L2 regurlarization strength (C) = " + str(c) + ", the f-score of the trained model is " 
          + str(lgr_model1_fscore) + ", Sum of Squared weights: " + str(np.sum(lgr_model1.coef_**2, axis=1)))
    
#     c = .001
#     lgr_model2 = LogisticRegression(C=c, solver="liblinear", multi_class="auto", penalty="l2")
#     lgr_model2.fit(matrix_train, train_labels)
#     lgr_model2_preds = lgr_model2.predict(matrix_dev)
#     lgr_model2_fscore = metrics.f1_score(y_true = dev_labels, y_pred = lgr_model2_preds, average='weighted')
#     print("For Logistic Regression naive bayes, L2 regurlarization strength (C) = " + str(c) + ", the f-score of the trained model is " 
#           + str(lgr_model2_fscore) + ", Sum of Squared weights: " + str(np.sum(lgr_model2.coef_**2, axis=1)))
    
#     c = .01
#     lgr_model3 = LogisticRegression(C=c, solver="liblinear", multi_class="auto", penalty="l2")
#     lgr_model3.fit(matrix_train, train_labels)
#     lgr_model3_preds = lgr_model1.predict(matrix_dev)
#     lgr_model3_fscore = metrics.f1_score(y_true = dev_labels, y_pred = lgr_model3_preds, average='weighted')
#     print("For Logistic Regression naive bayes, L2 regurlarization strength (C) = " + str(c) + ", the f-score of the trained model is " 
#           + str(lgr_model3_fscore) + ", Sum of Squared weights: " + str(np.sum(lgr_model3.coef_**2, axis=1)))
#     print("\n\n")
    #Optimize F1 SCORE
    
    ### STUDENT END ###

P3()

ANSWER:
K nearest neighbors uses values of features to calculate distances between observations. Because of this distance metric the K-nearest neighbours does not work well.

Logistic regression probably doesn't perform as well as Naive Bayes because the sample size is small. Logistic Regression performs better when the sample size is bigger.

As C increases the sum of squared weights also increases. C is the inverse of regularization strength, so that means that the sum of squared weights is also inversely related to regularization strength.

### Part 4:

Transform the data to a matrix of word **bigram** feature vectors.  Produce a Logistic Regression model.  For each topic, find the 5 features with the largest weights (that's 20 features in total).  Show a 20 row (features) x 4 column (topics) table of the weights.

Do you see any surprising features in this table?

Notes:
* Train on the transformed training data.
* You can use `CountVectorizer` and its `.fit_transform` method to transform data.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `np.argsort` to get indices sorted by element value. 

In [None]:
def P4():
    ### STUDENT START ###
    #set vectorizer to bigram data range
    vectorizer = CountVectorizer(ngram_range=(2, 2))
    matrix_train = vectorizer.fit_transform(train_data)
    vocab=np.array(vectorizer.get_feature_names())
    #logistic regression model
    lgr_model1 = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")
    lgr_model1.fit(matrix_train, train_labels)
    #get the coefficients of the model
    coeff = lgr_model1.coef_
#     print(coeff.shape[0])
#     print(coeff.shape[1])
    #sort the coefficients
    sorted_idx = np.argsort(coeff)
    # print(sorted_idx)
    
    #for each topic print the target name
    for i in range(coeff.shape[0]):
        print(newsgroups_train.target_names[i])
        #print vocab in each topic (we look at the last 4 since they are already sorted)
        for j in sorted_idx[i, -5:]:
            print("\t " + vocab[j])
    
    #To print table, first put the topic names (in top row)
    table = np.hstack(('\t', newsgroups_train.target_names))
    #loop through as above
    for i in range(coeff.shape[0]):
        for j in sorted_idx[i, -5:]:
            #get the weight of each vocab
            weights = coeff[:, j]
            #print(vocab[j], weights)
            #stack the vocab and weights to create a table
            temp = np.hstack((vocab[j], weights))
#             print(temp)
            table = np.vstack((table, temp))
#     print(table)

    #to print the table loop through the rows and columns
    for i in range(table.shape[0]):
        for j in range(table.shape[1]):
            if (i == 0):
                print(table[i][j] + "\t", end= " ")
            else: 
                print(table[i][j], end= " ")
        print("\n")
    
    ### STUDENT END ###

P4()

ANSWER: One thing that I find suprising about the list of bigram features is that they contain a lot of superflous words such as "are you", "is there." There is definitely more of these than actual content related words. I guess this shows depending on why we are analyzing the data it might be important to have a stop word list that would account for such superflous words.

### Part 5:

To improve generalization, it is common to try preprocessing text in various ways before splitting into words. For example, you could try transforming strings to lower case, replacing sequences of numbers with single tokens, removing various non-letter characters, and shortening long words.

Produce a Logistic Regression model (with no preprocessing of text).  Evaluate and show its f1 score and size of the dictionary.

Produce an improved Logistic Regression model by preprocessing the text.  Evaluate and show its f1 score and size of the vocabulary.  Try for an improvement in f1 score of at least 0.02.

How much did the improved model reduce the vocabulary size?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer(preprocessor=...)` to preprocess strings with your own custom-defined function.
* `CountVectorizer` default is to preprocess strings to lower case.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.
* If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular.

In [None]:
def better_preprocessor(s):
    ## STUDENT START ###
    #make lower case
    s = s.lower()
    # remove numbers
    s = re.sub("\d", "n", s)
    # remove anything not within these: [a-zA-Z0-9_]
    s = re.sub("\W", " ", s)
    # remove the _
    s = re.sub("_", " ", s)
    
    words = s.split()
    new_string = ""
    # split the word into new words if it is more than 4 letters
    # initially I tried length 5 for words but that ensured the my f-score didn't improve by .02.
    # It was interesting to see the affect of the lenght of the words on the f-score
    for word in words:
        if len(word) > 4:
            word = word[:4]
        new_string = new_string + " " + word
    
    return new_string
    
    ## STUDENT END ###

def P5():
    ## STUDENT START ###
    
    #no preprocessing - only default lower casing
    vectorizer = CountVectorizer()
    matrix_train = vectorizer.fit_transform(train_data)
    matrix_dev = vectorizer.transform(dev_data)
    vocab=np.array(vectorizer.get_feature_names())
    #logistic model for no preprocessing
    lgr_model1 = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")
    lgr_model1.fit(matrix_train, train_labels)
    # get predictions and f1_score
    lgr_model1_preds = lgr_model1.predict(matrix_dev)
    lgr_model1_fscore = metrics.f1_score(y_true = dev_labels, y_pred = lgr_model1_preds, average='weighted')
    print("With no pre-processing of text, \n\t the f-score of the trained model is " 
          + str(lgr_model1_fscore) + "\n\t the size of the vocab is " + str(len(vocab)))
    
    
    #preprocessing - with function we wrote above
    vectorizer_processed = CountVectorizer(train_data, preprocessor=better_preprocessor)
    matrix_train_processed = vectorizer_processed.fit_transform(train_data)
    matrix_dev_processed = vectorizer_processed.transform(dev_data)
    vocab2=np.array(vectorizer_processed.get_feature_names())
    #logistic model for no preprocessing
    lgr_model2 = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")
    lgr_model2.fit(matrix_train_processed, train_labels)
    # get predictions and f1_score
    lgr_model2_preds = lgr_model2.predict(matrix_dev_processed)
    lgr_model2_fscore = metrics.f1_score(y_true = dev_labels, y_pred = lgr_model2_preds, average='weighted')
    print("With pre-processing of text, \n\t the f-score of the trained model is " 
          + str(lgr_model2_fscore) + "\n\t the size of the vocab is " + str(len(vocab2)))
    
    print("The improved model (pre-processing of text) reduced the vocab size by " + str(len(vocab) - len(vocab2)))
    ## STUDENT END ###

P5()

### Part 6:

The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. Logistic regression seeks the set of weights that minimizes errors in the training data AND has a small total size. The default L2 regularization computes this size as the sum of the squared weights (as in Part 3 above). L1 regularization computes this size as the sum of the absolute values of the weights. Whereas L2 regularization makes all the weights relatively small, L1 regularization drives many of the weights to 0, effectively removing unimportant features.

For several L1 regularization strengths ...<br/>
* Produce a Logistic Regression model using the **L1** regularization strength.  Reduce the vocabulary to only those features that have at least one non-zero weight among the four categories.  Produce a new Logistic Regression model using the reduced vocabulary and **L2** regularization strength of 0.5.  Evaluate and show the L1 regularization strength, vocabulary size, and f1 score associated with the new model.

Show a plot of f1 score vs. log vocabulary size.  Each point corresponds to a specific L1 regularization strength used to reduce the vocabulary.

How does performance of the models based on reduced vocabularies compare to that of a model based on the full vocabulary?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `LogisticRegression(..., penalty="l1")` to produce a logistic regression model using L1 regularization.
* You can use `LogisticRegression(..., penalty="l2")` to produce a logistic regression model using L2 regularization.
* You can use `LogisticRegression(..., tol=0.015)` to produce a logistic regression model using relaxed gradient descent convergence criteria.  The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.015 (the default is .0001).

In [None]:
def P6():
    #Keep this random seed here to make comparison easier.
    np.random.seed(0)
    
    ## STUDENT START ###
    
    # removed .0001 because they were causing the length of the reduced vocab to be zero
    c_values=[0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
    #c_values = [.0001, .001, .01, .1, 1, 2, 10]
    # Column names in table
    table = np.array(['C', 'Red. Vocab Size', 'l1 F1 Score (less vocab)', 'Vocab Size', 'l1 F1 Score'])
    # for each c value
    for c in c_values:
        vectorizer = CountVectorizer()
        matrix_train = vectorizer.fit_transform(train_data)
        matrix_dev = vectorizer.transform(dev_data)
        vocab=np.array(vectorizer.get_feature_names())
        # model with l1 penalty
        lgr_model1 = LogisticRegression(C=c, penalty="l1", tol=0.015, solver='liblinear')
        lgr_model1.fit(matrix_train, train_labels)
        # get predictions and f1score
        lgr_model1_preds = lgr_model1.predict(matrix_dev)
        lgr_model1_f1score = metrics.f1_score(y_true = dev_labels, y_pred = lgr_model1_preds, average='weighted')
        lgr_weights1 = lgr_model1.coef_
        #print(np.sum(lgr_weights1 != 0))
    
        #sum the weights
        sum_of_weights = np.sum(lgr_weights1 != 0, axis = 0)
        #get indexes that have wieths greater than zero
        indices_with_nonzero_weights = np.where(sum_of_weights > 0)
        reduced_vocab = np.array(vectorizer.get_feature_names())[indices_with_nonzero_weights[0]]
        #set this as the reducd vocab
        reduced_vocab = set(reduced_vocab)
        #print("sized of reduced vocab: " + str(len(reduced_vocab)))
    
        
        #vectorizer with reduced vocab
        reduced_vectorizer = CountVectorizer(vocabulary = reduced_vocab)
        reduced_matrix_train = reduced_vectorizer.fit_transform(train_data)
        reduced_matrix_dev = reduced_vectorizer.transform(dev_data)
        #model with reduced vocab using l1 penalty
        reduced_lgr_model1 = LogisticRegression(C=c, penalty="l1", tol=0.015, solver='liblinear')
        reduced_lgr_model1.fit(reduced_matrix_train, train_labels)
        reduced_lgr_model1_preds = reduced_lgr_model1.predict(reduced_matrix_dev)
        reduced_lgr_model1_f1score = metrics.f1_score(y_true = dev_labels, y_pred = reduced_lgr_model1_preds, average='weighted')
        reduced_lgr_weights1 = reduced_lgr_model1.coef_
        # create table with f1score for reduced and normal voab for each c
        table = np.vstack((table, np.array([c, len(reduced_vocab), reduced_lgr_model1_f1score, len(vocab), lgr_model1_f1score])))
    
    
    #show the L1 regularization strength, vocabulary size, and f1 score associated with the new model
    
        
        # logistric model with penalty l2
        lgr_model2 = LogisticRegression(C =0.5, penalty="l2", tol=0.015, solver='liblinear')
        lgr_model2.fit(matrix_train, train_labels)
        lgr_model2_preds = lgr_model2.predict(matrix_dev)
        lgr_model2_f1score = metrics.f1_score(y_true = dev_labels, y_pred = lgr_model2_preds, average='weighted')
        
    # print the table by looping though rows and columns    
    print("Table")
    for i in range(table.shape[0]):
        for j in range(table.shape[1]):
            if (i != 0 and j == 2):
                print("\t\t\t" + table[i][j] + "\t", end= " ")
            else: 
                print(table[i][j]+ "\t", end= " ")
        print("\n")
    
    # plot the figure (leave the top row and only plot F1 score and Reduced Vocab sizes)
    plt.figure()
    plt.plot(np.log(table[1:, 1].astype(float)), table[1:, 2].astype(float))
    plt.xlabel('Reduced Vocabulary Size (log)')
    plt.ylabel('L1 F1 Score')
    plt.title('Plot of Reduced Vocabulary Size and L1 F1 Score at different Reguralization Values')

    
    ## STUDENT END ###

P6()

ANSWER: The performance of the model with reduced vocab is close to or at certain C values better than the model with the full vocabulary.

### Part 7:

How is `TfidfVectorizer` different than `CountVectorizer`?

Produce a Logistic Regression model based on data represented in tf-idf form, with L2 regularization strength of 100.  Evaluate and show the f1 score.  How is `TfidfVectorizer` different than `CountVectorizer`?

Show the 3 documents with highest R ratio, where ...<br/>
$R\,ratio = maximum\,predicted\,probability \div predicted\,probability\,of\,correct\,label$

Explain what the R ratio describes.  What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

Note:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `TfidfVectorizer` and its `.fit_transform` method to transform data to tf-idf form.
* You can use `LogisticRegression(C=100, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `LogisticRegression`'s `.predict_proba` method to access predicted probabilities.

In [None]:
def P7():
    ## STUDENT START ###
    #tfid vectorizers
    tfid_vectorizer = TfidfVectorizer()
    tfid_matrix_train = tfid_vectorizer.fit_transform(train_data)
    tfid_matrix_dev = tfid_vectorizer.transform(dev_data)
    tfid_vocab = np.array(tfid_vectorizer.get_feature_names())
    #logistic regression model using tfid vectorizers
    tfid_lgr_model1 = LogisticRegression(C=100, solver="liblinear", penalty="l2", multi_class="auto")
    tfid_lgr_model1.fit(tfid_matrix_train, train_labels)
    # predictions and f1 score
    tfid_lgr_model1_preds = tfid_lgr_model1.predict(tfid_matrix_dev)
    tfid_lgr_model1_f1score = metrics.f1_score(y_true = dev_labels, y_pred = tfid_lgr_model1_preds, average='weighted')
    tfid_lgr_model1_probs_preds = tfid_lgr_model1.predict_proba(tfid_matrix_dev)
    print("Using a tfid vectorizer, \n\t the f-score of the trained model is " 
          + str(tfid_lgr_model1_f1score) + "\n\t the size of the vocab is " + str(len(tfid_vocab)))
    
    
    #regular vectorizer
    vectorizer = CountVectorizer()
    matrix_train = vectorizer.fit_transform(train_data)
    matrix_dev = vectorizer.transform(dev_data)
    vocab = np.array(vectorizer.get_feature_names())
    #logistic regression model 
    lgr_model1 = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")
    lgr_model1.fit(matrix_train, train_labels)
    # predictions and f1 score
    lgr_model1_preds = lgr_model1.predict(matrix_dev)
    lgr_model1_f1score = metrics.f1_score(y_true = dev_labels, y_pred = lgr_model1_preds, average='weighted')
    lgr_model1_probs_preds = lgr_model1.predict_proba(matrix_dev)
    print("Using a normal vectorizer, \n\t the f-score of the trained model is " 
          + str(lgr_model1_f1score) + "\n\t the size of the vocab is " + str(len(vocab)))
    
    R = []
    # loop throug and get R ratio for each label
    for i in range(len(dev_labels)):
        max_prob = tfid_lgr_model1_probs_preds[i].max()
        label_prob = tfid_lgr_model1_probs_preds[i][dev_labels[i]]
        R.append(max_prob / label_prob)
    # convert to R ratio array
    R = np.array(R)
    # sorted the R ratios
    sorted_R = np.argsort(R)
#     print(sorted_R)
    # get largest 3 R rations
    largest_R = sorted_R[-3:]
#     print(largest_R)
    #index_large_R = sorted_R[-3:][::-1]
    #print(np.argsort(lgr_model1_probs_preds[largest_R[2]]))
    
    # loop through each of the largest r rations
    for r in largest_R:
        # get their indexes - use it to print document
        index = np.argsort(lgr_model1_probs_preds[r])[-1]
        #index_max_prob = np.argsort(logit_clf_probs_preds[r])[-1]
        print("Ratio R is", R[r])
        print("Correct Label:", newsgroups_train.target_names[dev_labels[index]])
        print("Predicted Label:",newsgroups_train.target_names[dev_labels[r]])
        print("Document: \n\t" + str(dev_data[index]) + "\n\n")


    ## STUDENT END ###

P7()

ANSWER:

CountVectorizer creates a matrix where each cell represents the number of times the feature has appeared in a document.  TfidfVectorizer creates a matrix where each cell represents the Tf-idf weight of each feature in the document.
Tf-idf = Term frequency X Inverse Document Frequency.
Term Frequency = number of times a word appeared in a document / total number of words in the document  --- measures frequency of word in a document 
Inverse Document Frequency = log_10(total number of documents / number of documents in which the word appears).  --- measures the informativeness of word. We use the log to normalize for a large number of documents

The high R ratio shows that the model predicted the label with a high maximum proability and it predicted the correct label with low probability.

One thing that could be going wrong is that there is a high number of similar words across multiple documents that is causing the model to choose the incorrect label. These words are probably superflous words such as "is, are, there." One way to possibly improve the model and reduce this issue is to have a stop word list. 



### Part 8 EXTRA CREDIT:

Produce a Logistic Regression model to implement your suggestion from Part 7.

In [None]:
def P8():
    #pulled this list of random words from part 4
    tfid_vectorizer = TfidfVectorizer(stop_words=['is', 'there', 'are', 'it', 'was', 'in', 'for', 'that'])

    tfid_matrix_train = tfid_vectorizer.fit_transform(train_data)
    tfid_matrix_dev = tfid_vectorizer.transform(dev_data)
    
    tfid_lgr_model1 = LogisticRegression(C=100, solver="liblinear", penalty="l2", multi_class="auto")
    tfid_lgr_model1.fit(tfid_matrix_train, train_labels)
    # predictions and f1 score
    tfid_lgr_model1_preds = tfid_lgr_model1.predict(tfid_matrix_dev)
    tfid_lgr_model1_f1score = metrics.f1_score(y_true = dev_labels, y_pred = tfid_lgr_model1_preds, average='weighted')
    print("Using a tfid vectorizer with stop words, \n\t the f-score of the trained model is " + str(tfid_lgr_model1_f1score))
    
    
P8()