# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup posts on a variety of topics. You'll train classifiers to distinguish posts by topics inferred from the text. Whereas with digit classification, where each input is relatively dense (represented as a 28x28 matrix of pixels, many of which are non-zero), here each document is relatively sparse (represented as a bag-of-words). Only a few words of the total vocabulary are active in any given document. The assumption is that a label depends only on the count of words, not their order.

The `sklearn` documentation on feature extraction may be useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on Slack, but <b> please prepare your own write-up with your own code. </b>

In [None]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

import nltk

Load the data, stripping out metadata so that only textual features will be used, and restricting documents to 4 specific topics. By default, newsgroups data is split into training and test sets, but here the test set gets further split into development and test sets.  (If you remove the categories argument from the fetch function calls, you'd get documents from all 20 topics.)

In [None]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test  = fetch_20newsgroups(subset='test',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

num_test = int(len(newsgroups_test.target) / 2)
test_data, test_labels   = newsgroups_test.data[num_test:], newsgroups_test.target[num_test:]
dev_data, dev_labels     = newsgroups_test.data[:num_test], newsgroups_test.target[:num_test]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training label shape:', train_labels.shape)
print('dev label shape:',      dev_labels.shape)
print('test label shape:',     test_labels.shape)
print('labels names:',         newsgroups_train.target_names)

### Part 1:

For each of the first 5 training examples, print the text of the message along with the label.

In [None]:
def P1(num_examples=5):
    ### STUDENT START ###
    x = 0
    while x < num_examples:
      print("This is data example "+ str(x+1) + ": \n" + train_data[x])
      print("This is label example "+ str(x+1) + ": " + str(train_labels[x]))
      x += 1
    ### STUDENT END ###

P1(5)


### Part 2:

Transform the training data into a matrix of **word** unigram feature vectors.  What is the size of the vocabulary? What is the average number of non-zero features per example?  What is the fraction of the non-zero entries in the matrix?  What are the 0th and last feature strings (in alphabetical order)?<br/>
_Use `CountVectorizer` and its `.fit_transform` method.  Use `.nnz` and `.shape` attributes, and `.get_feature_names` method._

Now transform the training data into a matrix of **word** unigram feature vectors using your own vocabulary with these 4 words: ["atheism", "graphics", "space", "religion"].  Confirm the size of the vocabulary. What is the average number of non-zero features per example?<br/>
_Use `CountVectorizer(vocabulary=...)` and its `.transform` method._

Now transform the training data into a matrix of **character** bigram and trigram feature vectors.  What is the size of the vocabulary?<br/>
_Use `CountVectorizer(analyzer=..., ngram_range=...)` and its `.fit_transform` method._

Now transform the training data into a matrix of **word** unigram feature vectors and prune words that appear in fewer than 10 documents.  What is the size of the vocabulary?<br/>
_Use `CountVectorizer(min_df=...)` and its `.fit_transform` method._

Now again transform the training data into a matrix of **word** unigram feature vectors. What is the fraction of words in the development vocabulary that is missing from the training vocabulary?<br/>
_Hint: Build vocabularies for both train and dev and look at the size of the difference._

Notes:
* `.fit_transform` makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").
* `.fit_transform` and `.transform` return sparse matrix objects.  See about them at http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html.  

In [None]:
def P2(x):
    ### STUDENT START ###
    #First part of the problem to simply run the two attributes and method
    vectorizer = CountVectorizer()
    v_array = vectorizer.fit_transform(x)
    print("Part 1: Transform the training data into a matrix of word unigram feature vectors")
    print("The size of the vocabulary is: " + str(v_array.shape[1]))
    print("The average number of non-zero features per example is: " + str(v_array.getnnz(axis=1).mean()))
    print("The fraction of the non-zero entries in the matrix are: " + str(v_array.nnz / (v_array.shape[0] * v_array.shape[1])))
    print("The 0th and the last feature strings are: " + str(vectorizer.get_feature_names()[0]) + " and " + str(vectorizer.get_feature_names()[-1]))

    #Transform the training data into a matrix of word unigram feature vectors using your own vocabulary with these 4 words: ["atheism", "graphics", "space", "religion"]
    vectorizer2 = CountVectorizer(vocabulary=["atheism", "graphics", "space", "religion"])
    vocab_array = vectorizer2.transform(x)
    #n = vocab_array.nnz
    #sums = []
    #for i in range(0,n):
    #  sums.append(np.sum(vocab_array[i]))
    print("\nPart 2: Transform the training data into a matrix of word unigram feature vectors using your own vocabulary with these 4 words: [atheism, graphics, space, religion]")
    print("The average number of non-zero features per example when vocabulary is atheism, graphics, space and religion is: " + str(vocab_array.getnnz(axis=1).mean()))
          #+ str(round(np.mean(sums),4))) 

    #Transform the training data into a matrix of character bigram and trigram 
    vectorizer3 = CountVectorizer(analyzer="char", ngram_range=(2,3))
    v_bitrigram = vectorizer3.fit_transform(x)
    print("\nPart 3: Transform the training data into a matrix of character bigram and trigram")
    print("The size of the vocabulary for a bigram and trigram ngram_range is: " + str(v_bitrigram.shape[1]))  

    #Transform the training data into a matrix of word unigram feature vectors and prune words that appear in fewer than 10 documents
    vectorizer4 = CountVectorizer(min_df=10)
    v_pruned = vectorizer4.fit_transform(x)
    print("\nPart 4: Transform the training data into a matrix of word unigram feature vectors and prune words that appear in fewer than 10 documents")
    print("The size of the vocabulary for a word unigram where words appearing in fewer than 10 documents are pruned is: " + str(v_pruned.shape[1]))  

    #Transform the training data into a matrix of word unigram feature vectors
    vectorizer5 = CountVectorizer()
    v_dev = vectorizer5.fit_transform(dev_data)
    print("\nPart 5: Transform the training data and dev data into a matrix of word unigram feature vectors")
    print("The size of the vocabulary for the dev data is: " + str(v_dev.shape[1]))
    print("The size of the vocabulary for the training data is: " + str(v_array.shape[1]))
    print("The fraction of words in the development vocabulary that is missing from the training vocabulary is: " + str((v_dev.shape[1] - len(np.intersect1d(vectorizer.get_feature_names(),vectorizer5.get_feature_names())))/v_dev.shape[1]))


    ### STUDENT END ###

P2(train_data)


### Part 3:

Transform the training and development data to matrices of word unigram feature vectors.

1. Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score.  For each model, show the k value and f1 score.
1. Produce several Naive Bayes models by varying smoothing (alpha), including one with alpha set approximately to optimize f1 score.  For each model, show the alpha value and f1 score.
1. Produce several Logistic Regression models by varying L2 regularization strength (C), including one with C set approximately to optimize f1 score.  For each model, show the C value, f1 score, and sum of squared weights for each topic.

* Why doesn't k-Nearest Neighbors work well for this problem?
* Why doesn't Logistic Regression work as well as Naive Bayes does?
* What is the relationship between logistic regression's sum of squared weights vs. C value?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer` and its `.fit_transform` and `.transform` methods to transform data.
* You can use `KNeighborsClassifier(...)` to produce a k-Nearest Neighbors model.
* You can use `MultinomialNB(...)` to produce a Naive Bayes model.
* You can use `LogisticRegression(C=..., solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.

In [None]:
def P3(k,alph,c):
    ### STUDENT START ###
    vectorizer_3 = CountVectorizer()
    v_train = vectorizer_3.fit_transform(train_data)
    v_dev = vectorizer_3.transform(dev_data)

    print("Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score. For each model, show the k value and f1 score.")
    z = 0
    zz = 1
    for i in range(1,k):
      kn_model = KNeighborsClassifier(n_neighbors=i)
      kn_model.fit(v_train,train_labels)
      kn_predict = kn_model.predict(v_dev)
      print("For k = " + str(i) + " the f1-score is " + str(metrics.f1_score(dev_labels,kn_predict,average='weighted')))
      if metrics.f1_score(dev_labels,kn_predict,average='weighted') > z:
        z = metrics.f1_score(dev_labels,kn_predict,average='weighted')
        zz = i
    print("For KNeighbors the best k = " + str(zz) + " with an f1-score of " + str(z))
    
    print("\nProduce several Naive Bayes models by varying smoothing (alpha), including one with alpha set approximately to optimize f1 score. For each model, show the alpha value and f1 score.")
    a = 0
    aa = 1
    for i in range(0,len(alph)):
      clf = BernoulliNB(alpha=alph[i])
      clf.fit(v_train,train_labels)
      clf_predict = clf.predict(v_dev)
      print("For alpha = " + str(alph[i]) + " the f1-score is " + str(metrics.f1_score(dev_labels,clf_predict,average='weighted')))
      if metrics.f1_score(dev_labels,clf_predict,average='weighted') > a:
        a = metrics.f1_score(dev_labels,clf_predict,average='weighted')
        aa = alph[i]
    print("For BernoulliNB the best alpha = " + str(aa) + " with an f1-score of " + str(a))

    print("\nProduce several Logistic Regression models by varying L2 regularization strength (C), including one with C set approximately to optimize f1 score")
    b = 0
    bb = 1
    for i in range(0,len(c)):
      lg_reg = LogisticRegression(C=c[i], solver="liblinear", multi_class="auto")
      lg_reg.fit(v_train,train_labels)
      lg_reg_predict = lg_reg.predict(v_dev)
      print("For Logistic Regression when C = " + str(c[i]) + " the f1-score is " + str(metrics.f1_score(dev_labels,lg_reg_predict,average='weighted')) + " the sum of the squared weights of the topic is: " + str(np.sum(lg_reg.coef_**2)))
      
      if metrics.f1_score(dev_labels,lg_reg_predict,average='weighted') > b:
        b = metrics.f1_score(dev_labels,lg_reg_predict,average='weighted')
        bb = c[i]
        bbb = np.sum(lg_reg.coef_**2)
    print("For Logistic Regression model the best C = " + str(bb) + " with an f1-score of " + str(b) + " the sum of the squared weights of the topic is: " + str(bbb))

    ### STUDENT END ###



max_k = 113
alphas = [0,0.00001,0.0001,0.001,0.01,0.1,1,2]
c = [0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]
P3(max_k,alphas,c)

ANSWER:
Why doesn't k-Nearest Neighbors work well for this problem?

> k-Nearest Neighbors isnt well suited to this question because of the variation in vocabulary and the high frequency of irrelevent words to meaning. The binary translation of the words do not have a clear clustered or numeric value.

Why doesn't Logistic Regression work as well as Naive Bayes does?

> Naive Bayes is good at handling the larger feature set with binary markers.  It can then derive a more clear pattern in classification.  Logistic Regression is taking many features with low relevancy.  It would likely be better to reduce dimensions or limit the vocabulary to get a better logistic regression result. Even then it may not out perform Naive Bayes.

What is the relationship between logistic regression's sum of squared weights vs. C value?

> As the C value increases the sum of the squared weights increases rapidly.  I believe this means that the risk of overfitting is increasing rapidly as well. Making it a possibly poor model for generalization even if the predictions are strong.

### Part 4:

Transform the data to a matrix of word **bigram** feature vectors.  Produce a Logistic Regression model.  For each topic, find the 5 features with the largest weights (that's 20 features in total).  Show a 20 row (features) x 4 column (topics) table of the weights.

Do you see any surprising features in this table?

Notes:
* Train on the transformed training data.
* You can use `CountVectorizer` and its `.fit_transform` method to transform data.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `np.argsort` to get indices sorted by element value. 

In [None]:
def P4():
    ### STUDENT START ###
    vectorizer_4 = CountVectorizer(analyzer="word", ngram_range=(2,2))
    train_vbigram = vectorizer_4.fit_transform(train_data)
    train_words = vectorizer_4.get_feature_names()

    model = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")
    model.fit(train_vbigram,train_labels)

    #Sort the array and then get the values from the indices
    #print(model.coef_[1][0])
    sorted_coef = np.argsort(model.coef_)[:,-5:]
    top_values = []
    for label in range(sorted_coef.shape[0]):
      for col in range(sorted_coef.shape[1]):
        top_values.append(model.coef_[:,sorted_coef[label][col]])

    top_values = np.array(top_values)
    top_values = np.around(top_values,decimals=3)

    #Get the feature list
    top_features = []
    for i in range(sorted_coef.shape[0]):
      for j in range(sorted_coef.shape[1]):
        top_features.append(train_words[sorted_coef[i][j]])
    #print(len(top_features))
    #print(top_features[:20])

    #Create table with pyplot
    columns = ('Atheism', 'Graphics', 'Space', 'Religion')

    fig, axs =plt.subplots()
    axs.set_axis_off() 
    table = axs.table( 
        cellText = top_values,  
        rowLabels = top_features,   
        colLabels = columns, 
        rowColours =["palegreen"] * 20,  
        colColours =["palegreen"] * 20, 
        cellLoc ='center',  
        loc ='upper left')     
    
    plt.show()

    ### STUDENT END ###

P4()
#print(train_data)

ANSWER: I find it surprising how generic these bigrams are. They really dont relate to the topics for which they have a high score.

### Part 5:

To improve generalization, it is common to try preprocessing text in various ways before splitting into words. For example, you could try transforming strings to lower case, replacing sequences of numbers with single tokens, removing various non-letter characters, and shortening long words.

Produce a Logistic Regression model (with no preprocessing of text).  Evaluate and show its f1 score and size of the dictionary.

Produce an improved Logistic Regression model by preprocessing the text.  Evaluate and show its f1 score and size of the vocabulary.  Try for an improvement in f1 score of at least 0.02.

How much did the improved model reduce the vocabulary size?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer(preprocessor=...)` to preprocess strings with your own custom-defined function.
* `CountVectorizer` default is to preprocess strings to lower case.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.
* If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular.

In [None]:
def better_preprocessor(s):
    ### STUDENT START ###
    s = s.lower()
    ps = nltk.PorterStemmer()


    s = re.sub('\n', ' ', s)
    #s = unicodedata.normalize('NFKD', s).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    s = re.sub(r'[^\w\s]', ' ', s)


    #s = ps.stem(s)

    #s = re.sub('to be ', ' ', s)
    #s = re.sub('was just ', ' ', s)
    #s = re.sub('cheers kent ', ' ', s)
    #s = re.sub('is there ', ' ', s)
    #s = re.sub('and such ', ' ', s)

    #Manual stemming
    s = re.sub('s ', ' ', s)
    s = re.sub('ing ', ' ', s)
    s = re.sub('ment ', 'e ', s)


    #s = re.sub('the ', ' ', s)
    s = re.sub('and ', ' ', s)
    #s = re.sub('just ', ' ', s)
    s = re.sub('but ', ' ', s)
    s = re.sub('no ', ' ', s)
    s = re.sub('email ', ' ', s)
    s = re.sub('internet ', ' ', s)
    s = re.sub('url ', ' ', s)
    s = re.sub('child ', ' ', s)
    #s = re.sub('book ', ' ', s)
    
    #s = re.sub('to ', ' ', s)
    #s = re.sub('i ', ' ', s)
    s = re.sub('\s+', ' ', s)


    
    return s
    ### STUDENT END ###

def P5():
    ### STUDENT START ###
    #Transform the data without preprocessing
    vectorizer_5 = CountVectorizer()
    v_train = vectorizer_5.fit_transform(train_data)
    v_dev = vectorizer_5.transform(dev_data)

    #Run the Logistic Regression Model
    model = LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")
    model.fit(v_train,train_labels)
    predict_model_5 = model.predict(v_dev)
    print("Without preprocessing the size of the vocabulary is: " + str(v_train.shape[1]))
    print("Without preprocessing the f1-score is: " + str(round(metrics.f1_score(dev_labels,predict_model_5,average='weighted'),3)))

    #Transform the data with preprocessing
    vectorizer_5b = CountVectorizer(preprocessor=better_preprocessor)
    vb_train = vectorizer_5b.fit_transform(train_data)
    vb_dev = vectorizer_5b.transform(dev_data)
    model.fit(vb_train, train_labels)
    predict_model_5b = model.predict(vb_dev)
    print("With preprocessing the size of the vocabulary is: " + str(vb_train.shape[1]))
    print("With preprocessing the f1-score is: " + str(round(metrics.f1_score(dev_labels,predict_model_5b,average='weighted'),3)))

    ### STUDENT END ###

P5()

### Part 6:

The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. Logistic regression seeks the set of weights that minimizes errors in the training data AND has a small total size. The default L2 regularization computes this size as the sum of the squared weights (as in Part 3 above). L1 regularization computes this size as the sum of the absolute values of the weights. Whereas L2 regularization makes all the weights relatively small, L1 regularization drives many of the weights to 0, effectively removing unimportant features.

For several L1 regularization strengths ...<br/>
* Produce a Logistic Regression model using the **L1** regularization strength.  Reduce the vocabulary to only those features that have at least one non-zero weight among the four categories.  Produce a new Logistic Regression model using the reduced vocabulary and **L2** regularization strength of 0.5.  Evaluate and show the L1 regularization strength, vocabulary size, and f1 score associated with the new model.

Show a plot of f1 score vs. log vocabulary size.  Each point corresponds to a specific L1 regularization strength used to reduce the vocabulary.

How does performance of the models based on reduced vocabularies compare to that of a model based on the full vocabulary?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `LogisticRegression(..., penalty="l1")` to produce a logistic regression model using L1 regularization.
* You can use `LogisticRegression(..., penalty="l2")` to produce a logistic regression model using L2 regularization.
* You can use `LogisticRegression(..., tol=0.015)` to produce a logistic regression model using relaxed gradient descent convergence criteria.  The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.015 (the default is .0001).

In [None]:
def P6(c):
    # Keep this random seed here to make comparison easier.
    seed = np.random.seed(0)
    
    ### STUDENT START ###
    #Vectorizer the data
    vectorizer_6 = CountVectorizer()
    v_train = vectorizer_6.fit_transform(train_data)
    v_dev = vectorizer_6.transform(dev_data)

    #Run the Logistic Regressions with L1 regularization
    best_c = 0
    best_f1 = 0
    best_vocab_size = 0
    vocab_size_array = []
    l2_f1_scores = []
    for i in c:
      model_l1 = LogisticRegression(solver='liblinear',penalty="l1", C=i, tol=0.015, random_state=seed)
      model_l1.fit(v_train,train_labels)
      predict_model_l1 = model_l1.predict(v_dev)
      f1_score = metrics.f1_score(dev_labels,predict_model_l1,average='weighted')
      vocab_size_array.append(np.count_nonzero(np.count_nonzero(model_l1.coef_,axis=0)))
      print("\nFor C = " + str(i) + " the f1-score is " + str(f1_score))
      if f1_score > best_f1:
        best_f1 = f1_score
        best_c = i
        best_vocab_size = np.count_nonzero(np.count_nonzero(model_l1.coef_,axis=0))
      #Reduce the vocabulary for to prep for the L2 regression - looked back and have to do this iteratively across C values
      reduced_vocab = []
      counter = 0
      #print(np.count_nonzero(np.count_nonzero(model_l1.coef_, axis=0)))
      for j in np.count_nonzero(model_l1.coef_,axis=0):
        if j > 0:
          reduced_vocab.append(vectorizer_6.get_feature_names()[counter])
        counter += 1
      #print(len(reduced_vocab))

      #Run Logistic Regression with L2 regularization and the reduced vocabulary iteratively to capture f1 scores
      vectorizer_6b = CountVectorizer(vocabulary=reduced_vocab)
      vb_train = vectorizer_6b.fit_transform(train_data)
      vb_dev = vectorizer_6b.transform(dev_data)


      model_l2 = LogisticRegression(solver='liblinear', penalty='l2',C=0.5,tol=0.015, random_state=seed)
      model_l2.fit(vb_train,train_labels)
      predict_model_l2 = model_l2.predict(vb_dev)
      l2_f1_scores.append(round(metrics.f1_score(dev_labels,predict_model_l2,average='weighted'),3))
      print("The L1 regularization strength was: ",i)
      print("The vocabulary size was: ",len(reduced_vocab))
      print("The L2 f1-score is: ",metrics.f1_score(dev_labels,predict_model_l2,average='weighted'))
    
      reduced_vocab.clear()
      counter = 0

    #print(l2_f1_scores)
    #print(vocab_size_array)
    
    log_vocab_array = np.log(vocab_size_array)
    
    #print(log_vocab_array)

    plt.plot(log_vocab_array,l2_f1_scores,color='blue')
    plt.scatter(log_vocab_array,l2_f1_scores,color='blue',marker='o')
    plt.ylabel='F1 Scores'
    plt.xlabel='Log Vocab Size'
    plt.title='F1 Scores for L2 Regularlization vs Log of L1 Reduced Vocab Size'

    plt.show()
    


    ### STUDENT END ###

c = [0.3,0.4,0.5,0.6,0.7,0.8,0.9]
P6(c)

ANSWER: The original model with full vocabulary was run in problem 5.  It had an F1 score of 0.708 which is higher than all of the scores in this run.  It's suprising because I'd assume dropping zero value features would result in the same if not improved performance, but that doesnt seem to be the case.

### Part 7:

How is `TfidfVectorizer` different than `CountVectorizer`?

Produce a Logistic Regression model based on data represented in tf-idf form, with L2 regularization strength of 100.  Evaluate and show the f1 score.  How is `TfidfVectorizer` different than `CountVectorizer`?

Show the 3 documents with highest R ratio, where ...<br/>
$R\,ratio = maximum\,predicted\,probability \div predicted\,probability\,of\,correct\,label$

Explain what the R ratio describes.  What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

Note:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `TfidfVectorizer` and its `.fit_transform` method to transform data to tf-idf form.
* You can use `LogisticRegression(C=100, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `LogisticRegression`'s `.predict_proba` method to access predicted probabilities.

In [None]:
def P7():
    ### STUDENT START ###
    #Start with Vectorizing data using both methods
    c_vectorizer = CountVectorizer()
    c_train_data = c_vectorizer.fit_transform(train_data)
    c_dev_data = c_vectorizer.transform(dev_data)
    t_vectorizer = TfidfVectorizer()
    t_train_data = t_vectorizer.fit_transform(train_data)
    t_dev_data = t_vectorizer.transform(dev_data)

    #Setup the regression
    c_model = LogisticRegression(C=100,solver='liblinear',multi_class='auto')
    c_model.fit(c_train_data,train_labels)
    predict_c_model = c_model.predict(c_dev_data)
    print('The CountVectorizer based model has an f1-score of: ',metrics.f1_score(dev_labels,predict_c_model,average='weighted'))
    t_model = LogisticRegression(C=100,solver='liblinear',multi_class='auto')
    t_model.fit(t_train_data,train_labels)
    predict_t_model = t_model.predict(t_dev_data)
    print('The TfidfVectorizer based model has an f1-score of: ',metrics.f1_score(dev_labels,predict_t_model,average='weighted'))

    #Calculate the R ratio and select the 3 highest
    t_model_probs = t_model.predict_proba(t_dev_data)
    #print(t_model_probs)
    max_t_model_probs = np.max(t_model_probs)
    r_ratios = []
    for i in range(len(dev_labels)):
      r_ratios.append(max_t_model_probs / t_model_probs[i,dev_labels[i]])

    r_array = np.array(r_ratios)
    r_max_ind = np.argsort(r_array)[-3:]

    r_3_max = []
    for r in r_max_ind:
      r_3_max.append(r_ratios[r])
    
    #Print the examples
    print('The three highest R ratios are: ', r_3_max)

    counter = 1
    for r in r_max_ind:
      print("\nExample ",counter)
      print('The correct label was ',newsgroups_train.target_names[dev_labels[r]], 'with a probability of ',round(t_model_probs[r,dev_labels[r]],3))
      print('The predicted label was ', newsgroups_train.target_names[predict_t_model[r]], 'with a probability of ',round(np.max(t_model_probs,axis=1)[r],3))
      print('\n',dev_data[r])
      counter += 1


    ### STUDENT END ###

P7()

ANSWER:
How is TfidfVectorizer different than CountVectorizer?
>TfidVectorizer returns a normalized float matrix versus the CountVectorizer that returns an int.

How does the TfidVectorizer f1 score differ from the CountVectorizer?
>Surprisingly, the TfidVectorizer returns an f1 that is nearly 0.08 better than the CountVectorizer.

Explain what the R ratio describes. What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.
>The R ratio describes the multiple level of the predicted probability versus the probability assigned to the real label. An exact value of 1 should mean the selection was correct. The closer to 1 the ratio, the closer the decision was between the two.  

>The model seems to be making mistakes often on short posts where there may not be enough data to differentiate correctly, but also when there is vocabulary that likely represents the predicted label.  For instance *ftp* in the religion post likely led the classifier to chose the graphics choice.

>One way to address this would be to use better stop words and tokenize key phrases. For instance Book of Mormon could be tokenized and other key phrases.  Stop words include words that are likley too general to be specific to the topic like email.




### Part 8 EXTRA CREDIT:

Produce a Logistic Regression model to implement your suggestion from Part 7.