<a href="https://colab.research.google.com/github/nugzar/mics-w207/blob/master/Nugzar_Nebieridze_p2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup postings on a variety of topics. You'll train classifiers to distinguish between the topics based on the text of the posts. Whereas with digit classification, the input is relatively dense: a 28x28 matrix of pixels, many of which are non-zero, here we'll represent each document with a "bag-of-words" model. As you'll see, this makes the feature representation quite sparse -- only a few words of the total vocabulary are active in any given document. The bag-of-words assumption here is that the label depends only on the words; their order is not important.

The SK-learn documentation on feature extraction will prove useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but please prepare your own write-up and write your own code.

In [0]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

In [0]:
import nltk

Load the data, stripping out metadata so that we learn classifiers that only use textual features. By default, newsgroups data is split into train and test sets. We further split the test so we have a dev set. Note that we specify 4 categories to use for this project. If you remove the categories argument from the fetch function, you'll get all 20 categories.

In [0]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)

num_test = int(len(newsgroups_test.target) / 2)
test_data, test_labels = newsgroups_test.data[num_test:], newsgroups_test.target[num_test:]
dev_data, dev_labels = newsgroups_test.data[:num_test], newsgroups_test.target[:num_test]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training label shape:', train_labels.shape)
print('test label shape:', test_labels.shape)
print('dev label shape:', dev_labels.shape)
print('labels names:', newsgroups_train.target_names)

In [0]:
newsgroups_train.target_names

(1) For each of the first 5 training examples, print the text of the message along with the label.

In [0]:
def P1(num_examples=5):

  ### STUDENT START ###
  #print (np.column_stack((train_labels[:5], train_data[:5])))
  #print (dict(zip(train_labels[:5], train_data[:5])))
  
  for i in range(5):
    print(dict(label=newsgroups_train.target_names[train_labels[i]], text=train_data[i]))

  ### STUDENT END ###
  
P1()

(2) Use CountVectorizer to turn the raw training text into feature vectors. You should use the fit_transform function, which makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").

The vectorizer has a lot of options. To get familiar with some of them, write code to answer these questions:

a. The output of the transform (also of fit_transform) is a sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html. What is the size of the vocabulary? What is the average number of non-zero features per example? What fraction of the entries in the matrix are non-zero? Hint: use "nnz" and "shape" attributes.

b. What are the 0th and last feature strings (in alphabetical order)? Hint: use the vectorizer's get_feature_names function.

c. Specify your own vocabulary with 4 words: ["atheism", "graphics", "space", "religion"]. Confirm the training vectors are appropriately shaped. Now what's the average number of non-zero features per example?

d. Instead of extracting unigram word features, use "analyzer" and "ngram_range" to extract bigram and trigram character features. What size vocabulary does this yield?

e. Use the "min_df" argument to prune words that appear in fewer than 10 documents. What size vocabulary does this yield?

f. Using the standard CountVectorizer, what fraction of the words in the dev data are missing from the vocabulary? Hint: build a vocabulary for both train and dev and look at the size of the difference.

In [0]:
def P2():
  ### STUDENT START ###
  
  vectorizer = CountVectorizer()
  vtrain = vectorizer.fit_transform(train_data)
  print ("Question a)")
  print (" - vocabulary length:", len(vectorizer.vocabulary_))
  print (" - matrix shape:", vtrain.shape)
  print (" - non-zero features per example:", vtrain.nnz / float(vtrain.shape[0]))
  print (" - non-zero features per matrix: %.2f%%" % (vtrain.nnz / float(vtrain.shape[0] * vtrain.shape[1]) * 100))

  print ("\nQuestion b)")
  print (" - 0th feature string:", vectorizer.get_feature_names()[0])
  print (" - last feature string:", vectorizer.get_feature_names()[-1])
  
  # Specifying our own vicabulary of words
  vectorizer = CountVectorizer(vocabulary=["atheism", "graphics", "space", "religion"])
  vtrain = vectorizer.fit_transform(train_data)
  print ("\nQuestion c)")
  print (" - vocabulary length:", len(vectorizer.vocabulary_))
  print (" - matrix shape:", vtrain.shape)
  print (" - non-zero features per example:", vtrain.nnz / float(vtrain.shape[0]))
  print (" - non-zero features per matrix: %.2f%%" % (vtrain.nnz / float(vtrain.shape[0] * vtrain.shape[1]) * 100))

  bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
  bigram_vtrain = bigram_vectorizer.fit_transform(train_data)
  print ("\nQuestion d)")
  print (" - (bigrams) vocabulary length:", len(bigram_vectorizer.vocabulary_))
  print (" - (bigrams) matrix shape:", bigram_vtrain.shape)
  print (" - (bigrams) non-zero features per example:", bigram_vtrain.nnz / float(bigram_vtrain.shape[0]))
  print (" - (bigrams) non-zero features per matrix: %.2f%%" % (bigram_vtrain.nnz / float(bigram_vtrain.shape[0] * bigram_vtrain.shape[1]) * 100))
  analyze = bigram_vectorizer.build_analyzer()
  print (" - (bigrams) extracting just 5 bigrams using analyzer:", analyze(train_data[0])[:5])  
  
  trigram_vectorizer = CountVectorizer(ngram_range=(3, 3))
  trigram_vtrain = trigram_vectorizer.fit_transform(train_data)
  print ("")
  print (" - (trigrams) vocabulary length:", len(trigram_vectorizer.vocabulary_))
  print (" - (trigrams) matrix shape:", trigram_vtrain.shape)
  print (" - (trigrams) non-zero features per example:", trigram_vtrain.nnz / float(trigram_vtrain.shape[0]))
  print (" - (trigrams) non-zero features per matrix: %.2f%%" % (trigram_vtrain.nnz / float(trigram_vtrain.shape[0] * trigram_vtrain.shape[1]) * 100))
  analyze = trigram_vectorizer.build_analyzer()
  print (" - (trigrams) extracting just 5 trigrams using analyzer:", analyze(train_data[0])[:5])  

  min_df_10_vectorizer = CountVectorizer(min_df=10)
  min_df_10_vtrain = min_df_10_vectorizer.fit_transform(train_data)
  print ("\nQuestion e)")
  print (" - (min_df=10) vocabulary length:", len(min_df_10_vectorizer.vocabulary_))
  print (" - (min_df=10) matrix shape:", min_df_10_vtrain.shape)
  print (" - (min_df=10) non-zero features per example:", min_df_10_vtrain.nnz / float(min_df_10_vtrain.shape[0]))
  print (" - (min_df=10) non-zero features per matrix: %.2f%%" % (min_df_10_vtrain.nnz / float(min_df_10_vtrain.shape[0] * min_df_10_vtrain.shape[1]) * 100))
  
  train_vectorizer = CountVectorizer()
  vtrain = train_vectorizer.fit_transform(train_data)

  dev_vectorizer = CountVectorizer()
  vdev = dev_vectorizer.fit_transform(dev_data)
  print ("\nQuestion f)")
  print (" - train vocabulary length:", len(train_vectorizer.vocabulary_))
  print (" - dev vocabulary length:", len(dev_vectorizer.vocabulary_))
  print (" - not in train data:", len(list(set(dev_vectorizer.vocabulary_) - set(train_vectorizer.vocabulary_))))
  print (" - not in dev data:", len(list(set(train_vectorizer.vocabulary_) - set(dev_vectorizer.vocabulary_))))
  print (" - difference in lenghts:", len(train_vectorizer.vocabulary_) - len(dev_vectorizer.vocabulary_))

  ### STUDENT END ###

P2()

(3) Use the default CountVectorizer options and report the f1 score (use metrics.f1_score) for a k nearest neighbors classifier; find the optimal value for k. Also fit a Multinomial Naive Bayes model and find the optimal value for alpha. Finally, fit a logistic regression model and find the optimal value for the regularization strength C using l2 regularization. A few questions:

a. Why doesn't nearest neighbors work well for this problem?

b. Any ideas why logistic regression doesn't work as well as Naive Bayes?

c. Logistic regression estimates a weight vector for each class, which you can access with the coef\_ attribute. Output the sum of the squared weight values for each class for each setting of the C parameter. Briefly explain the relationship between the sum and the value of C.

In [0]:
def P3():
  ### STUDENT START ###
  
  vectorizer = CountVectorizer()
  vtrain = vectorizer.fit_transform(train_data)
  vdev = vectorizer.transform(dev_data)
  
  f1_s = {}
  expected = dev_labels
  
  # Tried different ranges. After k=30 the f1 score drops
  #for k in range(1,300):
  for k in range(100,120):
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(vtrain, train_labels)
    
    predicted = model.predict(vdev)

    f1_s[k] = metrics.f1_score(expected, predicted, average='weighted')

  max_f1_k = max(f1_s, key=f1_s.get)
  print("KNeighborsClassifier: f1 score knn = %d, f1 = %.4f" % (max_f1_k, f1_s[max_f1_k]))

  alphas_s = {}
  
  # Same here. Tried different alphas. This is the best range I have found
  for a in np.linspace(start = 0.09, stop = 0.095, num = 500):
    mnb_model = MultinomialNB(alpha=a)
    mnb_model.fit(vtrain, train_labels)
    predicted_mnb = mnb_model.predict(vdev)

    alphas_s[a] = metrics.accuracy_score(expected, predicted_mnb)

  max_alpha_k = max(alphas_s, key=alphas_s.get)
  print("MultinomialNB: best alpha = %.5f, accuracy = %.5f" % (max_alpha_k, alphas_s[max_alpha_k]))

  # After first try, among 0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0 and so on
  # the best C value was 0.1. So, the best C value shold be between 
  # 0.01 and 1. Let's limit the scope
  
  C_value = 0.001
  C_values = []
  Accuracies = []
  Squares = []
  
  # After multiple tries and errors the best value appeared to be 
  # C = 0.948980, accuracy = 0.72041
  for C_value in np.linspace(start = 0.01, stop = 1, num = 20):
    C_values.append(C_value)
    lr_model = LogisticRegression(penalty='l2', solver='liblinear', tol=0.01, multi_class='auto', C=C_value)  
    lr_model.fit(vtrain, train_labels)
    lr_predicted = lr_model.predict(vdev)
    Accuracies.append(metrics.accuracy_score(expected, lr_predicted))
    Squares.append(np.sum(np.square(lr_model.coef_), axis=(0,1)))

  ix = Accuracies.index(max(Accuracies))
  print("LogisticRegression: best C = %f, accuracy = %.5f" % (C_values[ix], Accuracies[ix]))
  
  print ()
  print (C_values)
  print (Squares)
  ### STUDENT END ###
  
P3()

ANSWER: <br />
a) KNeighborsClassifier: f1 score knn = 112, f1 = 0.4789<br />
Nearest neighbor is not working because the nearest neighbor is the word that starts with the same letters. If one word starts with "A", that does not mean that another word that starts witn "A" should be the neighbour. Apple is more relevant to Oranges than to "Application"<br /><br />

b)MultinomialNB: best alpha = 0.09133, accuracy = 0.79438<br />
Naive Bias takes into cinsideration the frequency of appearance of the word in the text. This adds additional weight to the text. While Logistic Reggression checks the existence of the word in the text but does not check the frequency of appearance<br />
<br />
LogisticRegression: best C = 0.948980, accuracy = 0.72041<br />
C values: [0.01, 0.06210526315789474, 0.11421052631578947, 0.16631578947368422, 0.21842105263157896, 0.2705263157894737, 0.32263157894736844, 0.37473684210526315, 0.4268421052631579, 0.4789473684210527, 0.5310526315789474, 0.5831578947368421, 0.6352631578947369, 0.6873684210526316, 0.7394736842105263, 0.791578947368421, 0.8436842105263158, 0.8957894736842106, 0.9478947368421053, 1.0]<br />

Squares: [10.591165360490724, 66.46962944510123, 114.32185611729118, 158.97972757032557, 196.73405899612735, 213.91530119404237, 244.43347975416233, 299.53446484790135, 333.66588684874415, 349.20652913298744, 369.06480814748323, 323.18700779185514, 395.33660316642465, 407.97973715054627, 445.0369484261342, 460.76150847860487, 465.4660572795282, 542.6942205086331, 507.5945797684571, 446.62118584848815]<br />
<br />
Lower C specifies stronger regularization. Therefore, both weights (positive and negative) have lower values. When C increases, this means weaker regularization. Therefore the weights are higher and the negative cost/penalty value is lower. Therefore when we square the values we are getting higher values.

(4) Train a logistic regression model. Find the 5 features with the largest weights for each label -- 20 features in total. Create a table with 20 rows and 4 columns that shows the weight for each of these features for each of the labels. Create the table again with bigram features. Any surprising features in this table?

In [0]:
def print_top_weights(ngram_r=(1, 1)):
  ### STUDENT START ###

  vectorizer = CountVectorizer(ngram_range=ngram_r)
  vtrain = vectorizer.fit_transform(train_data)
  vdev = vectorizer.transform(dev_data)  

  lr_model = LogisticRegression(penalty='l2',solver='liblinear',multi_class='ovr')
  lr_model.fit(vtrain, train_labels)
  wordlist = []
  vocab = list(vectorizer.vocabulary_)

  for i in range(lr_model.coef_.shape[0]):
    top_weights = sorted(lr_model.coef_[i], reverse=True)[:5]
    ixs = [lr_model.coef_[i].tolist().index(w) for w in top_weights]
    top_words = [vocab[ix] for ix in ixs]
    
    #print (top_words)
    #print (top_weights)
    #print (np.where(ixs)[0])
    
    wordlist.extend(top_words)
  
  print ("Top 20 features:", ngram_r)
  print (wordlist)

  for i in range(lr_model.coef_.shape[0]):
    print ([lr_model.coef_[i][vocab.index(w)] for w in wordlist])
  
  ### STUDENT END ###

def P4():
  ### STUDENT START ###

  print_top_weights(ngram_r=(1, 1))
  print ()
  print_top_weights(ngram_r=(2, 2))
  
  ### STUDENT END ###

P4()

ANSWER:

Top 20 features: (1, 1)<br />
['slips', 'content', 't2n', 'cling', 'sacrificed', 'cargo', 'khwedodah', 'cg', '26', '354', 'iank', 'rites', 'bristol', 'debated', 'shaken', 'circles', 'venusian', '07', 'minign', 'invoked']<br />
[1.1248336372805916, 1.0298971848061016, 0.9901230792434893, 0.9539402376766195, 0.9394610760463263, -0.758541216093077, -0.5825685107025586, -0.33473085909394956, -0.3590352304238924, 0.14366535652558202, -1.2602976999024391, -0.4139415750627057, -0.5723280900994645, -0.47022713272143557, -0.35541621876063284, -0.7402619208857182, -0.607811657371762, -0.533278924102178, -0.30874636360743546, -0.793598936373556]<br />
[-0.39797629862755146, -0.0963645457002289, -0.22098520114013492, -0.6172078429934622, -0.4100088078699556, 1.9367273020012277, 1.3455660314883604, 1.2661015000620803, 1.1247814403210772, 0.9772429395012646, -1.316374784608689, -0.6714812952694931, -0.47908563205085525, -0.46516057723073434, -0.39368498517997846, -0.40947793867477494, -0.41831821162078153, -0.10668895937642836, -0.27342427043547035, -0.0794957570179783]<br />
[-0.4201901584891434, -0.31982968540290596, -0.34098222991357496, -0.7928458613298989, -0.4492984451905958, -1.3367616289075834, -0.8259067842084594, -0.8067724416998179, -0.7025001465980208, -0.6822145944331048, 2.162719035537564, 1.2253125135141474, 1.0119859152973572, 0.9366847223350961, 0.9201894412319044, -0.5253514976389329, -0.2704695875778207, -0.3160465993715539, -0.4483024032600511, -0.14882143379037843]<br />
[-0.3953913796313006, -0.8352072464975461, -0.46360049663427466, -0.06444581959645786, -0.43421652772343033, -0.7629495947315801, -0.46828623655551327, -0.6268635650561725, -0.37846183056365146, -0.4874506006241537, -1.1710177060124225, -0.6294898519667773, -0.46786899515142055, -0.3323658287949194, -0.3805764948049671, 1.1481342826502094, 1.1177943583915695, 1.0548848960800434, 0.9128104190351662, 0.9055169585754957]<br />
<br />
<br />
We have very strange combinations in bigrams :)<br />
Top 20 features: (2, 2) / Bigram<br />
['weight lifting', 'atheists try', 'the wp', 'try watson', 'built telescope', 'valid than', 'old stuff', 'military scientific', 'filename before', 'radial direction', 'package msdos', 'reliablility and', '25 22', 'b098747 by', 'directors includes', 'advertise regularly', 'up joint', 'lifestyle is', 'routines fft', 'bbs system']<br />
[0.7716703999604864, 0.6776306634226352, 0.649139435743164, 0.6337912074430369, 0.5693904786262417, -0.755578301542672, -0.37981345682468115, -0.544989092154071, -0.42846296628879144, -0.32429111098076924, -0.31408110015002483, -0.40493651951543624, -0.3177089931167513, -0.24229763745987556, -0.1577261329595157, -0.22788138463001542, -0.14909394806520845, -0.1326568795047691, -0.16670497278842034, -0.17641886290055966]<br />
[-0.2576378604238498, -0.19292537972741647, -0.8823773521240098, -0.238296287476917, -0.3183548524341258, 1.3195513319571806, 1.0372814163786108, 0.972234597894636, 0.9126050178484336, 0.8963697507292151, -0.645801433403874, -0.5764810087291432, -0.3885153393701742, -0.3996525103749189, -0.22650596392881858, -0.253139031193569, -0.21633296360908813, -0.21181779389339006, -0.2612737478558932, -0.2249166275238969]<br />
[-0.3520784156738887, -0.19793897433504118, -0.8219338371274696, -0.1951402246610944, -0.5770568383292033, -0.6137170388834878, -0.47078677878072883, -0.530990586965104, -0.5687450369243211, -0.5779039890073162, 1.03017330245163, 0.9517607828387479, 0.7388443543091359, 0.6881533129329107, 0.6755417124130042, -0.21136339445563354, -0.19067610677193378, -0.20088506961488634, -0.35924707611890355, -0.1951804240854557]<br />
[-0.2006255612397081, -0.3019892923758894, 0.6014765921432099, -0.1705408041530425, 0.0008703414118077683, -0.6998707400520107, -0.3966451628389555, -0.5073173674372629, -0.34145119688910847, -0.33287746796930545, -0.32450998372322587, -0.24062318541317257, -0.2748037600208872, -0.2569477363983065, -0.16374858109380674, 0.7456656767199157, 0.7065745448561445, 0.7014820927999529, 0.6537993607169632, 0.647989455301782]<br />



(5) Try to improve the logistic regression classifier by passing a custom preprocessor to CountVectorizer. The preprocessing function runs on the raw text, before it is split into words by the tokenizer. Your preprocessor should try to normalize the input in various ways to improve generalization. For example, try lowercasing everything, replacing sequences of numbers with a single token, removing various other non-letter characters, and shortening long words. If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular. With your new preprocessor, how much did you reduce the size of the dictionary?

For reference, I was able to improve dev F1 by 2 points.

In [0]:
def empty_preprocessor(s):
    return s

def better_preprocessor(s):
  ### STUDENT START ###
  
  # I have reviewed results of vectorizer.vocabulary_ and 
  # vectorizer.get_feature_names(). All of the words are 
  # lowercase. There are no punctuations except underscore
  # There are bunch of words that contain numbers, some of 
  # them are years, some of them look like zip codes
  # I decided to get rid of undescores and try to normalize 
  # different numbers that may mean some measures (kg, ound...)

  # convert string to lowercase and 
  # add spaces to the end and start (this will simplify regexes)
  data = ' '+ s.lower() + ' '
  
  # replace non alphanumeric symbols with spaces 
  data = re.sub('[^a-z0-9]', ' ', data)

  # Trying to convert plurals to singulars
  # verbs in past tense to current tense
  # and similar grammar issues
  data = re.sub('\b+ment ', ' ', data)
  data = re.sub('ing ', 'e ', data)
  data = re.sub('ies ', 'y ', data)
  data = re.sub('s ', ' ', data)
  data = re.sub('ed ', 'e ', data)
  data = re.sub('e ', ' ', data)

  # replace words with numbers only with underscores
  data = re.sub('\s\d+\s', ' ____ ', data)
  # not sure why the previous regex is still leaving some numbers
  # the second one gets rid of the rest of the numbers
  data = re.sub('\s\d+\s', ' ____ ', data)
  
  # replacing different measures
  data = re.sub('\s\d+m\s', ' ____m ', data)
  data = re.sub('\s\d+cm\s', ' ____cm ', data)
  data = re.sub('\s\d+km\s', ' ____km ', data)
  data = re.sub('\s\d+mi\s', ' ____mi ', data)
  data = re.sub('\s\d+mph\s', ' ____mph ', data)
  data = re.sub('\s\d+g\s', ' ____g ', data)
  data = re.sub('\s\d+nd\s', ' ____th ', data)
  data = re.sub('\s\d+th\s', ' ____th ', data)
  data = re.sub('\s\d+st\s', ' ____th ', data)
  data = re.sub('\s\d+e\d+\s', ' ____e0 ', data)
  data = re.sub('\s\d+k\s', ' ____k ', data)
  data = re.sub('\s\d+mb\s', ' ____mb ', data)
  data = re.sub('\s\d+pixel\s', ' ____pixel ', data)
  data = re.sub('\s\d+index\s', ' ____index ', data)
  data = re.sub('\s\d+am\s', ' ____am ', data)
  data = re.sub('\s\d+pm\s', ' ____pm ', data)
  data = re.sub('\s\d+hz\s', ' ____hz ', data)
  data = re.sub('\s\d+khz\s', ' ____khz ', data)
  data = re.sub('\s\d+mhz\s', ' ____mhz ', data)
  
  # replacing resolutions: 123x123, 123x123x123
  data = re.sub('\s\d+x\d+\s', ' ___x___ ', data)
  data = re.sub('\s\d+x\d+x\d+\s', ' ___x___x___ ', data)
  
  # remove words with more than 10 letters (nothing chaned, so commenting out)
  #data = re.sub('\s[a-z0-9]{10,200}\s', ' ', data)
  
  return data

  ### STUDENT END ###

def P5():
  
  ### STUDENT START ###
  
  vectorizer = CountVectorizer()
  vtrain = vectorizer.fit_transform(train_data)
  vdev = vectorizer.transform(dev_data)
  vtest = vectorizer.transform(test_data)
  
  better_vectorizer = CountVectorizer(preprocessor=better_preprocessor)
  better_vtrain = better_vectorizer.fit_transform(train_data)
  better_vdev = better_vectorizer.transform(dev_data)
  better_vtest = better_vectorizer.transform(test_data)
  
  #print (len(vectorizer.vocabulary_))
  #print (len(better_vectorizer.vocabulary_))
  #print ()

  lr_model = LogisticRegression(penalty='l2',solver='liblinear',tol=0.01, multi_class='auto')
  lr_model.fit(vtrain, train_labels)
  lr_predicted_dev = lr_model.predict(vdev)
  lr_predicted_test = lr_model.predict(vtest)

  better_lr_model = LogisticRegression(penalty='l2',solver='liblinear',tol=0.01, multi_class='auto')
  better_lr_model.fit(better_vtrain, train_labels)
  better_lr_predicted_dev = better_lr_model.predict(better_vdev)
  better_lr_predicted_test = better_lr_model.predict(better_vtest)

  print ("Dev data")
  print("Standard accuracy (dev) = %.5f" % (metrics.accuracy_score(dev_labels, lr_predicted_dev)))
  print("Preprocessor accuracy (dev) = %.5f" % (metrics.accuracy_score(dev_labels, better_lr_predicted_dev)))
  print ()
  print ("Standard F1 (dev) = %.5f" % (metrics.f1_score(dev_labels, lr_predicted_dev, average='weighted')))
  print ("Preprocessor F1 (dev) = %.5f" % (metrics.f1_score(dev_labels, better_lr_predicted_dev, average='weighted')))
  print ()
  print ("Test data")
  print("Standard accuracy (test) = %.5f" % (metrics.accuracy_score(test_labels, lr_predicted_test)))
  print("Preprocessor accuracy (test) = %.5f" % (metrics.accuracy_score(test_labels, better_lr_predicted_test)))
  print ()
  print ("Standard F1 (test) = %.5f" % (metrics.f1_score(test_labels, lr_predicted_test, average='weighted')))
  print ("Preprocessor F1 (test) = %.5f" % (metrics.f1_score(test_labels, better_lr_predicted_test, average='weighted')))
  
  ### STUDENT END ###

P5()

(6) The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. That is, logistic regression seeks the set of weights that minimizes errors in the training data AND has a small size. The default regularization, L2, computes this size as the sum of the squared weights (see P3, above). L1 regularization computes this size as the sum of the absolute values of the weights. The result is that whereas L2 regularization makes all the weights relatively small, L1 regularization drives lots of the weights to 0, effectively removing unimportant features.

Train a logistic regression model using a "l1" penalty. Output the number of learned weights that are not equal to zero. How does this compare to the number of non-zero weights you get with "l2"? Now, reduce the size of the vocabulary by keeping only those features that have at least one non-zero weight and retrain a model using "l2".

Make a plot showing accuracy of the re-trained model vs. the vocabulary size you get when pruning unused features by adjusting the C parameter.

Note: The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.01 (the default is .0001).

In [0]:
def get_nonzero_vector(p='l2'):
  
  ### STUDENT START ###

  vectorizer = CountVectorizer()
  vtrain = vectorizer.fit_transform(train_data)
  vdev = vectorizer.transform(dev_data)
  vocab = np.array(list(vectorizer.vocabulary_))

  lr_model_l1 = LogisticRegression(penalty=p,solver='liblinear',multi_class='ovr',max_iter=5000,tol=0.01)
  #lr_model_l1 = LogisticRegression(penalty=p, solver='liblinear', tol=0.01, multi_class='auto', C=0.948980,max_iter=5000)
  lr_model_l1.fit(vtrain, train_labels)
  
  # returning array with nonzero values for indices that differ from zero
  return sum(np.where(lr_model_l1.coef_ != 0.0, 1, 0))
  
  ### STUDENT END ###
  
def P6():
  # Keep this random seed here to make comparison easier.
  np.random.seed(0)

  ### STUDENT START ###
  
  vectorizer = CountVectorizer()
  vtrain = vectorizer.fit_transform(train_data)
  vdev = vectorizer.transform(dev_data)
  vocab = np.array(list(vectorizer.vocabulary_))

  lr_model_l1 = LogisticRegression(penalty='l1',solver='liblinear',multi_class='auto', C=0.948980)
  lr_model_l1.fit(vtrain, train_labels)
  
  ixs_l1 = get_nonzero_vector('l1')
  ixs_l2 = get_nonzero_vector('l2')

  print ('Number of nonzero weighted items for L1: ', len(vocab[ixs_l1 > 0]))
  print ('Number of nonzero weighted items for L2: ', len(vocab[ixs_l2 > 0]))
  
  filtered_vocabulary = vocab[ixs_l1 > 0]
  
  vectorizer_filtered = CountVectorizer(vocabulary=filtered_vocabulary)
  vtrain_filtered = vectorizer_filtered.fit_transform(train_data)
  vdev_filtered = vectorizer_filtered.transform(dev_data)
  
  lr_model_filtered = LogisticRegression(penalty='l2', solver='liblinear', tol=0.01, multi_class='auto', C=0.948980)
  lr_model_filtered.fit(vtrain_filtered, train_labels)
  lr_predicted_filtered = lr_model_filtered.predict(vdev_filtered)
  accuracy_filtered = metrics.accuracy_score(dev_labels, lr_predicted_filtered)
  
  print("Accuracy of filtered vocabulary = %.5f" % (accuracy_filtered))
  
  C_values = []
  Accuracies = []
  Vocab_sizes = []
  
  for C_value in np.linspace(start = 0.01, stop = 1, num = 100):  
    C_values.append(C_value)
    
    lr_model_c = LogisticRegression(penalty='l1', solver='liblinear', tol=0.01, multi_class='auto', C=C_value)
    lr_model_c.fit(vtrain, train_labels)
    lr_predicted_c = lr_model_c.predict(vdev)
    accuracy_filtered_c = metrics.accuracy_score(dev_labels, lr_predicted_c)
    
    Accuracies.append(accuracy_filtered_c)
    Vocab_sizes.append(sum(sum(np.where(lr_model_c.coef_ != 0.0, 1, 0)) > 0))
    
    C_value = C_value * 10
  
  print ("C_values:", C_values)
  print ("Accuracies:", Accuracies)
  print ("Vocab_sizes:", Vocab_sizes)
  
  fig, ax1 = plt.subplots()
  
  ax2 = ax1.twinx()
  accuracies = ax1.plot(C_values, Accuracies, 'g-', label='Accuracies')
  filtered_accuracy = ax1.plot(C_values, accuracy_filtered * np.ones(len(C_values)), 'r--', label='Re-trained')
  vocabularies = ax2.plot(C_values, Vocab_sizes, 'b-', label='Vocabulary sizes')

  ax1.set_xlabel('C values')
  ax1.set_ylabel('Accuracies', color='g')
  ax2.set_ylabel('Vocabulary sizes', color='b')

  lns = accuracies + vocabularies +filtered_accuracy
  labels = [l.get_label() for l in lns]
  ax1.legend(lns, labels)

  plt.title("Comparison of accuracies")
  plt.show()


  ### STUDENT END ###

P6()

(7) Use the TfidfVectorizer -- how is this different from the CountVectorizer? Train a logistic regression model with C=100.

Make predictions on the dev data and show the top 3 documents where the ratio R is largest, where R is:

maximum predicted probability / predicted probability of the correct label

What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

In [0]:
def P7():
  ### STUDENT START ###
  
  # Initializing TfidfVectorizer
  vectorizer = TfidfVectorizer()
  vtrain = vectorizer.fit_transform(train_data)
  vdev = vectorizer.transform(dev_data)

  # Training LogisticRegression
  lr_model_l2 = LogisticRegression(penalty='l2',solver='liblinear',multi_class='auto',C=100)
  lr_model_l2.fit(vtrain, train_labels)
  
  # Getting predicted values and probabilities
  lr_predicted_l2 = lr_model_l2.predict(vdev)
  lr_predicted_proba_l2_byalllabels = lr_model_l2.predict_proba(vdev)
  lr_predicted_proba_l2 = np.array([max(x) for x in lr_predicted_proba_l2_byalllabels])
  
  # Filtering matched labels
  matched = lr_predicted_l2 == dev_labels
  
  # To preserve the indexes, we are just setting probabilities to
  # unmatched labels to negative value. Therefore R value will also be negative
  # and these values will not get in TOP 3
  lr_predicted_proba_l2[lr_predicted_l2 != dev_labels] = -1.

  # Getting max probability from the array
  max_prob = max(lr_predicted_proba_l2)
  
  # Calculating R array
  R = max_prob / lr_predicted_proba_l2
  
  # Getting top 3 R-s
  R_top = sorted(R, reverse=True)[:3]
  
  # Getting indexes of the best R values
  ixs = (R == R_top[0]) | (R == R_top[1]) | (R == R_top[2])
  #top_words = [vocab[ix] for ix in ixs]
  
  print (np.array(dev_data)[ixs])
  print (np.array(lr_predicted_proba_l2_byalllabels)[ixs])
  
  ## STUDENT END ###

P7()

ANSWER:<br />
**Use the TfidfVectorizer -- how is this different from the CountVectorizer?**<br />
TfidfVectorizer additinally takes into consideration the term frequency.
<br />
<br />
**What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.**<br />
There are empty values in the dev data. This means, that data that contains only '' was correctly predicted with minimal probability. I think the easiest way to fix this is data sanitizing. Empty stging cannot be assigned the correct label. Therefore, before just blindly processing the data from forums, it is always a good decision to filter out empty strings or the garbage, that by definition cannot be assigned a correct label

(8) EXTRA CREDIT

Try implementing one of your ideas based on your error analysis. Use logistic regression as your underlying model.