# Problem 1: Multiclass (30 %)
### So far we have largely focused on binary classification, where the input is a document and the output is a yes or a no (or probability of yes). 

### In fact, more complex tasks exist where the input is a document and the output can be multiple (more than two) classes. 

## In this problem we'll investigate two so-called multiclass problems
### Multiclass: an observation is assigned inclusion in ONE of a N $N>2$ categories
 - ### E.g. is this sentence positive, negative, or neutral sentiment
 - ### E.g. is this email spam or not spam


### Multiclass-multilabel: an observation can belong to more than one of $N>=2$ categories
 - ### E.g. is this document about `{sports, current events, Steph Curry}` ( a document can be about more than one)
 - ### E.g. is this blood sample A, B, O, $+$, $-$ (blood can be `A+` or `A-`)

## We will study the metrics we can use to evaluate these classification problems

In [1]:
import numpy as np
import pandas as pd
%pylab inline

import json

from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

Populating the interactive namespace from numpy and matplotlib


In [2]:
np.random.seed(1234)

## We will start with multiclass by studying the 20 newsgroups data
# $ \\ $
# $ \\ $
# Part 0: get the data
 - ### use the builtin function `from sklearn.datasets import fetch_20newsgroups`
 - ### NB: look at the docs and use the `remove` kwarg in order to get cleaned data

## TODO
 - ## fetch the data separately for the train and test data
 - ## How many classes are present? 
 - ## What is the most common class- please give the name and not the number.
 - ## What is the accuracy of the best constant guess in the train set?

In [3]:
from sklearn.datasets import fetch_20newsgroups


In [4]:
data_train = fetch_20newsgroups(...
data_test = fetch_20newsgroups(...



In [None]:
val_counts = ...
print('found {} classes'.format(val_counts.shape[0]))
most_common_class = ...
print('most common class: {}'.format(most_common_class))

dummy_acc = accuracy_score(...
print('constant guess acc: {:.3f}'.format(dummy_acc))

# Part 1: fit a model
## As we saw with mnist, logistic regression is capable of fitting multi-class data.
 - ## Encode the text with as a bag of words and fit logistic regression to the data
 - ## Calcuate the out of sample accuracy score

In [6]:
# Todo
# 1. make a count vectorizer with max_features=20000
# 2. fit it
# 3. transform the train and test data into number
vec = ..

# your code here
xtr = ... # train data
xte = ... # test data


In [None]:
# TODO
# 1. fit logistic regression
# 2. compute accuracy score

# your code here
accuracy_score(...

# Part 2: Evaluate metrics
### As we have seen previously, while accuracy is useful, it does not always capture all the behavior we want in a metric.

### Here we will extend the concept of f1 score to the multiclass setting. There are several ways to do this
 - report a different f1 score for every class (no averaging)
 - report the mean f1 score over all classes
 - report a weighted f1 score weighted by class prevelance. 

### For each of these three types of f1
 - calculate the score(s) without the help of scikit learn
 - compare it to the corresponding f1 score evaluated with scikit-learn (NB you'll need to read the docs for `f1_score`. 
 - Write down the pros and cons for this method of calculating multiclass f1 score

In [8]:
preds = lr.predict(xte)

f1s = []
for label_index, label_name in enumerate(data_train.target_names):
    # calculate the f1 score of one (label_index) vs rest
    # your code here...
    f1s.append(...

for label_name, fs in zip(data_train.target_names, f1s):
    print('fscore for {} \t = {:.3f}'.format(label_name, fs))

print('\n\n')
# compare to sklearn
success = (f1s == f1_score(data_test.target, preds, average=None)).all()
if success:
    print('sklearn builtin matches results')
else:
    print('scores do not match')

In [1]:
# The pros are ...
# The cons are ...

In [None]:
# now the macro f1 (the mean of the f1s for each class)
f1_macro = ... # calculate without sklearn
f1_macro_sk = f1_score(... # calculate with sklearn
assert(f1_macro == f1_macro_sk)
print('macro f1: {} \t sklearn macro f1 {}'.format(
    f1_macro, 
    f1_macro_sk
))



In [None]:
# The pros are ...
# The cons are ...

In [None]:
# now weighted by class prevalence
# TODO:
#  - calculate the frequency of each class
#  - take a weighted average of the f1s, weighted by these weights
#  - compare to sklearn
wts = ...
weighted_f1 = # without sklearn
weighted_f1_sk = f1_score(... # with sklearn

print('weighted f1 {} \t sklearn weighted f1 {}'.format(weighted_f1, weighted_f1_sk))



In [None]:
# The pros are ...
# The cons are ...

# Part 3: Confusion Matrix
## The confusion matrix is a handy way to understand errors in classification problems.  It is a 2-D grid of what values were predicted and what the actual values were. 

See [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) in the sklearn docs. 

## Create a confusion matrix for the 20-newsgroups dataset and comment on the most common failure modes

In [3]:
from sklearn.metrics import confusion_matrix
# your code here
# NB: it's handy to call `pd.DataFrame` on the confusion matrix to print it out nicely

In [4]:
# comments here

# Problem 2: Multiclass Multilabel Problems (20 %)
### In this problem we'll examine academic articles from the [arXiv](www.arxiv.org).
### Authors who submit articles can attach one or more categories to the articles

# Part 0: Load the data
## TODO
 - ### load the data
 - ### compute all of the unique categories in the train data
 - ### What are the 10 most common categories which occur together

In [None]:
with open('../../data/arxiv-qfin-train.json') as fi:
    data_train = json.load(fi)

with open('../../data/arxiv-qfin-test.json') as fi:
    data_test = json.load(fi)

    
print(len(data_train), len(data_test))

In [5]:
# compute the unique categories here


In [6]:
# compute the co-occuring categories here
# Hint:
#  - loop through all the train articles
#  - loop through all the pairs of categories
#  - keep track of the counts of every pair


# Part 1: Encode the data

## We will encode the title of each article using a bag of words (`CountVectorizer`). Try limiting the features to about 20k. 

## Encoding targets is as bit trickier for multilabel problems. In this case we want our target to be a matrix of $N_{samples} x N_{categories}$ but each row does not have to sum to 1.
 - ## NB: scikit learn as a `MultiLabelBinarizer` to help here. 

# $ \\ $
## TODO
 - ## fit a `CountVectorizer` on the titles to create `x_train` and `x_test`
 - ## create `y_train` and `y_test` to be matrices of $N_{samples} x N_{categories}$ with all 0s and 1s

In [19]:
vec = CountVectorizer(...
vec.fit(...
x_train = ...
x_test = ...

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
# your code here

y_train = ...
y_test = ...
print(y_train.shape, y_test.shape)
print(list(mlb.classes_))

## Part 2: Model the data
### While scikit-learn can't handle multilabel data in logistic regression, keras can. 
### Create and fit a multilabel logistic regression model and fit it. 
### NB: think hard about the activation function and loss function that are appropriate in this case!

In [29]:
from keras.models import Model, Input
from keras.layers import Dense, Softmax, Dropout
import keras.backend as K

K.clear_session()
doc_input = Input( ...
# your code here
# dont forget to compile your model

In [None]:
model.fit(
    ...
)

In [None]:
pd.DataFrame(model.history.history)[['val_loss', 'val_accuracy']].plot(
    figsize=(12,7), secondary_y='val_loss'
)

# Part 3: f1 score
## While modeling is more difficult in the multilabel case, the metrics are, oddly, simpler. Here, we can only compute metrics class by class.

### For each class, print the accuracy and f1 score for the class. Comment on the results. 

In [None]:
preds = model.predict(...
# loop through all the classes
# compute and print the accuracy and f1 for that class
for i, class_ in enumerate(mlb.classes_):
    acc = accuracy_score(y_test[:, i], preds[:, i])
    f1 = f1_score(y_test[:, i], preds[:, i])
    print('class{} \t\tacc: {:.3f} \tf1: {:.3f}'.format(class_, acc, f1))

# Problem 3: New Metrics (30%)
## In this problem we'll explore new metrics associated with true positives and false positives.

## Part 0: Load the IMDB data and fit a model
 - ### Load the imdb data
 - ### featurize the text using TFIDF
 - ### Fit logistic regression
 - ### calculate the in-sample and out of sample accuracy and f1 score

In [16]:
import os
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline


def load_imdb_data_text(imdb_data_dir, random_seed=1234):
    train_dir = os.path.join(imdb_data_dir, 'train')
    test_dir = os.path.join(imdb_data_dir, 'test')

    np.random.seed(random_seed)
    texts = []
    targets = []
    for label in ('pos', 'neg'):
        data_dir = os.path.join(train_dir, label)
        files = glob.glob(os.path.join(data_dir, '*.txt'))
        for filename in files:
            with open(filename) as fi:
                text = fi.read()
            target = (label == 'pos')
            texts.append(text)
            targets.append(target)

    train_docs = texts
    y_train = np.array(targets)


    texts = []
    targets = []
    for label in ('pos', 'neg'):
        data_dir = os.path.join(test_dir, label)
        files = glob.glob(os.path.join(data_dir, '*.txt'))
        for filename in files:
            with open(filename) as fi:
                text = fi.read()
            target = (label == 'pos')
            texts.append(text)
            targets.append(target)

    test_docs = texts
    y_test = np.array(targets)

    inds = np.arange(y_train.shape[0])
    np.random.shuffle(inds)

    train_docs = [train_docs[i] for i in inds]
    y_train = y_train[inds]
    
    return (train_docs, y_train), (test_docs, y_test)

(train_docs, y_train), (test_docs, y_test) = load_imdb_data_text('../../data/aclImdb/')
print('found {} train docs and {} test docs'.format(len(train_docs), len(test_docs)))

Populating the interactive namespace from numpy and matplotlib
found 25000 train docs and 25000 test docs


In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

In [None]:
vec = TfidfVectorizer...
# more code here


# more code here
preds_train = ...
preds_test = ...

print('#'*20 + ' in sample ' + '#'*20 )
print('\t\taccuracy: {:.3f}'.format(accuracy_score(y_train, preds_train)))
print('\t\tf1: {:.3f}'.format(f1_score(y_train, preds_train)))
print('\n\n')
print('#'*20 + ' out of sample ' + '#'*20 )
print('\t\taccuracy: {:.3f}'.format(accuracy_score(y_test, preds_test)))
print('\t\tf1: {:.3f}'.format(f1_score(y_test, preds_test)))

## Part 2: Tradeoff between true positives and false positives
Typically we take a threshold of 0.5 probability to consider something a positive example.
However, as we change this threshold we can change the number of true positives we get.
 - Example: at a theshold of 0.0001 we will get nearly all of the true positives
 - Example: at a threshold of 0.999 we will get almost none of the true positives

Notice: as we change our threshold and increase the number of true positives we will also increase the number of false positives we pick up.

In this part you will create a graph of the false positive rate on the x-axis and the true positive rate on the y-axis. This is often called the `receiver operator characteristic`. Make this curve for the out of sample data below.

Note: while you can use the builtin scikit-learn functionality for this, you will __not receive credit__ if you do. 

In [17]:
from tqdm import tqdm


In [None]:
# your code here
# hint: 
#  - loop through the thresholds
#  - calulcate the true positives and false positives

# hint: what values for thresholds should you loop through?

In [None]:
pd.Series(true_pos_rates, index=false_pos_rates).plot(figsize=(12,8), fontsize=16)
plt.xlabel('False Pos Rate', fontsize=16)
plt.ylabel('True Pos Rate', fontsize=16)
plt.title('Receiver Operator Characteristic', fontsize=20)

## Part 3: Baseline
 - What does the receiver operator curve look like for a random guessing classifier? 
 - Make the same plot as above but add the random guessing curve
 - Add comments about WHY the random guessing curve looks this way

In [None]:
ax = pd.Series(true_pos_rates, index=false_pos_rates, name='logistic regression').plot(
    figsize=(12,8), fontsize=16
)
baseline_series = ... # your code here for the ROC for random guessing
baseline_series.to_frame('random guess').plot(ax=ax, fontsize=16)
plt.xlabel('False Pos Rate', fontsize=16)
plt.ylabel('True Pos Rate', fontsize=16)
plt.title('Receiver Operator Characteristic', fontsize=20)
plt.legend(fontsize=16)

In [38]:
# add comments here

## Part 4: Boiling it down to a single number
 - While the ROC is a useful curve and contains a lot of information, it is useful to distill in down to a single number. Typically, the area under the curve is used. Calculate the area under the curve and add it as the title to your previous plot. 
 - Hint: think about approximations for integrals for finding area under a curve


In [None]:
area_under = # your code here


# repeat the plotting code here

plt.title('Area under the curve = {:.3f}'.format(area_under, fontsize=20))
plt.legend(fontsize=16)

## Part 5: Check you work and comment on the results
 - "There's gotta be a better way!"
 - In fact, `scikit-learn` will take care of a lot of the headache here. 
 - `from sklearn.metrics import plot_roc_curve`
 - read the docs and use this function



In [None]:
from sklearn.metrics import plot_roc_curve, auc
# your code here

## A few comments:
 - The area under the ROC has a nice interpretation. It can be thought of as the probability that a randomly chosen positive example has a higher probability than a randomly chosen negative example.
 - This metric is also nice since it is independent of a threshold. 

# Problem 4: Examining Coefficients (20%)
In class we skipped an important step: we never made sure our models made sense. 
Logistic regression provides coefficients, which allow us to determine if a model
if learning anything reasonable. 

In this problem, you'll load the imdb data, fit logistic regression and exmamine the coefficients. 
Print out the largest and smallest (largest negative) coefficients and comment on the results.

In [8]:
import os
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline

# copy code from above to load the data


In [2]:
(train_docs, y_train), (test_docs, y_test) = load_imdb_data_text('../../data/aclImdb/')
print('found {} train docs and {} test docs'.format(len(train_docs), len(test_docs)))

found 25000 train docs and 25000 test docs


In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

vec = TfidfVectorizer(...
# more code here
                      
lr = LogisticRegression(...


In [None]:
# Hint: you can call `vec.get_feature_names` to get the words in order
# that correspond to the columns of the TFIDF matrix 
# This is useful to pass to the index of a pd.Series

In [None]:
coefs = pd.Series(...)

# NB: to get the largest items in a series by abs try
#    coefs.loc[coefs.abs().nlargest(20).index]

In [9]:
# comments here