**Table of Contents**

* [Text Classifiers](#cls)
    * [Generative Models](#gen)
    * [Discriminative Models](#disc)
    * [Logistic Regression](#lr)
        * [Parameter Estimation](#mle)
    * [Richer Features](#ff)
        
**Table of Exercises**     

* [Exercise 1](#ex1) (-/1)
* [Exercise 2](#ex2) (-/1)
* [Exercise 3](#ex3) (-/3)
* [Exercise 4](#ex4) (-/5)


**General notes**

* In this notebook you are expected to use $\LaTeX$. 
* Use python3

**ILOs**

After completing this lab you should be able to 

* develop text classifiers based on logistic regression 
* estimate parameters via gradient ascent using sklearn

# <a name="cls"> Text Classifiers



Text classifiers learn to map a given piece of text to a distribution over $C$ categories. Given a dataset of labelled documents, we can use the maximum likelihood principle to estimate he parameters of this distribution.

Then, we can classify unlabelled documents by predicting the mode of the distribution over classes given text.

## <a name="gen">  Generative Models

A **generative model**, such as the Naive Bayes classifier, obtains the conditional distribution $Y|S=w_{1:m}$ via *probabilistic inference*. That is, it prescribes a joint distribution over text and classes assigning probability 

\begin{equation}
P_{SY}(w_{1:m}, y) = P_Y(y)P_{S|Y}(w_{1:m}|y)
\end{equation}

to labelled data $(w_{1:m}, y)$, and uses Bayes rule to infer the posterior probability of any one class $c \in \mathcal Y$ given the text: 

\begin{equation}
P_{Y|S}(c|w_{1:m}) = \frac{P_Y(c)P_{S|Y}(w_{1:m}|c)}{\sum_{y \in \mathcal Y} P_Y(y)P_{S|Y}(w_{1:m}|y)}
\end{equation}

Crucially, in order to address data sparsity while generating the complex outcome $w_{1:m}$, a generative model exploits conditional independence assumptions to factorise $P_{S|Y}(w_{1:m}|y)$ conveniently. For example, the Naive Bayes model assumes independence of words given the class.

After training, we can use the model to make decisions, for example, by classifying an unlabelled document $s$ with the mode of the posterior distribution $Y|S=s$:

\begin{equation}
y^\star = \mathrm{argmax}_{y \in \mathcal Y} ~ \log P_Y(c) + \log P_{S|Y}(s|c) - \underbrace{\log P_S(s)}_{\text{constant}}
\end{equation}

Generative models are elegant and, in general, they can be quite flexible. The Naive Bayes model, in particular, is not *very* flexible. It does not easily accommodate rich and highly dependent features. For example, to capture the scope of a negation or of a conjunction of contrast (e.g., `but` in `The plot was interesting , but I expected more of the acting`) we need to model long-range dependencies in the sentence. 
Writing linguistic rules for this is not imediately obvious, thus we would like to design algorithms that can eventually figure these dependencies out on their own. 

One way to have the NB model capture such long range dependencies is to allow it to generate complex features, beyond unigrams, for example bigrams, or skip-bigrams (pair of words that are not necessarily adjacent, for example, `(not, nice)` in `that was not very nice`). This approach has at least two caveats, such a feature space can grow very large and many features will be very sparse, and, given a class, every feature is generated from the same distribution. 

**Quiz** What is the problem of generating a very heterogenous set of features (e.g., words, bigrams, linguistic features, binary features, count features, etc.) from the same conditional distribution?

<details>
    <summary><b>SOLUTION</b></summary>
    
They are artificially treated as if they existed in the same sample space. In reality, the sample space of words is very different from the sample space of bigrams (the latter is much bigger), for example. The frequency of these different features (which is the statistic that the feature distributions of an NBC can capture) vary widely depending on the feature type. The model has not automatic mechanism to learn the relative importance of the different feature sets.
    
</details>

---

In T2, you designed a Naive Bayes classifier yourself. In A2, you will use scikit-learn for that, so that you can save coding time and concentrate on other aspects of the problem.

Let us start by loading labelled data from nltk and preparing it in a format compatible with scikit-learn's API.

In [None]:
from nltk.corpus import sentence_polarity  # binary sentiment classification
from nltk.corpus import subjectivity  # binary classification
from nltk.corpus import brown  # 15-way document classification 

In [None]:
# order of difficulty: subjectivity < sentence_polarity < brown

# sentence_polarity is small, so you can use it to get comfortable with the assignment
# without having to wait too long

# in a some exercises you may be asked to use a specific corpus

corpus = sentence_polarity  

In [None]:
labels = corpus.categories()
C = len(labels)
print("{}-way classification:\n{}".format(C, '\n'.join(labels)))

As usual, we will split our observations in three disjoint sets, 80% for training, 10% for whatever development purposes we have, and 10% for testing the generalisation of our classifier at the end. 

In [None]:
import numpy as np


def prepare_corpus(nltk_corpus, categories, seed=23, BOS='<s>', EOS='</s>'):
    """
    Prepare an nltk text categorization corpus in a sklearn friendly format.
    
    This function is very similar to what you saw in T2, but here we add BOS tokens in addition to EOS tokens 
    (while the BOS token has no effect in NBC with unigram conditionals, 
    it can be useful for some of the feature-richer classifiers we will develop here).
    
    :param nltk_corpus: something like sentence_polarity
    :param categories: a list of categories (each a string), 
        sklearn will treat categories as 0-based integers, thus we will map the ith element in this list to y=i
    :param seed: for reproducibility
    :param BOS: if not None, start every sentence with a single BOS token
    :param EOS: if not None, end every sentence with a single EOS token
    :return: training, dev, test
        each an np.array such that 
        * array[:, 0] are the inputs (documents, each a string)
        * array[:, 1] are the outputs (labels)
    """
    pairs = []    
    prefix = [BOS] if BOS else []
    suffix = [EOS] if EOS else []
    for label in categories:  # here we pair doc (as a single string) and label (string)
        # this time we will concatenate the EOS symbol to the string
        pairs.extend((' '.join(prefix + s + suffix), label) for s in nltk_corpus.sents(categories=[label]))
    # we turn the pairs into a numpy array
    # np arrays are very convenient for the indexing tools np provides, as we will see
    pairs = np.array(pairs)
    # it's good to shuffle the pairs
    rng = np.random.RandomState(seed)    
    rng.shuffle(pairs)
    # let's split the np array into training (80%), dev (10%), and test (10%)
    num_pairs = pairs.shape[0]
    # we can use slices to select the first 80% of the rows
    training = pairs[0:int(num_pairs * 0.8),:]
    # and similarly for the next 10%
    dev = pairs[int(num_pairs * 0.8):int(num_pairs * 0.9),:]
    # and for the last 10%
    test = pairs[int(num_pairs * 0.9):,:] 
    return training, dev, test

Map your choice of corpus to sklearn's style:

In [None]:
training, dev, test = prepare_corpus(corpus, labels)

In [None]:
training.shape, dev.shape, test.shape

In [None]:
from tabulate import tabulate
print(tabulate([[dev[0, 1], dev[0, 0]], [dev[1, 1], dev[1, 0]]], headers=['label', 'doc']))

**Quiz** Use sklearn `sklearn.feature_extraction.text.CountVectorizer` and `sklearn.naive_bayes.MultinomialNB` to implement a Naive Bayes classifier with only a few lines of code. Use `sklearn.metrics.classification_report` to analyse performance on dev set.

**Tip** You can use `sklearn.pipeline.Pipeline` to encapsulate the vectorizer and the classifier in a single object that supports the fit/predict functionality of the classifier while dealing with vectorization.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

<details>
    <summary><b>SOLUTION</b></summary>
    
```python
cls_nb = Pipeline(
    [
        ('vect', CountVectorizer(ngram_range=(1,1))),  # unigram counts
        ('clf', MultinomialNB(alpha=0.5)),             # NBC parameterised using ngram counts (equivalent to our NaiveBayesClassifier from T2)    
    ]
)
cls_nb.fit(training[:, 0], training[:, 1])
print("NBC {}-way".format(C))
print(classification_report(dev[:,1], cls_nb.predict(dev[:, 0])))    
# Accuracy close to 0.51 for Brown and close to 0.78 for sentence_polarity
```    

</details>

---

In [None]:
cls_nb = Pipeline(
    [
        ('vect', CountVectorizer(ngram_range=(1,1))),
        ('clf', MultinomialNB(alpha=0.5)),                 
    ]
)
cls_nb.fit(training[:, 0], training[:, 1])
print("NBC {}-way".format(C))
print(classification_report(dev[:,1], cls_nb.predict(dev[:, 0])))

<a name="ex1" style="color:red">**Exercise 1**</a> You can use CountVectorizer to obtain a heterogenous feature set by changing the `ngram_range` argument of its constructor to `(1, 2)`, for example, which gathers counts for unigrams and bigrams. 

* In this exercise you should use the `brown` corpus.
* Compare two versions of NBC in terms of performance on dev set: one version should use unigram counts only, the other version should use both unigram and bigram counts.
* Produce a table of results comparing the two classifiers in terms of precision, recall, and f1-score.


Even though more information is presumably better, you should observe that richer features are not helping, and that's likely because of the fact that NBC cannot distinguish the different features types (unigrams vs bigrams) leading to sub-optimal use of statistics. 

---

Did you know you can use sklearn to plot a confusion matrix?

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix


<details>
    <summary><b>Show me how</b></summary>
    
```python
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

fig, ax = plt.subplots(figsize=(C/2, C/2))  # create a figure that's large enough for our task
_ = plot_confusion_matrix(
    cls_nb, # classifier pipeline
    dev[:,0], # inputs
    dev[:,1], # true labels
    xticks_rotation='vertical', # nice for visualisation
    ax=ax  # use our large figure
)
plt.show()    
```    

</details>

---

If instead of plotting, you want the actual data of the confusion matrix, you can use `sklearn.metrics.confusion_matrix`.

## <a name="disc">  Discriminative Models

A **discriminative model** computes the probability $P_{Y|S}(y|w_{1:m})$ *directly* by means of a parametric function that some input text $w_{1:m}$ to a probability vector of $C$ classes. Suppose this vector-valued function is called $\boldsymbol \pi$, mathematically, we write:

\begin{equation}
\boldsymbol \pi: \Sigma^* \to \Delta_{C-1}
\end{equation}

That is, $\boldsymbol \pi$ maps from the space of all strings $\Sigma^*$ to the space of $C$-dimensional probability vectors. Then, the probability $P_{Y|S}(y|w_{1:m})$ of any pair $(w_{1:m}, y)$ is given by the $y$th element of the vector $\boldsymbol\pi(w_{1:m})$, that is, $\pi_y(w_{1:m})$.

As before, our models have parameters $\boldsymbol\theta$ that we adjust for them to perform well, in this case, the parameters control the function $\boldsymbol\pi$, and we write it as $\boldsymbol\pi(w_{1:m}; \boldsymbol\theta)$.

**Notation guideline**

From now on, we will adopt the notation that is more common (and convenient) in discriminative models. We will refer to the input text (for us, that is a sentence, a paragraph, or document) as simply $x$.  

Let $X$ take on values in the space $\Sigma^*$ of all strings, and let $Y$ take on values in the set $\mathcal Y = \{1, \ldots, C\}$, then our discriminative model prescribes the following statistical process:

\begin{equation}
    Y|X=x \sim \mathrm{Cat}(\boldsymbol\pi(x; \boldsymbol\theta))
\end{equation}

which therefore assigns probabilty $P_{Y|X}(y|x, \boldsymbol\theta)=\mathrm{Cat}(y|\boldsymbol\pi(x; \boldsymbol\theta)) = \pi_y(x; \boldsymbol\theta)$ to a pair $(x, y)$.


**Terminology** See that the discriminative model *cannot* be used to generate text. It simply does not have a generative story of $x$, and thus it will always need a user (or dataset) to produce input texts. In some applications, this may be a limitation, but in *many* language technology applications, this is not at all a limitation.



**Quiz** How do we classify with a discriminative model? What is the time complexity of this procedure? Assume that an evaluation of $\boldsymbol \pi$ takes constant time $t$.


<details>
    <summary><b>SOLUTION</b></summary>

We output the mode of the conditional distribution:
    
\begin{equation}
    y^\star = \mathrm{argmax}_{y \in \{0, \ldots, C\}} ~ \pi_y(x; \theta)
\end{equation}    
    
This involves one assessment of $\boldsymbol \pi$ and then a search for the best class out of $C$ candidates, thus, with constant time for $\boldsymbol\pi$ itself, this takes time $\mathcal O(t + C)$. 
    
As we do not really get to choose $C$, its task dependent, the performance of our classifier will depend on how efficient $\boldsymbol \pi$ is.
    
</details>

---

## <a name="lr"> Logistic Regession

Our first discriminative model will be a linear model, which in the context of classification (as opposed to 
regression) is better known as **logistic regression** or **log-linear model**.


A general logistic regression model for $C$-way classification is parameterised as follows:

\begin{align}
    Y|X=x &\sim \mathrm{Cat}(\boldsymbol \pi(x; \theta)) \\
    \boldsymbol\pi(x; \theta) &= \mathrm{softmax}(\mathbf W \mathbf f(x) + \mathbf b) \\
    \boldsymbol \theta &= \{\mathbf w, \mathbf b\}
\end{align}

where 

* $\mathbf f$ is a vector-valued function that maps the input text $x$ to a $D$-dimensional vector of real-valued features, thus $\mathbf f(x) \in \mathbb R^D$ is a feature representation (or encoding) of the input text $x$;
* $\mathbf W  \in \mathbb R^{C\times D}$ is a matrix where each of the $C$ rows is a $D$-dimensional vector of real values, each of which captures the relative contribute of a feature to one of the $C$ classes;
* $\mathbf b \in \mathbb R^C$ is a $C$-dimensional vector of real values, each of which expresses a bias towards (if positive) or against (if negative) one of the classes;
* the linear model $\mathbf W \mathbf f(x) + \mathbf b$ produces a $C$-dimensional vector of scores (or logits);
* and the $\mathrm{softmax}$ function maps the logits from $\mathbb R^C$ to a $C$-dimensional probability vector in the probability simplex $\Delta_{C-1}$.

Note that we use column vectors (as commonly done in mathematics), thus when we say $\mathbf b \in \mathbb R^C$ you should imagine a numpy array with shape `[C, 1]`. You should always be careful with these conventions as oftentimes software packages adopt different conventions, sometimes more or less standardised. Most technical descriptions of models (as opposed to code documentation) use column vectors, unless otherwise noted. 



**Quiz** Define the softmax function, and explain why we are sure that its output can always be safely interpreted as the parameters of a Categorical distribution.


<details>
    <summary><b>SOLUTION</b></summary>

We output the mode of the conditional distribution:
    
\begin{equation}
    \mathrm{softmax}(\mathbf s) = \left\langle \frac{\exp(s_1)}{\sum_{c=1}^C \exp(s_c)}, \ldots, \frac{\exp(s_C)}{\sum_{c=1}^C \exp(s_c)} \right\rangle
\end{equation}    
    
Every element of the output is positive, because the numerator is an exponentiated number (and exp is always positive) and the denominator is positive (since it is a sum of positive quantities). Moreover, because the denominator is the sum of the numerators, we know that the output is such that its elements add up to 1, as a Categorical parameter should.
    
</details>

---

### <a name="mle"> Parameter estimation 

---
    
**Remark** This section is a brief review of MLE for feature-rich models, while the theory presented here has little impact on the assignment itself (as you will not be implementing parameter estimation yourself, rather you'll be delegating that to scikit-learn), your conceptual understanding of what happens in parameter estimation for these models is something we will assess in other forms (e.g., reading, exam, technical report).
    
---    
    
In MLE, we are given a dataset of observations $\mathcal D = \{(x^{(1)}, y^{(1)}), \ldots, (x^{(N)}, y^{(N)})\}$ which we use to assess the log-likelihood of any particular choice of $\boldsymbol\theta$:

\begin{equation}
    \mathcal L_{\mathcal D}(\boldsymbol\theta) = \sum_{n=1}^N \log P_{Y|X}(y^{(n)}|x^{(n)}, \boldsymbol \theta)
\end{equation}

For convex functions, the MLE solution, that is, the parameter vector $\boldsymbol \theta$ that is such that $\mathcal L_{\mathcal D}(\boldsymbol\theta)$ attains its maximum value can be obtained by solving the following problem:

\begin{equation}
    \nabla_{\boldsymbol\theta} \mathcal L_{\mathcal D}(\boldsymbol\theta) = \mathbf 0
\end{equation}

That is, by finding the $\boldsymbol\theta$ that makes the gradient of the log-likelihood evaluate to a vector of zeros.

The *gradient* $\nabla_{\boldsymbol\theta} \mathcal L_{\mathcal D}(\boldsymbol\theta)$ is the vector of partial derivatives of the log-likelihood with respect to the coordinates of the parameter vector. 
As the log-likelihood value depends on all data points, and derivatives are linear, the gradient we are looking for aggregates contributions from all observations:

\begin{align}
\nabla_{\boldsymbol\theta} \mathcal L_{\mathcal D}(\boldsymbol\theta) &= \nabla_{\boldsymbol\theta} \sum_{n=1}^N \log P_{Y|X}(y^{(n)}|x^{(n)}, \boldsymbol \theta) \\
 &=\sum_{n=1}^N \nabla_{\boldsymbol\theta} \log P_{Y|X}(y^{(n)}|x^{(n)}, \boldsymbol \theta)
\end{align}


We know how to solve this equation exactly for simple models, such as the $n$-gram LM, or the NBC (it leads to the count and divide formula that we employed so many times). For some other models, we do not know how to solve this exactly, but we know how to solve it (or approximately solve it) by numerical optimisation. 

A general strategy that works well as long as we have a differentiable likelihood is to use gradient ascent, whereby we start with an initial guess $\boldsymbol \theta^{(0)}$, and iteratively refine it by taking small steps (controlled by a positive learning rate $\eta$) in the direction of steepest ascent:

\begin{equation}
    \boldsymbol \theta^{(t)} = \boldsymbol \theta^{(t-1)} + \eta \nabla_{\boldsymbol\theta^{(t-1)}} \mathcal L_{\mathcal D}(\boldsymbol\theta^{(t-1)})
\end{equation}


This can be rather difficult if the dataset $\mathcal D$ is too large, because we would need to keep all data points and all predictions $\boldsymbol \pi(x^{(1)}; \boldsymbol \theta), \ldots, \boldsymbol \pi(x^{(N)}; \boldsymbol \theta)$ in memory for every parameter update, since the gradient aggregates contributions from all of them:

\begin{align}
\nabla_{\boldsymbol\theta^{(t-1)}} \mathcal L_{\mathcal D}(\boldsymbol\theta^{(t-1)}) &= \sum_{n=1}^N \nabla_{\boldsymbol\theta} \log P_{Y|X}(y^{(n)}|x^{(n)}, \boldsymbol \theta) \\
 &= \sum_{n=1}^N \nabla_{\boldsymbol\theta} \log \pi_{y^{(n)}}(x^{(n)}; \boldsymbol\theta)
\end{align}


To circumvent this problem we use *stochastic optimisation*, whereby in each step we replace the *exact* gradient above by a *Monte Carlo* (MC) estimate of it
\begin{align}
    \nabla_{\boldsymbol\theta} \mathcal L_{\mathcal D}(\boldsymbol\theta) &\overset{\text{MC}}{\approx} \nabla_{\boldsymbol\theta} \frac{1}{B} \sum_{b=1}^B \log P_{Y|X}(y^{(b)}|x^{(b)}, \boldsymbol\theta)\\
    &\quad\text{where } (x^{(b)}, y^{(b)}) \sim \mathcal D
\end{align}
which is just the sample mean of gradient vectors computed for a batch of $B$ data points sampled uniformly at random from the dataset (we can sample with or without replacement, it is equivalent in this case).


Under some conditions about the learning rate $\eta$ and the shape of $\mathcal L_{\mathcal D}(\boldsymbol \theta)$, this procedure is guaranteed to converge to the optimum in finite time. 

This procedure is seemly rather simple, but an efficient implementation requires sophisticated data structures and algorithms, since our feature functions are sparse and high-dimensional. 

Instead of implementing it yourself, you will work with sklearn's implementation, which is robust and correct.

In [None]:
from sklearn.linear_model import LogisticRegression

As always, begin by checking the documentation `LogisticRegression?`. 

In sklearn, we can find a number of solvers (optimisers) that address the MLE problem above. For small datasets it's okay to use the default, but as datasets grow (e.g., brown corpus) the default can be quite slow and require too much memory. We therefore advise that you use `solver=sag`, which implements a stochastic gradient ascent algorithm for you.

The other parameter that matters is the coefficient of regularisation.

Linear models can have *millions* of features, and thus millions of parameters. More often than not, we have less *data* than we have parameters. This makes your function too expressive, and essentially, the optimiser can find parameter vectors that are optimal in terms of log-likelihood but terrible in terms of generalisation to heldout data.

To fight this tendency to *overfit* to observations, we impose a penalty on the *complexity* of the model. You can think of the complexity of the model as the effective number of free parameters it has. These complexity penalties take the form of a cost, or *regulariser* function that judges general properties of the parameter vector, such as its norm of magnitude.

One of the most common regularisers is the $L_2$ norm (length of a vector in a Euclidean space). That is the regulariser we will be using.

A regularised version of MLE solves the following problem:

\begin{align}
\theta^\star &= \mathrm{argmax}_{\boldsymbol \theta} ~ \mathcal L_{\mathcal D}(\boldsymbol \theta) - \lambda L_2(\boldsymbol\theta)
\end{align}

where $\lambda \ge 0$ is the importance of the regulariser. This is a hyperparameter which we cannot fix automatically, and instead have to count of a development set, or cross-validation, to test a range of reasonable options. In sklearn, you can control this by controlling the argument `C` of `LogisticRegession`, though note that this `C` is interpreted as $\lambda^{-1}$, thus larger $C$ means less regularisation.

**Quiz** Train a classifier using sklearn's LogisticRegression, for features use unigram counts via `CountVectorizer`. 


<details>
    <summary><b>SOLUTION</b></summary>

```python
cls_lr = Pipeline(
    [
        ('vect', CountVectorizer(ngram_range=(1,1))),  # map strings to unigram counts (as in NBC)
        ('clf', LogisticRegression(
            max_iter=500,  # run stochastic gradient ascent for this many iterations
            verbose=2,     # some messages of progress
            C=100.,        # controls regularisation
            solver='sag')  # choice of solver ('sag' is a stochastic algorithm that is much faster and more memory efficient than the default)
        ),
    ]
)
cls_lr.fit(training[:, 0], training[:, 1])  # this make take a moment with large corpora and/or large feature sets
print(classification_report(dev[:,1], cls_lr.predict(dev[:, 0])))    
```
    
</details>

---

<a name="ex2" style="color:red">**Exercise 2**</a> Compare NBC and LogisticRegression.


* In this exercise you should use the `brown` corpus.
* Compare the two model types (NBC vs LR) as well as two types of feature spaces, namely, unigram counts only vs unigram and bigram counts.
* You don't need to hand-tune the regularisation coefficient of LR nor NBC's smoothing, simply use `C=100.` for the former and `alpha=0.5` for the latter.
* Produce a table of results comparing the four systems in terms of precision, recall, and f1-score on the development set.
* Discuss your findings.

**Tip** For discriminative models, the relative magnitude of the features matters more than for generative models since the goal is not to explain the input but rather the mapping to the probability of the labels, thus it is a good idea to normalise the counts. A yet more effective thing to do is to normalise the counts while also taking distributional information about the relevance of the feature into account. You don't need to code anything for that, you can simply use a *Transformer* in sklearn, to transform your counts to normalised counts (have a look at [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html?highlight=tfidf#sklearn.feature_extraction.text.TfidfTransformer), which we also use in our own solution below). A pipeline that uses this transformer looks like this:

```python
from sklearn.feature_extraction.text import TfidfTransformer
cls_lr = Pipeline(
    [
        ('vect', CountVectorizer(ngram_range=(1,1))),  # maps strings to (sparse) vectors of counts
        ('tfidf', TfidfTransformer()),                 # maps counts to normalised tfidf scores
        # ...
    ]
)
```

## <a name="ff"> Richer Features

The point behind logistic regression is to expand our feature representation to include richer information about potential dependencies that signal a class or another.

In this part you will develop feature functions for your logistic regression model.

Our customised feature functions will fit right into the sklearn pipeline API. We provide you with the basic structure and an example, which you can modify into your favourite feature function. Check the complete example before coding anything.

In [None]:
from collections import Counter
import re
# base class for transforming the input in sklearn
from sklearn.base import TransformerMixin  


class MyFF(TransformerMixin):
    """
    Our input is text (a string of space-separated tokens)
    which we will transform to a dictionary of sparse features (using python dict).
    
    Check the example and then include your own ideas for features.
    """

    def __init__(self, lowercase=False, normalize_emphasis=False, unigrams=False, length=False, count_ed=False):
        """
        :param lowercase: should we lowercase before doing anything?
        :param normalize_emphasis: a toy demonstration of a rule to normalize text
            this one will turn `soooo` into `so` and `naaah` into `nah` and keep track of how
            many times that happened
        :param unigrams: count unigrams
        :param length: a length feature
        :param count_ed: count words that end in `ed`
        """
        self._lowercase = lowercase
        self._unigrams = unigrams
        self._length = length
        self._count_ed = count_ed
        self._normalize_emphrasis = normalize_emphasis
    
    def fit(self, X, y=None, **fit_params):
        """We don't need to change this"""
        return self
    
    def get_params(self, deep=True):
        """We don't need to change this"""
        return dict()

    def _ff(self, string):
        """
        This is our feature function, it maps a document to a space of real-valued features.
        It uses python dict to represent only the features with non-zero values.
        
        :param string: a document as a string 
        "returns: dict (key is string, value is int/float)
        """
        fvec = Counter()  # Let's count how many times each feature fires
        # we can do some pre-processing if we like
        if self._lowercase:
            string = string.lower()
        # we then tokenize on spaces
        s = string.split()
        # and being applying our feature templates
        if self._normalize_emphrasis: # here we have a toy example (use regex to find two specific patterns)
            _s = []
            n = 0
            for w in s:
                if re.match(r'so+', w.lower()):
                    _s.append('so')
                    n += 1
                elif re.match(r'na+h', w.lower()):
                    _s.append('nah')
                    n += 1
                else:
                    _s.append(w)
            s = _s
            fvec["emphasis"] = n
        # we can count word occurrences, as this is pretty important
        if self._unigrams:
            fvec.update(('word={}'.format(w) for w in s))
        # we can also keep length around
        if self._length:
            fvec['length'] = len(s)
        # and count some simple patterns as well
        if self._count_ed:
            fvec['*ed='] = sum(1 for w in s if w.endswith('ed'))
                   
        return fvec
    
    def transform(self, X, **transform_params):
        """Here we transform each input (a string) into a python dict full of features"""
        return [self._ff(s) for s in X]

Let's see what this does to some examples:

In [None]:
ff = MyFF(lowercase=True, normalize_emphasis=True, unigrams=True, length=True, count_ed=True)
ff.transform(['I loved this film !', "Capitain Marvel is sooooo awesome !", "Did I like it? Naah"])

And here is how we plug it into a classifier pipeline:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction import DictVectorizer


text_clf = Pipeline(
    [
        ('ff', MyFF(
            unigrams=True, 
            length=True)
        ),
        ('dict', DictVectorizer()),   # This will convert python dicts into efficient sparse data structures
        ('tfidf', TfidfTransformer()),
        ('clf', LogisticRegression(max_iter=500, verbose=2, C=100., solver='sag')),
    ]
)

**Warning** If your feature spaces grow too large, you may run out of memory. If that happens you can replace 
`DictVectorizer` in the pipeline for `sklearn.feature_extraction.FeatureHasher`, which has a fixed memory footprint (at the expense of some accuracy, you can think of it as a lossy compression of the original features).

<a name="ex3" style="color:red">**Exercise 3**</a> Design feature functions and test how your logistic regression classifier reacts to them in terms of performance on development set. 

Note that it's easier to beat NBC where it's not already good enough (e.g., surely `brown`, maybe `sentence_polarity`) and harder to beat it where it is pretty good (e.g., subjectivity). Though note that if your feature space is too large, you may run into lack of computing resources, so be careful.  

You can sample ideas from this list, from the textbook, and you can also include your own ideas.

1. distributional feature other than unigram counts such as skip-bigram counts: skip-bigrams are word pairs that are not necessarily adjacent (e.g., `(I, run)` in `I like to run`). 
2. case features: help detect named-entities (e.g., Apple is likely the company, not the fruit, except when it's the first token of the sentence, in which case it's harder to tell)
3. number, date: 
4. text normalisation: detect numbers and dates using a regular expression are replace them by tags such as NUM or DATE; stem (see `from nltk.stem.snowball import SnowballStemmer`) or lemmatize (see `from nltk.stem import WordNetLemmatizer`) words
5. scope features: detect linguistic scope such as negation
6. membership features: detect which words belong to a category that is relevant to the task (e.g., which or how many words in the sentence are known to generally express positive/negative sentiment, see for example `from nltk.corpus.opinion_lexicon`).

**Guideline** For full points include three features, all from different categories. Please also have a look at E4 before you start your work.

---

<a name="ex4" style="color:red">**Exercise 4**</a> Now that you have a feature function, run a complete experiment:

* you should have a *baseline* which you intended to improve upon (NBC with unigram counts);
* also pick a stronger baseline (e.g., LogisticRegression with unigram and bigram counts);
* finally, test LogisticRegression with your feature function and see if you can beat the stronger baseline.
* use the dev set for decisions during development;
* for the best version of your NBC baseline, LR baseline, and your own proposed classifier, report performance on test set, including a confusion matrix;
* discuss your findings.

This exercise may be more interesting if you use `brown`, but we will accept `sentence_polarity` if your computer runs out of memory.

*Tip* sometimes to find the best setting for your proposed solution you need to find good hyperparameters (for example, you may need to try a few options for `C` in logistic regression).


**Guidelines** Three points for doing what is required including using the two features developed in E3. The final two points is for showing extra effort. This can be an extra feature or more effort put into creating an interesting feature or having to put in effort to use the full brown corpus etc. Please show/argue what extra effort you've done in the comments.