# INFO 3350/6350

## Lecture 12: Neural networks

## Neural networks and deep learning

* We've used NLP tools at many points this semester, but this isn't an NLP class
* That said, neural methods have transformed many areas of NLP over the last decade
    * And deep learning -- a subset of neural methods -- has been very widely applied in machine learning and AI
* Our tasks today: define "neural network," relate neural nets to other learning systems, take a look at how a neural network works, and show how to implement a very simple neural classifier in Python

### What is a neural network?

* A neural network is a computing system comprising artificial neurons
* Neurons were originally (1940s) intended to model organic brain behavior
    * But now, the name is really just a bit of jargon. No one thinks its important whether or not computational neurons have anything to do with biological neurons.
* Individual neurons are mathematical functions that take a vector of input values and produce a single output value.
    * We've seen lots of these kinds of functions over the semester, not all of them related to actually existing neural networks
    * What matters are the details of the functions and the ways they relate to one another in a network
* In a neural network, the neurons are connected to one another in one or more layers, so that the output of one neuron is the input of another (or many others)
    
### Logistic regression

* Logistic regression **is not a neural network** in the modern sense, but it captures much of the spirit of a basic neural network and a lot of the math is related, so let's revisit it
* Fit training data to a linear model: $z = W_0 + W_1 x_1 + W_2 x_2 + ...$
    * Values of $x\ $ are observed properties of an object (counts of individual words, say)
    * The $W\ $s are weights. We multiply the weight associated with each word (for example) by the number of times that word occurs in a document.
        * These types of element-wise multiplications between two vectors are called **dot products**
    * Add up the weight * count products and we produce an output value, $z$
    * Note that values of $z$ can range from -infinity to +infinity
* Transform the linear value into a score between 0 and 1 using the sigmoid function: $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
* Sigmoid function looks like this:
    
<img src="./images/sigmoid.png">

* When we train a logistic regression classifier, we're trying to learn the set of weights that produce the most accurate classifications
* We learn the weights by:
    * Initializing to random values (or equal values, or some arrangement that reflects our best guess about the correct weights)
    * Calculating **cross-entropy loss**, that is, how far away are our predicted outputs from the known-true (gold) values.
        * Our goal is to minimize this loss function by adjusting the weights in our model
        * See Jurafsky and Martin, ch. 5 ("Logistic Regression"), for the math, but the short version is that we take (roughly) the negative log of the sum of the differences between the predicted labels (as probabilities ranging from 0 to 1) and the true labels (which are either 0 or 1)
        * Trivia point: logistic regression is a more advanced version of the **perceptron** (which uses a binary loss function rather than a probabilistic one). The perceptron was invented at Cornell (by Frank Rosenblatt in 1958).
    * Adjusting our weights using **gradient descent**
        * Again, the math isn't important to us, but ... we find the gradient (slope) of the loss function by partial differentiation with respect to each of the weights. In short, we find how the loss function chages in response to small changes in each weight, then move the weight in the direction that minimizes the loss. Repeat until the loss function stops changing (much) and hope we've found the global minimum (that is, the globally best weights).
* If you've been around neural networks and machine learning, these terms will sound familiar: loss function, gradient descent. Now you know what they mean.

### From logistic regression to feed-forward networks

* The problem with logistic regression (which is a great classifier for many problems!) is that it can only learn linear relationships between inputs and outputs. If our problem is nonlinear, logistic regression might not work well on it.
* The simplest way to understand the relationship between logistic regression and a basic neural network is that a neural network is made up of multiple logistic-like functions, each of which can learn a different part of the correct solution (where "solution" = function that best fits the training data)
* Here's a schematic representation (from Jurafsky and Martin) of a feed-forward network with a single hidden layer (the middle one, with labels $h_i$):

<img src="./images/neural_network.png">

* There are three layers here: input, hidden, and output.
    * The input layer is the data you feed into the system.
    * The hidden layer is where the weights are adjusted to maximize classification accuracy. This is what *learns*.
    * The output layer translates numerical values calculated in the hidden layer into class probabilities (that is, into specific classification decisions).
* The math in this case is the same as in the logistic case, except that:
    * We have matrices of weights across the neurons, rather than a single vector of weights for a single neuron
    * We have a vector of outputs from the hidden layer, rather a single, scalar output
    * Gradient descent is harder, because there are more paths to differentiate
        * This is the most consequential difference in practical terms, because it really slows down training
        * The standard approach is **backpropagation**. For details, See Jurafsky and Martin, ch. 7 ("Neural Networks"). It's like partial differentiation, but performed piece-wise backward through the all the possible paths from outputs to inputs via the hidden layer(s). 
        
### From shallow to deep

* Even a neural network with a single hidden layer (of possibly infinite width; that is, made up of arbitrarily many neurons) can be shown to be able to represent a function of arbitrary complexity
    * Note in passing: this is a remarkable result. It means that neural networks are immensely flexible in the relationships between inputs and outputs that they can model.
    * But this fact doesn't imply that it's *easy* to learn a correct or high-performing representation of an arbitrary function in a neural network
* In practice, it can be (sometimes!) more efficient to build networks that are narrower but *deeper*; that have more layers
* Deep learning also largely removes the need for (certain kinds of) feature engineering, since the layers learn maximally effective transformations of the data
    * But the right kinds of data still need to be present in the first place!
    * If you only give your network word counts, it won't magically engineer paratextual features.
* You may have heard of **convolutional** neural networks and **recurrent** neural networks. These are networks in which there is not a strict one-to-one connection between all the neurons in each layer.
    * Convolutional networks are (or, were) widely used in image recognition
    * Recurrent networks (in which parts of layers are connected both forward and backward) are (again, were) often used in NLP applications
* All of this is **bloody slow** and involves a lot of matrix math. Two main factors have driven the deep learning revolution over the last two decades:
    * Web-scale data, which provides enough instances to learn fine distinctions in complex decision boundaries
        * A method that can model arbitrarily complex functions isn't much good if you don't have enough data to explore the function space
    * GPUs (graphics cards), which are essentially super-fast matrix calculators
        * These make computing with all that data tractable (more or less)
* More recently, we've discovered that we can do without recurrence or convolution, provided we have enough data. This is the insight behind the transformer architecture, BERT, and all that has followed. We'll have more to say about that in future lectures. 

## Basic neural network classification in `sklearn`

The only neural classifier built into `sklearn` is the multi-layer perceptron (which isn't really a perceptron at all, but the name stuck). We'll demonstrate it here; it's easy and works as a drop-in replacement for any other classifier.

For more advanced work with neural networks, you'd want to explore frameworks like [Keras](https://keras.io/), [PyTorch](https://pytorch.org/), and [TensorFlow](https://www.tensorflow.org/), which are more flexible and support computations on GPUs. In the next lecture, you'll see this in action.

Here, we're going to work with the **album review data** from our last problem set. Recall that the task is to predict, for a given review, whether its score is above or below the mean of all reviews, given the text of the review itself.

In [1]:
# load embedding representation of reviews data
import numpy as np
import os
import pickle
from   sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif
from   sklearn.linear_model import LogisticRegression
from   sklearn.model_selection import cross_val_score, train_test_split
from   sklearn.neural_network import MLPClassifier
from   sklearn.preprocessing import StandardScaler

with open(os.path.join('supplements', 'X_embed.pickle'), 'rb') as f:
    X = pickle.load(f) # embedding-based
    X = StandardScaler().fit_transform(X) # scale features

with open(os.path.join('supplements', 'X_tfidf.pickle'), 'rb') as f:
    X_tfidf = pickle.load(f) # token-based
    X_tfidf = StandardScaler().fit_transform(X_tfidf.toarray()) # scale features
    
with open(os.path.join('supplements', 'y.pickle'), 'rb') as f:
    y = pickle.load(f) # labels

print("Embedding feature array shape:", X.shape)

Embedding feature array shape: (1836, 300)


### Baseline

First, score a simple logisitic regression classifier on word embedding data. This is **not** a neural classifier. We'll use it as a baseline.

In [2]:
%%time
# logit score
logit_scores = cross_val_score(LogisticRegression(max_iter=1000), X, y, cv=5)
print("Logit accuracy:", np.mean(logit_scores))

Logit accuracy: 0.6405180073451013
CPU times: user 2.35 s, sys: 50.2 ms, total: 2.4 s
Wall time: 341 ms


In [3]:
# fully naive baseline: most common class
y.sum()/len(y)

0.6034858387799564

### Multilayer Perceptron

Score an MLP classifier. This **is** a neural method, but it doesn't perform very well out of the box.

In [4]:
%%time
# MLP score, no optimization
mlpc = MLPClassifier()
mlp_scores = cross_val_score(mlpc, X, y, cv=5, n_jobs=-1)
print("MLP accuracy:", np.mean(mlp_scores))

MLP accuracy: 0.6339992299490581
CPU times: user 35.3 ms, sys: 74.8 ms, total: 110 ms
Wall time: 6.19 s


Not great! It's slower and performs worse than logistic regression. Recall, for comparison, that our token-based, TF-IDF weighted, logistic regression score (from the problem set) was around 0.66.

#### Tuning with grid search
Let's try some tuning (tuning neural networks can be super important). Note our use of `GridsearchCV` (see [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)), which makes it easier to try different combinations of features.

Grid search is slow, as are neural methods. We're working with a subset of the data (as you may want to do in your own work if the full compute costs are high), but this makes it less likely that we'll land on the best parameters if we were to use them on the full dataset. Oh well ...

In [5]:
%%time
# Grid search: wide vs. deep, and compare solvers
from sklearn.model_selection import GridSearchCV
import warnings

params = {
    'hidden_layer_sizes': [(300,), (100,), (10,), (2,), (100,10), (30,10), (10,2)],
    'solver':['adam', 'lbfgs'],
    'max_iter':[2000] # not part of the search, but set a classifier parameter
}
clf = GridSearchCV(mlpc, params, n_jobs=-1)

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

# perform grid search
with warnings.catch_warnings() as w:
    warnings.simplefilter("ignore")
    clf.fit(X_train, y_train) # Note subset of the data!

CPU times: user 19.7 s, sys: 1.26 s, total: 21 s
Wall time: 11.6 s


In [6]:
# Which parameters are best?
clf.best_params_

{'hidden_layer_sizes': (300,), 'max_iter': 2000, 'solver': 'adam'}

In [7]:
# What's the cv score of the best classifier?
clf.best_score_

0.6607531749901325

In [8]:
%%time
# Score after tuning
mlp_tuned_scores = cross_val_score(
    MLPClassifier(**clf.best_params_), 
    X_test, 
    y_test, 
    cv=5,
    n_jobs=-1,
    verbose=1
)
print("MLP accuracy (tuned):", np.mean(mlp_tuned_scores))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


MLP accuracy (tuned): 0.6440947797112181
CPU times: user 501 ms, sys: 565 ms, total: 1.07 s
Wall time: 705 ms


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.6s remaining:    0.9s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.7s finished


#### Compare token-based MLP

In [9]:
%%time
# Compare the untuned, token-based version

# restrict to 300 most informative token features
selector = SelectKBest(f_classif, k=300)
X_tfidf_selected = selector.fit_transform(X_tfidf, y)

mlp_tfidf_scores = cross_val_score(
    mlpc, 
    X_tfidf_selected, 
    y, 
    cv=5, 
    n_jobs=-1,
    verbose=1
)
print("MLP accuracy (using tokens):", np.mean(mlp_tfidf_scores))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    0.8s remaining:    1.2s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.9s finished


MLP accuracy (using tokens): 0.8044677763298187
CPU times: user 41.7 ms, sys: 35.2 ms, total: 76.9 ms
Wall time: 1.09 s


So, none of this is very impressive (apart from the fact that we can do it at all via `sklearn`, which is pretty cool).

A couple of things to keep in mind:

* MLP is nowhere near state-of-the-art
* We don't have very much data to work with. Remember that neural methods thrive on large training sets.
* Embeddings don't outperform tokens as input features in this case. But note that we selected the tokens specifically suited to the given task, whereas static embeddings are general representations of word senses. So, our 300-dimensional embeddings will work for lots of tasks, whereas the 300 most-informative tokens for the score prediction task aren't likely to be the same 300 best tokens for a different task.

That said, you can see that using a neural classifier is no panacea. Neural methods are not the first-best option for all tasks, and they often introduce lots of computational complexity. Use them when they help, especially when you've tried other, cheaper methods and have found those methods wanting.