## Text Classification

#### Formalizing classification

In [3]:
from IPython.display import Image
Image(url="fclf.png", width=500, height=400)

#### The Naïve Bayes’ Classifier

In [4]:
Image(url="bclf.png", width=200, height=200)

##### Bayes’ rule. 
- It relates the posterior probability $P(C_{i}|X_{j})$ with the prior probability $P (C_{i})$ and class-conditional probability $P (X_{j} |C_{i})$. 
- The denominator acts to normalise everything, so that all the probabilities sum to 1
- When $X$ is a vector of feature values instead of just one feature. This is known as ***the maximum a posteriori or MAP hypothesis***, and it gives us a way to choose which class to choose as the output one.

Naive: The features are independent of each other

In [5]:
Image(url="nb.png", width=500, height=500)

So the classifier rule for the naïve Bayes’ classifier is to select the class $C_{i}$ for which the following computation is the maximum:

In [6]:
Image(url="nbclf.png", width=200, height=200)

#### Perception Test Algorithm for Binary Classification:
#### Predict class -1 or +1 for example x

In [7]:
Image(url="teclf.png", width=500, height=400)

#### Perceptron Training Algorithm:
#### Find good values for (w,b) given training data D

In [8]:
Image(url="trclf.png", width=500, height=400)

#### Machine Learning Vocabulary

##### $x$ is often called the feature vector
- its elements are defined (by us, the model designers) to capture properties or features of the input that are expected to correlate with predictions
##### $w$ and $b$ are the parameters of the classifier
- they are needed to fully define the classification function $f(x) = y$
- their values are found by the training algorithm using training data D
##### MaxIter is a hyperparameter
- controls when training stops
- MaxIter impacts the nature of function $f$ indirectly

All of the above affect the performance of the final classifier

#### Perceptron for binary classification

In [9]:
Image(url="pclf.png", width=500, height=400)

### Linear Support Vector Machines

In [10]:
Image(url="msvm.png", width=400, height=400)

In [11]:
Image(url="esvm.png", width=400, height=400)

#### More Machine Learning vocabulary:
#### overfitting/underfitting/generalization

##### Training error is not sufficient
- We care about generalization to new examples
- A classifier can classify training data perfectly, yet classify new examples incorrectly
     - Because training examples are only a sample of data distribution
        - a feature might correlate with class by coincidence
     - Because training examples could be noisy 
         - e.g., accident in labeling

#### Overfitting

Consider a model 𝜃 and its:
   - Error rate over training data $error_{train}(𝜃)$
   - True error rate over all data $error_{true} 𝜃$

We say ℎ overfits the training data if
   - $error_{train} (𝜃) < error_{true} (𝜃)$

#### Evaluating on test data

Problem: we don’t know $error_{true} 𝜃$ ! 

- Solution:
    - we set aside a test set
         - some examples that will be used for evaluation
    - we don’t look at them during training!
    - after learning a classifier 𝜃, we calculate $error_{test}(𝜃)$

#### Overfitting

- Another way of putting it
- A classifier 𝜃 is said to overfit the training data, if there are other parameters 𝜃′, such that
    - 𝜃 has a smaller error than 𝜃′ on the training data
    - but 𝜃 has larger error on the test data than 𝜃′.

#### Underfitting/Overfitting

- Underfitting
    - Learning algorithm had the opportunity to learn more from training data, but didn’t 

- Overfitting
    - Learning algorithm paid too much attention to idiosyncracies of the training data; the resulting classifier doesn’t generalize

#### Back to the Perceptron

- Practical strategies to improve generalization for the perceptron
    - Averaging
    - Randomize order of training data
    - Use a development test set to find good hyperparameter values
        - E.g., early stopping is a good strategy to avoid overfitting

#### Logistic Regression for Binary Classification

#### From Perceptron to Probabilities:
#### the Logistic Regression classifier

- The perceptron gives us a prediction y, and the activation can take any real value
- What if we want a probability p(y|x) instead?

#### The sigmoid function (aka the logistic function)

In [12]:
Image(url="lr.png", width=500, height=400)

In [13]:
Image(url="lre.png", width=300, height=400)

#### Making Predictions with the Logistic Regression Classifier

Given a test instance x, predict class 1 if P(y=1|x) > 0.5, and 0
otherwise

In [14]:
Image(url="lrd.png", width=300, height=400)

Inputs x for which P(y=1|x) = 0.5 constitute the decision boundary

#### Ingredients required for training

- Loss function or cost function
    - A measure of distance between classifier prediction and true label for a given set of parameters

In [15]:
Image(url="loss.png", width=400, height=400)

- An algorithm to minimize this loss
    - Here we’ll introduce stochastic gradient descent

#### The cross-entropy loss function

- Loss function used for logistic regression and often for neural networks
- Defined as follows:

In [16]:
Image(url="ce.png", width=400, height=400)

#### Deriving the cross-entropy loss function

- Conditional maximum likelihood
    - Choose parameters that maximize the log probability of true labels y given inputs x

In [17]:
Image(url="cm.png", width=400, height=400)

- Cross-entropy loss is defined as

In [18]:
Image(url="dce.png", width=400, height=400)

#### Gradient Descent

In [19]:
Image(url="gd.png", width=400, height=400)

#### Illustrating GradientDescent
- The gradient indicates the direction of greatest increase of the cost/loss function.

- Gradient descent finds parameters (w,b) that decrease the loss by taking a step in the opposite direction of the gradient.

In [20]:
Image(url="igd.png", width=500, height=500)

In [21]:
Image(url="cgd.png", width=500, height=500)

In [22]:
Image(url="dces.png", width=500, height=500)

### SGD hyperparameter: the learning rate

- The hyperparameter 𝜂 that control the size of the step down the gradient is called the learning rate

- If 𝜂 is too large, training might not converge; if 𝜂 is too small, training might be very slow.

- How to set the learning rate? Common strategies:
    - decay over time: $𝜂 =1/(𝐶+n)$

### A multiclass logistic regression classifier

***Goal***: predict probability P(y=c|x), where c is one of k classes in set C

#### The softmax function

- A generalization of the sigmoid
- Input: a vector z of dimensionality k
    - $z = [z_{1},z_{2},\cdots,z_{k}]$
- Output: a vector of dimensionality k

In [29]:
Image(url="softmax.png", width=500, height=500)

#### Model definition:

In [30]:
Image(url="softmaxmd.png", width=500, height=500)

#### Learning in Multiclass Logistic Regression

- Loss function for a single example 

In [31]:
Image(url="lmc.png", width=500, height=500)