Discriminative Classifier
===

![](images/discriminative.png)

By The End of This Session You Will:
---
- Know the difference between discriminative and generative models
- Be able to use the discriminative MaxEnt model to carry out NLP tasks

__Supplemental to these videos:__
- [video 1](https://class.coursera.org/nlp/lecture/38)
- [video 2](https://class.coursera.org/nlp/lecture/51)

***
<br>
<br>

Discriminative vs Generative Models
===

So far we have covered an example of a generative model, Naive Bayes, for sentiment analysis and text classification

Summary 
---

1. __Generative models compute joint probabilities__

   - Joint probabilities of observing the data ___and___ each of the classes respectively,  $p(D, C)$
   - Concretely, probability of observing a certain document and a positive sentiment class
   - e.g. $p(\text{this movie sucks}, \text{positive sentiment})$ is low
   - Revisiting Naive Bayes, 
   
     $$p(D, C) = p(D | C) \cdot p(C)$$
     
     $$\text{Joint probability} = \text{Likelihood} \times \text{Prior}$$
     
     <br>
     
   - Naive Bayes subsequently computes the posterior (conditional) probability
   
     $$p(C | D) = \frac{p(D, C)}{p(D)}$$
     
     $$\text{Posterior probability} = \frac{\text{Joint probability}}{\text{Probability of observing data}}$$
    
   <br>

2. __Discriminative compute conditional probabilities__

   - Conditional probability of observing the data ___given___ each of the classes respectively, $p(C | D)$
   - Concretely, probability of observing a certain document given a positive sentiment class
   - e.g. $p(\text{positive sentiment | this movie sucks})$ is low
   - Revisting logistic regression,
   
     $$p(C|D) = \frac{1}{1 + e^{\beta0+ \beta_1x_1 + \beta_2x_2 ...}} $$
     
   <br>
   
3. __Training of discriminative models are more difficult than generative models__

   - The parameters of a discriminative model, e.g $\beta_1, \beta_2...$, have to be opitimized via maximizing likelihood through gradient methods
   - The parameters of a generative model are simply different frequencies of classes and features 
   - Generative models are quicker to train and easier to optimize
   
   <br>

4. __Discriminative models are more accurate then generative models__

   - As shown empirically by Klein and Manning (2002), discriminative models are superior in carrying of various NLP tasks
   - Especially when discriminative models are used with smoothing / regularization
   
     <br>
     
     <img src="images/performance.png" width="300px">




***
<br>
<br>

Knowledge Check Questions
---

1) What is the fundamental difference between generative and discriminative models in terms of the probabilities that they are computing ?

<details><summary>
Click here for solution to 1.
</summary>
`
Discriminative models compute the conditional probability of observing a class given the data. 

Generative models compute the joint probability of observing a class and the data jointly.
`
</details>

2) What are the practical trade-offs between using a discriminative and a generative model?

<details><summary>
Click here for solution to 2.
</summary>
`
Discriminative models are more diiffcult / take longer to train but are more accurate
`
</details>

***
<br>
<br>

MaxEnt (Discriminative) Models
===

The MaxEnt model is able to account for prediction of multiple classes beyond a binary target (logistic regression)

---
Summary 
---

1. __MaxEnt is logistic regression with multiple target__

   - Logistic regression only permits a 0/1 target, MaxEnt allows many
   - MaxEnt is also known as softmax model
   - Say I have features $x_1, x_2, x_3$ and classes $c_1, c_2, c_3$ and thre is only one correct class out of the 3 for each instance     
   - __MaxEnt formula:__
   
     $$p(C_i|D) = \frac{e^{\beta_i^T x}}{\sum_{j=1}^k e^{\beta_j^T x}} $$
     
   <br>
   
2. __Pictorial representation of MaxEnt__

   - The blue circles represent inputs (features) in the model
   - $x_0$ is the bias term and $x_1$ and $x_2$ are the other features respectively 
   - The red, green and cyan ellipses are softmax functions applied to the dot product of the input and the weights ($\beta$)
   - Since there are three classes, the model is represented by 3 softmax functions (red, green and cyan)
   - Each arrowed line represent a weight ($\beta$). Since there are 3 features, each of the softmax models is represented by 3 weights / lines

   ![](images/softmax.png)
   
   <br>

3. __Sklearn implementation of MaxEnt__

   - MaxEnt is implemented under logistic regression in sklearn. Learn more [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
   - The input to the MaxEnt model could be the bag of words representation or other user defined features
   - The target would be sentiments or different classes of text
   - Specifying regularization (`penalty` as `l1` or `l2`) is important for MaxEnt performance
   - The hyperparameter `C` controls how much regularization is applied to the model
   
     ![](images/logistic.png)
   

<br>
<br>
<br>
----