---
Logistic Regression: The workhorse of Data Science
=====

![](http://i.stack.imgur.com/bA57S.png)

---
By the end of this session, you should be able to:
---

- Explain the difference between Logistic Regression and Naive Bayes
- Explain the difference between Linear and Logistic Regression
- Fit a multinomial Logistic Regression to text data

---
Logistic Regression vs Naive Bayes
---

![](http://cdn2.hubspot.net/hubfs/426799/powertools.png)

---
Generative models
----   

- P(c, d)
- Seek to maximize the joint likelihood
- Generate the observed data from hidden (latent) stuff, place probabilities over both observed data and the hidden stuff (e.g.: ngram, Naive Bayes)

![](images/gen_model_formula.png)

Create a joint model of the form p(y,x) = p(y)p(x|y) then condition on observed features x thereby deriving the class posterior p(y|x)

![](images/gen_model.png)

--
Discriminative models
----

- P(c|d)
- Take the data as given, models __only__ the conditional probability of the class.
- eg: Logistic regression and SVM

![](images/disriminative_model_flow.png)

![](images/disrim_model.png)


----
Generative Model vs Discriminative Model
----

A generative model learns the joint probability distribution p(x,y).

A discriminative model learns the conditional probability distribution p(y|x).

The data: (x,y): (1,0), (1,0), (2,0), (2, 1)

| __p(x,y)__ | y=0 | y=1 |  
|:-------:|:------:|:------:|
| x=1 | 1/2 | 0 |
| x=2 | 1/4 | 1/4 |


| __p(y pipe x)__ | y=0 | y=1 |  
|:-------:|:------:|:------:|
| x=1 | 1 | 0 |
| x=2 | 1/2 | 1/2 |

The distribution p(y|x) is the natural distribution for classifying a given example x into a class y, which is why algorithms that model this directly are called discriminative algorithms. Generative algorithms model p(x,y), which can be tranformed into p(y|x) by applying Bayes rule and then used for classification. However, the distribution p(x,y) can also be used for other purposes. For example you could use p(x,y) to generate likely (x,y) pairs.

[Source](http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm)

---
Take home message
----

Generally, discriminative models outperform generative models in classification tasks.

[On Discriminative vs. Generative
classifiers: A comparison of logistic regression and naive Bayes ](http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf)

---
Check for understanding
---

<details><summary>
Is Naive Bayes a generative or discriminative model?
</summary>
Naive Bayes is a __generative__ model.
</details>

----
Regessions
----

![](http://www.marketingdistillery.com/wp-content/uploads/2014/11/TheThreeRegressions.png)

![](http://www.appstate.edu/~whiteheadjc/service/logit/logit.gif)

[Read more here](images/http://faculty.cas.usf.edu/mbrannick/regression/Logistic.html)

---
Check for understanding
---

<details><summary>
If I wanted to build a classifer to predict, click-through-rate (i.e., Will this person click on my ad and give me $$$), which type of regession would I use? 
</summary>
Logistic Regression
</details>

---
Logistic Regession: Linear Regresssion for Classification
----

Generalize linear regression to the (binary) classification setting by making two changes. 

1) Replace the Gaussian distribution for y with a Bernoulli distribution,which is more appropriate for the case when the response is binary, y ∈ {0, 1}. 

![](images/berm.png)

2) Compute a linear combination of the inputs but then we pass this through a function that ensures 0 ≤ μ(x) ≤ 1 by defining:

![](images/sigma.png)

sigm(η) refers to the sigmoid function, also known as the logistic or logit function

![](images/sigma_formula.png)

Squashing function

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/2000px-Logistic-curve.svg.png" style="width: 400px;"/>

Maps the whole real line to [0,1], which is necessary for the output to be interpreted as a probability



Put parts 1 and 2 together:

![](images/logistic_regession.png)

[Read more here](http://alias-i.com/lingpipe/demos/tutorial/logistic-regression/read-me.html)

----
Logistic Regression: Beyond Binary
----

### "Soft max" 

Generalizes logistic regression to classification problems where the class label can take on more than 2 possible values.

There is the baseline class (K) and compared to all other classes:

![](images/multiclass.png)

<br>
<br>
<br>
---
Feature Engineering: Next to data cleaning, what you should spend your time doing
----

![](http://3.bp.blogspot.com/-v7RCgnSNLfA/U5stfq5zuJI/AAAAAAAADt4/9_87WaSa620/s1600/features-in-ML.jpg)

- Feature engineering is Art & Science
- Find candidate features, select, and weight them
- The optimum parameters are the ones for which each feature's predicted expectation equals its empirical expectation.
- Find candidates, select, and weighting


---
Check for understanding
---

<details><summary>
Given that NLP models are sparse (it is not uncommon to have a more than 1 million parameters), what extra step do you need to do during Feature Engineering?
</summary>
Regularize  
<br>
![](http://i.stack.imgur.com/HwqTv.png)
</details>

---
Feature-Based Linear Classifiers
----

- Assign a weight λi to each feature $f_i$ then linear function to predict class
- Make vote postive by taking expotential of linear weighting(=∑$λ_if_i$(c,d))
- Normalize to transform into a probability

----
Not Important (for this class)
----

- Bayes Net/Graphical Model/PGMs. Don't worry. We'll cover them in special topics
- The notation. Machine learning notation is a mess. Use the notation from Jared's class.

<br>
<br>
----