## Conditional Probability and Its Intuition

**Note** : The basis of the **Naive Bayes** algorithm is **Bayes’ theorem**.

It is type of **Supervised Classification**.

**Naive Bayes** is a probabilistic classifier which returns the probability of a test point belonging to a class rather than the label of the test point.

Ex - Text Classification like **Whether email is Spam/Ham**

**Conditional Probability**

A and B are events then 

P(A) = Probability of an email being Spam

P(B) = Probability of the word being VIAGRA

P(A) = 20% and P(A|B) = 70%

**Note** : Posterior, we know something about the situation. Prior, without any information of the situation that could be happen.

**Conditional Probability** is needed to understand relative probabilities, which is more often the case in the real world scenarios instead of looking into the absolute probability of events in isolation.


## Bayes' Theorem

Ex - "India wins 70% matches when Sachin scores a century".

Suppose that India plays 100 matches, out of which it wins 60 and loses 40. Also, Sachin Tendulkar plays these 100 matches, scores a century in 12 of them, and doesn't score a century in the rest 88.

To make things interesting, you also have this additional information: out of the 60 games that India wins, Sachin scores a century in 10, and out of the 40 games that India loses, Sachin scores a century only in two.

Let us look at how the **two-way contingency matrix** will look like for the above case :


| |	India Win	|India Lose	|Total|
|---|---|---|---|
|Sachin Scores a Century|10|2|12|
|Sachin Doesn't Score a Century	|50|38|88
|Total|	60	|40	|100|


A = "India wins",  B = "Sachin scores Century"

P(A) = 60/100,  P(B) = 12/100


**Conditional Probability**

P(A|B) = "India wins when Sachin scores Century" = 10/12

**Joint Probability**

P(A,B) = P(A$\cap$B) = "India wins and Sachin scores Century" = 10/100 = P(B$\cap$A)

P(A$\cap$B) = P(A|B) $\times$ P(B) = P(B|A) * P(A)

SO, from the above condition

P(A|B) $\times$ P(B) = P(B|A) $\times$ P(A)

#### Bayes Therom :

P(A|B) = $\frac {P(B|A) $\times$ P(A)}{P(B)}$

P(A|B) = $\frac {P(A \cap B)}{P(B)}$


#### Exercise :

Let’s define the events A and B as follows:

   - A: The email is spam

   - B: The email contains the word ‘viagra’

Now, based on the above definition, solve the questions given below.

**Ques 1** : What type of probability is ‘the probability that an email which contains the word ‘viagra’ is spam’?

**Ans** : Conditional Probability P(A=Spam | B=Viagra)

**Ques 2** : What type of probability is ‘the probability that an email contains the word ‘viagra’ and it is spam’?

**Ans** : Joint Probability P(A=Spam $\cap$ B=Viagra)

**Ques 3** : What is the symbolic notation of conditional probability?

**Ans** : P(A|B)

**Ques 4** : What is the symbolic notation of joint probability?

**Ans** : P(A $\cap$ B)


### Graded Questions

|Courses|Data Science(DS)|Machine Learning(ML)|Deep Learning(DL)|Big Data(BD)|Artificial Intelligence(AI)|Total|
|---|---|---|---|---|---|---|
|Male|80|60|40|50|30|260|
|Female|70|40|50|70|10|240|
|Total|150|100|90|120|40|500|

**Ques 1** : Given this contingency table, what is the probability that a randomly selected person joined Data Science?

**Ans** : 150 / 500

**Ques 2** : What is the probability that a randomly selected female joined DS? In other words, what is the probability of a person joining DS, GIVEN that she is female?

**Ans** : 70 / 240

**Ques 3** : Consider a set containing all DL students OR all male students. What is the Probability that a randomly selected person will belong to this set?

**Ans** : Using Formula, P(A OR B) = P(A) + P(B) - P(A $\cap$ B) 

310 / 500

**Ques 4** : What is the probability of a student being a female AND an AI student? P(A $\cap$ B)

**Ans** : 10 / 500

**Ques 5** : What is the probability of a Deep Learning (DL) student being a male?

**Ans** : 40 / 90


**Note** : ``A being B`` means ``P(B | A)``, ex - ``Deep Learning (DL)`` student **being** a ``male``? = P(Male | DL)


### Naive Bayes - With One Feature

It is a probabilistic classifier, uses probability as a criterion for classification. It uses Bayes theorem. Ex. Email is Spam/Ham.

Naïve Bayes is a probabilistic classifier that returns the probability of a test point belonging to a class, using Bayes’ theorem.

   - P($C_{i}$|x) = $\frac {P(x|C_{i}) $\times$ P(C_{i})}{P(x)}$ , where $C_{i}$ denotes the classes, and x denotes the features of the datapoint.
   - Probabilities are calculated simply by counting the number of instances/occurrences for categorical data.
   - The effect of the denominator P(x) is not incorporated while calculating probabilities as it is the same for both the classes and hence, can be ignored without affecting the final outcome.
   - The class assigned to the new test point is the class for which P($C_{i}$|x) is greater.
   

|Type of mushroom|Cap shape|
|---|---|
|Edible|Convex|
|Edible|Convex|
|Edible|Bell|
|Edible|Bell|
|Poisonous|Convex|
|Edible|Convex|
|Poisonous|Convex|
|Poisonous|Convex|
|Poisonous|Flat|
|Poisonous|Flat|
|Edible|Convex|
|Poisonous|Bell|
|Edible|Bell|
|Poisonous|Knobbed|

P(C=Edible|x=Convex) = $\frac {P(x=Convex|C=Edible) \times P(C=Edible)}{P(x=Convex)}$

P(C=Poisonous|x=Convex) = $\frac {P(x=Convex|C=Poisonous) \times P(C=Poisonous)}{P(x=Convex)}$

We don't consider P(x=Convex), because it is a comman factor in both equation.

P(Mushroom is Edible) = 7/14 = 0.5

P(Mushroom is Poisonous) = 7/14 = 0.5

P(CapShape=Convex|Edible=Yes) = 4/7

P(CapShape=Convex|Edible=No) = 3/7

P(Edible=Yes|CapShape=Convex) = $\frac {P(CapShape=Convex|Edible=Yes) \times P(Edible=Yes)}{P(CapShape=Convex)}$ = (4/7 $\times$ 1/2)/P(CapShape) = $\frac {4}{14 P(CapShape)}$
   
P(Edible=No|CapShape=Convex) = $\frac {P(CapShape=Convex|Edible=No) \times P(Edible=No)}{P(CapShape=Convex)}$ = (3/7 $\times$ 1/2)/P(CapShape) = $\frac {3}{14 P(CapShape)}$

Both Probabilities have P(CapShape) as common denominator, in P(CapShape) **CapShape** as a scaling factor, so can be removed while comparing the probability of two classes because it will be same for both the classes.

$\frac {4}{14 P(CapShape)}$ > $\frac {3}{14 P(CapShape)}$

$\frac {4}{14}$ > $\frac {3}{14}$

Since, P(Edible=Yes|CapShape=Convex) > P(Edible=No|CapShape=Convex)

Then we will classify the observation into the **Edible** class

#### Comprehension: Naive Bayes With One Feature:

|S.No|Type of mushroom|Cap shape|
|---|---|---|
|1.|Poisonous|Convex|
|2.|Edible|Convex|
|3.|Poisonous|Convex|
|4.|Edible|Convex|
|5.|Edible|Convex|
|6.|Poisonous|Convex|
|7.|Edible|Bell|
|8.|Edible|Bell|
|9.|Edible|Convex|
|10.|Poisonous|Convex|
|11.|Edible|Flat|
|12.|Edible|Bell|

Consider the table shown above. There are two types of mushrooms, **edible** and **poisonous**, which is the target (dependent) variable.  They have various kinds of **cap-shapes**. Out of the total 12 mushrooms, eight are edible and four poisonous.

**Note** : **Cap-Shape** is feature in the dataset.

**Ques 1**: The probability of a CONVEX mushroom being edible, P(C = edible | X = CONVEX) is given by:

**Ans** : P( X = CONVEX | C = edible) $\times$ P(C = edible) / P(X = CONVEX)

**Ques 2**: The value of P(C = edible) is simply the number of edible mushrooms in the dataset divided by the total observations. What is the value of P( C = edible)?

**Ans** : 8/12 = 66.66%

This means that approx. 66.66% of all mushrooms are edible. Note that P(C = edible) appears in the numerator of the Bayes expression and this value is directly proportional to the chances of a mushroom being edible.

**Ques 1**: Now let’s say you picked a new mushroom whose cap-shape is CONVEX. What are the chances of this happening, i.e. what is the value of P(X = CONVEX)?

**Ans** : 8/12

**Ques 2**: What is the probability of the mushroom being CONVEX given it is edible, i.e. P(X = CONVEX | C = edible)? This is the fraction of CONVEX mushrooms out of all the edible ones.

**Ans** : 4/8

**Ques 3**: In the previous questions, you have calculated that P(C = edible) is 8/12, P(X = CONVEX) is 8/12 and  P(X = CONVEX | C = edible) is 4/8.

What is the probability that the CONVEX mushroom is edible, P(C = edible | X = CONVEX)?

**Ans** : 4/8

**Ques 4**: In the previous question, you found the probability of the CONVEX mushroom being edible. What is the probability of the CONVEX mushroom being poisonous, P(C = poisonous | X = CONVEX)?

**Ans** : 4/8

**Ques 5**: What are the chances of a random mushroom being poisonous, i.e. P(C = poisonous)?

**Ans** : 4/12

**Ques 6**: What are the chances of a mushroom being CONVEX given it is poisonous, i.e. P(X = CONVEX | C = poisonous)?

**Ans** : 1


**Let’s analyse the results of this problem:**

The probabilities of a **CONVEX** mushroom being **edible** and **poisonous** are both 50%. 

**Note** Denominator is common in both calculations, i.e. P(X = CONVEX) = 8/12, and thus you do not need to calculate it. You can simply compare the numerators and conclude the classes based on that:

Edible: P( X = CONVEX | C = edible) $\times$ P(C = edible) =  (4/8)$\times$(8/12) = 4/12 = 33.33%

Poisonous: P( X = CONVEX | C = poisonous) $\times$ P(C = poisonous) =  (4/4)$\times$(4/12) = 4/12 = 33.33%

Since both numerators are 4/12,you cannot classify the **CONVEX** mushroom as **edible** or **poisonous** (if you consider 50% as the threshold probability for classification). The fundamental concept is that you only need to compare the numerators for the two classes and assign the class based on that.

Let’s now break down the **Bayes theorem**. The 50% probability that the CONVEX mushroom is edible (or poisonous) is a result of three probabilities. P(edible | CONVEX) is:

  - Proportional to P(edible), which tells us how abundant edible mushrooms are; if P(edible) is high, then P(edible | CONVEX) will be high simply because edible mushrooms are abundant!  
     - P(edible) is 66.66% and P(poisonous) is 33.33 %
     - This pushes the favour towards edible since they are in abundance
     
  - Proportional to P(CONVEX | edible), which explains how likely you are to find a CONVEX mushroom if you separately consider all the edible ones;
     - P(CONVEX | edible) is 50% and P(CONVEX | poisonous) is 100%
     - This pushes the favour towards poisonous since all poisonous mushrooms are CONVEX
     
  - Inversely proportional to P(CONVEX); this term cancels out while comparing the two classes
  
Thus, the numerators are equal because of the product of two probabilities balances each other out.

P(edible) = 66.66% $\times$  50% = 33.33%

P(poisonous) = 33.33% $\times$ 100% = 33.33%


### Conditional Independence in Naive Bayes

|Type of mushroom|Cap shape|Cap Surface|
|---|---|---|
|Edible|Convex|Smooth|
|Edible|Convex|Smooth|
|Edible|Bell|Scaly|
|Edible|Bell|Scaly|
|Poisonous|Convex|Smooth|
|Edible|Convex|Smooth|
|Poisonous|Convex|Fibrous|
|Poisonous|Convex|Scaly|
|Poisonous|Flat|Smooth|
|Poisonous|Flat|Scaly|
|Edible|Convex|Fibrous|
|Poisonous|Bell|Scaly|
|Edible|Bell|Fibrous|
|Poisonous|Knobbed|Scaly|

P(Edible = Yes | x = (Convex,Smooth)) = P(x = (Convex,Smooth) | Edible = Yes) $\times$ P(Edible = Yes)

**Naive Assumption** : Cap-Shape and Cap-Surface are conditionally independent. Then expression becomes : 

P(Edible = Yes | x = (Convex,Smooth)) = **P(x = (Convex,Smooth) | Edible = Yes) $\times$ P(Edible = Yes)** = P(x = Convex | Edible = Yes) $\times$ P(x = Smooth | Edible = Yes) $\times$ P(Edible = Yes) = 4/7 $\times$ 3/7 $\times$ 7/14 = 12/98

Similarly for **Poisonous**,

P(Edible = No | x = (Convex,Smooth)) = **P(x = (Convex,Smooth) | Edible = No) $\times$ P(Edible = No)** = P(x = Convex | Edible = No) $\times$ P(x = Smooth | Edible = No) $\times$ P(Edible = No) = 3/7 $\times$ 2/7 $\times$ 7/14 = 6/98

So classifying Test point, 

Since, P(Edible | x = (Convex,Smooth))  > P(Poisonous | x = (Convex,Smooth))

Test point classified as **Edible**

**Naïve Bayes follows an assumption that the variables are conditionally independent given the class i.e.**

P(X = convex,smooth | C= edible) = P(X=smooth | C=edible) $\times$ P(X=convex | C=edible)

Hence, the name “Naïve” because in most real-world situations the variables are not conditionally independent given the class label but most of the times the algorithm works nonetheless.

Let us say you are trying to compute **P(A and B | C)**. If **P(A | C)** is the same for all values of **B** and **P(B | C)** is the same for all values of **A**, then there is **conditional independence** between A and B, given C. This is when P(A and B | C) = P(A | C) x P(B | C), implying that A is not conditioned on B or vice versa.

Despite this assumption, Naive Bayes has proven to work very well in some cases, such as text classification.


### Comprehension - Naive Bayes with Multiple Features

|S.No|Type of mushroom|Cap shape|Cap Surface|
|---|---|---|---|
|1.|Poisonous|Convex|Scaly|
|2.|Edible|Convex|Scaly|
|3.|Poisonous|Convex|Smooth|
|4.|Edible|Convex|Smooth|
|5.|Edible|Convex|Fibrous|
|6.|Poisonous|Convex|Scaly|
|7.|Edible|Bell|Scaly|
|8.|Edible|Bell|Scaly|
|9.|Edible|Convex|Scaly|
|10.|Poisonous|Convex|Scaly|
|11.|Edible|Flat|Scaly|
|12.|Edible|Bell|Smooth|

**Useful numbers:**

Number of edible mushrooms = 8

Number of poisonous mushrooms = 4

**Ques 1** : Say you take a new mushroom which is (CONVEX, SMOOTH). What is the numerator of P(C = edible | X = CONVEX, SMOOTH)?

**Ans** : P(edible) x P(CONVEX | edible) x P(SMOOTH | edible)

**Ques 2** : What is P(CONVEX | edible)?

**Ans** : 4/8

**Ques 3** : What is P(SMOOTH| edible)?

**Ans** : 2/8

**Ques 4** : What is P(CONVEX | poisonous)?

**Ans** : 1

**Ques 5** : What is P(SMOOTH| poisonous)?

**Ans** : 1/4

**Ques 6** : In the previous questions, you have calculated that:

What is P(CONVEX | edible) = 4/8

P(SMOOTH| edible) = 2/8

P(CONVEX | poisonous) = 1 and

P(SMOOTH| poisonous) = 1/4

If all mushrooms above 50% probability of being edible are classified as edible, is the CONVEX, SMOOTH mushroom edible?

**Ans** : Cannot be decided, it is a tie (P(edible | CONVEX, SMOOTH) = P(edible).P(CONVEX | edible).P(SMOOTH| edible)/denominator = (8/12)(4/8)(2/8)/d = 1/12d

P(poisonous | CONVEX, SMOOTH) = P(poisonous).P(CONVEX | poisonous). P(SMOOTH| poisonous)/denominator = (4/12)(1)(1/4)/d = 1/12d.

Since both numerators are equal to 1/12d, this mushroom cannot be classified with a 50% threshold. Although if you would take a higher threshold, like 60% (which is reasonable since you don't want to take responsibility of people eating poisonous mushrooms), then it will be classified as poisonous. Why? Because, when you set the threshold as 60%, you want the probability of edible|CONVEX,SMOOTH to atleast 60%.)

**Classfication Rule**
  - If P($C_{1}$|x) > P($C_{2}$|x)
  - x is classified as $C_{1}$
  - Maximum aposteriori Classfication Rule (MAP)
  
  
 P(C_{i}/X) = \frac{P(X/C_{i}) P(C_{i})}{P(X)}
 
   - P(C_{i}) is known as the **prior probability**. It is the probability of an event occurring before the collection of new data. Prior plays an important role while classifying, when using Naïve Bayes, as it highly influences the class of the new test point.
   - P(X/C_{i}) represents the **likelihood function**. It tells the likelihood of a data point occurring in a category. The conditional independence assumption is leveraged while computing the likelihood probability.
   - The effect of the denominator P(x) is not incorporated while calculating probabilities as it is the same for both the classes and hence, can be ignored without affecting the final outcome.
   - P(C_{i}/X) is called the **posterior probability**, which is finally compared for the classes, and the test point is assigned the class whose Posterior probability is greater.
   

### Prior, Posterior and Likelihood

**Bayesian classification** is based on the principle that ‘you combine your **prior knowledge or beliefs about a population** with the **case specific information** to get the **actual (posterior) probability**’.

In many cases, the prior has a tremendous effect on the classification. If the prior is neutral (50% are edible), then the likelihood may largely decide the outcome.

If the likelihood is neutral (e.g. 50%), then the prior probability may largely decide the outcome. If the prior is way too powerful, then likelihood often barely affects the result.

**Posterior probability**

It is the outcome which combines **prior beliefs** and **case-specific information**. It is a balanced outcome of the prior and the likelihood.


|S.No|Type of mushroom|Cap shape|Cap Surface|
|---|---|---|---|
|1.|Poisonous|Convex|Scaly|
|2.|Edible|Convex|Scaly|
|3.|Poisonous|Convex|Smooth|
|4.|Edible|Convex|Smooth|
|5.|Edible|Convex|Fibrous|
|6.|Poisonous|Convex|Scaly|
|7.|Edible|Bell|Scaly|
|8.|Edible|Bell|Scaly|
|9.|Edible|Convex|Scaly|
|10.|Poisonous|Convex|Scaly|
|11.|Edible|Flat|Scaly|
|12.|Edible|Bell|Smooth|

**Ques 1** : In the table above, the prior probability is higher for a mushroom being:

**Ans** : Edible 

**Ques 2** : Say you consider a (CONVEX, SCALY) mushroom. The likelihood is higher for it being:

**Ans** : Poisonous (75%) > Edible (31.25%)

**Ques 3** : The values of P(X|Class). P(Class) where X = (CONVEX, SCALY) for both classes (edible and poisonous) are respectively:

**Ans** : Edible = 20.8 %; Poisonous = 25.0 % 

Edible: P(CONVEX | Edible). P(SCALY | EDIBLE). P(Edible) = (4/8)(5/8)(8/12) = 20.8% , Poisonous: P(CONVEX | poisonous). P(SCALY | poisonous). P(Poisonous) = (4/4)(3/4)(4/12) = 25%

**Ques 4** : For the (CONVEX, SCALY) mushroom:

**Ans** : The prior is in favor of edible; posterior in favor of poisonous


### Graded Questions

|S.No|	Class|	Freq 1|	Freq 2|	Freq 3|	Freq 4|
|---|---|---|---|---|---|
|1.|	Spam|	free|	buy|	limited|	hurry|
|2.|	Ham|	reply|	data|	report|	presentation|
|3.|	Ham|	report|	presentation|	file|	end of day|
|4.|	Spam|	limited|	file|	buy|	click|
|5.|	Ham|	meeting|	timelines|	limited|	documents|
|6.|	Spam|	hurry|	data|	buy|	stock|
|7.|	Spam|	limited|	sex|	click|	viagra|
|8.|	Ham|	presentation|	end of day|	data|	report|
|9.|	Ham|	reply|	data|	presentation|	click|
|10.|	Spam|	free|	reply|	weekend|	click|
|11.|	Spam|	limited|	click|	free|	hurry|
|12.|	Ham|	meeting|	end of day|	weekend|	data|
|13.|	Spam|	hurry|	weekend|	stock|	offer|
|14.|	Ham|	report| 	presentation|	file|	end of day|
|15.|	Ham|	free|	timelines|	reply|	offer|


**Ques 1** : What is the prior probability of a mail being spam, P(class = spam)?

**Ans** : 7/15

**Ques 2** : What does Naive Bayes assume while classifying spam or ham mails?

**Ans** : That frequency of keywords like hurry, free, offer etc. are conditionally independent of each other

**Ques 3** : Consider an email with the vector of features X = (free, data, weekend, click). What is the likelihood, P(X | spam)?

**Ans** : 4/ 2401 

P(X | spam) = P(free|spam). P(data|spam). P(weekend | spam). P(click|spam) = (2/7)(1/7)(1/7)(2/7) = 4/2401

**Ques 4** : Consider an email with the vector of features X = (free, data, weekend, click). What is the likelihood, P(X | ham)?

**Ans** : 2/ 4096

P(X | ham) = P(free|ham). P(data|ham). P(weekend | ham). P(click | ham) = (1/8)(2/8)(1/8)(1/8) = 2/4096.

**Ques 5** : The value of P(X|Class). P(Class) for class = spam for X = (free, data, weekend, click)?

**Ans** : (4/ 2401)(7/ 15)

**Ques 6** : What is the posterior for class = ham (i.e. without division by denominator) for the feature vector  X = (free, data, weekend, click)?

**Ans** : (2/4096)(8/15)

**Ques 7** : Which class should be point X = (free, data, weekend, click) be classified into?

**Ans** : Spam (The (numerators of) posteriors, P(Class | X) for spam and ham are respectively (7/15)(4/2401) and (8/15)(2/4096), spam's being higher.)


![image.png](attachment:image.png)

**Ques 1** : What is the accuracy of the model?

**Ans** : 940/1000

**Ques 2** : What is the sensitivity of the model?

**Ans** : 440/480

**Ques 3** : What is the specificity of the model?

**Ans** : 500/520

**Ques 4** : Given that you do not want to misclassify any genuine emails, which metric should be as high as possible?

**Ans** : Specificity (Fraction of correctly classified hams is measured by specificity (true negative rate).)


## Naive Bayes for Text Classification

Suppose the document is -

**"The movie Padmavat is a great example of beautiful cinematography".**

**Ques 1** : Given the document could you list down the stop words being used in it?

**Ans** : {'The','is','a','of'}

**Ques 2** : Could you enter the no. of words in the vocabulary after removing stop words?

**Ans** : 6


#### Bag of Words Representation

|Document|Class|
|---|---|
|Upgrad is a great educational institution|Education|
|Educational greatness depends on ethics|Education|
|A Story of great ethics and educational greatness|Education|
|Sholey is a great cinema|Cinema|
|good movie depends on good story|Cinema|


**Dictionary/Vocabulary**

|SNo|Dictionary before stop words removal|
|---|---|
|0|and|
|1|cinema|
|2|depends|
|3|Educational|
|4|ethics|
|5|good|
|6|great|
|7|greatness|
|8|institution|
|9|is|
|10|movie|
|11|of|
|12|on|
|13|Sholey|
|14|story|
|15|Upgrad|

|SNo|Stop Words|
|---|---|
|0|and|
|1|is|
|2|of|
|3|on|

|SNo|Dictionary after stop words removal|
|---|---|
|0|cinema|
|1|depends|
|2|Educational|
|3|ethics|
|4|good|
|5|great|
|6|greatness|
|7|institution|
|8|movie|
|9|Sholey|
|10|story|
|11|Upgrad|


**Bag of Words Representation**

|cinema|depends|Educational|ethics|good|great|greatness|institution|movie|Sholey|story|Upgrad|
|---|---|---|---|---|---|---|---|---|---|---|---|
|0|0|1|0|0|1|0|1|0|0|0|1|
|0|1|1|1|0|0|1|0|0|0|0|0|
|0|0|1|1|0|1|1|0|0|0|1|0|
|1|0|0|0|0|1|0|0|0|1|0|0|
|0|1|0|0|2|0|0|0|1|0|1|0|

Rows we have documents and Columns we have different words.

**Ques 1** : Why it is called Bag of Words Representation?

**Ans** : It is because the sentences are broken down into words and the ordering doesn't matter anymore as if it were put in a bag and shuffled.

### Document Classifier

$D^{education}$ = $$\begin{bmatrix} 0&0&1&0&0&1&0&1&0&0&0&1 \\ 0&1&1&1&0&0&1&0&0&0&0&0 \\ 0&0&1&1&0&1&1&0&0&0&1&0 \end{bmatrix}$$ = 13

$D^{cinema}$ = $$\begin{bmatrix} 1&0&0&0&0&1&0&0&0&1&0&0 \\ 0&1&0&0&2&0&0&0&1&0&1&0 \end{bmatrix}$$ = 8


**Ques 1** : Let us suppose that the following sentence is the "document" -

“An educational firm should be highly focussed on the quality it delivers.“

Can you specify the value of |V|, after removing the stop words, for the above document? Please recall that |V| represents length of the dictionary.

**Ans** : 6

**Prior** = { P(Education) = 3/5 and P(Cinema) = 2/5 }

**Posterior** = P(Education | w1,w2,-----,wn) and P(Cinema | w1,w2,-----,wn)

P(Cinema|w1,w2,-----,wn) = $\frac {P(w1,w2,-----,wn|Cinema) P(Cinema)}{P(w1,w2,-----,wn)}$

**Likelihood** = P(w1,w2,-----,wn|Cinema)

As per Naive Assumption :

P(w1|C) $\times$ P(w2|C) ----- P(wn|C)

||n-education|P(w given C = education)|n-cinema|P(w given C = cinema)|
|---|---|---|---|---|
|w1=cinema|0|0|1|1/8|
|w2=depends|1|1/13|1|1/8|
|w3=educational|3|3/13|0|0|
|w4=ethics|2|2/13|0|0|
|w5=good|0|0|2|2/8|
|w6=great|2|2/13|1|1/8|
|w7=greatness|2|2/13|0|0|
|w8=institution|1|1/13|0|0|
|w9=movie|0|0|1|1/8|
|w10=Sholey|0|0|1|1/8|
|w11=Story|1|1/13|1|1/8|
|w12=Upgrad|1|1/13|0|0|


So now Test document = "great story" => education or cinema ?

P(education|"great story") = P(great|education) P(story|education) P(education) = 2/13 * 1/13 * 3/5 = 0.007

P(cinema|"great story") = P(great|cinema) P(story|cinema) P(cinema) = 1/8 * 1/8 * 2/5 = 0.006

P(education|"great story") > P(cinema|"great story") , Hence Test document is classified as "Education class".


**Ques 1** : Suppose you have the following dictionary based on some training document narrating stories about love or action.

|W1|W2|W3|W4|W5|W6|W7|W8|
|---|---|---|---|---|---|---|---|
|bike|couple|fast|furious|tears|love|shoot|songs|

What will be feature vector of the document “A fast moving bike entered into the complex and shoot the couple.“

**Ans** : {1,1,1,0,0,0,1,0}

**Ques 2** : Assume the following likelihoods i.e P(word|class) for each word being part of a positive or negative review of a hotel. 

||Pos|Neg|
|---|---|---|
|i|0.09|0.16|
|loved|0.30|0.06|
|the|0.06|0.05|
|food|0.04|0.35|
|and|0.08|0.07|
|cleanliness|0.40|0.03|

What class will Naive Bayes assign to the sentence “I loved the food and cleanliness.” if the priors of the classes are considered equal ( it is equivalent to not considering the prior )

**Ans** : Pos

**Ques 3** : What class will Naive Bayes assign to the sentence “I loved the food and cleanliness.” if the prior probabilities for positive and negative classes are considered 0.1 and 0.9 respectively.

**Ans** : Neg


### Laplace Smoothing

You came across the **‘zero probability problem’** - the probability of a word which has never appeared in a class (though it may have appeared in the dataset in another class) is 0.

You will now understand how a technique called **‘Laplace smoothing’** helps solve this problem.

Why we need **Laplace Smoothing**?

Test Document : "very good educational institution"

Ignored "very" (not present in dictionary)

P(Education | "good educational institution") = P(good | Education) * P(educational | Education) * P(institution | Education) * P(Education) = 0 * 3/13 * 1/3 * 3/5 = 0 {Not Valid}

P(Cinema | "good educational institution") = P(good | Cinema) * P(educational | Cinema) * P(institution | Cinema) * P(Cinema) = 2/8 * 0 * 0 * 2/5 = 0 {Not Valid}

As we have both Probabilities = 0, so the document can't be classified into any class. So here is when **Laplace Smoothing** helps.

||n-education|P(w given C = education)|n-cinema|P(w given C = cinema)|
|---|---|---|---|---|
|w1=cinema|0+1=1|1/(13+12)=1/25|1+1=2|2/(8+12)=2/20|
|w2=depends|1+1=2|2/(13+12)=2/25|1+1=2|2/(8+12)=2/20|
|w3=educational|3+1=4|4/(13+12)=4/25|0+1=1|1/(8+12)=1/20|
|w4=ethics|2+1=3|3/(13+12)=3/25|0+1=1|1/(8+12)=1/20|
|w5=good|0+1=1|1/(13+12)=1/25|2+1=3|3/(8+12)=3/20|
|w6=great|2+1=3|3/(13+12)=3/25|1+1=2|2/(8+12)=2/20|
|w7=greatness|2+1=3|3/(13+12)=3/25|0+1=1|1/(8+12)=1/20|
|w8=institution|1+1=2|2/(13+12)=2/25|0+1=1|1/(8+12)=1/20|
|w9=movie|0+1=1|1/(13+12)=1/25|1+1=2|2/(8+12)=2/20|
|w10=Sholey|0+1=1|1/(13+12)=1/25|1+1=2|2/(8+12)=2/20|
|w11=Story|1+1=2|2/(13+12)=2/25|1+1=2|2/(8+12)=2/20|
|w12=Upgrad|1+1=2|2/(13+12)=2/25|0+1=1|1/(8+12)=1/20|

Now, After applying Laplace Smoothing :

Education = 1/25 * 4/5 * 2/25 * 0.6 = 0.0003

Cinema = 3/20 * 1/20 * 1/20 * 0.4 = 0.00015

Hence, the doc can now be classified into "Education" class.

**Laplace Smoothing** = $\frac {+\alpha}{+\alpha(V)}$

Please note that - If there are words occurring in a test sentence which are not a part of the dictionary, then they will not be considered as part of the feature vector since it only considers the words that are part of the dictionary. These new words will be completely ignored.


### Quick Introduction to Bernoulli Naive Bayes

Whether a word occurs or not in the document. So dictionary contains 0 or 1 for Bernoulli Naive Bayes.

For Multinomial we have number of times word occurs in the document.

D = $$\begin{bmatrix} 0&0&1&0&0&1&0&1&0&0&0&1 \\ 0&1&1&1&0&0&1&0&0&0&0&0 \\ 0&0&1&1&0&1&1&0&0&0&1&0 \\ 1&0&0&0&0&1&0&0&0&1&0&0 \\ 0&1&0&0&1&0&0&0&1&0&1&0 \end{bmatrix}$$


**Sparse Matric**

![image-2.png](attachment:image-2.png)

**Compressed Sparse Matric**

![image-3.png](attachment:image-3.png)


|Doc.No.|Document|Class|
|---|---|---|
|0|Coffee Tea  Soup Coffee Coffee|Hot|
|1|Coffee is hot and so is Soup  and Tea|Hot|
|2|Espresso is a hot Coffee  and not a Tea|Hot|
|3|Coffee is neither Tea nor Soup|Hot|
|4|Sprite Pepsi  Cold Coffee and cold Tea|Cold|

**Ques 1** : How many words will be there in the dictionary vector without stop words?

**Ans** : 8

**Ques 2** : What will be the feature vector after transforming the document:

“Coffee is neither Tea nor Soup”  look like?

The words in the dictionary are ordered in the way shown below :

|coffee|cold|espresso|hot|pepsi|soup|sprite|tea|
|---|---|---|---|---|---|---|---|

**Ans** : 1,0,0,0,0,1,0,1

**Ques 3** : What will be the feature vector after transforming the document:

“I hate cold Coffee but love Tea and hot Coffee”  look like?

The words in the dictionary are ordered in the way shown above :

**Ans** : 2,1,0,1,0,0,0,1

**Ques 4** : We have been asked to classify a new document whose content is not yet disclosed. What most likely will its class be?

**Ans** : Hot (In absence of the extra information we look at the prior. The prior probability of a document being of class Hot is ⅘ i.e 80%. Because 4 documents in the corpus of 5 documents belong to the Hot class)

**Ques 5** : What is the probability of word “Coffee” appearing in a document which has been classified as "Hot" if we are planning to do a Multinomial Naive Bayes Classification?

**Ans** : 6/16

**Ques 6** : What is Binarization of a feature vector? (Bernoulli Naive Bayes)

**Ans** : Converting all non-zero word count of a feature vector to 1 and leaving zero counts as it is

**Ques 7** : What is Binarized feature vector for the document  “I hate cold Coffee but love Tea and hot Coffee”?

|coffee|cold|espresso|hot|pepsi|soup|sprite|tea|
|---|---|---|---|---|---|---|---|

**Ans** : 1,1,0,1,0,0,0,1

**Ques 8** : What is the correct expression for the likelihood of document  “Coffee and Tea” for the “Hot” class if we are planning to do a Multinomial Naive Bayes Classification?

**Ans** : P(Coffee | Hot) * P(Tea | Hot)

**Ques 9** : What is the value of Likelihood of document  “Coffee and Tea” for the “Cold” class if we are planning to do a Multinomial Naive Bayes Classification?

**Ans** : 1/6 * 1/6



|Word|P(word given Hot)|P(word given Cold)|
|---|---|---|
|Coffee|6/16|1/6|
|cold|0|2/6|
|espresso|1/16|0|
|hot|P|0|
|pepsi|0|Q|
|soup|3/16|0|
|sprite|R|1/6|
|tea|4/16|1/6|

**Ques 1** : Few conditional probabilities  have been left blank and marked as P , Q and R. What is a possible combination of P, Q and R?

**Ans** : P = 2/16, Q = 1/6, R = 0

**Ques 2** : What is the value of P(“ I love tea and coffee”|Hot)? Use the table given above to calculate.

**Ans** : 4/16 * 6/16 


|Word|(word given Hot)|P(word given Hot)|(word given Cold)|P(word given Cold)|
|---|---|---|---|---|
|Coffee|6+1|(6+1)/(16+8)|1+1|(1+1)/(6+8)|
|cold|0+1|(0+1)/(16+8)|2+1|(2+1)/(6+8)|
|espresso|1+1|(1+1)/(16+8)|0+1|(0+1)/(6+8)|
|hot|2+1|(2+1)/(16+8)|0+1|(0+1)/(6+8)|
|pepsi|0+1|(0+1)/(16+8)|1+1|(1+1)/(6+8)|
|soup|3+1|(3+1)/(16+8)|0+1|(0+1)/(6+8)|
|sprite|0+1|(0+1)/(16+8)|1+1|(1+1)/(6+8)|
|tea|4+1|(4+1)/(16+8)|1+1|(1+1)/(6+8)|


**Ques 1** : What is the value of P(“ I love cold coffee”|Hot)?

**Ans** : 1/24 * 7/24

**Ques 2** : What is the most likely class for the document “cold tea”  based on the likelihood terms only (i.e. assume equal priors for both the classes)

**Ans** : Cold { P(cold tea | cold) > P(cold tea | Hot) }

**Ques 3** : Compute the most likely class for the document “cold tea”  based on likelihood and prior of the classes. Assume a naive Bayes classifier and use Laplace smoothing for the likelihoods. Its class should be

**Ans** : Hot { P(cold tea | cold)*P(cold) < P(cold tea | Hot)*P(Hot) }


### Naive Bayes Practice Questions

**Ques 1** : A bag A contains 3 Red and 4 Green balls and another bag B contains 4 Red and 6 Green balls. One bag is selected at random and a ball is drawn from it. If the ball drawn is found Green , find the probability that the bag chosen was A.

**Ans** : 20/41 

``Step 1:``

Let E1, E2 denote the events of selecting bag A and B respectively. 

Then P(E1) = P(E2) = 1/2.

 Let G denote the event that the ball chosen from the selected bag is Green.

Then we have to find P(E1/G).

``Step 2:``

By hypothesis P(G/E1) = 4/7 and  P(G/E2) = 6/10

By Bayes theorem P(E1/G) = (P(G/E1)*P(E1))/P(G)

 Now what is P(G) ?  P(G) = P(G,E1) + P(G,E2)

    = P(G/E1)P(E1) + P(G/E2)P(E2)

Therefore  P(E1/G) =  (P(G/E1)*P(E1))/P(G)

                    =P(G/E1)*P(E1)/P(E1)P(G/E1) + P(E2)P(G/E2)

=(4/7)x(1/2) / (1/2)x(4/7)+(1/2)x(6/10)=(4/14) / (4/14 + 6/20)=20/41

**Ques 2** : The bag A  contain 6 Green, 4 Blue ; B contains 4 Green, 6 Blue and C contains 5 Green, 5 Blue balls respectively. A bag is randomly selected  and a ball is drawn from it. If the ball drawn is Green, find the probability that it is drawn from bag A.

**Ans** : 2/5

Let EA, EB, EC and G be the events defined as follows:

EA =   Bag A is chosen, EB = Bag B is chosen,

EC = Bag C is chosen, and G= ball drawn is Green.

``Step 1:``

Since there are three bags and one of the three bags is chosen at random, therefore

P(EA) = P(EB) = P(EC) = 1/3

If EA is already occurred, then first bag has been chosen which contains 6 Green and 4 Black balls. The probability of drawing a green ball from it is 6/10.

So, P(G/EA) = 6/10

Similarly, we have P(G/EB) = 4/10  and P(G/EC) = 5/10

``Step 2:``

We are required to find P(EA/G) , i.e. given that the ball drawn is Green , what is the probability that it is drawn from bag A.

By Bayes’ theorem, we have

P(EA/G) = P(G/EA)*P(EA)P(G)

Now what is P(G) ?  P(G) = P(G, EA ) + P(G,EB ) + P(G,EC)

= P(G/EA)P(EA) + P(G/EB)P(EB) + P(G/EC)P(EC)

Therefore P(EA/G) =(6/10)(1/3) / (1/3)(6/10) +(1/3)(4/10)+ (1/3)(5/10)=2/5


## Table for given questions :

|Courses|Data Science(DS)|Machine Learning(ML)|Deep Learning(DL)|Big Data(BD)|Artificial Intelligence(AI)|Total|
|---|---|---|---|---|---|---|
|Male|80|60|40|50|30|260|
|Female|70|40|50|70|10|240|
|Total|150|100|90|120|40|500|

**Note** : To prove that two variables (say A and B) are independent, we must show that

P( A AND B) = P(A | B) * P(B) = P(A) * P(B)

**Ques 1** : Given this contingency table, Determine if being Male and having joined Big Data course are INDEPENDENT?

**Ans** : No, They are not.

P(A AND B) is P( Male AND BD Student) = 50/500

P(A) = P(Male) = 260/500

P(B) = P(BD Student) = 120/500

P(A | B) = P(Male GIVEN BD Student) = 50/120


**OK- now we have everything we need to check for independence:**

P( A AND B) = P(A | B) * P(B) = P(A) * P(B)

P(A | B) * P(B)= P(Male | BD Student) * P(BD Student) = 50/120 * 120/500 = 50/500

P(A) * P(B) = P(Male) * P(BD Student) = 260/500 * 120/500

As we can see P(Male | BD Student) * P(BD Student) not equal to P(Male) * P(BD Student)

So NO these are NOT independent.

**Ques 2** : The probability of a student being a Female student and a DL Student is greater than the probability of a Female student being a DL student.

**Ans** : False 

Let us first calculate the probability of a student being a Female student and a DL Student. We essentially want to calculate P(F and DL) . From the table we can see that there are 50 students who are both a Female and a DL students out of total 500 students hence P(F and DL) = 50/500=5/50.

Now the second probability is “probability of a Female student being a DL student” or P(DL |F). We are interested in the probability of a DL student given that the student is a Female. There are 240 Females out of which DL students are 50. Therefore P(DL|F) = 50/240=5/24

The statement is false.



||A|Not A||
|---|---|---|---|
|B||||	 	 	 
|Not B||||	 	 	 
|Total||50||


**Ques 1** : Suppose A and B are two independent events. Given that P(not A) = 0.2 and P(B)=0.3. Consider following unfilled contingency table and answer the questions.What is the probability of A and B happening together?

**Ans** : 12/50

**Ques 2** : What is the probability of A happening and B not happening ?

**Ans** : 28/50

**Ques 3** : What is the probability of B happening given that A has not happened? In other words, what is the value of P(B|not A)?

**Ans** : 3/10


**Ques 1** : Consider the following equation in a Naive Bayes classification problem.

P(x|C) = P($x_{1}$|C) . P($x_{2}$|C) . ...... . P($x_{d}$|C) = $\prod_{x = 1}^{d} P(x_{k}|C)$

Here X is a feature vector where x1, x2 …. are attributes of that feature vector. C is a specific class. Which of the following is/are true w.r.t  the above information

 1. Above equation is only true if x1, x2...xd are conditionally independent 

 2. P(x∣c) simply means: “How likely is it to observe this particular pattern x given that it belongs to class c

 3. In the context of a classification problem P(x|c) is also termed as the likelihood 

 4. P(x|c) is also termed as the posterior probability

**Ans** : 1,2,3

**Ques 2** : The likelihood , in the context of a classification problem, can be interpreted as

P($C_{i} \mid X$) = $\frac {P(X \mid C_{i}) \times P(C_{i})}{P(X)}$

**Ans** : What is the probability of observing a given feature vector knowing that it belongs to a class C

**Ques 3** : The prior probability, in the context of a classification problem, can be interpreted as

**Ans** : What is the probability of a class C in the sample being considered

**Ques 4** : The posterior probability, in the context of a classification problem, can be interpreted as

**Ans** : What is the probability that a particular object belongs to class C given its observed feature values

**Ques 5** : Which of the following is/are true w.r.t  the assumptions made in Naive Bayes Classification?

 1. All the rows of  a collection of data are i.i.d, (independent and identically distributed)i.e., all data points  are independent of each other and are drawn from the similar distribution.

 2. All features are conditionally independent.

 3. Even if samples are not a sequence of independent, identically distributed (IID) random distribution , they can be classified using Naive Bayes Classification

 4. Conditional independence of the features of a feature vector is not required

**Ans** : 1 and 2


## Graded Questions

**Ques 1** : What is the size of vocabulary after removing the stop words? Note that the vocabulary size depends only on the training set.

**Ans** : 35858 

    dict = CountVectorizer(stop_words='english')
    dict.fit(X_train)
    X_train_vocabs_dict = dict.get_feature_names()
    len(X_train_vocabs_dict)
    
**Ques 2** : Suppose we don't want to consider those (rare) words which have appeared only in 3% of the documents, or say those (extremely common ones) which have appeared in 80% of the documents.

Use CountVectorizer(stop_words='english', min_df=.03, max_df=.8) to create a new vocabulary from the training set. What is the size of the new vocabulary?

**Ans** : 1643

    vect3 = CountVectorizer(stop_words='english',min_df=.03,max_df=.8)
    vect3.fit(X_train)
    X_train_vocabs = vect3.get_feature_names()
    len(X_train_vocabs)
    
**Ques 3** : Suppose we build the vocabulary from the training data using CountVectorizer(stop_words='english', min_df=.03, max_df=.8) and then transform the test data using CountVectorizer(). How many nonzero entries are there in the sparse matrix (corresponding to the test data)? 

**Ans** : 51663

    vect3 = CountVectorizer(stop_words='english',min_df=.03,max_df=.8)
    vect3.fit(X_train)
    X3_test_fv = vect3.transform(X_test)
    X3_test_fv

**Ques 4** : Train a Bernoulli Naive Bayes model on the training set and predict the classes of the test set. Each movie review in the test set has been labelled as 'Pos' or 'Neg'. What is the accuracy of the model?

Note - Dictionary should be prepared using CountVectorizer(stop_words='english', min_df=.03, max_df=.8)

**Ans** : 0.79

    from sklearn.naive_bayes import BernoulliNB

    bnb = BernoulliNB()
    bnb.fit(X1_train_fv, y_train)
    y1_pred_class = bnb.predict(X1_test_fv)

    from sklearn import metrics
    print(metrics.accuracy_score(y_test, y1_pred_class))
    
**Ques 5** : The confusion matrix is a matrix which tabulates

True Negative(TN) , False Positive (FP) , False Negative (FN) and True Positive (TP) as follows:

 	
||Predicted Negative ⇓|Predicted Positive ⇓|
|---|---|---|
|Actual Negative ⇒|TN|FP|
|Actual Positive ⇒|FN|TP|

Run metrics.confusion_matrix(actual class of test data, predicted class of test data). How many reviews are actually negative but have been classified as positive by the model?

Note :

1. Dictionary should be  prepared using CountVectorizer(stop_words='english',min_df=.03,max_df=.8)

2. Remember that we have tagged negative as 0 and positive as 1 and If needed, look up the documentation of confusion_matrix to understand which values in the cells correspond to positives/negatives.

3. The CF docs mention that C{i, j} is the number which is known to be in class i but are predicted in class j. In this case, {0,1} is thus actually 0 (negative) and predicted 1 (positive).

**Ans** : 23

    confusion=metrics.confusion_matrix(y_test, y1_pred_class)
    print(confusion)