# Probability
## [Definition](#definition)
## [Probability in NLP](#probability_in_nlp)
### [Why Probability used in NLP ](why_probability_used_in_nlp)
### [How Probability used in NLP](#how_probability_used_in_nlp)
### [Discrete Sample Space](#discrete_sample_space)
### [Probability Mass Function](#probability_mass_function)
### [Sample Space Constraints](#sample_space_constraints)
### [Events](#events)
### [Random Variable](#random_variable)
### [Probability of the sentence](#probability_of_a_sentence)
### [Chain Rule](#chain_rule)
### [Markove Assumption](#markov_assumption)
### [Language Modeling](#language_modeling)
#### [Langugae Model Parameters](#language_model_parameters)
#### [Choosing n-gram for Language Model](#choosing_ngram_for_lm)
#### [Reducing number of parameters](#reducing_no_of_parameters)
### [Prior Probability](#prior_probability)
### [Conditional Probaility (Posterior Probability)](#conditional_probability)
### [Joint Probability](#joint_probability)
## [Perplexity](#perplexity)
### [Why Perplexity](#why_perplexity)

## Definition <a class='anchor' name='definition'></a>

Probability is defined as the likelihood that an event will occur

Eg., Flipping a coin. There is a 50% chance or probability that heads will come up for any given toss of a fair coin.

Probability can be expressed as

- as a Percentage - Eg., 60%
- as a Decimal Form - Eg., 0.6
- as a Fraction - 6/10


## Probability in NLP <a class='anchor' name='probability_in_nlp'></a>

### Why Probability used in NLP <a class='anchor' name='why_probability-used_in_nlp'></a>
- Probability will be used in estimating what could be the next word in the sentence
- Provides methods to predict or make decisions to pick the next word in the sequence based on sampled data
- Make the informed decision when there is a certain degree of uncertainty and some observed data
    - Example: How * you?
    - Finding all the possible words that might appeat in between How and you
    - To get an understanding, see Google NGram Viewer
- It provides a quantitative description of the chances or likelihood's associated with various outcomes

### How Probability used in NLP <a class='anchor' name='how_probability_used_in_nlp'></a>

1. Probability of a Sentence
    - Which sentence is most likely (probable)
2. Probability of the next word in the sentence?
    - How likely to predict "you" as the next word after the query sentence "How are ____?"
        - Likelihood of the next word is formalized through an observation by conducting experiment - counting the words in a document

### Discrete Sample Space <a class='anchor' name='discrete_sample_space'></a>
- Consider the following Bag of Words (_count = 52_)
    - Experiment
        - Extracting tokens from a document
    - Outcome
        - Every token/word in _x_ in the document
    - Sample Document
        - A weather balloon is floating at a constant height above Earth when it releases a pack of instruments. (Level 1) a. If the pack hits the ground with a downward velocity of −73.5 m/s, how far did the pack fall? b. Calculate the distance the ball has rolled at the end of 2.2 s 
- The outcome of the experiment - 52 sample (words).
    - They constitute the _sample space_, $\Omega$ or the set of all possible outcomes
        - $\Omega$ = 'a', 'weather', 'balloon', 'is', 'floating', 'at', 'a', 'constant', 'height', 'above', 'earth', 'when', 'it', 'releases', 'a', 'if', 'the', 'pack', 'hits', 'the', 'ground', 'with', 'a', 'downard', 'velocity', 'of', 'm', 's', 'how', 'far', 'did', 'the', 'pack', 'fall', 'b', 'calculate', 'the', 'distance', 'the', 'ball', 'has', 'rolled', 'at', 'the', 'end', 'of', 's'
- Each word in this sample belongs to $\Omega$, represented by $x \in \Omega$
- Eacm sample $x \in \Omega$ is assigned a probability score $[ 0, 1 ]$

### Probability Mass Function <a class='anchor' name='probability_mass_function'></a>
- Probability Function | Probability Distribution Function
    - A _probability function_ or _probability distribution function_ distributes the probability mass of $1$ to the all the samples in the sample space $\Omega$

### Sample Space Constraints <a class='anchor' name='sample_space_constraints'></a>
- All the words in the $\Omega$, must satisfy the following constraints:
    1. $P(x) \in [0,1], for all x \in \Omega$
    2. $\sum_{x \in \Omega} P(x) = 1$

### Events <a class='anchor' name='events'></a>

- Events can be described as a variable taking a certain value
- An __*Event*__ is a collection of samples of the same type, $E \subseteq \Omega$
    - $P(E) = \sum_{x \in E} P(x)$
- Example
    - Consider above sample document
        - Total number of words = 52.
        - The number of __*unique*__ words = 37 or there are 37 __*types*__ of words in this BOW.
        - 15 words have frequencies $> 1$.
        - Example 1:
            - $E_{pack} = 3$
                - In above corpus, the word type $pack$ occurs $3$ times
            - $P(E='pack')=\frac{3}{52}=0.058$
        - Example 2:
            - $E_{the} = 6$
                - In above corpus, the word type $the$ occurs $6$ times
            - $P(E='the')=\frac{6}{52}=0.115$
        - Example 3:
            - $E_{weather} = 1$
                - In above corpus, the word type $weather$ occurs only $1$ time
            - $P(E='weather')=\frac{1}{52}=0.019$

### Random Variable <a class='anchor' name='random_variable'></a>

- A __random variable__[8], is a variable whose possible values are numerical outcomes of a random experiment
- Two types of random variable
    - Continuous
    - Discrete
- For NLP, it will be __*Discrete*__

- To capture Type-Token distinction, we use random variable $W$.
    - $W(x)$ maps to the sample $x \in \Omega$
- $V$ is the set of types and the value is represented by a variable $v$
- Given a random variable $V$ and a value $v$, $P(V = v)$ is the probability of the event that $V$ takes the value $v$, i.e: $P(V = v) = P(x \in \Omega: V(x) = v)$
    - Example: $P(V = 'the') = P('the') = 0.115$
- Random variables are useful in describing/ constructing various events
- In NLP, we will often consider random variables representing the experiment of choosing a word within a vocabulary or choosing a sentence within a language. [11]

### Probability of the sentence $W$ <a class='anchor' name='probability_of_a_sentence'></a>

> $P(W) = P(w_1, w_2, ..., w_n)$

### Chain Rule <a class='anchor' name='chain_rule'></a>

> $P(w_1, w_2, ..., w_n) = P(w_1)P(w_2|x_1)...P(w_n|w_1,...w_{n-1})$

> $P('I got this one') = P('I', 'got', 'this', 'one'')$

> $P('I got this one') = P('I') × P('got' | 'I') × P('this' | 'I got') × P('one' | 'I got this')$

### Markove Assumption <a class='anchor' name='markov_assumption'></a>
- From [12]
    - The porbability of a word depends only on the $k-1$ preivous words (history).
        - $P(w_n|w_1,w_2,...,w_{n-1})=P(w_n|w_{n+1-k}...w_{n-1})$
        - Example: $k=2$
            - $P('I got this one') = P('I') × P('got' | 'I') × P('this' | 'got') × P('one' | 'this')$
    - This is called __Markov Assumption__: only the closes $k$ words are relvant:
        - *Unigram*: previous words do not matter
        - *Bigram*: only the previous word matters (like above example)
        - *Trigram*: only the previous two words matter
            - $P(nextWord | prevWord2 prevWord1)$

### Language Modeling <a class='anchor' name='language_modeling'></a>

- From [9]
    - Computing the probability of a Sentence $P(W)$
        - That is a model which computes $P(W)$ is the __language model__
            - A better term for this would be "The Grammar"
            - But "Language Model" or LM is standard
    - Which sentence is most likely (most probable)?
        - I saw this dog running across the street.
        - Saw dog this I running across street the.
    - Why? You have a *language model* in your head
        - $P("I saw this") >> P("saw dog this)$
- From [12]
    - A language model is a probability distribution over word/character sequences
    - We would like to find a language model $P$, such that
        - $P("And nothing but the truth") \approx 0.001$
        - $P("And nuts sing on the roof") \approx 0.000000001$
    - **__Bigram Model__**
        - $P(but | nothing) = \frac{P(nothing \; but)}{P(nothing)} \approx \frac{C_1}{C_2}$
            - Let $C_1$ be the count of how many times the phrase "nothing but" occured in the training corpus
            - Let $C_2$ be the count of how many times the token "nothing" occured in the training corpus
    - **__Trigram Model__**
        - $P(the | nothing but) = \frac{P(nothing but the)}{P(nothing but)} \approx \frac{C_1}{C_2}$
            - Let $C_1$ be the count of how many times the phrase "nothing but the" occured in the training corpus
            - Let $C_2$ be the count of how many times the phrase "nothing but" occured in the training corpus

#### Langugae Model Parameters <a class='anchor' name='anguage_model_parameters'></a>

- From [12]
    - Each Probability factor in probability of a sentence is called as __model parameters__
    - Example: Markov Assumption (bigram model)
        - $P('I got this one') = P('I') × P('got' | 'I') × P('this' | 'got') × P('one' | 'this')$
            - Each probabilty in above equation is a model parameter
    - The number of n-grams is exactly the number of parameters we have to learn
        - In above example, since we have selected bigram, it had 4 model parameters

#### Choosing n-gram for Language Model <a class='anchor' name='choosing_ngram_for_lm'></a>

- From [12]
    - Largen n
        - __greater discrimnation__:
            - more information about the context of the specific instance
        - but __less reliability__:
            - Out model is too complex, that it has too many parameters
            - Cannot estimate parameters reliably from limited data (data sparseness)
                - too many chances that the history has never been seen before
                - our estimates are not reliable because we have not seen enough examples
    - Small n
        - __less discrimination__:
            - not enough history to predict the next word very well, or model is not so good
        - but __more reliable__:
            - more instances in training data, better statistical estimates of our parameters
    - Bigrams and Trigrams are used in practice

#### Reducing number of parameters <a class='anchor' name='reducing_no_of_parameters'></a>

- From [12]
    - To reduce the number of parameters, we can:
        - do stemming (use stems instead of word types)
            - help = helps = helped
        - group words into semantic classes
            - {Monday, Tuesday, Wednesday, Thursday, Friday} = one word

### Prior Probability <a class='anchor' name='prior_probability'></a>

- From [10]
    - *Prior Probability*: The probability before we consider any additional knowledge
        - $P(A)$

### Conditional Probaility (Posterior Probability) <a class='anchor' name='conditional_probability'></a>

- From [7]
    - The conditional probability $P(E_2 | E_1)$ is the probability of event $E_2$ given that even $E_1$ hs occured.
        - You can think of this as the probability of $E_2$ givent hat $E_1$ is the temporary sample set
    - $P(E_2 | E_1) = \frac{P(E_1,E_2)} {P(E_1)}$ if $P(E_1) > 0$
- From [10] 
- From [12]
    - $P(X|Y)$ means probabilty that X is true when we already know Y is true
        - P(baby is named John | baby is a boy) = 0.002
        - P(baby is a boy | baby is named John) = 1

### Joint Probability <a class='anchor' name='joint_probability'></a>

- From [12]
    - $P(X,Y)$ means that X and Y are both true, for example:
        - P(brown eyes, boy) = (number of all baby boys with brown eyes)/(total number of babies)

## Perplexity <a class='anchor' name='perplexity'></a>

- From [9]
    - Perplexity is the probability of the test set (assigned by the language model) normalized by the number of words.
        - $PP(W) = P(w_1,w_2,...w_N)$<sup>$-\frac{1}{N}$</sup>
        - $PP(W) = \sqrt[N]{\frac{1}{{P(w_1,w_2,\ldots, w_N)}}}$
        - Chain Rule: $PP(W) = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1,...,w_{i-1})}}$
        - For bigrams: $PP(W) = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_{i-1})}}$
        
    - Minimizing perplexity is the same as maximizing probability
        - The best language model is one that best predicts an unseen test set
        - __Lower Perplexity = Better Model__

### Why Perplexity <a class='anchor' name='why_perplexity'></a>
- From [9]
    - Say we have learned probabilities from a __training set__.
    - Next we need to look at the model's performance on some new data
        - This is a __test set__. A dataset different tha our training set
    - Then we need an __evaluation metric__ to tell us how well our model is doing on the test set.
    - One such metric is __perplexity__