# Unigram method for counting probabilities

## Differences b/w unigram method and true ngram method
1. Unigram method uses AND property of probability: just multiply the probabilities of words
2. True NGram method uses conditional probability `P(abcd) = P(d|abc) * P(abc)`  
This is how we obtain the probability by using a chain of conditional probabilities in **bigram**  
`P('There was heavy rain') ~ P('There')P('was'|'There')P('heavy'|'There was')P('rain'|'There was heavy')`  
Unfortunately there is no way to obtain the probability for `P(‘rain’|‘There was’)` using just **bigram** models.  
3. True NGram will output very discreet values especially for `P(d|abc)` as `Freq(abcd)` could be very little, close to 0 or 1.
4. Use the true NGram method only when we have enough data sets so that we yield more observations and more counts of the true Ngram will be stored.


## Why do we use add-one smoothing
To distribute some probability from seen words to unseen words. We pretend that everything is seen at least once. However, add-one smoothing gives a false impression of the index we create (hence why we don't perform add-k smoothing for instance).
  
If language model is constructed from large number of observations
#### If small vocab but large observations
This case would be a good choice for add one smoothing. With many observations but a small vocab, only a small probability mass will be distributed towards the unseen words.  
eg. unseen words will get a probability of 1/(a very large number)

#### If large vocab but small observations

That could shift the distribution quite drastically towards the unseen words.  
eg. if we use the english characters without add one smoothing and "a" was seen, then we will get P(a) = 1. But if we apply add one smoothing, then it will become P(a) = 1/27.


### Add one or add-k smoothing taps on the belief that we should normalise and believe in all events (in the training set)

In [1]:
import math
import nltk

In [2]:
text_collection_PC = ["I Don't Want To Go", "A Groovy Kind Of Love", "You Can't Hurry Love", "This Must Be Love", "Take Me With You"]
text_collection_AS = ["All Out Of Love", "Here I Am", "I Remember Love", "Love Is All", "Don't Tell Me"]

vocab = {} # We store vocab here to count the universal number of different vocabs available****
text_collection_PC_freq = {}
text_collection_AS_freq = {}
total_text_collection_PC_freq = 0
total_text_collection_AS_freq = 0

### Declare our variabels and song titles

## Formula for counting the probability

`freq(word)/(total_freq_of_intended_dictionary + length(vocab))`

### Settle the Air Supply frequency first

In [3]:
for titles in text_collection_AS:
    for word in titles.split():
        vocab[word] = 1
        total_text_collection_AS_freq += 1
        if (word not in text_collection_AS_freq):
            text_collection_AS_freq[word] = 1
        else:
            text_collection_AS_freq[word] += 1

### Settle the Phil Collins frequency

In [4]:
for titles in text_collection_PC:
    for word in titles.split():
        vocab[word] = 1
        total_text_collection_PC_freq += 1

        if (word not in text_collection_PC_freq):
            text_collection_PC_freq[word] = 1
        else:
            text_collection_PC_freq[word] += 1



In [5]:
print (len(vocab))
print ("Total for AS, including total vocab size: ", (total_text_collection_AS_freq + len(vocab)))
print ("Total for PC, including total vocab size: ", (total_text_collection_PC_freq + len(vocab)))
print ("==========================")


26
Total for AS, including total vocab size:  42
Total for PC, including total vocab size:  48


### Add the query text

In [6]:
query = "I Remember You"

### Calculate probability

- We ignore the words that are not spotted in the dictionary
- We also use the log space as probabilities can get really really small (this prevents underflow of floating point values)

In [7]:
probability_AS = 0
probability_PC = 0

for word in query.split():
    if (word in text_collection_PC_freq):
        # Remember to add one smoothing here
        probability_PC += math.log10(text_collection_PC_freq[word] + 1) - math.log10(total_text_collection_PC_freq + len(vocab))
    elif (word in vocab):
        # Remember to consider as 1 if it does not exist in the dictionary but in the vocab
        probability_PC += math.log10(1) - math.log10(total_text_collection_PC_freq + len(vocab))
    else:
        # If the word is unseen in the data in general, then we don't consider it at all
        pass

    
for word in query.split():
    if (word in text_collection_AS_freq):
        # Remember to add one smoothing here
        probability_AS += math.log10(text_collection_AS_freq[word] + 1) - math.log10(total_text_collection_AS_freq + len(vocab))
    elif (word in vocab):
        # Remember to consider as 1 if it does not exist in the dictionary but in the vocab
        probability_AS += math.log10(1) - math.log10(total_text_collection_AS_freq + len(vocab))
    else:
        pass

print ("Air Supply", probability_AS)
print ("Phil Collins", probability_PC)

Air Supply -4.0915966208100585
Phil Collins -4.265572461743118


## Final result

In [8]:
if (probability_PC < probability_AS):
    print ("This title is from Air Supply")
else:
    print ("This title is from Phil Collins")

This title is from Air Supply
