# Unigram method for counting probabilities

## Differences b/w unigram method and true ngram method
1. Unigram method uses AND property of probability: just multiply the probabilities of words
2. True NGram method uses conditional probability `P(abcd) = P(d|abc) * P(abc)`  
This is how we obtain the probability by using a chain of conditional probabilities in **bigram**  
`P('There was heavy rain') ~ P('There')P('was'|'There')P('heavy'|'was')P('rain'|'heavy')`
3. True NGram will output very discreet values especially for `P(d|abc)` as `Freq(abcd)` and `Freq(abc)` could be same number (if data is sparse)
4. Use the true NGram method only when we have enough data sets.


## Why do we use add-one smoothing
To distribute some probability from seen words to unseen words. We pretend that everything is seen at least once. However, add-one smoothing gives a false impression of the index we create (hence why we don't perform add-k smoothing for instance).
  
If language model is constructed from large number of observations
- If small vocab but large observations, there will be no significant change of distribution

In [1]:
import math
import nltk

In [2]:
text_collection_PC = ["I Don't Want To Go", "A Groovy Kind Of Love", "You Can't Hurry Love", "This Must Be Love", "Take Me With You"]
text_collection_AS = ["All Out Of Love", "Here I Am", "I Remember Love", "Love Is All", "Don't Tell Me"]

vocab = {} # We store vocab here to count the universal number of different vocabs available****
text_collection_PC_freq = {}
text_collection_AS_freq = {}
total_text_collection_PC_freq = 0
total_text_collection_AS_freq = 0

### Declare our variabels and song titles

## Formula for counting the probability

`freq(word)/(total_freq_of_intended_dictionary + length(vocab))`

### Settle the Air Supply frequency first

In [3]:
for titles in text_collection_AS:
    for word in titles.split():
        vocab[word] = 1
        total_text_collection_AS_freq += 1
        if (word not in text_collection_AS_freq):
            text_collection_AS_freq[word] = 1
        else:
            text_collection_AS_freq[word] += 1

### Settle the Phil Collins frequency

In [4]:
for titles in text_collection_PC:
    for word in titles.split():
        vocab[word] = 1
        total_text_collection_PC_freq += 1

        if (word not in text_collection_PC_freq):
            text_collection_PC_freq[word] = 1
        else:
            text_collection_PC_freq[word] += 1



In [5]:
print (len(vocab))
print ("Total for AS, including total vocab size: ", (total_text_collection_AS_freq + len(vocab)))
print ("Total for PC, including total vocab size: ", (total_text_collection_PC_freq + len(vocab)))
print ("==========================")


26
Total for AS, including total vocab size:  42
Total for PC, including total vocab size:  48


### Add the query text

In [6]:
query = "I Remember You"

### Calculate probability

- We ignore the words that are not spotted in the dictionary
- We also use the log space as probabilities can get really really small (this prevents underflow of floating point values)

In [7]:
probability_AS = 0
probability_PC = 0


for word in query.split():
    if (word in text_collection_PC_freq):
        probability_PC += (math.log10(text_collection_PC_freq[word]) - math.log10(total_text_collection_PC_freq + len(vocab)))

for word in query.split():
    if (word in text_collection_AS_freq):
        probability_AS += (math.log10(text_collection_AS_freq[word]) - math.log10(total_text_collection_AS_freq + len(vocab)))
        
print ("Air Supply", probability_AS)
print ("Phil Collins", probability_PC)

Air Supply -2.94546858513182
Phil Collins -3.061452479087193


## Final result

In [8]:
if (probability_PC > probability_AS):
    print ("This title is from Air Supply")
else:
    print ("This title is from Phil Collins")

This title is from Phil Collins
