# Language Identifier Using Word Bigrams

In my [previous tutorial](https://github.com/nwams/NLTK-Hands-On-Tutorial/blob/master/2-1-Deriving-N-Grams-from-Text.ipynb) I demonstrate how to use N-grams at the character level for identifying language. However in this tutorial I will work on bigrams at the sentence level. Here bigrams are basically a set of 2 co-occuring words within a given sentence. For example, for the sentence "The cow jumps over the moon" the bigrams would be:
* the cow
* cow jumps
* jumps over
* over the
* the moon

We use bigrams because sometimes word groups provide more benefits than only one word when trying to teach the a machine the true **meaning** of a sentence. You might be wondering when we can apply bigrams? One application is: A bigram can makes a prediction for a word based on the one before.

The discussions in [this quora post](https://www.quora.com/What-is-a-bigram-and-a-trigram-layman-explanation-please) do a great job of elaborating on what a bigram is.

In [14]:
import string
from nltk import word_tokenize

# I will create a helper tokenize method
def punct_dig_tokenizer(sentence):
    '''Remove punctionation and digits first then tokenize'''
    # in maketrans() if three arguments are passed, each character in the third argument is mapped to None.
    sentence = sentence.translate(str.maketrans('', '', string.punctuation + string.digits))
    return word_tokenize(sentence.lower())

## 1. Let's Go Step-by-Step to Understand the Process

**Tokenize at the word level**

In [20]:
example_text = 'Oh, then, I see Queen Mab hath been with you.'
example_tokens_words = punct_dig_tokenizer(example_text)
example_tokens_words

['oh', 'then', 'i', 'see', 'queen', 'mab', 'hath', 'been', 'with', 'you']

**Tokenize at the character level**

In [21]:
example_tokens_chars = list(example_text_tokens_words[0])
example_tokens_chars

['o', 'h']

**Create Unigrams using NLTK's [ngrams](https://kite.com/python/docs/nltk.ngrams) module**

In [25]:
from nltk import ngrams

example_tokens_words_unigrams = list(ngrams(example_tokens_words, 1))
example_tokens_words_unigrams

[('oh',),
 ('then',),
 ('i',),
 ('see',),
 ('queen',),
 ('mab',),
 ('hath',),
 ('been',),
 ('with',),
 ('you',)]

**Create bigrams; apply left and right padding**

In [27]:
example_tokens_words_bigrams = list(ngrams(example_tokens_words, 2, pad_left=True, pad_right=True,
                                          left_pad_symbol='_', right_pad_symbol='_'))
example_tokens_words_bigrams

[('_', 'oh'),
 ('oh', 'then'),
 ('then', 'i'),
 ('i', 'see'),
 ('see', 'queen'),
 ('queen', 'mab'),
 ('mab', 'hath'),
 ('hath', 'been'),
 ('been', 'with'),
 ('with', 'you'),
 ('you', '_')]

**Let's create a frequency distribution that just counts the occurance of each word**

In [62]:
f_dist = FreqDist(example_tokens_words_unigrams) # FreqDist expects an iterable of tokens

for key, value in f_dist.items():
    print(key, ':', value)

('oh',) : 1
('then',) : 1
('i',) : 1
('see',) : 1
('queen',) : 1
('mab',) : 1
('hath',) : 1
('been',) : 1
('with',) : 1
('you',) : 1


**Create a dictionary of the words and their frequencies**

In [66]:
unigram_dict = {}

for k, v in f_dist.items():
    unigram_dict[''.join(k)] = v
    
unigram_dict

{'oh': 1,
 'then': 1,
 'i': 1,
 'see': 1,
 'queen': 1,
 'mab': 1,
 'hath': 1,
 'been': 1,
 'with': 1,
 'you': 1}

**Load the text from a file**

In [8]:
file = '2-3_ngram_langid_files/LangId.train.English.txt'

with open(file, encoding='utf8') as f:
    content = f.read().lower()

content.replace('\n', '')[:300] # remove newline symbol and show the first 300 characters

"approval of the minutes of the previous sitting the minutes of yesterday 's sitting have been distributed . are there any comments ? mr president , on monday i made a point of order about president nicole fontaine 's reported comments in the british press regarding her recent visit with her majesty "

**Read in the English unigram pickled dictionary**

In [10]:
import pickle

with open('2-3_ngram_langid_files/English.unigram.pickle', 'rb') as handle:
    unigram_enlgish_dict = pickle.load(handle)

unigram_enlgish_dict

{'approval': 3,
 'of': 2769,
 'the': 5699,
 'minutes': 11,
 'previous': 13,
 'sitting': 11,
 'yesterday': 15,
 's': 244,
 'have': 491,
 'been': 193,
 'distributed': 4,
 'are': 571,
 'there': 229,
 'any': 98,
 'comments': 9,
 'mr': 371,
 'president': 270,
 'on': 865,
 'monday': 11,
 'i': 721,
 'made': 61,
 'a': 1343,
 'point': 70,
 'order': 76,
 'about': 93,
 'nicole': 1,
 'fontaine': 4,
 'reported': 2,
 'in': 1668,
 'british': 16,
 'press': 13,
 'regarding': 23,
 'her': 51,
 'recent': 12,
 'visit': 6,
 'with': 449,
 'majesty': 1,
 'queen': 3,
 'elizabeth': 1,
 'ii': 1,
 'labour': 14,
 'member': 111,
 'this': 887,
 'house': 49,
 'miller': 2,
 'repeated': 6,
 'what': 132,
 'were': 82,
 'purported': 1,
 'to': 2490,
 'be': 620,
 'remarks': 1,
 'not': 525,
 'once': 27,
 'but': 240,
 'three': 21,
 'times': 11,
 'tuesday': 8,
 'and': 2040,
 'wednesday': 2,
 'he': 72,
 'sought': 5,
 'drag': 1,
 'into': 78,
 'political': 51,
 'controversy': 2,
 'use': 45,
 'name': 16,
 'score': 2,
 'cheap': 7,


**Read in the English bigram pickled dictionary**

In [15]:
with open('2-3_ngram_langid_files/English.bigram.pickle', 'rb') as handle:
    bigram_enlgish_dict = pickle.load(handle)

bigram_enlgish_dict

{'_ approval': 1,
 'approval of': 3,
 'of the': 906,
 'the minutes': 6,
 'minutes of': 2,
 'the previous': 9,
 'previous sitting': 2,
 'sitting the': 1,
 'of yesterday': 2,
 'yesterday s': 5,
 's sitting': 1,
 'sitting have': 1,
 'have been': 65,
 'been distributed': 1,
 'distributed are': 1,
 'are there': 5,
 'there any': 1,
 'any comments': 1,
 'comments mr': 1,
 'mr president': 192,
 'president on': 8,
 'on monday': 10,
 'monday i': 1,
 'i made': 1,
 'made a': 3,
 'a point': 8,
 'point of': 17,
 'of order': 6,
 'order about': 1,
 'about president': 1,
 'president nicole': 1,
 'nicole fontaine': 1,
 'fontaine s': 1,
 's reported': 1,
 'reported comments': 1,
 'comments in': 1,
 'in the': 442,
 'the british': 9,
 'british press': 1,
 'press regarding': 1,
 'regarding her': 1,
 'her recent': 1,
 'recent visit': 2,
 'visit with': 1,
 'with her': 2,
 'her majesty': 1,
 'majesty queen': 1,
 'queen elizabeth': 1,
 'elizabeth ii': 1,
 'ii a': 1,
 'a british': 1,
 'british labour': 1,
 'labo

**Show how many times "of the" occurs in the text**

In [16]:
bigram_enlgish_dict.get('of the') # Get the value of the key

906

**Show the most frequently occurring unigrams**

In [21]:
import operator

english_unigram_freq = sorted(unigram_enlgish_dict.items(), key=operator.itemgetter(1), reverse=True)
english_unigram_freq[:10]

[('the', 5699),
 ('of', 2769),
 ('to', 2490),
 ('and', 2040),
 ('in', 1668),
 ('a', 1343),
 ('is', 1303),
 ('that', 1205),
 ('this', 887),
 ('on', 865)]

In [None]:
labels, values = zip(*english_unigram_freq)