<a href="https://colab.research.google.com/github/mkane968/Text-Mining-Experiments/blob/main/NLTK/Tutorial%202%3A%20Modules%20for%20Language%20Identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 2: Modules for Language Identification 
# (N-Gram, Stopword and Word Bigram Analysis)

**Tutorial 2.1: Deriving N-Grams from Text**

Based on N-Gram-Based Text Categorization: Categorizing Text With Python by Alejandro Nolla

What are n-grams? See [here](https://cloudmark.github.io/Language-Detection/).

***Tokenization:*** Divides strings of text into substrings of letters and apostrophes ONLY to prepare for n-gram analysis

In [1]:
#Lowercase text in string
s = "Le temps est un grand maître, dit-on, le malheur est qu'il tue ses élèves."
s = s.lower()

In [2]:
#Import regular expressions tokenizer and tokenize string
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[a-zA-Z'`éèî]+")
s_tokenized = tokenizer.tokenize(s)
s_tokenized

['le',
 'temps',
 'est',
 'un',
 'grand',
 'maître',
 'dit',
 'on',
 'le',
 'malheur',
 'est',
 "qu'il",
 'tue',
 'ses',
 'élèves']

***Generating N-Grams:*** Finds n-length slices of a longer string, typically overlapping/in sequence; can be used for language detection

In [None]:
#import ngrams module and create list for generated ngrams
from nltk.util import ngrams
generated_4grams = []

#Generate ngram for each word in tokenized string
for word in s_tokenized:
    generated_4grams.append(list(ngrams(word, 4, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_'))) # n = 4.
generated_4grams

It seems that generated_4grams needs flattening since it's supposed to be a list of 4-grams:

In [5]:
#Creates list in which ????
generated_4grams = [word for sublist in generated_4grams for word in sublist]
generated_4grams[:10]

[('_', '_', '_', 'l'),
 ('_', '_', 'l', 'e'),
 ('_', 'l', 'e', '_'),
 ('l', 'e', '_', '_'),
 ('e', '_', '_', '_'),
 ('_', '_', '_', 't'),
 ('_', '_', 't', 'e'),
 ('_', 't', 'e', 'm'),
 ('t', 'e', 'm', 'p'),
 ('e', 'm', 'p', 's')]

Obtaining n-grams (n = 4)

In [None]:
#Join 4grams into list of strings
ng_list_4grams = generated_4grams
for idx, val in enumerate(generated_4grams):
    ng_list_4grams[idx] = ''.join(val)
ng_list_4grams

Sort n-grams by how frequently they appear within the text

In [None]:
#Create list for n-grams sorted by frequency
freq_4grams = {}

#Iterate through ngrams and add to freq_4grams list as many times as appearing in list of strings
for ngram in ng_list_4grams:
    if ngram not in freq_4grams:
        freq_4grams.update({ngram: 1})
    else:
        ngram_occurrences = freq_4grams[ngram]
        freq_4grams.update({ngram: ngram_occurrences + 1})
        
# The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python. For example, operator.add(x, y) is equivalent to the expression x + y.
from operator import itemgetter 

# We only keep the 300 most popular n-grams. This was suggested in the original paper written about n-grams.
freq_4grams_sorted = sorted(freq_4grams.items(), key=itemgetter(1), reverse=True)[0:300] 
freq_4grams_sorted

Obtain n-grams for multiple values of n (n = 1, 2, 3, 4)

In [8]:
from nltk import everygrams

# For the code below we need the raw sentence as opposed to the tokens.
s_clean = ' '.join(s_tokenized) 
s_clean

"le temps est un grand maître dit on le malheur est qu'il tue ses élèves"

In [11]:
#Define ngram extractor generating uni-grams, bigrams, trigrams and 4-grams of string (range set to 1-4)
def ngram_extractor(sent):
    return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4) 
            if ' ' not in ng and '\n' not in ng and ng != ('_',)]

ngram_extractor(s_clean)

['l',
 'e',
 't',
 'e',
 'm',
 'p',
 's',
 'e',
 's',
 't',
 'u',
 'n',
 'g',
 'r',
 'a',
 'n',
 'd',
 'm',
 'a',
 'î',
 't',
 'r',
 'e',
 'd',
 'i',
 't',
 'o',
 'n',
 'l',
 'e',
 'm',
 'a',
 'l',
 'h',
 'e',
 'u',
 'r',
 'e',
 's',
 't',
 'q',
 'u',
 "'",
 'i',
 'l',
 't',
 'u',
 'e',
 's',
 'e',
 's',
 'é',
 'l',
 'è',
 'v',
 'e',
 's',
 'le',
 'e_',
 '_t',
 'te',
 'em',
 'mp',
 'ps',
 's_',
 '_e',
 'es',
 'st',
 't_',
 '_u',
 'un',
 'n_',
 '_g',
 'gr',
 'ra',
 'an',
 'nd',
 'd_',
 '_m',
 'ma',
 'aî',
 'ît',
 'tr',
 're',
 'e_',
 '_d',
 'di',
 'it',
 't_',
 '_o',
 'on',
 'n_',
 '_l',
 'le',
 'e_',
 '_m',
 'ma',
 'al',
 'lh',
 'he',
 'eu',
 'ur',
 'r_',
 '_e',
 'es',
 'st',
 't_',
 '_q',
 'qu',
 "u'",
 "'i",
 'il',
 'l_',
 '_t',
 'tu',
 'ue',
 'e_',
 '_s',
 'se',
 'es',
 's_',
 '_é',
 'él',
 'lè',
 'èv',
 've',
 'es',
 'le_',
 '_te',
 'tem',
 'emp',
 'mps',
 'ps_',
 '_es',
 'est',
 'st_',
 '_un',
 'un_',
 '_gr',
 'gra',
 'ran',
 'and',
 'nd_',
 '_ma',
 'maî',
 'aît',
 'îtr',
 'tre',


**Tutorial 2.2:** Detecting Text Language by Counting Stopwords

Based on Detecting Text Language With Python and NLTK by Alejandro Nolla

Stop words are words which are filtered out before processing because they are mostly grammatical as opposed to semantic in nature e.g. search engines remove words like 'want'.