# Simple language recognition

The aim of this Notebook is to implement a simple language recognition system. We are going to use a N-Gram system.

This exercise is inpired by the first NLP home-work I had to do at the University of : http://tesniere.univ-fcomte.fr/

**Step 1:** 

- choose 3 languages

- download 2 texts per language (you can find this on : https://www.gutenberg.org/)

My languages : english, french, esperanto.

**Step 2:** clean the text files

**Step 3:** find the unigrams (warming step)

**Step 4:** find the bigrams

**Step5:** find if a new text is wether in english, french or esperanto

## Clean the text

**Warning** : please pay attention to punctuation, upper case...

Note 1 : there are many ways to clean a document, for example you can use a regular expression to extract only the words of the text (re.findall(r'\w+', text), or use specialized libraries such as NLTK.

Note 2 : sometimes punctuation can help to detect a language : for instance in spanish with '¿'. Also the alphabet family can tell what language is to be detected (check the differences between latin alphabet and cyrrillic alphabet).

**In this Notebook we detect a language based on words**


In [1]:
import os # files place
import re # remove punctuation (you can also use string or NLTK)
from collections import Counter # build the N-gram frequencies

# 1 : clean the text
# Inputs : directory path, text list in a specific language
# Output : text (string) without punctuation and extra whitespaces
def clean_text(directory, file_list):    
    text = ''
    
    # read the file
    for file in file_list :
        with open(os.path.join(directory, file)) as f:
            lines = f.read()
            
            # punctuation, case and whitespaces
            lines = re.sub(r'[^\w\s]',' ',lines) 
            lines = re.sub('\n', ' ', lines)
            lines = lines.lower()
            lines = re.sub(' +', ' ', lines)
            
            text += lines

    return text


In [29]:
# 2 : apply your clean function on texts

# dir and texts
directory = 'data/'
texts = os.listdir(directory)

# french texts :
french = [ f for f in texts if 'fr' in f]

# esperanto texts :
esperanto = [ f for f in texts if 'esp' in f]

# english texts :
english = [ f for f in texts if 'en' in f]

# clean text :
clean_fr = clean_text(directory, french)
clean_esp = clean_text(directory, esperanto)
clean_en = clean_text(directory, english)

print("Clean text for esperanto : ", clean_esp[0:100])

Clean text for esperanto :  estis iam dudek kvin stansoldatoj ili ciuj estis fratoj car ili devenis de malnova stankulero la paf


## Unigrams

Find the list of unigrams in the texts and sort them by frequencies.

Here we simply use a Counter which counts the occurrences in the text. It's exaclty the same as building a vocabulary.

In [3]:
# Build the unigram frequencies of a text
# Input : a text (str)
# Output : Counter object with sorted unigrams
def unigrams(text):
    
    text = text.split()
    unigrams = Counter(text)
    
    return unigrams

en_unig = unigrams(clean_en)
esp_unig = unigrams(clean_esp)
fr_unig = unigrams(clean_fr)

# some operations you can make on Counter objects :
print(en_unig.most_common(10))
print('\nThere are {} unigrams in esperanto texts.\n'.format(len(esp_unig)))
print('There are {} unigrams in french texts.\n'.format(len(fr_unig)))
print('There are {} unigrams in english texts.\n'.format(len(en_unig)))

[('the', 4895), ('of', 2784), ('and', 2476), ('a', 2104), ('to', 2051), ('in', 1538), ('he', 1349), ('i', 1188), ('was', 1114), ('that', 1057)]

There are 5381 unigrams in esperanto texts.

There are 10180 unigrams in french texts.

There are 11197 unigrams in english texts.



## Bigrams

Build the bigram lists for each text and sort them by frequence.

Here we are trying to give an example to lower the data by using the lemmas of words (for french only to give you an example).

**warning** : we don't want duplicates

In [4]:
import spacy

nlp = spacy.load(("fr_core_news_md"))

In [5]:
text = nlp(clean_fr)

fr_text_lemma = ''

for word in text:
    fr_text_lemma += word.lemma_ +' '
    
print('Standard text :', clean_fr[100:200])
print('Lemmatized text :', fr_text_lemma[100:200])

Standard text : us l énorme platane qui la couvre l abrite et l ombrage tout entière j aime ce pays et j aime y vivr
Lemmatized text : n sou l énorme platane qui le couvre l abriter et l ombrage tout entier j aimer ce pays et j aimer y


In [6]:
# results
print('There are {} unigrams in french texts.\n'.format(len(fr_unig)))
print('There are {} tokens in the french text lemmatized'.format(len(unigrams(fr_text_lemma))))

There are 10180 unigrams in french texts.

There are 6415 tokens in the french text lemmatized


In [7]:
# Build the bigrams frequencies of the text
# Input : text (str)
# Output : dictionary of bigram frequencies
def bigrams(text):
    
    bigram_list = []
    text = text.split()
    
    for i in range(len(text) - 2 + 1):
        # sliding window starting at position i and containing 2 words
        bigram = ' '.join(text[i : i + 2])        
        bigram_list.append(bigram)
    
    bigram_count = Counter(bigram_list)
        
    return bigram_count

french_bg = bigrams(clean_fr)
print(french_bg.most_common(10))

[('de la', 292), ('qu il', 256), ('c est', 212), ('de l', 183), ('d un', 174), ('d une', 166), ('je ne', 154), ('à la', 153), ('j ai', 152), ('et il', 137)]


In [8]:
# rest of the bigrams :
english_bg = bigrams(clean_en)
print(english_bg.most_common(10))

esperanto_bg = bigrams(clean_esp)
print(esperanto_bg.most_common(10))

[('of the', 680), ('in the', 471), ('it was', 240), ('to the', 213), ('to be', 189), ('on the', 172), ('he had', 170), ('he was', 170), ('at the', 160), ('of a', 156)]
[('en la', 319), ('de la', 219), ('sur la', 165), ('kaj la', 113), ('tio ci', 112), ('tiu ci', 98), ('la doktoro', 95), ('al la', 93), ('diris la', 91), ('al si', 85)]


**As you can see esperanto and french share common bigrams**

## Similarity between lists

In this step we attribute weigths to each words in an ordered list.

During the previous steps we created Counter objects to order words by frequencies. Now we use this order
to attribute weights to words. The first word of the smaller list will have a weight of the length of this list. Weights decrease at each word in the list.

After that we apply padding on greater lists (we add zeros).

These steps will help to compare a new text with all the weighted lists. The list with the highter score indicates which language the new text is.

**Step 1**: convert Counter objects into list.

- **remember** : keep word order (they are ordered by frequency)

In [9]:
# Counter objects for each language (unigrams) : 
french_c_ug = fr_unig
english_c_ug = en_unig
esperanto_c_ug = esp_unig

In [10]:
# Convert Counter object into a list,
# Input : Counter object of a specific language
# Output : ordered list respecting the order of the words in this Counter

def counter_to_list(counter_object):
    
    dic = dict(counter_object)
    # output : {'she': 1, 'was': 1, 'driving': 1, 'her': 2, 'car': 2, ... }
    
    # this step is necessary to respect the order of words in the Counter object
    dic = sorted(dic.items(), key=lambda x : x[1], reverse= True)
    # output : [('her', 2), ('car', 2), ('she', 1), ('was', 1), ('driving', 1), ('when', 1), ...]

    ordered_list = [ i[0] for i in dic ]
    # output : ['her', 'car', 'she', 'was', 'driving', 'when', ...]
    
    return ordered_list

# test :
test_text = "she was driving her car when a dear appeared and stopped her car"
counter_object = Counter(test_text.split())
print(counter_to_list(counter_object))


['her', 'car', 'she', 'was', 'driving', 'when', 'a', 'dear', 'appeared', 'and', 'stopped']


In [11]:
# Convert all Counter objects into ordered lists:
french_l_ug = counter_to_list(french_c_ug)
english_l_ug = counter_to_list(english_c_ug)
esperanto_l_ug = counter_to_list(esperanto_c_ug)

# Check if the conversing is right :
print(french_c_ug.most_common(10))
print(french_l_ug[0:10])
print(english_c_ug.most_common(10))
print(english_l_ug[0:10])
print(esperanto_c_ug.most_common(10))
print(esperanto_l_ug[0:10])

[('de', 2768), ('et', 2239), ('la', 2002), ('le', 1675), ('il', 1461), ('à', 1424), ('les', 1344), ('l', 1302), ('un', 1230), ('je', 1200)]
['de', 'et', 'la', 'le', 'il', 'à', 'les', 'l', 'un', 'je']
[('the', 4895), ('of', 2784), ('and', 2476), ('a', 2104), ('to', 2051), ('in', 1538), ('he', 1349), ('i', 1188), ('was', 1114), ('that', 1057)]
['the', 'of', 'and', 'a', 'to', 'in', 'he', 'i', 'was', 'that']
[('la', 2586), ('kaj', 1525), ('de', 773), ('mi', 644), ('en', 615), ('li', 569), ('si', 536), ('al', 526), ('ne', 500), ('ci', 408)]
['la', 'kaj', 'de', 'mi', 'en', 'li', 'si', 'al', 'ne', 'ci']


**Step 2 :** associate weights to words in the ordered list.

The first weight of the first word in the smaller list is equal to the size of this list. Then the weights decrease. Also we apply padding to the larger lists (see example).


- L1 = [a, b, c, d, e]
- L2 = [a, b, f, c, g, d]
- weights of L1 =  [5, 4, 3, 2, 1]
- weights of L2 = [5, 4, 3, 2, 1, 0]

**Step 3 :** multiply and sum word's weights between the 2 lists 

- S(L1, L2) = 5x5 +4x4 + 3x2 + 2x0 = 25 + 16 + 6 + 0 = 47

In [12]:
# admit that l1 is always smaller than l2
def similarity_lists(l1, l2):
    
    w_small_l = []
    
    for w in range(1, len(l1) +1):
        w_small_l.append(w)
    # [1, 2, 3, 4, 5]
    
    # 1st word's weight = size of the list :
    w_small_l.reverse()
    # [5, 4, 3, 2, 1]
    
    # padd the large list with zeros :
    padding_val = len(l2) - len(l1)
    # 1
    
    # weights of the large list :
    w_large_l = w_small_l  + ([0] * padding_val)
    # [5, 4, 3, 2, 1, 0]
    
    # assign weights to words to each list :
    dic1 = {word : w for word, w in zip(l1, w_small_l)}
    dic2 = {word : w for word, w in zip(l2, w_large_l)}
    # {'a': 5, 'b': 4, 'c': 3, 'd': 2, 'e': 1}
    # {'a': 5, 'b': 4, 'f': 3, 'c': 2, 'g': 1, 'd': 0}
    
    # sum the product of weights :
    count = 0
    for val1, poids1 in dic1.items():
        for val2, poids2 in dic2.items():
            #print(val1, val2)
            if val1 == val2:
                count += poids1 * poids2
    
    return count

# test with small lists of char : 
l1 = ['a', 'b', 'c', 'd', 'e']
l2 = ['a', 'b', 'f', 'c', 'g', 'd']
print(similarity_lists(l1, l2))

47


In [28]:
# Check simlarity between a text in an unknown language and our 3 ordered lists of unigrams

unk_texts = ['i love to eat icecream', 'i love ice', 'je vais à la piscine avec mes amis', 
             'La cielo nin gardu de vido de sceno komenco oni vidadis en la semitajo', 
             'La cielo nin gardu de vido de sceno']

for text in unk_texts:
    # unigrams of this text (don't forget to clean it if you are using a larger file):
    unk_ug = unigrams(text)

    sim_english = similarity_lists(unk_ug, english_l_ug)
    sim_french = similarity_lists(unk_ug, french_l_ug)
    sim_esperanto = similarity_lists(unk_ug, esperanto_l_ug)

    # scores
    #print(sim_english)
    #print(sim_esperanto)
    #print(sim_french)
    
    # Tell to the user which language is detected:

    if sim_english > sim_esperanto and sim_english > sim_french:
        print('Your text \'{}\' is probably in english'.format(text))
    elif sim_french > sim_esperanto and sim_french > sim_english:
        print('Your text \'{}\' probably in french'.format(text))
    elif sim_esperanto > sim_french and sim_esperanto > sim_english:
        print('Your text \'{}\' is probably in esperanto'.format(text))
    elif sim_esperanto == 0 or sim_english == 0 or sim_french == 0:
        print('I don\'t understand the sentence \'{}\' , tell me more'.format(text))

Your text 'i love to eat icecream' is probably in english
I don't understand the sentence 'i love ice' , tell me more
Your text 'je vais à la piscine avec mes amis' probably in french
Your text 'La cielo nin gardu de vido de sceno komenco oni vidadis en la semitajo' is probably in esperanto
Your text 'La cielo nin gardu de vido de sceno' probably in french


### Note that here we are checking sentences

See the results, the system can't always give the right answer.

Esperanto looks like french a lot, because they have some shared tokens. More precisely esperanto is a melting-pot of a lot of languages in the world, because it aims to be THE international language ! More info here : https://fr.wikipedia.org/wiki/Esp%C3%A9ranto

Also you can check the language by using bigrams.

You will have more chance to detect the right language with a lot of text ! :)