### Simple Language Guesser using Character Bigrams

In [4]:
import nltk
from nltk.corpus import udhr

### Problem a)

1. The following function  "build_language_models(languages, language_base)", is returning conditional frequency distribution of lower case character bigrams with conditions are languages.<br>

2. Algorithm explanation:<br>
First initialised a blank dictionary to save character bigrams. Using loop, list of words are converted into character bigrams and saved into language_base2. Then, conditional frequency distribution is calculated with dictionary of character bigrams. Return conditional frequency distribution.


In [5]:
def build_language_models(languages, language_base):
    language_base2=dict([(key, []) for key in languages]) #initialised new dict with keys as languages
    
    for language in languages:
        for word in language_base[language]:
            language_base2[language].extend(list(nltk.bigrams(word.lower()))) #fill dictionary with character bigrams
            
    #Now find conditional frequency distribution of character bigrams with keys as languages        
    cfd=nltk.ConditionalFreqDist((lang,bi) for lang in language_base2.keys() for bi in language_base2[lang])    
    return cfd

In [6]:
languages = ['English', 'German_Deutsch', 'Finnish_Suomi', 'Italian']
language_base = dict((language, udhr.words(language + '-Latin1')) for language in languages)
language_model_cfd = build_language_models(languages, language_base)

# print the models for visual inspection
for language in languages:
    for key in list(language_model_cfd[language].keys())[:10]:
        print(language, key, "->", language_model_cfd[language].freq(key))

English ('u', 'n') -> 0.0063174114021571645
English ('n', 'i') -> 0.004930662557781202
English ('i', 'v') -> 0.0032357473035439137
English ('v', 'e') -> 0.01140215716486903
English ('e', 'r') -> 0.020338983050847456
English ('r', 's') -> 0.003389830508474576
English ('s', 'a') -> 0.001694915254237288
English ('a', 'l') -> 0.019106317411402157
English ('d', 'e') -> 0.00600924499229584
English ('e', 'c') -> 0.007087827426810477
German_Deutsch ('d', 'i') -> 0.0099361249112846
German_Deutsch ('i', 'e') -> 0.016465578424414477
German_Deutsch ('a', 'l') -> 0.008232789212207239
German_Deutsch ('l', 'l') -> 0.005961674946770759
German_Deutsch ('l', 'g') -> 0.00127750177430802
German_Deutsch ('g', 'e') -> 0.02185947480482612
German_Deutsch ('e', 'm') -> 0.0055358410220014195
German_Deutsch ('m', 'e') -> 0.006813342796309439
German_Deutsch ('e', 'i') -> 0.02995031937544358
German_Deutsch ('i', 'n') -> 0.01973030518097942
Finnish_Suomi ('i', 'h') -> 0.0037654653039268424
Finnish_Suomi ('h', 'm') 

### Algorithm: <br>

**Input**: language_model_cfd, text
<br>
**Output**: language_guess
<br><br>
**Initialize**: test_text_char_bigram= [   ],  score_checker={language:0} where language in language_model_cfd.keys() 
<br>
**begin**:<br>
1.**for** word in words **do**<br>
2.&emsp;    Convert word into lower case.<br>
3.&emsp;    Find character bigram of converted word.<br>
4.&emsp;    Append bigram to test_text_char_bigram.<br>
5.**for** language in language_model_cfd **do**<br>
6.&emsp;    **for** char_bigram in test_text_char_bigram **do**<br>
7.&emsp;&emsp;        score_checker[language]+=language_model_cfd[language].freq(char_bigram)<br>
8.max_score=max(score_checker.values())<br>
9.**for** language in score_checker.values() **do**<br>
10.&emsp;    **if** score_checker[language] is max_score **do**<br>
11.&emsp;&emsp;        language_guess=language<br>
        
12.return language_guess<br>        

#### Algorithm explanation:
This algorithm is based on finding out overall score of cumulative frequency sum of character bigrams in test text with respect to each language available in conditional frequency distribution. Then it selects language with maximum overall score. This algorithm was able to guess language of all 4 texts given in the problem, correctly. 

1. In steps 1 to 4, algorithm is constructing a list of character bigrams in text given from its words.
2. In steps 5 to 7, algorithm is constructing a dictionary with keys as the names of languages and values are cumulatively added frequency score of each bigram. For, example, suppose bigram ('i','e') is more frequent in German than English. So, obviously it will have a high frequency score in German than English. So, for each bigram in text, we continue adding frequency score and maintain final overall score of each language in a dictionary named score_checker.
3. In the steps 9 to 11, algorithm is finding out language with highest overall cumulative score.

Look at the following code. It prints dictionary [score_checker] with languages as keys and values as cumulative frequency score using character bigrams.

In [7]:
text = "Come in altri paesi europei del mediterraneo, sono presenti tratti distintivi ed elementi che caratterizzano la dieta mediterranea."
words=text.split(" ")       #Convert the given text into list of words  
test_text_char_bigrams=[]   #This list is used to store character bigrams of the given text

for word in words:
    #Save lower case charcter bigrams from list of words.
    test_text_char_bigrams.extend(list(nltk.bigrams(word.lower())))

#Initialize dictionary with keys as languages and value=0
score_checker=dict.fromkeys(language_model_cfd.keys(),0)

#Save consecutive fequency score of each bigram for each language in language_model_cfd. 
for language in language_model_cfd.keys():
    for char_bigram in test_text_char_bigrams:
        score_checker[language]+=language_model_cfd[language].freq(char_bigram)

(score_checker)        

{'English': 0.8713405238828973,
 'German_Deutsch': 0.8600425833924772,
 'Finnish_Suomi': 0.6358257127487892,
 'Italian': 1.0007423904974013}

 As evident from  above output, the best guess for given text is **"Italian"** as its having highest cumulative frequency score. 

### Function implementation 

In [8]:
def guess_language(language_model_cfd, text):
    words=text.split(" ")       #Convert the given text into list of words  
    test_text_char_bigrams=[]   #This list is used to store character bigrams of the given text
    
    for word in words:
        #Save lower case charcter bigrams from list of words
        test_text_char_bigrams.extend(list(nltk.bigrams(word.lower())))
    
    #Initialize dictionary with keys as languages and value=0
    score_checker=dict.fromkeys(language_model_cfd.keys(),0)
    
    #Save consecutive fequency score of each bigram for each language in language_model_cfd 
    for language in language_model_cfd.keys():
        for char_bigram in test_text_char_bigrams:
            score_checker[language]+=language_model_cfd[language].freq(char_bigram)
    max1=max(score_checker.values())  #Find out maximum overall score
    
    #Find key (language) with maximum overall score as its value
    language_guess = [lang for lang in score_checker.keys() if score_checker[lang] is max1]
    return language_guess[0]

### Testing on texts

In [9]:
text1 = "Syksy on kaunis vuodenaika, varsinkin kun ei sada."
text2 = "Erkenntnisfortschritte ergeben sich durch das Wechselspiel von Beobachtung oder Experiment mit der Theorie."
text3 = "Come in altri paesi europei del mediterraneo, sono presenti tratti distintivi ed elementi che caratterizzano la dieta mediterranea."
text4 = "A healthy diet is important if you want to live a healthy life."

In [45]:
print('guess for finnish text is', guess_language(language_model_cfd, text1))
print('guess for german text is', guess_language(language_model_cfd, text2))
print('guess for italian text is', guess_language(language_model_cfd, text3))
print('guess for english text is', guess_language(language_model_cfd, text4))

guess for finnish text is Finnish_Suomi
guess for german text is German_Deutsch
guess for italian text is Italian
guess for english text is English
