In [1]:
# Run this cell once before doing anything else
!pip install --target=$nb_path nltk==3.5
!python3 -m nltk.downloader udhr

Defaulting to user installation because normal site-packages is not writeable
[nltk_data] Downloading package udhr to /home/alexander/nltk_data...
[nltk_data]   Package udhr is already up-to-date!


In [2]:
import nltk 
from nltk.corpus import udhr

## Homework 4.1 (6 points)
### Alexander Praus, Maike Arnold

Implement a language guesser, i.e. a function that takes a given text and outputs the language it thinks the text is
written in. The function should base its decision on the frequency of individual characters in each language.

In [103]:
# build the language models
# udhr contains the Universal Declaration of Human Rights in over 300 languages
languages = ['English', 'German_Deutsch', 'Spanish']
language_base = dict((language, udhr.words(language + '-Latin1')) for language in languages)

a) Implement a function `build_language_models(languages,words)` which takes a list of languages and a
dictionary of words as arguments and returns a conditional frequency distribution where:
*   the languages are the conditions
*   the values are the lower case characters found in `words[language]`

Call the function as follows:

In [104]:
def build_language_models(languages, words):
    freqDist = nltk.ConditionalFreqDist((language, char.lower()) for language in languages for word in words[language] for char in word)  
    return freqDist

In [105]:
language_model_cfd = build_language_models(languages, language_base)
print(language_model_cfd.conditions())
# print the models for visual inspection (you always should have a look at the data :)
for language in languages:
    for key in list(language_model_cfd[language].keys())[:10]:
        print(language, key, "->", language_model_cfd[language].freq(key))

['English', 'German_Deutsch', 'Spanish']
English u -> 0.02212549873050417
English n -> 0.08076411558457261
English i -> 0.07882964575021158
English v -> 0.011606819006166122
English e -> 0.12017893845967839
English r -> 0.06770644420263572
English s -> 0.05126345061056704
English a -> 0.08040140249062991
English l -> 0.04570184983677911
English d -> 0.03615040502962157
German_Deutsch d -> 0.054984823721690404
German_Deutsch i -> 0.0745972449217838
German_Deutsch e -> 0.16845668923651647
German_Deutsch a -> 0.053116974083586274
German_Deutsch l -> 0.04027550782162036
German_Deutsch g -> 0.03829091758113472
German_Deutsch m -> 0.016577165538174177
German_Deutsch n -> 0.10156432407191221
German_Deutsch r -> 0.07623161335512492
German_Deutsch k -> 0.011440579033387813
Spanish d -> 0.060897821639186424
Spanish e -> 0.12552653748946924
Spanish c -> 0.05295462751233602
Spanish l -> 0.05271392466000722
Spanish a -> 0.10903839210494644
Spanish r -> 0.06462871585028282
Spanish i -> 0.07678420989

b) Develop an algorithm which calculates the overall score of a given text based on the frequency of characters
accessible by `language_model_cfd[language].freq(character)`. Explain how the algorithm works.

In [107]:
prob = 0
for char in text.lower():
    prob += language_model_cfd[language].freq(char)

NameError: name 'text' is not defined

The algorithm simply interates over all characters in the given text and sums up the frequencies of the individual characters given in the freqDist. To avoid a case sensitive algorithm, the text is converted into all lowercase by using text.lower().

c) Implement a function `guess_language(language_model_cfd,text) `that returns the most likely language
for a given text according to your algorithm from the previous sub task.

In [110]:
def guess_language(language_model_cfd, text):
    probabilites = dict()
    for language in language_model_cfd.conditions():
        prob = 0
        for char in text.lower():
            prob += language_model_cfd[language].freq(char)
        probabilites[language] = prob
        
    return max(probabilites, key=probabilites.get), probabilites[max(probabilites, key=probabilites.get)]

d) Test your implementation with the following data:

In [111]:
text1 = "Peter had been to the office before they arrived."
text2 = "Si terminas tu tarea, te dare un caramelo."
text3 = "Das ist ein schon recht langes deutsches Beispiel."

# guess the language by comparing the frequency distributions
print('guess for english text is', guess_language(language_model_cfd, text1))
print('guess for spanish text is', guess_language(language_model_cfd, text2))
print('guess for german text is', guess_language(language_model_cfd, text3))

peter had been to the office before they arrived.
peter had been to the office before they arrived.
peter had been to the office before they arrived.
guess for english text is ('German_Deutsch', 3.0331543310763487)
si terminas tu tarea, te dare un caramelo.
si terminas tu tarea, te dare un caramelo.
si terminas tu tarea, te dare un caramelo.
guess for spanish text is ('Spanish', 2.5370080635455525)
das ist ein schon recht langes deutsches beispiel.
das ist ein schon recht langes deutsches beispiel.
das ist ein schon recht langes deutsches beispiel.
guess for german text is ('German_Deutsch', 3.0392248424001864)


e) Discuss, why English and German texts are difficult to distinguish with the given approach.

This issue is most likely due to the fact that both English and German are Germanic langauges. Therefore, they have a similar vocabulary and also a more similar character distribution. This is clearly not the case with Spanish, as Spanish is a Romance langauge meaning that there are less similarities between German/English and Spanish than between German and English. We would most likely run into the same problem if we were to test this approach on Spanish and Portuguese or Spanish and Italian.

## Homework 4.2 (4 points)

The previous language guesser was based on the frequency of characters. Implement alternative language guesser
based on the following lexical units:

In all of the following exercises, large parts of the code were copied from Homework 4.2. For each exercise, the only thing that had to be changed were the events when creating the ConditionalFreqDist and the algorthim described in 4.1.b.

a) tokens

In [68]:
def build_language_models(languages, words):
    freqDist = nltk.ConditionalFreqDist((language, word.lower()) for language in languages for word in words[language])  
    return freqDist

In [69]:
language_model_cfd = build_language_models(languages, language_base)

# print the models for visual inspection (you always should have a look at the data :)
for language in languages:
    for key in list(language_model_cfd[language].keys())[:10]:
        print(language, key, "->", language_model_cfd[language].freq(key))

English universal -> 0.002807411566535654
English declaration -> 0.003368893879842785
English of -> 0.045480067377877596
English human -> 0.0072992700729927005
English rights -> 0.010106681639528355
English preamble -> 0.0005614823133071309
English whereas -> 0.0039303761931499155
English recognition -> 0.0016844469399213925
English the -> 0.06176305446378439
English inherent -> 0.0005614823133071309
German_Deutsch die -> 0.026298487836949377
German_Deutsch allgemeine -> 0.003287310979618672
German_Deutsch erklärung -> 0.003287310979618672
German_Deutsch der -> 0.03353057199211045
German_Deutsch menschenrechte -> 0.0039447731755424065
German_Deutsch resolution -> 0.0006574621959237344
German_Deutsch 217 -> 0.0006574621959237344
German_Deutsch a -> 0.0006574621959237344
German_Deutsch ( -> 0.0006574621959237344
German_Deutsch iii -> 0.0006574621959237344
Spanish declaración -> 0.0022688598979013048
Spanish universal -> 0.0022688598979013048
Spanish de -> 0.06466250709018719
Spanish dere

In [82]:
def guess_language(language_model_cfd, text):
    probabilites = dict()
    for language in language_model_cfd.conditions():
        prob = 0
        for token in nltk.word_tokenize(text):
            prob += language_model_cfd[language].freq(token.lower())
        probabilites[language] = prob
    return max(probabilites, key=probabilites.get), probabilites[max(probabilites, key=probabilites.get)]

In [83]:
# guess the language by comparing the frequency distributions
print('guess for english text is', guess_language(language_model_cfd, text1))
print('guess for spanish text is', guess_language(language_model_cfd, text2))
print('guess for german text is', guess_language(language_model_cfd, text3))

guess for english text is ('English', 0.1420550252667041)
guess for spanish text is ('German_Deutsch', 0.09993425378040763)
guess for german text is ('German_Deutsch', 0.08086785009861933)


b) character bigrams

In [92]:
def build_language_models(languages, words):
    freqDist = nltk.ConditionalFreqDist((language, bigram) for language in languages for word in words[language] for bigram in nltk.bigrams(word.lower()))  
    return freqDist

In [93]:
language_model_cfd = build_language_models(languages, language_base)

# print the models for visual inspection (you always should have a look at the data :)
for language in languages:
    for key in list(language_model_cfd[language].keys())[:10]:
        print(language, key, "->", language_model_cfd[language].freq(key))

English ('u', 'n') -> 0.0063174114021571645
English ('n', 'i') -> 0.004930662557781202
English ('i', 'v') -> 0.0032357473035439137
English ('v', 'e') -> 0.01140215716486903
English ('e', 'r') -> 0.020338983050847456
English ('r', 's') -> 0.003389830508474576
English ('s', 'a') -> 0.001694915254237288
English ('a', 'l') -> 0.019106317411402157
English ('d', 'e') -> 0.00600924499229584
English ('e', 'c') -> 0.007087827426810477
German_Deutsch ('d', 'i') -> 0.0099361249112846
German_Deutsch ('i', 'e') -> 0.016465578424414477
German_Deutsch ('a', 'l') -> 0.008232789212207239
German_Deutsch ('l', 'l') -> 0.005961674946770759
German_Deutsch ('l', 'g') -> 0.00127750177430802
German_Deutsch ('g', 'e') -> 0.02185947480482612
German_Deutsch ('e', 'm') -> 0.0055358410220014195
German_Deutsch ('m', 'e') -> 0.006813342796309439
German_Deutsch ('e', 'i') -> 0.02995031937544358
German_Deutsch ('i', 'n') -> 0.01973030518097942
Spanish ('d', 'e') -> 0.03513596089214788
Spanish ('e', 'c') -> 0.014359914

In [95]:
def guess_language(language_model_cfd, text):
    probabilites = dict()
    for language in language_model_cfd.conditions():
        prob = 0
        for token in nltk.word_tokenize(text):
            for bigram in nltk.bigrams(token.lower()):
                prob += language_model_cfd[language].freq(bigram)
        probabilites[language] = prob
    return max(probabilites, key=probabilites.get), probabilites[max(probabilites, key=probabilites.get)]

In [96]:
# guess the language by comparing the frequency distributions
print('guess for english text is', guess_language(language_model_cfd, text1))
print('guess for spanish text is', guess_language(language_model_cfd, text2))
print('guess for german text is', guess_language(language_model_cfd, text3))

guess for english text is ('English', 0.3320493066255778)
guess for spanish text is ('Spanish', 0.26336694164375196)
guess for german text is ('German_Deutsch', 0.4481192334989355)


c) token bigrams

In [97]:
def build_language_models(languages, words):
    freqDist = nltk.ConditionalFreqDist((language, bigram) for language in languages for bigram in nltk.bigrams(words[language]))  
    return freqDist

In [98]:
language_model_cfd = build_language_models(languages, language_base)

# print the models for visual inspection (you always should have a look at the data :)
for language in languages:
    for key in list(language_model_cfd[language].keys())[:10]:
        print(language, key, "->", language_model_cfd[language].freq(key))

English ('Universal', 'Declaration') -> 0.0011235955056179776
English ('Declaration', 'of') -> 0.0011235955056179776
English ('of', 'Human') -> 0.0011235955056179776
English ('Human', 'Rights') -> 0.0011235955056179776
English ('Rights', 'Preamble') -> 0.0005617977528089888
English ('Preamble', 'Whereas') -> 0.0005617977528089888
English ('Whereas', 'recognition') -> 0.0005617977528089888
English ('recognition', 'of') -> 0.0005617977528089888
English ('of', 'the') -> 0.011235955056179775
English ('the', 'inherent') -> 0.0005617977528089888
German_Deutsch ('Die', 'Allgemeine') -> 0.0006578947368421052
German_Deutsch ('Allgemeine', 'Erklärung') -> 0.0013157894736842105
German_Deutsch ('Erklärung', 'der') -> 0.0013157894736842105
German_Deutsch ('der', 'Menschenrechte') -> 0.002631578947368421
German_Deutsch ('Menschenrechte', 'Resolution') -> 0.0006578947368421052
German_Deutsch ('Resolution', '217') -> 0.0006578947368421052
German_Deutsch ('217', 'A') -> 0.0006578947368421052
German_Deu

In [101]:
def guess_language(language_model_cfd, text):
    probabilites = dict()
    for language in language_model_cfd.conditions():
        prob = 0
        for bigram in nltk.bigrams(text):
            prob += language_model_cfd[language].freq(bigram)
        probabilites[language] = prob
    print(probabilites)
    return max(probabilites, key=probabilites.get), probabilites[max(probabilites, key=probabilites.get)]

In [102]:
# guess the language by comparing the frequency distributions
print('guess for english text is', guess_language(language_model_cfd, text1))
print('guess for spanish text is', guess_language(language_model_cfd, text2))
print('guess for german text is', guess_language(language_model_cfd, text3))

{'English': 0.0, 'German_Deutsch': 0.0, 'Spanish': 0.0}
guess for english text is ('English', 0.0)
{'English': 0.0, 'German_Deutsch': 0.0, 'Spanish': 0.0}
guess for spanish text is ('English', 0.0)
{'English': 0.0, 'German_Deutsch': 0.0, 'Spanish': 0.0}
guess for german text is ('English', 0.0)


d) Discuss, which approach should work best theoretically. Is this reflected in the results?

In theory approaches that use bigrams vs. single characters/tokens should work better as they take the context of the character/token into consideration. This is also reflected to a certain extent in the results. The approach using the character bigrams was the only one to correctly guess the language of all three texts. Nonetheless, the approach using token bigrams failed completely. The probabilty for each language in each of the three test cases was 0. The reason for this is most likely not due to the approach itself but rather to the fact that we only used one (very specific) text to train our model. The three test texts were:

`
text1 = "Peter had been to the office before they arrived."
text2 = "Si terminas tu tarea, te dare un caramelo."
text3 = "Das ist ein schon recht langes deutsches Beispiel."
`

Just by looking at the texts, one can easily see that they have very little, if anything at all, in common with the Declaration of Human Rights. Therefore, it is very unlikely that we will find matching bigrams in our model. For this reason, we would need to train our model with more texts and also make sure that we don't exclusivly have text with a very specific domain. If this were done, the results for the token bigram approach would probably be a lot more accurate.

## Homework 4.3 
*(This homework is not part of the bonus system. However, we recommend you to work it out. It will save you some time in the future.)*

Copy all functions implemented in the tasks and homeworks to one file and name it `UKP_Lib.py`. You may easily access for examle the function `word_freq` of the previous tasks with the following statement:

`from UKP_Lib import word_freq`

You just implemented your first module. If you are familiar with another object oriented language, feel free to use classes and OO in the exercises. Make yourself familiar with syntax of OO-constructs in Python, e.g. consult http://docs.python.org/tutorial/classes.html