### Fast language recognizer

Please download the texts from Absalon:

[textes](https://absalon.ku.dk/files/7715992/download?download_frd=1)

[texten](https://absalon.ku.dk/files/7715991/download?download_frd=1)

And save them at the same directory as this notebook.

#### Step 1: Read the data

In [3]:
# data I/O
# Don Quijote de la Mancha, by Miguel de Cervantes
data_es = open('./textes.txt', 'r').read() # should be simple plain text file
chars_es = list(set(data_es))
data_size_es, vocab_size_es = len(data_es), len(chars_es)

print ('data has %d characters, %d unique.' % (data_size_es, vocab_size_es))

data has 2117498 characters, 104 unique.


In [4]:
# data I/O
# Pride and Prejudice, by Jane Austen
data_en = open('./texten.txt', 'r').read() # should be simple plain text file
chars_en = list(set(data_en))
data_size_en, vocab_size_en = len(data_en), len(chars_en)

print ('data has %d characters, %d unique.' % (data_size_en, vocab_size_en))

data has 775741 characters, 92 unique.


#### How does the data look like?

In [5]:
print (data_es[1000:1050])

 pliego del dicho
libro a tres maravedís y medio; 


In [6]:
print (data_en[10000:10050])

eas,” he continued, “let us return
      to Mr. Bi


In [7]:
import nltk

#### Step 2: Get conditional frequencies

In [8]:
cfreq_bigrams_corpuses= nltk.ConditionalFreqDist(nltk.bigrams(data_es))
cfreq_bigrams_corpuses

<ConditionalFreqDist with 104 conditions>

In [9]:
cfreq_bigrams_corpuses['o']

FreqDist({' ': 47195, 's': 25541, 'n': 18164, 'r': 15705, ',': 10299, 'm': 5798, 't': 4903, 'd': 4081, 'l': 3472, 'c': 3043, ...})

#### Step 3: Get conditional probabilities from frequencies

In [10]:
cprob_bigrams_corpuses = nltk.ConditionalProbDist(cfreq_bigrams_corpuses, nltk.MLEProbDist)

In [11]:
cfreq_bigrams_corpusen = nltk.ConditionalFreqDist(nltk.bigrams(data_en))
cprob_bigrams_corpusen = nltk.ConditionalProbDist(cfreq_bigrams_corpusen, nltk.MLEProbDist)

In [12]:
def prod (ite):
    if len(ite)==0:
        return 1
    else:
        return ite[0]*prod(ite[1:])

#### Step 4: Write a function to get the probability of a sentence, given a language model

In [13]:

def return_proba (cpd, sentence):
    return prod([cpd[bigram[0]].prob(bigram[1]) for bigram in nltk.bigrams(sentence)])

return_proba(cprob_bigrams_corpusen,"I want coffee"),return_proba(cprob_bigrams_corpuses,"I want coffee")

(1.837190970465065e-12, 7.349855830592194e-23)

#### Step 5: Write a function to get the language in which the text is more likely to be written (either Spanish or English)

In [14]:
def spanish_or_english (sent):
    if return_proba(cprob_bigrams_corpusen,sent) > return_proba(cprob_bigrams_corpuses,sent):
        return "english"
    else:
        return "spanish"

### Playground

In [15]:
spanish_or_english("My hometown is beautiful")

'english'

In [16]:
spanish_or_english("My name is Manex and my hometown is called Zumarraga") #Yes! The model was fooled with this example!

'spanish'

In [17]:
spanish_or_english("My name is Joe and my hometown is called Sacramento")

'english'

In [18]:
spanish_or_english("Me llamo William Smith y vivo en el barrio de Bel Air")

'spanish'