# Exercise 1: Language classification

Characters used in a language have a direct correlation with the language itself. Design a solution that will allow a user to provide a document and identify (classify) the language it was written on. The languages that this tool should identify are: Spanish, English, Italian, German, French, Portuguese, and Danish.

In [531]:
import wikipedia as wk
import pandas as pd
import numpy as np
import nltk
import unicodedata
import string
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

## Data Exploration and Acquisition
Let's begin by getting the text from the various wikipedia sites provides. We'll use this for both training and testing our model. We will be using the "Wikipedia" package in python to get our data. For inputs, the functions in this package require languages and page names, which we can grab directly from the links provided.

In [532]:
languages_s = ['es','it','en','de','fr','pt','nl']
languages_l = ['Spanish','Italian','English','German','French','Portuguese','Danish']
page_names  = ['Divina_comedia', 
               'Divina_Commedia', 
               'Divine_Comedy',
               'Göttliche_Komödie', 
               'Divine_Comédie', 
               'Divina_Comédia',
               'De_goddelijke_komedie']
text_splits = ['== Traducciones ==', 
               '== Data di composizione ==',
               '== Earliest manuscripts ==', 
               '== Rezeptionsgeschichte ==', 
               'Sainte Vierge et finalement Dante',
               'Virgem Maria e esta concede sua visita', 
               '== Receptie ==']

Text splits are necessary because generally towards the end of Wikipedia pages there are references and glossaries, which while useful, do not contain full sentences. It's not a requirement for documents (sentences) to always be complete but I would like to be consistent with the data for this exercise. Also note that not all the text splits occur at a section heading, python can't always deal with unicode correctly so in some cases I split the data at an earlier point.

**Please note that when I am referring to documents, I am referring to a sentence.**

In [533]:
text_splits

['== Traducciones ==',
 '== Data di composizione ==',
 '== Earliest manuscripts ==',
 '== Rezeptionsgeschichte ==',
 'Sainte Vierge et finalement Dante',
 'Virgem Maria e esta concede sua visita',
 '== Receptie ==']

Below is a list of the languages that we will be classifiying.

In [534]:
for language in languages_s:
    print wk.languages()[language]

español
italiano
English
Deutsch
français
português
Nederlands


In [535]:
lang_dict = dict(zip(languages_l, [1,2,3,4,5,6,7]))
rev_dict = dict(zip([1,2,3,4,5,6,7], languages_l))
lang_dict

{'Danish': 7,
 'English': 3,
 'French': 5,
 'German': 4,
 'Italian': 2,
 'Portuguese': 6,
 'Spanish': 1}

Let's create a dictionary of all the language content from each of the wikipedia pages, with languages as keys.

In [536]:
def get_wiki(language, page_name):
    wk.set_lang(language)
    data = wk.page(page_name)
    return data.content

In [537]:
content = {}
for row in zip(languages_s, languages_l, page_names, text_splits):
    content[row[1]] = (get_wiki(row[0], row[2])).split(row[3])[0]

To get the data in a format that will be acceptable for the model, we will have to transform it in the following ways:
* convert unicode to ascii
* split into sentences
* remove punctuation
* lowercase all text

In [538]:
def transform_df(languages, content, int_conv):
    dfs = []
    idx = 0
    for l in languages:
        sent = unicodedata.normalize('NFKC', content[l]).encode('ascii', 'ignore')
        sent = nltk.sent_tokenize(sent, language=l)
        sent = [(x.translate(string.maketrans("",""), string.punctuation)).lower() for x in sent]
        rge = np.arange(idx, idx+len(sent), 1)
        text= pd.Series(data=sent, name='Text', index=rge)
        label = pd.Series(data=[int_conv[l] for x in sent], name='Label', index=rge)
        dfs.append(pd.concat([text, label], axis=1))
        idx = idx + 1 + len(sent)
    return pd.concat([x for x in dfs], axis=0)

In [539]:
result = transform_df(languages_l, content, lang_dict)
result.head()

Unnamed: 0,Text,Label
0,la divina comedia en italiano divina commedia,1
1,es un poema escrito por dante alighieri,1
2,se desconoce la fecha exacta en que fue escrit...,1
3,libro ms famoso de su autor es una de las obra...,1
4,es considerada la obra maestra de la literatur...,1


We now have a dataframe of all the text from the various wikipedia pages converted into sentences and labels based on language.

In [540]:
result['Label'].value_counts()

7    361
5    277
4    218
2    113
1    106
6     88
3     68
Name: Label, dtype: int64

The data is definitely not uniformly distributed between the languages. Since individuals contribute to wikipedia articles and there is no required consistency of text between languages, it looks like the same page can vary widely between languages. This can be adjusted for when using the Multinomial Naive Bayes classifier in sklearn.

## Model Creation, Training, and Testing
One of the most popular text classifiers is the Multinomial Naive Bayes classifier. We can try a random forest classifier as well for comparison.

Since we can't pass a dataframe of text directly to a classifier, let's convert our modified data into a vector - two methods:
* Count Vector: For each document(sentence), count the number of time each word appears (term frequency)
* TFIDF Vector: term frequency * inverse document frequency

See link for more details on TFIDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

In [541]:
count_vectorizer = CountVectorizer(max_df=0.95)
tfidf_vectorizer = TfidfVectorizer(max_df=0.95)

In [542]:
tfidf = tfidf_vectorizer.fit_transform(result['Text'].values)
count = count_vectorizer.fit_transform(result['Text'].values)
mnb = MultinomialNB(alpha=0.1, fit_prior=False)
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)

In [543]:
def model_test(data, model, t_size, data_type, model_type):
    X_train, X_test, y_train, y_test = train_test_split(data, \
        result['Label'].values, test_size = t_size, random_state=8)
    model.fit(X_train, y_train)
    cv_score = np.mean(cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy'))
    test_score = model.score(X_test, y_test)
    print "The cross-validation score for the %s model, using %s data is %s" \
        % (model_type, data_type, cv_score)
    print "The test set score for the %s model, using %s data is %s" \
        % (model_type, data_type, test_score)

In [544]:
model_test(tfidf, mnb, .25, "TFIDF", "Multinomial NB")

The cross-validation score for the Multinomial NB model, using TFIDF data is 0.989221534899
The test set score for the Multinomial NB model, using TFIDF data is 0.977272727273


In [545]:
model_test(count, mnb, .25, "Count Vectorizer", "Multinomial NB")

The cross-validation score for the Multinomial NB model, using Count Vectorizer data is 0.988134578378
The test set score for the Multinomial NB model, using Count Vectorizer data is 0.980519480519


In [546]:
model_test(tfidf, rf, .25, "TFIDF", "Random Forest")

The cross-validation score for the Random Forest model, using TFIDF data is 0.946887706354
The test set score for the Random Forest model, using TFIDF data is 0.957792207792


In [547]:
model_test(count, rf, .25, "Count Vectorizer", "Random Forest")

The cross-validation score for the Random Forest model, using Count Vectorizer data is 0.943626008017
The test set score for the Random Forest model, using Count Vectorizer data is 0.941558441558


The Multinomial Naive Bayes (MNB) model using TFIDF data performed the best, although not much better than the same model using Count Vector data. Either one could be used to do further predictions on other data. The Random Forest models didn't perform poorly either, they just can't match the near perfect performance of MNB. Since the test data used for the above scores also came from the same source - Divine Comedy (religious text), it would be interesting to see how the best model above performs on data from a different source.

Let's create a final model using all the training data, which I can then use on new text strings to predict what language they are.

In [548]:
final_model = mnb.fit(tfidf, result['Label'].values)

In [549]:
final_model.classes_

array([1, 2, 3, 4, 5, 6, 7])

## Testing on Outside Data

The model I used above requires that an array of multiple values be passed as test data for prediction. Let's get text from different Wikipedia sites for each of the languages. All of the below strings were copied directly from Wikipedia and pasted below. ** None of the strings below come from the Divine Comedy page.**

In [550]:
txt1 = "A peer review of 42 science articles found in both Encyclopædia Britannica and Wikipedia was published in Nature in 2005, and found that Wikipedia's level of accuracy approaches Encyclopedia Britannica's"
lang1 = "English"
txt2 = "A pesar de que el bajo imperio se extendía por las tierras de la periferia del Mediterráneo, en la historia militar de Roma las batallas navales fueron, por lo general, menos significativas que las batallas terrestres, debido a su dominio casi incuestionable del mar tras las feroces luchas navales de la Primera Guerra Púnica."
lang2 = "Spanish"
txt3 = "Dal 1951 al 1956 con la compagine biancorossa vinse numerosi trofei e disputò la finale di Coppa dei Campioni contro il Real Madrid nell'edizione 1955-1956; qualche settimana dopo si trasferì proprio alla squadra spagnola. Giocò tre stagioni a Madrid in cui vinse tre volte la Coppa dei Campioni, due volte il campionato nazionale e una Coppa Latina."
lang3 = "Italian"
txt4 = "Son règne est marqué par une double fidélité : à l'Empire, dont il tire sa légitimité en tant que vicaire impérial ; au parti gibelin, dont il devient le chef incontesté dans le nord de l'Italie."
lang4 = "French"
txt5 = "Den ihr allgemein gegebenen Namen der „großen Gräfin“ verdankt sie ebenso ihrer Macht wie ihren glänzenden Geistesgaben und ihrer hohen Bildung."
lang5 = "German"
txt6 = "Ao final da Guerra de Sucessão Espanhola com o Tratado de Utrecht em 1713, o duque de Savoia readquiriu suas possessões originais e recebeu o título de Rei da Sicília."
lang6 = "Portuguese"
txt7 = "Sinds de nieuwe tijd, toen de Europeanen met de verkenning en onderwerping van de rest van de wereld begonnen werd de westerse cultuur de dominante cultuur van de wereld."
lang7 = "Danish"

In [551]:
test_txt = [txt1, txt2, txt3, txt4, txt5, txt6, txt7]
test_lang = [lang1, lang2, lang3, lang4, lang5, lang6, lang7]
test_txt_cln = []
for txt in test_txt:
    source = unicode(txt, 'utf-8')
    sent = unicodedata.normalize('NFKC', source).encode('ascii', 'ignore')
    sent = sent.translate(string.maketrans("",""), string.punctuation).lower()
    test_txt_cln.append(txt)

In [552]:
test_txt_cln = np.array(test_txt_cln)

Let's use our existing model and create predictions for the new data.

In [553]:
X_test = tfidf_vectorizer.transform(test_txt_cln)
pred = final_model.predict(X_test)

In [554]:
pred_l = []
for val in pred:
    pred_l.append(rev_dict[val])

for val in zip(pred_l, test_lang):
    print "Prediction: %s" % val[0]
    print "Actual: %s" % val[1]
    print "_____"

Prediction: English
Actual: English
_____
Prediction: Spanish
Actual: Spanish
_____
Prediction: Italian
Actual: Italian
_____
Prediction: French
Actual: French
_____
Prediction: German
Actual: German
_____
Prediction: Portuguese
Actual: Portuguese
_____
Prediction: Danish
Actual: Danish
_____


Our model did well on random data from other Wikipedia pages, granted there were only 7 test samples.

## Conclusions and Final Thoughts
I was able to create a model that classified text strings into one of seven languages pretty accurately. The nature of language is that there are far fewer common words/spellings between languages than unique ones. For this reason, the classifier that I built was able to parse through text strings and generally have a high degree of classification success based on the training data.

As stated earlier, the Multinomial Naive Bayes model had the highest degree of success. Using count vectors or TFIDF did not lead to much difference in classification accuracy.

I used 5-fold cross validation on the training data to check for overfitting, the models did not overfit as the test scores were very similar to the cross validation scores.

The model above was also able to generalize pretty well on unseen data (granted I only used 7 data points for this portion of the exercise).

If I had all the time in the world, I would do the following to improve my solution:
* Train the model on several articles, I only had about 1200 documents (sentences), which is far fewer than what was used in some papers I read. Having more articles would increase the overall vocabulary of each language significantly.
* Consider using a stop-word list for each language. I didn't use stop-words because I didn't find a list for every language. Removing stop words could be useful since some short words look the same in multiple languages. This also may not have helped but it would have been interesting to check.
* Create a one-off language classification check. The sklearn implementation required an array of predictions (greater than 1). It would have been nice to create a function that could take any random single string and create a prediction.
* Create a large test set from other sources and test the model again.
* I had issues with unicode to ascii conversion when "Testing on Outside Data". I would research this issue more.

## Sources
* https://github.com/goldsmith/Wikipedia - Wikipedia API wrapper for Python
* https://en.wikipedia.org/wiki/Tf%E2%80%93idf - Explanation of TFIDF
* http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html - Scikit Learn Multinomial Naive Bayes implementation
* http://corporavm.uni-koeln.de/vardial/sharedtask.html - DSL Shared Task problem set
* O'Reilly, Natural Language Processing with Python, 2009
* As always, several stack overflow pages were also of use when I got stuck