# Detecting Text Language by Counting Stop Words¶
This is based on Alejandro Nolla's <a href="http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/" target="_blank">Detecting Text Language with Python and NLTK</a>

As I mentioned in my <a href="https://medium.com/nwamaka-imasogie/stand-up-comedy-and-nlp-c7d64002520c" target="_blank">Stand-Up Comedy and NLP Python publication</a>, **stop words** are common words that add no additional meaning to text such as 'a', 'the', etc. We often filter them out before doing any kind of processing because they are mostly grammatical (not semantic) in nature.

We can use stop words to detect language. Search engines, for example, can show us a text in one particular language we choose, like English or Spanish. To do that the indexed text was analyzed previously to "guess" the language and store it together. 

# 1. Tokenizing

When tokenizing we have to decide whether we want to break it down to "words" or "tokens". Some things to think about are whether we should keep contractions or not (for example, _they're_ versus _they are_)In this case I'm going to **split all punctuations into seperate tokens**.

In [2]:
text = "Yo man, it's time for you to shut yo' mouth! I ain't even messin' dawg."

In [3]:
from nltk.tokenize import wordpunct_tokenize # regex-based tokenizer which splits text on whitespace and punctuation (except for underscore)

tokens = wordpunct_tokenize(text)
tokens

['Yo',
 'man',
 ',',
 'it',
 "'",
 's',
 'time',
 'for',
 'you',
 'to',
 'shut',
 'yo',
 "'",
 'mouth',
 '!',
 'I',
 'ain',
 "'",
 't',
 'even',
 'messin',
 "'",
 'dawg',
 '.']

# 2. Exploring NLTK's stop words corpus
Now I have clean words to match against a  list of stop words. NLTK comes in handy because it comes with a corpus of stop words from various languages.

In [4]:
from nltk.corpus import stopwords
stopwords.readme().replace('\n', ' ') # The contents of the README file of the corpus

'Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  '

In [5]:
stopwords.fileids() # Most corpora consist of a set of files. fileids() is a list of identifiers for these files 

['arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'kazakh',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish',
 'turkish']

**Corpus readers have a variety of ways to read data from a corpus like .words(), .raw(), and .sents()**

In [6]:
stopwords.raw('danish').replace('\n', ' ')

'og i jeg det at en den til er som på de med han af for ikke der var mig sig men et har om vi min havde ham hun nu over da fra du ud sin dem os op man hans hvor eller hvad skal selv her alle vil blev kunne ind når være dog noget ville jo deres efter ned skulle denne end dette mit også under have dig anden hende mine alt meget sit sine vor mod disse hvis din nogle hos blive mange ad bliver hendes været thi jer sådan '

In [7]:
stopwords.words('english')[:15]

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him']

In [8]:
stopwords.words('russian')

['и',
 'в',
 'во',
 'не',
 'что',
 'он',
 'на',
 'я',
 'с',
 'со',
 'как',
 'а',
 'то',
 'все',
 'она',
 'так',
 'его',
 'но',
 'да',
 'ты',
 'к',
 'у',
 'же',
 'вы',
 'за',
 'бы',
 'по',
 'только',
 'ее',
 'мне',
 'было',
 'вот',
 'от',
 'меня',
 'еще',
 'нет',
 'о',
 'из',
 'ему',
 'теперь',
 'когда',
 'даже',
 'ну',
 'вдруг',
 'ли',
 'если',
 'уже',
 'или',
 'ни',
 'быть',
 'был',
 'него',
 'до',
 'вас',
 'нибудь',
 'опять',
 'уж',
 'вам',
 'ведь',
 'там',
 'потом',
 'себя',
 'ничего',
 'ей',
 'может',
 'они',
 'тут',
 'где',
 'есть',
 'надо',
 'ней',
 'для',
 'мы',
 'тебя',
 'их',
 'чем',
 'была',
 'сам',
 'чтоб',
 'без',
 'будто',
 'чего',
 'раз',
 'тоже',
 'себе',
 'под',
 'будет',
 'ж',
 'тогда',
 'кто',
 'этот',
 'того',
 'потому',
 'этого',
 'какой',
 'совсем',
 'ним',
 'здесь',
 'этом',
 'один',
 'почти',
 'мой',
 'тем',
 'чтобы',
 'нее',
 'сейчас',
 'были',
 'куда',
 'зачем',
 'всех',
 'никогда',
 'можно',
 'при',
 'наконец',
 'два',
 'об',
 'другой',
 'хоть',
 'после',
 'на

In [9]:
len(stopwords.words(['english', 'spanish'])) # There are 466 total English and Spanish stop words

466

# 3. Classification

Now I will compute language probability depending on which stopwords are used. Start by looping through the list of stop words in all of the languages. Then check how many stop words our tokenized text contains in each language.

The text is finally classified based on the language in which it has the most stop words.

In [18]:
language_ratios = {}

words = [word.lower() for word in tokens] # lowercase all tokens

# Advantage of sets over lists: optimized for checking whether a specific element is contained in the set
words_set = set(words)

# Compute number of unique stopwords that appear in the text, compute this per each language
for language in stopwords.fileids():
    stopwords_set = set(stopwords.words(language))
    overlapping_words = words_set.intersection(stopwords_set)
    language_ratios[language] = len(overlapping_words) # language score
    
language_ratios

{'arabic': 0,
 'danish': 3,
 'dutch': 0,
 'english': 8,
 'finnish': 0,
 'french': 2,
 'german': 1,
 'hungarian': 1,
 'italian': 1,
 'kazakh': 0,
 'norwegian': 3,
 'portuguese': 1,
 'romanian': 2,
 'russian': 0,
 'spanish': 1,
 'swedish': 2,
 'turkish': 0}

In [19]:
# The key parameter is a function that computes a key that is used to determine how to rank items
highest_scoring_language = max(language_ratios, key=language_ratios.get)
highest_scoring_language

'english'

In [23]:
# We can even see which English stop words were found
words_set.intersection(set(stopwords.words(highest_scoring_language)))

{'ain', 'for', 'i', 'it', 's', 't', 'to', 'you'}