# Detecting Text Language by Counting Stop Words

Based on [Detecting Text Language With Python and NLTK](https://web.archive.org/web/20240613234104/https://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/) by Alejandro Nolla.

In [1]:
text = "Yo, man! It's time to learn NLP. I'm for real, dawg!"

## 1. Tokenization

This time, we will use `wordpunct_tokenize`, a regexp-based tokenizer that splits text on whitespace and punctuation (except for underscores):

In [2]:
try:
    from nltk.tokenize import wordpunct_tokenize
except ImportError:
    print('[!] You need to install NLTK (https://www.nltk.org/).')

In [3]:
test_tokens = wordpunct_tokenize(text)
test_tokens

['Yo',
 ',',
 'man',
 '!',
 'It',
 "'",
 's',
 'time',
 'to',
 'learn',
 'NLP',
 '.',
 'I',
 "'",
 'm',
 'for',
 'real',
 ',',
 'dawg',
 '!']

## 2. Exploring NLTK’s stop words corpus

**Stop words** are words that are filtered out before processing because they are mostly grammatical and not semantic in nature. E.g., search engines remove words like “want”.

NLTK comes with a corpus of stop words in various languages called `stopwords`. We can learn more about it using the `readme()` function. It returns raw text, so, although it’s not mandatory, we use `print()` to process it as a string literal so that `\n` escape sequences take effect:

In [4]:
from nltk.corpus import stopwords

print(stopwords.readme()[:80])

Stopwords Corpus

This corpus contains lists of stop words for several languages


Most corpora consist of a set of files, each containing a piece of text. A list of identifiers for these files can be accessed via `fileids()`.

In [5]:
stopwords.fileids()[:10]

['albanian',
 'arabic',
 'azerbaijani',
 'basque',
 'belarusian',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch']

In [6]:
len(stopwords.fileids())

33

Corpus readers provide different methods for reading data from a corpus.

In [7]:
print(stopwords.raw('english')[:60])

a
about
above
after
again
against
ain
all
am
an
and
any
are



In [8]:
stopwords.words('english')[:13]

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are']

We can also use `sents()` and `paras()`, which return sentences and paragraphs, respectively. However, in our particular case, this would cause an error, because the stop words corpus reader is of type `WordListCorpusReader`, so there are no sentences or paragraphs.

Finally, corpus readers allow us to read the text of multiple files together. For example, we can count the total number of Norwegian and Swedish stop words.

In [9]:
len(stopwords.words(['norwegian', 'swedish']))

290

## 3. Classification

Even though NLTK’s stop words corpus is intended to help us filter out stop words in the language we are working with, it is essentially a list of words in more than 30 languages. Therefore, we can use it to identify languages by checking the presence of these words. Of course, this is just an exercise and is not a serious method for classifying text by language.

We will loop through the list of stop words in all languages and count how many stop words our text contains in each language. The text is then classified in the language in which it has the most stop words.

In [10]:
language_ratios = {}

test_words = [word.lower() for word in test_tokens]
test_words_set = set(test_words)

for language in stopwords.fileids():
    stopwords_set = set(stopwords.words(language))
    common_elements = test_words_set.intersection(stopwords_set)
    language_ratios[language] = len(common_elements)

print('TEST TEXT LANGUAGE "SCORES"')
print('Albanian:', language_ratios['albanian'])
print('Bengali:', language_ratios['bengali'])
print('English:', language_ratios['english'])

TEST TEXT LANGUAGE "SCORES"
Albanian: 3
Bengali: 0
English: 6


We then use Python’s `max()` function to find the language with the highest “score”. Since `language_ratios` is a dictionary, we use the `key` parameter to specify that the comparison should be performed based on the dictionary values rather than the keys.

In [11]:
most_rated_language = max(language_ratios, key=language_ratios.get)
most_rated_language

'english'

We can also check which English stop words were found.

In [12]:
test_words_set.intersection(set(stopwords.words(most_rated_language)))

{'for', 'i', 'it', 'm', 's', 'to'}