# Stemming

Let's see an example of stemming text using [NLTK](https://www.nltk.org/) (Natural Language ToolKit). We will use their SnowballStemmer implementation. The implementation is available online, so if you are curious about how stemming is done in different languages, you can look [here](https://www.nltk.org/api/nltk.stem.html).

In [None]:
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize 

In [None]:
# We need to download a package for word tokenization
nltk.download('punkt')

## Tokenization

In [None]:
text = "At first, historical linguistics served as the cornerstone of comparative linguistics primarily as a tool for linguistic reconstruction.[5] Scholars were concerned chiefly with establishing language families and reconstructing prehistoric proto-languages, using the comparative method and internal reconstruction."
text

Let's start with the word tokenization. Notice how it cut words and symbols.

In [None]:
" ".join(word_tokenize(text))

Rule-based tokenizers have to make implementation choices. For example, not splitting hyphenated words (`proto-languages`) or cutting symbols independently (`[ 5 ]`). Different tokenizers will bring different results. 

NLTK provides several [word and sentence tokenizers](https://www.nltk.org/api/nltk.tokenize.html).

## Stemming

Now let's apply the stemming to everything that is composed of characters.

In [None]:
import re

re_word = re.compile(r"^\w+$")
stemmer = SnowballStemmer("english")
stemmed = [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
        
" ".join(stemmed)

Note how the words are simply cut and stemmed. Note that "were" didn't change as it does not follow standard stemming rules.

Another example with "went".

In [None]:
text = " I went to the cinema"
stemmed = [stemmer.stem(word) for word in word_tokenize(text.lower()) if re_word.match(word)]
        
" ".join(stemmed)

## Going Further

NLTK proposes several stemming implementation in several languages. Notably, [this little tutorial](http://www.nltk.org/howto/stem.html) shows how to use the `Snowball stemmer` in several languages. You an also directly look into [NLTK's implementation](https://www.nltk.org/_modules/nltk/stem/snowball.html) of different stemmer.