## Overview

In any natural language, there are many words that share the same root word. For example, think about the root *auto* which means *self*. When combined with other words, we have automatic, automobile, autocrat, and many others. Or the word *active*. We can't reduce these words down any further and still have meaning. But these words are also *stems*. We can add endings (auto*matically*, auto*mobiles*, activa*ted*) or a prefix (*in*active) or both (*de*activa*ted*).

In language, this process is called inflection where a word is modified for different uses including tensem case number, and gender. But inflected words still retain their core meaning.

In English (and other languages) inflection is applied to a verb and is called verb conjugation. We modify verbs for person, tense, and number. An example verb conjugation for tense would be: work, worked, am working.

Nouns are infected for number (plural) by changing the sufix: cat-cats, wolf-wolves, puppy-puppies. Just to make things confusing, sometimes more than the suffix is changes, like for goose-geese or mouse-mice.

In natural language processing, we use stemming or lemmatization to trim our words down to the root word or stem. As we can see from the inflection examples, language is complicated and there are no simple rules to do this trimming. Let's look in more detail at the process to get a better idea of how it works.

### Stemming

As we've already see from the inflection examples above, the process of stemming can involve something simple like removing and "s" or "es" from the end of a noun or a "ed" or "ing" from the end of a verb.

These rules aren't comprehensive but are a good starting place. Fortunately, there is a lot of research on stemming algorithms. For more information on various stemming algorithms, check out the resources listed below.

Let's look in more detail at the Porter stemmer. It works by using an explicit list of suffixes and a list of criteria under which that suffix can be removed. The stemmer works in phases where each phase follows a particular rule, depending on the end of the word. For example, the first phase uses the following "rule group".

| Rule    |   |       | Example  |   |          |
| --------|---|-------|----------|---|----------|
| SSES |&rarr;| SS   | caresses |&rarr;| caress |
| IES |&rarr;| I     | ponies |&rarr;| poni     |
| SS |&rarr;| SS     | caress |&rarr;| caress   |
| S |&rarr;|         | cats |&rarr;| cat        |

We can see that the result stem on the right doesn't have to be an actual word ("poni") because we can still understand the meaning. Here's an example of what some text looks like after going through the Porter stemmer.

We can see that the result stem on the right doesn't have to be an actual word ("poni") because we can still understand the meaning. Here's an example of what some text looks like after going through the Porter stemmer.

```plaintext
> *Sample text:* Such and analysis can reveal features that are not easily visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation
> *Porter stemmer:* such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to a pictur of express that is more biolog transpar and access to interpret
```

A Porter stemmer can be implemented in Python with the Natural Languages Toolkit (NLTK) but we are going to be working with the spaCy library for this part of the course. However, spaCy doesn't do stemming out of the box, but instead uses a different technique called *lemmatization* which we'll discuss next.

### Lemmatization

As you read through the example sentence after stemming, it will seem like the words are literally "chopped off". The job of a stemmer is to remove the endings, which is essentially what the Porter stemmer example above shows. This type of stemming works well when the result doesn't need to be human-readable. Stemming is useful in search and information retrieval applications; also, it's fast.

Lemmatization on the other hand is more methodical. The goal is to transform a word into its base form called a *lemma*. Plural nouns with uncommon spellings get transformed to the singular tense. Verbs are all transformed to the transitive: an action word with something or someone receiving the action such as *paint* (transitive verb) the *canvas* (object).

However, this type of processing has a computational cost. In this case, spaCy does a pretty good job of lemmatizing.

## Follow Along

Let's use some of our example text from the previous objective to apply lemmatization to.

In [1]:
# Import the library
import spacy

# Create an example sentence
sent = "The rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that Alice had not a moment to think about stopping herself before she found herself falling down a very deep well."

# Load the language library
nlp = spacy.load("en_core_web_lg")
doc = nlp(sent)

# Lemma Attributes
for token in doc:
    print(token.text, " --> ", token.lemma_)

The  -->  the
rabbit  -->  rabbit
-  -->  -
hole  -->  hole
went  -->  go
straight  -->  straight
on  -->  on
like  -->  like
a  -->  a
tunnel  -->  tunnel
for  -->  for
some  -->  some
way  -->  way
,  -->  ,
and  -->  and
then  -->  then
dipped  -->  dip
suddenly  -->  suddenly
down  -->  down
,  -->  ,
so  -->  so
suddenly  -->  suddenly
that  -->  that
Alice  -->  Alice
had  -->  have
not  -->  not
a  -->  a
moment  -->  moment
to  -->  to
think  -->  think
about  -->  about
stopping  -->  stop
herself  -->  -PRON-
before  -->  before
she  -->  -PRON-
found  -->  find
herself  -->  -PRON-
falling  -->  fall
down  -->  down
a  -->  a
very  -->  very
deep  -->  deep
well  -->  well
.  -->  .


And there we go! We have tokenized *and* lemmatized our first sentence in spaCy. The lemmas are much easier to read than the stemmed text. In this particular there isn't a lots of 

In order to make this process more efficient, let's put these various text normalizing functions in another function.

In [2]:
# Tokenizing and lemmatizing in one function
def get_lemmas(text):

    # Initialize a list
    lemmas = []
    
    # Convert the input text into a spaCy doc
    doc = nlp(text)
    
    # Remove stop words, punctuation, and personal pronouns (PRON)
    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False)) and (token.pos_ != 'PRON'):
            lemmas.append(token.lemma_)
    
    # Return the lemmatized tokens
    return lemmas

In [3]:
# Example text (https://en.wikipedia.org/wiki/Geology)
geology = ["Geology describes the structure of the Earth on and beneath its surface, and the processes that have shaped that structure.",
        "It also provides tools to determine the relative and absolute ages of rocks found in a given location, and also to describe the histories of those rocks.", 
        "By combining these tools, geologists are able to chronicle the geological history of the Earth as a whole, and also to demonstrate the age of the Earth.",
        "Geology provides the primary evidence for plate tectonics, the evolutionary history of life, and the Earth's past climates."]

In [4]:
# Find the lemmas for each sentence in the above text
geology_lemma = [get_lemmas(sentence) for sentence in geology]
print(geology_lemma)

[['geology', 'describe', 'structure', 'Earth', 'beneath', 'surface', 'process', 'shape', 'structure'], ['provide', 'tool', 'determine', 'relative', 'absolute', 'age', 'rock', 'find', 'give', 'location', 'describe', 'history', 'rock'], ['combine', 'tool', 'geologist', 'able', 'chronicle', 'geological', 'history', 'Earth', 'demonstrate', 'age', 'Earth'], ['geology', 'provide', 'primary', 'evidence', 'plate', 'tectonic', 'evolutionary', 'history', 'life', 'Earth', 'past', 'climate']]


When compared to the original sentences, we see lots of words that have been changed, either by becoming singular or changing the tense: processes to process, shaped to shape, combining to combine.

## Challenge

Using the `get_lemmas()` function above, complete the following text normalization steps on any text you choose. The text can be something more technical from Wikipedia or your favorite book.

* tokenization
* removing stop words
* remove pronouns (part of speech)
* lemmatization

Compare your original text with the lemmatized version and see if you can notice any patterns.

## Resources

* [Stemming algorithms](https://pdfs.semanticscholar.org/1c0c/0fa35d4ff8a2f925eb955e48d655494bd167.pdf)
* [Porter stemmer](http://people.scs.carleton.ca/~armyunis/projects/KAPI/porter.pdf)
* [spaCy: Text Processing (Lemmatization)](https://spacy.io/api/annotation#text-processing)