# Retrieve n-grams: Example

This example shows the different options for performing the extraction of n-grams from a tagged text using the corpora toolbox. The tagged text must be provided as a list of Tag objects (from the corpora_toolbox.nlp.**NLP_classes**).

The retrieve_n_grams fucntion from the corpora_toolbox.nlp.**NLP_classics** has the following particularities:

1. It extracts n-grams taking into account the period (indicated by the tagger with the tag "Fp", following the [FreeLing](http://nlp.lsi.upc.edu/freeling/) tagging style) as a sentence chunker. This means that two words that are separated by a perdiod identified as a sentence breaker by a tagger, won't be a part of the same n-gram. For example, in "I am happy. The sun is shinning.", there won't be "happy_the as a bigram, since the sentence has a basic chunker based on the preriod sign.
2. The retriever allows the extraction of n-grams of POS, original tokens or lemmas.
3. It also permits the attachment of the tag to the n-gram. This allows the differentiation between n-grams that include **homonyms**. For example, "*bear/N*" (the animal) and "*bear/V*" (to withstand).

### Setting the relative path for importing the corpora_toolbox 

In [1]:
import platform
import sys

if platform.platform().lower().startswith("windows"):
    relative_path="..\\\\"
else:
    relative_path="../"

sys.path.append(relative_path)

### Importing functions

In [2]:
from corpora_toolbox.nlp.NLP_primitives import Tag
from corpora_toolbox.nlp.NLP_classics import retrieve_n_grams
from collections import Counter

### Creating a list of Tag objects

The n-grams will be retreived from the phrase "Verónica has a dog, and a cat, and a mouse, and an alligator. Laura has a rabbit.". The following section, creates a list of Tag objects filled with the part of speech tagging and lemmatization of the phrase.

POS tags: 

    N = noun
    V = verb
    D = determinant
    F = Punctuation Sign
        Fc = comma
        Fp = period
    C = conjunction

In [3]:
list_of_tags = []

list_of_tags.append( Tag("Veronica", "veronica", "N") )
list_of_tags.append( Tag("has", "have", "V") )
list_of_tags.append( Tag("a", "a", "D") )
list_of_tags.append( Tag("dog", "dog", "N") )
list_of_tags.append( Tag(",", ",", "Fc") )
list_of_tags.append( Tag("and", "and", "C") )
list_of_tags.append( Tag("a", "a", "D") )
list_of_tags.append( Tag("cat", "cat", "N") )
list_of_tags.append( Tag(",", ",", "Fc") )
list_of_tags.append( Tag("and", "and", "C") )
list_of_tags.append( Tag("a", "a", "D") )
list_of_tags.append( Tag("mouse", "mouse", "N") )
list_of_tags.append( Tag(",", ",", "Fc") )
list_of_tags.append( Tag("and", "and", "C") )
list_of_tags.append( Tag("an", "a", "D") )
list_of_tags.append( Tag("animal", "alligator", "N") )
list_of_tags.append( Tag(".", ".", "Fp") )
list_of_tags.append( Tag("Laura", "laura", "N") )
list_of_tags.append( Tag("has", "have", "V") )
list_of_tags.append( Tag("a", "a", "D") )
list_of_tags.append( Tag("rabbit", "rabbit", "N") )
list_of_tags.append( Tag(".", ".", "Fp") )

### Using the default options to extract n-grams

By default, the *retrieve_n_grams* fucntion retrieves the n-grams of lemmas, without considering punctuation marks. Following the [FreeLing](http://nlp.lsi.upc.edu/freeling/) tagging style, punctuation marks tags always start with "F". 

**NOTE**: Regardless of the value of the *remove_punctuation* parameter, *retrieve_n_grams* will always consider the "Fp" (period) tag as the chunker mark to separate sentences.

In [4]:
unigrams = retrieve_n_grams(1, list_of_tags)
bigrams = retrieve_n_grams(2, list_of_tags)
trigrams = retrieve_n_grams(3, list_of_tags)
tetragrams = retrieve_n_grams(4, list_of_tags)
pentagrams = retrieve_n_grams(5, list_of_tags)

print("1-grams: ", unigrams, "\n")
print("2-grams: ", bigrams, "\n")
print("3-grams: ", trigrams, "\n")
print("4-grams: ", tetragrams, "\n")
print("5-grams: ", pentagrams, "\n")

1-grams:  ['veronica', 'have', 'a', 'dog', 'and', 'a', 'cat', 'and', 'a', 'mouse', 'and', 'a', 'alligator', 'laura', 'have', 'a', 'rabbit'] 

2-grams:  ['veronica_have', 'have_a', 'a_dog', 'dog_and', 'and_a', 'a_cat', 'cat_and', 'and_a', 'a_mouse', 'mouse_and', 'and_a', 'a_alligator', 'laura_have', 'have_a', 'a_rabbit'] 

3-grams:  ['veronica_have_a', 'have_a_dog', 'a_dog_and', 'dog_and_a', 'and_a_cat', 'a_cat_and', 'cat_and_a', 'and_a_mouse', 'a_mouse_and', 'mouse_and_a', 'and_a_alligator', 'laura_have_a', 'have_a_rabbit'] 

4-grams:  ['veronica_have_a_dog', 'have_a_dog_and', 'a_dog_and_a', 'dog_and_a_cat', 'and_a_cat_and', 'a_cat_and_a', 'cat_and_a_mouse', 'and_a_mouse_and', 'a_mouse_and_a', 'mouse_and_a_alligator', 'laura_have_a_rabbit'] 

5-grams:  ['veronica_have_a_dog_and', 'have_a_dog_and_a', 'a_dog_and_a_cat', 'dog_and_a_cat_and', 'and_a_cat_and_a', 'a_cat_and_a_mouse', 'cat_and_a_mouse_and', 'and_a_mouse_and_a', 'a_mouse_and_a_alligator'] 





**- Bonus tip: -** You can easily get a dictionary with the frequency of n-grams by using the Counter subclass from collections:

In [5]:
from collections import Counter 

print(Counter(unigrams),"\n")
print(Counter(bigrams),"\n")
print(Counter(trigrams),"\n")

Counter({'a': 5, 'and': 3, 'have': 2, 'veronica': 1, 'dog': 1, 'cat': 1, 'mouse': 1, 'alligator': 1, 'laura': 1, 'rabbit': 1}) 

Counter({'and_a': 3, 'have_a': 2, 'veronica_have': 1, 'a_dog': 1, 'dog_and': 1, 'a_cat': 1, 'cat_and': 1, 'a_mouse': 1, 'mouse_and': 1, 'a_alligator': 1, 'laura_have': 1, 'a_rabbit': 1}) 

Counter({'veronica_have_a': 1, 'have_a_dog': 1, 'a_dog_and': 1, 'dog_and_a': 1, 'and_a_cat': 1, 'a_cat_and': 1, 'cat_and_a': 1, 'and_a_mouse': 1, 'a_mouse_and': 1, 'mouse_and_a': 1, 'and_a_alligator': 1, 'laura_have_a': 1, 'have_a_rabbit': 1}) 



### Playing with the parameters

#### Retrieving n-grams of the original words (without lemmatization)

In [6]:
unigrams = retrieve_n_grams(1,list_of_tags, lemmas=False)
bigrams = retrieve_n_grams(2, list_of_tags, lemmas=False)
trigrams = retrieve_n_grams(3, list_of_tags, lemmas=False)

print("1-grams: ", unigrams, "\n")
print("2-grams: ", bigrams, "\n")
print("3-grams: ", trigrams, "\n")

1-grams:  ['Veronica', 'has', 'a', 'dog', 'and', 'a', 'cat', 'and', 'a', 'mouse', 'and', 'an', 'animal', 'Laura', 'has', 'a', 'rabbit'] 

2-grams:  ['Veronica_has', 'has_a', 'a_dog', 'dog_and', 'and_a', 'a_cat', 'cat_and', 'and_a', 'a_mouse', 'mouse_and', 'and_an', 'an_animal', 'Laura_has', 'has_a', 'a_rabbit'] 

3-grams:  ['Veronica_has_a', 'has_a_dog', 'a_dog_and', 'dog_and_a', 'and_a_cat', 'a_cat_and', 'cat_and_a', 'and_a_mouse', 'a_mouse_and', 'mouse_and_an', 'and_an_animal', 'Laura_has_a', 'has_a_rabbit'] 



#### Including punctuation marks in the n-grams:

In [7]:
unigrams = retrieve_n_grams(1,list_of_tags, remove_punctuation=False)
bigrams = retrieve_n_grams(2, list_of_tags, remove_punctuation=False)
trigrams = retrieve_n_grams(3, list_of_tags, remove_punctuation=False)

print("1-grams: ", unigrams, "\n")
print("2-grams: ", bigrams, "\n")
print("3-grams: ", trigrams, "\n")

1-grams:  ['veronica', 'have', 'a', 'dog', ',', 'and', 'a', 'cat', ',', 'and', 'a', 'mouse', ',', 'and', 'a', 'alligator', 'laura', 'have', 'a', 'rabbit'] 

2-grams:  ['veronica_have', 'have_a', 'a_dog', 'dog_,', ',_and', 'and_a', 'a_cat', 'cat_,', ',_and', 'and_a', 'a_mouse', 'mouse_,', ',_and', 'and_a', 'a_alligator', 'laura_have', 'have_a', 'a_rabbit'] 

3-grams:  ['veronica_have_a', 'have_a_dog', 'a_dog_,', 'dog_,_and', ',_and_a', 'and_a_cat', 'a_cat_,', 'cat_,_and', ',_and_a', 'and_a_mouse', 'a_mouse_,', 'mouse_,_and', ',_and_a', 'and_a_alligator', 'laura_have_a', 'have_a_rabbit'] 



#### Attaching the tags to the n-grams

In [8]:
unigrams = retrieve_n_grams(1,list_of_tags, only_tokens=False)
bigrams = retrieve_n_grams(2, list_of_tags, only_tokens=False)
trigrams = retrieve_n_grams(3, list_of_tags, only_tokens=False)

print("1-grams: ", unigrams, "\n")
print("2-grams: ", bigrams, "\n")
print("3-grams: ", trigrams, "\n")

1-grams:  ['veronica/N', 'have/V', 'a/D', 'dog/N', 'and/C', 'a/D', 'cat/N', 'and/C', 'a/D', 'mouse/N', 'and/C', 'a/D', 'alligator/N', 'laura/N', 'have/V', 'a/D', 'rabbit/N'] 

2-grams:  ['veronica_have/N_V', 'have_a/V_D', 'a_dog/D_N', 'dog_and/N_C', 'and_a/C_D', 'a_cat/D_N', 'cat_and/N_C', 'and_a/C_D', 'a_mouse/D_N', 'mouse_and/N_C', 'and_a/C_D', 'a_alligator/D_N', 'laura_have/N_V', 'have_a/V_D', 'a_rabbit/D_N'] 

3-grams:  ['veronica_have_a/N_V_D', 'have_a_dog/V_D_N', 'a_dog_and/D_N_C', 'dog_and_a/N_C_D', 'and_a_cat/C_D_N', 'a_cat_and/D_N_C', 'cat_and_a/N_C_D', 'and_a_mouse/C_D_N', 'a_mouse_and/D_N_C', 'mouse_and_a/N_C_D', 'and_a_alligator/C_D_N', 'laura_have_a/N_V_D', 'have_a_rabbit/V_D_N'] 



#### Retreiving n-grams of the POS tags only

For linguistic analysis, it is common to analyze patterns of n-grams considering only the POS tags

In [9]:
unigrams = retrieve_n_grams(1,list_of_tags, only_tags=True)
bigrams = retrieve_n_grams(2, list_of_tags, only_tags=True)
trigrams = retrieve_n_grams(3, list_of_tags, only_tags=True)

print("1-grams: ", unigrams, "\n")
print("2-grams: ", bigrams, "\n")
print("3-grams: ", trigrams, "\n")

1-grams:  ['N', 'V', 'D', 'N', 'C', 'D', 'N', 'C', 'D', 'N', 'C', 'D', 'N', 'N', 'V', 'D', 'N'] 

2-grams:  ['N_V', 'V_D', 'D_N', 'N_C', 'C_D', 'D_N', 'N_C', 'C_D', 'D_N', 'N_C', 'C_D', 'D_N', 'N_V', 'V_D', 'D_N'] 

3-grams:  ['N_V_D', 'V_D_N', 'D_N_C', 'N_C_D', 'C_D_N', 'D_N_C', 'N_C_D', 'C_D_N', 'D_N_C', 'N_C_D', 'C_D_N', 'N_V_D', 'V_D_N'] 

