## Retrieve n-grams: Example

This example tests the extraction of n-grams from a tagged text. The tagged text must be provided as a list of "tag" objects (from the corpora_toolbox.nlp.freeling_module).

The retrieve_n_grams fucntion from the corpora_toolbox.nlp.NLP_classics has the following particularities:
1. It extracts n-grams taking into account the period (indicated by the tagger as "Fp") as a sentence separator. This means that two words that are separated by a perdiod identified as a sentence breaker by a tagger, won't be a part of the same n-gram. For example, in "I am happy. The sun is shinning.", there won't be "happy_the as a bigram, since the sentence has a basic chunker based on the preriod sign.
2. The retriever 

#### Setting the relative path for importing the corpora_toolbox 

In [1]:
import platform
import sys

if platform.platform().lower().startswith("windows"):
    relative_path="..\\\\"
else:
    relative_path="../"

sys.path.append(relative_path)


#### Importing functions

In [2]:
from corpora_toolbox.nlp.freeling_module import tag
from corpora_toolbox.nlp.NLP_classics import retrieve_n_grams
from collections import Counter

### Create a list of Tag objects

From the phrase "Verónica has a dog, and a cat, and a mouse, and an alligator. Laura has a rabbit."
POS tags: 
    N = noun
    V = verb
    D = determinant
    F = Punctuation Sign
        Fc = comma
        Fp = period
    C = conjunction

In [3]:
list_of_tags = []

t = tag("Veronica", "veronica", "N")
list_of_tags.append(t)

t = tag("has", "have", "V")
list_of_tags.append(t)

t = tag("a", "a", "D")
list_of_tags.append(t)

t = tag("dog", "dog", "N")
list_of_tags.append(t)

t = tag(",", ",", "Fc")
list_of_tags.append(t)

t = tag("and", "and", "C")
list_of_tags.append(t)

t = tag("a", "a", "D")
list_of_tags.append(t)

t = tag("cat", "cat", "N")
list_of_tags.append(t)

t = tag(",", ",", "Fc")
list_of_tags.append(t)

t = tag("and", "and", "C")
list_of_tags.append(t)

t = tag("a", "a", "D")
list_of_tags.append(t)

t = tag("mouse", "mouse", "N")
list_of_tags.append(t)

t = tag(",", ",", "Fc")
list_of_tags.append(t)

t = tag("and", "and", "C")
list_of_tags.append(t)

t = tag("an", "a", "D")
list_of_tags.append(t)

t = tag("animal", "alligator", "N")
list_of_tags.append(t)

t = tag(".", ".", "Fp")
list_of_tags.append(t)

t = tag("Laura", "laura", "N")
list_of_tags.append(t)

t = tag("has", "have", "V")
list_of_tags.append(t)

t = tag("a", "a", "D")
list_of_tags.append(t)

t = tag("rabbit", "rabbit", "N")
list_of_tags.append(t)

t = tag(".", ".", "Fp")
list_of_tags.append(t)

#### First, we extract n-grams (n: 1-4), as they are in the sentence (including punctuation marks):

In [4]:
unigrams = retrieve_n_grams(1,list_of_tags, remove_punctuation=False, lemmas=True)
bigrams = retrieve_n_grams(2, list_of_tags, remove_punctuation=False, lemmas=True)
trigrams = retrieve_n_grams(3, list_of_tags, remove_punctuation=False, lemmas=True)

print("1-grams: ", unigrams, "\n")
print("2-grams: ", unigrams, "\n")
print("3-grams: ", unigrams, "\n")

1-grams:  ['veronica', 'have', 'a', 'dog', ',', 'and', 'a', 'cat', ',', 'and', 'a', 'mouse', ',', 'and', 'a', 'alligator', 'laura', 'have', 'a', 'rabbit'] 

2-grams:  ['veronica', 'have', 'a', 'dog', ',', 'and', 'a', 'cat', ',', 'and', 'a', 'mouse', ',', 'and', 'a', 'alligator', 'laura', 'have', 'a', 'rabbit'] 

3-grams:  ['veronica', 'have', 'a', 'dog', ',', 'and', 'a', 'cat', ',', 'and', 'a', 'mouse', ',', 'and', 'a', 'alligator', 'laura', 'have', 'a', 'rabbit'] 



#### We can opt to not consider the punctuation
##### Important note:
The sign of punctuation with a tag "Fp", representing a perdiod, will still be considered by the fucntion to separate n-grams for different sentences.

In [5]:
unigrams = retrieve_n_grams(1,list_of_tags, remove_punctuation=True, lemmas=True)
bigrams = retrieve_n_grams(2, list_of_tags, remove_punctuation=True, lemmas=True)
trigrams = retrieve_n_grams(3, list_of_tags, remove_punctuation=True, lemmas=True)

print("1-grams: ", unigrams, "\n")
print("2-grams: ", unigrams, "\n")
print("3-grams: ", unigrams, "\n")

1-grams:  ['veronica', 'have', 'a', 'dog', 'and', 'a', 'cat', 'and', 'a', 'mouse', 'and', 'a', 'alligator', 'laura', 'have', 'a', 'rabbit'] 

2-grams:  ['veronica', 'have', 'a', 'dog', 'and', 'a', 'cat', 'and', 'a', 'mouse', 'and', 'a', 'alligator', 'laura', 'have', 'a', 'rabbit'] 

3-grams:  ['veronica', 'have', 'a', 'dog', 'and', 'a', 'cat', 'and', 'a', 'mouse', 'and', 'a', 'alligator', 'laura', 'have', 'a', 'rabbit'] 



For linguistic analysis, it is common to analyze patterns of n-grams considering only the POS tags

In [6]:
unigrams = retrieve_n_grams(1,list_of_tags, remove_punctuation=True, only_tags=True)
bigrams = retrieve_n_grams(2, list_of_tags, remove_punctuation=True, only_tags=True)
trigrams = retrieve_n_grams(3, list_of_tags, remove_punctuation=True, only_tags=True)

print("1-grams: ", unigrams, "\n")
print("2-grams: ", bigrams, "\n")
print("3-grams: ", trigrams, "\n")

1-grams:  ['N', 'V', 'D', 'N', 'C', 'D', 'N', 'C', 'D', 'N', 'C', 'D', 'N', 'N', 'V', 'D', 'N'] 

2-grams:  ['N_V', 'V_D', 'D_N', 'N_C', 'C_D', 'D_N', 'N_C', 'C_D', 'D_N', 'N_C', 'C_D', 'D_N', 'N_V', 'V_D', 'D_N'] 

3-grams:  ['N_V_D', 'V_D_N', 'D_N_C', 'N_C_D', 'C_D_N', 'D_N_C', 'N_C_D', 'C_D_N', 'D_N_C', 'N_C_D', 'C_D_N', 'N_V_D', 'V_D_N'] 

