# Inducing simple grammars from text using GITTA

This notebook shows the power of [GITTA](https://github.com/twinters/gitta) *(Grammar Induction using a Template Tree Approach)*
induce a template-driven generative grammar from textual examples

In [1]:
import random
import grammar_induction
random.seed(123)

## Creating a dataset
Create a dataset of text for which you would like to induce a generative grammar.
Then pass it to Gitta using the `induce_grammaR_using_template_trees` method.
While the default values should already work, you can give some more hints to GITTA about your expected grammar using its parameters.
The most important are:
- `relative_similarity_threshold`: 0 = join slots if at least one value overlaps, 1 = never join slots unless their values 100% overlap.
- `àllow_empty_string`: True if slots are allowed to map to empty strings, False if you want at least one token from every slot. Helpful to simply and easily correct resulting grammars.
- `max_depth`: The maximum depth the internal template tree is allow to become at any point, thus also limiting how deep your grammar can be.
- `use_best_merge_candidate`: Forces GITTA to work optimally, but lose some performance. Turning this boolean off can increase speed, but might result in slightly off grammars.
- `prune_redudant`: Prunes nodes of the template tree if all their children are already covered by other sibling nodes. Turning this off might make the grammar have more paths to generate the same string.

In [2]:
dataset = ["I like my cat and my dog", 
           "I like my dog and my chicken",
           "Alice the cat is jumping",
           "Bob the dog is walking",
           "Cathy the cat is walking"]
reconstructed_grammar = grammar_induction.induce_grammar_using_template_trees(
    dataset, relative_similarity_threshold=0.1,
)
print(reconstructed_grammar.to_json())


{
    "origin": [
        "<B> the <C> is <D>",
        "I like my <C> and my <C>"
    ],
    "C": [
        "cat",
        "chicken",
        "dog"
    ],
    "B": [
        "Alice",
        "Bob",
        "Cathy"
    ],
    "D": [
        "jumping",
        "walking"
    ]
}


## Generating more examples using the grammar
Now that we have induced this grammar, we can check if it indeed generates more examples in the same line as our input
by checking all possible generations.
If you only want a limited number of generations, feel free to just use `generate()` instead.

In [3]:
all_generations = reconstructed_grammar.generate_all()
all_generations

{"Alice the cat is jumping",
 "Alice the cat is walking",
 "Alice the chicken is jumping",
 "Alice the chicken is walking",
 "Alice the dog is jumping",
 "Alice the dog is walking",
 "Bob the cat is jumping",
 "Bob the cat is walking",
 "Bob the chicken is jumping",
 "Bob the chicken is walking",
 "Bob the dog is jumping",
 "Bob the dog is walking",
 "Cathy the cat is jumping",
 "Cathy the cat is walking",
 "Cathy the chicken is jumping",
 "Cathy the chicken is walking",
 "Cathy the dog is jumping",
 "Cathy the dog is walking",
 "I like my cat and my cat",
 "I like my cat and my chicken",
 "I like my cat and my dog",
 "I like my chicken and my cat",
 "I like my chicken and my chicken",
 "I like my chicken and my dog",
 "I like my dog and my cat",
 "I like my dog and my chicken",
 "I like my dog and my dog"}

## Exporting to other grammar frameworks

GITTA can export its grammars to other popular frameworks, such as the default grammar arrow notation, or Tracery.
If you need exports to other grammar frameworks, feel free raise [an issue](https://github.com/twinters/gitta/issues),
or sending a pull request for the [grammar_exporter](https://github.com/twinters/gitta/blob/master/gitta/grammar_exporter.py) file.

In [4]:
from gitta import grammar_exporter

print("Arrow notation:")
print(grammar_exporter.to_arrow_notation(reconstructed_grammar))
print("\n")
print("Tracery notation:")
print(grammar_exporter.to_tracery(reconstructed_grammar))

Arrow notation:
S -> #B# the #C# is #D# | I like my #C# and my #C#
C -> cat | chicken | dog
B -> Alice | Bob | Cathy
D -> jumping | walking


Tracery notation:
{
    "origin": [
        "#B# the #C# is #D#",
        "I like my #C# and my #C#"
    ],
    "C": [
        "cat",
        "chicken",
        "dog"
    ],
    "B": [
        "Alice",
        "Bob",
        "Cathy"
    ],
    "D": [
        "jumping",
        "walking"
    ]
}
