#### Context Free Grammers - CFGs

-One of the most commen types of formal grammar used in NLP.

-CFGs define rules where a single non-terminal on the left-hand side of a production rule can be replaced by a sequence of terminals and non-terminals.

(Terminals : The actual words or symbols of the language.)

(Non-terminals : Abstract categories (like noun phrase or verb phrase) used to group terminals and describe patterns in the language.)

S      ->       NP VP

NP     ->       Det N

VP     ->       V  NP

Det    ->       'the' | 'a'

N      ->       'cat' | 'dog'

V      ->       'chases' | 'sees'

In [1]:
import nltk

grammar = nltk.CFG.fromstring("""
  S -> NP VP
  NP -> Det N
  VP -> V NP
  Det -> 'the' | 'a'
  N -> 'cat' | 'dog'
  V -> 'chases' | 'sees'
""")

parser = nltk.ChartParser(grammar)
sentence = "the cat chases the dog".split()

for i in parser.parse(sentence):
    print(i)

(S (NP (Det the) (N cat)) (VP (V chases) (NP (Det the) (N dog))))


#### Language Generation

In [3]:
import nltk
from nltk import CFG

cfg = CFG.fromstring("""
    S -> NP VP
    NP -> Det N
    VP -> V NP | V
    Det -> 'the' | 'a'
    N -> 'cat' | 'dog' | 'ball'
    V -> 'chased' | 'saw'
""")

#a function that generates all possible sentences of a given CFG.

def sentence_generate(grammar, starter):

    productions = grammar.productions(lhs=starter)

    for production in productions:
        if all(isinstance(symbol, str) for symbol in production.rhs()):
            yield list(production.rhs())
        else:
            for sub_sentence in combine_symbols(grammar, production.rhs()):
                yield sub_sentence

def combine_symbols(grammar,symbols):

    if not symbols:
        yield []
    else:
        first, rest = symbols[0], symbols[1:]
        if isinstance(first, nltk.grammar.Nonterminal):
            for first_part in sentence_generate(grammar, first):
                for rest_part in combine_symbols(grammar, rest):
                    yield first_part + rest_part  
        else:
            for rest_part in combine_symbols(grammar, rest):
                yield [first] + rest_part  

example_sentences = list(sentence_generate(cfg, nltk.Nonterminal('S')))
example_sentences

[['the', 'cat', 'chased', 'the', 'cat'],
 ['the', 'cat', 'chased', 'the', 'dog'],
 ['the', 'cat', 'chased', 'the', 'ball'],
 ['the', 'cat', 'chased', 'a', 'cat'],
 ['the', 'cat', 'chased', 'a', 'dog'],
 ['the', 'cat', 'chased', 'a', 'ball'],
 ['the', 'cat', 'saw', 'the', 'cat'],
 ['the', 'cat', 'saw', 'the', 'dog'],
 ['the', 'cat', 'saw', 'the', 'ball'],
 ['the', 'cat', 'saw', 'a', 'cat'],
 ['the', 'cat', 'saw', 'a', 'dog'],
 ['the', 'cat', 'saw', 'a', 'ball'],
 ['the', 'cat', 'chased'],
 ['the', 'cat', 'saw'],
 ['the', 'dog', 'chased', 'the', 'cat'],
 ['the', 'dog', 'chased', 'the', 'dog'],
 ['the', 'dog', 'chased', 'the', 'ball'],
 ['the', 'dog', 'chased', 'a', 'cat'],
 ['the', 'dog', 'chased', 'a', 'dog'],
 ['the', 'dog', 'chased', 'a', 'ball'],
 ['the', 'dog', 'saw', 'the', 'cat'],
 ['the', 'dog', 'saw', 'the', 'dog'],
 ['the', 'dog', 'saw', 'the', 'ball'],
 ['the', 'dog', 'saw', 'a', 'cat'],
 ['the', 'dog', 'saw', 'a', 'dog'],
 ['the', 'dog', 'saw', 'a', 'ball'],
 ['the', 'dog', '

In [4]:
#Constituency Parsing using Stanza (these approaches ensure that sentences conform to the
#grammatical rules of the language, which can lead to more accurate language understanding 
#and generation.)

!pip install stanza




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
import stanza

nlp_stanza = stanza.Pipeline('en', processor='tokenize,pos,constituency', download_method=None)
doc_stanza = nlp_stanza("The cat chases the dog.")
for sentence in doc_stanza.sentences:
    print(sentence.constituency)

2024-12-12 17:30:01 INFO: Loading these models for language: en (English):
| Processor    | Package                   |
--------------------------------------------
| tokenize     | combined                  |
| mwt          | combined                  |
| pos          | combined_charlm           |
| lemma        | combined_nocharlm         |
| constituency | ptb3-revised_charlm       |
| depparse     | combined_charlm           |
| sentiment    | sstplus_charlm            |
| ner          | ontonotes-ww-multi_charlm |

2024-12-12 17:30:01 INFO: Using device: cpu
2024-12-12 17:30:01 INFO: Loading: tokenize
2024-12-12 17:30:02 INFO: Loading: mwt
2024-12-12 17:30:02 INFO: Loading: pos
2024-12-12 17:30:02 INFO: Loading: lemma
2024-12-12 17:30:02 INFO: Loading: constituency
2024-12-12 17:30:02 INFO: Loading: depparse
2024-12-12 17:30:03 INFO: Loading: sentiment
2024-12-12 17:30:03 INFO: Loading: ner
2024-12-12 17:30:03 INFO: Done loading processors!


(ROOT (S (NP (DT The) (NN cat)) (VP (VBZ chases) (NP (DT the) (NN dog))) (. .)))


In [6]:
tree = doc_stanza.sentences[0].constituency
tree.label

'ROOT'

In [7]:
tree.children

((S (NP (DT The) (NN cat)) (VP (VBZ chases) (NP (DT the) (NN dog))) (. .)),)

In [8]:
tree.children[0].children

((NP (DT The) (NN cat)), (VP (VBZ chases) (NP (DT the) (NN dog))), (. .))

In [9]:
#Dependency Parsing using Stanza
import stanza
nlp_stanza = stanza.Pipeline(lang='en', processors='tokenize,mwt,pos,lemma,depparse')

2024-12-12 17:31:16 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.9.0.json:   0%|   …

2024-12-12 17:31:16 INFO: Downloaded file to C:\Users\emre-\stanza_resources\resources.json
2024-12-12 17:31:17 INFO: Loading these models for language: en (English):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| mwt       | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |

2024-12-12 17:31:17 INFO: Using device: cpu
2024-12-12 17:31:17 INFO: Loading: tokenize
2024-12-12 17:31:17 INFO: Loading: mwt
2024-12-12 17:31:17 INFO: Loading: pos
2024-12-12 17:31:17 INFO: Loading: lemma
2024-12-12 17:31:17 INFO: Loading: depparse
2024-12-12 17:31:17 INFO: Done loading processors!


In [10]:
doc_stanza = nlp_stanza('The cat chases the dog.')
print(*[f'id: {word.id}\tword: {word.text}\thead id: {word.head}\thead: {sent.words[word.head-1].text if word.head > 0 else "root"}\tdeprel: {word.deprel}' for sent in doc_stanza.sentences for word in sent.words], sep='\n')


id: 1	word: The	head id: 2	head: cat	deprel: det
id: 2	word: cat	head id: 3	head: chases	deprel: nsubj
id: 3	word: chases	head id: 0	head: root	deprel: root
id: 4	word: the	head id: 5	head: dog	deprel: det
id: 5	word: dog	head id: 3	head: chases	deprel: obj
id: 6	word: .	head id: 3	head: chases	deprel: punct


In [11]:
for sentence in doc_stanza.sentences:
    for word in sentence.words:
        print(f"{word.text} -> {word.head}, {word.deprel}")

The -> 2, det
cat -> 3, nsubj
chases -> 0, root
the -> 5, det
dog -> 3, obj
. -> 3, punct
