<a href="https://colab.research.google.com/github/pmadhyastha/INM434/blob/main/syntactic_structure_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Syntax and Syntactic Structure Prediction

In [None]:
!pip install tabulate

## Parsing language using CFG

In the code below, we will implement the CYK algorithm (this is a toy implementation). 

The Grammar class is the primary class below which contains several variables and functions that will be used to load and parse a context-free grammar, represented by a list containing production rules (we are giving this, later).


Notice that we have a custom dictlist class. The grammar_rules variable is a Dictlist object that stores the production rules of the grammar, where the keys are the right-hand side of the production and the values are the left-hand side. 

The parse_table variable is initialized as None, and will later be used to store the parse table created during the CYK parsing algorithm.

The Grammar class constructs the grammar_rules dictionary. The right-hand side of the production is used as the key in the grammar_rules dictionary, and the left-hand side is stored as the corresponding value.

The apply_rules function takes a string as input and returns the left-hand side of the production rule that matches the right-hand side of the input string. 

The CYK parsing algorithm consists of two nested loops. The outer loop iterates over the length of the input sentence (i.e., the number of words in the sentence), and the inner loop iterates over the starting position of the substring being considered. For example, if the sentence is "the cat sat on the mat", the inner loop would start with "the", then "cat", then "sat", and so on.

The first loop, which is over the length of the substring being considered, starts with 2 and goes up to the length of the sentence. The second loop, which is over the starting position of the substring being considered, starts with 1 and goes up to the length of the sentence minus the length of the substring plus 1. For each combination of substring length and starting position, the function then iterates over all possible ways of splitting the substring into two parts, represented by the variable p. For example, if the substring being considered is "the cat sat", p would be 1 (for splitting the substring into "the" and "cat sat"), 2 (for splitting the substring into "the cat" and "sat"), and 3 (for splitting the substring into "the cat sat" and "").

For each combination of substring length, starting position, and split point, the function retrieves the set of rules that can generate the left-hand side of the substring to the left of the split point and the set of rules that can generate the right-hand side of the substring to the right of the split point. It then iterates over all possible pairs of rules from the two sets and checks whether there is a production rule in the grammar that generates the pair of rules.



In [None]:
from tabulate import tabulate


class DictList(dict):
    
    def __setitem__(self, key, value):
        try:
            self[key]
        except KeyError:
            super(DictList, self).__setitem__(key, [])
        self[key].append(value)


class ProductionRule:
    
    def __init__(self, result, p1=None, p2=None):
        self.result = result
        self.p1 = p1
        self.p2 = p2
    
    @property
    def type(self):
        return self.result
    
    @property
    def left_child(self):
        return self.p1
    
    @property
    def right_child(self):
        return self.p2


class Cell:
    
    def __init__(self, productions=None):
        self.productions = productions or []
            
    def add_production(self, result, p1=None, p2=None):
        self.productions.append(ProductionRule(result, p1, p2))
    
    def set_productions(self, productions):
        self.productions = productions
    
    @property
    def types(self):
        return [p.type for p in self.productions]
    
    @property
    def rules(self):
        return self.productions


class Grammar:
    
    def __init__(self, elist):
        self.grammar_rules = DictList()
        self.parse_table = None
        self.length = 0
        for line in elist:
            a, b = line.split("->")
            self.grammar_rules[b.rstrip().strip()] = a.rstrip().strip()
        
        print('')
        print('Grammar loaded. Rules read:')
        self.print_rules()
        print('')
    
    def print_rules(self):
        for r in self.grammar_rules:
            for p in self.grammar_rules[r]:
                print(f"{p} --> {r}")
        
    def apply_rules(self, t):
        return self.grammar_rules.get(t, None)
            
    def parse(self, sentence):
        self.number_of_trees = 0
        self.tokens = sentence.split()
        self.length = len(self.tokens)
        self.parse_table = [[Cell() for _ in range(self.length - y)] for y in range(self.length)]
        
        # Process the first line
        for x, t in enumerate(self.tokens):
            r = self.apply_rules(t)
            if r is None:
                raise ValueError(f"The word {t} is not in the grammar")
            for w in r:
                self.parse_table[0][x].add_production(w, ProductionRule(t))
        
        # Run CYK-Parser
        for l in range(2, self.length+1):
            for s in range(1, self.length-l+2):
                for p in range(1, l-1+1):
                    t1 = self.parse_table[p-1][s-1].rules
                    t2 = self.parse_table[l-p-1][s+p-1].rules
                    for a in t1:
                        for b in t2:
                            r = self.apply_rules(f"{a.type} {b.type}")
                            if r is not None:
                                for w in r:
                                    print(f"This rule is being applied: {w}[{l},{s}] --> {a.type}[{p},{s}] {b.type}[{l-p},{s+p}]")
                                    self.parse_table[l-1][s-1].add_production(w, a, b)
                               
        self.number_of_trees = len(self.parse_table[self.length-1][0].types)
        if self.number_of_trees > 0:
            print(f"Number of possible trees: {self.number_of_trees}")
        else:
            print(f"The sentence seems incorrect")
        
    def get_trees(self):
        return self.parse_table[self.length-1][0].productions

    def print_parse_table(self):
        lines = []

        for row in reversed(self.parse_table):
            l = []
            for cell in row:
                l.append(cell.types)
            lines.append(l)
        
        lines.append(self.tokens)
        print(tabulate(lines))

In [None]:
example = ['S -> NP VP', 'PP -> P NP', 'VP -> V NP', 'VP -> VP PP', 'NP-> NP PP', 'NP -> pizza', 'NP -> I', 'NP -> eat', 'NP-> pineapple', 'P -> with', 'V -> ate']


In [None]:
g = Grammar(example)
g.parse('I eat pizza with pineapple')
g.print_parse_table()

# Todo: 
- Give another example in the same format (this is known as the Chomsky Normal Form) 
- Try different sentences. 

## Exploring dependency parsing

In the following, we will try and navigate dependency parse tree of sentences. We will however rely on a trained model for extracting parses. We will use a popular NLP library called `spacy`. 

We will first download the library.



In [None]:
!pip install spacy

We will now import `spacy` library and load the pre-trained English model "en_core_web_sm" (please refer to spacy documentation for further information about the library). 

In [None]:
import spacy
from spacy import displacy
pipeline = spacy.load("en_core_web_sm")

In [None]:
sample_text = "Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data."

# from wikipedia! 

Let us now pass the sentence through `spacy` pipeline, which runs many things. It also generates the dependency parse.

In [None]:
sample_piped_output = pipeline(sample_text)


We are really interested in visualising the dependency parse. The code below helps us visualise the dependency parse: 

In [None]:
displacy.render(sample_piped_output, style="dep", jupyter=True, options={'distance': 90})

Isn't that sentence a tad long and difficult to process in our head? 


## TODO: 
- Is there a mistake in the parse tree? 
- Try your own set of sentences and visualise the parse tree. 




Let us now print the children of "is".  Is this correct? Does it make sense? 

In [None]:
print([token.text for token in sample_piped_output[9].children])



Let us now print subtree for the word "is"

In [None]:
print (list(sample_piped_output[9].subtree))

Now let's go through each of the word and print out the part of speech tag, the dependency relation for each of the word in the sentence. 

In [None]:
for word in sample_piped_output:
    print("%s: %s %s" % (word, word.pos_, word.dep_))

## TODO: 
- Can you get all the subjects and objects in this sentence? 
- (to know more about dependency labels please visit: https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf)