### Assignement 2

The assignment consists in the development, in NLTK, OpenNLP, SketchEngine or GATE/Annie a pipeline that, starting from a text in input, in a given language (English, French, German and Italian are admissible) outputs the syntactic tree of the sentence itself, intended as a tree with root in S for sentence, and leaves on the tokens labelled with a single Part-of-speech. The generation of the tree can pass through one of the following models:

1) PURE SYMBOLIC. The tree is generated by a LR analysis with CF LL2 grammar as a base. Candidates can assume the following:

   a) Adjectives in English and German shall be only prefixed to nouns, whilst in French and Italian are only suffixed;

    b) Verbs are all at present tense;

    c) No pronouns are admitted;

    d) Only one adverb is admitted, always post-poned with respect to the verb (independently of the language, and the type of adverb);

    Overall the point above map a system that could be devised in regular expressions, but a Context-free grammar would be simpler to     
    define. Candidate can either define a system by themselves or use a syntactic tree generation system that can be found on GitHub. 
    Same happens for POS-tagging, where some of the above mentioned systems can be customized by existing techniques that are available
    in several fashions (including a pre-defined NLTK and OpenNLP libraries for POS-tagging and a module in GATE for the same purpose. Ambiguity 
    should be blocked onto first admissible tree.

2) PURE ML. Candidates can develop a PLM with one-step Markov chains to forecast the following token, and used to generate the forecast of the
     POS tags to be attributed. In this case the PLM can be generated starting with a Corpus, that could be obtained online, for instance by 
     using the Wikipedia access API, or other available free repos (including those available with SketchEngine. In this approach, candidates should
     never use the forecasting to approach the determination of outcomes (for this would be identical purpose of distinguishing EN/non ENG (and
     then IT/non IT, FR/not FR or DE/not DE) but only to identify the POS model in a sequence. In this case, the candidate should output the most
     likely POS tagging, without associating the sequence to a tree in a direct fashion.

Candidates are free to employ PURE ML approach to simplify, or pre-process the text in order to improve the performance of a PURE SYMBOLIC approach while generating a mixed model.

In [66]:
#************************ GENERAL IMPORTS ************************#
import spacy
import nltk; nltk.download("europarl_raw")
from tqdm import tqdm 
import math,random,pathlib,collections
from nltk.corpus import europarl_raw 

[nltk_data] Downloading package europarl_raw to
[nltk_data]     /home/kativen/nltk_data...
[nltk_data]   Package europarl_raw is already up-to-date!


In [74]:
#LOADING ENGLISH SPACY 
file = nltk.sent_tokenize(europarl_raw.english.raw(europarl_raw.english.fileids()[0]))
nlp = spacy.load("en_core_web_sm") 

In [68]:
#LOADING ITALIAN SPACY 
file = nltk.sent_tokenize(europarl_raw.italian.raw(europarl_raw.italian.fileids()[0]))
nlp = spacy.load("it_core_news_sm") 

In [69]:
#LOADING GERMAN SPACY 
file = nltk.sent_tokenize(europarl_raw.german.raw(europarl_raw.german.fileids()[0]))
nlp = spacy.load("de_core_news_sm") 

In [70]:
#LOADING FRENCH SPACY 
file = nltk.sent_tokenize(europarl_raw.french.raw(europarl_raw.french.fileids()[0]))
nlp = spacy.load("fr_core_news_sm") 

#### Esempio libro nltk 

In [71]:

#### EXAMPLE  FROM NLTK BOOK 

# Generate a CSP Grammar with nltk 
eng_grammar = nltk.CFG.fromstring(""" # this is for an example usage
S -> NP VP
NP -> Det N
VP -> V NP
Det -> 'the'
N -> 'cat' | 'dog'
V -> 'chased' | 'sat'
""") # Takes a File as an input and returns a nltk.CFG object 

parser = nltk.BottomUpLeftCornerChartParser(grammar=eng_grammar)
sentence = "the cat chased the dog".split()
trees = parser.parse(sentence)
for tree in trees: 
    print(tree)

(S (NP (Det the) (N cat)) (VP (V chased) (NP (Det the) (N dog))))


### LAVORO EFFETTIVO:

In [98]:
grammar_rules = "S -> NP VP\nPP -> P NP\nNP -> Det N | Det N PP | 'I'\nVP -> V NP | VP PP\n"

# For each sentence in the file do the POS tagging and save the results in a map with every word saved in a list labeled with the Tag.
for sentence in file: 
    possible_pos = set()
    grammar = {}
    spacy_parsed_sent= nlp(sentence)
    for token in spacy_parsed_sent:
        possible_pos.add(token.pos_)
        if not token.pos_ in grammar:
            grammar[token.pos_] = []
        word = '"' + token.text + '"'
        if word not in grammar[token.pos_]:
            grammar[token.pos_].append(word)

    # Target types 
    grammar_rules+= "SPACE -> ' '\n" 
    for type in possible_pos:  
        if type != "SPACE":
            appo_string = f"{type} -> "
            index = len(grammar[type]) - 1
            for word in grammar[type][0:index]:
                appo_string+= " {} |".format(word)
            appo_string+= " {}\n".format(grammar[type][-1])
            grammar_rules+= appo_string 

    # if "NOUN" in possible_pos and "PROPN" in possible_pos:
    #     grammar_rules += f'N -> {" | ".join(grammar["NOUN"] + grammar["PROPN"])}\n'
    # if "VERB" in possible_pos and "AUX" in possible_pos:
    #     grammar_rules += f'N -> {" | ".join(grammar["VERB"] + grammar["AUX"])}\n'

    print(f"Sentence: {sentence}\n")
    print(f"Grammar Rules: {grammar_rules}")
    nltk_grammar = nltk.CFG.fromstring(grammar_rules)
    # print(f"Sentence Grammar: {nltk_grammar}")

    parser = nltk.ChartParser(nltk_grammar)
    sentence = nltk.word_tokenize(sentence)
    for tree in parser.parse(sentence):
        print(tree)



Sentence:  
Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .

Grammar Rules: S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
SPACE -> ' '
NOUN ->  "session" | "year" | "hope" | "period"
CCONJ ->  "and"
ADJ ->  "happy" | "new" | "pleasant" | "festive"
VERB ->  "declare" | "resumed" | "adjourned" | "like" | "wish" | "enjoyed"
PART ->  "to"
ADP ->  "of" | "on" | "in"
AUX ->  "would"
NUM ->  "17" | "1999"
SCONJ ->  "that"
PUNCT ->  "," | "."
ADV ->  "once" | "again"
PRON ->  "I" | "you"
PROPN ->  "Resumption" | "European" | "Parliament" | "Friday" | "December"
DET ->  "the" | "a"

Sentence: Although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .

ValueError: Grammar does not cover some of the input words: "'part-session'".

In [91]:
sentence = nltk.word_tokenize(sentence)
print("sent: ", sentence)
for tree in parser.parse(sentence):
    print(tree)

sent:  ['Although', ',', 'as', 'you', 'will', 'have', 'seen', ',', 'the', 'dreaded', "'", 'millennium', 'bug', "'", 'failed', 'to', 'materialise', ',', 'still', 'the', 'people', 'in', 'a', 'number', 'of', 'countries', 'suffered', 'a', 'series', 'of', 'natural', 'disasters', 'that', 'truly', 'were', 'dreadful', '.']


ValueError: Grammar does not cover some of the input words: '\'Although\', \'as\', \'will\', \'have\', \'seen\', \'dreaded\', "\'", \'millennium\', \'bug\', "\'", \'failed\', \'materialise\', \'still\', \'people\', \'number\', \'countries\', \'suffered\', \'series\', \'natural\', \'disasters\', \'truly\', \'were\', \'dreadful\''.

Grammar Rules: S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
SPACE -> ' '
PRON ->  'A'
SPACE -> ' '
X ->  'l'
SPACE -> ' '
INTJ ->  't'
SPACE -> ' '
X ->  'h'
SPACE -> ' '
INTJ ->  'o'
SPACE -> ' '
NOUN ->  'u'
SPACE -> ' '
PROPN ->  'g'
SPACE -> ' '
X ->  'h'
SPACE -> ' '
SPACE -> ' '
PUNCT ->  ','
SPACE -> ' '
SPACE -> ' '
PRON ->  'a'
SPACE -> ' '
NOUN ->  's'
SPACE -> ' '
SPACE -> ' '
X ->  'y'
SPACE -> ' '
INTJ ->  'o'
SPACE -> ' '
NOUN ->  'u'
SPACE -> ' '
SPACE -> ' '
PROPN ->  'w'
SPACE -> ' '
PRON ->  'i'
SPACE -> ' '
X ->  'l'
SPACE -> ' '
X ->  'l'
SPACE -> ' '
SPACE -> ' '
X ->  'h'
SPACE -> ' '
PRON ->  'a'
SPACE -> ' '
X ->  'v'
SPACE -> ' '
X ->  'e'
SPACE -> ' '
SPACE -> ' '
NOUN ->  's'
SPACE -> ' '
X ->  'e'
SPACE -> ' '
X ->  'e'
SPACE -> ' '
CCONJ ->  'n'
SPACE -> ' '
SPACE -> ' '
PUNCT ->  ','
SPACE -> ' '
SPACE -> ' '
INTJ ->  't'
SPACE -> ' '
X ->  'h'
SPACE -> ' '
X ->  'e'
SPACE -> ' '
SPACE -> ' '
X ->  'd'
SPACE -> ' '
X ->  'r'
SPACE 