### Assignement 2

The assignment consists in the development, in NLTK, OpenNLP, SketchEngine or GATE/Annie a pipeline that, starting from a text in input, in a given language (English, French, German and Italian are admissible) outputs the syntactic tree of the sentence itself, intended as a tree with root in S for sentence, and leaves on the tokens labelled with a single Part-of-speech. The generation of the tree can pass through one of the following models:

1) PURE SYMBOLIC. The tree is generated by a LR analysis with CF LL2 grammar as a base. Candidates can assume the following:

   a) Adjectives in English and German shall be only prefixed to nouns, whilst in French and Italian are only suffixed;

    b) Verbs are all at present tense;

    c) No pronouns are admitted;

    d) Only one adverb is admitted, always post-poned with respect to the verb (independently of the language, and the type of adverb);

    Overall the point above map a system that could be devised in regular expressions, but a Context-free grammar would be simpler to     
    define. Candidate can either define a system by themselves or use a syntactic tree generation system that can be found on GitHub. 
    Same happens for POS-tagging, where some of the above mentioned systems can be customized by existing techniques that are available
    in several fashions (including a pre-defined NLTK and OpenNLP libraries for POS-tagging and a module in GATE for the same purpose. Ambiguity 
    should be blocked onto first admissible tree.

2) PURE ML. Candidates can develop a PLM with one-step Markov chains to forecast the following token, and used to generate the forecast of the
     POS tags to be attributed. In this case the PLM can be generated starting with a Corpus, that could be obtained online, for instance by 
     using the Wikipedia access API, or other available free repos (including those available with SketchEngine. In this approach, candidates should
     never use the forecasting to approach the determination of outcomes (for this would be identical purpose of distinguishing EN/non ENG (and
     then IT/non IT, FR/not FR or DE/not DE) but only to identify the POS model in a sequence. In this case, the candidate should output the most
     likely POS tagging, without associating the sequence to a tree in a direct fashion.

Candidates are free to employ PURE ML approach to simplify, or pre-process the text in order to improve the performance of a PURE SYMBOLIC approach while generating a mixed model.

In [1]:
#************************ GENERAL IMPORTS ************************#
import spacy
import nltk; nltk.download("europarl_raw"); 
from tqdm import tqdm 
import math,random,pathlib,collections
from nltk.corpus import europarl_raw 
spacy_to_nltk_gram = """
PUNCT -> "PUNCT"
SCONJ -> "SCONJ"
PRON -> "PRON"
SYM -> "SYM"
NUM -> "NUM"
N -> "NOUN"
V -> "VERB"
AUX -> "AUX"
P -> "ADP"
ADJ -> "ADJ"
ADV -> "ADV"
"""

[nltk_data] Downloading package europarl_raw to
[nltk_data]     /home/kativen/nltk_data...
[nltk_data]   Package europarl_raw is already up-to-date!
[nltk_data] Downloading package universal_treebanks_v20 to
[nltk_data]     /home/kativen/nltk_data...
[nltk_data]   Package universal_treebanks_v20 is already up-to-date!


In [2]:
# LOADING DATA
# file = nltk.sent_tokenize(europarl_raw.english.raw(europarl_raw.english.fileids()[0]))
file = [
    "The fat cat is jumping.",
    "The red cat is blue.",
    "The cat is running away.",
    "I love cats.",
    "Small cats are awesome.",
    "Fat cats are awesome."
]

# LOADING ENGLISH SPACY 
nlp = spacy.load("en_core_web_sm") 

# LANGUAGE SPECIFIC GRAMMAR  
base_grammar= """
S -> NP VP PUNCT | NP VP
NP -> NUM ADJ N | N
VP -> V NP | V | V ADVP | VP SCONJ VP | AUX VP
ADVP -> ADV 
ADJP -> ADJ | ADJ ADJP
PP -> P NP
""" + spacy_to_nltk_gram

In [8]:
# LOADING DATA
# file = nltk.sent_tokenize(europarl_raw.italian.raw(europarl_raw.italian.fileids()[0]))
file = [
    "Il gatto grasso sta saltando.",
    "Il gatto rosso è blu.",
    "Il gatto sta correndo via",
    "Amo i gatti,",
    "I gatti piccoli sono fantastici.",
    "I gatti grassi sono fantastici."
]

#LOADING ITALIAN SPACY 
nlp = spacy.load("it_core_news_sm") 

# LANGUAGE SPECIFIC GRAMMAR  
base_grammar= """
S -> NP VP PUNCT | NP VP
NP -> NUM ADJ N | N
VP -> V NP | V | V ADVP | VP SCONJ VP | AUX VP
ADJP -> ADJ | ADJ ADJP
ADVP -> ADV 
PP -> P NP
""" + spacy_to_nltk_gram

In [11]:
# LOADING DATA
# file = nltk.sent_tokenize(europarl_raw.german.raw(europarl_raw.german.fileids()[0]))
file = [ 
    "Die fette Katze springt.",
    "Die rote Katze ist blau.",
    "Die Katze rennt davon.",
    "Ich liebe Katzen.",
    "Kleine Katzen sind toll.",
    "Fette Katzen sind großartig."
]

#LOADING GERMAN SPACY 
nlp = spacy.load("de_core_news_sm") 

# LANGUAGE SPECIFIC GRAMMAR  
base_grammar= """
S -> NP VP PUNCT | NP VP
NP -> NUM ADJ N | N
VP -> V NP | V | V ADVP | VP SCONJ VP | AUX VP
ADJP -> ADJ | ADJ ADJP
ADVP -> ADV 
PP -> P NP
""" + spacy_to_nltk_gram

In [13]:
# LOADING DATA
# file = nltk.sent_tokenize(europarl_raw.french.raw(europarl_raw.french.fileids()[0]))
file = [
    "Le gros chat saute.",
    "Le chat rouge est bleu.",
    "Le chat s'enfuit.",
    "J'aime les chats.",
    "Les petits chats sont géniaux.",
    "Les gros chats sont géniaux."
]

#LOADING FRENCH SPACY 
nlp = spacy.load("fr_core_news_sm") 

# LANGUAGE SPECIFIC GRAMMAR  
base_grammar= """
S -> NP VP PUNCT | NP VP
NP -> PRON | NUM ADJ N | N
VP -> V NP | V | V ADVP | VP SCONJ VP | AUX VP
ADJP -> ADJ | ADJ ADJP
ADVP -> ADV 
PP -> P NP
""" + spacy_to_nltk_gram

### FIRST PIPELINE:
( No treebank parser for exctracting grammar)
For each sentence in the file do the POS tagging and save the results in a map with every word saved in a list labeled with the Tag.
Then transform the grammar into an nltk grammar object and use it inside the parser.

In [1]:


for sentence in file: 
    possible_pos = set()
    grammar = {}
    spacy_parsed_sent= nlp(sentence)
    for token in spacy_parsed_sent:
        possible_pos.add(token.pos_)
        if not token.pos_ in grammar:
            grammar[token.pos_] = []
        word = '"' + token.text + '"'
        if word not in grammar[token.pos_]:
            grammar[token.pos_].append(word)

    # Target types 

    grammar_rules = base_grammar
    for type in possible_pos:  
        appo_string = f"{type} -> "
        index = len(grammar[type]) - 1
        for word in grammar[type][0:index]:
            appo_string+= " {} |".format(word)
        appo_string+= " {}\n".format(grammar[type][-1])
        grammar_rules+= appo_string 

    print(f"{sentence}\n")
    nltk_grammar = nltk.CFG.fromstring(grammar_rules)
    # print(f"Sentence Grammar: {nltk_grammar}")
    parser = nltk.ChartParser(nltk_grammar)

    spacy_tokenized = list(map(lambda e:e.text,spacy_parsed_sent))
    for tree in parser.parse(spacy_tokenized[1:]):
        print(tree)

NameError: name 'file' is not defined