### Assignement 2

The assignment consists in the development, in NLTK, OpenNLP, SketchEngine or GATE/Annie a pipeline that, starting from a text in input, in a given language (English, French, German and Italian are admissible) outputs the syntactic tree of the sentence itself, intended as a tree with root in S for sentence, and leaves on the tokens labelled with a single Part-of-speech. The generation of the tree can pass through one of the following models:

1) PURE SYMBOLIC. The tree is generated by a LR analysis with CF LL2 grammar as a base. Candidates can assume the following:

   a) Adjectives in English and German shall be only prefixed to nouns, whilst in French and Italian are only suffixed;

    b) Verbs are all at present tense;

    c) No pronouns are admitted;

    d) Only one adverb is admitted, always post-poned with respect to the verb (independently of the language, and the type of adverb);

    Overall the point above map a system that could be devised in regular expressions, but a Context-free grammar would be simpler to     
    define. Candidate can either define a system by themselves or use a syntactic tree generation system that can be found on GitHub. 
    Same happens for POS-tagging, where some of the above mentioned systems can be customized by existing techniques that are available
    in several fashions (including a pre-defined NLTK and OpenNLP libraries for POS-tagging and a module in GATE for the same purpose. Ambiguity 
    should be blocked onto first admissible tree.

2) PURE ML. Candidates can develop a PLM with one-step Markov chains to forecast the following token, and used to generate the forecast of the
     POS tags to be attributed. In this case the PLM can be generated starting with a Corpus, that could be obtained online, for instance by 
     using the Wikipedia access API, or other available free repos (including those available with SketchEngine. In this approach, candidates should
     never use the forecasting to approach the determination of outcomes (for this would be identical purpose of distinguishing EN/non ENG (and
     then IT/non IT, FR/not FR or DE/not DE) but only to identify the POS model in a sequence. In this case, the candidate should output the most
     likely POS tagging, without associating the sequence to a tree in a direct fashion.

Candidates are free to employ PURE ML approach to simplify, or pre-process the text in order to improve the performance of a PURE SYMBOLIC approach while generating a mixed model.

### Pure Symbolic:
To resolve this assignment task i decided to use the Pure Symbolic approach. I later discovered that this task is composed of 3 main subtopics:
1. Tokenize and do Part of Speech tagging for the input phrase in all 4 of the languages;
2. Create a base grammar (for each one of languages) following the provided rules and add all the word-tag (terminals) to it, then transform it to a nltk-compatible version.
3. With the nltk-grammar object create a parser used to generate a syntactic tree by parsing the phrase. If the parser finds more trees for a single phrase print only the first one;

In [1]:
#************************ GENERAL IMPORTS ************************#
import spacy
import nltk 
from nltk.tree import TreePrettyPrinter
spacy_to_nltk_gram = """
N -> NOUN
V -> VERB
P -> ADP
"""

##### 1: Tokenization and POS tagging:
To tokenize and perform pos tagging I used the library Spacy. Spacy provides a broadth catalogue of supported languages (far more than nltk) and it performs both the operation within just one function.
Spacy, given an input text, returns an array of tokenized objects that also contain their tag as a field.
I created one block for each one of the languages using the same variable names, to run any-one of the languages is as easy as just rerunning the language specific block.

### English

In [2]:
# LOADING DATA
file = [
    "The fat cat is jumping.",
    "The red cat is blue.",
    "The cat is running away.",
    "I love cats.",
    "Small cats are awesome.",
    "Fat cats are awesome."
]

# LOADING ENGLISH SPACY 
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm") 

# LANGUAGE SPECIFIC GRAMMAR  
base_grammar= """
S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
NP -> NUM ADJ N | N | ADJP NP  | DET NP 
VP -> VP NP | V | VP ADVP | VP SCONJ VP | AUX VP | VP PUNCT | AUX ADJP| AUX ADV 
ADVP -> ADV 
ADJP -> ADJ | ADJ ADJP
PP -> P NP
""" + spacy_to_nltk_gram

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 7.8 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Italian

In [60]:
# LOADING DATA
file = [
    "Il gatto grasso sta saltando.",
    "Il gatto rosso è blu.",
    "Il gatto sta correndo via",
    "Amo i gatti,",
    "I gatti piccoli sono fantastici.",
    "I gatti grassi sono fantastici."
]

#LOADING ITALIAN SPACY 
spacy.cli.download("it_core_news_sm")
nlp = spacy.load("it_core_news_sm") 

# LANGUAGE SPECIFIC GRAMMAR  
base_grammar= """
S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
NP -> NUM N ADJ | N | NP ADJP | DET NP 
VP -> VP NP | V | VP ADVP | VP SCONJ VP | AUX VP | VP PUNCT | AUX ADJP| AUX ADV 
ADVP -> ADV 
ADJP -> ADJ | ADJ ADJP
PP -> P NP
""" + spacy_to_nltk_gram

### German

In [63]:
# LOADING DATA
# file = nltk.sent_tokenize(europarl_raw.german.raw(europarl_raw.german.fileids()[0]))
file = [ 
    "Die fette Katze springt.",
    "Die rote Katze ist blau.",
    "Die Katze rennt davon.",
    "Ich liebe Katzen.",
    "Kleine Katzen sind toll.",
    "Fette Katzen sind großartig."
]

#LOADING GERMAN SPACY 
spacy.cli.download("de_core_news_sm")
nlp = spacy.load("de_core_news_sm") 

# LANGUAGE SPECIFIC GRAMMAR  
base_grammar= """
S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
NP -> NUM ADJ N | N | ADJP NP  | DET NP 
VP -> VP NP | V | VP ADVP | VP SCONJ VP | AUX VP | VP PUNCT | AUX ADJP| AUX ADV 
ADVP -> ADV 
ADJP -> ADJ | ADJ ADJP
PP -> P NP
""" + spacy_to_nltk_gram

### French:

In [13]:
# LOADING DATA
# file = nltk.sent_tokenize(europarl_raw.french.raw(europarl_raw.french.fileids()[0]))
file = [
    "Le gros chat saute.",
    "Le chat rouge est bleu.",
    "Le chat s'enfuit.",
    "J'aime les chats.",
    "Les petits chats sont géniaux.",
    "Les gros chats sont géniaux."
]

#LOADING FRENCH SPACY 
spacy.cli.download("fr_core_news_sm")
nlp = spacy.load("fr_core_news_sm") 

# LANGUAGE SPECIFIC GRAMMAR  
base_grammar= """
S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
NP -> NUM N ADJ | N | NP ADJP | DET NP 
VP -> VP NP | V | VP ADVP | VP SCONJ VP | AUX VP | VP PUNCT | AUX ADJP | AUX ADV 
ADVP -> ADV 
ADJP -> ADJ | ADJ ADJP
PP -> P NP
""" + spacy_to_nltk_gram

#### 2. Creating a NLTK-compatible Grammar
I created a phrase-specific grammar by just adding to the language specific grammar strings containing the Tag-Word combination, for each word in the phrase. <br/>
To convert this string to grammar I used the **nltk.CFG.fromstring** function, and then use the return value (a nltk-grammar object) to create a phrase specific parser.

#### 3. Create a Parser and Generate syntactic Trees
The parser returns a list of compatible trees, that reppresent all the possible combination in which the phrase can be parsed. <br/>
The input phrases aren't all parsable with the given base grammar: this shows the limitations of the provided grammar and of this method.


In [3]:
for sentence in file: 
    possible_pos = set()
    grammar = {}
    spacy_parsed_sent= nlp(sentence)
    print(f"{sentence}\n")
    for token in spacy_parsed_sent:
        print(f"{token.text } -> {token.pos_}")
        possible_pos.add(token.pos_)
        if not token.pos_ in grammar:
            grammar[token.pos_] = []
        word = '"' + token.text + '"'
        if word not in grammar[token.pos_]:
            grammar[token.pos_].append(word)

    print("\n")
    # Target types 

    grammar_rules = base_grammar
    for type in possible_pos:  
        appo_string = f"{type} -> "
        index = len(grammar[type]) - 1
        for word in grammar[type][0:index]:
            appo_string+= " {} |".format(word)
        appo_string+= " {}\n".format(grammar[type][-1])
        grammar_rules+= appo_string 

    nltk_grammar = nltk.CFG.fromstring(grammar_rules)
    parser = nltk.ChartParser(nltk_grammar)

    spacy_tokenized = list(map(lambda e:e.text,spacy_parsed_sent))
    trees = list(parser.parse(spacy_tokenized))
    if trees: print(TreePrettyPrinter(trees[0]).text()) 
    print("\n\n")

The fat cat is jumping.

The -> DET
fat -> ADJ
cat -> NOUN
is -> AUX
jumping -> VERB
. -> PUNCT


              S                        
      ________|_____________________    
     NP                |            |  
  ___|____             |            |   
 |        NP           VP           |  
 |    ____|___      ___|_____       |   
 |   |        NP   |         VP     |  
 |   |        |    |         |      |   
 |  ADJP      N    |         V      |  
 |   |        |    |         |      |   
DET ADJ      NOUN AUX       VERB  PUNCT
 |   |        |    |         |      |   
The fat      cat   is     jumping   .  




The red cat is blue.

The -> DET
red -> ADJ
cat -> NOUN
is -> AUX
blue -> ADJ
. -> PUNCT


              S                     
      ________|__________________    
     NP                |         |  
  ___|____             |         |   
 |        NP           |         |  
 |    ____|___         |         |   
 |   |        NP       VP        |  
 |   |        |    