# Pola Kalimat

Dalam melakukan analisis untuk memahami kalimat, kita bisa menggunakan *word sequence patterns*. Fungsi utamanya biasanya digunakan untuk klasifikasi dan menghasilkan teks. Selain itu ada teknik *walking the syntactic dependency tree* untuk mengambil informasi tertentu dalam suatu kalimat.

**Catatan**: materi pada notebook ini diambil dari https://nostarch.com/NLPPython - bab 6 dan disesuaikan dengan versi spaCy terbaru (3.2.x).

## Word Sequence Patterns

Pola ini digunakan untuk mendeteksi dan membuat klasifikasi suatu kalimat. Sebagai contoh, kita bisa mengklasifikasikan suatu kalimat sebagai *kalimat tanya yang menanyakan kemampuan*, dan lain-lain.


In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc1 = nlp(u'We can overtake them.')
doc2 = nlp(u'You must specify it.')

for i in range(len(doc1)-1):
    if doc1[i].dep_ == doc2[i].dep_:
        print(doc1[i].text, doc2[i].text, doc1[i].dep_, spacy.explain(doc1[i].dep_))


We You nsubj nominal subject
can must aux auxiliary
overtake specify ROOT None
them it dobj direct object


In [2]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc1 = nlp(u'We can overtake them.')
doc2 = nlp(u'You must specify it.')

for i in range(len(doc1)-1):
    if doc1[i].pos_ == doc2[i].pos_:
        print(doc1[i].text, doc2[i].text, doc1[i].pos_, spacy.explain(doc1[i].pos_))


We You PRON pronoun
can must AUX auxiliary
overtake specify VERB verb
them it PRON pronoun


Kedua script di atas menunjukkan bahwa meski kedua kalimat tersebut mempunyai makna yang berbeda, tetapi mempunyai struktur pola kalimat yang sama. Script di bawah ini akan melakukan hal yang sama tetapi sudah dalam bentuk beberapa kalimat yang digabung.

In [3]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(u'We can overtake them. You must specify it. I could do it.')

sents = list(doc.sents)
for sent in sents[1:]:
    for i in range(len(sents[0])-1):
        if sents[0][i].dep_ == sent[i].dep_:
            print(sents[0][i].text, sent[i].text, sents[0][i].dep_, spacy.explain(sents[0][i].dep_))


We You nsubj nominal subject
can must aux auxiliary
overtake specify ROOT None
them it dobj direct object
We I nsubj nominal subject
can could aux auxiliary
overtake do ROOT None
them it dobj direct object


Kita bisa melakukan pemeriksaan apakah suatu kalimat mempunyai pola tertentu:

In [4]:
import spacy

nlp = spacy.load('en_core_web_sm')

def dep_pattern(doc):
    for i in range(len(doc)-1):
        if doc[i].dep_ == 'nsubj' and doc[i+1].dep_ == 'aux' and  doc[i+2].dep_ == 'ROOT':
            for tok in doc[i+2].children:
                if tok.dep_ == 'dobj':
                    return True
    return False

doc = nlp(u'We can overtake them.')

if dep_pattern(doc):
  print('Found')
else:
  print('Not found')

Found


Memeriksa pola seperti di atas bisa dilakukan dengan menggunakan *Matcher*:

In [5]:
import spacy

from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

matcher = Matcher(nlp.vocab)

pattern = [[{"DEP": "nsubj"}, {"DEP": "aux"}, {"DEP": "ROOT"}]]

matcher.add("NsubjAuxRoot", pattern)

doc = nlp(u"We can overtake them.")

matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]
    print("Span: ", span.text)
    print("The positions in the doc are: ", start, "-", end)


Span:  We can overtake
The positions in the doc are:  0 - 3


Pola bisa kita definisikan lebih dari satu dan kemudian digabungkan:

In [6]:
import spacy

nlp = spacy.load('en_core_web_sm')

def dep_pattern(doc):
    for i in range(len(doc)-1):
        if doc[i].dep_ == 'nsubj' and doc[i+1].dep_ == 'aux' and  doc[i+2].dep_ == 'ROOT':
            for tok in doc[i+2].children:
                if tok.dep_ == 'dobj':
                    return True
    return False

def pos_pattern(doc):
    for token in doc:
        if token.dep_ == 'nsubj' and token.tag_ != 'PRP':
            return False
        if token.dep_ == 'aux' and token.tag_ != 'MD':
            return False
        if token.dep_ == 'ROOT' and token.tag_ != 'VB':
            return False
        if token.dep_ == 'dobj' and token.tag_ != 'PRP':
            return False
    return True

#Testing code

doc = nlp(u'We can overtake them.')
if dep_pattern(doc) and pos_pattern(doc):
    print('Found')
else:
    print('Not found')

doc = nlp(u'I might send them a card as a reminder.')
if dep_pattern(doc) and pos_pattern(doc):
    print('Found')
else:
    print('Not found')


Found
Not found


Script berikut ini akan menentukan pola untuk *pronoun*. Mengapa perlu dilakukan? karena seringkali dalam suatu kalimat, setelah dimunculkan suatu kata, berikutnya akan direferensi menggunakan *pronoun*. Contoh: 

```
The trucks are traveling slowly. We can overtake them
```

Pada kalimat tersebut, *them* merujuk pada *trucks* yang sudah disebutkan sebelumnya.

In [7]:
import spacy

nlp = spacy.load('en_core_web_sm')

def dep_pattern(doc):
    for i in range(len(doc)-1):
        if doc[i].dep_ == 'nsubj' and doc[i+1].dep_ == 'aux' and  doc[i+2].dep_ == 'ROOT':
            for tok in doc[i+2].children:
                if tok.dep_ == 'dobj':
                    return True
    return False

def pos_pattern(doc):
    for token in doc:
        if token.dep_ == 'nsubj' and token.tag_ != 'PRP':
            return False
        if token.dep_ == 'aux' and token.tag_ != 'MD':
            return False
        if token.dep_ == 'ROOT' and token.tag_ != 'VB':
            return False
        if token.dep_ == 'dobj' and token.tag_ != 'PRP':
            return False
    return True

def pron_pattern(doc):
    plural = ['we','us','they','them']
    for token in doc:
        if token.dep_ == 'dobj' and token.tag_ == 'PRP':
            if token.text in plural:
                return 'plural'
            else:
                return 'singular'
    return 'not found'

doc = nlp(u'We can overtake them.')

if dep_pattern(doc) and pos_pattern(doc):
    print('Found:', 'the pronoun in position of direct object is',
    pron_pattern(doc))
else:
    print('Not found')


Found: the pronoun in position of direct object is plural


## Menggunakan *Word Sequence Patterns* untuk Menghasilkan Teks.

In [8]:
import spacy

nlp = spacy.load('en_core_web_sm')

def dep_pattern(doc):
    for i in range(len(doc)-1):
        if doc[i].dep_ == 'nsubj' and doc[i+1].dep_ == 'aux' and  doc[i+2].dep_ == 'ROOT':
            for tok in doc[i+2].children:
                if tok.dep_ == 'dobj':
                    return True
    return False

def pos_pattern(doc):
    for token in doc:
        if token.dep_ == 'nsubj' and token.tag_ != 'PRP':
            return False
        if token.dep_ == 'aux' and token.tag_ != 'MD':
            return False
        if token.dep_ == 'ROOT' and token.tag_ != 'VB':
            return False
        if token.dep_ == 'dobj' and token.tag_ != 'PRP':
            return False
    return True

def pron_pattern(doc):
    plural = ['we','us','they','them']
    for token in doc:
        if token.dep_ == 'dobj' and token.tag_ == 'PRP':
            if token.text in plural:
                return 'plural'
            else:
                return 'singular'
    return 'not found'

def find_noun(sents, num):
    if num == 'plural':
        taglist = ['NNS','NNPS']
    if num == 'singular':
        taglist = ['NN','NNP']
    for sent in reversed(sents):
        for token in sent:
            if token.tag_ in taglist:
                return token.text
    return 'Noun not found'

def gen_utterance(doc, noun):
    sent = ''
    for i,token in enumerate(doc):
        if token.dep_ == 'dobj' and token.tag_ == 'PRP':
            sent = doc[:i].text + ' ' + noun + ' ' + doc[i+1:len(doc)-2].text + 'too.'
            return sent
    return 'Failed to generate an utterance' 

doc = nlp(u'The symbols are clearly distinguishable. I can recognize them promptly.')

sents = list(doc.sents)

response = ''

noun = ''

for i, sent in enumerate(sents):
    if dep_pattern(sent) and pos_pattern(sent):
        noun = find_noun(sents[:i], pron_pattern(sent))
        if noun != 'Noun not found':
            response = gen_utterance(sents[i],noun)
            break

print(response)

I can recognize symbols too.


## Information Extraction - *Walking Dependency Tree*

In [9]:
import spacy
nlp = spacy.load('en_core_web_sm')

#Here's the function that figures out the destination

def det_destination(doc):
    for i, token in enumerate(doc):
        if token.ent_type != 0 and token.ent_type_ == 'GPE':
            while True:
                token = token.head
                if token.text == 'to':
                    return doc[i].text
                if token.head == token:
                    return 'Failed to determine'
    return 'Failed to determine'

#Testing the det_destination function

doc = nlp(u'I am going to the conference in Berlin.')

dest = det_destination(doc)

print('It seems the user wants a ticket to ' + dest)


It seems the user wants a ticket to Berlin


### Meringkas Teks

In [10]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(u"The product sales hit a new record in the first quarter, with 18.6 million units sold.")

phrase = ''

for token in doc:
    if token.pos_ == 'NUM':
        while True:
            phrase = phrase + ' ' + token.text
            token = token.head
            if token not in list(token.head.lefts):
                phrase = phrase + ' ' + token.text
                if list(token.rights):
                    phrase = phrase + ' ' + doc[token.i+1:].text
                break
        break

while True:
    token = doc[token.i].head
    if token.pos_ != 'ADP':
        phrase = token.text + phrase
    if token.dep_ == 'ROOT':
        break

for tok in token.lefts:
    if tok.dep_ == 'nsubj':
        phrase = ' '.join([tok.text for tok in tok.lefts]) + ' ' + tok.text + ' '+ phrase
        break

print(phrase.strip())


The product sales hit 18.6 million units sold.


Jika sebelumnya, penentuan tiket ke Berlin berdasarkan *to* + *GPE*, maka bisa diperbaiki menggunakan konteks

In [11]:
import spacy

nlp = spacy.load('en_core_web_sm')

def det_destination(doc):
    for i, token in enumerate(doc):
        if token.ent_type != 0 and token.ent_type_ == 'GPE':
            while True:
                token = token.head
                if token.text == 'to':
                    return doc[i].text
                if token.head == token:
                    return 'Failed to determine'
    return 'Failed to determine'

def guess_destination(doc):
    for token in doc:
        if token.ent_type != 0 and token.ent_type_ == 'GPE':
            return token.text
    return 'Failed to determine'

def gen_response(doc):
    dest = det_destination(doc)
    if dest != 'Failed to determine':
        return 'When do you need to be in ' + dest + '?'
    dest = guess_destination(doc)
    if dest != 'Failed to determine':
        return 'You want a ticket to ' + dest +', right?'
    return 'Are you flying somewhere?'

doc = nlp(u'I am going to the conference in Berlin.')

print(gen_response(doc))


When do you need to be in Berlin?


## Modifier untuk Makna yang Lebih Spesifik

In [12]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(u"Kiwano has jelly-like flesh with a refreshingly fruity taste. This is a nice exotic fruit from Africa. It is definitely worth trying.")

fruit_adjectives = []
fruit_origins = []

for token in doc:
    if token.text == 'fruit':
        fruit_adjectives = fruit_adjectives + [modifier.text for modifier in token.lefts if modifier.pos_ == 'ADJ']
        fruit_origins = fruit_origins + [doc[modifier.i + 1].text for modifier in token.rights if modifier.text == 'from' and doc [modifier.i + 1].ent_type != 0]

print('The list of adjectival modifiers for word fruit:', fruit_adjectives)
print('The list of GPE names applicable to word fruit as postmodifiers:', fruit_origins)

The list of adjectival modifiers for word fruit: ['nice', 'exotic']
The list of GPE names applicable to word fruit as postmodifiers: ['Africa']
