<a href="https://colab.research.google.com/github/krispudzian/POS_Neighbors/blob/main/Syntax.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pay Attention to Your Neighbors: A Simple Approach to the Part Of Speech Tagging

Understanding a native language is one of the most natural skills. We produce and consume words just as easily as we breathe. However, understanding a language is a complex process. There are words, that have multiple meanings. We hide the true meaning behind jokes or sarcasm. But we can take advantage of the context, we have some sense of humor, and we know what sarcasm is. Despite all these challenges, we usually get the meaning. Do computers get it? Well, this is one of the biggest challenges in Natural Language Processing. 


In this article, I write about the part-of-speech tagging, the sentence syntax, and multiple meanings of words. I give some examples of homonyms. At the end I use Python's spaCy to show, how it's parser handles the task of tagging.

## Table of Contents

1. [Part of Speech Tagging](#tagging)
2. [Definition of Syntax](#syntax)
3. [Information from Neighbors](#neighbors)
4. [Different meanings of the same word](#diff-meaning)
5. [Examples](#examples)
    - [Kind](#kind)
    - [Leaves](#leaves)
    - [Rose](#rose)
    - [Arm](#arm)
6. [Wrap Up](#wrap-up)
7. [References](#refs)



## Part of Speech Tagging <a name="tagging"></a>

Knowledge of a word itself isn't enough in many NLP tasks, such as Semantic Analysis, or Information Retrieval. For example you can *read a book* or *book a flight*. How can we distinguish different meanings of **book**? This is when knowing word's part of speech is helpful. 
Part of speech (POS) is a category of words with similar syntactic behavior, i.e. they play similar roles within sentences. The lexical category is a fancy way of saying part of speech.

The process of assigning each word in the text to a certain category is called part-of-speech tagging. It isn't an easy task, but text parsers take advantage of some details.

### Main Sources of Information for POS Tagging

*   Knowledge of neighboring words. There are many syntactic rules, that exclude some parts of speech
*   Knowledge of word probabilities. For example, **man** can be a verb, but way more often is a noun.
*   Checking prefixes: unfaithful, un- -> likely an adjective
*   Checking suffixes: importantly, -ly -> likely an adverb
*   Capitalization in the middle of sentence -> likely a proper noun (name, country, etc.)

In this article, we focus on the first piece of information. How is knowledge of neighboring words useful in the process of tagging? Time to get to know the syntax.



## Syntax <a name="syntax"></a>

Syntax is a study of the ordering of components in phrases or sentences. It allows us to combine words into phrases and sentences. Every language has a certain order of words, a certain way that someone thinks when he speaks. Native English speakers think in the SVO (Subject-Verb-Object) order. It's the only natural way of thinking.

Let's take the sentence: I study law. This is a typical SVO sentence, where I is the subject (S), study the verb (V), and law the object (O). Now, try to change the order of these 3 words to whatever other than SVO: I law study(SOV), or Study I law(VSO). They sound extremely unnatural because it's not how English speakers produce thoughts.

I know what you might think. "Why would you change the order of these words in the first place? Nobody speaks like this!" Not in English but there are thousands of other languages. How does the following sentence sound to you? "He knows that I Canada already with my parents visited have." Weird, right? In this order, German natives produce thoughts, when they speak their language. To make it even funnier, English and German belong to the same family of languages - West Germanic.

## Information from Neighbors <a name="neighbors"></a>

What I'm saying is that we cannot put words together in any order to have a well-formed sentence. It means, there are rules we must obey. And we usually do as I already explained in the previous passage. We immediately notice when someone breaks the rules. However, computers don't know them. We need to teach them. Luckily, knowing the part of speech of a word gives us information about its neighbors. Here are some syntactic structure rules: 

1.   Adjectives never describe verbs, adverbs, or other adjectives
2.   Adjectives precede nouns or pronouns
3.   Determiners precede adjectives or nouns
4.   Determiners or adjectives never precede verbs
5.   Adverbs never describe nouns
6.   Pronouns are words used in place of nouns.



## Different meanings of the same word <a name="diff-meaning"></a>

We can't assign a part of speech only by looking at a word itself. Many words have multiple meanings. Homonyms are words, that have the same spelling and pronunciation but different meanings and origins. We need the context to get the correct sense. Finding it is not an easy task for computers.
Word-Sense Disambiguation (WSD) is the process of identifying which sense of a word is used in a sentence.


## Examples <a name="examples"></a>

I'll give you some homonym examples here. Let's see if we can assign the correct part of speech only by looking at syntactic rules about neighbors.

In [1]:
import spacy


if int(spacy.__version__[0]) < 3:
    !pip install -U pip setuptools wheel
    !pip install -U spacy
    !python -m spacy download en_core_web_sm
    print(spacy.__version__)
    print("Please restart runtime")
else:
    print(spacy.__version__)



3.0.6


In [2]:
from termcolor import colored
nlp = spacy.load('en_core_web_sm')


def displayer(sentences, colored_word=""):
    """
        parameters:
            sentences - a string containing sentences to be displayed
            colored_word - a word to distinguish from the rest
    """
    doc = nlp(sentences)

    head_str = "{0:>12} | {1} ".format("Word", "POS")
    print(head_str)
    sents = [sent for sent in doc.sents if len(sent) > 2]

    for sent in sents:
        print(" -- "*10)
        for token in sent[:-1]:
            fixed_str = "{0:>12} | {1}".format(token.text, token.pos_)
            if token.text == colored_word:
                print(colored(fixed_str, "blue", attrs=["bold"]))
            else:
                print(fixed_str)

### Leaves <a name="leaves"></a>
**Leaves** can be a noun as a plural form for a leaf, or a verb - 3rd person form for leave.

In the first sentence, **leaves** is a noun, because it follows a determiner, hence it must not be a verb. In the second one, it's a verb because it comes after a noun.

In [3]:
sentences = "The children love to play in the leaves. They do not like when their father leaves for work."
displayer(sentences, "leaves")

        Word | POS 
 --  --  --  --  --  --  --  --  --  -- 
         The | DET
    children | NOUN
        love | VERB
          to | PART
        play | VERB
          in | ADP
         the | DET
[1m[34m      leaves | NOUN[0m
 --  --  --  --  --  --  --  --  --  -- 
        They | PRON
          do | AUX
         not | PART
        like | VERB
        when | ADV
       their | PRON
      father | NOUN
[1m[34m      leaves | VERB[0m
         for | ADP
        work | NOUN


### Rose <a name="rose"></a>

**Rose** can be a noun (a flower) or a past tense form of a verb to rise.

In the first sentence, **rose** comes directly after a determiner so it cannot be a verb. In the second sentence, **rose** follows an adverb hence it cannot be a noun.

In [4]:
sentences = "My favorite flower is a rose. He quickly rose from his seat."
displayer(sentences, "rose")

        Word | POS 
 --  --  --  --  --  --  --  --  --  -- 
          My | PRON
    favorite | ADJ
      flower | NOUN
          is | AUX
           a | DET
[1m[34m        rose | NOUN[0m
 --  --  --  --  --  --  --  --  --  -- 
          He | PRON
     quickly | ADV
[1m[34m        rose | VERB[0m
        from | ADP
         his | PRON
        seat | NOUN


### Arm <a name="arm"></a>

**Arm** can be a noun (a part of a human body) or a verb meaning to equip with something that strengthens.

In the first sentence, **arm** follows a determiner so it cannot be a verb. In the second sentence, **arm** has only a noun's company hence it cannot be another noun.

In [5]:
sentences = "I have an ant bite on the arm. Schools arm children with an education."
displayer(sentences, "arm")

        Word | POS 
 --  --  --  --  --  --  --  --  --  -- 
           I | PRON
        have | VERB
          an | DET
         ant | ADJ
        bite | NOUN
          on | ADP
         the | DET
[1m[34m         arm | NOUN[0m
 --  --  --  --  --  --  --  --  --  -- 
     Schools | NOUN
[1m[34m         arm | VERB[0m
    children | NOUN
        with | ADP
          an | DET
   education | NOUN


### Kind <a name="kind"></a>
**Kind** can be a noun meaning *some type of*, or an adjective meaning *caring*.

In the first sentence, **kind** follows a determiner and precedes a noun, so it's most likely an adjective. In the second sentence, **kind** precedes a preposition, so it cannot be an adjective, hence it is a noun.

In [6]:
sentences = "He is a kind person. I like any kind of cheese."
displayer(sentences, "kind")

        Word | POS 
 --  --  --  --  --  --  --  --  --  -- 
          He | PRON
          is | AUX
           a | DET
[1m[34m        kind | ADJ[0m
      person | NOUN
 --  --  --  --  --  --  --  --  --  -- 
           I | PRON
        like | VERB
         any | DET
[1m[34m        kind | NOUN[0m
          of | ADP
      cheese | NOUN


As you can see, spacy's parser does a great job of identifying the correct part of speech of words. It performs well for simple sentences. I came across one mistake, that I find interesting.

I don't know, why **kind** is assigned as an adverb. **Kind** can never be an adverb - **kindly** is an adverbial counterpart. Additionally, **kind** describes a noun and adverbs never do it. 

In [7]:
sentences = "Kind people make the world a better place."
displayer(sentences, "Kind")

        Word | POS 
 --  --  --  --  --  --  --  --  --  -- 
[1m[34m        Kind | ADV[0m
      people | NOUN
        make | VERB
         the | DET
       world | NOUN
           a | DET
      better | ADJ
       place | NOUN


## Wrap-Up <a name="wrap-up"></a>

In this article, I focused on grammar and basic syntactic rules. They are so obvoius for us, we obey them unconsiously. One of NLP's tasks is to teach computers to pay attention to these rules. 

# References <a name="refs"></a>


1. [Sequence Labeling for Parts of Speech](https://web.stanford.edu/~jurafsky/slp3/8.pdf)
2. [Homonyms](https://grammar.yourdictionary.com/for-students-and-parents/words-with-multiple-meanings.html)
3. [Video about English Syntax](https://www.youtube.com/watch?v=n9168PgGHBc&ab_channel=EvanAshworth)
4. [Some Methods for Sequence Models for POS Tagging](https://www.youtube.com/watch?v=SkaPqBDPzKc&list=PLoROMvodv4rOFZnDyrlW3-nI7tMLtmiJZ&index=58&t=5s&ab_channel=stanfordonline)

