## Understanding the Structure of a Sentences

There are two major components of NLP.

1. Natural language understanding** (NLU) is considered the first component of NLP. NLU is considered an Artificial Intelligence-Hard (AI-Hard) problem or Artificial Intelligence-Complete (AI-Complete) problem. NLU is defined as the process of converting NL input into useful a representation by using computational linguistics tools.

NLU requires the following analysis to convert NL into a useful representation:
* Morphological analysis
* Lexical analysis
* Syntactic analysis
* Semantic analysis
* Handling ambiguity
* Discourse integration
* Pragmatic analysis

2. **Natural language generation** (NLG) is considered the second component of NLP. NLG is defined as the process of generating NL by a machine as output. The output of the machine should be in a logical manner, meaning, whatever NL
is generated by the machine should be logical.

### Branches of NLP

NLP involves two major branches that help us to develop NLP applications. One is
computational, the **Computer Science** branch, and the other one is the **Linguistics** branch.

### Defining context-free grammar

Now let's focus on NLU, and to understand it, first we need to understand context-free grammar (CFG) and how it is used in NLU.

Context-free grammar is defined by its four main components. Those four components are shown in this symbolic representation of CFG:
A set of non-terminal symbols, **N**
A set of terminal symbols, **T**
A start symbol, **S**, which is a non-terminal symbol
A set of rules called **production rules P**, for generating sentences.

Let's take an example to get better understanding of the context-free grammar terminology:

$X -> \propto$ 

Here, $X -> \propto$ is called the phrase structure rule or production rule, $P$. $X \in N$ means $X$
belongs to non-terminal symbol; $\propto \in \{N, T \}$ means belongs to either terminal symbols or non-terminal symbols. $X$ can be rewritten in the form of $\propto$. The rule tells you which element can be rewritten to generate a sentence, and what the order of the elements will be as well.

So, start from there. I want to generate the following sentence:

*He likes cricket.*

In order to generate the preceding sentence, I'm proposing the following production rules:
* R1: S -> NP VP
* R2: NP -> N
* R3: NP -> Det N
* R4: VP -> V NP
* R5: VP -> V
* R6: N -> Person Name | He | She | Boy | Girl | It | cricket | song | book
* R7: V -> likes | reads | sings

![image.png](attachment:737474a3-8754-43ca-994e-f1744b26fbd5.png)

Now, let's know how we have generated a parse tree:
* According to the production rules, we can see S can be rewritten as a
combination of a noun phrase (NP) and a verb phrase (VP); see rule R1.
* NP can be further rewritten as either a noun (NN) or as a determiner (Det)
followed by a noun; see rules R2 and R3.
* Now you can rewrite the VP in form of a verb (V) followed by a NP, or a VP can
be rewritten as just V; see rules R4 and R5.
* Here, N can be rewritten in the form of Person Name, He, She, and so on. N is a
terminal symbol; see the rule R6.
* V can be rewritten by using any of the options on the right-hand side in rule R7.
V is also terminal symbol.

Here, we have seen a very basic and simple example of CFG. Context-free grammar is also
called **phrase structure grammar**.

Exercise
1. Generate a parse tree by using the rule given previously in this section and
generate the parse tree for the following sentence:
She sings a song.
2. Generate production rules and make a parse tree for the following sentence:
That boy is reading a book.

https://yohasebe.com/rsyntaxtree/

![image.png](attachment:6aeea587-b06f-4383-946a-f9a0f7f60ab9.png)

* R1: S -> NP VP
* R2: NP -> N
* R3: NP -> Det N
* R4: VP -> V NP
* R5: VP -> V
* R6: VP -> AV GV NP
* R7: N -> Person Name | He | She | Boy | Girl | It | cricket | song | book
* R8: V -> likes | reads | sings
* R9: AV -> is
* R10: GV -> reading

![image.png](attachment:1eb1198c-7473-4c72-abad-5c11b2f9746e.png)

### Morphological analysis

Morphology is branch of linguistics that studies how words can be structured and formed.

What are morphemes?
In linguistics, a morpheme is the smallest meaningful unit of a given language. The
important part of morphology is morphemes, which are the basic unit of morphology.
Let's take an example. The word boy consists of single morpheme whereas *boys* consists of
two morphemes; one is *boy* and the other morpheme *-s*.

Generally, morphemes are **affixes**. Those affixes can be divided into four types:
1. Prefixes, which appear before a stem, such as **un**happy
2. Suffixes, which appear after a stem, such as happi**ness**
3. Infixes, which appear inside a stem, such as bu**mi**li (this means buy in Tagalog, a
language from the Philippines)
4. Circumfixes surround a word. It is attached to the beginning and end of the stem.
For example, **ka**baddang**an** (this means help in Tuwali Ifugao, another language from the Philippines)

Morphological analysis is used in word segmentation, and **Part Of Speech (POS)** tagging uses this analysis.

![image.png](attachment:26a445e8-70e5-4e93-b60c-9961fa3b89ca.png)

#### What is a word?