## CS585: Natural Language Processing
### Overview + Finite State Automata
<br><br>
#### Illinois Institute of Technology  
#### Aron Culotta



<br><br><br>

## What is Natural Language?


## Natural vs. Unnatural (Formal) Languages

**Natural**
- Emerges from intelligent beings
- We **discover** the grammar.
- Full of ambiguity
- English, Spanish, Dolphin Language?

**Formal**
- Defined by humans
- We **prescribe** the grammar.
- Designed to **remove** ambiguity
- Python, math, ...

## NLP Examples

![figs/watson.jpg](figs/watson.jpg)
<br><br><br><br><br><br>

![figs/siri.png](figs/siri.png)
<br><br><br><br><br><br>

![figs/translate.jpg](figs/translate.jpg)
<br><br><br><br><br><br>

![figs/echo.jpg](figs/echo.jpg)
<br><br><br><br><br><br>

![figs/her.jpg](figs/her.jpg)

In [1]:
%%HTML
<video width="600" height="400" controls>
  <source src="figs/dave.mp4" type="video/mp4">
</video>
<br><br><br><br><br><br><br><br><br><br><br><br>


<br><br><br><br><br><br>

#### Natural Language Processing = Linguistics + Logic + Machine Learning

Each has a long history...



## NLP History

- Dates back to first days of computing (Turing)
- Combines linguistics, formal logic, and statistics


## NLP History
### 1940s-1950s

**Logic**
- automaton (Turing)
- Context-free grammars (Chomsky)

**Statistics**
- McCulloch Pits neuron (perceptron)
- Noisy channel model (Shannon)

#### Turing test (Alan Turing, 1950)


Q: Please write me a sonnet on the topic of the Forth Bridge.  
A: Count me out on this one. I never could write poetry  


Q: Add 34957 to 70764  
A: (Pause about 30 seconds and then give answer as) 105621.

<br><br><br><br>

## NLP History

### 1957-1970

**Logic**
- Theorem proving (Newell & Simon) **[n.b. Herb Simon was chair of PolySci dep't at IIT from '42-'49]**
- Logical inference

**Statistics**
- Neural nets
- Bayesian statistics (e.g., identify author of text)


#### ELIZA (1964, Weizenbaum)

User: You are like my father in some ways.  
ELIZA: WHAT RESEMBLANCE DO YOU SEE  
User: You are not very aggressive but I think you don’t want me to notice that.  
ELIZA: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE  
User: You don’t argue with me.  
ELIZA: WHY DO YOU THINK I DON’T ARGUE WITH YOU?  
User: You are afraid of me.  
ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU

<br><br><br><br>

## NLP History
### 1970-1983

**Logic**
  - Winograd's SHRDLU

**Statistics**
  - speech recognition (AT&T Bell Labs, IBM)
  - Hidden Markov Models

![figs/shrdlu.png](figs/shrdlu.png)


## NLP History

### 1983-1993
- Return to finite state models
- Empirical approach: (data driven) IBM speech recognition

### 1994-1999
- Empirical approach widespread
- Bayesian statistics
- graphical models

### 2000s
- Combinations of logical and empirical approaches
  - Markov logic networks, etc.
- Deep learning
  - revival of neural nets from 1960s

## Linguistics

- **Phonetics and Phonology:** The study of linguistic sounds.
  - /fəˈnediks/

<br><br><br><br>

- **Morphology:** The study of the meaningful components of words.

![figs/morph.png](figs/morph.png)

<br><br><br><br>

- **Syntax:** The study of the structural relationships between words.
  -  "*I’m I do, sorry that afraid Dave I’m can’t.*"
  
![figs/dog.png](figs/dog.png)

<br><br><br><br>

- **Semantics:** The study of meaning.

![figs/green.png](figs/green.png)


<br><br><br><br>
- **Pragmatics:** The study of how language is used to accomplish goals.
  - "*Honey, do you think it's cold in here?*"

<br><br><br><br>

- **Discourse:** The study of linguistic units larger than a single utterance.
  - **Dave**: Open the pod bay doors, HAL.
  - **HAL**: I'm sorry Dave, I can't do **<font color=blue>that</font>.**



<br><br><br><br>

## Ambiguity: The Good and the Bad

- Makes language fun and interesting for humans, but makes language difficult for computers.
- The central problem to NLP is **resolving ambiguity**.


- E.g., "*I made her duck*."

<br><br><br><br><br><br><br><br>



1. I cooked waterfowl for her.
2. I cooked waterfowl belonging to her.
3. I created the (plaster?) duck she owns.
4. I caused her to quickly lower her head or body.
5. I waved my magic wand and turned her into undifferentiated waterfowl.


- Syntactic ambiguity (1 vs 4): "duck" $\rightarrow$ verb or noun?  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **part-of-speech tagging, syntactic parsing**
- Semantic ambiguity (1 vs 3): "make" $\rightarrow$ *create* or *cook*? &nbsp;&nbsp; **word sense disambiguation**
- Phonetic ambiguity: "I" or "eye"; "made" or "maid"?  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; **speech recognition **

<br><br><br>

## Models & Algorithms

- State machines
- Rule systems
- Logic
- Probability
- Dynamic programming
- Machine Learning

## Diving right in with a simple example...
## Sheep Language Processing


![figs/wolf.jpg](figs/wolf.jpg)

Is this string something a sheep would say?

```
baa!  
baaa!  
baaaa!  
baaaaa!  
baaaaaa!  
.  
.  
.
```
**but not**

```
bark  
woof  
meow
.
.
.
```

<br><br><br><br><br><br>

#### Method 1: Regular Expressions

In [2]:
import re

def is_a_sheep(text):
    if re.match('^baa+!$', text):
        return True
    else:
        return False
        
is_a_sheep('baa!')

True

In [3]:
is_a_sheep('baaaaaa!')

True

In [4]:
is_a_sheep('ba!')

False

In [5]:
is_a_sheep('baa')

False

<br><br><br><br><br><br>

#### Method 2: Finite State Automata

<br><br><br><br><br><br>
![figs/baa.png](figs/baa.png)

![figs/transition.png](figs/transition.png)


## Finite State Automata, Formally

- $Q$: a finite set of $N$ states $\{q_0 \ldots q_N\}$
- $\Sigma:$ a finite input alphabet of symbols
- $q_0$: the start states
- $F$: the set of final states, $F \subseteq Q$
- $\delta(q,i)$: the transition function  between states. $Q \times \Sigma \rightarrow Q$
  - Given a state $q \in Q$ and an input symbol  $i \in \Sigma$, return a new state $q' \in Q$.

In the sheep example above:
- $Q=\{q_0, q_1, q_2, q_3, q_4\}$
- $\Sigma = \{a,b,!\}$
- $ F = \{q_4\}$
- $\delta(q,i)$ is the transition table above in Fig 2.12.

## Recognizing a string with an FSA

![figs/tape.png](figs/tape.png)

- Given an FSA and a string, read each symbol and transition among states according to $\delta(q,i)$
- If all symbols are read, and the machine ends in one of the final states $F$, then the string is **accepted.**
- Else, the string is **rejected.**

## What about undefined transitions?

e.g., "*baZ* "

<br><br><br><br>

![figs/fail.png](figs/fail.png)



## Formal Language

- The regular expression and FSA above are equivalent ways of defining a **formal language.**

**Formal Language:** a (possibly infinite) set of strings composed of symbols from a finite alphabet.

E.g., for sheep:

$$L(m) = \{baa!, baaa!, baaaa!, baaaaa!, ...\}$$

<br><br>

A key approach in NLP: Convert a natural language to a formal language that is "close."


![figs/automata.png](figs/automata.png)

## Regular Languages

- Class of languages definable by a regular expression
- Class of languages definable by a finite state automata.


1. $\emptyset$ is a regular language
2. $\forall a \in \Sigma \cup \epsilon, \{a\} $ is a regular language.
3. If $L_1$ and $L_2$ are regular languages, then so are:
   1. $L_1 \cdot L_2 = \{xy $ $  | $ $ x \in L_1 , y \in L_2\}$ (**concatenation**)
   2. $L_1 \cup L_2$ (**union**)
   3. $L_1^*$ (**Kleene closure**)
   
   
E.g., if $L_1=\{a,b\}$ and $L_2 = \{c,d\}$:
  - $L_1 \cdot L_2 = \{ac, ad, bc, bd\}$
  - $L_1 \cup L_2 = \{a, b, c, d\}$
  - $L_1^* = a^* \cup b^*$

## FSA for Words

- E.g., represent the formal language of all valid expressions of dollar amounts.
- Alphabet is now **words** instead of **letters**

![figs/money.png](figs/money.png)

<br><br><br><br>

![figs/money2.png](figs/money2.png)

<br><br><br><br>

## Non-deterministic FSA

![figs/nfsa.png](figs/nfsa.png)

FSA for which the transition function is not **fully determined**

- E.g., if see letter *a* in state $q_2$, can either go to $q_2$ **OR** $q_3$

- NFSAs are often more convenient to define a language

- **Note:** Every NFSA can be converted to an equivalent FSA (though, possibly with an exponential number of states).

![figs/nfsa2.png](figs/nfsa2.png)

## Dealing with ambiguity

- **Backup:** Whenever we come to a choice point, we could put a marker to mark where we were in the input, and what state the automaton was in. Then if it turns out that we took the wrong choice, we could back up and try another path.
  - E.g., depth-first and breadth-first search
- **Look-ahead:** We could look ahead in the input to help us decide which
path to take.
- **Parallelism:** Whenever we come to a choice point, we could look at every alternative path in parallel.

**State-space search is a key part of NLP algorithms**

"*I made her duck*."

**To the syllabus!**
<https://github.com/iit-cs585/main>

#### image sources

- https://www.cs.colorado.edu/~martin/SLP/

- https://www.washingtonpost.com/business/on-it/how-ibm-is-trying-to-commercialize-watson/2014/05/09/4f552506-d23c-11e3-937f-d3026234b51c_story.html

- http://www.howtogeek.com/229308/26-actually-useful-things-you-can-do-with-siri/

- http://mashable.com/2015/01/14/google-translate-word-lens/

- https://www.youtube.com/watch?v=ng7Sti29S5k

- http://www.kurzweilai.net/a-review-of-her-by-ray-kurzweil

- https://www.youtube.com/watch?v=9W5Am-a_xWw

- http://mosermichael.github.io/cstuff/all/blog/2015/02/05/nlp-revisited.html

- http://all-about-linguistics.group.shef.ac.uk/branches-of-linguistics/morphology/what-is-morphology/

- http://english.stackexchange.com/questions/294993/ambiguous-syntax-tree-and-phrase-structure-rules

- https://en.wikipedia.org/wiki/Talk%3AColorless_green_ideas_sleep_furiously

- http://www.salem-news.com/articles/september102009/oxycontin_wolf_9-10-09.php