# Week 1 - Natural Language Content Analysis

## Overview

- **Le**xical analysis (POS tagging)
- **S**yntactic analysis
- **S**emantic analysis
- **P**ragmatic analysis
- **D**iscourse analysis

**LeSSPD**

## Lexical analysis (POS tagging)

Also known as *lexical analysis*.

Determine the basic units of a sentence and the meaning of each unit.

This entails labeling each word in a sentence according to its syntactic category, e.g., 
- noun phrase
- verb phrase
- prepositional phrase etc.

This gives the *structure of the sentence*, but not its meaning.

## Syntactic analysis

This step determines the relationships between the words in a sentence, and hence reveals the syntactic structure of a sentence.


## Semantic analysis (parsing)

This step uses the meaning of words and the syntactic structure from the previous two steps to determine the meaning of a sentence or a larger linguistic unit.

## Pragmatic analysis

This step determines the meaning in context, for e.g., to determine the reason(s) behind the actions described in a sentence. Put differently, the goal here is to understand the purpose in communication.

## Discourse analysis

Analyze a large chunk of text, for e.g., a set of sentences, by taking into account the connections between the sentences and interpreting the meaning of each sentence in context.


# Challenges (WAPA)

Ambiguities arise because natural language is not designed for computers.

Also, computer lacks background knowledge to help it disambiguate text.


## Word level ambiguities

Some words are overloaded with different meaning (**ambiguous sense**)
and/or belong to different syntactic categories (**ambiguous POS**).

For e.g., the word "design" can be either a noun or a verb.


## Syntactic ambiguities

A phrase or sentence may have multiple valid syntactic structures, each leading to a different meaning.

### Ambiguous modification

For e.g., "Natural language processing" can mean "processing of natural language" or "natural processing of language".

It is unclear if the word "natural" modifies "language" or "processing".

### Prepositional phrase attachment ambiguities

Also known as *PP attachment ambiguities*.

E.g., "A man saw a boy with a telescope". It is unclear who has the telescope. Is it the man who had used it to see the boy? Or did it belong to the boy and the man saw the boy carrying it?

This ambiguity arose because it is unclear which entity the phrase "with the" is associated with.


## Anaphora resolution

This ambiguity arise due to uncertainty about which entity a pronoun refers to. In the sentence "John persuaded Bill to buy a television for himself", it is unclear if "himself" refers to "John" or "Bill".

## Presupposition

The sentence "He has quit smoking" implies that he has smoked before. It is difficult for a computer to make this inference in general.



# State of the art in NLP

Deep understanding is still hard too achieve, especially in the general sense. High accuracy is restricted to specific domains/datasets.

Can achieve fairly high accuracy (with usual caveats) for 

- POS tagging 
- Partial parsing.
- Entity relation extraction
- Word sense disambiguation
- Sentiment analysis

Deep semantic meaning is still hard to achieve.


# NLP for text retrieval

- Must be general & efficient -> Shallow retrieval.
- "Bag of words" representation sufficient for most search tasks because some tasks require fairly "crude" results - as long as the document contains the queried words it is likely to be relevant.

- Some text retrieval techniques can address NLP problems. For e.g., retrieving documents with words in a query can achieve word sense disambiguation because some specific meaning of a word only occur frequently with some other words.

- Complex search tasks still require deep NLP (e.g., machine translation).


---

# Text access modes

## Push vs Pull

Who initiates the access?

Pull (think search engine **pulls** in user): User initiates by requesting for ad-hoc information.

Push (think recommenders **push** content to user): System recommends information to user based on some knowledge about user.


## Querying vs Browsing

Depends on which mode is **convenient** for user. 

Convenience depends on whether you know the search terms or if it is easy to express the search terms or enter the query.

- **Query**: If you know what you are looking for.
- **Browsing**: If you want to explore data.


# What is text retrieval

User query -> System -> Relevant documents.

## TR vs Database Query

**Information**
- Unstructured vs structured.
- Ambiguous vs well-defined semantics.

**Query**
- Ambiguous vs well-defined semantics.
- Incomplete vs complete specification.


**Answers**
- Relevant documents vs matched records.

**TR is empirically defined problem**
- Cannot mathematically prove one method is better than another.
- Must rely on empirical evaluations involving users.


# Formal formulation of TR

**Vocabulary**: $V = \{w_1, w_2, \ldots, w_N\}$ of language.

**Query**: $q = q_1, \ldots, q_m$ where $q_i \in V$.

**Document**: $d_i = d_{i1}, \ldots, d_{im_i}$ where $d_{ij} \in V$.

**Collection**: $\mathcal{C} = \{d_1, \ldots, d_M \}$.

**Relevant documents**: $\mathcal{R(q)} \subseteq \mathcal{C} $.

**Task**: Compute $\mathcal{R^\prime}(q) \approx \mathcal{R}(q)$.


## Computing $\mathcal{R^\prime}(q)$

1. Document selection

System determines if document is relevant. Absolute relevant. No ranking.

2. Document ranking

System determines if document is more relevant than another (relative relevance). 

User determines cutoff.

- Ranking in general is preferred because it is slightly "easier" because the classifier that performs document selection is unlikely perfectly accurate or might even return no relevant documents because the query might be overly specific (**"over-constrained"** query).

- In other cases, the query might be too ambiguous (**"under-constrained"** query), leading to too many results. Without ranking, it is difficult for the user to navigate the result.

- Hard to find the right position between these two extremes because user is not clear what he/she is looking for or knows the correct way to express the query.

## 

- **Ranking function**: $f(q, d) \in \mathbb{R}$.
   - Should rank relevant documents above irrelevant ones.
   - Challenge is how to measure likelihood that document $d$ is relevant to query $q$.

- **Retrieval models** = formalization of relevance (gives a computational definition to relevance).

## Different retrieval model

- **Similarity-based model**: $f(q, d)$ = similarity(q, d).
- **Probabilistic model**: $f(q, d)$ = $P(R = 1 \vert q, d)$ where $R \in \{0, 1\}$.
  - Classic probabilistic model
  - Language model
  - Divergence-from-random model
- **Probabilistic inference model**: $f(q, d) = P(d \rightarrow q)$.
- **Axiomatic model**: $f(q, d)$ must satisfy a set of constraints.

These models tend to result in similar ranking functions involving similar variables.


# Common ideas in TR

- Term frequency (how often term appears in document).
- Document length (term appearing in shorter document might be more relevant).
- Document frequency (how often term appears in entire collection).


# Popular retrieval models

When optimized, the following models work equally well.

- BM25
- Pivoted length normalization
- Query likelihood
- PL2

**BM25** is most popular.

# Vector space model (VSM)

- Represent documents and queries as a length $N$ vector.
- Relevance = similarity between $q$ and $d$.
- Each dimension of vector should ideally represent orthogonal concepts.
- Each entry in vector is a weight of the corresponding concept.

## Unspecified part of VSM 

- Does not say how to define/select the basic "concept".
- How to assign term weights for docs and query.
- How to define the similarity measure.


## Vector placement: Bit vector

- $q = (x_1, \ldots, x_N), d = (y_1, \ldots, y_N)$
   - $x_i, y_i \in \{0, 1\}$. 1: word $w_i$ is present, else 0.
   
## Dot product as similarity

- $f(q, d) = q \cdot d$.
- For bit vector case, this is simply the number of distinct terms that match between the two. This gives equal weight to all terms.
   