# Natural Language Processing NLP

## NLP Terms

### Vocabulary

- From [v2] Basic Text Processing - Word Tokenization
    - Set of Types
        - Set of unique tokens in the Text/Corpus

### Types and Tokens

- From [28]
  - Consider the following example
    - "Ever tried. Ever failed."
    - "No matter. Try again."
    - "Fail again. Fail better"
  - There are two tokens of type "Ever", two tokens of type "again", and two tokens of type "Fail".

#### Type

- From [v2] Basic Text Processing - Word Tokenization
    - An element of the vocabulary

#### Token

- From [v2] Basic Text Processing - Word Tokenization
    - An instance of that type in running text

### Stemming

- From [28]
  - Morphology is the study of the internal structures of words
  - Often a word is composed of a stem (root) with added affixes (inflections), such as Plurals, Past Tenses
    - E.g., _trapped_ is composed of the stem _trap_ and the affix _ed_
  - Stemming, as kind of morphological analysis, is the process of reducing inflected words to their stems.
  - Normalization increases the recall and reduces precision

### Normalization

- From [28]
  - The motivation of normalization is that many different strings of characters often convery essentially identical meanins
  - Given that we want to get at the meaning that underlies the words, it seems reasonable to normalize suerpficial variations by converting them to the same form.
  - Most common types of Normalization
    - Case Folding (converting all words to lower case)
    - Stemming (reducing inflected words to their stem or root form)

### Annotation

- From [28]
  - Annotation is the inverse of Normalization
  - Just as different strings of characters may have same meaning, it also happens that identical strings of characters may have different meanings, depending on the context.
  - Common forms of annotation include
    - Part-of-speech tagging (making words according to their parts of speech)
    - Word sense tagging (marking ambiguous words according to their intended meanings)
    - Parsing (analysing the grammatical structure of sentences and making the words in the sentences according to their grammatical roles)
  - Annotation decreases recall and increases precision
    - Example, by tagging _program_ as a noun or verb
      - We may be able to selectively search for documents that are about the act of computer programming (verb)
      - instead of documents that discuss particular computer programs (noun)
    - Hence we can increase the precision

### Performance measured in Information Retrieval

- From [28]
  - The performance of an IR system is often measured by _precision_ and _recall_

#### Precision

- From [28]
  - The _precision_ of a system is an estimate of the conditional probability that a document is truly relevant to a query, if the system says it is relavant.

#### Recall

- From [28]
  - The _recall_ of a system is an estimate of the conditional probability that the system will say that a document is relevant to a query, if it is truly relevant

### Semantic

- From [27]
  - Indicates Tenses _(Past vs Present vs Future)_
  - Count _(Singular vs Plural)_
  - Gender _(Masculine vs Feminine)_

## WordNet

- From <https://www.guru99.com/wordnet-nltk.html>
  - WordNet is a Corpus Reader, a lexical database for English
  - It is a semantically oriented dictionary of English
  - It can be used to find the
    - _meanings of words_
    - _Synonyms_
    - _Antonyms_
  - From WordNet, information about a word or phrase can be calculated as:
    - Synonym (Word having the same meaning)
    - Hypernyms (The generic terms used to designate a class of specifics(i.e., meal is a breakfast), hyponyms (rice is a meal))
    - Holonyms (Proteins, Carbohydrates are part of meal)
    - Meronyms (Meal is part of daily food intake)
  - WordNet is divided into:
    - Noun
    - Verb
    - Adjective
    - Adverb
  - Can be used for text analytics

```python
from nltk.corpus import wordnet as wn
syns = wn.synsets("good")
print(syns)
```

```
[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]
```

```python
from nltk.corpus import wordnet as wn
synonyms = []
antonyms = []

for syn in wn.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print('Synonyms: ', set(synonyms))
print('Antonyms: ', set(antonyms))
```

```
Synonyms:  {'upright', 'dear', 'safe', 'trade_good', 'adept', 'unspoilt', 'thoroughly', 'full', 'right', 'salutary', 'honorable', 'honest', 'practiced', 'goodness', 'well', 'soundly', 'good', 'beneficial', 'sound', 'undecomposed', 'serious', 'unspoiled', 'skilful', 'near', 'commodity', 'effective', 'respectable', 'just', 'skillful', 'secure', 'proficient', 'estimable', 'in_force', 'expert', 'dependable', 'in_effect', 'ripe'}
Antonyms:  {'evil', 'badness', 'ill', 'evilness', 'bad'}
```

## NLP Matrices

### Token-Document Matrix

- From [28]
  - One row for each token
    - A row vector for a token has binary values: An element is 1 if the given token appears in the given document and 0 otherwise
  - One column for each document
  - Used in polysemy, typically in __*WSD(Word Sense Disambiguation)*__ algorithms which deals with word tokens

### Type-Document-Matrix

- From [28]
  - One row for each type
    - A row vector for a type has integer values: An element is the frequency of the given type in the given document
  - One column for each document

### Term-Document-Matrix

- From [v1] Lec 33
  - ![Term_Document_Matrix](images/Term_Document_Matrix.jpg)
  - Column represents various features of a document
  - Row represents the words in the Corpus
- From [27]
  - More idea on SVD over Term-Document-Matrix
- From [28]
  - Rows corresponds to Terms
  - Columns corresponds to Documents
  - Use: Similarity of Documents

### Word-Document-Matrix

### Word-Context-Matrix

## Types of matrices used in VSM (Vector Space Model)

- From [28]
  - TO DO
  - Term-Document Matrix
  - Word-Context Matrix
  - Pari-Pattern Matrix

### Term-Document Matrix

- From [28]
  - Row vectors corresponds to Terms
  - Column vectors corresponds to Documents
  - Content of element will be TF-IDF (normally)
  - Use: To find the __*Similarity of Documents*__
    - The relevance to a query is  given by the similarity of their vectors

### Word-Context-Matrix

- From [28]
  - Row vectors correspond to words
  - Column vector correspond to context given by the words
    - Context: words, phrases, sentences, paragraphs, chapters, documents, or more exotic possibilities, such as sequences of characters or patterns
  - Use: To find the __*Similarity of Words*__. i.e., measuring __*Attributional Similarity*__
    - By looking at row vectors in the Term-Document-Matrix instead of column vectors
  - Attributionanl Similarity
    - The attribution similarity between two words $a$ and $b$, $sim_a(a,b) \in \mathbb{R}$, depends on the degree of correspondence between the properties of $a$ and $b$.
    - The more correspondence there is, the greater the attributional similarity
  - TO DO

### Pair-Pattern Matrix

- From [28]
  - Row vectors correspond to pairs of words, such as $maron : stone$ and $carpenter : wood$
  - Column vectors correspond to the patterns in which the pairs co-occur
    - such as "X cuts Y" and "X works with Y"
  - Use: To find the __*Relational Similarity of Words*__, measures semnatic similarity of patterns
    - To measures the similariy of semantic relationship between pair of words
  - Relational Similarity
    - The relational similarity between two _pairs_ of words $a : b$ and $c : d$, $sim_r(a:b, c:d) \in \mathbb{R}$, depends on the degree of correspondence between the relations of $a : b$ and $c : d$
    - The more correspondence there is, the greater the relational similarity
      - Example:
        - $dog$ and $wolf$ have a relatively _high degree of attributional similarity_
        - Where as $dog : bark$ and $cat : meow$ have a relatively _high degree of relational similarity_
  - TO DO

## SVD in NLP

- From [28]

### Latent Meaning

- From [28]
  - It is a method to discovery latent meaning

### Noise Reduction

- From [28]
  - TO DO

### Sparsity Reduction

- From [28]
  - TO DO

### High-order Co-occurence

- From [28]
  - TO DO

## Word Embedding

### Using LSI

- From [v1] Lec 39
  - Dimensionality of the matrix is driven by the Singular values which are in Singular Matrix
  - If you have SVD over Term-Context Matrix (i.e., Term Document Matrix)
    - Left Singular Matrix will contain the word Embeddings
    - Right Singular Matrix will contain the Context Embeddings
  - ![Word_Embedding](images/Word_Embedding.jpg)

### Using Neural Network

- TO DO

## Continuous Bag of Words (CBOW)

![Continuous_Bag_Of_Words_Model](images/Continuous_Bag_Of_Words_Model.jpg)

## Skip Gram Model

![Skip_Gram_Model](images/Skip_Gram_Model.jpg)

## Polysemy

- Meaning in tamil: பலபொருள் ஒருசொல்
- Same word having different meanings

- [34] describes Polysemy in Tamil, Telugu languages

## Named Entity Recognition (NER)

- From v1 Lecture 53
  - Named Entity Recognition (NER)
    - Example: "Mr.John is the CEO of the Company. And he had done great things for the company"
      - What does that "he" means there in the second sentece?
      - Sytem should be able to say that "he" refers to the CEO of that company
      - It is called NER.
    - How many times does the 'CEO' referred in the documennt? Is it possible to find that?
      - NER model should be used, which can recognize that person as CEO, wherever he is mentioned as part of the document.
  - Data will be Sequential Data
  - May be RNN Model required for this kind of problem

## Paraphrase Detection

- From v1 Lecture 53
  - Paraphrase detection - identifying semantically equivalent questions
    - A question can be asked in different ways
    - All those questions are semantically equivalent
    - Paraphrase detection is finding semantically equivalent sentences
    - Example: IT Call Center
      - They receive calls having various queries, but semantically equivalent
      - Company need to find most frequently asked questions (FAQ), so that new joinee can answer those calls
      - Here paraphrase detection is required
  - Data will be Sequential Data
  - May be RNN Model required for this kind of problem

## Language Generation

- From v1 Lecture 53
  - Language Generation
    - 'Given a photograph, you are asked to write a line about the photograph'
      - You look at the content of the photograph, say objects and you give some title
      - We can use Language Generation Model
        - Input is going to be different
          - Meaning, the photograph may have 3 object, or 5 object or any number of objects in it
          - We should be able to process those without changing or adjusting our neural network size

## Machine Translation

- From v1
  - Lecture 53
    - Given a parallel corpora, we should be able to translate from one language to the other
    - We have to do sentence by sentence translation to have correct translation (not word by word translation, which won't give correct translation)

## Speech Recognition

- From v1
  - Lecture 53
      - Should be able to translation a speech audio file from one language to the other
      - Based on the speech, system should be able to recognize what he is speaking about
        - Example: Wreck a nice beach or recognize speech
          - Based on the context, it should be able to understand that the speaker said "recognize speech"
        - This can be acheived in NLP, when we have taken words as time-series

## Spell Checking

- From v1
  - Lecture 53
    - When you type, you should really be able to figure out the distance between the characters that you have typed so-far and the words that are in the dictionary, start suggesting what is the right word.

## Machine Translation

### Word 2 Word Translation

- From [v1] See Video Lecture 63

### Syntactic Translation

- From [v1] See Video Lecture 63

### Semantic Translation

- From [v1] See Video Lecture 63

### Interlingua Translation

- From [v1] See Video Lecture 63

### Automatic Machine Translation

- From [v1] See Video Lecture 64