# Natural Language Processing NLP

## NLP Terms

### Vocabulary

- From [v2] Basic Text Processing - Word Tokenization
    - Set of Types
        - Set of unique tokens in the Text/Corpus

### Type

- From [v2] Basic Text Processing - Word Tokenization
    - An element of the vocabulary

### Token

- From [v2] Basic Text Processing - Word Tokenization
    - An instance of that type in running text

## WordNet

- From <https://www.guru99.com/wordnet-nltk.html>
  - WordNet is a Corpus Reader, a lexical database for English
  - It is a semantically oriented dictionary of English
  - It can be used to find the
    - _meanings of words_
    - _Synonyms_
    - _Antonyms_
  - From WordNet, information about a word or phrase can be calculated as:
    - Synonym (Word having the same meaning)
    - Hypernyms (The generic terms used to designate a class of specifics(i.e., meal is a breakfast), hyponyms (rice is a meal))
    - Holonyms (Proteins, Carbohydrates are part of meal)
    - Meronyms (Meal is part of daily food intake)
  - WordNet is divided into:
    - Noun
    - Verb
    - Adjective
    - Adverb
  - Can be used for text analytics

```python
from nltk.corpus import wordnet as wn
syns = wn.synsets("good")
print(syns)
```

```
[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]
```

```python
from nltk.corpus import wordnet as wn
synonyms = []
antonyms = []

for syn in wn.synsets("good"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print('Synonyms: ', set(synonyms))
print('Antonyms: ', set(antonyms))
```

```
Synonyms:  {'upright', 'dear', 'safe', 'trade_good', 'adept', 'unspoilt', 'thoroughly', 'full', 'right', 'salutary', 'honorable', 'honest', 'practiced', 'goodness', 'well', 'soundly', 'good', 'beneficial', 'sound', 'undecomposed', 'serious', 'unspoiled', 'skilful', 'near', 'commodity', 'effective', 'respectable', 'just', 'skillful', 'secure', 'proficient', 'estimable', 'in_force', 'expert', 'dependable', 'in_effect', 'ripe'}
Antonyms:  {'evil', 'badness', 'ill', 'evilness', 'bad'}
```

### Term-Document-Matrix

- From [v1] Lec 33
  - ![Term_Document_Matrix](images/Term_Document_Matrix.jpg)
  - Column represents various features of a document
  - Row represents the words in the Corpus
- From [27]
  - More idea on SVD over Term-Document-Matrix

## Semantic

- From [27]
  - Indicates Tenses _(Past vs Present vs Future)_
  - Count _(Singular vs Plural)_
  - Gender _(Masculine vs Feminine)_

## Word Embedding

### Using LSI

- From [v1] Lec 39
  - Dimensionality of the matrix is driven by the Singular values which are in Singular Matrix
  - If you have SVD over Term-Context Matrix (i.e., Term Document Matrix)
    - Left Singular Matrix will contain the word Embeddings
    - Right Singular Matrix will contain the Context Embeddings
  - ![Word_Embedding](images/Word_Embedding.jpg)

### Using Neural Network

- TO DO

## Continuous Bag of Words (CBOW)

![Continuous_Bag_Of_Words_Model](images/Continuous_Bag_Of_Words_Model.jpg)

## Skip Gram Model

![Skip_Gram_Model](images/Skip_Gram_Model.jpg)