## How Does Natural Language Processing (NLP) Work?

NLP models work by finding relationships between the constituent parts of language — for example, the letters, words, and sentences found in a text dataset. 

NLP architectures use various methods for data preprocessing, feature extraction, and modeling. Some of these processes are:

- Data preprocessing: Before a model processes text for a specific task, the text often needs to be preprocessed to improve model performance or to turn words and characters into a format the model can understand. Various techniques may be used in this data preprocessing:

#### Sentence segmentation & Word tokenization

    - Sentence segmentation breaks a large piece of text into linguistically meaningful sentence units. This is obvious in languages like English, where the end of a sentence is marked by a period, but it is still not trivial. A period can be used to mark an abbreviation as well as to terminate a sentence, and in this case, the period should be part of the abbreviation token itself. The process becomes even more complex in languages, such as ancient Chinese, that don’t have a delimiter that marks the end of a sentence. 

    -Tokenization splits text into individual words and word fragments. The result generally consists of a word index and tokenized text in which words may be represented as numerical tokens for use in various deep learning methods. A method that instructs language models to ignore unimportant tokens can improve efficiency.  

In [20]:
from IPython.display import Image
Image(url="tokenizer.png", width=500, height=350)

#### Stemming, lemmatization, and Stop word removal

        - Stemming and lemmatization: Stemming is an informal process of converting words to their base forms using heuristic rules. For example, “university,” “universities,” and “university’s” might all be mapped to the base univers. (One limitation in this approach is that “universe” may also be mapped to univers, even though universe and university don’t have a close semantic relationship.) 
       
       - Lemmatization is a more formal way to find roots by analyzing a word’s morphology using vocabulary from a dictionary. Stemming and lemmatization are provided by libraries like spaCy and NLTK. 


        - Stop word removal aims to remove the most commonly occurring words that don’t add much information to the text. For example, “the,” “a,” “an,” and so on.


In [21]:
Image(url="stemlem.png", width=500, height=400)

#### Feature extraction




Most conventional machine-learning techniques work on the features – generally numbers that describe a document in relation to the corpus that contains it. 

Text representation are classified into four categories:

- Basic vectorization approaches
- Distributed representations
- Universal language representation
- Handcrafted features


##### Basic Vectorization

- One-Hot Encoding: In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID $w_{id}$ that is between 1 and |V|, where V is the set of the corpus vocabulary. Each word is then represented by a V-dimensional binary vector of 0s and 1s.

In [12]:
Output: [[0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]

Cons:
   - The size of a one-hot vector is directly proportional to size of the vocabulary, and most real-world corpora have large vocabularies.
   - This representation does not give a fixed-length representation for text
   - It treats words as atomic units and has no notion of (dis)similarity between words.
   - It cannot handle the out of vocabulary (OOV) problem.
    

##### Bag of Words

   - Bag-of-Words: Bag-of-Words counts the number of times each word or n-gram (combination of n words) appears in a document. For example, below, the Bag-of-Words model creates a numerical representation of the dataset based on how many of each word in the word_index occur in the document.
   
       - The size of the vector increases with the size of the vocabulary.
       - It does not capture the similarity between different words that mean the same thing.
           - Bag of N-grams: captures some context and word-order information in the form of n-grams
       - This representation does not have any way to handle out of vocabulary words.
       - word order information is lost in this representation.

In [11]:
Image(url="bow.png", width=500, height=350)

##### Term frequency–inverse document frequency

   - TF-IDF: In Bag-of-Words, we count the occurrence of each word or n-gram in a document. In contrast, with TF-IDF, we weight each word by its importance. To evaluate a word’s significance, we consider two things:

        - Term Frequency: How important is the word in the document?

TF(word in a document)= Number of occurrences of that word in document / Number of words in document

    - Inverse Document Frequency: How important is the term in the whole corpus?

IDF(word in a corpus)=log(number of documents in the corpus / number of documents that include the word)

A word is important if it occurs many times in a document. But that creates a problem. Words like “a” and “the” appear often. And as such, their TF score will always be high. We resolve this issue by using Inverse Document Frequency, which is high if the word is rare and low if the word is common across the corpus. The TF-IDF score of a term is the product of TF and IDF.

The intuition behind TF-IDF is as follows: if a word $w$ appears many times in a document $d_{i}$ but does not occur much in the rest of the documents $d_{j}$ in the corpus, then the word $w$ must be of great importance to the document $d_{i}$. The importance of $w$ should increase in proportion to its frequency in $d_{i}$, but at the same time, its importance should decrease in proportion to the word’s frequency in other documents $d_{j}$ in the corpus.

This method still suffers from the curse of high dimensionality.

In [32]:
Image(url="tf.png", width=500, height=350)

If we look back at all the representation schemes we’ve discussed so far, we notice three fundamental drawbacks:
- They’re discrete representations—i.e., they treat language units (words, n-grams, etc.) as atomic units. This discreteness hampers their ability to capture relationships between words.
- The feature vectors are sparse and high-dimensional representations. The dimensionality increases with the size of the vocabulary, with most values being zero for any vector. This hampers learning capability. Further, high-dimensionality representation makes them computationally inefficient.
- They cannot handle OOV words.

#### Distributed Representations

To overcome these limitations, methods to learn low-dimensional representations were devised.

Distributional similarity: 
This is the idea that the meaning of a word can be understood from the context in which the word appears. For example: “NLP rocks.” The literal meaning of the word “rocks” is “stones,” but from the context, it’s used to refer to something good and fashionable.

#### Word Embeddings

The neural network–based word representation model known as “Word2vec,” based on “distributional similarity,” can capture word analogy relationships such as:

King – Man + Woman ≈ Queen

It comes in two variations: 
   - Skip-Gram, in which we try to predict surrounding words given a target word
   - Continuous Bag-of-Words (CBOW), which tries to predict the target word from surrounding words. 
   
After discarding the final layer after training, these models take a word as input and output a word embedding that can be used as an input to many NLP tasks. Embeddings from Word2Vec capture context. If particular words appear in similar contexts, their embeddings will be similar.

In [23]:
Image(url="CBOW.png", width=500, height=350)

In [24]:
Image(url="pCBOW.png", width=500, height=350)

In [25]:
Image(url="nCBOW.png", width=500, height=350)

In [26]:
Image(url="SG.png", width=500, height=350)

In [27]:
Image(url="pSG.png", width=500, height=350)

In [28]:
Image(url="nSG.png", width=500, height=350)


GLoVE is similar to Word2Vec as it also learns word embeddings, but it does so by using matrix factorization techniques rather than neural learning. The GLoVE model builds a matrix based on the global word-to-word co-occurrence counts. 


It’s not necessary to train your own embeddings, and using pre-trained word embeddings often suffices.