# Structure of the book


![Structure](Images/Structure.jpg)



# Chapter 1 : NLP - A Primer

## NLP Tasks
![NLP tasks](Images/NLPTasks.jpg)


* **Language modeling** : Learn the probability of a sequence of words appearing in a given language (speech recognition, OCR, ...)

* **Text classification** : Bucketing the text into a known set of categories bases on its content.

* **Information extraction** : Extracting relevant information from text.

* **Information retrieval** : Finding documents relevant.

* **Conversational agent** : Building dialogue systems

* **Text summarization** : Create short summaries of longer documents

* **Question answering** : Building  a system that can automatically answer questions.

* **Machine translation** : Converting a piece of text from one language to another.

* **Topic modeling** : Uncovering the topical structure of a large collection of documents.


## Language building blocks

We can think of human language as composed of four major building blocks: phonemes,
morphemes and lexemes, syntax, and context. NLP applications need knowledge
of different levels of these building blocks, starting from the basic sounds
of language (phonemes) to texts with some meaningful expressions (context).

![NLP tasks](Images/BuildingBlocksLanguage.jpg)

* **Phonemes** : Smallest unit of sound in a language. Standard English has 44 phonemes. -> Speech understanding.

* **Morphemes and lexemes** :

    - Morphemes = Smallest unit of language that has meaning. All prefixes and suffixes are morphemes. Ex: Un + Break + Able

    - Lexemes : structural variations of morphemes related to one another by meaning.
For example, “run” and “running” belong to the same lexeme form.

    --> Tokenization, Stemming, Speech tagging

* **Syntax** : Set of rules to construct grammatically correct sentences out of words and phrases in a language. This has a hierarchical structure of language, with words at the lowest level, followed by part-of-speech tags, followed by phrases, and ending with a sentence at the highest level. In Figure 1-6, both sentences have a similar structure and hence a similar syntactic

* **Context** : how various parts in a language come together to convey a particular
meaning. Long term references, world knowledge, common sense, literal meaning, ... -> Sarcasm detection, summarization and topic modeling.


## NLP Challenges

* Ambiguity

* Common knowledge : It is the set of all facts that most humans are aware of. In any conversation, it is assumed that these facts are known.

* Creativity : Various styles, dialects, genres, and variations.

* Diversity across languages : A solution that works for one language might not work at all for
another language.


## Approaches to NLP

### Heuristics-Based NLP

Similar to other early AI systems, early attempts at designing NLP systems were based
on building rules for the task at hand. This required that the developers had some
expertise in the domain to formulate rules that could be incorporated into a program.

***Example :***
 * ***Lexicon-based sentiment analysis***
 
 * ***Wordnet, a database of words and the semantic relationships between them*** (synonymes, hyponyms, meronyms)

 * ***Common sense world knowledge : Open Mind Common sense***

 * ***Regex***
 
 * ***Context-free grammar (CFG) : type of formal grammar that is used to model natural
languages. To capture more complex and hierarchical information that a regex might not. -> For rules bases systems like GATE (text extraction for closed and well-defined domains).***

### Machine learning for NLP

NLP Classification to classify news to a topic, estiamte the price of a stock based on social media discussion.

* ***Naive Bayes*** : Naive Bayes is a classic algorithm for classification tasks that mainly relies on Bayes’ theorem (as is evident from the name). Using Bayes’ theorem, it calculates the probability of observing a class label given the set of features for the input data. A characteristic of this algorithm is that it assumes each feature is independent of all
other features.

* ***Support vector machine*** : The support vector machine (SVM) is another popular classification algorithm. The goal in any classification approach is to learn a decision boundary that acts as a separation between different categories of text (e.g., politics versus sports in our news classification example). This decision boundary can be linear or nonlinear (e.g., a circle).

* ***Hidden Markov Model*** : statistical model that assumes there is an underlying, unobservable process with hidden states that generates the data—i.e., we can only observe the data once it is generated. An HMM then tries to model the hidden states from this data.

For example, consider the NLP task of part-of-speech (POS) tagging, which deals with assigning part-of-speech tags to sentences. HMMs are used for POS tagging of text data. Here, we assume that the text is generated according to an underlying grammar, which is hidden underneath the text. Along with this, HMMs also make the Markov assumption, which means that each hidden state is dependent on the previous state(s).

* ***Conditional random fields (CRF)***: performs a classification task on each element in
the sequence.

Imagine the same example of POS tagging, where a CRF can tag word by word by classifying them to one of the parts of speech from the pool of all POS tags.

### Deep Learning for NLP

* ***Recurrent neural network*** : language is inherently sequential. A sentence in any language flows from one direction to another (e.g., English reads from left to right). Thus, a model that can progressively read an input text from one end to another can be very useful for language understanding. Recurrent neural networks (RNNs) are specially designed to keep such sequential processing and learning in mind. RNNs have neural units that are capable of remembering what they have processed so far. This memory is temporal, and the information is stored and updated with every time step as the RNN reads the next word in the input.

* ***Long short-term memory*** : Despite their capability and versatility, RNNs suffer from the problem of forgetful memory—they cannot remember longer contexts and therefore do not perform well when the input text is long, which is typically the case with text inputs. Long shortterm
memory networks (LSTMs), a type of RNN, were invented to mitigate this shortcoming of the RNNs. LSTMs circumvent this problem by letting go of the irrelevant context and only remembering the part of the context that is needed to solve the task at hand. This relieves the load of remembering very long context in one vector representation.
Gated recurrent units (GRUs) are another variant of RNNs that are used mostly in language generation.

![Structure](Images/LTSM.jpg)

* ***Convolutional neural networks*** :
CNNs have also seen success in NLP, especially in text-classification tasks. One can replace each word in a sentence with its corresponding word vector, and all vectors are of the same size (d) (refer to “Word Embeddings” in Chapter 3). Thus, they can be stacked one over another to form a matrix or 2D array of dimension n ✕ d, where n is the number of words in the sentence and d is the size of the word vectors. This matrix can now be treated similar to an image and can be modeled by a CNN. The main advantage CNNs have is their ability to look at a group of words together using a context window.

![Structure](Images/CNN.jpg)

* ***Transformers*** : 

They model the textual context but not in a sequential manner.Given a word in the input, it prefers to look at all the words around it (known as selfattention) and represent each word with respect to its context. For example, the word “bank” can have different meanings depending on the context in which it appears. If the context talks about finance, then “bank” probably denotes a financial institution.
On the other hand, if the context mentions a river, then it probably indicates a bank of the river.

Recently, large transformers have been used for transfer learning with smaller downstream tasks. Transfer learning is a technique in AI where the knowledge gained while solving one problem is applied to a different but related problem. With transformers, the idea is to train a very large transformer mode in an unsupervised manner (known as pre-training) to predict a part of a sentence given the rest of the content so that it can encode the high-level nuances of the language in it. 

Exemple : BERT

![Structure](Images/BERT.jpg)

* ***Autoencorders*** : 

a different kind of network that is used mainly for learning compressed vector representation of the input. For example, if we want to represent a text by a vector, what is a good way to do it? We can learn a mapping function from input text to the vector. To make this mapping function useful, we “reconstruct” the input back from the vector representation. This is a form of unsupervised learning since you don’t need human-annotated labels for it.

![Structure](Images/Autoencoder.jpg)

Problems:

* Overfitting on small datasets

* Few-shot learning and synthetic data generation

* Domain adaptation : If we utilize a large DL model that is trained on datasets originating from some common domains (e.g., news articles) and apply the trained model to a newer domain that is different from the common domains (e.g., social media posts), it may yield poor performance.

* Interpretable models

* Common sense and world knowledge

* Cost

* On-device deployment
