# Introduction to Deep Learning

<p align="center">
    <img width="699" alt="image" src="https://user-images.githubusercontent.com/49638680/159042792-8510fbd1-c4ac-4a48-8320-bc6c1a49cdae.png">
</p>

---

# Language models and Introduction to Text Analysis

![](https://blog.feedly.com/wp-content/uploads/2019/03/Screen-Shot-2019-03-06-at-12.07.45-PM.png)

The main aim of this notebook is to provide an introduction to the wide topic of _text analysis_. In particular, we are going to focus on predictive _language models_.

## Definition

> A __language model__ is a statistical model able to map a probability to each word (given the context or not) of a vocabulary.

More formally, it is a __probability distribution__ over sequences of words.
Hence, given a sequence $ W = \{w_1, \ldots, w_m \} $, the LM assigns a probability measure to the whole sequence $p(w_1, \ldots, w_m)$.

## How does it work?

There are several different probabilistic approaches to modeling language, which vary depending on the purpose of the language model. 

From a technical perspective, the various types differ by the amount of text data they analyze and the math they use to analyze it. For example, a language model designed to generate sentences for an automated Twitter bot may use different math and analyze text data in a different way than a language model designed for determining the likelihood of a search query.

Some common statistical language modeling types are:

* **Unigram**. The unigram is the simplest type of language model. It doesn't look at any conditioning context in its calculations. It evaluates each word or term independently. Unigram models commonly handle language processing tasks such as information retrieval. The unigram is the foundation of a more specific model variant called the query likelihood model, which uses information retrieval to examine a pool of documents and match the most relevant one to a specific query.

* **N-gram**. N-grams are a relatively simple approach to language models. They create a probability distribution for a sequence of n The n can be any number, and defines the size of the "gram", or sequence of words being assigned a probability. For example, if n = 5, a gram might look like this: "can you please call me." The model then assigns probabilities using sequences of n size. Basically, n can be thought of as the amount of context the model is told to consider. Some types of n-grams are unigrams, bigrams, trigrams and so on.

* **Bidirectional**. Unlike n-gram models, which analyze text in one direction (backwards), bidirectional models analyze text in both directions, backwards and forwards. These models can predict any word in a sentence or body of text by using every other word in the text. Examining text bidirectionally increases result accuracy. This type is often utilized in machine learning and speech generation applications. For example, Google uses a bidirectional model to process search queries.

* **Exponential**. Also known as maximum entropy models, this type is more complex than n-grams. Simply put, the model evaluates text using an equation that combines feature functions and n-grams. Basically, this type specifies features and parameters of the desired results, and unlike n-grams, leaves analysis parameters more ambiguous -- it doesn't specify individual gram sizes, for example. The model is based on the principle of entropy, which states that the probability distribution with the most entropy is the best choice. In other words, the model with the most chaos, and least room for assumptions, is the most accurate. Exponential models are designed maximize cross entropy, which minimizes the amount statistical assumptions that can be made. This enables users to better trust the results they get from these models.

* **Continuous space**. This type of model represents words as a non-linear combination of weights in a neural network. The process of assigning a weight to a word is also known as *word embedding*. This type becomes especially useful as data sets get increasingly large, because larger datasets often include more unique words. The presence of a lot of unique or rarely used words can cause problems for linear model like an n-gram. This is because the amount of possible word sequences increases, and the patterns that inform results become weaker. By weighting words in a non-linear, distributed way, this model can "learn" to approximate words and therefore not be misled by any unknown values. Its "understanding" of a given word is not as tightly tethered to the immediate surrounding words as it is in n-gram models.

### The Unigram model

A unigram model can be treated as the combination of several one-state. It splits the probabilities of different terms in a context, e.g. from,

```python
sentence = 'I have a dream[END]'
```

Estimate:

$$\begin{array}{l}
 p(\mathrm{i}) \\
 p(\text{have}|\mathrm{i}) \\
 p(\text{a}|\text{i have}) \\
 p(\text{dream}|\text{i have a}) \\
 p(\text{[END]}|\text{i have a dream}) \\
 \end{array}$$

The unigram models estimates probabilities independently from the context: _i.e._ $p(\text{a}|\text{i have}) = p(\text{a})$.

The unigram language model makes the following assumptions:
1. The probability of each word is independent of any words before it.
2. Instead, it only depends on the fraction of time this word appears among all the words in the training text. In other words, training the model is nothing but calculating these fractions for all unigrams in the training text.

![](https://miro.medium.com/max/1252/1*qF3ZRwKDgmAG4cUBw3XeGA.png)


#### Evaluating the model

After estimating all unigram probabilities, we can apply these estimates to calculate the probability of each sentence in the evaluation text: each sentence probability is the product of word probabilities.

![](https://miro.medium.com/max/2000/1*5Y3ue0yFSsc-TAPVuf51pQ.png)

We can go further than this and estimate the probability of the entire evaluation text, such as a book or a wikipedia page. Under the naive assumption that each sentence in the text is independent from other sentences, we can decompose this probability as the product of the sentence probabilities, which in turn are nothing but products of word probabilities.

![](https://miro.medium.com/max/2000/1*QKSI3UMUbfJ40Y5fiw97VA.png)

The role of ending symbols
As outlined above, our language model not only assigns probabilities to words, but also probabilities to all sentences in a text. As a result, to ensure that the probabilities of all possible sentences sum to 1, we need to add the symbol `[END]` to the end of each sentence and estimate its probability as if it is a real word. This is a rather esoteric detail, and you can read more about its rationale [here](https://web.stanford.edu/~jurafsky/slp3/3.pdf).

#### Evaluation metric: average log likelihood

When we take the log on both sides of the above equation for probability of the evaluation text, the log probability of the text (also called log likelihood), becomes the sum of the log probabilities for each word. Lastly, we divide this log likelihood by the number of words in the evaluation text to ensure that our metric does not depend on the number of words in the text.

![](https://miro.medium.com/max/1228/1*BNrYUIi-hGFTeBnMy0sesg.png)

As a result, we end up with the metric of average log likelihood, which is simply the average of the trained log probabilities of each word in our evaluation text. In other words, the better our language model is, the probability that it assigns to each word in the evaluation text will be higher on average.

Other common evaluation metrics for language models include [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy) and [perplexity](https://en.wikipedia.org/wiki/Perplexity). However, they still refer to basically the same thing: cross-entropy is the negative of average log likelihood, while perplexity is the exponential of cross-entropy.

#### What about unknown words?

There is a big problem with the above unigram model: for a unigram that appears in the evaluation text but not in the training text, its count in the training text — hence its probability — will be zero. This will completely implode our unigram model: the log of this zero probability is negative infinity, leading to a negative infinity average log likelihood for the entire model!

##### Laplace smoothing
To combat this problem, we will use a simple technique called [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing):

1. We add an artificial unigram called `[UNK]` to the list of unique unigrams in the training text. This represents all the unknown tokens that the model might encounter during evaluation. Of course, the count for this unigram will be zero in the training set, and the unigram vocabulary size — number of unique unigrams — will increase by 1 after this new unigram is added.

2. Next, we add a pseudo-count of $k$ to all the unigrams in our vocabulary. The most common value of $k$ is 1, and this goes by the intuitive name of “add-one smoothing”.

![](https://miro.medium.com/max/1400/1*QoX9dDVsj2M3QDLFGbEWyg.png)

As a result, for each unigram, the numerator of the probability formula will be the raw count of the unigram plus k, the pseudo-count from Laplace smoothing. Furthermore, the denominator will be the total number of words in the training text plus the unigram vocabulary size times k. This is because each unigram in our vocabulary has k added to their counts, which will add a total of ($k \times \mathrm{vocabulary size}$) to the total number of unigrams in the training text.

##### Effect of Laplace smoothing
Because of the additional pseudo-count k to each unigram, each time the unigram model encounters an unknown word in the evaluation text, it will convert said unigram to the unigram `[UNK]`. The latter unigram has a count of zero in the training text, but thanks to the pseudo-count k, now has a non-negative probability:

![](https://miro.medium.com/max/832/1*lbzIr0LY7VD4pdEbVrE7Iw.png)

Furthermore, Laplace smoothing also shifts some probabilities from the common tokens to the rare tokens. Imagine two unigrams having counts of 2 and 1, which becomes 3 and 2 respectively after add-one smoothing. The more common unigram previously had double the probability of the less common unigram, but now only has 1.5 times the probability of the other one.

![](https://miro.medium.com/max/1228/1*n06c9a0GWyAJ45kKrxnkbA.png)


This can be seen from the estimated probabilities of the 10 most common unigrams and the 10 least common unigrams in the training text: after add-one smoothing, the former lose some of their probabilities, while the probabilities of the latter increase significantly relative to their original values. In short, this evens out the probability distribution of unigrams, hence the term “smoothing” in the method’s name.

### Coding!

The bulding of a unigram model is really istructive, so the curious reader is strongly encouraged to go through this excellent [GitHub repository](https://github.com/seismatica/ngram/blob/master/analysis/part1.ipynb).