# Data Preprocessing Techniques

Data preprocessing involves cleaning, normalizing, and converting user prompts or raw text into a format that models can process effectively. It enhances model performance and helps standardize textual data for analysis.

## Common Preprocessing Techniques

1. **Tokenization**  
   Breaking down text into smaller units called tokens, which can be words, subwords, or characters. This step ensures the text is manageable for the model. Tokenization approaches may vary by language, as some languages (e.g., Chinese) require special segmentation.

2. **Stemming**  
   Reducing words to their root or base form by stripping suffixes and prefixes using heuristic rules. Stemming does not guarantee meaningful root forms (e.g., *flies* → *fli*), as it relies on simple rules rather than linguistic context.

3. **Lemmatization**  
   Reducing a word to its base or dictionary form (lemma) using linguistic analysis. Unlike stemming, lemmatization considers the grammatical role and context of a word, making it more accurate but computationally expensive.  

   *Example*: *am, are, is* → *be* (lemma).

4. **Normalization**  
   Converting text into a standardized form to ensure consistency and improve model interpretability. Common normalization steps include:  
   - Lowercasing text (e.g., *HELLO* → *hello*).  
   - Removing punctuation or special characters.  
   - Removing stop words (e.g., *the, is, and*), though this can sometimes hurt performance in tasks where stop words carry meaning.

5. **Part-of-Speech (POS) Tagging**  
   Assigning grammatical categories (e.g., nouns, verbs, adjectives) to each word in a sentence. While not strictly a preprocessing step, POS tagging can support other tasks like lemmatization or dependency parsing by providing grammatical structure and improving semantic understanding.

<img src="https://miro.medium.com/v2/resize:fit:1024/1*pzjECYWP8WOWhwfCjebZVw.png" width=700>

**Note**: These techniques are often combined to tailor preprocessing to the specific requirements of the NLP task. Each technique has its own strengths and trade-offs, which should be carefully considered for optimal results.

# Feature Extractions

Feature extraction involves transforming raw input data into numerical representations that retain meaningful information, enabling models to process and analyze the data. In NLP, feature extraction techniques help capture the importance, relationships, and patterns of words or phrases in a corpus.

## Common Feature Extraction Techniques

1. **Bag of Words (BoW)**  
   Bag of Words represents text as a collection of its words, ignoring grammar and word order but retaining word frequency. It is simple and quick to compute, but the lack of contextual understanding and high-dimensional sparsity can limit its effectiveness. Additionally, non-informative terms (e.g., stop words) may appear frequently and dominate the representation.  

   <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*sK9pIrYSFfqlsDbC.png" width=500>

2. **Term Frequency-Inverse Document Frequency (TF-IDF)**  
   TF-IDF evaluates the importance of a term in a document relative to a collection of documents. It balances local importance (frequency within a document) and global rarity (occurrence across all documents).  
   - **Term Frequency (TF)**: Measures the frequency of a term in a document (note that the TF score can be different for the same word in different documents).  
   - **Inverse Document Frequency (IDF)**: Reduces the weight of commonly occurring terms that appear across many documents, emphasizing rare but important terms.  
   - **TF-IDF**: Combines TF and IDF to assign higher scores to terms that are frequent in a document but rare in the corpus (each document will have its own TF-IDF score)

   **Formulas**:  
       $$TF(t,d)=\frac{\text{Number of times term t appears in document d}}{\text{Total number of terms in document d}}$$  

   $$IDF(t, D) = \log{\frac{\text{Total number of documents in D}}{\text{Number of documents term t appears} + 1}}$$  

   $$TF-IDF(t, d, D) = TF(t,d) * IDF(t, D)$$  

   <img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*bK8wXF-TtjQpCiVo.png" width=500>

3. **N-Grams**  
   N-grams are contiguous sequences of `n` items (e.g., characters, words, or tokens) extracted from text. They capture small amounts of context depending on the value of `n`, with larger `n` representing longer sequences and more context. However, increasing `n` can lead to sparsity and computational challenges. N-grams are commonly used in text representation, feature extraction, and language modeling.

   <img src="https://cdn.botpenguin.com/assets/website/N_Gram_feb7947286.png" width=300>

4. **Word Embeddings**  
   Word embeddings are powerful feature extraction techniques that represent words as dense, continuous vectors in a high-dimensional space. These vectors capture semantic meaning, syntactic properties, and relationships between words. Unlike sparse representations (e.g., BoW, TF-IDF), embeddings leverage distributional semantics to model words based on their context. Popular embedding methods include Word2Vec, GloVe, and contextual embeddings like BERT, which capture word meanings in different contexts.


# One-hot encoding
Each word in the vocabulary is represented as a unique vector, with all entries set to 0 except one

<img src="https://miro.medium.com/v2/resize:fit:1400/1*GsKLFAlzoNeIKo-1gz1h_Q.png">

Issues:
1. In one-hot representation, words are treated as individuals with no relations to other words
2. High-dimensional vector when the number of word is large
3. Inefficient computation (many entries being 0)

# Word embeddings
Word embeddings are essentially a way to convert words into numerical representations (vectors) in a continuous, dense, low dimensional vector space. The goal is to capture the semantic meaning of words such that the distance and direction between vectors reflect the similarity and relationships among the corresponding words. Word embeddings are used as inputs to the models

<img src="https://miro.medium.com/v2/resize:fit:1056/1*GkJpulpSAIm6GTeC1dVR_w.png" width=500>

## Embedding similarity
The distance and direction between vectors represents their relationship

Given two word embeddings $e_1$, $e_2$, the similarity between the two is calculated as

$$\text{Similarity} = \frac{e_1 \cdot e_2}{||e_1|| ||e_2||}$$

The similarity value is between 1 and -1, where 1 represents 100% similary, -1 represents 100% opposite, and 0 represents no relationship

<img src="https://miro.medium.com/v2/resize:fit:1400/1*sXNXYfAqfLUeiDXPCo130w.png" width=600>

## Embedding matrix
For a given word $w$ and its one-hot encoding vector $o_w$, the embedding matrix $E$ is a matrix that maps its 1-hot representation $o_w$ to its embedding $e_w$ as follows:

$$e_w = Eo_w$$

$E$: the embedding matrix with number of rows equals to number of features for the embedding and number of columns equals to the number of vocabulary in the one-hot encoding dictionary

<img src="https://miro.medium.com/v2/resize:fit:1400/1*Bq6lIOdjCK172I1V04RwLA.png" width=500>

### t-SNE (t-distributed Stochastic Neighbor Embedding)
t-SNE is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/t-sne-en.png?411bf8ff5d5c06c6e90cae95f7110ff1" width=500>


## Apply word embedding
Word embeddings is widely used for transfer learning

Steps:
1. Train a model to learn word embeddings from a large corpus of text or use a pre-trained embedding
2. Use the word embedding on a smaller training set to complete the given task
3. (Optional) Fine tune the word embedding based on the transfer learning

# Learning word embedding
Note: the individual components of the learned word embeddings are not necessarily interpretable since the axis chosen by the algorithm does not necessarily align with interpretable axis

## Word2vec
Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words

### Skip-gram
The skip-gram word2vec model is a supervised training model that takes in an context word, $c$, and will predict a target word, $t$, where $c$ is randomly selected within a window around $t$. First, we convert $c$ from its one-hot vector, $o_c$, into embedding, $e_c$, using the embedding matrix, $E$. Then, we feed the embedding, $e_c$, into a softmax unit to predict the probability of each word in the vocabulary list with size $m$ to be the target word, where

$$P(t|c) = \frac{e^{\theta^T_te_c}}{\sum^{m}_{j=1}e^{\theta^T_je_c}}$$

$P(t|c)$: the probability of a word being the target word, $t$, given the context word, $c$

$\theta_t$: the parameters in the softmax unit that associate with $t$

Loss function:
$$L(\hat y, y) = - \sum_{i=1}^{m}y_i log(\hat y_i)$$

We can then use gradient descent to train the parameters within the softmax unit and the embedding matrix, $E$

To sample a context word, $c$, certain weight is used instead of the random distribution to prevent over selecting most common words

Note: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. To solve this, we can use a hierarchical softmax classifier instead of a normal classifier to speed up the computation

<img src="https://aegis4048.github.io/images/featured_images/skip-gram.png" width=700>

### Negative sampling
Use the same approach as the skip-gram model to sample a context word, $c$, and a target, $t$ and set this pair as the positive pair. Then, use $c$ and pair it up with $k$ random selected words from the vocabulary dictionary to create $k$ negative samples

To train the embedding matrix, we covert the both $c$ and $t$ into their embeddings and feed it into a network with logistic units to calculate the probability on how likely the context word and target word will appear together, where

$$P(y=1|c, t) = \sigma(\theta_t^Te_c)$$

$P(y=1|c, t)$: the probability of the context and the target word appears simultaneously

For each iteration, the model is only trained on one positive example and $k$ negative examples with a set of binary classifiers. This speeds up the training speed significantly compares to the skip-gram model

<img src="https://aegis4048.github.io/jupyter_images/neg_opt_1.png">

## GloVe
The GloVe model uses a co-occurence matrix $X$, where each $X_{i,j}$ denotees for the number of times that a target word $i$ appears in the context of the word $j$

Its cost function is
$$J(\theta) = \frac{1}{2}\sum_{i, j=1}^{m}f(X_{i,j})(\theta_t^Te_c + b_i + b'_j + log(X_{i,j}))^2$$

$f$: a weighting function such that $f(X_{i,j}) = 0$ when $X_{i,j} = 0$ (do not add loss if two words are not in context)

Initially, we initialize $e$ and $\theta$ randomly. After training, given the symmetry that $e$ and $\theta$ play in this model, the final word embedding $e_w^{final} = \frac{e_w + \theta_w}{2}$

# Sentiment classification
Sentiment classification predicts the sentiment based on a given sentence

To make predictions, we first convert all the words in the sentence from one-hot vector into embeddings. Then, take the average or sum of the embeddings and feed it into a softmax unit for classificatioin. An issue of this method is that it is not good at catching multiple negations in the same sentence

Another method is to feed the word embeddings of the sentence into a RNN by time steps and feed the activation from the last time step into a softmax unit for classification

<img src="https://www.tensorflow.org/static/text/tutorials/images/bidirectional.png" width=400>

# Recurrent neural network
RNN is specialized for  tasks that involve sequential inputs, such as speech and language. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations

Compares to normal neural networks, RNN has the advantages of
* Managing different input and output length
* Sharing features learned across different position of the sequence 
* Process data based on their order and context

## Notations
$x^{(i)<t>}$: the input from the $i$th training example at the time step $t$

$y^{(i)<t>}$: the output for the $i$th training example at the time step $t$ (not necessarily a fixed size)

$T_x^{(i)}$: size of the input of the $i$th training example

$T_y^{(i)}$: size of the output of the $i$th training example

$a^{<t>}$: the activation at the time step $t$ ($a^{<0>} = \vec{0}$)

## Representing words
Each word is represented as a one-hot vector based on a dictionary

## Architecture
<img src="https://miro.medium.com/v2/resize:fit:1400/1*SKGAqkVVzT6co-sZ29ze-g.png">

The image shows the unfolded version of a RNN. The actual RNNs use one cell repeatedly

At each time step, the network will take in an activation from the current time step, $x^{<t>}$, and an activation from the previous time step, $a^{<t-1>}$ to provide the output $y^{<t>}$ and pass the activation to the next time setp, $a^{<t>}$. Thus, RNN takes in 2 inputs and have 2 outputs

### Forward propagation
$$a^{<t>} = g_1(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)$$
$$y^{<t>} = g_2(W_{ya}a^{<t-1>} + b_y)$$

RNN cell architecture
<img src="https://global.discourse-cdn.com/dlai/original/3X/2/c/2cd9b38764a152e508d90650b5a365599c6347f8.png">

The activation $a^{<t-1>}$ contains the information from previous timesteps, which represents the context. The input $x^{<t>}$ represents the information from the current timestep. The RNN cell combines them to produce an output that contains both the context and current information. This output servers as the context for the next timestep

For the output layer, the RNN will transform the activation at this timestep and apply a softmax function on it. The softmax function gives the probability of each word being the next word, $y^{<t>}$. In general, we will choice the entry with the highest probability to be the next word (to ensure randomness, we may pick among the words with high probability)

Note: in RNN, tanh activation is used most often

## Loss function
$$L(\hat{y}, y) = \sum^{T_y}_{t=1} -y^{<t>}log(\hat{y}^{<t>}) - (1 - y^{<t>})log(1 - \hat{y}^{<t>})$$

The BCE loss function is used at each timestep to compute the difference between the predicted probability, $\hat{y}^{<t>}$, and the true labe, $\hat{y}^{<t>}$. The true label only contains one entry with value 1 and the rest are 0s (one-hot encoding)

## Different types of RNN
1. Many to many: many input and many output (eg. translation)
2. Many to one: many input and only one output (eg. movie rating based on description)
3. One to many: one input and many output (eg. music generation)

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*qUcWLuiBgpFVACYE.png">

## Training language model
1. Collect large corpus (body) of text as the training set
2. Tokenize the training text into one-hot vectors based on the dictionary (UNK for unknown words)
3. Feed the tokenized input to the network one by one and predict the probability of the next word ($x^{<0>} = \vec{0}$)
4. Construct the cost function based on the predicted probability and perform gradient descents (backpropogation through time) to update parameters

## Generate sequence  (one to many)
In order to generate text, we can randomly sample a word based on the probability distrubtion of the softmax output at the time step, $y^{<t>}$. Then, the sampled word is fed as the input of the next time step, where $x^{<t+1>} = y^{<t>}$, to generate a sequence

Initially, we start with $a^{<0>} = \vec{0}$ and $x^{<1>} = \vec{0}$

<img src="https://media5.datahacker.rs/2020/09/59-1-1024x410.jpg" width=700>

## Vanishing gradient and solutions
For a long sequential data, tranditional RNNs will experience vanishing gradient, causing the model to take longer to train and difficult to learn long term dependencies (context of a long sentence).

As the backpropagation algorithm advances downwards(or backward) from the output layer towards the input layer, the gradients often get smaller and smaller and approach zero which eventually leaves the weights of the initial or lower layers nearly unchanged. This is caused by the staturation nature of the some activation functions

### Solutions
1. Proper initialization of weights (Xavier initialization)
2. Use Non-saturating activation function: LeakyReLU (non-zero gradient)
3. Batch normalization: normalize the activation to stabilize activations and ensure gradients remain within a reasonable range
4. Gradient clipping: force the gradient to be in a certain range (the range requires tuning)

# Gated recurrent unit (GRU)
GRU solves the vanishing gradient problem by capturing the long term dependencies using memory cells, which contains 2 gates, an update gate and a relevance gate. The update gate decides how much past information to remember and forget, and the relevance gate determines how much past information will we keep when forming the new memory

Note: $c^{<t>}$ and $a^{<t>}$ are the same in the context of GRU, so there's only 2 inputs and 2 outputs for each GRU cell

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/gru-ltr.png?00f278f71b4833d32a87ed53d86f251c" width=500>


$$c^{<t>} = a^{<t>}$$

$$\tilde c^{<t>} = tanh(W_{cc}(\Gamma_r * c^{<t-1>}) + W_{cx}x^{<t>} + b_c)$$

$$\Gamma_u = \sigma(W_{uc}c^{<t-1>} + W_{ux}x^{<t>} + b_u)$$

$$\Gamma_r = \sigma(W_{rc}c^{<t-1>} + W_{rx}x^{<t>} + b_r)$$

$$c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1 - \Gamma_u) * c^{<t - 1>}$$

$c^{<t>}$: the actual memory content at time step $t$. Initially, we can set $c^{<0>} = a^{<0>}$. Note that $c^{<t>}$ can be a matrix, implying it captures multiple dependencies at the same time

$\tilde c^{<t>}$: the current memory content at time step $t$, which is the candidate for updating the actual memory content. Note that the current memory content depends on the actual memeory content, $c^{<t-1>}$, the input at this time step, $x^{<t>}$, and the relavance gate, $\Gamma_r$

$\Gamma_r$: the relavance gate that decides how relevant is the the actual memory content, $c^{<t-1>}$, to compute the current memory content $\tilde c^{<t>}$. $\Gamma_r$ depends on the actual memory content, $c^{<t-1>}$ and the input at this time step, $x^{<t>}$

$\Gamma_u$: the update gate that decides whether acutal memory content, $c^{<t>}$, will be updated to the calculated current memory content, $\tilde c^{<t>}$. $\Gamma_u$ depends on the actual memory content, $c^{<t-1>}$ and the input at this time step, $x^{<t>}$

$c^{<t>} = \Gamma_u * \tilde c^{<t>} + (1 - \Gamma_u) * c^{<t - 1>}$: the function that decides whether the value $c^{<t>}$ will be updated to the value of $\tilde c^{<t>}$. $*$ denotes for element-wise multiplication so, $c^{<t>}$, $\tilde c^{<t>}$, $\Gamma_u$, and $\Gamma_r$ must have the same dimensions

Note: since $\Gamma_r$ and $\Gamma_u$ are calculated with a sigmoid function, their actual values will be very close to either 0 or 1, which indiate relavent if $\Gamma_r \approx 1$ or irrelavent if $\Gamma_r \approx 0$ and update the value if $\Gamma_u \approx 1$ or not update the value if $\Gamma_u \approx 0$. This update can be partial since all variables are matrices

If $\Gamma_u \approx 1$, $c^{<t>} = \tilde c^{<t>}$

If $\Gamma_u \approx 0$, $c^{<t>} = c^{<t-1>}$


## LSTM
Long Short-Term Memory Networks (LSTMs) solves the managing long-term data dependencies problem that traditional RNN faced by using a system of gates that control how information flows through the network — deciding what to keep and what to forget over extended sequences

The LSTM cell takes in 3 inputs and produce 3 outputs, the previous cell state (the information which one is stored at the end of the previous time step), the previous hidden state (activation from previous state), and the input at the current time step, $x^{<t>}$

The hidden state and current timestep inputs are very similar to those of the traditional RNNs. The cell state are like "memory" that moves the information with basic operations like addition and multiplication that remembers important information and forgets not important ones. This is done by 3 gates, the update gate, the forget gate, and the output gate

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/lstm-ltr.png?4539fbbcbd9fabfd365936131c13476c" width=500>

$$\tilde c^{<t>} = \Gamma_r = tanh(W_{ca}a^{<t-1>} + W_{cx}x^{<t>} + b_c)$$

$$\Gamma_u = \sigma(W_{ua}a^{<t-1>} + W_{ux}x^{<t>} + b_u)$$

$$\Gamma_f = \sigma(W_{fa}a^{<t-1>} + W_{fx}x^{<t>} + b_f)$$

$$\Gamma_o = \sigma(W_{oa}a^{<t-1>} + W_{ox}x^{<t>} + b_o)$$

$$c^{<t>} = \Gamma_u * \tilde c^{<t>} + \Gamma_f * c^{<t - 1>}$$

$$a^{<t>} = \Gamma_o * c^{<t>}$$

$\tilde c^{<t>}$: in LSTM, the current memory content depends on the activation from the previous time step, $a^{<t-1>}$, and the input at the current time step, $x^{<t>}$

$\Gamma_u$, $\Gamma_f$, $\Gamma_o$: the update gate, forget gate, and output gate; all depend on the activation from the previous time step, $a^{<t-1>}$, and the input at the current time step, $x^{<t>}$

$c^{<t>} = \Gamma_u * \tilde c^{<t>} + \Gamma_f * c^{<t - 1>}$: the function that decides whether to update the memory content or not. Compared to the equation of GRU, this equation is more powerful because it uses two gates to update the memeory content; this means we can not only makes decision on whether to update the memory content to $\tilde c^{<t>}$ or not, but can also decide to keep both $\tilde c^{<t>}$ and $c^{<t - 1>}$ by adding them when $\Gamma_u \approx 1$ and $\Gamma_f \approx 1$

$a^{<t>} = \Gamma_o * c^{<t>}$: the activation $a^{<t>}$ is a filtered version of the cell state $c^{<t>}$

Despite the forget gate and update gate have values between 0 and 1, they do not necessarily adds to 1 (they are independent), which provides more flexility to the model 

In general, LSTM is more powerful and flexible than the GRU but requires more computational power

# Different types of RNNs
### Deep RNN
The Deep RNN is constructed by stacking multiple layers of RNN together. In this architecture, every RNN layer predicts the sequence of outputs to send to the next RNN layer instead of predicting a single output value. Then the final RNN layer predicts the single output

Note: we can start processing the next layer as soon as the current layer produces an output for the current time step, and there is no need to wait for the entire sequence to be processed by the current layer

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/deep-rnn-ltr.png?f57da6de44ddd4709ad3b696cac6a912" width=300>

### CNN-RNN
CNN-RNN architecture is a combination of CNN and RNN architectures. It first uses the CNN network layer to extract the essential features from the input and then send them to the RNN layer to support sequence prediction. An example application for this architecture is generating textual descriptions for the input image

<img src="https://media.springernature.com/lw1200/springer-static/image/art%3A10.1007%2Fs11063-024-11687-w/MediaObjects/11063_2024_11687_Fig3_HTML.png" width=500>

### Encoder-decoder RNN (Seq2Seq)
Encoder-decoder RNN architecture has an encoder to convert the input to an intermediate encoder vector. Then one decoder transforms the intermediate encoder vector into the final result. An application for this is model is machine translation

<img src="https://miro.medium.com/v2/resize:fit:1400/1*1JcHGUU7rFgtXC_mydUA_Q.jpeg" width=500>

### Bidirectional RNN
Bidirectional RNN connect two RNN layers together, one in forwarding direction and the other in backward direction. With this architecture, the output layer can get information from past and future simultaneously. In general, the forward and backward will process independently, and their output will be combined to produce the final output. An application for this model is sentiment classification

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/bidirectional-rnn-ltr.png?e3e66fae56ea500924825017917b464a" width=400>

# Sequence to sequence model
A sequence-to-sequence (Seq2Seq) model transforms one sequence into another sequence (e.g. translation, text generation, summarization). Seq2Seq models are particularly useful in tasks where the model needs to process and understand the entire input sequence before generating an output sequence

For example, in language translation, the model reads the entire input sentence first to capture its holistic meaning (context, grammar, and semantics) and then generates the translated sentence. A word-by-word translation often fails because many languages have different grammatical structures and idiomatic expressions that require contextual understanding to produce a meaningful translation

Seq2Seq models address this challenge by encoding the entire input sequence into a fixed-length context vector (via the encoder), and then decoding this vector to generate the output sequence step-by-step (via the decoder). This allows the model to handle the input and output sequences of different lengths effectively.

The basic sequence to sequence model has an encoder and a decoder. The encoder takes in the input sequence, converts it to a fixed-length context encoding, which captures the meaning of the input sequence, and feed it to the decoder. The decoder will takes in the context encoding and generate a sequence based on the given encoding. Note that the output at each timestep of the encoder network is discarded

Note: the encoder can be a CNN, RNN, or other architecture depending on the input data

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*RSdvbRGBJnLgX4E0.png" width=700>

Theoretically, in training, the decoder should feed its prediction at the previous timestep as the input to the next timestep as context for predicting the next word. However, at the beginning of the training, this may cause issue because the initial, untrained network will output some random predictions, and using these predictions as the context will cause more inaccurate predictions for the sequence after, which lead to very slow convergence and model instability. Therefore, in training phase, we use the teacher forcing method by always using the ground truth label from the previous timestep as the input for next timestep. This method is like training each timestep individually, which allows the model to learn faster. After the model is trained (testing phase), we will switch back to use the model's prediction at the previous timestep as the input for the next timestep

Compared to the language model previously with input activation of, $\vec 0$, the decoder of Seq2Seq model is a conditional language model that generate a sequence with maximum probability (the most likely sequence) given the ecoding as the condition

### Issues with Seq2Seq model and solutions
1. The amount of "memeory" that the model can capture from the input sequence depends on the context vector size. With, a small context vector, the model may not be able to capture the entire context from the input sequence, especially for long input sequences. A solution to this is to use deep RNN architecture, which allows the context vector to capture more information

2. For the generation process, we find the sentence with max probability given the encoding, $P(y^{<1>}, ..., y^{<T_y>}|encoding)$. The greedy search does not work well in this condition because it only maxmizes the probability of each word based on the previous words, but not maxmizes the probability of the entire generated sequence. A solution to this is beam search, which maximizes the probability for the output sentence instead of each word 

## Beam search
Beam search is used to find the most likely output sequence, $y$, given the input encoding, $x$

Steps:
1. Define a beam width, $B$
2. At the first time step, select the top $B$ words with the highest probability
3. For each word/sequence, feed it into seconds time step and compute the combined probability. Only keep track of the top $B$ sequences with the highest probability and drop the rest
4. Repeat step 3

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/beam-search-en.png?3515955a2324591070618dd85812d5d7" width=900>

$B$: the beam width that determines the number of most likely sentence to be track of. Large values of $B$ yield to better result but with slower performance and increased memory. Small values of $B$ lead to worse results but is less computationally intensive. A standard value for $B$ is around 10

### Length normalization
The combined probability is calculated by multiplying each word given the previous words, where

$$P = \text{argmax}_y \Pi_{t=1}^{T_y}P(y^{<t>}|x, y^{<1>}, ...y^{<t-1>})$$

$T_y$: the length of the output sequence

$x$: the context vector from the encoder

$y^{<t>}$: the prediction of the model at timestep $t$

Thus, the longer the sentence, the combined probability will be smaller and can result in numerical under-floor. To prevent this, we apply the normalized log-likelihood objective, where

$$Objective = \text{argmax}_y \frac{1}{(T_y)^{\alpha}} \sum_{t=1}^{T_y}log(P(y^{<t>}|x, y^{<1>}, ...y^{<t-1>}))$$

This allows the objective to be calculated as a sum to prevent numerical under-floor. Since the objective is calculated on a log scale, the values of the objective will be negative, and more negative objective indicates smaller proabability; less negative objective indicates greater proabability

Since longer sequence will have a smaller probability, the model will tend to generate shorter sequence. To prevent this, we apply the term $\frac{1}{(T_y)^{\alpha}}$ to compute the average probability to ensure the model can generate long sequences as well. $\alpha$ is a softener with its value usually between 0.5 and 1

### Beam search error analysis
The beam search error analysis helps us determine whether is badly generated sequence is caused by the RNN or the beam search algorithm since beam search can miss the optimal sequence due to its search space limitations

Suppose $\hat y$ is a bad sequence generated by the model and $y$ is a good target sequence, we can then calculate the probablity, $P(\hat y|x)$ and $P(y|x)$

* If $P(\hat y|x) \geq P(y|x)$: the probability of the model to generate a good sequence is lower than the probability of the model to generate a bad sequence indicates the RNN is not able to generate a good sequence. This can be solved by using a different architecture, applying regularizations, or getting more training data

* If $P(\hat y|x) < P(y|x)$: the probability of the model to generate a good sequence is higher than the probability of the model to generate a bad sequence indicates the RNN is able to generate a good sequence, but the beam search algorithm is not able to pick it up. This can be solved by increasing the beam width

Note: if length normalization is applied, the objective should be compared instead of the probability

## Bleu score
The bilingual evaluation understudy (Bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision

The bleu score on n-gram only is
$$p_n = \frac{\sum_{n-gram \in \hat y} \text{count}_{clip}(\text{n-gram})}{\sum_{n-gram \in \hat y} \text{count}(\text{n-gram})}$$

$p_n$: n-gram precision calculated by the number of correct predicted n-gram over the number of total predicted n-grams

$\sum_{n-gram \in \hat y}$: for each contiguous sequences in the generated sequence

$\text{count}_{clip}(\text{n-gram})$: the maximum number of a n-gram appears in only one reference

$\text{count}(\text{n-gram})$: the total number of n-gram in the genreated sequence


The combined bleu score is
$$\text{Bleu score} = exp({\frac{1}{n}\sum_{k=1}^{n}p_k})$$

The bleu score is a number between 0 and 1. The closer the score to 1, the more similar the generated sequence and the reference are. However, a score of 0.6 - 0.7 is considered a good score since a score too close to one indicates overfitting

## BLEU Score (Bilingual Evaluation Understudy)

The **BLEU (Bilingual Evaluation Understudy) score** is a metric used to evaluate the quality of machine-translated text by comparing it to one or more reference translations. It measures **n-gram precision** while incorporating a penalty for overly short translations.

### 1. N-gram Precision

BLEU evaluates how many n-grams in the generated translation appear in the reference translation. The **n-gram precision** is calculated as:

$$
p_n = \frac{\sum_{\text{n-gram} \in \hat{y}} \text{count}_{\text{clip}}(\text{n-gram})}{\sum_{\text{n-gram} \in \hat{y}} \text{count}(\text{n-gram})}
$$

where:

- $p_n$ = **n-gram precision** (i.e., the fraction of predicted n-grams that appear in the reference translation).
- $\hat{y}$ = generated (candidate) translation.
- $\sum_{\text{n-gram} \in \hat{y}}$ = sum over all contiguous n-grams in the generated sequence.
- $\text{count}(\text{n-gram})$ = total number of n-gram in the generated sequence.
- $\text{count}_{\text{clip}}(\text{n-gram})$ = **clipped count**, which limits the count of an n-gram to the maximum number of times it appears in **any single reference translation** (prevents artificially high precision due to repeated words).

### 2. Combined BLEU Score

The overall BLEU score is calculated using **a weighted geometric mean of n-gram precisions** and a **brevity penalty** to penalize translations that are too short.

$$
\text{BLEU} = \text{BP} \cdot \exp \left( \sum_{k=1}^{N} w_k \log p_k \right)
$$

where:

- $w_k = \frac{1}{N}$ (uniform weight for each precision score when using up to $N$-grams, typically $N=4$).
- $p_k$ = precision for n-grams of size $k$.
- **Brevity Penalty (BP)**:

$$
\text{BP} =
\begin{cases} 
1 & \text{if } L_c \geq L_r \\
e^{(1 - L_r / L_c)} & \text{if } L_c < L_r
\end{cases}
$$

where:

- $L_c$ = length of the generated (candidate) translation.
- $L_r$ = length of the closest reference translation.
- This penalty discourages overly short translations, which might artificially inflate n-gram precision.

### 3. Interpreting BLEU Scores

- BLEU scores range from **0 to 1** (higher is better).
- **Typical scores**:
  - **0.6 - 0.7**: Considered **good** machine translation.
  - **0.8 - 1.0**: Indicates potential **overfitting** (rare in practical use cases).
  - **< 0.3**: Poor translation quality.

# Attention Mechanism
Issue: In traditional **Seq2Seq models** (such as those using RNNs or LSTMs), the **context vector** has a fixed size and serves as the only source of information for the decoder. This can lead to **information loss**, especially for long input sequences, since the decoder must rely solely on this compressed representation.

Solution: The **attention mechanism** allows the model to **dynamically focus on different parts of the input sequence** at each decoding step. Instead of encoding all input information into a single fixed-size vector, the attention mechanism encode each input into context information, called hidden states, to better retain input informatio during the encoding. When encoding, attention assigns **different weights** to each inputs' hidden states based on their relevance to the current decoding step, so despite the model keeps all the input context, it will only look at the important, relevant parts when decoding 

## Advantages of attention mechanism

1. **Handling Long-Range Dependencies**  
   - The model can refer to any part of the input sequence, regardless of length, improving translation quality and other sequence tasks.

2. **Better Contextual Understanding**  
   - By selectively attending to relevant words, the model can generate more **coherent and contextually appropriate** outputs.

3. **Improved Interpretability**  
   - The attention weights show which parts of the input are most influential, making the model’s decisions more transparent.

4. **Parallelization (only in Transformer models, not in RNN-based models)**  
   - Unlike RNN-based attention, **self-attention in Transformers** allows for **parallel computation**, making training much faster.

## How attention works
The attention mechanism allows each encoder **time step** to output its own **hidden state**, $h_{t}$, as part of the **context information**, rather than producing a single fixed-size context vector at the end. These hidden states are similar to those in any sequential model, where each hidden state contains information from **all previous inputs up to that point**. Once computed, these hidden states values remain **fixed** throughout the decoding process.

At the end of encoding, we have **$T$ fixed hidden states** from the encoder, where $T$ is the length of the input sequence. These hidden states serve as **inputs to the attention mechanism**, which assigns different weights to them and computes a **weighted sum** to form a **context vector** at each decoding step.

After encoding, the last hidden state of the encoder is fed as the initial hidden state of the decoder, similar to the traditional Seq2Seq model. At each time step, the decoder produces its own hidden state, and we compute a **score** for each encoder hidden state by taking the dot product or some learned transformation between the decoder hidden state and each encoder hidden state. These scores represent the relevance of each input word for generating the current output. A higher score indicates that a particular input word is more important, and the attention mechanism will assign it a higher weight when generating the current output.

Next, we pass all the scores through a softmax function to normalize them into attention weights, ensuring that each weight is between 0 and 1, and that their sum equals 1. These attention weights determine how much each encoder hidden state contributes to the final context vector used by the decoder.

Finally, we calculate the context vector by computing the weighted sum of the encoder hidden states using the attention weights. This sum forms the context vector for the decoder at the current timestep, which contains a focused representation of the input. This method overcomes the information bottleneck of the intermediary state by allowing the decoder model to access all the hidden states 

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/0*C8OjldkhqO6aTHyG.png">

## Mathematical explanation
First, the encoder calculates the hidden states for each input timestep, which we refer as $h_{t}$ for all input timestep where $t = 0, 1, ..., T$. After encoding, we have an array of hidden states $[h_0, h_1, ..., h_T]$

At each step of decoding, the decoder takes in the decoder hidden state, $h_{s-1}$ and the predicted word, $y_{s-1}$ (ground truth label at training time) from the previous timestep to compute its current hidden state, where
$$h_{s} = f(h_{s-1}, y_{s-1})$$

Note: for the first decoder timestep, the input hidden state will be the last hidden state of the encoder and the input word is just a 0 vector

After obtaining the current decoder hidden state, we use this value to calculate a score for each encoder hidden state, where
$$e^{(s)}_{t} = score(h_s, h_t)$$

$e^{(s)}_{t}$: how much the decoder at step $s$ should focus on the encoder hidden state $h_t$. You can think of there are two timelines, one for the encoder, represented by $t = 0, 1, ..., T$, and another for the decoder, represented by $s = 0, 1, ..., T'$

Different type of attention mechanisms use different score function. For each decoder timestep, $s$, we iterate through the score calculation $T$ times with respect to all encoder hidden states to obtain $e^{(s)} = [e^{(s)}_{0}, e^{(s)}_{1}, ..., e^{(s)}_{T}]$

After obtaining the score, we need to normalize this value using softmax layer to get the attention weights, where
$$\alpha^{(s)}_{t} = \frac{exp(e^{(s)}_{t})}{\sum_{t'=0}^{T} exp(e^{(s)}_{t'})}$$

$\alpha^{(s)}_{t}$: the attention weight to the $t$th encoder hidden state

$e^{(s)}_{t}$:  the score with respect to the $t$th encoder hidden state

$\sum_{t'=0}^{T} exp(e^{(s)}_{t'})$: the sum of all the score

Again, we iterate this calculation $T$ time with respect to all encoder hidden states to obtain the attention weights $\alpha^{(s)} = [\alpha^{(s)}_{0}, \alpha^{(s)}_{1}, ..., \alpha^{(s)}_{T}]$. $\alpha^{(s)}$ tells the decoder how much it should focus on to the $t$th encoder hidden state when generating the output. All attention weights are values between 0 and 1. A value closer to 0 indicates that the input should not be focused on, while a value closer to 1 means the input is highly attended to

Then, we calculate the weighted sum of the encoder hidden state using the attention weights to obtain the context vector for the decoder timestep $s$, where
$$c_s = \sum^{T}_{t=0}{\alpha^{(s)}_{t} h_t}$$

$c_s$: the context vector at decoder timestep $s$

Finally, the decoder can make a prediction at this timestep by using its hidden state and the context vector, where
$$y_s = g(h_s, c_s)$$

$y_s$: the decoder prediction at timestep $s$

In this case, the attention mechanism is not a neural network and there's nothing to be learned. It's only there to help the encoder and decoder learn better. However, learned parameters can definitely be added to inpromve the models' capability

<img src="https://media.geeksforgeeks.org/wp-content/uploads/20200603211336/attn.png">

In the image, a feed-forward network is added, which is responsible for transforming the target hidden state, $h_t$, into a representation that is compatible with the attention mechanism, $A$. It takes the target hidden state h(t-1) and applies a linear transformation followed by a non-linear activation function (e.g., ReLU) to obtain a new representation, $A$. Instead of using the hidden state directly, it use a neural network to map the hidden state to a new representation as the input to attention layer. However, the above formulas still hold by replacing all $h_t$ to $A$

# Transformer
The transformer model uses the attention model, but CNN style of process. This means it can process the entire input sequence at once using parallelization while understanding the context of the sequence, which improves both the computational speed and model performance

## Self-attention
One issue of traditional word embedding is that the embedding vectors do not account for the context of the text since a word can have different meaning in different context. Self-attention is a way to solve this by applying the attention model to take a traditional embedding, $e^{<t>}$, as input and output a more refined, contextualized embedding, $A^{<t>}$. This significantly improves the ability for the model to "understand" the text

First, the model takes in the word embeddings in the input sequence and calculate a query vector, a key vector, and a value vector for each word, where

$$q^{<t>} = W^Qe^{<t>}$$

$$k^{<t>} = W^Ke^{<t>}$$

$$v^{<t>} = W^Ve^{<t>}$$

$q^{<t>}$: the query vector for the $t$th word that "looks" for a specific context (feature) from other words that can influence the meaning of this word

$k^{<t>}$: the key vector for the $t$th word that encodes how the $t$th word, as context, can influence the meaning of other words

$v^{<t>}$: the value vector for the $t$th word that determines a word's embedding should be influenced if the $t$th word has influence on it. In other word, $v^{<t>}$ determines how a word embedding should be updated with the $t$th word as context

To calculate the attention

$$A^{<t>}(q^{<t>}, K, V) = \sum_{i}\frac{exp(q^{<t>} \cdot k^{<i>})}{\sum_{j}exp(q^{<t>} \cdot k^{<j>})}v^{<i>}$$

$A^{<t>}(q^{<t>}, K, V)$: the attention (updated, refined, contextualized embedding) of the $t$th word

$q^{<t>} \cdot k^{<i>}$: a dot product that determines how much the $i$th word influence the meaning of the $t$th word. The larger the value, the more $i$th word can influence the meaning of the $t$th word

$\frac{exp(q^{<t>} \cdot k^{<i>})}{\sum_{j}exp(q^{<t>} \cdot k^{<j>})}$: a softmax implementation that calculates the probabilty distribution on how much each word in the sequence can influence the $t$th word

$\sum_{i}\frac{exp(q^{<t>} \cdot k^{<i>})}{\sum_{j}exp(q^{<t>} \cdot k^{<j>})}v^{<i>}$: first determines how each word in the sequence can influence the $t$th word using softmax, then update the embedding using $v^{<t>}$ and sum over the entire input sequence

Vectorized implementation:
$$A(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

$A(Q, K, V)$: a matrix that contains the calcualted attention

$\sqrt{d_k}$: the square root of the dimension of the key query space for numerical stability

<img src="https://sebastianraschka.com/images/blog/2023/self-attention-from-scratch/summary.png" width=700>

## Multi-head attention 
Multi-head attention performs self-attention $h$ times, where $h$ is the number of head. Each head is looking for a different feature in the context and proposes a update to improve the embedding. All the proposed changes will be added to the original embedding to obtain the refined embedding

<img src="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff406d55e-990a-4e3b-be82-d966eb74a3e7_1766x1154.png" width=700>

## Architecture
<img src="https://deeprevision.github.io/posts/001-transformer/transformer.png" width=600>

### Encoder
Each encoder block has 3 main layers which are multi-head attention(MHA), layer norm, and MLPs(feedforward layer). It takes in the embeddigns in the sequence as input, and process it through the a multi-head attention (MHA) layer adnd then feed it to into a multiLayer perceptrons (MLPs) to for output matrices, $K$ and $V$, that contains information about the textual context. The add&norm layer is normalization layer that is used to speed up training. Also, dropout and residual connections are used within each encoder as displayed in the diagram

Typically, the encoder is stacked up $n$ times before feeding the output to the decoder to better learn different attention representations and boost the predictive power

### Decoder
The decoder aims to fuse encoder output with the target sequence and to make predictions. It takes in the embedding of the generated sequence as input; initially, this would be a start of sentence token ($<SOS>$). This input will be passed on to a musked multi-head attention layer to generate the query matrix, $Q$, and feeds it into the second layer. Then, the next layer will takes in $Q$ from the musked multi-head attention layer from the decoder and the matrices, $K$ and $V$ from the encoder; these information will be passed into a MHA layer followed by a MLP layer. Finally, the decoder will takes the output from the MLP layer and feed it to a classifier to predict the next word

The newly generated sequence is then feed into the decoder again to generate a new matric $Q$ to predict the next word. This means the decoder block can be stacked up $n$ times until an end of sentence token ($<EOS>$) is generated

#### Musked multi-head attention
The musked multi-head attention layer is used during training to need to prevent the network having access to future tokens. This can be done by using a look-ahead mask to get future token with the value of negative infinity. After performing the softmax, all future tokens will have the value of 0, meaning the network does not have access to those tokens

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*QYFua-iIKp5jZLNT.png" width=400><img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*3ykVCJ9okbgB0uUR.png" width=400>


## Positional encoding
The positional encoding are added to the input embeddings to provide information on where a word is located in the sequence. This is very important since the transformer model inputs sequence in order with no information on the position of the tokens

The positional encoding uses sine and cosine functions to give each embedding unique representation on their position in the sequence

$$PE_{(pos, 2i)} = \sin{(\frac{pos}{10000^{\frac{2i}{d}}})}$$

$$PE_{(pos, 2i + 1)} = \sin{(\frac{pos}{10000^{\frac{2i}{d}}})}$$

$pos$: the position index of the word in the sequence

$d$: size of the word embedding

Positional encoding generate values for odd positions using the sine function and even positions using the cosine function. This will be repeated until the vector has the size of $d$, meaning the postional encoding has the size dimension of the word embedding. Then, the positional encoding will be added to the word embedding to give the network information on the position of each vector

# Speech recongnition and trigger word detection
## Model
* Attention model
* Connectionist temporal classification (CTC) model


# Resources
* CS230: https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks#