# Word Embeddings
Word embeddings are a type of dense vector representation of words in natural language, which are typically learned from large corpora of text using neural network models. The goal of word embeddings is to capture the semantic and syntactic relationships between words, such that similar words are represented by similar vectors in the embedding space.

Each word in the vocabulary is typically represented by a vector of fixed length, with each element of the vector representing a particular feature or aspect of the word. The elements of the vector are learned from the co-occurrence patterns of words in the training corpus, such that words that appear in similar contexts are represented by similar vectors.

Word embeddings have become a powerful tool in natural language processing (NLP) and related fields, as they can be used as input to machine learning models for a wide range of NLP tasks, such as text classification, sentiment analysis, and language translation. By representing words in a dense vector space, word embeddings can capture subtle semantic relationships between words that are difficult to represent using traditional sparse representations, such as bag-of-words models.

##  Embedding Matrix
An embedding matrix is a matrix that contains the word embeddings for all the words in a given vocabulary. Each row of the matrix corresponds to the embedding vector for a single word in the vocabulary.

The size of the embedding matrix depends on the size of the vocabulary and the dimensionality of the embedding vectors. For example, if the vocabulary has 10,000 words and the embedding vectors have a dimensionality of 300, the embedding matrix will have dimensions 10,000 x 300.

The values in the embedding matrix are learned through a process called training, in which a neural network model is trained to predict the context or neighboring words of a given word, based on its embedding vector. The embeddings are learned by minimizing a loss function that measures the difference between the predicted context words and the actual context words, typically using backpropagation and stochastic gradient descent.

Once the embedding matrix has been learned, it can be used as input to machine learning models for a wide range of natural language processing (NLP) tasks, such as text classification, sentiment analysis, and language translation. By representing words as dense vectors in an embedding space, the embedding matrix captures the semantic relationships between words, which can be used to improve the performance of NLP models.

## Learning Word Embeddings
There are several approaches to learning word embeddings from text data, some of which are:

- Count-based methods: These methods are based on the co-occurrence counts of words in a large corpus of text. The idea is to build a co-occurrence matrix of word pairs, where the (i,j) entry of the matrix represents the number of times the words i and j occur together within a fixed-size context window. The matrix is then factorized using techniques like singular value decomposition (SVD) or non-negative matrix factorization (NMF) to obtain low-dimensional embeddings of the words.

- Prediction-based methods: These methods are based on neural network models that are trained to predict the context or neighboring words of a given word, given its embedding. The embeddings are learned by minimizing a loss function that measures the difference between the predicted context words and the actual context words, typically using backpropagation and stochastic gradient descent. Popular models in this category include Word2Vec, GloVe, and fastText.

- Hybrid methods: These methods combine elements of count-based and prediction-based approaches. One example is the PPMI-CD method, which uses the positive pointwise mutual information (PPMI) matrix as a count-based representation of word co-occurrence, and then applies a neural network model with a contrastive divergence (CD) objective to learn embeddings.

Overall, the choice of approach depends on the specific needs of the application and the nature of the data. Count-based methods are simple and interpretable, but may not capture complex semantic relationships between words. Prediction-based methods, on the other hand, are more complex and computationally expensive, but can capture more nuanced semantic relationships. Hybrid methods offer a balance between the two, but may require more tuning and experimentation to obtain good results.
## Word2Vec
Word2Vec is a popular prediction-based method for learning word embeddings from text data. It consists of two main variants: Continuous Bag-of-Words (CBOW) and Skip-gram.

In the CBOW variant, the model takes as input a set of context words and tries to predict the center word. Specifically, given a window of context words, the model computes the average of their embeddings, which serves as the input to a neural network with a single hidden layer. The output of the network is a softmax probability distribution over the vocabulary, which is trained to predict the center word. The model is trained by minimizing the negative log-likelihood of the observed center words in the training data.

In the Skip-gram variant, the model takes as input a center word and tries to predict the context words that occur in its vicinity. Specifically, given a center word, the model computes its embedding, which serves as the input to a neural network with a single hidden layer. The output of the network is a softmax probability distribution over the vocabulary, which is trained to predict the context words. The model is trained by minimizing the negative log-likelihood of the observed context words in the training data.

Both variants of Word2Vec are trained using stochastic gradient descent with negative sampling, which involves sampling a few negative (non-context) words in addition to the positive (context) words, and adjusting the model parameters to increase the probability of the positive words and decrease the probability of the negative words.

Overall, Word2Vec is a powerful method for learning high-quality word embeddings that capture the semantic and syntactic relationships between words. The embeddings can be used as input to machine learning models for a wide range of natural language processing tasks, such as text classification, sentiment analysis, and language translation.

## Negative Sampling
Negative sampling is a technique used in natural language processing (NLP) to train word embeddings, particularly in the context of prediction-based models such as Word2Vec.

In these models, the goal is to predict the context words of a given target word based on its embedding. To do this, the model learns to assign high probabilities to the correct context words and low probabilities to other words in the vocabulary that are not in the context. However, computing the probabilities for all words in the vocabulary can be computationally expensive, especially for large vocabularies.

Negative sampling addresses this problem by only computing the probabilities for a small number of negative (non-context) words, rather than all words in the vocabulary. Specifically, during training, for each positive (context) word in the training data, a small number of negative words are sampled randomly from the vocabulary. The model is then trained to assign high probabilities to the positive words and low probabilities to the negative words.

The number of negative samples to use is typically a hyperparameter that is tuned to the specific application and dataset. A larger number of negative samples can improve the quality of the embeddings, but also increases the computational cost of training.

Overall, negative sampling is an efficient and effective technique for training word embeddings in large-scale NLP applications. By only computing probabilities for a small number of negative words, it allows prediction-based models to scale to large vocabularies and datasets without sacrificing performance.

## GloVe Word Vectors
GloVe (Global Vectors for Word Representation) is a popular method for learning word embeddings from large text corpora. It is a count-based approach that builds on the idea of matrix factorization of the co-occurrence matrix of words.

In GloVe, the co-occurrence matrix is first constructed by counting the number of times each word appears in the context of every other word in a given window size. The matrix is then normalized to give a pointwise mutual information (PMI) matrix, which measures the statistical dependence between each pair of words.

Next, GloVe factorizes the PMI matrix into two low-rank matrices, representing the rows (word embeddings) and columns (context embeddings) of the matrix. The goal of the factorization is to find embeddings that capture the relationships between words and their contexts, such that words that have similar contexts are represented by similar vectors in the embedding space.

The training objective in GloVe is to minimize the sum of squared errors between the dot product of the word and context embeddings and the logarithm of the co-occurrence counts. The optimization is done using stochastic gradient descent, with a weighting function that balances the contribution of rare and frequent words to the objective.

Overall, GloVe is a powerful method for learning high-quality word embeddings that capture the semantic and syntactic relationships between words. The embeddings can be used as input to machine learning models for a wide range of natural language processing tasks, such as text classification, sentiment analysis, and language translation.

## Sentiment Classificaiton
Sentiment classification is the task of automatically analyzing the sentiment of a piece of text, such as a review or a social media post. The goal is to determine whether the text expresses a positive, negative, or neutral sentiment towards a particular topic or entity.

Sentiment classification is an important task in natural language processing (NLP), as it has a wide range of applications in fields such as marketing, customer service, and product development. Some common examples include sentiment analysis of product reviews, social media sentiment analysis for brand monitoring, and sentiment analysis of political speeches and news articles.

The most common approach to sentiment classification involves using supervised machine learning techniques, such as support vector machines (SVMs), decision trees, or neural networks. These models are trained on labeled data, where each text sample is labeled with its corresponding sentiment (positive, negative, or neutral). During training, the model learns to identify features in the text that are predictive of sentiment, such as sentiment words, emoticons, and other linguistic cues.

Once the model has been trained, it can be used to classify new, unlabeled text samples into one of the sentiment categories. The model applies the learned features and rules to the new text sample and assigns it a sentiment label.

In recent years, there has been growing interest in using deep learning models, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), for sentiment classification. These models can capture more complex relationships between words and their contexts, leading to improved performance on sentiment classification tasks.

## Debiasing Word Embeddings
Word embeddings can encode biases that reflect stereotypes and prejudices present in the data used to train them. To mitigate these biases, various debiasing techniques have been proposed. Here are some common approaches:

Hard debiasing: This approach involves removing bias by directly manipulating the embedding vectors to reduce the gender, race or other demographic associations. The simplest technique involves removing a biased direction from the embedding space, such as subtracting the mean difference vector between gender-specific word pairs (e.g., he-she, him-her, etc.) from all gender-neutral words. However, this approach may also remove relevant information from the embeddings and may be difficult to generalize to other types of biases.

Soft debiasing: Soft debiasing involves modifying the training process to reduce the influence of biased data. One approach is to re-weight the training data to balance the contributions of different groups, such as gender or race. Another approach is to modify the loss function used to train the embeddings to penalize biases. For example, the authors of the "Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes" paper proposed a bias-corrected loss function that reduces the influence of stereotypical associations in the data.

Post-processing: This approach involves modifying the embedding vectors after they have been trained to reduce biases. One approach is to identify gender or race subspaces in the embedding space and project the vectors onto the orthogonal complement of those subspaces. Another approach is to train a separate debiasing model that learns to predict the bias direction in the embedding space and removes it.

Overall, debiasing word embeddings is an active area of research, and different techniques may be appropriate for different applications and types of biases. It is important to carefully evaluate the effectiveness and potential side effects of debiasing techniques before applying them in practice.