# Deep Leanring with Python

## How do machines learn?

### What is deep learning?
Deep learning is a subset of machine learning, which is a subset of artificial intelligence. In industry, deep learning is used to solve practical tasks in a variety of fields such as computer vision (image), natural language processing (text), and automatic speech recognition (audio). In short, deep learning is a subset of methods in the machine learning toolbox, primarily using artificial neural networks, which are a class of algorithm loosely inspired by the human brain.

### What is machine learning?

A field of study that gives computers the ability to learn without being
explicitly programmed.

### what is the difference between machine learning and deep learning?

Deep learning is a subset of machine learning, which is a subset of artificial intelligence. In industry, deep learning is used to solve practical tasks in a variety of fields such as computer vision (image), natural language processing (text), and automatic speech recognition (audio). In short, deep learning is a subset of methods in the machine learning toolbox, primarily using artificial neural networks, which are a class of algorithm loosely inspired by the human brain. 

Examples of machine learning algorithms include linear regression, logistic regression, decision trees, random forests, and support vector machines. Examples of deep learning algorithms include convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs). 

Examples where ml is used without deep learning are:
1. Spam detection - because it is a simple classification problem and doesn't require deep learning.
2. Credit card fraud detection
3. Customer segmentation

#### Machine Learning algorithms can be classified into two main types:
1. Supervised Learning: In supervised learning, the algorithm is trained on input-output pairs. The algorithm learns to map the input to the output<br>
Example: Given a dataset of images of cats and dogs, the algorithm learns to classify the images as either a cat or a dog.
2. Unsupervised Learning: In unsupervised learning, the algorithm is given input data without the corresponding output. The algorithm learns to find patterns in the input data.<br>
Example: Given a dataset of images of cats and dogs, the algorithm learns to group the images into two separate clusters of cats and dogs. It doesn't know the labels of the images but its tries to find patterns in the data and group them accordingly.

Machine learning algorithms can be also divided into parametric and non-parametric models. Parametric models have a fixed number of parameters, whereas non-parametric models have a number of parameters that grows with the amount of training data.

As you can see, there are really four different types of algorithms to choose from. An algorithm is either unsupervised or supervised, and either parametric or nonparametric.

#### Supervised Parametric Learning<br>
Supervised parametric learning machines are machines with a fixed number of knobs (that’s the parametric part), wherein learning occurs by turning the knobs. Input data comes in, is processed based on the angle of the knobs, and is transformed into a prediction. The knobs are then turned to minimize the error in the prediction. The most common example of a supervised parametric learning machine is linear regression. In ml terms, the knobs are called weights, and the goal is to find the best weights for making predictions. And the intial weights are set randomly.

General steps in this type of learning are:
1. Initialize the weights randomly.
2. For each data point in the training data, compute the prediction using the current weights.
3. Compute the error between the prediction and the actual target.
4. Update the weights to minimize the error.
5. Repeat steps 2-4 until the error is minimized.
6. Use the weights to make predictions on new data.
7. Evaluate the model on the test data.
8. If the model is not performing well, go back to step 1.


#### Unsupervised Parametric Learning<br>
It is similar to supervised parametric learning, but the difference is that the input data is not labeled. The goal is to find patterns in the data without knowing the labels. The most common example of an unsupervised parametric learning machine is k-means clustering. In this type of learning, the intial weights are set randomly.

Same steps as above are followed in this type of learning.

In Non-parametric learning, the number of parameters grows with the amount of training data. The most common example of a non-parametric learning machine is k-nearest neighbors. In this type of learning, the intial weights are set to the training data.But generally even in parametric learning, the number of parameters is not fixed and grows with the amount of training data, but there we explicitly set the number of parameters. So in a way, all machine learning algorithms are non-parametric.

## Introduction to neural networks

### What is a neural network?

I Highly recommended to watch this playlist on neural networks by 3Blue1Brown: https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

A neural network is simply a collection of connected neurons. The neurons are connected in a way that allows them to send signals to each other. The signals are then processed by the neurons to produce an output. The output is then used to make a decision or perform a task. Neural networks are inspired by the human brain, which is made up of billions of interconnected neurons. 

Neural networks generally consist of three main components: an input layer, one or more hidden layers, and an output layer. The input layer receives input data, the hidden layers process the input data, and the output layer produces the final output. The hidden layers are called hidden because they are not directly connected to the input or output layers. The hidden layers are responsible for processing the input data and producing the final output. 

The connections between the neurons are called weights. The weights are used to determine how much influence one neuron has on another. The weights are learned during the training process, where the network is trained on a dataset of input-output pairs. The training process involves adjusting the weights to minimize the error between the predicted output and the actual output. The weights are adjusted using an optimization algorithm, such as gradient descent. And these weights are adjusted using the backpropagation algorithm.

And the values in the hidden layers are called activations. The activations are computed by applying a function to the input data. Basically everytime we pass the input data there is not need for every neuron to fire. So we apply a function to the input data to determine which neurons should fire. The most common activation function is the sigmoid function, which maps the input data to a value between 0 and 1. The sigmoid function is used to introduce non-linearity into the network, which allows it to learn complex patterns in the data. But this is not used anymore because of the vanishing gradient problem. The most common activation function used today is the ReLU function, which maps the input data to a value between 0 and infinity. The ReLU function is used because it is computationally efficient and does not suffer from the vanishing gradient problem.

What is the vanishing gradient problem?<br>
The vanishing gradient problem occurs when the gradients of the activation function become very small, which makes it difficult for the network to learn. The vanishing gradient problem is caused by the sigmoid function, which has a very small gradient for large input values (basically very small change in x leads to very small change in y - check sigmoid graph). The ReLU function does not suffer from the vanishing gradient problem because it has a constant gradient for positive input values.

The interface for the neural network is simple: it accepts an input variable as information and a weights variable as knowledge, and it outputs a prediction.

What does that mean when a neural network makes a prediction?<br>
Roughly speaking, it means the network gives a high score of the inputs based on how similar they are to the weights.

Neural networks can be stacked on top of each other to form deep neural networks. Deep neural networks are used to learn complex patterns in the data, such as images, text, and audio. Deep neural networks are used in a variety of applications, such as computer vision, natural language processing, and automatic speech recognition.

### What is a perceptron?

A perceptron is a simple neural network that consists of a single neuron. The neuron takes an input, applies a function to the input, and produces an output. The output is then used to make a decision or perform a task. The perceptron is the building block of neural networks, and it is used to learn patterns in the data. The perceptron is inspired by the human brain, which is made up of billions of interconnected neurons.

The perceptron consists of three main components: an input layer, a weights layer, and an output layer. The input layer receives input data, the weights layer processes the input data, and the output layer produces the final output. The weights layer is responsible for adjusting the weights to minimize the error between the predicted output and the actual output. The weights are adjusted using an optimization algorithm, such as gradient descent.

The perceptron is used to learn patterns in the data, such as images, text, and audio. The perceptron is used in a variety of applications, such as computer vision, natural language processing, and automatic speech recognition.

### What is a neural network layer?

A neural network layer is a collection of connected neurons. The neurons are connected in a way that allows them to send signals to each other. The signals are then processed by the neurons to produce an output. The output is then used to make a decision or perform a task. Neural network layers are inspired by the human brain, which is made up of billions of interconnected neurons.

Neural network layers generally consist of three main components: an input layer, one or more hidden layers, and an output layer. The input layer receives input data, the hidden layers process the input data, and the output layer produces the final output. The hidden layers are called hidden because they are not directly connected to the input or output layers. The hidden layers are responsible for processing the input data and producing the final output.

The connections between the neurons are called weights. The weights are used to determine how much influence one neuron has on another. The weights are learned during the training process, where the network is trained on a dataset of input-output pairs. The training process involves adjusting the weights to minimize the error between the predicted output and the actual output. The weights are adjusted using an optimization algorithm, such as gradient descent.

The values in the hidden layers are called activations. The activations are computed by applying a function to the input data. The most common activation function is the sigmoid function, which maps the input data to a value between 0 and 1. The sigmoid function is used to introduce non-linearity into the network, which allows it to learn complex patterns in the data. But this is not used anymore because of the vanishing gradient problem. The most common activation function used today is the ReLU function, which maps the input data to a value between 0 and infinity. The ReLU function is used because it is computationally efficient and does not suffer from the vanishing gradient problem.

Neural network layers can be stacked on top of each other to form deep neural networks. Deep neural networks are used to learn complex patterns in the data, such as images, text, and audio. Deep neural networks are used in a variety of applications, such as computer vision, natural language processing, and automatic speech recognition.

### What is a neural network activation function?

An activation function is a mathematical function that is applied to the input data to determine which neurons should fire. The activation function is used to introduce non-linearity into the network, which allows it to learn complex patterns in the data. The most common activation function is the sigmoid function, which maps the input data to a value between 0 and 1. The sigmoid function is used to introduce non-linearity into the network, which allows it to learn complex patterns in the data. But this is not used anymore because of the vanishing gradient problem. The most common activation function used today is the ReLU function, which maps the input data to a value between 0 and infinity. The ReLU function is used because it is computationally efficient and does not suffer from the vanishing gradient problem.

### What is a neural network loss function?

A loss function is a mathematical function that is used to measure the error between the predicted output and the actual output. The loss function is used to determine how well the network is performing on a given task. The loss function is used to adjust the weights of the network to minimize the error between the predicted output and the actual output. The most common loss function is the mean squared error, which measures the average squared difference between the predicted output and the actual output. The mean squared error is used to train regression models, which predict a continuous value. Another common loss function is the cross-entropy loss, which measures the difference between the predicted output and the actual output. The cross-entropy loss is used to train classification models, which predict a discrete value.


Why is the error squared? <br>
Think about an archer hitting a target. When the shot hits 2 inches too high, how much did the archer miss by? When the shot hits 2 inches too low, how much did the archer miss by? Both times, the archer missed by only 2 inches. The primary reason to square “how much you missed” is that it forces the output to be positive. (pred - goal_pred) could be negative in some situations, unlike actual error.

Doesn’t squaring make big errors (>1) bigger and small errors (<1) smaller? <br>
Yeah ... It’s kind of a weird way of measuring error, but it turns out that amplifying big errors and reducing small errors is OK. Later, you’ll use this error to help the network learn, and you’d rather it pay attention to the big errors and not worry so much about the small ones. Good parents are like this, too: they practically ignore errors if they’re small enough (breaking the lead on your pencil) but may go nuclear for big errors (crashing the car). See why squaring is valuable?

Why do you want only positive error?<br>
Eventually, you’ll be working with millions of input -> goal_prediction pairs, and we’ll still want to make accurate predictions. So, you’ll try to take the average error down to 0. For example, if you have 3 errors: -1, 1, and 2, the average error is 0.66. But if you have 3 errors: 1, 1, and 1, the average error is 1.0. So, you’ll want to make sure that the errors are positive, so that they don’t cancel each other out.

What is Hot and cold learning ?<br>
Hot and cold learning means wiggling the weights to see which direction reduces the error the most, moving the weights in that direction, and repeating until the error gets to 0.

Problems of Hot and cold learning:<br>
1. It’s inefficient. You have to wiggle the weights in every direction to see which one reduces the error the most.
2. It’s imprecise. You can’t tell how much to wiggle the weights to reduce the error the most. You have to try different amounts to see what works best.
3. It’s slow. You have to try different amounts of wiggling for every weight, and you have to do this for every input -> goal_prediction pair.

What is Gradient descent?<br>
Gradient descent is a way to find the weights that reduce the error the most, without trying all possible weights. It’s a way to find the weights that reduce the error the most for a single input -> goal_prediction pair, and then do this for every input -> goal_prediction pair.

What is divergence?<br>
Divergence is when the weights get so big that they blow up and the error becomes NaN (not a number). This is because the weights are so big that they cause the output to be so big that it’s not a real number. This is a problem because it means that the weights are too big and need to be made smaller. This is done by multiplying the weights by a number less than 1, called the learning rate. 

What is learning rate?<br>
The learning rate is a number less than 1 that tells you how much to change the weights to reduce the error the most. It’s a way to make sure that the weights don’t get so big that they blow up and the error becomes NaN. It’s a way to make sure that the weights don’t get so small that they don’t change the output. It prevents from overshooting.

What does overshooting mean?<br>
Overshooting means that the weights get so big that they blow up and the error becomes NaN. This is because the weights are so big that they cause the output to be so big that it’s not a real number. This is a problem because it means that the weights are too big and need to be made smaller. This is done by multiplying the weights by a number less than 1, called the learning rate.


Types of Gradient Descent Optimization Algorithms:<br>
1. Batch Gradient Descent - In this type of optimization instead of updating the weights after each data point, the weights are updated after the certain number of data points.
2. Stochastic Gradient Descent - In this type of optimization, a single data point is used to compute the gradient of the loss function. The weights are then updated using the gradient of the loss function. This type of optimization is computationally efficient and is suitable for large datasets.
3. Mini-Batch Gradient Descent - In this type of optimization, a small batch of data points is used to compute the gradient of the loss function. The weights are then updated using the gradient of the loss function. This type of optimization is computationally efficient and is suitable for large datasets.
4. Full Gradient Descent - In this type of optimization, the entire dataset is used to compute the gradient of the loss function. The weights are then updated using the gradient of the loss function. This type of optimization is computationally expensive and is not suitable for large datasets.

Regularization Techniques:<br>
Regularization techniques help improve a neural network's generalization ability by reducing overfitting. They do this by minimizing needless complexity and exposing the network to more diverse data. Regularization techniques include L1 and L2 regularization, dropout, and early stopping.

L1 and L2 regularization are two common regularization techniques used in neural networks. L1 regularization adds a penalty term to the loss function that is proportional to the absolute value of the weights. L2 regularization adds a penalty term to the loss function that is proportional to the square of the weights. L1 regularization is used to encourage sparsity in the weights, while L2 regularization is used to prevent the weights from becoming too large.

Dropout:<br>
Dropout is a regularization technique that randomly sets a fraction of the neurons in the network to zero during training. This helps prevent the network from becoming too reliant on any one neuron and encourages the network to learn more robust features.


Why do we need multiple layers in a neural network?<br>
Some datasets are so complex in which there is no specific corelation between the input and output. In such cases for example if we consider two layers then the first layer will learn some features of the data and will have little corelation with the output, and the second layer will take this corelation existed data and will learn the corelation with the output. So in this way, the network can learn complex patterns in the data. And this comes also introduces non-linearity in the network. Non linearity is important because if we have a linear function then the output will be a linear function of the input and the network will not be able to learn complex patterns in the data. So basically what we are doing is that when there is complex data which doesn't have corelation between input and output then we are trying to make the data corelated when it passes through the layers, one such way to achieve is not choosing the nodes that have negative activation values, that means we are telling the network that these nodes are not important and we are trying to make the data corelated. And this is exactly what happens in the hidden layers of the neural network.

Why training more can lead to overfitting?<br>
Training a machine learning model for too many epochs or iterations can lead to overfitting due to the model learning the training data too well, including its noise and outliers, at the expense of generalization to unseen data.

How do you get neural networks to train only on the signal (the essence of a dog) and ignore the noise (other stuff irrelevant to the classification)?<br>
One way is early stopping. It turns out a large amount of noise comes in the fine-grained detail of an image, and most of the signal (for objects) is found in the general shape and perhaps color of the image.

Types of regularization techniques:<br>
1. Early stopping: Stop training when the validation error starts to increase, indicating that the model is starting to overfit.
2. Dropout(Industry Standard): Randomly set a fraction of input units to 0 at each update during training, which helps prevent overfitting.
3. Batch Gradient Descent: Update the weights after each batch of training data, rather than after each individual data point, to help prevent overfitting.

Contraints for choosing the activation function:<br>
1. Must be continuous and differentiable: This is necessary for gradient-based optimization algorithms to work.
2. Must be infinite in domain: This allows the network to learn complex functions and approximate any continuous function.
3. Good activation functions are monotonic, never changing direction. This helps the network learn more effectively. This particular constraint isn’t technically a requirement. Unlike functions that have missing values (noncontinuous), you can optimize functions that aren’t monotonic. But consider the implication of having multiple input values map to the same output value.
4. Good activation functions are nonlinear (they squiggle or turn). This allows the network to learn from data that isn’t linearly separable.

State of the art activation functions:<br>
1. ReLU: Rectified Linear Unit (ReLU) is the most commonly used activation function in deep learning models. It is defined as f(x) = max(0, x) and is computationally efficient.
2. Sigmoid: The sigmoid function is a classic activation function that squashes the output to the range [0, 1]. It is often used in the output layer of binary classification models.
3. Tanh: The hypebolic tangent (tanh) function is similar to the sigmoid function but squashes the output to the range [-1, 1]. It is often used in the hidden layers of neural networks. This function is zero-centered, which can help with optimization.
4. Leaky ReLU: Leaky ReLU is a variant of the ReLU function that allows a small gradient when the input is negative, preventing the dying ReLU problem.
5. Softmax: The softmax function is commonly used in the output layer of multi-class classification models to convert raw scores into probabilities.

Why do we need softmax activation when we have sigmoid activation?<br>
Sigmoid activation is used for binary classification, squashing output values between 0 and 1 for two-class problems. Softmax activation, however, is used for multi-class classification, ensuring the output probabilities sum up to 1 across all classes, allowing the model to make probabilistic predictions across multiple classes. They serve different purposes based on the nature of the classification problem: sigmoid for binary and softmax for multi-class tasks. This softmax activatiob function reduces the effort to penalize the wrong class and makes the right class more probable.

When neural networks have lots of parameters but not very many training examples, overfitting is difficult to avoid. What is the solution?<br>
Regularization techniques can help prevent overfitting in neural networks with many parameters and limited training examples. Some common regularization techniques include:

But even with regularization, the model may still overfit if the training data is limited. In such cases, other strategies can be employed to mitigate overfitting.

For example incase of images there are lot of parameters but not very many training examples, overfitting is difficult to avoid. So there is a architecture called Convolutional Neural Networks (CNNs) which are specifically designed to handle image data and have been shown to be effective in learning from limited training examples. CNNs use convolutional layers to extract features from images and pooling layers to reduce the dimensionality of the data, which can help prevent overfitting by capturing the most important information in the images. Basically it reduces the number of parameters and hence the model is less likely to overfit. 

### Convolutional Neural Networks (CNNs)

In this case lots of linear layers are reused instead of a very big linear layer. This is done by using convolutional layers which are specifically designed to handle image data and have been shown to be effective in learning from limited training examples. CNNs use convolutional layers to extract features from images and pooling layers to reduce the dimensionality of the data, which can help prevent overfitting by capturing the most important information in the images. Basically it reduces the number of parameters and hence the model is less likely to overfit.

In CNNs , basically there are filters which are also called as kernels which are used to extract features from the images. These filters are used to convolve(traverse) over the input image and extract features like edges, textures, shapes, etc. These features are then passed through activation functions like ReLU to introduce non-linearity in the model. The output of the convolutional layers is then passed through pooling layers to reduce the dimensionality of the data and make the model more robust to variations in the input. 

This technique allows each kernel to learn a particular pattern and then search for the existence of that pattern somewhere in the image while inference. This is the reason why CNNs are so effective in image recognition tasks and can also used to overcome the overfitting problem in neural networks with many parameters and limited training examples.

"Reusing weights is one of the most important innovations in deep learning." - What does this mean?<br>
Reusing weights in deep learning refers to the practice of sharing the same set of weights across multiple parts of a neural network. This allows the network to learn common patterns or features in the data more efficiently by reusing the learned representations across different parts of the network. By sharing weights, the network can generalize better to new data and make more accurate predictions.
For example, in Convolutional Neural Networks (CNNs), the same set of filters (weights) is applied to different parts of the input image to extract features such as edges, textures, and shapes. By reusing the same filters, the network can learn to detect these features more effectively and efficiently, leading to better performance on image recognition tasks.

#### Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. NLP techniques enable computers to understand, interpret, and generate human language, allowing for tasks such as language translation, sentiment analysis, text summarization, and more.

NLP tries to solve the problem of understanding and generating human language by using techniques from machine learning, deep learning, and linguistics. Some common NLP tasks include:

1. Text Classification: Categorizing text documents into predefined categories or labels.
2. Named Entity Recognition (NER): Identifying and classifying named entities (e.g., names of people, organizations, locations) in text.
3. Sentiment Analysis: Determining the sentiment or opinion expressed in text (e.g., positive, negative, neutral).
4. Machine Translation: Translating text from one language to another.
5. Text Generation: Generating human-like text based on input prompts.
6. Question Answering: Answering questions based on a given context or passage.
7. Text Summarization: Generating concise summaries of longer text documents.

NLP techniques often involve processing and analyzing text data using methods such as tokenization, part-of-speech tagging, parsing, and semantic analysis. Machine learning models, including neural networks, are commonly used in NLP tasks to learn patterns and relationships in text data and make predictions or generate text.

Generally speaking, NLP tasks seek to do one of three things:<br>
1. label a region of text
2. link two or more regions of text
3. try to fill in missing information (missing words) based on context.

why do neural networks only take fixed size input's?<br>
Neural networks typically require fixed-size inputs because the architecture and parameters of the network are fixed during training and inference. Having fixed-size inputs allows for consistent and efficient computation within the network.

When processing data with neural networks, each layer in the network expects inputs of a specific size, and the parameters of the network (such as the number of neurons in each layer) are set accordingly. If inputs were of variable sizes, it would be challenging to ensure that the dimensions of the input match the expected dimensions of each layer, leading to inconsistencies and difficulties in training and inference.

Additionally, fixed-size inputs facilitate the use of batch processing techniques, where multiple data samples are processed simultaneously to improve computational efficiency. With fixed-size inputs, batches of data can be easily organized and processed together, optimizing computation and memory usage during training.

Furthermore, many neural network architectures, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), rely on specific input shapes and sizes to leverage their inherent structure and properties effectively. For example, CNNs use filters of fixed sizes to extract features from images, while RNNs process sequences of fixed lengths to capture temporal dependencies in data.

Overall, requiring fixed-size inputs allows neural networks to maintain consistency in computation, facilitate batch processing, and leverage the specific architectures and properties of different network types for efficient and effective learning and inference.

#### RNNs

##### what is the need of 1 of k encoding in rnns?<br>
In the context of Recurrent Neural Networks (RNNs), "1 of K" encoding (also known as "one-hot encoding") is a method used to represent categorical variables as binary vectors. Each category is represented by a vector where one element is set to 1 and all other elements are set to 0.

The need for "1 of K" encoding in RNNs arises when dealing with categorical data as input or output. Here's why it's used:

Representation of Categorical Variables: RNNs often deal with sequential data where some of the inputs or outputs are categorical in nature (e.g., words in a sentence, characters in a text, categorical labels). These categorical variables need to be represented numerically for the RNN to process them.

Ensuring Numerical Compatibility: Neural networks, including RNNs, typically operate on numerical data. By using "1 of K" encoding, categorical variables are converted into numerical representations that can be processed by the network.

Distinct Representations: Each category is represented by a unique binary vector, ensuring that the network can distinguish between different categories. This is crucial for tasks such as language modeling, where each word needs to be uniquely represented.

Loss Calculation: In many tasks, RNNs are trained using loss functions that require numerical predictions. One-hot encoding facilitates the comparison between the predicted outputs of the network (also in one-hot encoded form) and the ground truth labels.

Output Layer Activation: For classification tasks, RNNs often use softmax activation in the output layer, which requires one-hot encoded representations for proper calculation of probabilities across classes.

In summary, "1 of K" encoding is needed in RNNs to represent categorical variables numerically, ensure compatibility with neural network operations, enable distinct representations for different categories, facilitate loss calculation, and support appropriate activation functions in the output layer.

##### But can't we can just use the the position of the character in the vocabulary right ? but are we usign this encoding in RNNs?
Yes, you are correct that instead of using "1 of K" encoding, one could represent categorical variables, such as characters or words, using their position in a vocabulary. This approach is commonly referred to as "index-based encoding" or "integer encoding."

In index-based encoding, each unique category (e.g., character or word) is mapped to a unique integer value based on its position in the vocabulary. For example, if we have a vocabulary containing the characters 'a', 'b', 'c', and 'd', we could assign them the integer values 0, 1, 2, and 3 respectively.

While index-based encoding is indeed an alternative to "1 of K" encoding, there are some considerations to keep in mind:

Size of Vocabulary: If the vocabulary is large, using integer encoding may lead to large integer values, which could be less efficient to handle in terms of memory and computation compared to "1 of K" encoding, especially if the integers are not contiguous.

Sparse Representation: "1 of K" encoding results in a sparse representation, where most elements in the encoded vector are zero except for one. This can be advantageous in terms of memory efficiency, especially for large vocabularies, as only one element per vector needs to be stored as non-zero.

Network Behavior: Some neural network architectures may perform better or worse with different encoding schemes. For example, some architectures may be more suited to handle "1 of K" encoded inputs efficiently, while others may work well with integer encoded inputs.

Both encoding schemes have their own advantages and disadvantages, and the choice between them depends on factors such as the size of the vocabulary, the specific requirements of the task, and the characteristics of the neural network architecture being used.

And as we do matrix multiplication in RNNs, we use "1 of K" encoding to represent categorical variables as binary vectors, which can be efficiently processed through matrix operations in the network. Making zero values for all other elements except one, helps in efficient computation and representation of categorical data in RNNs.
