# Neural Networks

- From NLP Perspective
    - Another mechanism to process the corpus and get insights out of them

## Why we need Machine Learning (Neural Nework)?

- In order to understand the need of NN, we need to understand the limitations of standard algorithms
    - __*Standard Algorithm*__: An algorithm is a sequence of instructions to solve a problem
        - The steps to solve problems are well defined
            - We know _what are the inputs_
            - We _know what are the rules to manipulate the input_, and
            - We _know what we expect out of program as output_
        - Steps are coded in some ordered sequence to transform the input from one form to another
        - Rules are unambiguous
            - Without specifying rules, we can't solve any problem in _algorithmic fashion_
        - Sufficient Knowledge is available to fully solve the problem
            - We need to know about the Domain to solve the problem
- There are problems whose solutions cannot be formulated using standard rule-based algorithms
- Problems that require subtle inputs cannot be solved using standard algorithmic approach - Face Recognition, Speech Recognition, Hand-written character recognition, etc
- Finding Examples and using experience gained in similar situations are useful
- Examples provide certain underlying patterns
- Patterns give the ability to predict some outcome or help in constructing an approximate model
- __Learning__ is the key to the ambiguous world

## What is Learning

- Given the input and output, finding the relationship between Input and Output
    - This learned one is called model
    - So that, we can give input similar to the one given as example in learning and get the output
- There are problems, where we can't clearly give exact details of input and the rules to find the output.
    - In such case, we want the __*Machine*__ to __*Learn*__ from the given input (examples), to find (estimate) the Output
- The system starts looking at the (latent) patterns found in the examples, using which it predicts/estimates the outcome
- Always their will be new data, so _learning_ is a continuous process and model should be updated accordingly

## How ML used in NLP?

- Classification
- Word Embedding
- Learning a Sentence (This is not possible with Probabilistic Language Model)
- (How to) Encode a Paragraph
- (How to) Encode a Problem Statement
- Translation from language to the other
    - Can be done with Statistical Machine Translation models, but NN based models have advatange
- Modeling conversations - Chatbot

## Perceptron

- From [17]
    - Perceptron is the basic element of Neural Network
    - From where Neural Network started

## Gradient Descent

- Helps in answering
    - How do we iterate to realy get to the solution by descending down /ascending up the slope using Gradient Descendent

# Classification

- From [v1] Week 3, Lec 2
    - __Classification__ is the task of assigning predefine dis-joint categories to objects
    - Example
        - Detect $\text{Spam emails}$
        - Find the set of $\text{mobile phones < Rs.10000 and received  $5*$ reviews}$
            - In these kind of classification NER will be used to extract the features and then classification will be performed over the extracted features
        - Identify the category of the incoming document as Sports, Politics, Entertainment or Business
        - Determine whether a movie review is a Positive of Negative Review

## Definition of Classification

- The input is a collection of records
- Each rechord is represented by a tuple ($x$,$y$)
- $x=x_1,x_2,...,x_n$ and $y=y_1,y_2,...y_n$ are the input features and the classes respectively
- Example
    - $x \in R^{2}$ is a vector - the of observed variables
    - ($x$,$y$) are related by an unknown function. The goal is to estimate the unknown function $g(.)$ also known as a classifier function, such that $g(x) = f(x), \forall x$
    
    ![Classification_Model](images/Classification_Model.jpg)

## What does the Classifier Fucntion do?

- Assuming we have a linearly separable $x$, the classifier function $g(.)$ implements decision rule
    - Fitting a Straight Line to a given data set requires two parameters $(w_0 $ $and$  $w)$
        - $w_0$ is the bias
            - It is the distance of the line from the origin
        - $w$ is the weight
            - It is the orientation of the line
        - Both $w_0$ and $w$ are called as _Model Parameters_
        - __*Fitting the Line*__: This decision doubary line is estimated in a iterative fashion
            - Using the errors that we are caulcualted after fitting a line in each iteration
            - This fitment is leart during the above iterative process
    - The decision rule divides the data space into two sub-spaces separating two classes using a boundary
    - The distance of the boundary from the origin $= \frac{w_0}{\parallel w \parallel}$
    - Distance of any point from the boundary $=d=\frac{g(x)}{\parallel w \parallel}$
    
    ![Classifier_Function](images/Classifier_Function.jpg)

## Lineary Separable

- From https://en.wikipedia.org/wiki/Linear_separability
    - If a set of points can be separated by using a line (a hyperplane in higher dimension), then we can say that the points are __*lineary separable*__

# Linear Models for Classification

- The goal of classification is to take a vector $x$ and assign it to one of the $N$ discrete $\mathbb{C}_n$, where $n=1,2,3,...,N$
    - The classes are disjoint and an input is assigned to only one class
    - The input space is divided into _decision regions_
    - The boundaries are called a _decision boundaries or decision surfaces_
    - In general, if the input space is $N$ dimensional, then $g(x)$ would define $N-1$ hyperplane

## Geomentry of the Linear Discrimminant Function

![Geomentry_of_the_Linear_Discriminant_Function](images/Geomentry_of_the_Linear_Discriminant_Function.jpg)

## Decision Boundary

### 1D-Decision Boundary for OR Gate

- The decision regions are separated by a hyperplane and it is defined by $g(x) - 0$.
- This separates linearly separable classes $\mathbb{C}_1$ and $\mathbb{C}_2$
- The $OR$ Gate _Truth Table_

| $x_1$ | $x_2$ | y |
|-------|-------|---|
| 0     | 0     | 0 |
| 0     | 1     | 1 |
| 1     | 0     | 1 |
| 1     | 1     | 1 |

- If any of the input feature $x_1$ or $x_2$ has 1, then result is $1$
- Below diagram depicts the boundary line for this OR gate
    ![1D_Decision_Boundary_For_OR_Gate](images/1D_Decision_Boundary_For_OR_Gate.jpg)

### 1D-Decision Boundary for AND Gate

- The decision regions are separated by a hyperplane and it is defined by $g(x) = 0$.
- This separates linearly separable classes $\mathbb{C}_1$ and $\mathbb{C}_2$
- The $AND$ Gate _Truth Table_

| $x_1$ | $x_2$ | y |
|-------|-------|---|
| 0     | 0     | 0 |
| 0     | 1     | 0 |
| 1     | 0     | 0 |
| 1     | 1     | 1 |

- We will have 1 only when both $x_1$ and $x_2$ are 1
- The boundary line for this $AND$ gate will similar to the one below

    ![1D_Decision_Boundary_For_AND_Gate](images/1D_Decision_Boundary_For_AND_Gate.jpg)

### Decision Boundary for Sentiments - NLP

- The concept of decision boundary can be applied to NLP as well
- Let us consider some positive and negative sentiment terms which are contained in two classes $\mathbb{C}_P$ and $\mathbb{C}_N$
    - $\mathbb{C}_P = [\text{achieve efficient improve profitable}] = +1$
    - $\mathbb{C}_N = [\text{termination penalties misconduct serious}] = -1$
    
    ![Decision_Boundary_For_Sentiments](images/Decision_Boundary_For_Sentiments.jpg)
    
- __*Note*__
    - Here inputs are texts
    - We need to transform the input for finding the decision boundary

### Decision Boundary - Variation of $W_J$

- The slope (weight) of the line is decided by varaition sof $W_J$
    - During fitment of a line/ learning the model, the slope of the line need to be adjusted to have _line of fit_
    - So, $W_J$ need to be adjustedbased on the errors calcualted after each iteration during fitment

    ![Decision_Boundary_Variation_of_W_J](images/Decision_Boundary_Variation_of_W_J.jpg)

### Decision Boundary - Variation of Bias

- The distance of the decision boundary from origin is decided by bias ($w_0$)
    - During fitment, line/hyperplane need to be moved to have apprximate decision boundary which splits the input to decision regions
    - So, bias $w_0$ also need to be variated during iterative process of learning/fitment
- The contribution of bias to the creation of the decision boundary

    ![Decision_Boundary_Variation_of_Bias](images/Decision_Boundary_Variation_of_Bias.jpg)

### Decision Boundary and Gradient Descent

- Example showing the iterative process of fitting the line using _Gradient Descent_
    - Assume we have the following
        - Input: 10 points are taken as input, shown in picture as $\perp$
        - Output: Assume output classes are well defined and well known
    - What we don't know is how to fit the line so that decision region(s) are created to each class
        - This fitment has to be learnt during the iterative process
- Picture shows the fitment of line in 10 iterations
    - Let $y$ be our target
        - The green line which goes over $\perp$ in the left diagram
    - let $\hat{y}$ is estimate
        - The first line is shown in blue color (parallel to x-axis)
    - Goal is to have $y-\hat{y} \approx 0$
        - That is the error should reach the _minima_, or no more change that can be brought to the model parameters, then we can stop the iteration
    - In each iteration error $y-\hat{y}$ is calcualted and it is propogated back to the model and ask the model to learn the parameter ($w$ and $w_0$ keeps changing in each iteration)
    
Image 1             |  Image 2
:-------------------------:|:-------------------------:
![Decision_Boundary_And_Gradient_Descent](images/Decision_Boundary_And_Gradient_Descent.jpg)  |  ![Decision_Boundary_And_Gradient_Descent_2](images/Decision_Boundary_And_Gradient_Descent_2.jpg)

## Lineary Separable

- Is below data libearyly separable?

    ![Linearly_Separable](images/Linearly_Separable.jpg)