# 1) What is loss function

### Common Loss functions in machine learning

Machines learn by means of a loss function. It’s a method of evaluating how well specific algorithm models the given data. If predictions deviates too much from actual results, loss function would cough up a very large number. Gradually, with the help of some optimization function, loss function learns to reduce the error in prediction. In this article we will go through several loss functions and their applications in the domain of machine/deep learning.

There’s no one-size-fits-all loss function to algorithms in machine learning. There are various factors involved in choosing a loss function for specific problem such as type of machine learning algorithm chosen, ease of calculating the derivatives and to some degree the percentage of outliers in the data set.

Broadly, loss functions can be classified into two major categories depending upon the type of learning task we are dealing with — Regression losses and Classification losses. In classification, we are trying to predict output from set of finite categorical values i.e Given large data set of images of hand written digits, categorizing them into one of 0–9 digits. Regression, on the other hand, deals with predicting a continuous value for example given floor area, number of rooms, size of rooms, predict the price of room.

# Regression Losses

### Mean Square Error/Quadratic Loss/L2 Loss  

Mathematical formulation :-
<img src = 'pic/1.png'>
As the name suggests, Mean square error is measured as the average of squared difference between predictions and actual observations. It’s only concerned with the average magnitude of error irrespective of their direction. However, due to squaring, predictions which are far away from actual values are penalized heavily in comparison to less deviated predictions. Plus MSE has nice mathematical properties which makes it easier to calculate gradients.

In [12]:
import numpy as np

y_hat = np.array([0.000, 0.166, 0.333])
y_true = np.array([0.000, 0.254, 0.998])

def mse(predictions, targets):
    differences = predictions - targets
    differences_squared = differences ** 2
    mean_of_differences_squared = differences_squared.mean()
    return mean_of_differences_squared

print("Predicted values are: " + str(["%.8f" % elem for elem in y_hat]))
print("Actial values are: " + str(["%.8f" % elem for elem in y_true]))
mse_val = mse(y_hat, y_true)
print("ms error is: " + str(mse_val))

Predicted values are: ['0.00000000', '0.16600000', '0.33300000']
Actial values are: ['0.00000000', '0.25400000', '0.99800000']
ms error is: 0.14998966666666666


### Mean Absolute Error/L1 Loss
Mathematical formulation :-
<img src='pic/2.png'>
Mean absolute error, on the other hand, is measured as the average of sum of absolute differences between predictions and actual observations. Like MSE, this as well measures the magnitude of error without considering their direction. Unlike MSE, MAE needs more complicated tools such as linear programming to compute the gradients. Plus MAE is more robust to outliers since it does not make use of square.

In [7]:
def mae(predictions, targets):
    differences = predictions - targets
    absolute_differences = np.absolute(differences)
    mean_absolute_differences = absolute_differences.mean()
    return mean_absolute_differences

print("Predicted values are: : " + str(["%.8f" % elem for elem in y_hat]))
print("Actual values are: : " + str(["%.8f" % elem for elem in y_true]))
mae_val = mae(y_hat, y_true)
print ("mae error is: " + str(mae_val))

Predicted values are: : ['0.00000000', '0.16600000', '0.33300000']
Actual values are: : ['0.00000000', '0.25400000', '0.99800000']
mae error is: 0.251


### Mean Bias Error
Mathematical formulation :-
<img src='pic/3.png'>
This is much less common in machine learning domain as compared to it’s counterpart. This is same as MSE with the only difference that we don’t take absolute values. Clearly there’s a need for caution as positive and negative errors could cancel each other out. Although less accurate in practice, it could determine if the model has positive bias or negative bias.

In [13]:
def mbe(predictions, targets):
    differences = predictions - targets
    mean_of_differences = differences.mean()
    return mean_of_differences

print("Predicted values are: " + str(["%.8f" % elem for elem in y_hat]))
print("Actial values are: " + str(["%.8f" % elem for elem in y_true]))
mbe_val = mbe(y_hat, y_true)
print("mb error is: " + str(mse_val))

Predicted values are: ['0.00000000', '0.16600000', '0.33300000']
Actial values are: ['0.00000000', '0.25400000', '0.99800000']
mb error is: 0.14998966666666666


# Classification Losses

### Hinge Loss/Multi class SVM Loss
In simple terms, the score of correct category should be greater than sum of scores of all incorrect categories by some safety margin (usually one). And hence hinge loss is used for maximum-margin classification, most notably for support vector machines. Although not differentiable, it’s a convex function which makes it easy to work with usual convex optimizers used in machine learning domain.

Mathematical formulation :-
<img src='pic/4.png'>  
Consider an example where we have three training examples and three classes to predict — Dog, cat and horse. Below the values predicted by our algorithm for each of the classes :-  
<img src='pic/5.jpg'>  
Computing hinge losses for all 3 training examples :-

In [17]:
## 1st training example
max(0, (1.49) - (-0.39) + 1) + max(0, (4.21) - (-0.39) + 1)
max(0, 2.88) + max(0, 5.6)
# 2.88 + 5.6
# 8.48 (High loss as very wrong prediction)

## 2nd training example
max(0, (-4.61) - (3.28)+ 1) + max(0, (1.46) - (3.28)+ 1)
max(0, -6.89) + max(0, -0.82)
# 0 + 0
# 0 (Zero loss as correct prediction)

## 3rd training example
max(0, (1.03) - (-2.27)+ 1) + max(0, (-2.37) - (-2.27)+ 1)
max(0, 4.3) + max(0, 0.9)
# 4.3 + 0.9
# 5.2 (High loss as very wrong prediction)

### Cross Entropy Loss/Negative Log Likelihood
This is the most common setting for classification problems. Cross-entropy loss increases as the predicted probability diverges from the actual label.  

Mathematical formulation :-

<img src = 'pic/6.png'>

Notice that when actual label is 1 (y(i) = 1), second half of function disappears whereas in case actual label is 0 (y(i) = 0) first half is dropped off. In short, we are just multiplying the log of the actual predicted probability for the ground truth class. An important aspect of this is that cross entropy loss penalizes heavily the predictions that are confident but wrong.

In [18]:
predictions = np.array([[0.25,0.25,0.25,0.25],
                        [0.01,0.01,0.01,0.96]])
targets = np.array([[0,0,0,1],
                   [0,0,0,1]])

def cross_entropy(predictions, targets, epsilon=1e-10):
    predictions = np.clip(predictions, epsilon, 1. - epsilon)
    N = predictions.shape[0]
    ce_loss = -np.sum(np.sum(targets * np.log(predictions + 1e-5)))/N
    return ce_loss

cross_entropy_loss = cross_entropy(predictions, targets)
print ("Cross entropy loss is: " + str(cross_entropy_loss))

Cross entropy loss is: 0.7135329699138555


# 2) Loss function vs RSS

### RSS
In a model with a single explanatory variable, RSS is given by:

<img src = "https://wikimedia.org/api/rest_v1/media/math/render/svg/2f6526aa487b4dc460792bf1eeee79b2bba77709" >
or  
<img src = "https://wikimedia.org/api/rest_v1/media/math/render/svg/63e1a994055df3be373f8f85a194e3bd1f750e3e" >

And here we are wandering an interesting question:  

Does minimizing the RSS in your model always equate to minimizing the MSE as MSE=1/n * RSS?

The answer is **YES**  

This is a trivial application of optimisation rules. So long as n is constant (i.e., does not depend on θ) then for any objective function F and any other function h you have:  

argmin[F(θ)] = argmin[h(n)*F(θ)]


# 3) Discrete Wavelet Transform (DWT) vs Discrete Fourier Transform (DFT)

The discrete Fourier transform (DFT) tells you the frequency components of a signal, averaged over the entire duration of the signal. The discrete wavelet transform (DWT) gives information about the frequency (actually, basis) components as well as being able to indicate what time these components occur at.  

### Pros of DWT vs DFT  
1) Time and frequency information  

2) A lot of flexibility - there are many different types of DWT bases, whereas the DFT is just based on cos and sin of different frequencies (or equivalently, complex exponentials of different frequencies).  

3) Because data are shattered into more components, it becomes much easier to filter in or filter out a given nonstationary waveform.  

4) A lot of signals are found to be sparse in an appropriate DWT basis. This makes it easy to, for instance, filter noise out of a phoneme by using a simple binary mask in the DWT domain.
### Cons of DWT vs DFT

1) Greater complexity. Greater complexity translates in this case into more resources required to perform the computation - more memory and/or processor cycles and/or time. If you don’t need to locate events in time, or if the signal is stationary in the frequency domain, there is no advantage to DWT, and more effort to compute the values.  

2) The theory is more difficult to understand. DFTs are much easier to understand, and are more intuitive. However, understanding enough to use a library function for DWTs is not much more difficult than learning to use a DFT routine, perhaps 2 or 3 times harder.  

3) The flexibility of DWTs is a two-edged sword - it is sometimes very difficult to chose which basis to use. Do you need a differentiable basis? Then you won’t want a Haar or low order Daubechies basis.  

4) It is more difficult to interpret the results.

# 4) How can this technique be useful for data reduction if the wavelet transformed data are of the same length as the original data

The discrete wavelet transform (DWT) is a linear signal processing technique. It transforms a
vector into a numerically different vector (D to D’) of wavelet coefficients. The two vectors are of the
same length. However it is useful for compression in the sense that wavelet-transformed data can be
truncated. A small compressed approximation of the data can be retained by storing only a small
fraction of the strongest wavelet coefficient e.g., retain all wavelet coefficients larger than some
particular threshold and the remaining coefficients are set to zero. The resulting data representation is
sparse. Computations that can take advantage of sparsity are very fat if performed in wavelet space.
Given a set of coefficients, an approximation of the original data con be got by applying the inverse
DWT. The DWT is closely related to the discrete Fourier transform (DFT) a signal processing technique
involving sine’s and cosines. The general procedure for applying a discrete wavelet transform uses a
hierarchical pyramid algorithm that halves the data in each iteration, resulting in fast computational
speed. The method is as follows:   

1) The length, L , of the input data vector must and integer power of 2.This condition can be met by padding the data vector with zeros as necessary.  

2) Each transform involves applying two functions. The first applies some data smoothing, such as sum or weighted average .The second performs a weighted difference, which acts to bring out the detailed features of the data.

3) The two functions are applied to pairs of input data, resulting in two sets of data of length L/2. In general these represent a smoothed or low frequency version so he input data and the high frequency content of it.

4) The two functions are recursively applied to sets of data obtained in the previous loop, until the resulting data sets obtained are of length 2.

5) A selection of values from the data sets obtained in the above iterations are designated the wavelet coefficients of the transformed data. 

Wavelet transforms can be applied to multidimensional data such as data cubes. Wavelet transforms
have many real world applications, including the compression of fingerprint images, computer vision,
and analysis of time-series data and data cleaning.

# 5) What is Model Building
In regression analysis, model building is the process of developing a probabilistic model that best describes the relationship between the dependent and independent variables. The major issues are finding the proper form (linear or curvilinear) of the relationship and selecting which independent variables to include. In building models it is often desirable to use qualitative as well as quantitative variables.

As noted above, quantitative variables measure how much or how many; qualitative variables represent types or categories. For instance, suppose it is of interest to predict sales of an iced tea that is available in either bottles or cans. Clearly, the independent variable “container type” could influence the dependent variable “sales.” Container type is a qualitative variable, however, and must be assigned numerical values if it is to be used in a regression study. So-called dummy variables are used to represent qualitative variables in regression analysis. For example, the dummy variable x could be used to represent container type by setting x = 0 if the iced tea is packaged in a bottle and x = 1 if the iced tea is in a can. If the beverage could be placed in glass bottles, plastic bottles, or cans, it would require two dummy variables to properly represent the qualitative variable container type. In general, k - 1 dummy variables are needed to model the effect of a qualitative variable that may assume k values.

# 6) What is Heuristic Method
There are 2d possible attribute combinations of d attributes

Typical heuristic attribute selection methods:

1) Best step-wise feature selection:

        The best single-attribute is picked first

        Then next best attribute condition to the first, ...

2) Step-wise attribute elimination:

        Repeatedly eliminate the worst attribute

3) Best combined attribute selection and elimination

    At each step, the procedure selects the best attribute and removes the worst from among the remaining attributes

# 7) What is Greedy Search
A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage[1] with the intent of finding a global optimum. In many problems, a greedy strategy does not usually produce an optimal solution, but nonetheless a greedy heuristic may yield locally optimal solutions that approximate a globally optimal solution in a reasonable amount of time.

For example, a greedy strategy for the traveling salesman problem (which is of a high computational complexity) is the following heuristic: "At each step of the journey, visit the nearest unvisited city." This heuristic does not intend to find a best solution, but it terminates in a reasonable number of steps; finding an optimal solution to such a complex problem typically requires unreasonably many steps. In mathematical optimization, greedy algorithms optimally solve combinatorial problems having the properties of matroids, and give constant-factor approximations to optimization problems with submodular structure.

# 8) What is Stratified Sampling
This is also called a cluster sampling.

Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)

Used in conjunction with skewed data

# 9) Simple random sampling vs Sampling without replacement vs Sampling with replacement
1) Simple random sampling
       
       There is an equal probability of selecting any particular item
       
2) Sampling without replacement

        Once an object is selected, it is removed from the population
        
3) Sampling with replacement

        A selected object is not removed from the population