# Artificial Neural Network

* Though the concept of artificial neural network has been in existence since the 1950s, it’s only recently that we have capable hardware to turn theory into practice. Neural networks are supposed to be able to mimic any continuous function. But many a times we are stuck with networks not performing up to the mark, or it takes a whole lot of time to get decent results. One should approach the problem statistically rather than going with gut feelings regarding the changes which should be brought about in the architecture of the network. One of the first steps should be proper preprocessing of data. 



* Other than mean normalisation and scaling, Principal Component Analysis may be useful in speeding up training. If the dimension of the data is reduced to such an extent that a proper amount of variance is still retained, one can save on space without compromising much on the quality of the data. Also, neural networks can be trained faster when they are provided with less data.



* Reduction in dimension can be achieved by decomposing the covariance matrix of the training data using singular value decomposition into three matrices. The first matrix is supposed to be contain eigenvectors. Furthermore, the set of vectors present in the matrix are orthonormal, hence they may be treated as basis vectors. 



* We pick the first few vectors out of this matrix, the number being equal to the number of dimensions we wish to reduce the data into. Making a transformation of the original matrix (with original dimensions) with the matrix we obtain in the previous step, we get a new matrix, which is both reduced in dimension and linearly transformed.



* The above steps are mathematical in nature, but essentially we simply “projected” the data from the higher dimension to a lower dimension, similar to projecting points in a plane on a well-fitting line in a way that the distances a point has to “travel” is minimised.



* One may use PCA for visualising the data by reducing it to 3D or 2D. But, a more recommended method would be to make use of t-distributed stochastic neighbour embedding, which is based on a probability distribution, unlike PCA. t-SNE tries to minimise the difference between the conditional probability in the higher and the reduced dimensions.


<img src="https://miro.medium.com/max/247/1*u92Cpw35fcyWaeCd1FhZcQ.png"/>


* The conditional probability is high for points close together (measured by their Euclidean distance) and is low for the once which are far apart. Points are grouped according to the obtained distribution. The variance is chosen such that points in dense areas are given a smaller variance compared to points in sparse areas.




* Though it was proved by George Cybenko in 1989 that neural networks with even a single hidden layer can approximate any continuous function, it may be desired to introduce polynomial features of higher degree into the network, in order to obtain better predictions. One might consider increasing the number of hidden layers. In fact, the number of layers of a network is equal to the highest degree of a polynomial it should be able to represent. Though this could also be achieved by raising the number of neurons in the existing layers too, that would require far more neurons (and hence an increased computational time) compared to adding hidden layers to the network, for approximating a function with a similar amount of error. On the other hand, making neural nets “deep” results in unstable gradients. This can be divided into two parts, namely the vanishing and the exploding gradient problems.




* The weights of a neural network are generally initialised with random values, having a mean 0 and standard deviation 1, placed roughly on a Gaussian distribution. This makes sure that most of the weights are between -1 and 1. The sigmoid function gives us a maximum derivative of 0.25 (when the input is zero). This, combined with the fact that the weights belong to a limited range helps makes sure that the absolute value of their product too is less than 0.25. The gradient of a perceptron comprises the product of many such terms, each being less than 0.25. The deeper we go into the layers, we’ll have more and more such terms, resulting in the vanishing gradient issue.





* Essentially, the gradient of a perceptron of an outer hidden layer (closer to the input layer) would be given by the sum of products of the gradients of the deeper layers and the weights assigned to each of the links between them. Hence, it is apparent that shallow layers would have very less gradient. This would result in their weights changing less during learning and becoming almost stagnant in due course of time. The first layers are supposed to carry most of the information, but we see it gets trained the least. Hence, the problem of vanishing gradient eventually leads to the death of the network.



* There might be circumstances in which the weight might go beyond one while training. In that case, one might wonder how vanishing gradients could still create problems. Well, this might lead to the exploding gradient problem, in which the gradient in the earlier layers become huge. If the weights are large and the bias is such that it’s product with the derivative of the sigmoid of the activation function too keeps it on the higher side, this problem would occur. But, on the other hand, that’s a little difficult to achieve, for, increased weight may result in higher value for the input to the activation function, where the derivative of sigmoid would be pretty low. This also helps establish the fact that the vanishing gradient issue is difficult to prevent. In order to address this problem, we choose other activation functions, avoiding sigmoid.


<img src="https://miro.medium.com/max/303/1*dVoDR24VRWfn3l7alQLWMQ.png"/>




* Though sigmoid is a popular choice as it squashes the input between zero and one, and also for its derivative can be written as a function of sigmoid itself, neural networks relying on it might suffer from unstable gradients. Moreover, the sigmoid outputs are not zero centred, they are all positive. This means, all the gradients would either be positive or negative depending on the gradient of units on the next layer.



* The most recommended activation function one may use is Maxout. Maxout maintains two sets of parameters. The one which yields higher value to be presented as input to the activation function is used. Also, the weights may be varied according to certain input conditions. One such attempt leads to Leaky Rectified Linear Units. In this special case, the gradient remains 1 when the input is greater than 0, and it gets a small negative slope when it’s less than 0, proportional to the input.



* Another trouble which is encountered in neural networks, especially when they are deep is internal covariate shift. The statistical distribution of the input keeps changing as training proceeds. This can cause a significant change in the domain and hence, reduce training efficiency. A solution to the problem is to perform normalisation for every mini batch. We compute the mean and variance for all such batches, instead of the entire data. The input is normalised before feeding it into almost every hidden layer. The process is commonly known as batch normalisation. Applying batch normalisation can assist in overcoming the issue of vanishing gradients as well.



* Regularisation can be improved by implementing dropout. Often certain nodes in the network are randomly switched off, from some or all the layers of a neural network. Hence, in every iteration, we get a new network and the resulting network (obtained at the end of training) is a combination of all of them. This also helps in addressing the problem of overfitting.



* Whatever tweaks are applied, one must always keep a track of the percentage of dead neurons in the network, and adjust the learning rate accordingly.



* Certain diagnostics may be performed on the parameters to get better statistics. Plots on bias and variance are two important factors here. They can be determined by plotting curves with the output of the loss function (without regularisation) on the training and the cross validation data sets versus the number of training examples.



<img src="https://miro.medium.com/max/1742/1*jjp_6OHijUROG3Xs06mJWw.png"/>



* In the figure above, the curve in red represents the cross validation data while the colour blue has been used to mark the training data set. The first figure is the one which would be roughly obtained when the architecture is suffering from high bias. It means, the architecture is poor, hence it gives pretty high errors even on the training data set. Addition of more features into the network (like adding more hidden layers, and hence introducing polynomial features) could be useful. If it is suffering from high variance, it means the trained parameters fits the training set well, but performs poorly when tested on “unseen” data (the training or the validation set). This could be because the model “over-fits” the training data. Getting more data could act as a fix. Reducing the number of hidden layers in the network might also be useful in this case. Playing with the regularisation parameter could help as well. Increasing its value could fix high variance whereas a decrease should assist in fixing high bias.



* Though it has been noticed that a huge number of training data could increase the performance of any network, getting a lot of data might be costly and time consuming. In case the network is suffering from high bias or vanishing gradients issue, more data would be of no use. Hence simple mathematics should be implemented as it would guide us which step we should descend towards

# Challenge Of Deep Learning:

## Deep learning is opaque

* While decisions made by rule-based software can be traced back to the last if and else, the same can’t be said about machine learning and deep learning algorithms. This lack of transparency in deep learning is what we call the “black box” problem. Deep learning algorithms sift through millions of data points to find patterns and correlations that often go unnoticed to human experts. The decision they make based on these findings often confound even the engineers who created them.



* This might not be a problem when deep learning is performing a trivial task where a wrong decision will cause little or no damage. But when it’s deciding the fate of a defendant in court or the medical treatment of a patient, mistakes can have more serious repercussions.



* “The transparency issue, as yet unsolved, is a potential liability when using deep learning for problem domains like financial trades or medical diagnosis, in which human users might like to understand how a given system made a given decision,”


* Algorithmic bias as one of the problems stemming from the opacity of deep learning algorithms. Machine learning algorithms often inherit the biases of the training data the ingest, such as preferring to show higher paying job ads to men rather than women, or preferring white skin over dark in adjudicating beauty contests. These problems are hard to debug in development phase and often result in controversial news headlines when the deep learning–powered software go into production.


## Is deep learning doomed to fail?


* Certainly not. But it is bound for a reality check. “In general, deep learning is a perfectly fine way of optimizing a complex system for representing a mapping between inputs and outputs, given a sufficiently large data set,”


* Deep learning must be acknowledged for what it is, a highly efficient technique for solving classification problems, which will perform well when it has enough training data and a test set that closely resembles the training data set.



* But it’s not a magic wand. If you don’t have enough training data, or when your test data differs greatly from your training data, or when you’re not solving a classification problem, then “deep learning becomes a square peg slammed into a round hole, a crude approximation when there must be a solution elsewhere,”



* Marcus also suggests in his paper that deep learning has to be combined with other technologies such as plain-old rule-based programming and other AI techniques such as reinforcement learning. Other experts such as Starmind’s Pascal Kaufmann propose neuroscience as the key to creating real AI that will be able to achieve human-like problem solving.



* “Deep learning is not likely to disappear, nor should it,”. “But five years into the field’s resurgence seems like a good moment for a critical reflection, on what deep learning has and has not been able to achieve.”




## Deep learning is data hungry



* “In a world with infinite data, and infinite computational resources, there might be little need for any other technique,” Marcus says in his paper. And therein lies the problem, because we don’t live in such a world.



* You can never give every possible labelled sample of a problem space to a deep learning algorithm. Therefore, it will have to generalize or interpolate between its previous samples in order to classify data it has never seen before such as a new image or sound that’s not contained in its dataset.



* “Deep learning currently lacks a mechanism for learning abstractions through explicit, verbal definition, and works best when there are thousands, millions or even billions of training examples,”



* So what happens when deep learning algorithm doesn’t have enough quality training data? It can fail spectacularly, such as mistaking a rifle for a helicopter, or humans for gorillas.



* The heavy reliance on precise and abundance of data also makes deep learning algorithms vulnerable to spoofing. “Deep learning systems are quite good at some large fraction of a given domain, yet easily fooled,” Marcus says.



* Testament to the fact are many crazy stories such as deep learning algorithms mistaking stop signs for speed limit signs with a little defacing, or British police software not being able to distinguish sand dunes from nudes.


## Deep learning is shallow


* Another problem with deep learning algorithms is that they’re very good at mapping inputs to outputs but not so much at understanding the context of the data they’re handling. In fact, the word “deep” in deep learning is much more a reference to the architecture of the technology and the number of hidden layers it contains rather than an allusion to its deep understanding of what it does. “The representations acquired by such networks don’t, for example, naturally apply to abstract concepts like ‘justice,’ ‘democracy’ or ‘meddling,’”