# What is Deep Learning?
- Cut through the noise to differentiate between the press perception and world-changing developments.

- Questions to be tackled:
  - What has Deep Learning achieved so far?
  - How significant is it?
  - Where are we headed next?
  - Should we believe the hype?

## Artificial Intelligence, Machine Learning and Deep Learning

|![AI, ML and DL Venn Diagram](./images/AI_ML_DL_venn.png)|
|:----:|
|*Artificial Intelligence, Machine Learning and Deep Learning*|

<br><br>

|Artificial Intelligence|Machine Learning|Deep Learning|
|:----------------------|:---------------|:------------|
|<ul><li>Started in 1950s</li><li>A concise definition: *The effort to automate intellectual tasks performed by human beings.*</li><li>Superset of Machine Learning and Deep Learning, which also includes techniques that doesn't involve learning.</li><li>Symbolic AI:<ul><li>Handcrafting a large number of rules to manipulate knowledge</li><li>1950s to 1980s</li><li>Peaked in popularity in 1980s, during *expert systems* boom</li><li>Difficult to use for fuzzy problems, like Image Classification and Speech Recognition.</li><li>Hence Machine Learning arose.</li></ul></li></ul>|<ul><li>Ada Lovelace remark on *Analytical Engine*: "The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform... Its province is to assist us in making available what we're already acquainted with."</li><li>Alan Turing when introducing *Truing Test* quoted above remark as "Lady Lovelace's Objection" and came to the conclusion that general purpose computers are capable of learning and originality.</li><li>Machine Learning arises from questions:<ul><li>Could a computer go beyond what we know how to order it to perform and learn on its own how to perform a specified task?</li><li> Could a computer suprise us?</li><li>Can it define rules on its own by looking at data?</li></ul><br><img alt="Classical Programming vs Machine Learning" src="./images/classicalprogramming-vs-ml.png" style="align:center"/></li><li>It is trained rather than explicitly programmed</li><li>It finds statisical structure in the data examples presented to it.</li><li>Tightly linked to mathematical statistics, but differs from statistics in several ways:<ul><li>ML deals with more complex data, in large amounts, for which classical stats like Bayesian Analysis would be impractical.</li></ul></li></ul>|<ul><li>Next level of Machine Learning</li><li>Better way to find hidden representation in the data.</li><li>Learn about data reprsentation in the section below.</li><li>*Deep* in deep learning stands for finding input to output map using successive layers of increasingly meaningful representations</li><li>No of layers in the model is called *depth* of the model</li><li>Other apt names for the field could have been:<ul><li>Layered Representations Learning</li><li>Hierarchical Representations Learning</li></ul></li><li>Other ML approaches focuses on using either one or two layers of representations, while Deep Learning can have 100s of layers of representations. Hence, ML approaches are sometimes called *shallow learning*</li><li>Most often the layered representations of Deep Learning are achieved using *neural networks*</li><li>Term "Neural Networks" came from neural biology and many of its characteristics are taken from "our understanding" of human brain. But it isn't a model of a human brain!</li><li>There is no evidence that the brain uses learning algorithms, used by moder Deep Networks!</li></ul>|

### Data Representation Learning
- Here we will learn about differences between Deep Learning and other Machine Learning algorithms.
- To do ML we need 3 things:
  - Input data points. Example: Images of Dog and Cat
  - Examples of expected outputs. Example: labels stating whether image is that of dog or cat.
  - Way to measure whether the algorithm is doing a good job. This is important to measure the distance between the expected output and where the algorithm is right now!

- An ML model (algorithm), transform input into meaningful outputs.
- This is done by a process that is "learned" by exposure to known outputs.
- Central problem in ML and DL - *meaningfully transform data*. Change representation of input data to something that is more closer to the output.
- *Representation* - different way to look at data.
- An image can be represented using RGB or HSV

Let's take a concrete example!

![Data Representation Example](./images/data_representation_example.png)


Now we can use $x>0$ to define blue class and $x<=0$ to define red class

- Machine Learning algorithms perform these kinds of transformations automatically!
- *Learning* in the context of Machine Learning stands for automatic search for better representations.
- Transformations performed by Machine Learning algorithms can be:
  - Coordinate Changes
  - Linear Projections (which may destroy some info.)
  - Translations
  - Nonlinear operations (example, select all points where x>0)

- ML algorithms are not very "creative" at finding best transformations. They search for the best combinations of transformations in a pre-defined set of operations, called a *hypothesis space*.

- So what makes Deep Learning Special?
  - Deep Learning tries to find layered hidden representation, whereas ML only finds shallow representations.
  - ML algorithms are not very good at finding complex representation as in images and speech, whereas we can reach human level performance using Deep Networks. This is due their layered representations!
  - Think of Neural Networks as a multi-stage information distillation operation, where information goes through successive filters and comes out increasingly *purified*!

![deep representations learnt by model](./images/deep_representations_learnt_by_model.png)

### Understanding How Deep Learning Works!
- Here we will see how Deep Networks map input to desired output using simple data layered transformations and their exposure to examples of input-target pairs.
- Transformations performed by a layer are *parameterized* by the *weights* it holds. *Weights* are a bunch of numbers, a.k.a. *parameters*.
- *Learning* means finding particular sets of *weights* such that network maps inputs correctly (read optimally) to associated targets.

![](./images/nn_parameterized_by_weights.png)

- A Deep Network can contains 10s of millions of parameters, changing one can affect behavior of others, finding the set of weights can be a daunting task.

- To control the output of a neural network, we must have a sense of how far the current output is from desired output. This distance measure is called *loss function* or *objective function*.

![](./images/loss_measures_quality_of_output.png)

- The trick is to use the loss function's distance score as a feedback signal to adjust the weights a little, in a direction that will lower the loss score for the current example. This adjustment is the job of an *optimizer*, which implements *Backpropagation Algorithm* - the central algorithm in Deep Learning.

![](./images/loss_used_as_feedback_signal.png)

- Initially the weights are assigned randomly, so naturally the output will be much far from required.
- With each example shown the network improves by adjusting its weights to perform better on the given example. This is called *training loop*. Final result, is a *trained network*, whose weights produce outputs as close to targets as close they can be.

### What Deep Learning has achieved si fat
- Near human level image classification
- Near human level speech recognition
- Near human level handwritting transcription
- Improved Machine Translation
- Improved text-to-speech conversion
- Digital assistants such as Google Now and Amazon Alexa
- Near human level autonomous driving
- Improved ad targeting, as used by Google, Baidu and Bing
- Improved search results on the web
- Ability to answer natural language questions
- Superhuman Go Playing

### Don't Believe in Short-term hype
- While DL is making great strives, we shouldn't be too serious about talks like *human-level general intelligence*.
- As when the technology fails to deliver, the research investments start to dry up slowing progress for a long time.
- This has happened twice in past - *AI winter* of 1970s and the expensive mantainance of *expert systems* in 1990s.

### The Promise of AI
- We've only started applying Deep Learning to many important problems for which it could prove transformative, from medical diagnosis to digital assistants.
- AI research is moving quickly due to never before seen funding.
- But relatively little research is coming up in products that can be used in the world.
- Doctors, accountants and people in general doesn't use many of the AI systems in their daily life. Yes, there are recommendation systems, assistants and other such things. But they are no where near the full potential.
- As the impact of internet was less understood in 1995, so is the impact of full fledged deployment is less understood in recent times.
- In not so distant future:
  - AI will be our asisstant, may be our friend
  - It may educate our kids and
  - watch over our health
  - will deliver our groceries to our door
  - drive us from point A to B
  - It will be our interface to increasingly complex and information sensitive world.
  - It will help us move forward by helping all scientific fields to make big breakthroughs.
- We may see a few setbacks, another AI winter may be! But we will get there eventually
- Don't believe short-term hype, but do believe in long-term vision.

## Before Deep Learning: A brief history of Machine Learning
- Deep Learning isn't the first successful ML algorithm.
- It might not be used in every AI in industry today, for which there might be many reasons:
  - Not enough data
  - Problem better solved by different algorithm
- Don't think of Deep Learning as your only hammer to every Machine Learning problem.
- Get familiar with other approaches and practice them when apt.
- A little go through of some ML algorithms.


### Probabilistic Modeling
- Application of principles of statistics to data analysis.
- One of the earliest form of ML
- Example, Naive Bayes
  - Applies Bayes' theorem for classification, while assuming that features of input are independent (a "naive" assumption)
  - This form of analysis predates computers

- A closely related model - *Logistic Regression*, a.k.a "Hello World" to ML
  - It also predates computers
  - Often the first thing a Data Scientist apply on Dataset

### Early Neural Networks
- Although, core ideas of NN were investigated in toy forms as early as the 1950s, the application took decades to get started.
- Missing piece for long time was - efficient way to train them. This changes in 1980s, when multiple people independently rediscovered the Backpropagation algorithm.
- First successful application of NNs came in 1989 from Bell labs, when Yann LeCun combined earlier ideas of convolutional NNs and backprop, and applied to problem of classifying handwritten digits - LeNet.
- It was used by US Portal Service in 1990s to automate the reading of ZIP codes on mail envelopes.


### Kernel Methods
- A new method rose to fame in 1990s *kernel methods*.
- *Kernel methods* are group of classification algorithms, the best known of which is *Support Vector Machines (SVM)*.
- SVMs aim at solving classification problems by finding good *decision boundaries* between 2 sets of points belonging to 2 different classes.

![](https://predictiveprogrammer.com/wp-content/uploads/2019/03/svm_rbf_kernel.jpg)

- To predict on new points, just check on which side of boundary they fall.
- SVM find the boundary using two steps:
  - Data is mapped to a new high dimensional representation where the decision boundary cab ve expressed as a hyperplane.
  - A good decision boundary us computer by trying to maximize the distance between the hyperplane and the cosest data points from each class, a step called *maximizing the margin*. This allows the boundary to generalize well to new samples outside of the training dataset.

- Mapping data to higher dimensions looks good on paper, but in practice it's often computationally intractable. That's where the *kernel trick* comes in.

- Here's the gist of *kernel trick*:
  - For search of decision hyperplane in new representation space, we doesn't require to convert each and every point in the new space; we just need to compute distance between pairs of points in the space, which can be done efficiently using a *kernel function*.
  - *Kernel Function* is computationally tractable and can map pair of points in given space to distance between them in required space.
  - They are crafted by hands and are not learnt using data.

- At that time SVMs:
  - Gave SOTA on simple classification
  - had extensive theory and mathematical analysis to back it up
  - making it well understood and easily interpretable.
  - proved hard to scale on large datasets
  - didn't give good results on perceptual problems like image classification
  - For perceptual problems special, hand-crafter feature engineering was required - which is difficult and brittle.

### Decision Trees, Random Forests & Gradient Boosting Machines
- Decision Trees:
  - Flowchart like structures, that let's us classify input data points
  - easy to visualize and interpret
  - By 2010s they were often preferred to kernel methods
![](./images/decision_tree.png)

- Random Forest:
  - Robust, practical take on Decision Trees learning
  - Building large no of specialized trees and ensembling their outputs.
  - Almost always second best algorithm for shallow ML task

- Gradient Boosting Machines:
  - As random forest, it is a technique to ensemble weak prediction models, generally Decision Trees.
  - It uses *gradient boosting*, a way to improve ML model by iteratively training new models that specialize in addressing the weak points of the previous models.
  - Applied to DTs, the use of gradient boosting techniques results in models that strictly outperform random forests most of the time, while having similar properties.
  - May be one of the best algorithm today, to deal with non-perceptual tasks.

### Back to Neural Networks
- Almost completely shunned, people started making breakthroughs using NNs, the groups:
  - Groffrey Hinton at the University of Toronto
  - Yoshua Bengio at the University of Montreal
  - Yann LeCun at New York University
  - IDSIA in Switzerland
- In 2011, Dan Ciresan from IDSIA began to win academic Image classification competiations using GPU trained Deep NNs
- In 2012, team led by Alex Krizhevsky and adviced by Geoffrey Hinton brought top-5 accuracy in ImageNet Challenge to 83.6%, which was 74.3% in 2011 using conventional CV methods.
- By 2015, classification task on ImageNet was considered a completely solved problem as the winners reached accuracy of 96.4%. (All these using Convolutional Neural Networks).
- For several years European Organization for Nuclear Research, CERN, used decision tree-based methods for analysis of particle data from the ATLAS detector at the Large Hardon Collider (LHC); but CERN eventually switched to keras-based NNs due to their higher performance and ease of training on large datasets.

### What Makes Deep Learning Different?
- Primary reason it took off - better performance on many problems.
- Other big reason - it automated a major step in ML workflow "Feature Engineering"
- With DL, model learns all features in one pass rather than us engineering the features.

- Then why not use successive shallow models to emulate Deep NNs?
  - In practice, there are fast diminishing returns to successive applications of shallow models, because *the optimal first representation layer in a three-layer model isn't the optimal first layer in a one-layer or two-layer model.*
  - Deep Learning allows model to learn all layers of representation jointly, at the same time, rather than in succession.
  - When one feature is changed in DL all the features linked to it adapt.
- DL learn from data in an incremental, layer-by-layer way in which increasingly complex representations are developed. Since these intermediate incremental representations are learned jointly, each layer gets updated to accomodate need of layer above and layer below.

- Current DL landscape: XGBT for nonperceptual tasks, DL for perceptual tasks.

## Why Deep Learning? Why now?
- Knowledge of many current DL algorithms were available 2 decades back, so why such a hype now?
- Three technical forces are driving advances in ML today:
  - Hardware: CPUs and GPUs got much much faster
  - Datasets and banchmarks: Rise of internet provided us with a lot of data to train these models.
  - Algorithmic Advances: Better *activation functions*, *weight-initialization schemes* and *optimization schemes*