![image.png](attachment:image.png)

# What is Deep Learning?

This Lecture Covers
- High-level definitions of fundamental concepts
- Timeline of the development of machine learning
- Key factors behind deep learning’s rising
popularity and future potential

## Artificial intelligence, machine learning, and deep learning
![image-2.png](attachment:image-2.png)

### Artificial intelligence
- Allan Turing.
    - _Computing Machinery and Intelligence_.

![image.png](attachment:image.png)

In 1956, John McCarthy:

_The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it. An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer._


- AI can be described as the effort to automate intellectual tasks normally performed by humans.
- As such, AI is a general field that encompasses machine learning and deep learning, but that also includes many more approaches that may not involve any learning.
<!-- Consider that until the 1980s, most AI textbooks didn’t mention “learning” at all! Early chess programs, for instance, only involved hardcoded rules crafted by programmers, and didn’t qualify as machine learning. In fact, for a fairly long time, most experts believed that human-level artificial intelligence could be achieved by having programmers handcraft a  sufficiently large set of explicit rules for manipulating knowledge stored in explicit databases. -->

- _Symbolic AI_: Human-level artificial intelligence could be achieved by having programmers handcraft a sufficiently large set of explicit rules for manipulating knowledge stored in explicit databases.
- Dominant paradigm in AI from the 1950s to the late 1980s, and it reached its peak popularity during the expert systems boom of the 1980s.
- Symbolic AI proved suitable to solve well-defined, logical problems, such as playing chess, it turned out to be intractable to figure out explicit rules for solving more complex, fuzzy problems, such as image classification, speech recognition, or natural language translation. A new approach arose to take symbolic AI’s place: **machine learning**.

### Machine Learning
Charles Babbage: _The Analytical Engine_.
- It was a way to use mechanical operations to automate certain computations from the field of mathematical analysis
- Designed in the 1830s and 1840s
- Visionary and far ahead of its time.
- The concept of general-purpose computation was yet to be invented.

In 1843, Ada Lovelace remarked on the invention of the Analytical Engine:

_The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform. . . . Its province is to assist us in making available what we're already acquainted with._

Even with 178 years of historical perspective, Lady Lovelace’s observation remains arresting:
- Could a general-purpose computer “originate” anything, or would it always be bound to dully execute processes we humans fully understand?
- Could it ever be capable of any original thought?
- Could it learn from experience? Could it show creativity?

- The usual way to make a computer do useful work is to have a human programmer write down rules—a computer program—to be followed to turn input data into appropriate answers.
-  Machine learning turns this around: the machine looks at the input data and the corresponding answers, and figures out what the rules should be.
- A machine learning system is trained rather than explicitly programmed.
- It's presented with many examples relevant to a task, and it finds statistical structure in these examples that eventually allows the system to come up with rules for automating the task.

![image.png](attachment:image.png)

- ML started to flourish in the 1990s.
- Most popular and most successful subfield of AI.

Driven by:
1. Availability of faster hardware.
1. Larger datasets.

- Unlike statistics, machine learning tends to deal with large, complex datasets (such as a dataset of millions of images, each consisting of tens of thousands of pixels) for which classical statistical analysis such as Bayesian analysis would be impractical.

- Machine learning, and especially deep learning, exhibits comparatively little mathematical theory -- maybe too little -- and is fundamentally an engineering discipline.
- Unlike theoretical physics or mathematics, machine learning is a very _hands-on_ field driven by empirical findings and **deeply reliant on advances in software and hardware**.

### Learning rules and representations from data
1. Input data points.
1. Examples of the expected output.
1. A way to measure whether the algorithm is doing a good job—This is necessary in order to determine the distance between the algorithm’s current output and its expected output. The measurement is used as a feedback signal to adjust the way the algorithm works. This adjustment step is what we call learning.

- The central problem in machine learning and deep learning is to meaningfully transform data.
- To learn useful representations of the input data at hand—representations that get us closer to the expected output.

- What’s a representation?
- A different way to look at data -- to represent or encode data.
- A color image can be encoded in the RGB format (red-green-blue) or in the HSV format (hue-saturation-value).
- Some tasks that may be difficult with one representation can become easy with another.
- Machine learning models are all about finding appropriate representations for their input data—transformations of the data that make it more amenable to the task at hand.

- The inputs are the coordinates of our points.
- The expected outputs are the colors of our points.
- A way to measure whether our algorithm is doing a good sample data.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

- Black points are such that x > 0.
- White points are such that x < 0.

- We used our human intelligence to come up with our own appropriate representation of the data.
- Fine for such an extremely simple problem.
- Could you do the same if the task were to:
    - Classify images of handwritten digits? Could you write down explicit, computer-executable image transformations that would illuminate the difference between a 6 and an 8, between a 1 and a 7, across all kinds of different handwriting?
    - "Number of closed loops".
    - "Vertical and horizontal pixel histograms" can do a decent job of telling apart handwritten digits.
    -  finding such useful representations by hand is hard work, and, as you can imagine, the resulting rule-based system is brittle -- a nightmare to maintain.
    - Every time you come across a new example of handwriting that breaks your carefully thought-out rules, you will have to add new data transformations and new rules, while taking into account their interaction with every previous rule.

<!--
- Can we automate this process?
- Learning, in the context of machine learning, describes an automatic search process for data transformations that produce useful representations of some data, guided by some feedback signal—representations that are amenable to simpler rules solving the task at hand.
-->

- ML algorithms aren’t usually creative in finding these transformations.
- They’re merely searching through a predefined set of operations, called a __hypothesis space__.
- For instance, the space of all possible coordinate changes would be our hypothesis space in the 2D coordinates classification example.

- Now that you understand what we mean by learning, let’s take a look at what makes deep learning special.

### The "Deep" in "Deep Learning"

<!-- - DL is a specific subfield of ML:
    - Emphasis on learning successive layers of increasingly meaningful representations.
- The "deep" in "deep learning": successive layers of representations.
- How many layers contribute to a model of the data is called the depth of the model.
- Other appropriate names for the field could have been layered representations learning or hierarchical representations learning.
- Modern deep learning often involves tens or even hundreds of successive layers of representations, and they’re all learned automatically from exposure to training data. Meanwhile, other approaches to machine learning tend to focus on learning
- only one or two layers of representations of the data (say, taking a pixel histogram and then applying a classification rule); hence, they’re sometimes called shallow learning. -->

![image-3.png](attachment:image-3.png)



![image.png](attachment:image.png)

- You can think of a deep network as a multistage information-distillation process, where information goes through successive filters and comes out increasingly purified (that is, useful with regard to some task).


### Understanding how deep learning works
<!-- 
- ML is about mapping inputs (such as images) to targets (such as the label “cat”).
- Which is done by observing many examples of input and targets.
- You also know that deep neural networks do this input-to-target mapping via a deep sequence of simple data transformations (layers).
- Data transformations are learned by exposure to examples. Now let’s look at how this learning happens, concretely.
 -->

![image.png](attachment:image.png)




<!--
- To control the output of a neural network, you need to be able to measure how far this output is from what you expected. This is the job of the loss function of the network, also sometimes called the objective function or cost function.
-->

![image.png](attachment:image.png)



- Weights of the network are assigned random values, so the network merely implements a series of random transformations.
- Naturally, its output is far from what it should ideally be, and the loss score is accordingly very high.
- But with every example the network processes, the weights are adjusted a little in the correct direction, and the loss score decreases.
- This is the training loop, which, repeated a sufficient number of times (typically tens of iterations over thousands of examples), yields weight values that minimize the loss function.
- A network with a minimal loss is one for which the outputs are as close as they can be to the targets: a trained network. - Once again, it’s a simple mechanism that, once scaled, ends up looking like magic.

### What deep learning has achieved so far

- Near-human-level image classification.
- Near-human-level speech transcription.
- Near-human-level handwriting transcription.
- Dramatically improved machine translation.
- Dramatically improved text-to-speech conversion.
- Digital assistants such as Google Assistant and Amazon Alexa.
- Near-human-level autonomous driving.
- Improved ad targeting, as used by Google, Baidu, or Bing.
- Improved search results on the web.
- Ability to answer natural language questions.
- Superhuman Go playing.


- Automatically transcribing the tens of thousands of ancient manuscripts held in the Vatican’s Apostolic Archive.
- Detecting and classifying plant diseases in fields using a simple smartphone.
- Assisting oncologists or radiologists with interpreting medical imaging data.
- Predicting natural disasters such as floods, hurricanes, or even earthquakes, and so on.
- With every milestone, we’re getting closer to an age where deep learning assists us in every activity and every field of human endeavor—science, medicine, manufacturing, energy, transportation, software development, agriculture, and even artistic creation.

### Don’t believe the short-term hype
Although some world-changing applications like autonomous cars are already within reach, many more are likely to remain elusive for a long time, such as:
- Believable dialogue systems.
- Human-level machine translation across arbitrary languages.
- Human-level natural language understanding.
- talk of human-level general intelligence shouldn’t be taken too seriously.
    - The risk with high expectations for the short term is that, as technology fails to deliver, research investment will dry up, slowing progress for a long time.

- Twice in the past, AI went through a cycle of intense optimism followed by disappointment and skepticism, with a dearth of funding as a result.
- Symbolic AI in the 1960s.
- Marvin Minsky, who claimed in 1967, "_Within a generation . . . the
problem of creating ‘artificial intelligence’ will substantially be solved._

- Three years later, in 1970, he made a more precisely quantified prediction: "_In from three to eight years we will have a machine with the general intelligence of an average human being._"
- **In 2021 such an achievement still appears to be far in the future—so far that we have no way to predict how long it will take**.
- In the 1960s and early 1970s, several experts believed it to be right around the corner (as do many people today).
- A few years later, as these high expectations failed to materialize, researchers and government funds turned away from the field, marking the start of the first AI winter (a reference to a nuclear winter, because this was shortly after the height of the Cold War).

- In the 1980s, a new take on symbolic AI, expert systems, started gathering steam among large companies. A few initial success stories triggered a **wave of investment**, with corporations around the world starting their own in-house AI departments to develop expert systems.
- Around 1985, companies were spending over 1 billion each year on the technology;
- by the early 1990s, these systems had proven expensive to maintain, difficult to scale, and limited in scope, and interest died down.
- Thus began the second AI winter.

- We may be currently witnessing the third cycle of AI hype and disappointment.
- and we’re still in the phase of intense optimism.
- It’s best to moderate our expectations for the short term and make sure people less familiar with the technical side of the field have a clear idea of what deep learning can and can’t deliver.

### The promise of AI
- AI research has been moving forward amazingly quickly in the past ten years.
-  due to a level of funding never before seen in the short history of AI.
- Most of the research findings of deep learning aren’t yet applied, or at least are not applied to the full range of problems they could solve across all industries.

- Make no mistake: **AI is coming**.
- In a not-so-distant future.

AI will be:
- Your assistant or your friend.
- It will answer your questions.
- Help educate your kids
- Watch over your health.
- It will deliver your groceries to your door.
- It will drive you from point A to point B.
- It will be your interface to an increasingly complex and information-intensive world.
- More importantly, AI will help humanity as a whole move forward, by assisting human scientists in new breakthrough discoveries across all scientific fields, from genomics to mathematics.

- Don’t believe the short-term hype, but do believe in the long-term vision. It may take a while for AI to be deployed to its true potential—a potential the full extent of which no one has yet dared to dream—but AI is coming, and it will transform our world in a fantastic way.

## Before deep learning: A brief history of machine learning

- DL isn’t the first successful form of machine learning.
- Probably most of the machine learning algorithms used in the industry today aren't DL algorithms.
- DL isn’t always the right tool for the job, sometimes
    - There isn’t enough data for DL to be applicable, and;
    - sometimes the problem is better solved by a different algorithm.
- If DL is your first contact with ML, you may find yourself in a situation where all you have is the deep learning hammer, and every machine learning problem starts to look like a nail.
- The only way not to fall into this trap is to be familiar with other approaches and practice them when appropriate.


### Probabilistic Modeling

- _Probabilistic modeling_ is the application of the principles of statistics to data analysis.
- One of the earliest forms of ML.
- One of the best-known algorithms in this category is the Naive Bayes algorithm (NB).

<!-- 
- NB is a type of machine learning classifier based on applying Bayes' theorem while assuming that the features in the input data are all independent (a strong, or "naive" assumption, which is where the name comes from).
- This form of data analysis predates computers and was applied by hand decades before its first computer implementation (most likely dating back to the 1950s).
- Bayes' theorem and the foundations of statistics date back to the eighteenth century, and these are all you need to start using Naive Bayes classifiers.
 -->

- A closely related model is logistic regression (logreg for short).


<!-- 
- sometimes considered to be the "Hello World" of modern machine learning.
- Don’t be misled by its name -- logreg is a classification algorithm rather than a regression algorithm.
- Much like Naive Bayes, logreg predates computing by a long time, yet it’s still useful to this day, thanks to its simple and versatile nature.
- It’s often the first thing a data scientist will try on a dataset to get a feel for the classification task at hand.
 -->

### Early neural networks
- Core ideas of neural networks were investigated in toy forms as early as the 1950s, the approach took decades to get started.
- For a long time, the missing piece was an efficient way to train large neural networks.
- This changed in the mid1980s, when multiple people independently rediscovered the Backpropagation algorithm -- a way to train chains of parametric operations using gradient-descent optimization.
- The first successful practical application of neural nets came in 1989 from Bell Labs, when Yann LeCun combined the earlier ideas of convolutional neural networks and backpropagation
    - It was applied to the problem of classifying handwritten digits.
    - The resulting network, dubbed LeNet, was used by the United States Postal Service in the 1990s to automate the reading of ZIP codes on mail envelopes.
    - [https://www.youtube.com/watch?v=FwFduRA_L6Q](https://www.youtube.com/watch?v=FwFduRA_L6Q).

### Kernel methods

- SVM is a classification algorithm that works by finding "decision boundaries" separating two classes.

- The data is mapped to a new high-dimensional representation where the decision boundary can be expressed as a hyperplane (if the data was two-dimensional.

![image-2.png](attachment:image-2.png)

- A kernel function is a computationally tractable operation that maps any two points in your initial space to the distance between these points in your target representation space, completely bypassing the explicit computation of the new representation. Kernel functions are typically crafted by hand rather than learned from data—in the case of an SVM, only the separation hyperplane is learned.

- At the time they were developed, SVMs exhibited state-of-the-art performance on simple classification problems and were one of the few machine learning methods backed by extensive theory and amenable to serious mathematical analysis, making them well understood and easily interpretable. Because of these useful properties, SVMs became extremely popular in the field for a long time.

- But SVMs proved hard to scale to large datasets and didn’t provide good results for perceptual problems such as image classification. Because an SVM is a shallow method, applying an SVM to perceptual problems requires first extracting useful representations manually (a step called feature engineering), which is difficult and brittle.

- For instance, if you want to use an SVM to classify handwritten digits, you can’t start from the raw pixels; you should first find by hand useful representations that make the problem more tractable, like the pixel histograms I mentioned earlier.

### Decision trees, random forests, and gradient boosting machines



![image.png](attachment:image.png)

### Back to neural networks

- Geoffrey Hinton -- the University of Toronto.
- Yoshua Bengio -- University of Montreal.
- Yann LeCun -- New York University.
- IDSIA in Switzerland.

- In 2011, Dan Ciresan from IDSIA began to win academic image-classification competitions with GPU-trained deep neural networks—the first practical success of modern deep learning.
- But the watershed moment came in 2012, with the entry of Hinton's group in the yearly large-scale image-classification challenge ImageNet (ImageNet Large Scale Visual Recognition Challenge, or ILSVRC for short).

- The imageNet challenge was notoriously difficult at the time, consisting of classifying high-resolution color images into 1,000 different categories after training on 1.4 million images.
- In 2011, the top-five accuracy of the winning model, based on classical approaches to computer vision, was only 74.3%.5 Then, in 2012, a team led by Alex Krizhevsky and advised by Geoffrey Hinton was able to achieve a top-five accuracy of 83.6%—a significant breakthrough. The competition has been dominated by deep convolutional neural networks every year since. By 2015, the winner reached an accuracy of 96.4%, and the classification task on ImageNet was considered to be a completely solved problem.
- Since 2012, deep convolutional neural networks (convnets) have become the go-to algorithm for all computer vision tasks; more generally, they work on all perceptual tasks.

###  What makes deep learning different

- Deep learning also makes problem-solving much easier, because it completely automates what used to be the most crucial step in a machine learning workflow: **feature engineering**.
- DL allows a model to learn all layers of representation **jointly**, at the same time, rather than in succession (greedily, as it's called).
- With joint feature learning, whenever the model adjusts one of its internal features, all other features that depend on it automatically adapt to the change, without requiring human intervention.

These are the two essential characteristics of how deep learning learns from data:
1. Incremental, layer-by-layer way in which increasingly complex representations are developed.
1. And the fact that these intermediate incremental representations are learned jointly, each layer being updated to follow both the representational needs of the layer above and the needs of the layer below.

### The modern machine learning landscape

Machine learning tools used by top teams (top 5 for any compettition) on Kaggle.

![image-2.png](attachment:image-2.png)

- In early 2019, Kaggle ran a survey asking teams that ended in the top five of any competition since 2017 which primary software tool they had used in the competition. It turns out that top teams tend to use either deep learning methods (most often via the Keras library) or gradient boosted trees (most often via the LightGBM or XGBoost libraries).

Tool usage across the machine learning and data science industry

![image.png](attachment:image.png)

From 2016 to 2020, the entire machine learning and data science industry has been dominated by these two approaches:
1. DL. Used for perceptual problems such as image classification.
1. Gradient boosted trees. Used for problems where structured data is available

## Why deep learning? Why now?

1. Hardware.
1. Datasets and benchmarks.
1. Algorithmic advances.

### Hardware

- From 1990 to 2010, CPUs became faster by a factor of 5,000.
- Today you can run models on your laptop.
- But complex models still need much more computing power.

- Throughout the 2000s, companies like NVIDIA and AMD invested billions of dollars in developing fast, massively parallel chips (graphical processing units, or GPUs) to power the graphics of increasingly photorealistic video games—cheap, single-purpose supercomputers designed to render complex 3D scenes on your screen in real time.
- In 2007, NVIDIA launched CUDA.
- A small number of GPUs started replacing massive clusters of CPUs in various highly parallelizable applications, beginning with physics modeling.
- DL networks are highly parallelizable.
- Around 2011 some researchers began to write CUDA implementations of neural nets
    - Alex Krizhevsky was among the first.


- In 2019, NVIDIA Titan RTX, a GPU that cost 2,500 USD can deliver a peak of 16 teraFLOPS in single precision (16 trillion float32 operations per second).
- That's about 500x more computing power than the world’s fastest supercomputer from 1990, the Intel Touchstone Delta.
- On a Titan RTX, it takes only a few hours to train an ImageNet model of the sort that would have won the ILSVRC competition around 2012 or 2013.
- Meanwhile, large companies train deep learning models on clusters of hundreds of GPUs.

- DL industry has been moving beyond GPUs and is investing in increasingly specialized, efficient chips for deep learning. - In 2016, at its annual I/O convention, Google revealed its Tensor Processing Unit (TPU) project:
    - a new chip design developed from the ground up to run deep neural networks significantly faster and far more energy efficient than top-of-the-line GPUs.
- In 2020, the third iteration of the TPU card represents 420 teraFLOPS of computing power.
- That's 10,000 times more than the Intel Touchstone Delta from 1990.

### Data

If deep learning is the steam engine of this new industrial revolution, then data is its coal:
- the raw material that powers our intelligent machines, without which nothing would be possible.

Today, large companies work with image datasets, video datasets, and natural language datasets that couldn’t have been collected without the internet.

- User-generated image tags on Flickr, for instance, have been a treasure trove of data for computer vision.
- So are YouTube videos.
- And Wikipedia is a key dataset for natural language processing.
- ImageNet dataset, consisting of 1.4 million images that have been hand annotated with 1,000 image categories (one category per image).


### Algorithms

- In the late 2000s, we were missing a reliable way to train very deep neural networks.
- As a result, neural networks were still fairly shallow.
- No competition against shallow methods such as SVM.
- Algorithmic improvements in 2009-2010:
    - Better activation functions for neural layers.
    - Better weight-initialization schemes, starting with layer-wise pretraining, which was then quickly abandoned
    - Better optimization schemes, such as RMSProp and Adam.
- After that: 10 or more layers.

2014, 2015 and 2016:
- Batch normalization.
- Residual connections.
- Depthwise separable convolutions.

- Today we can train models that are arbitrarily deep from scratch.
- This has unlocked the use of extremely large models, which hold considerable representational power—that is to say, which encode very rich hypothesis spaces.
- **Extreme scalability**. Tens of layers and tens of millions of parameters.
    - ResNet, Inception, or Xception.
    - BERT, GPT-3, or XLNet.

### A new wave of investment

![image.png](attachment:image.png)

- Meanwhile, large tech companies such as Google, Amazon, and Microsoft have invested in internal research departments in amounts that would most likely dwarf the flow of venture-capital money.

### The democratization of deep learning

- In the early days, doing deep learning required significant C++ and CUDA expertise, which few people possessed.
- Nowadays, basic Python scripting skills suffice to do advanced deep learning research.

### Will it last?

- **Simplicity** -- Deep learning removes the need for feature engineering, replacing complex, brittle, engineering-heavy pipelines with simple, end-to-end trainable models that are typically built using only five or six different tensor operations.
- **Scalability** -- Deep learning is highly amenable to parallelization on GPUs or TPUs, so it can take full advantage of Moore’s law. In addition, deep learning models are trained by iterating over small batches of data, allowing them to be trained on datasets of arbitrary size. (The only bottleneck is the amount of parallel computational power available, which, thanks to Moore’s law, is a fast-moving barrier.)
- **Versatility and reusability** -- Unlike many prior machine learning approaches, deep learning models can be trained on additional data without restarting from scratch, making them viable for continuous online learning—an important property for very large production models. Furthermore, trained deep learning models are repurposable and thus reusable: for instance, it’s possible to take a deep learning model trained for image classification and drop it into a video-processing pipeline. This allows us to reinvest previous work into increasingly complex and powerful models. This also makes deep learning applicable to fairly small datasets.

**Deep learning is still a revolution in the making, and it will take many years to realize its full potential.**