# References | 14. Review, Conclusion

In [None]:
from IPython.display import YouTubeVideo

### References

If you are interested in Chollet's theory of intelligence, and the series of tests he developed, here's a fairly comprehensive series of video explaining Chollet's paper ["On the Measure of Intelligence"](https://arxiv.org/abs/1911.01547).

In [2]:
YouTubeVideo('3_qGrmD6iQY', width=853, height=480) # Yannic Kilcher, On the Measure of Intelligence by François Chollet - Part 1: Foundations (Paper Explained) 

In [4]:
YouTubeVideo('THcuTJbeD34', width=853, height=480) # Yannic Kilcher, On the Measure of Intelligence by François Chollet - Part 2: Human Priors (Paper Explained) 

In [5]:
YouTubeVideo('cuyM63ugsxI', width=853, height=480) # Yannic Kilcher, On the Measure of Intelligence by François Chollet - Part 3: The Math (Paper Explained) 

In [6]:
YouTubeVideo('O9kFX33nUcU', width=853, height=480) # Yannic Kilcher, On the Measure of Intelligence by François Chollet - Part 4: The ARC Challenge (Paper Explained) 

### Foudations / Going Deeper

[Goodfellow, Bengio and Courville, *Deep Learning*](https://www.deeplearningbook.org/)  
[Christopher M. Bishop, *Pattern Recognition and Machine Learning*](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)  
[MacKay, *Information Theory, Inference, and Learning Algorithms*](https://www.inference.org.uk/mackay/itila/) (ignore the horrendous website)  
MacKay has lectures here as well on [YouTube](https://www.youtube.com/playlist?list=PLruBu5BI5n4aFpG32iMbdWoRVAA-Vcso6).  

### 14.2.3 Local generalization vs. extreme generalization

Even beyond, through experience, our access to broad training sets, here is another findamental dfference berween us and them (DL models).

We maintain *abstract* models of our social and natural environment and we use these models to anticpate and predict.

Thanks to that we are able to form representations of things that are novel to our experience, like dogs cooking or winnng the lottery.

We manage this feat through abstraction and reasoning.

#### Note

Some of these strategies (e.g. building a model of the world and acting accordingly) are studied in **Reinforcement Learning**, that unfortunately we haven't looked at this term.

#### Extreme generalisation

The ability to adapt to novel experiences despite a paucity of data is what Chollet calls *extreme generalisation*.

Deep nets generalise only locally. They only adapt to new situations that closely match the training set. 

<!-- <img src="images/chollet.local-vs-extreme-generalisation.p.448.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/9-conclusions/images/chollet.local-vs-extreme-generalisation.p.448.png?raw=true">

<small>DLWP, p. 448</small>

#### Example: a trip to the moon

Imagine training a model to launch a moon-bound rocket. The training set would have to include millions of launch trials before the model could accuratley predict the voyage.

But humans, using reasoning and abstraction, develop rocket science, and we manage the task with only a few trials.

---

## 14.4 Implementing intelligence: The missing ingredients

### 14.4.2 The two poles of abstraction

#### Value-centric analogy

<!-- <img src="images/chollet.value-centric-analogy.p.456.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/9-conclusions/images/chollet.value-centric-analogy.p.456.png?raw=true">

<small>DLWP, p. 456</small>

#### Program-centric analogy

<!-- <img src="images/chollet.program-centric-analogy.p.457.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/9-conclusions/images/chollet.program-centric-analogy.p.457.png?raw=true">

<small>DLWP, p. 457</small>

#### Cognition as a combination of both kinds of abstraction

<!-- <img src="images/chollet.two-poles-abstraction.p.458.png"> -->
<img src="https://github.com/jchwenger/AI/blob/main/9-conclusions/images/chollet.two-poles-abstraction.p.458.png?raw=true">

<small>DLWP, p. 458</small>

### 14.4.3 The missing half of the picture

Deep Learning focusses almost exclusively on the first kind, "value-centric analogies".

It might be more fruitful to view these two aspects as belonging to a **spectrum**, with humans using different strategies at different times.

### 14.5.1 Models as programs

Current AI reasoning relies on hardcoding e.g. search algorithms, graph manipulation and formal logic.

- **AlphaGo**, for example, the intelligence is hardcoded as a [Monte-Carlo Tree Search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search). Learning from data only occurs in specialised submodules.
- **RNNs** are less restricted than feedforward nets becuase they apply simple geometric transformations within a feedback loop. 

#### Program synthesis

If the system had access to **programming primitives** such as `if`, `while`, variables, disk storage, sorting operations, data structures (lists, hash tables etc.), the hypothesis space would far exeed that of current models.

**Program synthesis** searches to for simple programs automatically, and terminates when input-output pairs have been matched.

But instead of modifying a hardcoded model, program synthesis generates *source code*.

One could imagine a augmenting Deep Learning models with program synthesis capabilities.

### Beyond backpropagation

Deep Learning/program synthesis hybrids will likely no longer be differentiable. So, we cannot use backpropagation

But there are alternatives, such as optimisers that make little assumption about the loss function (or, at least, do not assume continuity and smoothness):
- [genetic algorithms](https://en.wikipedia.org/wiki/Evolutionary_computation);
- [simulated annealing](https://en.wikipedia.org/wiki/Simulated_annealing);
- [swarm intelligence](https://en.wikipedia.org/wiki/Swarm_intelligence);
- etc...

### Automated ML

There is research on how **architectures themselves** can be learnt rather than handcrafted, for instance [**Neural Architecture Search** (NAS)](https://en.wikipedia.org/wiki/Neural_architecture_search) and [**Meta-Learning**](https://en.wikipedia.org/wiki/Meta-learning_(computer_science)).

Hyperparameter tuning is a simple search procedure. Quite a few tuning systems already exist. Even architecture search algorithms are feasible.

But learning architectures in conjunction with model weights would be more desirable. This would be more efficient because at the moment each architecture has to be trained from scratch.

### Lifelong learning and modular subroutine reuse

There is a lot of wasted effort in Deep Learning: every dataset/model/task requires training from scratch.

But future hybrid Deep Learning/synthetic progams would require higher modular reuse.

This is because many datasets are insufficiently informative – it will be necessary to use information from previously encountered datasets. 

(We don't relearn the language every time we read a new book...)

#### Example: translation

The same DL model trained to translate English/German and French/Italian is better at each indivdual translation.

The more you add languages, the better it becomes, due to information overlap.

The same goes for vision, training an image segmetation model in conjunction with a an image classification model yields a model that is better at both tasks.

#### Foundation models

We already use **pretained models** – i.e. a fixed model with weights trained an a large database – in computer vision and NLP, and pretraining procedures are expanding fast.

The rise of very large models used as a basis for multiple tasks leads to the concept of [**Foundation models**](https://en.wikipedia.org/wiki/Foundation_models) (a large model acquiring general knowledge that can then be finetuned).

The rise of Transformers is really that story (GPT-3 being perhaps the most famous foundational model now).