# Introduction to Machine Learning

Machine Learning is one of the most famous field in Data Science these days: *A machine that can learn by itself!* As impressive as it sounds, you will see all the possibilities that Artificial Intelligence has to offer to technology. In this course, we will help you define all the terms like Big Data, Artificial Intelligence or even Deep Learning to put you up to speed with the latest technologies out there in Data Science. 

## What you'll learn in this class 🧐🧐

- Defining what Machine Learning is
- Understanding the importance and potential of this field
- Understanding the applications of Machine Learning
- Understanding the importance of Data Cleaning
- Understanding the main models of Machine Learning
- Writing a first Data Preprocessing template

## What is Machine Learning 🦾

### What is AI

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/ai.png)

Computer Sciences is the study of the power of computers in general. In Computer Sciences, there is a branch called Artificial Intelligence. Its purpose is to make the computer mimic the way the human brain works. This brings us to Machine Learning, which is just one branch of Artificial Intelligence.

Be careful, many people equate Data Science with Machine Learning and think that this is only what a Data Scientist does. Even if they use ML to understand data, it's not just what they do. They use statistics and other analysis tools such as Table, SQL etc. that we have seen in previous courses.

### Definition of Machine Learning

Here's a definition from Stanford University:

> *“Machine learning is the science of getting computers to act without being explicitly programmed.”*


> *"Machine learning is the science that gives computers the ability to act without coding every step of the program."*

This is a good summary of the possibilities offered by Machine Learning. Whereas in the past it was necessary to code each step of the program in order for it to work, now there is no need to code each step of the process.


## Why is ML the future?

### For Data Science

By 2005, humanity as a whole had created about 130 data exabytes. To put things in perspective, we would have to cut down all the trees of about 1000 Amazonian forests to be able to put all this data on paper. Huge right? 😮. Even more impressive: by 2020, scientists expect that we will have created 50,000 exabytes of data. So you can see the exponential growth in the amount of data created by man in recent years.

Today, we are not able to manage all the data available. This is where Machine Learning comes in. If computers are able to *"learn by themselves"*, they will be able to process more data than humans and create models that can predict the future.

### For businesses

For companies, Machine Learning is a new tool that will enable them to get to know their customers better and therefore to offer products that are better suited to their needs. For example, you can recommend products based on what other customers already have bought. Insurance companies can predict more accurately the risks associated with a customer and therefore better adjust their prices.

More broadly, Machine Learning potentially can have a significant impact around the world. Driverless cars are a first example, beyond being driven by a robot, productivity gains can be considerable (imagine the time you save in your life if you don't have to spend it driving). Google even supports the possibility of increasing life expectancy thanks to these cars because the number of accidents would drastically decrease.

### For healthcare

This last example brings us to the last point. Thanks to Machine Learning, doctors are able to predict diseases more effectively. This is essential because earlier diagnosis of certain serious illnesses drastically increases the vital prognosis for patients.

## Applications of Machine Learning


* **Movie recommendations on Netflix** 🍿

    * Machine Learning is able to predict with great accuracy the products you will like based on what you have previously purchased. It is on this type of Machine Learning algorithm that Netflix gives recommendations.

* **Facebook's facial recognition** 📸

    * If you've ever tried to tag someone on Facebook, you must have already seen that the company was able to recognize where the faces of the people in a photo were. The algorithm that enables this is based on Deep Learning (a branch of Machine Learning).

* **Virtual reality headset** 🥽

    * We've all seen these headsets (Facebook's Oculus or Google's Daydream). The fact that the image moves when you move your head is based on Machine Learning: an algorithm is able to recognize your movements and transcribe them into the game.

* **Driverless cars** 🏎️
    * We've talked enough about it, just like the Boston Dynamics robots, driverless cars rely on reinforcement learning algorithms to drive a car and adapt to the terrain.

## Cycle of a Machine Learning project 🔁

Being able to predict the future is not easy. Here are the steps to build a Machine Learning algorithm.

### Data Collection

The first piece of information needed in Machine Learning is how the data is collected and what type of data it is. This will be one of the factors that will decide which ML model you will use for your analyses. Here are some of the things you will need to think about:

* **How the data is structured?** 🗃️

    * There are several levels of structure. Your data can be **completely structured,** which means that it is stored in a database and is well ordered (as in a SQL database for example).

    * Your data can be **semi-structured,** which is a bit more complex because it means that it is not cleaned up and you will probably have to re-compile it to match the structure you want to have. For example, you may have numbers that are considered text data.

    * Finally, you may have data that is completely **unstructured.** In this case, your computer won't be able to find a way to organize your data. For example, pixels in a picture or images in a movie, words in a blog post are unstructured data.

* **What is the size of your dataset?** 🦃

    * It is important to know how much data you are going to process to know how energy efficient your model should be. For example, if you manage terabytes of data (1000 Gb), your model must be able to swallow that much data and process it within a reasonable time interval.

    * Size is not the only problem, you also need to know what type of data you will have in your model. If the data is digital, you will not use the same model as if it were words or images. You could technically also have both words and numbers to process at the same time and then the template is still the same.

    * Finally, you will need to determine whether your data reflects reality or whether some of it may be wrong. If you think that the accuracy of your data may vary, your model will also be different. For example, in the case of virtual reality, someone's movement is not exactly the same every time, even if the desired action is the same. Therefore, your model must be able to recognize this.

* **Do you already have results? (i.e - Is your data labeled?)** 🏷️

    * To train certain kind of algorithms, you will need to feed them data that already have the results of what happened. For example, if you are studying your customers' churn rate (the percentage of customers who stopped using your product), you probably already have a sample of customers that you know whether they continue or have stopped using the product. This is called **labeled data**. You already have answers that your model can be trained on.

    * Sometimes, you will need to understand data that you collected, your is not to make predictions but groups of homegeneous data. Usually, we are then using  **unlabeled data**. We don't have results of any kind of prediction in advance. 

### Data Cleaning 🧹

Cleaning your data is probably the most daunting and yet most important task in the cycle. Indeed, a template is only as effective as the data you feed it with. That's why you have to be very careful with your sources. Here's what to look for:

* **Missing or inconsistent values**

    * Most of the time, there will be missing or inconsistent values in your dataset. Let's take the example above. You may have a customer who used your product regularly, then stopped and then started using it again but only 4 or 5 times in a year. It is likely that this customer does not follow the "normal" pattern of the rest of your customers. It will be preferable to remove the data for this customer from your sample as it is not consistent.

    * You may also have missing data. These can give your model a hard time because it will not know how to interpret this value. There are several techniques to manage these missing values: delete them or replace them with an average.

* **Standardization**

    * Some data will also have to be put on the same scale. For example, if you want to measure wages and ages, the former are expressed in thousands while the latter are expressed in tens. For some models, the data will have to be standardized, i.e., put everything back on the same range of values. Notably, in the previous case, express all the data as a number between 0 and 1.

* **Filter**

    * As mentioned above, when you have inconsistent or missing data, it may be best to remove it from your data source because it is not representative of your population.

### Building a model

Now that you have collected and cleaned up your data, you can build your model. This is the coolest part of the process. There are several types of models that belong to several types of Machine Learning, we are going to see the most popular ones.

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/machine_learning.png)

#### Supervised Machine Learning

If you have labeled data, your model is in the **supervised**. It will use the results you already have to make its predictions. So you're going to split your database in two. The first part will be your **training set**, which you will use to train your model. The second part will be your **test set** and will be used to test the accuracy of your model. There are two types of supervised models.

##### Regressions

In a regression, you will try to predict a numerical value. Here is a list of examples where regression can be useful:

- Predicting someone's salary based on years of experience
- Knowing a company's future revenues based on its investments

Your results will depend on one or more variables. In the first example, someone's salary depends on the number of years of experience. In the second, the variables are the investments the company has made. This is what it looks like in a graph:

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M04-Machine-learning/regressions.png)

All points are what actually happened in real life (e.g. for each person, a point will be added at the intersection of his age on the horizontal axis and his salary on the vertical axis). The line represents the prediction of the model. We will develop below how this line is constructed.

##### Classification

Classification models will allow you to predict a category. Here are two examples:

- Is this email spam?
- Is this person going to buy product X or product Y?

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M04-Machine-learning/supervised_classification.png)

In this graph, the points are the actual values, positioned according to the data characterizing emails and people. The colors are the predictions made by your model. If we take the first example from above, the blue color corresponds to emails that are not spam and the red color to emails that are spam.

#### Unsupervised Machine Learning

Unlike supervised ML, this time you **do not have labeled data**. Your model will try to build groups of homogeneous data. Here are the most popular models:

##### Clustering

Clustering is similar to classification except that you don't know in advance which categories your data will belong to. Your model will make groups by itself, called clusters, based on all the variables you have. Here is an application:

- I'm a distributor and I would like to test several marketing strategies on my different customer targets. However I don't know these targets well. I can use a clustering model to know them.

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M04-Machine-learning/unsupervised_clustering.png)

Here is a graphical representation of clustering, the colors represent the different groups that your model has created.

#### Deep Learning

##### Neural Network

Deep Learning is a sub-domain of Machine Learning based on the use of Artificial Neural Networks (ANNs) as a model. Intuitively, ANNs mimic the way the human brain works to predict its outcomes. The best example of an application is facial recognition in photos:

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M04-Machine-learning/Artificial_neural_network.png)

As you can see on the graph, you will first enter several values into your network (or images) at the input level. Your algorithm will then decompose your images into several layers (hidden layers) and then output a result. If we had to give a concrete example, to recognize a face in an image :

- Can we see the oval shape in the image?
- Does this oval shape correspond to a head?
- Does it have hair?
- Can we see eyes on this head?
- …

At the end of the process, your algorithm is able to recognize the person in the image by going through all these steps. This process is very long and requires a lot of computing power.


##### What is the difference between Machine Learnind & Deep Learning 

It is often asked *what is the difference between Machine Learning and Deep Learning*. Well it boils down to 3 things: 

1. The algorithm you will use is an ANN for Deep Learning 
2. In Machine Learning you will most often use **structured data** whereas in Deep Learning you will use **unstructured data** 
3. In Machine Learning, you will usually perform a feature extraction process where you will try to extract the variables that seems important for your model. Whereas in Deep Learning, you will simply feed the raw unstructured Data. 

If you want to learn more, feel free to checkout our article: [La vraie différence entre machine learning et deep learning](https://www.jedha.co/blog/la-vraie-difference-entre-machine-learning-deep-learning)

![](https://uploads-ssl.webflow.com/5ecea319ef4214bb71278093/6005af7e4850ab3e94d806e0_Screen%20Shot%202021-01-18%20at%2016.54.11.png)

#### Natural Language Processing

Image recognition is an application case of Deep Learning but the latter is also used in Natural Language Processing (NLP). Natural Language Processing is a technology used to interpret words in a text. It is used a lot in the construction of Chatbots or to recognize whether a review is positive or negative automatically.

#### Reinforcement Learning

Finally, Reinforcement Learning is the last type of Learning Machine. It is a powerful tool whose idea is to give a reward or a penalty to the algorithm every time it is right or wrong in its prediction. The reward or penalty simply materializes as +1 or -1.

We won't develop this feature further in the course but if you are interested, reinforcement learning actually solves the [multi-armed bandit](<https://fr.wikipedia.org/wiki/Bandit_manchot_(math%C3%A9matics)>) problem which you can find in the resources.

### Evaluate and implement your model

The very last step in your process: evaluate the performance of your model. The best way to do this is to confront it with reality. We've talked about dividing your dataset into a training set and a test set. The test set will allow you to test your model. 

There are also performance indicators like _matrices of confusion_ that will allow you to decide how powerful your model is.

Your model will not be 100% accurate. However, this is not the point. Your model has value when it is better at predicting the future than luck. If your algorithm predicts the future better than the luck factor then it may be good to keep it.


#### Overfitting VS Underfitting 

As a Data Scientist, you #1 enemy is **overfitting**. The concept is simple. Sometimes your model learn too well on training data and therefore is not good at predicting new test data. This is what you really need to avoid and the main way to do so is to **get more data**.

On the other hand, there is the concept of **underfitting** where basically your model isn't performing well on the data. Here the idea is simply to find more complex models that can really fit the data. 

![](https://essentials-assets.s3.eu-west-3.amazonaws.com/M04-Machine-learning/underfitting_overfitting.png)



## How ML models learn? 

### **Minimizing a cost function** 😵

Don't worry, it sounds way more complicated than it actually is. 😌 In fact, how do you think a Machine Learning algorithm learn? Well it is with the use of what we call a **cost function**. It works iteratively the following way:

1. You choose an ML model (We'll cover different kinds of models later on)
2. At the beginning your model will make random predictions 
3. Then you will compare your predictions with the reality (i.e you will measure if your model classified correctly whether there was a dog or a cat on a photo)
4. Then you wil try to minimize the difference between the reality and your predictions using the cost function 

There are many different types of cost functions that you can use. Here are two examples:

* *Mean Squared Error* - $\frac{\sum_{i=0}^n (y_i - \hat{y_i})^2}{n}$
* *Log Loss* - $\sum_{i=0}^n -y_ilog(\hat{y_i}) - (1-y_i)log(1-\hat{y_i})$

The way your algorithm is *"learning"* is simply by trying to minimize a cost function. If the cost is low, it means that your algorithm predictions are close to the reality otherwise, it means that your algorithm has not learned. 

## Resources 📚📚

- Confusion matrix - [https://bit.ly/2FQo6WU](https://bit.ly/2FQo6WU)
- Problem of the Multi-Armed Bandit - [https://bit.ly/2rlcRRZ](https://bit.ly/2rlcRRZ)
- Machine Learning Stanford - [http://stanford.io/2aeK8WA](http://stanford.io/2aeK8WA)
- Introduction to Machine Learning with Google PM (Slides) - [http://bit.ly/2tAX96C](http://bit.ly/2tAX96C)
- Introduction to Machine Learning with Google PM (presentation) - [http://bit.ly/2u1Foyk](http://bit.ly/2u1Foyk)
- Machine Learning A-Z: Hands-On Python & R in Data Science - [http://bit.ly/2tB2niQ](http://bit.ly/2tB2niQ)
- Diving into Machine Learning - by Rob Craft, Group Product Manager at Google - [http://bit.ly/2v5hLC2](http://bit.ly/2v5hLC2)
- Machine Learning Introduction Regression and Classification - [http://intel.ly/2vah4qW](http://intel.ly/2vah4qW)