# Introduction to Machine Learning

## What is Machine Learning?

Machine learning is all about building systems that can combine input they have never seen before to make useful predictions. The systems are shown similar kind of inputs that they should expect beforehand.

For example, let's say we need to build a spam filter for emails. Whenever a new email comes in, we have to figure out whether it is spam or not and flag it accordingly.

To build a regular "rule-based" system as opposed to a machine learning system, we would inspect a lot of emails that are spam and those that are not, and come up with a set of keywords that occur frequently in spam emails. Then, whenever a new email comes in, we would check the contents of the email against our set of keywords. If there are any matches, we would declare the email spam.

Some problems with using such a system are:
 - How do you decide which keywords are spam? It would be terrible if you had to sit somewhere and think of all such keywords.
 - Even if you do come up with such a list of keywords, isn't it possible that a mail that is not spam contains one of those words?
 - What about when a keyword that you've "blacklisted" is spelt slightly differently in an email. For example, you've blacklisted "lottery", but the email contains the word "l0ttery".

In contrast, to build a machine learning system, we would need to collect a lot of examples of emails that are spam and those that are not:

<img src="emailspam.png" width="500px"/>

We would then **train a model** with this data. We will look at what these terms mean later. For now, imagine the model as a black box, and training as the process of showing it the examples we have collected. During training, the model "figures out" how it is going to identify if an email is "spam" or "not spam". After we have trained the model, we can give it any new email that we get and the model will predict whether it is "spam" or "not spam".

<img src="spam.png" width="600px"/>

## Important Terms

### Labels
A **label** is what we are predicting using machine learning. It could be the temperature tomorrow, the breed of a dog in a picture, or even the translation in French of a sentence in English. In the example above, we were trying to predict whether an email was spam or not. We had two labels - "spam" and "not spam".

### Features
A **feature** is an input that is used to predict the **label**. A very simple machine learning project might use just one feature. However, typical machine learning projects have several features - maybe even millions.

In our example above, our features would be things that describe the email we are trying to categorize as spam/not spam, such as:
 - The sender's address
 - The time of day the email was sent
 - The words in the email
 - The words in the subject of the email

### Examples
An **example** is a data point. In the email spam filter example above, every line in the table is an **example**.

An example can be either **labeled** or **unlabeled**.

Labeled examples have both features and a label. Such examples are used to **train** the model.

Unlabeled examples contain features but no label. We use our model to make predictions about such examples.

### Models
A model is what defines the relationship between features and the label. For our spam detection example, the model has to "learn" what kind of features are associated with emails being spam, and which are associated with emails not being spam.

**Training** a model consists of showing a model **labeled examples** so that it can adjust its internal parameters to learn the relationship between features and labels it is being shown.

**Inference** means using the model with the internal parameters that it has "learned" to make predictions on **unlabeled examples**.


## Types of Machine Learning

In this course, we'll cover two types of machine learning:

### Supervised Learning

In supervised learning, the data consists of both features and labels like we have already discussed. We use these labeled examples to train a model and then we run inference on unlabeled examples.

There can be two types of supervised learning models, based on the nature of the label. 

If the label is a **continuous value** then a **regression** model has to be used. For example:

 - Price of a house
 - Average temperature the following day
 - The salary of a software engineer

If the label has **discrete values** then a **classification** model has to be used. For example:

 - If a passenger of the titanic survived
 - Is the animal in a photo a cat or dog
 - The risk of granting a loan: high, medium or low

❓ What kind of model is required for our spam filter example above? Regression or classification?

### Unsupervised Learning

In unsupervised learning, the data does not have any labels. It is the responsibility of the algorithm you use to find patterns in the features and group similar examples together.

Examples of unsupervised learning are:

 - Clustering (e.g. K-Means, Hierarchical)
 - Anomaly detection
 - Dimensionality reduction


## Other Terms You Might Encounter

### Artificial Intelligence

> the theory and development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages. --- Oxford Dictionary

### Deep Learning

Neural networks are a family of models that are inspired by how the human brain works. There are many different kinds of neural networks. Deep learning refers to working with any kind of neural network.

### Natural Language Processing

NLP is the ability to analyze and understand human language using computer programs.

### Reinforcement Learning

Reinforcement learning is all about an **agent** learning how to perform a particular task by trial and error, and receiving feedback after every action.


*How they are related to each other:*


![AI ML DL NLP RL](ai-ml-dl-nlp-rl.png)


## When does Machine Learning make sense?
1. You have a lot of data available.
2. There is an underlying pattern in the data.
3. It is not possible to make a rule-based system.

If you do not have enough data available, then you will not be able to train a model that performs well. The more data a model sees, the better its chances are of making high quality predictions. There are data augmentation techniques available that let you increase the amount of data to train your model with, without having to actually collect more data.

If there is no pattern in the data that you use to train a model, the model will never complain about this. You will just get a model whose predictions are as good as random noise. You can perform **exploratory data analysis** on the data you have collected to see if there is any pattern.

Machine learning systems take a lot of time and effort to build. There is a significant amount of time required to collect data, clean and transform it, and train a model. It is generally easier and cheaper to build rule-based systems. If it is feasible to solve your problem using a rule-based system, it would not be wise to use machine learning just for the heck of it.


## A Machine Learning Pipeline

### Data Collection

The first step after you have decided on the problem you want to solve is to collect the data you need. Machine Learning needs a lot of data and if you realize you don't have enought data available, it wouldn't make sense to go ahead. Of course, how much data is "enough" depends on a lot of factors like the complexity of the problem you are trying to solve, the complexity of the kind of model you choose, etc.

### Data Pre-processing

Machine learning models work only with numbers. Any kind of data you have first needs to be converted to a numerical format before it can be used to train a machine learning model.

For example, in the Titanic dataset that we have been using, the **Sex** column has the text 'male' and 'female'. We need to convert this to numbers like this before we can use it in a machine learning model:

![Data preprocessing on Sex column of Titanic Dataset](data-preprocessing.png)

We'll take a look at the different ways in which we can do this in more detail later.

### Train/Test Split

We want to train a model that can do well at making predictions on data it has never seen before. To do this, we set aside a certain percentage of the data as **test data**, and refer to the remaining data as **training data**.

<img src="train_test_split.png" width="300px" />

Only the **training data** is used to train the model. After training is complete, the **test data** is then used to evaluate the performance of the model. Since the model has not seen the test data during training, it is as good as performing predictions on unseen data.

If we have a lot of training data and less test data, we will be able to train a better model (since we have more data to train with), however we won't be able to evaluate our model well enough.

In contrast, if we have less training data and more test data, we won't be able to train a good model (since we don't have enough data to train with), but we'll get a better estimate of how our model performs on data it hasn't seen before.

A typical split is **80% training data** and **20% test data**.

### Model Training

Model training is where the computer does most of the work. It looks at your data and tries to estimate parameters of the model that make it learn the relationship between input and output. During model training, you need to provide three things:

 - A model to train
 - A learning algorithm to train the model
 - Data to train the model on

A model has **parameters** and **hyperparameters**. During model training, the **parameters** of a model are updated to fit the data. A model's **hyperparameters** are not influenced by data and additional steps have to be carried out to find the best values of **hyperparameters** for a model.
### Model Evaluation

There are different evaluation metrics available that you can choose from based on the kind of problem you are solving:

 - Accuracy
 - Mean Square Error
 - Precision/Recall/F-1 Score
 - R Square
 - IOU
 - ROC-AUC
 - Mutual Information  
 etc.
 
The aim for the model is to minimize/maximize the chosen evaluation metric over time during the model training process.

### Overfitting and Underfitting

If your model performs very well on your training data, but cannot make good predictions on your test data, then we say that the model is **overfitting** the training data. What this means is that the model has memorized the training data instead of learning the patterns in it. As a result, it cannot make correct predictions on data it hasn't seen before.

If your model doesn't perform well on both training and test data, then we say it is **underfitting**. This means that the model is not complex enough to learn the pattern in the training data.

### Model Deployment

After you have trained a model, you would want to make it available for making predictions. A very common pattern is to wrap the model with a web framework like Flask, and make the model available as a REST API.

### Drift and Retraining

Over time, the performance of your model might deteriorate. This could be because the distribution of data that your model has to run predictions on starts to become different from the data that is was trained on. You need to keep monitoring your deployed model for this kind of deterioration in performance. When this happens, you need to retrain your model. This might mean starting all the way from the start and collecting more data that represents the kind of distribution your model is supposed to make predictions on.

## References:
 - [Machine Learning Crash Course by Google](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology)
 - [Data Preprocessing in Data Mining & Machine Learning](https://towardsdatascience.com/data-preprocessing-in-data-mining-machine-learning-79a9662e2eb)