# Machine Learning
## <a href="#I">I What is Machine Learning?</a>
### <a href="#I.1">I.1 How We Get Machines to Learn</a>
### <a href="#I.2">I.2 Deep Learning: More Accuracy, More Math and More Computing power</a>
### <a href="#I.3">I.3 Challenges and limitations</a>
### <a href="#I.4">I.4 Machine Learning Applications</a>
## <a href="#II">II Types of Machine Learning Techniques</a>
### <a href="#II.1">II.1 Supervised Learning</a>
#### <a href="#II.1.1">II.1.1 Types of Supervised Machine Learning Techniques</a>
##### <a href="#II.1.1.1">II.1.1.1 Regression</a>
##### <a href="#II.1.1.2">II.1.1.2 Classification</a>
### <a href="#II.2">II.2 Unsupervised Learning</a>
#### <a href="#II.2.1">II.2.1 Types of Unsupervised Machine Learning Techniques</a>
##### <a href="#II.2.1.1">II.2.1.1 Clustering</a>
##### <a href="#II.2.1.2">II.2.1.2 Association</a>
##### <a href="#II.2.1.3">II.2.1.3 Anomaly detection</a>
### <a href="#II.3">II.3 Semi-Supervised Learning</a>
### <a href="#II.4">II.4 Reinforcement Learning</a>
## <a href="#III">III Datasets and Machine Learning</a>
### <a href="#III.1">III.1 Data and Data Sets</a>
#### <a href="#III.1.1">III.1.1 Independent vs Dependent Variables</a>
#### <a href="#III.1.2">III.1.2 The different kind of variables</a>
##### <a href="#III.1.2.1">III.1.2.1 Categorical or qualitative variable</a>
##### <a href="#III.1.2.2">III.1.2.2 Numerical or quantitative variables</a>
### <a href="#III.2">III.2 Supervised Learning and Data Sets</a>
#### <a href="#III.2.1 ">III.2.1 Underfitting and Overfitting models</a>
### <a href="#III.3">III.3 How to get training DataSets ?</a>
#### <a href="#III.3.1">III.3.1 Python DataSets</a>
#### <a href="#III.3.2">III.3.2 Kaggle Datasets</a>
#### <a href="#III.3.3">III.3.3 Kaggle DatasetsAmazon Datasets</a>
#### <a href="#III.3.4">III.3.4 UCI Machine Learning Repository</a>
#### <a href="#III.3.5">III.3.5 Google’s Datasets Search Engine</a>
## <a href="#IV">IV Machine Learning with Python</a>
### <a href="#IV.1">IV.1 scikit-learn</a>
### <a href="#IV.2">IV.2 TensorFlow</a>
### <a href="#IV.3">IV.3 Theano</a>
### <a href="#IV.4">IV.4 Keras </a>
### <a href="#IV.5">IV.5 Pytorch</a>

# Machine Learning
<a id="I"></a>
## I What is Machine Learning?

<blockquote><i><b>Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions.</b></i></blockquote>

The above definition encapsulates the ideal objective or ultimate aim of machine learning, as expressed by many researchers in the field. 

__Note__: __Machine learning__ is a subset of __AI__.

<a id="I.1"></a>
### I.1 How We Get Machines to Learn

There are different approaches to getting machines to learn, from using basic __decision trees__ to __clustering__ to layers of __artificial neural networks__ (the latter of which has given way to __deep learning__), depending on what task you’re trying to accomplish and the type and amount of data that you have available. 

There are many different types of machine learning algorithms, with hundreds published each year, and they’re typically grouped by either learning style (i.e. __supervised learning__, __unsupervised learning__, __semi-supervised learning__) or by similarity in form or function (i.e. __classification__, __regression__, _clustering, etc.). <br>
Regardless of learning style or function, all combinations of machine learning algorithms consist of the following:

1. __Representation__: a set of classifiers or the language that a computer understands
2. __Evaluation__: objective/scoring function
3. __Optimization__

<a id="I.2"></a>
### I.2 Deep Learning: More Accuracy, More Math and More Computing power

__Deep learning__ is a subset of machine learning. Usually, this term refer to __Deep Artificial Neural Networks__ (__DANN__).
__DANN__ are a set of algorithms that have set new records in accuracy for many important problems, such as image recognition, sound recognition, recommender systems, natural language processing etc. For example, deep learning is part of DeepMind’s well-known AlphaGo algorithm, which beat the former world champion at Go in early 2016.

_Deep_ is a technical term, it refers to the number of layers in a neural network: a deep network has more than one __hidden layers__. Multiple hidden layers allow deep neural networks to learn features of the data in a so-called feature hierarchy, because simple features (e.g. two pixels) recombine from one layer to the next, to form more complex features (e.g. a line). <br>
Nets with many layers pass input data (features) through more mathematical operations than nets with few layers, and are therefore more computationally intensive to train. <br>
Computational intensivity is one of the hallmarks of deep learning, and it is one reason why a new kind of chip call GPUs are in demand to train deep-learning models.

<a id="I.3"></a>
### I.3 Challenges and limitations

The two biggest, historical (and ongoing) problems in machine learning have involved __overfitting__ (in which the model exhibits bias towards the training data and does not generalize to new data) and __dimensionality__ (algorithms with more features work in higher/multiple dimensions, making understanding the data more difficult). <br>
Having access to a large enough data set is, in some cases, also a primary problem.

<a id="I.4"></a>
### I.4 Machine Learning Applications

Here are a few popular applications of data science.

- __Fraud detection__: many financial institutions deploy data science in their validation pipeline to automatically flag suspicious credit or debit card transaction.
In this situations the features are, for instance: location of the transaction, amount, credit store of participants, credit card status, …
An the result is a binary classification.
An extension of this application could be in loan request approval, where a financial institution could decide to approve or turn down loan requests by customers bases on patterns learnt from past loan applications.


- __Recommender systems__ (__recsys__): recommender systems provide product recommendation to users based on past preferences or other user information. In the e-commerce sector, _recsys_ are used to automatically recommend products to users based on what has been purchased.


- __Image recognition__: here the challenge is to correctly identify objects in an image. Face recognition, for instance, is used today by many security systems.


- __Digital advertising__: the idea here is to provide targeted personalized ads which have a far greater conversion rate compared to traditional advertising. Instead of the "one size fits all" approach of traditional advertisements, digital adverts display ads that are relevant to an individual user.


- __Email Spam__ and __Malware Filtering__ : thousands of malwares are detected every day and each piece of code is 90–98% similar to its previous versions. Anti-malware systems powered by machine learning understand the coding pattern and therefore can detect new malware with 2–10% variation easily and offer protection against them.


<a id="II"></a>
## II Types of Machine Learning Techniques

Based on the kind of data available and the research question at hand, a data scientist will choose an algorithm based on a specific learning model.

In a __supervised learning model__, the algorithm learns on a labeled dataset, providing an answer key that the algorithm can use to evaluate its accuracy on training data. <br>

An __unsupervised learning model__, in contrast, provides unlabeled data that the algorithm tries to make sense of by extracting features and patterns on its own.<br>

A __semi-supervised learning model__ takes a middle ground. It uses a small amount of labeled data bolstering a larger set of unlabeled data. <br>

Finally, __reinforcement learning model__ trains an algorithm with a reward system, providing feedback when an artificial intelligence agent performs the best action in a particular situation.<br>

During this presentation, we will focus on the 2 main kinds of learning models: supervised and unsupervised.

<a id="II.1"></a>
### II.1 Supervised Learning

If you’re learning a task under supervision, someone is present judging whether you’re getting the right answer. Similarly, in __supervised learning__, that means having a full set of __labeled data__ while training an algorithm.

Fully labeled means that _each example in the training dataset is tagged with the answer the algorithm should come up with on its own_. So, a labeled dataset of flower images would tell the model which photos were of roses, daisies and daffodils. When shown a new image, the model compares it to the training examples to predict the correct label.

<img src="nbimages/dataset2.png" alt="Supervised" title="Supervised" width=400 height=400 />

<a id="II.1.1"></a>
#### II.1.1 Types of Supervised Machine Learning Techniques

There are two main areas where supervised learning is useful: classification problems and regression problems.

<img src="nbimages/classsvsreg.png" alt="Regression/Classification" title="Regression/Classification" width=400 height=400 />

<a id="II.1.1.1"></a>
##### II.1.1.1 Regression

__Regression__ technique predicts a single output value using training data.

Example: You can use regression to predict the house price from training data. The input variables will be locality, size of a house, etc.

<a id="II.1.1.2"></a>
##### II.1.1.2 Classification

__Classification__ means to group the output inside a class. If the algorithm tries to label input into two distinct classes, it is called binary classification. Selecting between more than two classes is referred to as multiclass classification.

<a id="II.2"></a>
### II.2 Unsupervised Learning

Clean, perfectly labeled datasets aren’t easy to come by. And sometimes, researchers are asking the algorithm questions they don’t know the answer to. That’s where __unsupervised learning__ comes in.

In unsupervised learning, a machine learning algorithm is handed a dataset without explicit instructions on what to do with it. The training dataset is a collection of examples without a specific desired outcome or correct answer. The ML algorithm then attempts to automatically find structure in the data by analysing the features.

__Note__: Because there is no "ground truth" element to the data, it’s difficult to measure the accuracy of an algorithm trained with unsupervised learning. But there are many research areas where labeled data is elusive, or too expensive, to get. In these cases, giving the learning model free rein to find patterns of its own may be the best option.

<a id="II.2.1"></a>
#### II.2.1 Types of Unsupervised Machine Learning Techniques

Depending on the problem at hand, the unsupervised learning model can organize the data in different ways.

<a id="II.2.1.1"></a>
##### II.2.1.1 Clustering

_Clustering_ is an important concept when it comes to unsupervised learning (it is the most common application of unsupervised learning). It mainly deals with finding a structure or pattern in a collection of uncategorized data. Clustering algorithms will process your data and find natural clusters(groups) if they exist in the data. 

<img src="nbimages/clustering.png" alt="Clustering" title="Clustering" width=400 height=400 />

<a id="II.2.1.2"></a>
##### II.2.1.2 Association

_Association_ rules allow you to establish associations amongst data objects inside large databases. This unsupervised technique is about discovering relationships between variables in large databases. By looking at a couple key attributes of a data point, an unsupervised learning model can predict the other attributes with which they’re commonly associated.
For example if you fill an online shopping cart with diapers, applesauce and sippy cups, thanks to the use of a ML algorithm the site may recommend that you add a bib and a baby monitor to your order.

<a id="II.2.1.3"></a>
##### II.2.1.3 Anomaly detection

Banks detect fraudulent transactions by looking for unusual patterns in customer’s purchasing behavior. For instance, if the same credit card is used in California and Denmark within the same day, that’s cause for suspicion. Here, an unsupervised learning can be used to flag outliers in a dataset.

Various repositories of open data sets that may be useful in training machine learning algorithm are available.

<a id="II.3"></a>
### II.3 Semi-Supervised Learning

__Semi-supervised learning__ is, for the most part, just what it sounds like: a training dataset with both labeled and unlabeled data.<br> 
This method is particularly useful when extracting relevant features from the data is difficult, and labeling examples is a time-intensive task for experts.<br>
The __semi-supervised learning model__ can still benefit from the small proportion of labeled data and improve its accuracy compared to a fully unsupervised model.

<a id="II.4"></a>
### II.4 Reinforcement Learning

In this kind of machine learning, AI agents are attempting to find the optimal way to accomplish a particular goal, or improve performance on a specific task. As the agent takes action that goes toward the goal, it receives a reward. The overall aim: predict the best next step to take to earn the biggest final reward.

To make its choices, the agent relies both on learnings from past feedback and exploration of new tactics that may present a larger payoff. This involves a long-term strategy — just as the best immediate move in a chess game may not help you win in the long run, the agent tries to maximize the cumulative reward.

It’s an iterative process: the more rounds of feedback, the better the agent’s strategy becomes. This technique is especially useful for training robots, which make a series of decisions in tasks like steering an autonomous vehicle or managing inventory in a warehouse.



<a id="III"></a>
## III Datasets and Machine Learning

One of the hardest problems to solve in machine learning or deep learning has nothing to do with the algorithm used: it’s the problem of getting the right data in the right format.

Getting the right data means gathering or identifying the data that correlates with the outcomes you want to predict; i.e. data that contains a signal about events you care about. <br>
Verifying that the data is aligned with the problem you seek to solve must be done by a __data scientist__. <br>
If you do not have the right data, then your efforts to build an AI solution must return to the data collection stage.

The right end format for machine learning is generally a __multi-dimensional array__ (a __tensor__ in deep learning). So data pipelines built for ML or DL will generally convert all data – be it images, video, sound, voice, text or time series – into vectors and tensors to which linear algebra operations can be applied. <br>
That data frequently needs to be scaled, standardized and cleaned to increase its usefulness, and those are all steps of a Data Science process.

When doing __supervised learning__ the corresponding algorithms (methods) needs a good training set to work properly. Collecting and constructing the training set – a sizable body of known data – takes time and domain-specific knowledge of where and how to gather relevant information. <br>

The training set acts as the benchmark against which the __supervised learning__ algorithms are trained. That is what they learn to reconstruct before they’re unleashed on data they haven’t seen before.

At this stage, knowledgeable humans need to find the right raw data and transform it into a numerical representation that the algorithm can understand (a multi-dim array or a tensor). 

To create a useful training set, you have to understand the problem you’re solving; i.e. what you want your learning algorithm to pay attention to, which outcomes you want to predict.

<a id="III.1"></a>
### III.1 Data and Data Sets

<a id="III.1.1"></a>
#### III.1.1 Independent vs Dependent Variables

Any predictive mathematical model tends to divide the observations (data) into dependent/ independent features in order to determine the causal effect. 

Variables of interest in an experiment (those that are measured or observed) are called __response__ or __dependent variables__.<br>
Other variables in the experiment that affect the response and can be set or measured by the experimenter are called __predictor__, __explanatory__, __features__ or __independent variables__.<br> 

__Note__: it should be noted that relationship between dependent and independent variables need not be linear, it can be polynomial. 

For example, in the below data set, the independent variables are the input of the purchasing process being analyzed. The result (whether a user purchased or not) is the dependent variable.

<img src="nbimages/variable.png" alt="dependent/independent variables" title="dependent/independent variables" width=400 height=400 />

<a id="III.1.2"></a>
#### III.1.2 The different kind of variables

<a id="III.1.2.1"></a>
##### III.1.2.1 Categorical or qualitative variable

__Categorical variables__ contain a finite number of categories or distinct groups.<br> __Categorical variables__ take on values that are names or labels. The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of categorical variables. In the same way __categorical predictors__ include gender, material type, and payment method.

Under categorical data we have 2 subtypes: 
1. nominal 
2. ordinal.

__Nominal data__ is where there is no innate ordering of categories (male/female for instance).
__Ordinal data__ has a concept of ordering (info/debug/warning/error/fatal for instance)

Examples of Categorical Variables:

- Class in college (e.g. freshman, sophomore, junior, senior).
- Party affiliation (e.g. Republican, Democrat, Independent).
- Type of pet owned (e.g. dog, cat, rodent, fish).
- Favorite author (e.g. Stephen King, James Patterson, Charles Dickens).
- Preferred airline (e.g. Swiss, EasyJet, Quantas).
- Hair color (e.g. blond, brunette, black).
- Types of hats (e.g. sombrero, beanie, fedora).

As a general rule, if you can’t add something, then it’s categorical. For example, you can’t add cat + dog

<a id="III.1.2.2"></a>
##### III.1.2.2 Numerical or quantitative variables

__Numerical__ or __quantitative variables__ are variables whose values represent a measurable quantity.  

There are 2 subtypes of numerical variables: 
1. discrete 
2. continuous.

__Discrete variables__ are numeric variables that have a countable number of values between any two values. Discrete variables take on integer values. For example, the number of rooms in a house. 

__Continuous variables__ are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or date/time. For example, the length of a part or the date and time a payment is received.

Examples of Quantitative Variables / Numeric Variables:

- High school Grade Point Average (e.g. 4.0, 3.2, 2.1).
- Number of pets owned (e.g. 1, 2, 4).
- Bank account balance (e.g. 100, 987, -42).
- Number of stars in a galaxy (e.g. 100, 2301, 1 trillion) .
- Average number of lottery tickets sold (e.g. 25, 2.789, 2 million).
- How many cousins you have (e.g. 0, 12, 22).
- The amount in your paycheck (e.g. 200, 1.457, 2.222).

General rule of thumb: if you can add it, it’s quantitative.

<a id="III.2"></a>
### III.2 Supervised Learning and Data Sets

Supervised learning typically works with two data sets: __training__ and __test__. 

- The first set you use is the __training set__ (the largest of the two). Running a training set through a Supervised learning algorithm, a neural network for instance, teaches the net how to weight different features, adjusting coefficients according to their likelihood of minimizing errors in your results.

<img src="nbimages/dataset.png" alt="Learning phase" title="Learning phase" width=400 height=400 />

- The second set is your __test set__. It functions as a seal of approval, and you don’t use it until the end. After you’ve trained and optimized your data, you test your Supervised learning algorithm against this final random sampling. The results it produces should validate that it works "properly" (with a reasonable level of accuracy).

<a id="III.2.1 "></a>
#### III.2.1 Underfitting and Overfitting models

We require that the model learn from known examples and __generalize__ from those known examples to new examples in the future. 

Too little learning and the model will perform poorly on the training dataset and on new data. The model will __underfit__ the problem. <br>

Too much learning and the model will perform well on the training dataset and poorly on new data, the model will __overfit__ the problem. <br> 

This happens because your model is trying too hard to capture the noise in your training dataset. By noise we mean here the data points that don’t really represent the true properties of your data.
__Overfitting__ may occur in cases where you have a very complex model (the number of features is large) or the number of labels is very low.

A model with too little capacity cannot learn the problem, whereas a model with too much capacity can learn it too well and __overfit__ the training dataset. Both cases result in a model that does not __generalize__ well.

There are two ways to approach an underfitting model:

1. Ensure that your models are sufficiently complex, which you can accomplish by adding features.
2. Changing the data preprocessing steps.

There are two ways to approach an overfitting model:

1.	Reduce overfitting by training the algorithm on more examples.
2.	Reduce overfitting by reducing the complexity of the training data.

<a id="III.3"></a>
### III.3 How to get training DataSets ?

Various repositories of open data sets are available: 

<a id="III.3.1"></a>
#### III.3.1 Python DataSets

Several Python "toolkits" (Scikit Learn, Seaborn, PyDataset...) do provide facilities to get test datasets.<br>

__Seaborn__ provides the __load_dataset()__ function to load a dataset from their online repository. The function __get_dataset_names()__ will return a list of available datasets ('anscombe', 'attention','brain_networks','car_crashes', ... ,'tips','titanic').

With __Scikit-Learn__ the __sklearn.datasets__ module provide utilities to load datasets, including methods (load_iris(), load_diabetes(), ...) to load and fetch popular reference datasets. 

__pydataset__ has about 757 (mostly numerical-based) datasets, that are based on R Datasets. 
To load a dataset you simply use the __data()__ function:<br> _data('titanic', show_doc=True)_


<a id="III.3.2"></a>
#### III.3.2 Kaggle Datasets

(https://www.kaggle.com/datasets)

Here, each dataset is a small community where you can have a discussion about data, find some public code or create your own projects. Sometimes you can find notebooks with algorithms that solve the prediction problem in this specific dataset.


<a id="III.3.3"></a>
#### III.3.3 Amazon Datasets

(https://registry.opendata.aws) [Registry of Open Data on AWS]

This source contains many datasets in different fields such as: (Public Transport, Ecological Resources, Satellite Images, etc.). It also has a search engine to help you find the dataset you are looking for and it also has dataset description and usage examples for all datasets.
The datasets are stored in Amazon Web Services (AWS) resources.

<a id="III.3.4"></a>
#### III.3.4 UCI Machine Learning Repository

(https://archive.ics.uci.edu/ml/index.php) 

Another great repository of datasets from the University of California, School of Information and Computer Science. It classifies the datasets by the type of machine learning problem. <br>
You can find datasets for univariate and multivariate time-series datasets, classification, regression or recommendation systems. <br>
Some of the datasets at UCI are already cleaned and ready to be used.

<a id="III.3.5"></a>
#### III.3.5 Google’s Datasets Search Engine

(https://toolbox.google.com/datasetsearch)

In late 2018, Google launched a new service:a toolbox that can search for datasets by name.<br> 
Their aim is to unify tens of thousands of different repositories for datasets and make that data discoverable. 

#### ...


<a id="IV"></a>
## IV Machine Learning with Python
<a id="IV.1"></a>
### IV.1 scikit-learn

The __scikit-learn__ library is definitely one of, if not the most, popular ML libraries out there among all languages.<br>
It has a huge number of features for data mining and data analysis, making it a top choice for researches and developers alike.

Its built on top of the popular NumPy, SciPy, and matplotlib libraries. 
 
One of its best features is great documentation and tons of tutorials. Thanks to the library's popularity you won't have much trouble finding resources to show you how to get your models up and running.

<a id="IV.2"></a>
### IV.2 TensorFlow

__TensorFlow__ is a high-level neural network library that helps you program your network architectures while avoiding the low-level details. 
The focus is more on allowing you to express your computation as a data flow graph, which is much more suited to solving complex problems.

It is mostly written in C++, which includes the Python bindings, so you don't have to worry about sacrificing performance. 

You can deploy your model to one or more CPUs or GPUs in a desktop, server, or mobile device all with the same API. 

It was developed for the Google Brain project.

<a id="IV.3"></a>
### IV.3 Theano

__Theano__ is a machine learning library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays. 

Like scikit-learn, Theano tightly integrates with NumPy. 

The transparent use of the GPU makes Theano fast and painless to set up, which is pretty crucial for those just starting out. Although some have described it as more of a research tool than production use, so use it accordingly.

__Note__: Theano is no longer in active development

<a id="IV.4"></a>
### IV.4 Keras 

__Keras__ is an open source high-level neural network library written in Python. It is capable of running on top of TensorFlow or Theano. It is designed to enable fast experimentation with deep neural networks.

<a id="IV.5"></a>
### IV.5 Pytorch

__PyTorch__ is an open source machine learning library for Python, based on Torch (a scientific computing framework ). It is used for applications such as natural language processing and was developed by Facebook’s AI research group.  

PyTorch is, compared to Keras, a low-level library.