# Giving Computers the Ability to Learn from Data

__Machine Learning__ is the application and science of algorithms that make sense of data. We are living in an age where data comes in abundance; using self-learning algorithms from the field of machine learning, we can turn this data into knowledge. 

In this Jupyter Notebook, we will learn about the main concepts and different types of machine learning. Together with a basic introduction to the relevant terminology, we will lay the groundwork for successfully using machine learning techniques for practical problem solving. 

We will cover the following topics: 
* The general concepts of machine learning
* The three types of learning and basic terminology
* The building blocks for successfully designing machine learning systems
* Installing and setting up Python for data analysis and machine learning

# Building intelligent machines to transform data into knowledge

In this age of modern technology, there is one resource that we have in abundance: a large amount of structured and unstructured data. In the second half of the twentieth century, machine learning evolved as a subfield of **Artificial Inteligence** that involved **self-learning algorithms that derived knowledge from data in order to make predictions**. Instead of requiring humans to manually derive rules and build models from analyzing large amounts of data, **machine learning offers a more efficient alternative for capturing the knowledge in data to gradually improve the performance of predictive models and make data-driven decisions**. Thanks to machine learning, we enjoy robust email spam filters, convenient text and voice recognition software, reliable web search engines, challenging chess-playing programs, and, hopefully soon, safe and efficient self-driving cars. 

# The three different types of machine learning

There are the types of machine learning: **Supervised Learning**, **Unsupervised Learning** and **Reinforcement Learning**.

<img src='01_01.png'>

## Supervised Learning

Its main goal is to learn a model from **labeled training data** that allows us to make predictions about unseen or future data. Here, the term **supervised** refers to a set of samples where the desired output signals (labels) are already known. 

Considering the example of email filtering, we can train a model using a supervised machine learning algorithm on a corpus of labeled emails, emails that are correctly marked as spam or not-spam, to predict whether a new email belongs to either of the two categories.

A supervised learning task with discrete class labels, such as in the previous email spam filtering example, is also called a **classification task**.
Another subcategory of supervised learning is **regression**, where the outcome signal is a continuous value.

<img src='01_02.png'>

### Classification for predicting class labels

**Classification is a subcategory of supervised learning where the goal is to predict the categorical class labels of new instances, based on past observations**. Those class labels are discrete, unordered values that can be understood as the group memberships of the instances. The previously mentioned example of email spam detection represents a typical example of a binary classification task, where the machine learning algorithm learns a set of rules in order to distinguish between two possible classes: spam and not-spam emails. 

However, the set of class labels does not have to be of a binary nature. The predictive model learned by a supervised learning algorithm can assign any class label what was presented in the training dataset to a new, unlabeled instance. A typical example of **multiclass classification** task is handwritten character recognition. Here, we could collect a training dataset that consists of multiple handwritten examples of each letter in the alphabet. Now, if a user provides a new handwritter character via an input device, our predictive model will be able to predict the correct letter in the alphabet with certain accuracy. However, our machine learning system would be unable to correctly recognize any of the digits zero to nine, for example, if they were not part of our training dataset. 

The following figure illustrates the concept of a binary classification task given 30 training samples; 15 samples are labeled as negative class (minus signs) and 15 training samples are labeled as positive class (plus signs). In this scenario, our dataset is two-dimensional, which means that each sample has two values associated with it: **x1** and **x2**. Now, we can use a supervised machine learning algorithm to learn a rule, the **decison boundary** represented as a dashed line, that can separate those two calsses and classify new data into each of those two categories given its x1 and x2 values. 

<img src='01_03.png'>

### Regression for predicting continuous outcomes

A second type of supervised learning is the prediction of continuous outcomes, which is also called **regression analysis**. In the regression analysis, we are given a number of predictor (**explanatory**) variables and a continuous response variable (**outcome** or **target**), and **we try to find a relationship between those variables that allows us to predict an outcome**. 
For example, let's assume that we are interested in predicting the math SAT scores of our students. If there is a relationship between the time spent studying for the test and the final scores, we could use it as training data to learn a model that uses the study time to predict the test scores of future students who are planning to make this test.

<img src='01_04.png'>

## Solving interative problems with reinforcement learning

Another type of machine learning is **reinforcement learning**. In reinforcement learning, **the goal is to develop a system (*agent*) that improves its performance based on interactions with the environment**. Since the information about the current state of the environment typically also includes a so-called **reward signal**, **we can think of reinforcement learning as a field related to supervised learning**. However, **in reinforcement learning this feedback is not the correct ground truth label or value, but a measure of how well the action was measured by a reward function**. Through its interaction with the environment, **an agent can then use reinforcement learning to learn a series of actions that maximizes this reward via an exploratory trial-and-error aproach or deliberative planning**.

A popular example of reinforcement learning is a chess engine. Here, the agent decides upon a series of moves depending on the state of the board (the environment), and the reward can be defined as **win** or **lose** at the end of the game.

<img src='01_05.png'>

There are many different subtypes of reinforcement learning. However, a general scheme is that the agent in reinforcement learning tries to maximize the reward by a series of interactions with the environment. Each state can be associated with a positive or negative reward, and a reward can be defined as accomplishing an overall goal, such as winning or losing a game of chess. For instance, in chess the outcome of each move can be thought of as a different state of the environment. To explore the chess example further, let's think of visiting certain locations on the chess board  as being associated with a positive event, for instance, removing an opponent's chess piece from the board or threatening the queen. Other positions, however, are associated with a negative event, such as losing a chess piece to the opponent in the following turn. Now, not every turn results in the removal of a chess piece, and reinforcement learning is concerned with learning the series of steps by maximizing a reward based on immediate and delayed feedback. 

## Discovering hidden structures with unsupervised learning

In **supervised learning**, we know the right answer beforehand when we train our model, and in **reinforcement learning**, we define a measure of reward for particular actions by the agent. In **unsupervised learning**, however, we are dealing with **unlabeled data** of unknown structure. Using unsupervised learning techniques, **we are able to explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function**. 

### Finding subgroups with clustering

**Clustering** is an exploratory data analysis technique that **allows us to organize a pile of information into meaningful subgroups (*clusters*) without having any prior knowledge of their group memberships**. Each cluster that arises during the analysis defines a group of objects that share a certain degree of similarity but are more dissimiliar to objects in other clusters, which is why clustering is also sometimes called **unsupervised classification**. Clustering is a **great technique for structuring information and deriving meaningful relationships from data**. For example, it allows marketers to discover customer groups based on their interests, in order to develop distinct marketing programs. 

The following figure illustrates how clustering can be applied to organizing unlabeled data into three distinct gorups based on the similarity of their features x1 and x2:

<img src='01_06.png'>

### Dimensionality reduction for data compression

Another subfield of unsupervised learning is **dimensionality reduction**. Often we are working with data of high dimensionality, **each observation comes with a high number of measurements**, that can present a challenge for limited storage space and the computational performance of machine learning algorithms. Unsupervised dimensionality reduction is a commonly used approach in feature preprocessing to **remove noise from data**, which can also degrade the predictive performance of certain algorithms, and **compress the data onto the smaller dimensional subspace while retaining most of the relevant information**. 

Sometimes, **dimensionality reduction can also be useful for visualizing data**, for example, a high-dimensional feature set can be projected onte one-, two-, or three-dimensional feature spaces in order to visualize it via 3D or 2D scatterplots or histograms. The following figure shows an example where nonlinear dimensionality reduction was applied to compress a 3D Swiss Roll onto a new 2D feature subspace: 

<img src='01_07.png'>

# Introduction of the basic terminology and notations

Now that we have discussed the three broad categories of machine learning, **supervised**, **unsupervised** and **reinforcement learning**, let us have a look at the basic terminology that we will be using thoughout this material. The following table depicts an excerpt of the **Iris dataset**, which is a classic example in the field of machine learning. The iris dataset contains the measurements of 150 Iris flowers from three different species, Setosa, Versicolor, and Virginica. Here, each **flower sample represents one row in our dataset**, and **the flower measurements in centimeters are stored as columns**, which we also call the **features** of the dataset:

<img src='01_08.png'>

To keep the notation and implementation simple yet efficient, we will make use of some of the basics of linear algebra. In the following chapters, we will use a matrix and vector notation to refer to our data. We will follow the common convention to represent **each sample as a separate row** in a **feature matrix X**, where **each feature is stored as a separate column**. The Iris dataset consisting of 150 samples and four features can then be written as a **150x4 matrix X**.

# A roadmap for building machine learning systems

In previous sections, we discussed the basic concepts of machine learning and the three types of learning. In this section, we will discuss the other important parts of a machine learning system accompanying the learning algorithm. The following diagram shows a typical workflow for using machine learning in predictive modeling, which we will discuss in the following subsections:

<img src='01_09.png'>

## Preprocessing - getting data into shape

Let's begin with discussing the roadmap for building machine learning systems. **Raw data rarely comes in the form and shape that is necessary for the optimal performance of a learning algorithm**. Thus, **the preprocessing of the data is one of the most crucial steps in any machine learning application**. If we take the Iris flower dataset from the previous section as an example, we can think of the raw data as a series of flower images from which we want to extract meaningful features. Useful features could be the color, the hue, the intensity of the flowers, the height, and the flower lenghts and widths. **Many machine learning algorithms also require that the selected features are on the same scale for optimal performance**, which is often achieved by transforming the features in the range [0, 1], or a **standard normal distribution with zero mean and unit variance**, as we will see in later chapters. 

**Some of the selected features may be highly correlated and therefore redundant to a certain degree**. In those cases, **dimensionality reduction techniques are useful for compressing the features onto a lower dimensional subspace.** Reducing the dimensionality of our feature space has the advantage that **less storage space is required**, and **the learning algorithm can run much faster**. In certain cases, **dimensionality reduction can also improve the predictive performance of a model if the dataset contains a large number of irrelevant features, or noise, that is, if the dataset has a low signal-to-noise ratio**. 

To determine whether our machine learning algorithm **not only performs well on the training set but also generalizes well to new data**, we also want to **randomly divide the dataset into a separate training and test set**. We use the training set to train and optimize our machine learning model, while we keep the test set until the very end to evaluate the final model. 

## Training and selecting a predictive model

Many different machine learning algorithms have been developed to solve different problem tasks. An important point that can be summarized from David Wolpert's famous **No free lunch theorems** is that we can't get learning for free. Intuitively, we can relate this concept to the popular saying, **_I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail_**. For example, **each classification algorithm has its inherent biases, and no single classification model enjoys superiority if we do not make any assumptions about the task**. In practice, it is therefore essential to compare at least a handful of different algorithms in order to train and select the best performing model. But before we can compare different models, we first have to decide upon a metric to measure performance. One commonly used metric is classification **accuracy**, which is defined as **the proportion of correctly classified instances**. 

One legitimate question to ask is this: **how do we know which model performs well on the final test dataset and real-world data if we do not use this test set for the model selection, but keep it for the final model evaluation?** In order to address the issue embedded in this question, **different cross-validation techniques** can be used where **the training dataset is further divided into training and validation subsets in order to estimate the generalization performance of the model**. Finally, **we also cannot expect that the default parameters of the different learning algorithms provided by the software libraries are optimal for our specific problem task**. Therefore, we will make frequent use of **hyperparameter optimization techniques** that help us to **fine-tune the performance of our model** in later chapters. Intuitively, **we can think of those hyperparameters as parameters that are not learned from the data but represent the knobs of a model that we can turn to improve its performance**. 

## Evaluating models and predicting unseen data instances

After we have selected a model that has been fitted on the training dataset, we can use the test dataset to estimate how well it performs on this unseen data to estimate the generalization error. If we are satisfied with its performance, we can now use this model to predict new, future data. It is important to note that **the parameters for the previously mentioned procedures, such as feature scaling and dimensionality reduction, are solely obtained from the training dataset, and the same parameters are later reapplied to transform the test dataset, as well as any new samples, the performance measure on the test data may be overly optimistic otherwise**. 

# Installing Python and packages from the Python Package Index

Its is recommended you use the most recent version of Python 3 that is currently available, although most of the code examples may also be compatible with Python 2.7 or higher. 

The additional packages that we will be using thoughout this material can be installed via the *pip* installer program, which has been part of the Python standard library since Python 3.3. 

After we have successfully installed Python, we can execute *pip* from the terminal to install additional Python packages: 

**pip install SomePackage**

## Using the Anaconda Python distribution and package manager

A highly recommended alternative Python distribution for scientific computing is Anaconda. Anaconda is a free, including for commercial use, interprise-ready Python distribution that bundles all the essential Python packages for data science, math, and engineering in one user-friendly cross-plataform distribution. 

After successfully installing Anaconda, we can install new Python packages using the following command:

**conda install SomePackage**

## Packages for scientific computing, data science and machine learning

Throughout this material, we will mainly use **NumPy**'s multidimensional arrays to store and manipulate data. Occasionally, we will make use of **Pandas**, which is a library built on top of NumPy that provides additional higher-level data manipulation tools that make working with tabular data even more convenient. To augment our learning experience and visualize quantitative data, which is often extremely useful to intuitively make sense of it, we will use the very customizable **Matplotlib** library. To machine learning models and implementations we will use the library **scikit-learn**.

# Summary

In this chapter, we explored machine learning at a very high level and familiarized ourselves with the big picture and major concepts that we are going ot explore in the following chapters in more detail. We learned that supervised learning is composed of two important subfields: classification and regression. While classification models allow us to categorize objects into known classes, we can use regression analysis to predict the continuous outcomes of target variables. Unsupervised learning not only offers useful techniques for discovering structures in unlabeled data, but it can also be useful for data compression in feature preprocessing steps. We briefly went over the typical roadmap for applying machine learning to problem tasks, which we will use as a foundation for deeper discussions and hands-on examples in the following chapters. Eventually, we set up our Python environment and installed the required packages to get ready to see machine learning in action.

Later in this material, in addition to machine learning itself, we will also introduce different techniques to preprocess our dataset, which will help us to get the best performance out of different machine learning algorithms. While we will cover classification algorithms quite extensively throughout the book, we will also explore different techniques for regression analysis and clustering.

In the following chapter, we will start our journey by implementing one of the earliest machine learning algorithms for classification, which will prepare us for Chapter 3, where we cover more advanced machine learning algorithms using the scikit-learn open source machine learning library.