# 1. Introduction

* 싸이그래머 / QGM - ML [1]
* 김무성

# Contents

* 1.1 Machine learning: what and why? 
* 1.2 Supervised learning
* 1.3 Unsupervised learning
* 1.4 Some basic concepts in machine learning

### A probabilistic approach

* This books adopts the view that the best way to make machines that can learn from data is to use the tools of probability theory, which has been the mainstay of statistics and engineering for centuries. 
    - Probability theory can be applied to any problem involving uncertainty. In machine learning, uncertainty comes in many forms: what is the best prediction (or decision) given some data? what is the best model given some data? what measurement should I perform next? etc.
    - The systematic application of probabilistic reasoning to all inferential problems, including inferring parameters of statistical models, is sometimes called a Bayesian approach. 
    - However, this term tends to elicit very strong reactions (either positive or negative, depending on who you ask), so we prefer the more neutral term “probabilistic approach”. 
    - Besides, we will often use techniques such as maximum likelihood estimation, which are not Bayesian methods, but certainly fall within the probabilistic paradigm.

### model-based approach

* Rather than describing a cookbook of different heuristic methods, this book stresses a principled model-based approach to machine learning. 
* For any given model, a variety of algorithms can often be applied.

### graphical models

* We will often use the language of graphical models to specify our models in a concise and intuitive way. 
* In addition to aiding comprehension, the graph structure aids in developing efficient algorithms, as we will see. 
* However, this book is not primarily about graphical models; it is about probabilistic modeling in general.

# 1.1 Machine learning: what and why? 

* 1.1.1 Types of machine learning

We are drowning in information and starving for knowledge. — John Naisbitt.

* big data
* machine learning

## 1.1.1 Types of machine learning

* predictive or supervised learning
    - training set
    - features, attributes or covariates
    - response variable
        - categorical or nominal variable
            - classification or pattern recognition
        - real-valued
            - regression
            - ordinal regression
* descriptive or unsupervised learning
    - knowledge discovery
* reinforcement learning

<img src="figures/cap1.1.png" width=600 />

# 1.2 Supervised learning

* 1.2.1 Classification
* 1.2.2 Regression

## 1.2.1 Classification

* 1.2.1.1 Example
* 1.2.1.2 The need for probabilistic predictions
* 1.2.1.3 Real-world applications

Here the goal is to learn a mapping from inputs x to outputs y, where y ∈ {1,...,C}, with C being the number of classes.

* mutually exclusive (class labels)
    - binary classification : C = 2
    - multiclass classification : C > 2
* not mutually exclusive
    - multi-label classification
        - = multiple output model

* function approximation
    - y = f(x)
    - yˆ = fˆ(x). 

### 1.2.1.1 Example

### 1.2.1.2 The need for probabilistic predictions

<img src="figures/cap1.2.png" width=600 />

* MAP estimate (MAP stands for maximum a posteriori)
    - This corresponds to the most probable class label, and is called the mode of the distribution p(y|x, D);
* confidence
    - Now consider a case such as the yellow circle, where p(yˆ|x,D) is far from 1.0. 
    - In such a case we are not very confident of our answer, so it might be better to say “I don’t know” instead of returning an answer that we don’t really trust.
    - This is particularly important in domains such as 
        - medicine
        - finance
    - examples
        - Watson (IBM)
        - SmartASS (Google)
            - click-through rate(CTR)

### 1.2.1.3 Real-world applications

#### Document classification and email spam filtering

p(y = c|x, D)

* document classification
    - A special case  : email spam filtering
        - spam y = 1 or ham y = 0.

* bag of words

<img src="figures/cap1.3.png" width=600 />

#### Classifying flowers

* feature extraction
    - sepal length and width, and petal length and width
* scatter plot
* exploratory data analysis

<img src="figures/cap1.4.png" width=600 />

<img src="figures/cap1.5.png" width=600 />

#### Image classification and handwriting recognition

* image classification
* handwriting recognition
* MNIST

<img src="figures/cap1.6.png" width=600 />

#### Face detection and recognition

* object detection or object localization
    - face detection
    - sliding window detector
* face recognition
    - invariant

<img src="figures/cap1.7.png" width=600 />

## 1.2.2 Regression

Regression is just like classification except the response variable is continuous

<img src="figures/cap1.8.png" width=600 />

* Here are some examples of real-world regression problems.
    - Predict tomorrow’s stock market price given current market conditions and other possible side information.
    - Predict the age of a viewer watching a given video on YouTube.
    - Predict the location in 3d space of a robot arm end effector, given control signals (torques) sent to its various motors.
    - Predict the amount of prostate specific antigen (PSA) in the body as a function of a number of different clinical measurements.
    - Predict the temperature at any location inside a building using weather data, time, door sensors, etc.

# 1.3 Unsupervised learning

* 1.3.1 Discovering clusters
* 1.3.2 Discovering latent factors
* 1.3.3 Discovering graph structure
* 1.3.4 Matrix completion

We now consider unsupervised learning, where we are just given output data, without any inputs.

* knowledge discovery
* density estimation

<img src="figures/cap1.9.png" />

* There are two differences from the supervised case. 
    - First, we have written p(xi|θ) instead of p(yi|xi,θ); 
        - that is, 
            - supervised learning is conditional density estimation,
            - whereas unsupervised learning is unconditional density estimation. 
    - Second, xi is a vector of features, so we need to create multivariate probability models.

<font color="blue">When we’re learning to see, nobody’s telling us what the right answers are — we just look. Every so often, your mother says “that’s a dog”, but that’s very little information. You’d be lucky if you got a few bits of information — even one bit per second — that way. The brain’s visual system has 10^14 neural connections. And you only live for 10^9 seconds. So it’s no use learning one bit per second. You need more like 10^5 bits per second. And there’s only one place you can get that much information: from the input itself. — Geoffrey Hinton, 1996 (quoted in (Gorder 2006)).</font>


## 1.3.1 Discovering clusters

* clustering
* hidden or latent variable
* model based clustering

#### clustering

<img src="figures/cap1.11.png" />

<img src="figures/cap1.12_0.png" />

#### hidden or latent variable

<img src="figures/cap1.12.png" />

<img src="figures/cap1.10.png" width=600 />

## 1.3.2 Discovering latent factors

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data.

* dimensionality reduction
* visualizing
* principal components analysis (PCA)

<img src="figures/cap1.13.png" width=600 />

<img src="figures/cap1.14.png" width=600 />

## 1.3.3 Discovering graph structure

* Sometimes we measure a set of correlated variables, and we would like to discover which ones are most correlated with which others. 
* This can be represented by a graph G, in which nodes represent variables, and edges represent direct dependence between variables

<img src="figures/cap1.16.png" />

<img src="figures/cap1.15.png" width=600 />

## 1.3.4 Matrix completion

* 1.3.4.1 Image inpainting
* 1.3.4.2 Collaborative filtering
* 1.3.4.3 Market basket analysis

Sometimes we have missing data, that is, variables whose values are unknown.

* NaN
* imputation
* matrix completion

### 1.3.4.1 Image inpainting

<img src="figures/cap1.17.png" width=600 />

### 1.3.4.2 Collaborative filtering

Another interesting example of an imputation-like task is known as collaborative filtering. A common example of this concerns predicting which movies people will want to watch based on how they, and other people, have rated movies which they have already seen.

* Netflix
* sparse

<img src="figures/cap1.18.png" width=600 />

### 1.3.4.3 Market basket analysis

* In commercial data mining, there is much interest in a task called market basket analysis. 
* The data consists of a (typically very large but sparse) binary matrix, where each column represents an item or product, and each row represents a transaction. 
* We set xij = 1 if item j was purchased on the i’th transaction. 
* Given a new partially observed bit vector, representing a subset of items that the consumer has bought, the goal is to predict which other bits are likely to turn on, representing other items the consumer might be likely to buy.
* Unlike collaborative filtering, we often assume there is no missing data in the training data, since we know the past shopping behavior of each customer
* frequent itemset mining

# 1.4 Some basic concepts in machine learning

* 1.4.1 Parametric vs non-parametric models
* 1.4.2 A simple non-parametric classifier: K -nearest neighbors 
* 1.4.3 The curse of dimensionality 
* 1.4.4 Parametric models for classification and regression 1.4.5 Linear regression
* 1.4.6 Logistic regression
* 1.4.7 Overfitting
* 1.4.8 Model selection
* 1.4.9 No free lunch theorem

## 1.4.1 Parametric vs non-parametric models

p(y|x) or p(x)

* parametric model
    - the model have a fixed number of parameters
    - Parametric models have the advantage of often being faster to use,
    - but the disadvantage of making stronger assumptions about the nature of the data distributions.
* non-parametric model
    - the number of parameters grow with the amount of training data
    - Non-parametric models are more flexible, 
    - but often computationally intractable for large datasets.

## 1.4.2 A simple non-parametric classifier: K-nearest neighbors 

* A simple example of a non-parametric classifier is the K nearest neighbor (KNN) classifier.
* This simply “looks at” the K points in the training set that are nearest to the test input x, counts how many members of each class are in this set, and returns that empirical fraction as the estimate, as illustrated in Figure 1.14.

* indicator function
* memory-based learning or instance-based learning
* Voronoi tessellation

<img src="figures/cap1.19.png" width=600 />

<img src="figures/cap1.19.5.png" width=600 />

<img src="figures/cap1.20.png" width=600 />

## 1.4.3 The curse of dimensionality 

* The poor performance in high dimensional settings is due to the curse of dimensionality.

<img src="figures/cap1.21.png" width=600 />

## 1.4.4 Parametric models for classification and regression 

The main way to combat the curse of dimensionality is to make some assumptions about the nature of the data distribution (either p(y|x) for a supervised problem or p(x) for an unsupervised problem). 

* inductive bias
* parametric model

## 1.4.5 Linear regression

* linear regression
* scalar product
* weight vector
* residual error

<img src="figures/cap1.23.png" width=600 />

* Gaussian or normal distribution
    - bell curve

<img src="figures/cap1.22.png" width=600 />

<img src="figures/cap1.25.png" width=600 />

* bias

<img src="figures/cap1.26.png" width=600 />

* basis function expansion

<img src="figures/cap1.27.png" width=600 />

<img src="figures/cap1.28.png" width=400 />

* polynomial regression

<img src="figures/cap1.24.png" width=600 />

## 1.4.6 Logistic regression

* Bernoulli distribution
* sigmoid
    - = logistic or logit
* squashing function
* decision rule
* linearly separable

<img src="figures/cap1.30.png" width=600 />

<img src="figures/cap1.29.png" width=600 />

<img src="figures/cap1.31.png" width=600 />

## 1.4.7 Overfitting

When we fit highly flexible models, we need to be careful that we do not overfit the data, that is, we should avoid trying to model every minor variation in the input, since this is more likely to be noise than true signal. 

<img src="figures/cap1.24.png" width=600 />

<img src="figures/cap1.32.png" width=600 />

## 1.4.8 Model selection

* misclassification rate
* generalization error
* U-shaped curve
* underfits
* validation set
* cross validation (CV)
    - K folds
    - leave-one out cross validation, or LOOCV
    - model selection

When we have a variety of models of different complexity (e.g., linear or logistic regression models with different degree polynomials, or KNN classifiers with different values of K), how should we pick the right one? 

#### misclassification rate

<img src="figures/cap1.33.png" width=600 />

#### generalization error

* U-shaped curve
* underfits
* validation set

<img src="figures/cap1.35.png" width=600 />

#### cross validation (CV)
* K folds
* leave-one out cross validation, or LOOCV

Choosing K for a KNN classifier is a special case of a more general problem known as model selection, where we have to choose between models with different degrees of flexibility.

## 1.4.9 No free lunch theorem

<font color="blue">All models are wrong, but some models are useful. — George Box (Box and Draper 1987, p424).12</font>

* speed-accuracy-complexity tradeoffs

# 참고자료

* [1] Machine Learning: A Probabilistic Perspective -http://www.amazon.com/gp/product/0262018020