# Machine Learning
__A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E__

OR

Machine learning enables a machine to automatically learn from data, improve performance from experiences, and predict things without being explicitly programmed.

__The term machine learning was first introduced by Arthur Samuel in 1959.__

#### How does Machine Learning work?
    A Machine Learning system learns from historical data, builds the prediction models, and whenever it receives new data, predicts the output for it. The accuracy of predicted output depends upon the amount of data, as the huge amount of data helps to build a better model which predicts the output more accurately.

## Classification of Machine Learning

At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

# Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample labeled data to the machine learning system in order to train it, and on that basis, it predicts the output.

The system creates a model using labeled data to understand the datasets and learn about each data, once the training and processing are done then we test the model by providing a sample data to check whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The supervised learning is based on supervision, and it is the same as when a student learns things in the supervision of the teacher. The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

1. __Classification__

2. __Regression__

# Steps to solve the Supervised learning problem.

In order to solve a given problem of supervised learning, one has to perform the following steps:

1. __Determine the type of training examples__. Before doing anything else, the user should decide what kind of data is to be used as a training set. In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting.
2. __Gather a training set__. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.
3. __Determine the input feature representation of the learned function__. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.
4. __Determine the structure of the learned function and corresponding learning algorithm.__ For example, the engineer may choose to use support vector machines or decision trees.
5. __Complete the design. Run the learning algorithm on the gathered training set.__ Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.
6. __Evaluate the accuracy of the learned function.__ After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.


## What is dependent and independent variable?

__Independent variables (also referred to as Input Features) are the input for a process that is being analyzes. Dependent variables are the output of the process.__

Dependent variable is also called __output variable or label__, __target variable__ or __outcome variable__ or __Response variable__, usually denoted by Y , is the variable being predicted in supervised learning.

For example, in the below data set, the independent variables are the input of the purchasing process being analyzed. The result (whether a user purchased or not) is the dependent variable.
![DependentAndIndependentVariable.png](attachment:DependentAndIndependentVariable.png)

# four major issues to consider in supervised learning

### 1. Bias-variance tradeoff
### 2. Function complexity and amount of training data
### 3. Dimensionality of the input space
### 4. Noise in the output values

## Bias-variance tradeoff
Whenever we discuss model prediction, it’s important to understand prediction errors (bias and variance). There is a tradeoff between a model’s ability to minimize bias and variance. __Gaining a proper understanding of these errors would help us not only to build accurate models but also to avoid the mistake of overfitting and underfitting.__

__when a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data.__ This happens because our statistical learning procedure is working too hard to find patterns in the training data, and maybe picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f

#### What is bias?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
#### What is variance?
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
#### Mathematically
Let the variable we are trying to predict as Y and other covariates as X. We assume there is a relationship between the two such that

    Y=f(X) + e

Where e is the error term and it’s normally distributed with a mean of 0.

We will make a model f^(X) of f(X) using linear regression or any other modeling technique.

So the expected squared error at a point x is

![BiasAndVariance1.png](attachment:BiasAndVariance1.png)

The Err(x) can be further decomposed as

![BiasAndVariance2.png](attachment:BiasAndVariance2.png)

Err(x) is the sum of Bias², variance and the irreducible error.
Irreducible error is the error that can’t be reduced by creating good models. It is a measure of the amount of noise in our data. Here it is important to understand that no matter how good we make our model, our data will have certain amount of noise or irreducible error that can not be removed.

#### Bias and variance using bulls-eye diagram


![BiasAndVariance3.png](attachment:BiasAndVariance3.png)
In the above diagram, center of the target is a model that perfectly predicts correct values. As we move away from the bulls-eye our predictions become get worse and worse. We can repeat our process of model building to get separate hits on the target.

In supervised learning, underfitting happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and low variance. It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.

In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high variance. These models are very complex like Decision trees which are prone to overfitting.

![BiasAndVariance4.png](attachment:BiasAndVariance4.png)

#### Why is Bias Variance Tradeoff?
If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.

This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.
#### Total Error
To build a good model, we need to find a good balance between bias and variance such that it minimizes the total error.


![BiasAndVariance6.png](attachment:BiasAndVariance6.png)

![BiasAndVariance5.png](attachment:BiasAndVariance5.png)

An optimal balance of bias and variance would never overfit or underfit the model.

Therefore understanding bias and variance is critical for understanding the behavior of prediction models.

# Example to demonstrate the Underfitting and Overfitting of Data with the Help of MSE(Mean Square Error).
__https://medium.com/analytics-vidhya/bias-variance-trade-off-in-datascience-and-calculating-with-python-766158812c46__

# Important points of bias variance trade off.
1. For high variance models an alternative is feature reduction
2. The two main reasons for high bias are insufficient model capacity and underfitting.
3. A model with low variance and low bias is the ideal model.
4. A model with low bias and high variance is a model with overfitting. Generally speaking, overfitting means bad generalization, memorization of the training set rather than learning a generic concepts behind the data.
5. A model with high bias and low variance is usually an underfitting model.
6. A model with high bias and high variance is the worst case scenario, as it is a model that produces the greatest possible prediction error.

In [3]:
import mlxtend
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from mlxtend.evaluate import bias_variance_decomp

# Data url
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url)
data = df.values
X, y = data[:, :-1], data[:, -1]

'''Now, for testing my model I want a different dataset from the training set. 
For lack of real testing dataset, the obvious solution is to split the dataset
 you have into two sets, one for training and the other for testing
 
 And, I need to split the data in a random manner.  
 Using train_test_split() function, we can easily split the dataset
into the training and the testing datasets in various proportions.
 train_test_split is a function in Sklearn model selection for 
 splitting data arrays into two subsets:
 '''

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

'''Explanations of the parameters of train_test_split() function

X, y. The first parameter is the dataset you're selecting to use.

test_size — This parameter decides the size of the data that has to be split as the test dataset. 
This is given as a fraction. So here, 0.33 as the value, means the dataset will be split 33% as the test dataset. 
If you’re specifying this parameter, you can ignore the next parameter.
If you’re specifying this parameter, you can ignore the next parameter (i.e. train_size).

train_size — You have to specify this parameter only if you’re not specifying the test_size. 
This is the same as test_size, but instead you tell the class what percent of the dataset you 
want to split as the training set.

random_state — Here you pass an integer, which will act as the seed for the random number generator during the split.
 Or, you can also pass an instance of the RandomState class, which will become the number generator. 
 If you don’t pass anything, the RandomState instance used by np.random will be used instead.

'''

# Make a model
model = LinearRegression()

# Now estimate bias and variance
mse, bias, var = bias_variance_decomp(model, X_train, y_train, X_test, y_test, loss='mse', num_rounds=200, random_seed=1)
print('MSE: %.3f' % mse)
print('Bias: %.3f' % bias)
print('Variance: %.3f' % var)

# MSE: 25.957
# Bias: 24.394
# Variance: 1.563

'''If run multiple times, results will vary given the stochastic nature of the algorithm 
or evaluation procedure, or differences in numerical precision. 
We may consider running the examples a few times and compare the average outcome.

Note that the model has a high bias and a low variance. 
This is expected for a linear regression model. 
We can also see that the sum of the estimated mean and variance equals the 
estimated error of the model, e.g. 24.394 + 1.563 = 25.957.'''

MSE: 25.663
Bias: 24.181
Variance: 1.482


'If run multiple times, results will vary given the stochastic nature of the algorithm \nor evaluation procedure, or differences in numerical precision. \nWe may consider running the examples a few times and compare the average outcome.\n\nNote that the model has a high bias and a low variance. \nThis is expected for a linear regression model. \nWe can also see that the sum of the estimated mean and variance equals the \nestimated error of the model, e.g. 24.394 + 1.563 = 25.957.'

# Ways to mitigate this bias-variance tradeoff on small- and medium- scale problems

Ensemble methods to the rescue, which are meta-algorithms which combine several machine learning models as a technique to decrease the bias and/or variance and improve model performance.

Instead of fitting a single final model, you can fit multiple final models. Together, the group of final models may be used as an ensemble. For a given input, each model in the ensemble makes a prediction and the final output prediction is taken as the average of the predictions of the models.

__By building several models, with different inductive biases, and aggregating their outputs, we hope to get a model with better performance. Below, we’ll discuss some commonly used Ensemble methods, including bagging, boosting, and stacking.__

https://medium.com/analytics-vidhya/bias-variance-trade-off-in-datascience-and-calculating-with-python-766158812c46

https://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/

Above link has theory about bagging,boosting and stacking

__In the case of overfitting of a model learn Structural risk minimization__

# Supervised algorithm and approches
### Algorithm
1. Support Vector Machines
2. linear regression
3. logistic regression
4. naive Bayes
5. linear discriminant analysis
6. decision trees
7. k-nearest neighbor algorithm
8. Neural Networks (Multilayer perceptron)
9. Random Forests

### Approches
1. Similarity learning
2. Analytical learning
3. Artificial neural network
4. Backpropagation
5. Boosting (meta-algorithm)
6. Bayesian statistics
7. Case-based reasoning
8. Decision tree learning
9. Inductive logic programming
10. Gaussian process regression
11. Genetic Programming
12. Group method of data handling
13. Kernel estimators
14. Learning Automata
15. Learning Classifier Systems
16. Minimum message length (decision trees, decision graphs, etc.)
17. Multilinear subspace learning
18. Naive Bayes classifier
19. Maximum entropy classifier
20. Conditional random field
21. Nearest Neighbor Algorithm
22. Probably approximately correct learning (PAC) learning
23. Ripple down rules, a knowledge acquisition methodology
24. Symbolic machine learning algorithms
25. Subsymbolic machine learning algorithms
26. Support vector machines
27. Minimum Complexity Machines (MCM)
28. Ensembles of Classifiers
29. Ordinal classification
30. Data Pre-processing
31. Handling imbalanced datasets
32. Statistical relational learning
33. Proaftn, a multicriteria classification algorithm