## Causes of Error:

### Bias (Underfitting)
As mentioned before, bias occurs when a model has enough data but is not complex enough to capture the underlying relationships. As a result, the model consistently and systematically misrepresents the data, leading to low accuracy in prediction. This is known as underfitting.
Simply put, bias occurs when we have an inadequate model. An example might be when we have objects that are classified by color and shape, but our model can only partition and classify objects just by color (it is an overly simplified model) and therefore consistently mislabels future objects.
Or perhaps we have continuous data that is polynomial in nature but our model can only represent linear relationships. In this case it does not matter how much data we feed the model because it just cannot represent the underlying relationship that we need a more complex model for.


### Variance (Overfitting)
When we train a model, we typically use a limited number of samples from a larger population (the training set). If we repeatedly train a model with randomly selected subsets of data, we would expect its predictons to be different based on the specific examples given to it. Here variance is a measure of how much the predictions vary for any given test sample.
Some variance is normal, but too much variance indicates that the model is unable to generalize its predictions to the larger population from which training samples were drawn. High sensitivity to the training set is also known as overfitting, and generally occurs when either the model is too complex and/or we do not have enough data to support it.
We can typically reduce the variability of a model's predictions and increase precision by training on more data.

Balancing bias and variance and getting the right spot with high r2 score is the trick in ML algo

A good read on bias and variance:
http://scott.fortmann-roe.com/docs/BiasVariance.html

## Methods to split data for being able to get the best model with low errors
The best way to find the most optimal model is using guess and check. i.e. try out all possible models.
Each model is built out of training data and then we rank this model by checking its error/score metric against test data.

The way we do this is to split our smaple data in to train/test split which can be applied to the model and then evaludated






In [1]:
import numpy as np
from sklearn import svm
clf = svm.SVC(kernel='linear', C=1)
# how we now build x_train, y_train and x_test and y_test is explained below
#clf.fit(x_train,y_train)
#clf.score(x_test,y_test) #todo check how score function was injected to classfier

### Cross validation
Given a sample of data with expected result we would need to train on some parts of this and use some parts to test our ML model. This train/ test split from our sample can be done intutively using CrossValidaiton

In [2]:

from sklearn import cross_validation, datasets
iris = datasets.load_iris()
print iris.data.shape, iris.target.shape # 150 rows in input data has 4 dimensions
x_train, x_test, y_train, y_test = cross_validation.train_test_split(iris.data,iris.target, test_size=0.4, random_state=0)
print x_train.shape, x_test.shape, y_train.shape, y_test.shape

(150, 4) (150,)
(90, 4) (60, 4) (90,) (60,)



### KFold
Extending the basic cross validation is Kfold.
Here the idea is to run cross validation K times i.e. split sample data in K folds and choose first fold as test data for first run  then the second fold and so on till the Kth fold.
This way all data is used in training and testing and out of the K models we created with fitting training data pick the one with lowest error.

In [3]:
kf = cross_validation.KFold(4,n_folds=2) # total size of data and number of folds we want to make
for train, test in kf:
    print ("%s %s" % (train,test)) # will print 2 rows for 2 folds with sets swapped
    
# now this index can be used to pick train and corresponding test data
#not index could be used like array index in input data eg..
print iris.data[0:2] # returns 2 rows of iris data in index 0 and 2

# hence for each run we do
#build x_train, y_train, x_test, y_test from index
#clf.fit(x_train,y_train)
#clf.score(x_test,y_test)
# pick the best clasffier
    
kf = cross_validation.KFold(10,n_folds=4,shuffle=True) # shuffle import to shuffle indices in data if first half 
# of data is  in one pattern and second is another pattern
for train, test in kf:
    print ("%s %s" % (train,test))
    



[2 3] [0 1]
[0 1] [2 3]
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]]
[0 1 3 4 5 7 9] [2 6 8]
[2 3 5 6 7 8 9] [0 1 4]
[0 1 2 3 4 6 8 9] [5 7]
[0 1 2 4 5 6 7 8] [3 9]


### GridCV
Above is a lot of work.. Also for each combination of train_test we sometimes have need to run it against different parameters of our ML algo.. Incase of SVM the SVM algo has paramaters such as 
kernel = linear/rbf 
C = number from 1 to +ve range

Hence a wapper class which takes Algo + array parameters and generates all combination of paremeter to be used with algo and runs this with Kfold cross validation split on train_data and test_data is GridSearchCV
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV

In [19]:
from sklearn import grid_search 

paramaters = { 'kernel':('linear','rbf'), 'C':[1,10] }  #params with all possibile options C range 1 to 10 
                                                        #and kernel can be linear or rb

svr = svm.SVC() #choose estimator


clf = grid_search.GridSearchCV(svr,paramaters,scoring='accuracy')  # build grid search to combine esimtator with various combinations of 
                                                # paramteres to be build different classfiers
clf.fit(iris.data,iris.target); # will carry out kfold cross validat

print clf.best_score_
print clf.best_params_
print clf.scorer_

print clf.best_estimator_

clf = grid_search.GridSearchCV(svr,paramaters,scoring='f1_macro')  # build grid search to combine esimtator with various combinations of 
                                                # paramteres to be build different classfiers
clf.fit(iris.data,iris.target); # will carry out kfold cross validat
print clf.best_score_
# scoring function can be customized using stirng that maps to function give in link
# 'http://scikit-learn.org/stable/modules/model_evaluation.html'

#Since clf is doing many combinations of ML model with params running. 
# It can run them parallely 
# parallelism hint is given by argument n_jobs

clf = grid_search.GridSearchCV(svr,paramaters,scoring='accuracy',n_jobs=10)  
clf.fit(iris.data,iris.target); 

print clf.best_score_

# we can also pass numer of cross folds else by default it dtermines the cross fold stragey
# check  http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation on how ot choose strategy options
k_fold = cross_validation.KFold(len(iris.target), n_folds=6, shuffle=True, random_state=0)
clf = grid_search.GridSearchCV(svr,paramaters,scoring='accuracy',n_jobs=10,cv=k_fold)  
clf.fit(iris.data,iris.target); 

print clf.best_score_

0.98
{'kernel': 'linear', 'C': 1}
make_scorer(accuracy_score)
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
0.979947931698
0.98
0.973333333333


### Learning Curve
A curve with number of training points increasing on xaxis and error score on yaxis;
if we have training data x_train, y_train and test x_test, y_test
than 2 cuvers can be made
by taking sample  x_train[:s] - where s increments by step size in xaxis
fitting x_train[:s] and y_train[:s] into the ML model. 
and then checking for errorscore against what we trained for i.e. x_train[:s] y_train[:s] 
and the test data x_test, y_test.
Will result in 2 curves the training cuver and testing curve for incrmented values of s.
train cuver would move from 0 error score to some upper value while test will move from high to low.
The point where the cuvers converge is where the model fits best.

If the convergence to high in the y axis of error score we can assume the parameter we used in model is underfitting. as with larger training data our error on the trained data itself is still very high.

If the convergence is low in the yaxis but the gap between test and train curve is high we can attirbute this to overfitting. our model has overfit the data such that test error score are higher than train and the graphs covergence gap is hight.



### Model Complexity
A curve with ML model parameter of increasing complexity (such as increasize max_dept in case of decision_tree) in xaxis and error score in yaxis is the complexity graph.

Note here the error score is computed from same test data and model is trained on fixed size training data.
The graph helps to clear identify OCams razor of overfitting /underfitting.

