# Bias/Variance Trade Off

One of the basic challenges that we face when dealing with real workd data is overfitting versus underfitting your regression to that data, or your model, or your perdictions.

Bias is basically how far off you are from the correct values, that is, how good are your perdictions overall in predicting the right overall value. If you take the mean of all your perdictions, are they more or less on the right spot? or are your errors all cosistently skewed in one directions or another? If so, then your predictions are biased in a certain direction.

Variance is a measure of how spread out, how scattered your predictions are. If your predictions are all over the place, then that's high variance. But, if they are very tightly focused on what the correct values are, or even an incorrect value in that case high bias, then your variance is small.

![title](Bias.PNG)

* Starting with the dartboard in the upper left-hand corner, you can see that our points are all scattered about the center. So overall, you know the mean error comes out to be pretty close to reality. Our bias is actually very low, because our predictions are all around the same correct point. However, we have very high variance, because these points are scattered about all over the place. So, this is an example of low bias and high variance.

* If we move on to the dartboard in the upper right corner, we see that our points are all consistently skewed from where they should be, to the Northwest. So this is an example of high bias in our predictions, where they're consistently off by a certain amount. We have low variance because they're all clustered tightly around the wrong spot, but at least they're close together, so we're being consistent in our predictions. That's low variance. But, the bias is high. So again, this is high bias, low variance.

* In the dartboard in the lower left corner, you can see that our predictions are scattered around the wrong mean point. So, we have high bias; everything is skewed to some place where it shouldn't be. But our variance is also high. So, this is kind of the worst of both worlds here; we have high bias and high variance in this example.

* Finally, in a wonderful perfect world, you would have an example like the lower right dartboard, where we have low bias, where everything is centered around where it should be, and low variance, where things are all clustered pretty tightly around where they should be. So, in a perfect world that's what you end up with.

In Reality, You often need to choose between Bias and Variance. It comes down to Over Fitting Vs Under Fitting your data.

![title](Bias1.PNG)

In the left graph we have a straight line and you can think of that having a very low variance, so there is not a lot of variance in this line hence there is low variance. But the bias, the error from each individual point is actually very high. Now in contrast to the over fitted data in the graph to the right, we have gone out of our way to fit the observations. Thus we can see that the line has high variance, but low bias, because each individual point is very close to the actual value.

Overall, we are not trying to reduce Bias or just Variance. We want to reduce error. 
Error can be expressed in terms of Bias and Variance 

Error = (Bias**2) + Variance

Error is equal to Bias Squared plus Variance. Thus we can see that Bias and Variance both contribute to Error, bias more so than Variance. But, its really the error that we want to minimize and not Bias or Variance specifically. An overly complex model will have high variance and low bias and a simple model will have low variance and high bias, however they both could end up having similar error terms.

KNN Example - 

In K-Nearest Neighbours if we increase value of K, we start to spread out the neighborhood that we average across to a larger area, that has a effect of decreasing the variance because we are smoothing out over a larger space, but it might increase the bias because we will be picking up a larger population that may be less and less relevant to the point we started from. By choosing larger K we are smoothing over larger number of neighbors, which can decrese the variance and that also increases bias and by choosing small k we are smoothing over small number of neighours which has less bias since we are averaging over neighors which are close to the point we started from but that increases variance this is because when we average over large number of neighbors we reduce larger amount of variance in the data set and when we average over smaller neighbors we reduce smaller amount of variance in data set. 

# K-Fold Cross Validation to Avoid Over Fitting

Earlier we talked about train and test as a good way of over fitting and measuring how well your model performs on data that it has never seen before. To recall from train/test we split all into two segments training data and testing data. We train our model only using the data in the training data set and then evaluate the performance using the data in the testing data set. That prevents out model from over fitting since we are testing it againist data that the model has never seen before.

However, Train/Test also has limitations and you could end up overfitting to your specific train test split, this may be because the training data may not be representative of the entire data set and too much ended up in the training set which skewed the results. This can be over come by using K-Fold Cross Validation.

Idea behind K-Fold Cross Validation

1. Instead of dividing our data into two buckets, one for training and one for testing, we divide it into K buckets.
2. We reserve one of those buckets for testing purposes, for evaluating the results of our model.
3. We train our model against the remaining buckets that we have, K-1, and then we take our test dataset and use that to evaluate how well our model did amongst all of those different training datasets.
4. We average those resulting error metrics, that is, those r-squared values, together to get a final error metric from k-fold cross-validation.

# Example of K-Fold Cross Validation

The way we use K-Fold Cross Validation is, you will have a model that you are trying to tune and you will have different variations of the model and different parameters you migh want to tweek on it. For example it may be degree of polynomial in a polynomial fit. so the idea is to try different values of your model, different variations and measure them all using k-fold cross validation and find the one that minimizes error against your test data.

In practice you will want to use k-fold cross validation to measure accuracy of your model against the test data and keep refining the model, keep trying different parameters, keep trying different variations of the model or may be different model entirely, untill you find the technique that reduces error the most.

We are going to apply the k-fold cross validation technique to Iris data set which we used in earlier examples. We are going to use the SVC model for classification. We start by doing a conventional single train/test split and see how that performs. We have a train test split of 0.4 which means that 40% of the data is reserved for testing. We then build SVC model for predicting Iris species and build a model using training data. We fit the SVC model using the linear kernel and call the model clf, Then we call the score() function on clf to measure the performance against test data.

In [11]:
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
# Split the iris data into train/test data sets with
#40% reserved for testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data,
iris.target, test_size=0.4, random_state=0)
# Build an SVC model for predicting iris classifications
#using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
# Now measure its performance with the test data
clf.score(X_test, y_test)

0.9666666666666667

We have an accuracy of 96% which is very good. But this a small data set with 150 samples. We are using 60% for training and 40% for testing which is even smaller in terms of sample size. There is a very good chance the model may be over fitting the data since the sample size is so small. So let's use K-Fold Cross Validation to verify the same.

We already have a SVC model which was defined earlier. We need to call cross_val_score() of cross_validation package and pass the model clf and the entire data set which is all the feture data and target data as well. We will use cv = 5 which means that it will use 5 different training set while reserving 1 for testing. This will automatically evaluate our model against the entire data set, split up 5 different ways and gives back the result for each split.

In [12]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
# Print the accuracy for each fold:
print (scores)
# And the mean accuracy of all 5 folds:
print (scores.mean())

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


We have an accuracy of 98% When we do this 5 fold, we see that out results are even better that what we had initially which was 96.6%. This is very good. Let us try to see if we can better the model by using polynomial kernel instead of the linear kernel.
We can sheck if that will be over fitting the data.

In [14]:
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
print (scores)
print (scores.mean())

[1.         1.         0.9        0.93333333 1.        ]
0.9666666666666666


Score of 96.6% which is lower than our original score, this tells us that polynomial kernel probably is over fitting the data.
Lets try different degree of polynomial and see if that changes the result

In [23]:
for i in range (5)    :
    clf = svm.SVC(kernel='poly', C=1,degree=i).fit(X_train, y_train)
    scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
    print(i)
    #print (scores)
    print (scores.mean())

0
0.3333333333333333
1
0.9800000000000001
2
0.9733333333333334
3
0.9666666666666666
4
0.96


We can see that the best result of for degree of 2 which basically makes it linear.
Let us try different kernel and observe the changes.

In [27]:
#RBF Kernel
clf = svm.SVC(C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
print (scores)
print (scores.mean())

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


In [29]:
clf = svm.SVC(kernel='sigmoid', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
print (scores)
print (scores.mean())

[0.33333333 0.1        0.         0.03333333 0.        ]
0.09333333333333334


The best kernels to use for this experiment are RBF, Linear and Polynomial with a degree of 2.