#Thinkful Data Science Course
##Unit 4: Predicting the Future; 
##Lesson 8: Evaluating Classifier Performance

Throughout the unit we've been splitting our data into training, test, and validation sets. Let's take a moment and discuss why this is necesary. By now you can probably see that learning an estimator and testing that estimator's performance on the same data is a methodological mistake. It's like if a professor administered a test with the exact same questions as the practice test. All a student would have to do to get 100% would be to memorize all the solutions to the practice test; they wouldn't acutally have to learn anything. If you test your estimator on the data used to train it, it knows all the answers, and thus can achieve a perfect score, even though it very well could fail to predict any- thing on data it's never seen before. This is called overfitting. Predicting on never-before-seen data is kind of the whole point, so knowing how our estimator performs on data its already seen isn't really useful.

Holding out a subset of your data for testing, i.e., excluding a subset of your data from your training set, gives you some never-before-seen data to test your estimator's performance. The scikit-learn library has a train_test_split helper function to randomly split data into training and test sets.

When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and we can't make claims about how it will generalize (i.e., how it will perform) on never-before-seen data.

To resolve this problem, we can hold out yet another subset of our data for validation. Training proceeds on the training set, evaluation is done on the validation set, and when it seems like we have a good model, we can perform our final evaluation on the test set.

####Use the cross_validation.train_test_split() helper function to split the Iris dataset into training and test sets, holding out 40% of the data for testing. 
How many points do you have in your training set? In your test set?

In [5]:
from sklearn.cross_validation import train_test_split
import pandas as pd

In [6]:
from sklearn import datasets
iris = datasets.load_iris()

In [7]:
iris_df = pd.DataFrame()
iris_df['sepal_length'] = iris.data[:,0]
iris_df['sepal_width'] = iris.data[:,1]
iris_df['petal_length'] = iris.data[:,2]
iris_df['petal_width'] = iris.data[:,3]
iris_df['target'] = iris.target
iris_df['target_flower'] = iris.target
iris_df['target_flower'].replace(0, 'setosa', inplace = True)
iris_df['target_flower'].replace(1, 'versicolor', inplace = True)
iris_df['target_flower'].replace(2, 'virginica', inplace = True)
iris_df1 = iris_df[iris_df['target_flower']=='setosa']
iris_df2 = iris_df[iris_df['target_flower']=='versicolor']
iris_df3 = iris_df[iris_df['target_flower']=='virginica']

In [9]:
X = iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].as_matrix()
y = iris_df['target'].as_matrix()

In [13]:
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=0.40)

In [14]:
X_train

array([[ 6.4,  2.7,  5.3,  1.9],
       [ 6.3,  2.9,  5.6,  1.8],
       [ 6.5,  3. ,  5.2,  2. ],
       [ 6.3,  3.3,  6. ,  2.5],
       [ 5.8,  2.6,  4. ,  1.2],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 5.1,  3.5,  1.4,  0.2],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 6.9,  3.1,  5.1,  2.3],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6. ,  3.4,  4.5,  1.6],
       [ 5.6,  2.7,  4.2,  1.3],
       [ 6.5,  3.2,  5.1,  2. ],
       [ 6.3,  2.5,  5. ,  1.9],
       [ 6.9,  3.2,  5.7,  2.3],
       [ 4.7,  3.2,  1.3,  0.2],
       [ 6. ,  2.9,  4.5,  1.5],
       [ 7.4,  2.8,  6.1,  1.9],
       [ 7.2,  3. ,  5.8,  1.6],
       [ 6.3,  2.8,  5.1,  1.5],
       [ 7.2,  3.6,  6.1,  2.5],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 6. ,  2.2,  5. ,  1.5],
       [ 5.8,  2.7,  3.9,  1.2],
       [ 5.1,  3.7,  1.5,  0.4],
       [ 4.8,  3. ,  1.4,  0.3],
       [ 5.1,  3.8,  1.9,  0.4],
       [ 6.7,  3.1,  4.7,  1.5],
       [ 5.9,  3. ,  4.2,  1.5],
       [ 5. ,  3. ,  1.6,  0.2],
       [ 4

In [15]:
X_test

array([[ 6.7,  3.1,  4.4,  1.4],
       [ 6.8,  3.2,  5.9,  2.3],
       [ 6.1,  3. ,  4.9,  1.8],
       [ 5.1,  3.5,  1.4,  0.3],
       [ 7. ,  3.2,  4.7,  1.4],
       [ 5.1,  3.4,  1.5,  0.2],
       [ 5.4,  3.4,  1.5,  0.4],
       [ 5.6,  2.5,  3.9,  1.1],
       [ 5. ,  3.4,  1.6,  0.4],
       [ 5.8,  2.8,  5.1,  2.4],
       [ 5.5,  3.5,  1.3,  0.2],
       [ 7.2,  3.2,  6. ,  1.8],
       [ 5.5,  2.5,  4. ,  1.3],
       [ 6.3,  2.7,  4.9,  1.8],
       [ 4.9,  3.1,  1.5,  0.1],
       [ 5. ,  3.2,  1.2,  0.2],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 5.6,  2.9,  3.6,  1.3],
       [ 6.6,  3. ,  4.4,  1.4],
       [ 6.4,  2.9,  4.3,  1.3],
       [ 4.9,  2.5,  4.5,  1.7],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 5.1,  3.8,  1.6,  0.2],
       [ 7.7,  3.8,  6.7,  2.2],
       [ 6.7,  3.3,  5.7,  2.1],
       [ 4.6,  3.2,  1.4,  0.2],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 4.8,  3.4,  1.6,  0.2],
       [ 5.6,  3. ,  4.5,  1.5],
       [ 5.9,  3. ,  5.1,  1.8],
       [ 5

In [16]:
y_train

array([2, 2, 2, 2, 1, 2, 0, 1, 2, 1, 1, 1, 2, 2, 2, 0, 1, 2, 2, 2, 2, 0, 2,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 2, 2, 0, 1, 2, 2, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 0, 2, 0, 1, 2, 1, 0, 1, 1, 1, 2, 0, 2, 2, 0, 0, 1, 1, 2,
       2, 1, 1, 2, 0, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 0, 1, 0, 0, 1, 1])

In [17]:
y_test

array([1, 2, 2, 0, 1, 0, 0, 1, 0, 2, 0, 2, 1, 2, 0, 0, 1, 1, 1, 1, 2, 1, 0,
       2, 2, 0, 1, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 1, 1, 2, 2, 1, 0, 0, 2, 2,
       1, 0, 2, 2, 1, 1, 1, 0, 2, 0, 1, 0, 1, 1])

####How many points do you have in your training set? 

In [22]:
print('There are ', len(X_train), 'points in the training set')

There are  90 points in the training set


####In your test set?

In [24]:
print('There are', len(X_test), 'points in the test set, which is', (len(X_test)/(len(X_test)+len(X_train)))*100, '% of the data.')

There are 60 points in the test set, which is 40.0 % of the data.


####Fit a linear Support Vector Classifier to the training set and evaluate its performance on the test set. 

What is the score? How does it compare to the score in the Support Vector Machine lesson?

In [28]:
from sklearn import svm
svc = svm.SVC(kernel='linear')
from sklearn import datasets
X=X_train
y=y_train
svc.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

####What is the score?

####How does it compare to the score in the Support Vector Machine lesson?