# Project 2: Supervised Learning
### Building a Student Intervention System

## 1. Classification vs Regression

*Your goal is to identify students who might need early intervention - which type of supervised machine learning problem is this, classification or regression? Why?*

This problem is best suited for Classificaton. This is due to the nature of the output, where we require a determination of the student as "need early intervention", or not.

## 2. Exploring the Data

Let's go ahead and read in the student dataset first.

_To execute a code cell, click inside it and press **Shift+Enter**._

In [3]:
# Import libraries
import numpy as np
import pandas as pd

In [4]:
# Read student data
student_data = pd.read_csv("student-data.csv")
print "Student data read successfully!"
# Note: The last column 'passed' is the target/label, all other are feature columns


Student data read successfully!


Now, can you find out the following facts about the dataset?
- Total number of students
- Number of students who passed
- Number of students who failed
- Graduation rate of the class (%)
- Number of features

_Use the code block below to compute these values. Instructions/steps are marked using **TODO**s._

In [5]:
# TODO: Compute desired values - replace each '?' with an appropriate expression/function call
n_students =student_data.shape[0]
n_features = student_data.shape[1]
n_passed = student_data[student_data['passed']=='yes'].shape[0]
n_failed = student_data[student_data['passed']=='no'].shape[0]
grad_rate = n_passed/float(n_students)
print "Total number of students: {}".format(n_students)
print "Number of students who passed: {}".format(n_passed)
print "Number of students who failed: {}".format(n_failed)
print "Number of features: {}".format(n_features)
print "Graduation rate of the class: {:.2f}%".format(grad_rate)

Total number of students: 395
Number of students who passed: 265
Number of students who failed: 130
Number of features: 31
Graduation rate of the class: 0.67%


## 3. Preparing the Data
In this section, we will prepare the data for modeling, training and testing.

### Identify feature and target columns
It is often the case that the data you obtain contains non-numeric features. This can be a problem, as most machine learning algorithms expect numeric data to perform computations with.

Let's first separate our data into feature and target columns, and see if any features are non-numeric.<br/>
**Note**: For this dataset, the last column (`'passed'`) is the target or label we are trying to predict.

In [6]:
# Extract feature (X) and target (y) columns
feature_cols = list(student_data.columns[:-1])  # all columns but last are features
target_col = student_data.columns[-1]  # last column is the target/label
print "Feature column(s):-\n{}".format(feature_cols)
print "Target column: {}".format(target_col)

X_all = student_data[feature_cols]  # feature values for all students
y_all = student_data[target_col]  # corresponding targets/labels
print "\nFeature values:-"
print X_all.head()  # print the first 5 rows

Feature column(s):-
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']
Target column: passed

Feature values:-
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher   
1     GP   F   17       U     GT3       T     1     1  at_home     other   
2     GP   F   15       U     LE3       T     1     1  at_home     other   
3     GP   F   15       U     GT3       T     4     2   health  services   
4     GP   F   16       U     GT3       T     3     3    other     other   

    ...    higher internet  romantic  famrel  freetime goout Dalc Walc health  \
0   ...       yes       no        no       4         3     4    1    1      3   
1   ...    

### Preprocess feature columns

As you can see, there are several non-numeric columns that need to be converted! Many of them are simply `yes`/`no`, e.g. `internet`. These can be reasonably converted into `1`/`0` (binary) values.

Other columns, like `Mjob` and `Fjob`, have more than two values, and are known as _categorical variables_. The recommended way to handle such a column is to create as many columns as possible values (e.g. `Fjob_teacher`, `Fjob_other`, `Fjob_services`, etc.), and assign a `1` to one of them and `0` to all others.

These generated columns are sometimes called _dummy variables_, and we will use the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies) function to perform this transformation.

In [7]:
# Preprocess feature columns
def preprocess_features(X):
    outX = pd.DataFrame(index=X.index)  # output dataframe, initially empty

    # Check each column
    for col, col_data in X.iteritems():
        # If data type is non-numeric, try to replace all yes/no values with 1/0
        if col_data.dtype == object:
            col_data = col_data.replace(['yes', 'no'], [1, 0])
        # Note: This should change the data type for yes/no columns to int

        # If still non-numeric, convert to one or more dummy variables
        if col_data.dtype == object:
            col_data = pd.get_dummies(col_data, prefix=col)  # e.g. 'school' => 'school_GP', 'school_MS'

        outX = outX.join(col_data)  # collect column(s) in output dataframe

    return outX

X_all = preprocess_features(X_all)
print "Processed feature columns ({}):-\n{}".format(len(X_all.columns), list(X_all.columns))

Processed feature columns (48):-
['school_GP', 'school_MS', 'sex_F', 'sex_M', 'age', 'address_R', 'address_U', 'famsize_GT3', 'famsize_LE3', 'Pstatus_A', 'Pstatus_T', 'Medu', 'Fedu', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'guardian_father', 'guardian_mother', 'guardian_other', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences']


### Split data into training and test sets

So far, we have converted all _categorical_ features into numeric values. In this next step, we split the data (both features and corresponding labels) into training and test sets.

In [8]:
# First, decide how many training vs test samples you want
num_all = student_data.shape[0]  # same as len(student_data)
num_train = 300  # about 75% of the data
num_test = num_all - num_train

# TODO: Then, select features (X) and corresponding labels (y) for the training and test sets
# Note: Shuffle the data or randomly select samples to avoid any bias due to ordering in the dataset

from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split( X_all, y_all, test_size=num_test, random_state=42)

print "Training set: {} samples".format(X_train.shape[0])
print "Test set: {} samples".format(X_test.shape[0])
# Note: If you need a validation set, extract it from within training data

Training set: 300 samples
Test set: 95 samples


## 4. Training and Evaluating Models
Choose 3 supervised learning models that are available in scikit-learn, and appropriate for this problem. For each model:

- What is the theoretical O(n) time & space complexity in terms of input size?
- What are the general applications of this model? What are its strengths and weaknesses?
- Given what you know about the data so far, why did you choose this model to apply?
- Fit this model to the training data, try to predict labels (for both training and test sets), and measure the F<sub>1</sub> score. Repeat this process with different training set sizes (100, 200, 300), keeping test set constant.

Produce a table showing training time, prediction time, F<sub>1</sub> score on training set and F<sub>1</sub> score on test set, for each training set size.

Note: You need to produce 3 such tables - one for each model.

In [9]:
# Train a model
import time

def train_classifier(clf, X_train, y_train):
    print "Training {}...".format(clf.__class__.__name__)
    start = time.time()
    clf.fit(X_train, y_train)
    end = time.time()
    print "Done!\nTraining time (secs): {:.3f}".format(end - start)

# TODO: Choose a model, import it and instantiate an object
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

# Fit model to training data
train_classifier(clf, X_train, y_train)  # note: using entire training set here
print clf  # you can inspect the learned model by printing it

Training GaussianNB...
Done!
Training time (secs): 0.004
GaussianNB()


In [10]:
# Predict on training set and compute F1 score
from sklearn.metrics import f1_score

def predict_labels(clf, features, target):
    print "Predicting labels using {}...".format(clf.__class__.__name__)
    start = time.time()
    y_pred = clf.predict(features)
    end = time.time()
    print "Done!\nPrediction time (secs): {:.3f}".format(end - start)
    return f1_score(target.values, y_pred, pos_label='yes')

train_f1_score = predict_labels(clf, X_train, y_train)
print "F1 score for training set: {}".format(train_f1_score)

Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.80378250591


In [11]:
# Predict on test data
print "F1 score for test set: {}".format(predict_labels(clf, X_test, y_test))

Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.763358778626


In [12]:
# Train and predict using different training set sizes
def train_predict(clf, X_train, y_train, X_test, y_test):
    print "------------------------------------------"
    print "Training set size: {}".format(len(X_train))
    train_classifier(clf, X_train, y_train)
    predict_train=predict_labels(clf, X_train, y_train)
    print "F1 score for training set: {}".format(predict_train)
    predict_test=predict_labels(clf, X_test, y_test)
    print "F1 score for test set: {}".format(predict_test)
    return predict_test,predict_train # let's return the scores, so we can use them for comparisons

# TODO: Run the helper function above for desired subsets of training data
# Note: Keep the test set constant
test,train = train_predict(clf, X_train, y_train, X_test, y_test)
print "Test:",test,"Train:",train

------------------------------------------
Training set size: 300
Training GaussianNB...
Done!
Training time (secs): 0.002
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.002
F1 score for training set: 0.80378250591
Predicting labels using GaussianNB...
Done!
Prediction time (secs): 0.002
F1 score for test set: 0.763358778626
Test: 0.763358778626 Train: 0.80378250591


In [112]:
# TODO: Train and predict using two other models
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn import linear_model, decomposition, datasets, ensemble
from sklearn.naive_bayes import GaussianNB
#from sklearn.neural_network import MLPClassifier # not in vesion .17

def runCLF():
    scores={}

    clf = DecisionTreeClassifier(class_weight ="balanced")
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["DecisionTreeClassifier"]=[test,train]

    clf = KNeighborsClassifier(n_jobs =-1)
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["KNeighborsClassifier"]=[test,train]

    clf=LinearSVC(class_weight="balanced")
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["LinearSVC"]=[test,train]

    clf=SVC(class_weight='balanced')
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["SVC"]=[test,train]

    clf=linear_model.LogisticRegression(n_jobs =-1)
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["LogisticRegression"]=[test,train]

    clf = ensemble.AdaBoostClassifier()
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["AdaBoostClassifier-plain"]=[test,train]

    clf = ensemble.AdaBoostClassifier(GaussianNB())
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["AdaBoostClassifier-GaussianNB"]=[test,train]

    clf = ensemble.AdaBoostClassifier(DecisionTreeClassifier(max_depth=2,class_weight ="balanced"))
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["AdaBoostClassifier-DecisionTreeClassifier"]=[test,train]

    clf = ensemble.RandomForestClassifier(n_jobs =-1,class_weight="balanced")
    test,train = train_predict(clf, X_train, y_train, X_test, y_test)
    scores["RandomForestClassifier"]=[test,train]

    scoresDF= pd.DataFrame(scores,index=['test','train']).T
    return scoresDF.sort_values('test',ascending=False)[:3]
runs = 1
holdData=[]
for r in range(runs) :
    holdData.append(runCLF())

for h in holdData:
    print h

------------------------------------------
Training set size: 300
Training DecisionTreeClassifier...
Done!
Training time (secs): 0.002
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for training set: 1.0
Predicting labels using DecisionTreeClassifier...
Done!
Prediction time (secs): 0.001
F1 score for test set: 0.724409448819
------------------------------------------
Training set size: 300
Training KNeighborsClassifier...
Done!
Training time (secs): 0.001
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.155
F1 score for training set: 0.880898876404
Predicting labels using KNeighborsClassifier...
Done!
Prediction time (secs): 0.142
F1 score for test set: 0.780141843972
------------------------------------------
Training set size: 300
Training LinearSVC...
Done!
Training time (secs): 0.055
Predicting labels using LinearSVC...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.603278688525
Predic


* What is the theoretical O(n) time & space complexity in terms of input size?
* What are the general applications of this model? What are its strengths and weaknesses?
* Given what you know about the data so far, why did you choose this model to apply?

Experimentaion showed three different models that may best classify the data: LogisticRegression, AdaBoost, and KNeighbors.

LogisticRegression has an advantage over the others best methods, in that it is fast to query, and an eager learner, but it must be retrained to use new data.

AdaBoost is a boost algorithim

KNeighbors is a lazy learner, and requires more resources, but continues to learn as it is given new data.

## 5. Choosing the Best Model

- Based on the experiments you performed earlier, in 1-2 paragraphs explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?
- In 1-2 paragraphs explain to the board of supervisors in layman's terms how the final model chosen is supposed to work (for example if you chose a Decision Tree or Support Vector Machine, how does it make a prediction).
- Fine-tune the model. Use Gridsearch with at least one important parameter tuned and with at least 3 settings. Use the entire training set for this.
- What is the model's final F<sub>1</sub> score?

In [118]:
# TODO: Fine-tune your model and report the best F1 score

def modelTune(clf, params, X, y):
    #Fine tune model with grid search
    from sklearn.grid_search import GridSearchCV
    from sklearn.metrics import make_scorer, f1_score


    grid_search = GridSearchCV(clf, 
                               param_grid=params, 
                               cv=4, 
                               n_jobs= 2,
                               scoring=make_scorer(f1_score, 
                                                   pos_label="yes",
                                                   greater_is_better=True)) 

    grid_search.fit(X, y)

    #reach into the grid search and pull out the best parameters, and set those on clf. probably not necesary, but easier to deal with
    bestGridParams={}
    for bp in grid_search.best_params_:
        bestGridParams[bp]=grid_search.best_params_[bp]

    clf.set_params(**bestGridParams)
    return clf
    # Check out each parameter combination and it's score:
    #for gs in grid_search.grid_scores_:
    #    print gs


In [119]:

print "best classifier was:"
clf=linear_model.LogisticRegression(n_jobs =-1)
test_def,train_def=train_predict(clf, X_train, y_train, X_test, y_test)
print "\nOriginal clf:\n",clf #compare after gridsearch


params={'C' : [.005,.05,.5,1.,10.,100.,],
        'fit_intercept' : [True, False],
        'class_weight': [ None,'balanced'],
        'random_state' : [None,42],
        #'solver' : ['newton-cg', 'lbfgs', 'liblinear'],#, 'sag'],
        'tol': [0.00001,0.0001,.001],
        'penalty': ['l1', 'l2']
       }

clf_tuned=modelTune(clf, params,X_all,y_all)

test_GS,train_GS=train_predict(clf_tuned, X_train, y_train, X_test, y_test)
print "GridSearch best clf:\n",clf_tuned 

# the dataset is small gridsearch may not provide a better set of parameters than the default
# We'll check for that situation and return the best options F1 score for the full data set
clf_GS_F1=predict_labels(clf_tuned, X_all, y_all)
clff_F1= predict_labels(clf, X_all, y_all)
if clf_GS_F1 >=clf_def_F1: 
    print "\nFinal clf_GS, F1 score:",clf_GS_F1,"\n",clf_tuned #after gridseach

else:
    print "\nFinal clf_def, F1 score:",clf_def_F1,"\n",clf #after gridseach
    
clf_GS_F1

best classifier was:
------------------------------------------
Training set size: 300
Training LogisticRegression...
Done!
Training time (secs): 0.004
Predicting labels using LogisticRegression...
Done!
Prediction time (secs): 0.000
F1 score for training set: 0.846846846847
Predicting labels using LogisticRegression...
Done!
Prediction time (secs): 0.000
F1 score for test set: 0.805970149254

Original clf:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


JoblibValueError: JoblibValueError
___________________________________________________________________________
Multiprocessing exception:
...........................................................................
C:\Anaconda2\lib\runpy.py in _run_module_as_main(mod_name='ipykernel.__main__', alter_argv=1)
    157     pkg_name = mod_name.rpartition('.')[0]
    158     main_globals = sys.modules["__main__"].__dict__
    159     if alter_argv:
    160         sys.argv[0] = fname
    161     return _run_code(code, main_globals, None,
--> 162                      "__main__", fname, loader, pkg_name)
        fname = r'C:\Anaconda2\lib\site-packages\ipykernel\__main__.py'
        loader = <pkgutil.ImpLoader instance>
        pkg_name = 'ipykernel'
    163 
    164 def run_module(mod_name, init_globals=None,
    165                run_name=None, alter_sys=False):
    166     """Execute a module's code without importing it

...........................................................................
C:\Anaconda2\lib\runpy.py in _run_code(code=<code object <module> at 000000000226FBB0, file ...lib\site-packages\ipykernel\__main__.py", line 1>, run_globals={'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': r'C:\Anaconda2\lib\site-packages\ipykernel\__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'ipykernel', 'app': <module 'ipykernel.kernelapp' from 'C:\Anaconda2\lib\site-packages\ipykernel\kernelapp.pyc'>}, init_globals=None, mod_name='__main__', mod_fname=r'C:\Anaconda2\lib\site-packages\ipykernel\__main__.py', mod_loader=<pkgutil.ImpLoader instance>, pkg_name='ipykernel')
     67         run_globals.update(init_globals)
     68     run_globals.update(__name__ = mod_name,
     69                        __file__ = mod_fname,
     70                        __loader__ = mod_loader,
     71                        __package__ = pkg_name)
---> 72     exec code in run_globals
        code = <code object <module> at 000000000226FBB0, file ...lib\site-packages\ipykernel\__main__.py", line 1>
        run_globals = {'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': r'C:\Anaconda2\lib\site-packages\ipykernel\__main__.py', '__loader__': <pkgutil.ImpLoader instance>, '__name__': '__main__', '__package__': 'ipykernel', 'app': <module 'ipykernel.kernelapp' from 'C:\Anaconda2\lib\site-packages\ipykernel\kernelapp.pyc'>}
     73     return run_globals
     74 
     75 def _run_module_code(code, init_globals=None,
     76                     mod_name=None, mod_fname=None,

...........................................................................
C:\Anaconda2\lib\site-packages\ipykernel\__main__.py in <module>()
      1 
      2 
----> 3 
      4 if __name__ == '__main__':
      5     from ipykernel import kernelapp as app
      6     app.launch_new_instance()
      7 
      8 
      9 
     10 

...........................................................................
C:\Anaconda2\lib\site-packages\traitlets\config\application.py in launch_instance(cls=<class 'ipykernel.kernelapp.IPKernelApp'>, argv=None, **kwargs={})
    584         
    585         If a global instance already exists, this reinitializes and starts it
    586         """
    587         app = cls.instance(**kwargs)
    588         app.initialize(argv)
--> 589         app.start()
        app.start = <bound method IPKernelApp.start of <ipykernel.kernelapp.IPKernelApp object>>
    590 
    591 #-----------------------------------------------------------------------------
    592 # utility functions, for convenience
    593 #-----------------------------------------------------------------------------

...........................................................................
C:\Anaconda2\lib\site-packages\ipykernel\kernelapp.py in start(self=<ipykernel.kernelapp.IPKernelApp object>)
    398         
    399         if self.poller is not None:
    400             self.poller.start()
    401         self.kernel.start()
    402         try:
--> 403             ioloop.IOLoop.instance().start()
    404         except KeyboardInterrupt:
    405             pass
    406 
    407 launch_new_instance = IPKernelApp.launch_instance

...........................................................................
C:\Anaconda2\lib\site-packages\zmq\eventloop\ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    146             PollIOLoop.configure(ZMQIOLoop)
    147         return PollIOLoop.instance()
    148     
    149     def start(self):
    150         try:
--> 151             super(ZMQIOLoop, self).start()
        self.start = <bound method ZMQIOLoop.start of <zmq.eventloop.ioloop.ZMQIOLoop object>>
    152         except ZMQError as e:
    153             if e.errno == ETERM:
    154                 # quietly return on ETERM
    155                 pass

...........................................................................
C:\Anaconda2\lib\site-packages\tornado\ioloop.py in start(self=<zmq.eventloop.ioloop.ZMQIOLoop object>)
    878                 self._events.update(event_pairs)
    879                 while self._events:
    880                     fd, events = self._events.popitem()
    881                     try:
    882                         fd_obj, handler_func = self._handlers[fd]
--> 883                         handler_func(fd_obj, events)
        handler_func = <function null_wrapper>
        fd_obj = <zmq.sugar.socket.Socket object>
        events = 1
    884                     except (OSError, IOError) as e:
    885                         if errno_from_exception(e) == errno.EPIPE:
    886                             # Happens when the client closes the connection
    887                             pass

...........................................................................
C:\Anaconda2\lib\site-packages\tornado\stack_context.py in null_wrapper(*args=(<zmq.sugar.socket.Socket object>, 1), **kwargs={})
    270         # Fast path when there are no active contexts.
    271         def null_wrapper(*args, **kwargs):
    272             try:
    273                 current_state = _state.contexts
    274                 _state.contexts = cap_contexts[0]
--> 275                 return fn(*args, **kwargs)
        args = (<zmq.sugar.socket.Socket object>, 1)
        kwargs = {}
    276             finally:
    277                 _state.contexts = current_state
    278         null_wrapper._wrapped = True
    279         return null_wrapper

...........................................................................
C:\Anaconda2\lib\site-packages\zmq\eventloop\zmqstream.py in _handle_events(self=<zmq.eventloop.zmqstream.ZMQStream object>, fd=<zmq.sugar.socket.Socket object>, events=1)
    428             # dispatch events:
    429             if events & IOLoop.ERROR:
    430                 gen_log.error("got POLLERR event on ZMQStream, which doesn't make sense")
    431                 return
    432             if events & IOLoop.READ:
--> 433                 self._handle_recv()
        self._handle_recv = <bound method ZMQStream._handle_recv of <zmq.eventloop.zmqstream.ZMQStream object>>
    434                 if not self.socket:
    435                     return
    436             if events & IOLoop.WRITE:
    437                 self._handle_send()

...........................................................................
C:\Anaconda2\lib\site-packages\zmq\eventloop\zmqstream.py in _handle_recv(self=<zmq.eventloop.zmqstream.ZMQStream object>)
    460                 gen_log.error("RECV Error: %s"%zmq.strerror(e.errno))
    461         else:
    462             if self._recv_callback:
    463                 callback = self._recv_callback
    464                 # self._recv_callback = None
--> 465                 self._run_callback(callback, msg)
        self._run_callback = <bound method ZMQStream._run_callback of <zmq.eventloop.zmqstream.ZMQStream object>>
        callback = <function null_wrapper>
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    466                 
    467         # self.update_state()
    468         
    469 

...........................................................................
C:\Anaconda2\lib\site-packages\zmq\eventloop\zmqstream.py in _run_callback(self=<zmq.eventloop.zmqstream.ZMQStream object>, callback=<function null_wrapper>, *args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    402         close our socket."""
    403         try:
    404             # Use a NullContext to ensure that all StackContexts are run
    405             # inside our blanket exception handler rather than outside.
    406             with stack_context.NullContext():
--> 407                 callback(*args, **kwargs)
        callback = <function null_wrapper>
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    408         except:
    409             gen_log.error("Uncaught exception, closing connection.",
    410                           exc_info=True)
    411             # Close the socket on an uncaught exception from a user callback

...........................................................................
C:\Anaconda2\lib\site-packages\tornado\stack_context.py in null_wrapper(*args=([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],), **kwargs={})
    270         # Fast path when there are no active contexts.
    271         def null_wrapper(*args, **kwargs):
    272             try:
    273                 current_state = _state.contexts
    274                 _state.contexts = cap_contexts[0]
--> 275                 return fn(*args, **kwargs)
        args = ([<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>],)
        kwargs = {}
    276             finally:
    277                 _state.contexts = current_state
    278         null_wrapper._wrapped = True
    279         return null_wrapper

...........................................................................
C:\Anaconda2\lib\site-packages\ipykernel\kernelbase.py in dispatcher(msg=[<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>])
    255         if self.control_stream:
    256             self.control_stream.on_recv(self.dispatch_control, copy=False)
    257 
    258         def make_dispatcher(stream):
    259             def dispatcher(msg):
--> 260                 return self.dispatch_shell(stream, msg)
        msg = [<zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>, <zmq.sugar.frame.Frame object>]
    261             return dispatcher
    262 
    263         for s in self.shell_streams:
    264             s.on_recv(make_dispatcher(s), copy=False)

...........................................................................
C:\Anaconda2\lib\site-packages\ipykernel\kernelbase.py in dispatch_shell(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, msg={'buffers': [], 'content': {u'allow_stdin': True, u'code': u'\nprint "best classifier was:"\nclf=linear_mod...f_F1,"\\n",clf #after gridseach\n    \nclf_GS_F1', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2016-03-29T16:53:16.609000', u'msg_id': u'9E5D2DC47D02467B815FC7C75CBC3C6D', u'msg_type': u'execute_request', u'session': u'44400A7D9C034E9F86936DE42A9874F8', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'9E5D2DC47D02467B815FC7C75CBC3C6D', 'msg_type': u'execute_request', 'parent_header': {}})
    207             self.log.error("UNKNOWN MESSAGE TYPE: %r", msg_type)
    208         else:
    209             self.log.debug("%s: %s", msg_type, msg)
    210             self.pre_handler_hook()
    211             try:
--> 212                 handler(stream, idents, msg)
        handler = <bound method IPythonKernel.execute_request of <ipykernel.ipkernel.IPythonKernel object>>
        stream = <zmq.eventloop.zmqstream.ZMQStream object>
        idents = ['44400A7D9C034E9F86936DE42A9874F8']
        msg = {'buffers': [], 'content': {u'allow_stdin': True, u'code': u'\nprint "best classifier was:"\nclf=linear_mod...f_F1,"\\n",clf #after gridseach\n    \nclf_GS_F1', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2016-03-29T16:53:16.609000', u'msg_id': u'9E5D2DC47D02467B815FC7C75CBC3C6D', u'msg_type': u'execute_request', u'session': u'44400A7D9C034E9F86936DE42A9874F8', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'9E5D2DC47D02467B815FC7C75CBC3C6D', 'msg_type': u'execute_request', 'parent_header': {}}
    213             except Exception:
    214                 self.log.error("Exception in message handler:", exc_info=True)
    215             finally:
    216                 self.post_handler_hook()

...........................................................................
C:\Anaconda2\lib\site-packages\ipykernel\kernelbase.py in execute_request(self=<ipykernel.ipkernel.IPythonKernel object>, stream=<zmq.eventloop.zmqstream.ZMQStream object>, ident=['44400A7D9C034E9F86936DE42A9874F8'], parent={'buffers': [], 'content': {u'allow_stdin': True, u'code': u'\nprint "best classifier was:"\nclf=linear_mod...f_F1,"\\n",clf #after gridseach\n    \nclf_GS_F1', u'silent': False, u'stop_on_error': True, u'store_history': True, u'user_expressions': {}}, 'header': {'date': '2016-03-29T16:53:16.609000', u'msg_id': u'9E5D2DC47D02467B815FC7C75CBC3C6D', u'msg_type': u'execute_request', u'session': u'44400A7D9C034E9F86936DE42A9874F8', u'username': u'username', u'version': u'5.0'}, 'metadata': {}, 'msg_id': u'9E5D2DC47D02467B815FC7C75CBC3C6D', 'msg_type': u'execute_request', 'parent_header': {}})
    365         if not silent:
    366             self.execution_count += 1
    367             self._publish_execute_input(code, parent, self.execution_count)
    368 
    369         reply_content = self.do_execute(code, silent, store_history,
--> 370                                         user_expressions, allow_stdin)
        user_expressions = {}
        allow_stdin = True
    371 
    372         # Flush output before sending the reply.
    373         sys.stdout.flush()
    374         sys.stderr.flush()

...........................................................................
C:\Anaconda2\lib\site-packages\ipykernel\ipkernel.py in do_execute(self=<ipykernel.ipkernel.IPythonKernel object>, code=u'\nprint "best classifier was:"\nclf=linear_mod...f_F1,"\\n",clf #after gridseach\n    \nclf_GS_F1', silent=False, store_history=True, user_expressions={}, allow_stdin=True)
    170 
    171         reply_content = {}
    172         # FIXME: the shell calls the exception handler itself.
    173         shell._reply_content = None
    174         try:
--> 175             shell.run_cell(code, store_history=store_history, silent=silent)
        shell.run_cell = <bound method ZMQInteractiveShell.run_cell of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = u'\nprint "best classifier was:"\nclf=linear_mod...f_F1,"\\n",clf #after gridseach\n    \nclf_GS_F1'
        store_history = True
        silent = False
    176         except:
    177             status = u'error'
    178             # FIXME: this code right now isn't being used yet by default,
    179             # because the run_cell() call above directly fires off exception

...........................................................................
C:\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py in run_cell(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, raw_cell=u'\nprint "best classifier was:"\nclf=linear_mod...f_F1,"\\n",clf #after gridseach\n    \nclf_GS_F1', store_history=True, silent=False, shell_futures=True)
   2718                 self.displayhook.exec_result = result
   2719 
   2720                 # Execute the user code
   2721                 interactivity = "none" if silent else self.ast_node_interactivity
   2722                 self.run_ast_nodes(code_ast.body, cell_name,
-> 2723                    interactivity=interactivity, compiler=compiler, result=result)
        interactivity = 'last_expr'
        compiler = <IPython.core.compilerop.CachingCompiler instance>
   2724 
   2725                 # Reset this so later displayed values do not modify the
   2726                 # ExecutionResult
   2727                 self.displayhook.exec_result = None

...........................................................................
C:\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py in run_ast_nodes(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, nodelist=[<_ast.Print object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Print object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.Print object>, <_ast.Assign object>, <_ast.Assign object>, <_ast.If object>, <_ast.Expr object>], cell_name='<ipython-input-119-1deeaf7e00f5>', interactivity='last', compiler=<IPython.core.compilerop.CachingCompiler instance>, result=<IPython.core.interactiveshell.ExecutionResult object>)
   2820 
   2821         try:
   2822             for i, node in enumerate(to_run_exec):
   2823                 mod = ast.Module([node])
   2824                 code = compiler(mod, cell_name, "exec")
-> 2825                 if self.run_code(code, result):
        self.run_code = <bound method ZMQInteractiveShell.run_code of <ipykernel.zmqshell.ZMQInteractiveShell object>>
        code = <code object <module> at 000000001F583D30, file "<ipython-input-119-1deeaf7e00f5>", line 17>
        result = <IPython.core.interactiveshell.ExecutionResult object>
   2826                     return True
   2827 
   2828             for i, node in enumerate(to_run_interactive):
   2829                 mod = ast.Interactive([node])

...........................................................................
C:\Anaconda2\lib\site-packages\IPython\core\interactiveshell.py in run_code(self=<ipykernel.zmqshell.ZMQInteractiveShell object>, code_obj=<code object <module> at 000000001F583D30, file "<ipython-input-119-1deeaf7e00f5>", line 17>, result=<IPython.core.interactiveshell.ExecutionResult object>)
   2880         outflag = 1  # happens in more places, so it's easier as default
   2881         try:
   2882             try:
   2883                 self.hooks.pre_run_code_hook()
   2884                 #rprint('Running code', repr(code_obj)) # dbg
-> 2885                 exec(code_obj, self.user_global_ns, self.user_ns)
        code_obj = <code object <module> at 000000001F583D30, file "<ipython-input-119-1deeaf7e00f5>", line 17>
        self.user_global_ns = {'DecisionTreeClassifier': <class 'sklearn.tree.tree.DecisionTreeClassifier'>, 'GaussianNB': <class 'sklearn.naive_bayes.GaussianNB'>, 'GridSearchCV': <class 'sklearn.grid_search.GridSearchCV'>, 'In': ['', u'# Import libraries\nimport numpy as np\nimport pandas as pd', u'# Read student data\nstudent_data = pd.read_cs... the target/label, all other are feature columns', u'# TODO: Compute desired values - replace each ...on rate of the class: {:.2f}%".format(grad_rate)', u'# Extract feature (X) and target (y) columns\n...-"\nprint X_all.head()  # print the first 5 rows', u'# Preprocess feature columns\ndef preprocess_f....format(len(X_all.columns), list(X_all.columns))', u'# First, decide how many training vs test samp...dation set, extract it from within training data', u'# Train a model\nimport time\n\ndef train_clas...you can inspect the learned model by printing it', u'# Predict on training set and compute F1 score...ore for training set: {}".format(train_f1_score)', u'# Predict on test data\nprint "F1 score for te... {}".format(predict_labels(clf, X_test, y_test))', u'# Train and predict using different training s...test, y_test)\nprint "Test:",test,"Train:",train', u'# TODO: Train and predict using two other mode...int scoresDF.sort(\'test\',ascending=False)[:-1]', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Train and predict using two other mode...int scoresDF.sort(\'test\',ascending=False)[:-1]', u'# TODO: Train and predict using two other mode...int scoresDF.sort(\'test\',ascending=False)[:-1]', u'# TODO: Train and predict using two other mode...int scoresDF.sort(\'test\',ascending=False)[:-1]', ...], 'KNeighborsClassifier': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'LinearSVC': <class 'sklearn.svm.classes.LinearSVC'>, 'Out': {}, 'SVC': <class 'sklearn.svm.classes.SVC'>, 'X_all':      school_GP  school_MS  sex_F  sex_M  age  ad...   3       5         5  

[395 rows x 48 columns], 'X_test':      school_GP  school_MS  sex_F  sex_M  age  ad...    2       1         0  

[95 rows x 48 columns], ...}
        self.user_ns = {'DecisionTreeClassifier': <class 'sklearn.tree.tree.DecisionTreeClassifier'>, 'GaussianNB': <class 'sklearn.naive_bayes.GaussianNB'>, 'GridSearchCV': <class 'sklearn.grid_search.GridSearchCV'>, 'In': ['', u'# Import libraries\nimport numpy as np\nimport pandas as pd', u'# Read student data\nstudent_data = pd.read_cs... the target/label, all other are feature columns', u'# TODO: Compute desired values - replace each ...on rate of the class: {:.2f}%".format(grad_rate)', u'# Extract feature (X) and target (y) columns\n...-"\nprint X_all.head()  # print the first 5 rows', u'# Preprocess feature columns\ndef preprocess_f....format(len(X_all.columns), list(X_all.columns))', u'# First, decide how many training vs test samp...dation set, extract it from within training data', u'# Train a model\nimport time\n\ndef train_clas...you can inspect the learned model by printing it', u'# Predict on training set and compute F1 score...ore for training set: {}".format(train_f1_score)', u'# Predict on test data\nprint "F1 score for te... {}".format(predict_labels(clf, X_test, y_test))', u'# Train and predict using different training s...test, y_test)\nprint "Test:",test,"Train:",train', u'# TODO: Train and predict using two other mode...int scoresDF.sort(\'test\',ascending=False)[:-1]', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Fine-tune your model and report the be...ain, X_test, y_test)\nprint clf #after gridseach', u'# TODO: Train and predict using two other mode...int scoresDF.sort(\'test\',ascending=False)[:-1]', u'# TODO: Train and predict using two other mode...int scoresDF.sort(\'test\',ascending=False)[:-1]', u'# TODO: Train and predict using two other mode...int scoresDF.sort(\'test\',ascending=False)[:-1]', ...], 'KNeighborsClassifier': <class 'sklearn.neighbors.classification.KNeighborsClassifier'>, 'LinearSVC': <class 'sklearn.svm.classes.LinearSVC'>, 'Out': {}, 'SVC': <class 'sklearn.svm.classes.SVC'>, 'X_all':      school_GP  school_MS  sex_F  sex_M  age  ad...   3       5         5  

[395 rows x 48 columns], 'X_test':      school_GP  school_MS  sex_F  sex_M  age  ad...    2       1         0  

[95 rows x 48 columns], ...}
   2886             finally:
   2887                 # Reset our crash handler in place
   2888                 sys.excepthook = old_excepthook
   2889         except SystemExit as e:

...........................................................................
C:\cygwin\home\llathrop\projects\udacity-Projects\student_intervention\<ipython-input-119-1deeaf7e00f5> in <module>()
     12         #'solver' : ['newton-cg', 'lbfgs', 'liblinear'],#, 'sag'],
     13         'tol': [0.00001,0.0001,.001],
     14         'penalty': ['l1', 'l2']
     15        }
     16 
---> 17 clf_tuned=modelTune(clf, params,X_all,y_all)
     18 
     19 test_GS,train_GS=train_predict(clf_tuned, X_train, y_train, X_test, y_test)
     20 print "GridSearch best clf:\n",clf_tuned 
     21 

...........................................................................
C:\cygwin\home\llathrop\projects\udacity-Projects\student_intervention\<ipython-input-118-1382769c7c96> in modelTune(clf=LogisticRegression(C=1.0, class_weight=None, dua...ol=0.0001,
          verbose=0, warm_start=False), params={'C': [0.005, 0.05, 0.5, 1.0, 10.0, 100.0], 'class_weight': [None, 'balanced'], 'fit_intercept': [True, False], 'penalty': ['l1', 'l2'], 'random_state': [None, 42], 'tol': [1e-05, 0.0001, 0.001]}, X=     school_GP  school_MS  sex_F  sex_M  age  ad...   3       5         5  

[395 rows x 48 columns], y=0      0
1      0
2      1
3      1
4      1
5  ...   0
393    1
394    0
Name: passed, dtype: int64)
     12                                n_jobs= 2,
     13                                scoring=make_scorer(f1_score, 
     14                                                    pos_label="yes",
     15                                                    greater_is_better=True)) 
     16 
---> 17     grid_search.fit(X, y)
     18 
     19     #reach into the grid search and pull out the best parameters, and set those on clf. probably not necesary, but easier to deal with
     20     bestGridParams={}
     21     for bp in grid_search.best_params_:

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\grid_search.py in fit(self=GridSearchCV(cv=4, error_score='raise',
       e...=make_scorer(f1_score, pos_label=yes), verbose=0), X=     school_GP  school_MS  sex_F  sex_M  age  ad...   3       5         5  

[395 rows x 48 columns], y=0      0
1      0
2      1
3      1
4      1
5  ...   0
393    1
394    0
Name: passed, dtype: int64)
    799         y : array-like, shape = [n_samples] or [n_samples, n_output], optional
    800             Target relative to X for classification or regression;
    801             None for unsupervised learning.
    802 
    803         """
--> 804         return self._fit(X, y, ParameterGrid(self.param_grid))
        self._fit = <bound method GridSearchCV._fit of GridSearchCV(...make_scorer(f1_score, pos_label=yes), verbose=0)>
        X =      school_GP  school_MS  sex_F  sex_M  age  ad...   3       5         5  

[395 rows x 48 columns]
        y = 0      0
1      0
2      1
3      1
4      1
5  ...   0
393    1
394    0
Name: passed, dtype: int64
        self.param_grid = {'C': [0.005, 0.05, 0.5, 1.0, 10.0, 100.0], 'class_weight': [None, 'balanced'], 'fit_intercept': [True, False], 'penalty': ['l1', 'l2'], 'random_state': [None, 42], 'tol': [1e-05, 0.0001, 0.001]}
    805 
    806 
    807 class RandomizedSearchCV(BaseSearchCV):
    808     """Randomized search on hyper parameters.

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\grid_search.py in _fit(self=GridSearchCV(cv=4, error_score='raise',
       e...=make_scorer(f1_score, pos_label=yes), verbose=0), X=     school_GP  school_MS  sex_F  sex_M  age  ad...   3       5         5  

[395 rows x 48 columns], y=0      0
1      0
2      1
3      1
4      1
5  ...   0
393    1
394    0
Name: passed, dtype: int64, parameter_iterable=<sklearn.grid_search.ParameterGrid object>)
    548         )(
    549             delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,
    550                                     train, test, self.verbose, parameters,
    551                                     self.fit_params, return_parameters=True,
    552                                     error_score=self.error_score)
--> 553                 for parameters in parameter_iterable
        parameters = undefined
        parameter_iterable = <sklearn.grid_search.ParameterGrid object>
    554                 for train, test in cv)
    555 
    556         # Out is a list of triplet: score, estimator, n_test_samples
    557         n_fits = len(out)

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self=Parallel(n_jobs=2), iterable=<generator object <genexpr>>)
    807             if pre_dispatch == "all" or n_jobs == 1:
    808                 # The iterable was consumed all at once by the above for loop.
    809                 # No need to wait for async callbacks to trigger to
    810                 # consumption.
    811                 self._iterating = False
--> 812             self.retrieve()
        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
    813             # Make sure that we get a last message telling us we are done
    814             elapsed_time = time.time() - self._start_time
    815             self._print('Done %3i out of %3i | elapsed: %s finished',
    816                         (len(self._output), len(self._output),

---------------------------------------------------------------------------
Sub-process traceback:
---------------------------------------------------------------------------
ValueError                                         Tue Mar 29 16:53:46 2016
PID: 6764                            Python 2.7.11: C:\Anaconda2\python.exe
...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __call__(self=<sklearn.externals.joblib.parallel.BatchedCalls object>)
     67     def __init__(self, iterator_slice):
     68         self.items = list(iterator_slice)
     69         self._size = len(self.items)
     70 
     71     def __call__(self):
---> 72         return [func(*args, **kwargs) for func, args, kwargs in self.items]
     73 
     74     def __len__(self):
     75         return self._size
     76 

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\cross_validation.pyc in _fit_and_score(estimator=LogisticRegression(C=0.005, class_weight=None, d...tol=1e-05,
          verbose=0, warm_start=False), X=     school_GP  school_MS  sex_F  sex_M  age  ad...   3       5         5  

[395 rows x 48 columns], y=0      0
1      0
2      1
3      1
4      1
5  ...   0
393    1
394    0
Name: passed, dtype: int64, scorer=make_scorer(f1_score, pos_label=yes), train=array([ 88,  91,  93,  94,  95,  96,  97,  98, 1...    386, 387, 388, 389, 390, 391, 392, 393, 394]), test=array([  0,   1,   2,   3,   4,   5,   6,   7,  ...     99, 100, 103, 106, 114, 118, 124, 127, 128]), verbose=0, parameters={'C': 0.005, 'class_weight': None, 'fit_intercept': True, 'penalty': 'l1', 'random_state': None, 'tol': 1e-05}, fit_params={}, return_train_score=False, return_parameters=True, error_score='raise')
   1545                              " numeric value. (Hint: if using 'raise', please"
   1546                              " make sure that it has been spelled correctly.)"
   1547                              )
   1548 
   1549     else:
-> 1550         test_score = _score(estimator, X_test, y_test, scorer)
   1551         if return_train_score:
   1552             train_score = _score(estimator, X_train, y_train, scorer)
   1553 
   1554     scoring_time = time.time() - start_time

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\cross_validation.pyc in _score(estimator=LogisticRegression(C=0.005, class_weight=None, d...tol=1e-05,
          verbose=0, warm_start=False), X_test=     school_GP  school_MS  sex_F  sex_M  age  ad...   2       4         0  

[100 rows x 48 columns], y_test=0      0
1      0
2      1
3      1
4      1
5  ...   0
127    0
128    0
Name: passed, dtype: int64, scorer=make_scorer(f1_score, pos_label=yes))
   1601 def _score(estimator, X_test, y_test, scorer):
   1602     """Compute the score of an estimator on a given test set."""
   1603     if y_test is None:
   1604         score = scorer(estimator, X_test)
   1605     else:
-> 1606         score = scorer(estimator, X_test, y_test)
   1607     if not isinstance(score, numbers.Number):
   1608         raise ValueError("scoring must return a number, got %s (%s) instead."
   1609                          % (str(score), type(score)))
   1610     return score

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\metrics\scorer.pyc in __call__(self=make_scorer(f1_score, pos_label=yes), estimator=LogisticRegression(C=0.005, class_weight=None, d...tol=1e-05,
          verbose=0, warm_start=False), X=     school_GP  school_MS  sex_F  sex_M  age  ad...   2       4         0  

[100 rows x 48 columns], y_true=0      0
1      0
2      1
3      1
4      1
5  ...   0
127    0
128    0
Name: passed, dtype: int64, sample_weight=None)
     85             return self._sign * self._score_func(y_true, y_pred,
     86                                                  sample_weight=sample_weight,
     87                                                  **self._kwargs)
     88         else:
     89             return self._sign * self._score_func(y_true, y_pred,
---> 90                                                  **self._kwargs)
     91 
     92 
     93 class _ProbaScorer(_BaseScorer):
     94     def __call__(self, clf, X, y, sample_weight=None):

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\metrics\classification.pyc in f1_score(y_true=0      0
1      0
2      1
3      1
4      1
5  ...   0
127    0
128    0
Name: passed, dtype: int64, y_pred=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..., 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int64), labels=None, pos_label='yes', average='binary', sample_weight=None)
    634 
    635 
    636     """
    637     return fbeta_score(y_true, y_pred, 1, labels=labels,
    638                        pos_label=pos_label, average=average,
--> 639                        sample_weight=sample_weight)
        average = 'binary'
    640 
    641 
    642 def fbeta_score(y_true, y_pred, beta, labels=None, pos_label=1,
    643                 average='binary', sample_weight=None):

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\metrics\classification.pyc in fbeta_score(y_true=0      0
1      0
2      1
3      1
4      1
5  ...   0
127    0
128    0
Name: passed, dtype: int64, y_pred=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..., 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int64), beta=1, labels=None, pos_label='yes', average='binary', sample_weight=None)
    751                                                  beta=beta,
    752                                                  labels=labels,
    753                                                  pos_label=pos_label,
    754                                                  average=average,
    755                                                  warn_for=('f-score',),
--> 756                                                  sample_weight=sample_weight)
    757     return f
    758 
    759 
    760 def _prf_divide(numerator, denominator, metric, modifier, average, warn_for):

...........................................................................
C:\Anaconda2\lib\site-packages\sklearn\metrics\classification.pyc in precision_recall_fscore_support(y_true=array([0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1,..., 0,
       0, 0, 0, 0, 0, 0, 0, 0], dtype=int64), y_pred=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,..., 1,
       1, 1, 1, 1, 1, 1, 1, 1], dtype=int64), beta=1, labels=None, pos_label='yes', average='binary', warn_for=('f-score',), sample_weight=None)
    979                 if len(present_labels) < 2:
    980                     # Only negative labels
    981                     return (0., 0., 0., 0)
    982                 else:
    983                     raise ValueError("pos_label=%r is not a valid label: %r" %
--> 984                                      (pos_label, present_labels))
    985             labels = [pos_label]
    986     if labels is None:
    987         labels = present_labels
    988         n_labels = None

ValueError: pos_label='yes' is not a valid label: array([0, 1], dtype=int64)
___________________________________________________________________________