# Validation Methods Investigation Using Breast Cancert Dataset


Objectives for this notebook:
- Implement sklearn pipelines applied to various models
- Use gridsearch with different regularization techniques
- Implement a step forward variable selection algorithm

Types of regularization:
1. Lasso & Ridge
2. ElastoNet




In [169]:
#import librarys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('dark')
pd.set_option('display.max_colwidth', 1000)
import warnings
warnings.filterwarnings('ignore')

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

Investigate the data and understand the features. Load in DataFrame format.

In [199]:
data = load_breast_cancer()
#extract data
features, target = data.data, data.target

#create dataframe
data = pd.DataFrame(features, columns=data.feature_names)

#join the target column
data = pd.concat([data, pd.DataFrame(target)], axis=1)

#rename the target column
data.rename(columns={0: 'target'}, inplace=True)

#remove the spaced in the columns headers
data.columns = data.columns.str.replace(" ", "_")

print(data.shape)
data.head()

(569, 31)


Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


#### Observations
1. Data should be normalized before classifiying
2. Relatively small data set. Use kfold validation on all model types.
2. There are 30 features that likely have colinearity. Consider PCA.

In [None]:
#plot distributions

In [103]:
data.iloc[:,-1].value_counts()

1    357
0    212
Name: nan, dtype: int64

In [104]:
212/357

0.5938375350140056

### Define helper functions

- **load_data** - Retrieves the data from sklearn
- **get_train_test** - Splits the data in train and test
- **classify_model** - Takes in a model, and applied gridsearch. Returns the best model paramers, roc auc score, classification report, and model results report
- **pipeline_pca** - A pipeline that takes a model and applies standard scaler and pca

In [40]:
#funciton to load the data
def load_data():
    """
    Function to load the data in numpy array format.
    
    Use for input to the Classifier pipeline.
    """
    
    data = load_breast_cancer()
    X, Y = data.data, data.target
    
    return X, Y

In [72]:
def get_train_test(train_size=0.7, random_state=99):
    """
    
    Loads the data.
    Returns X and Y test and train sets.
    
    """
    
    X, Y = load_data()

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y)#, train_size=train_size, random_state=random_state)
    
    return X_train, X_test, Y_train, Y_test
    

In [189]:
def classify_model(model, parameters, score='roc_auc'):
    
    """
    General function for implementing grid search with validation to any model.
    
    Returns 
    
    """
    #load the train and test sets
    X_train, X_test, Y_train, Y_test = get_train_test(X, Y)
    
    #initalize kfolds
    kf = KFold(10, random_state=99)
    
    #initalize the grid search of the selected model
    #call return_train_score true to include training scores in the results output
    model_gs = GridSearchCV(model, parameters, cv=kf, scoring=score, return_train_score=True)
    
    #fit the model
    model_gs.fit(X_train.astype(float), Y_train)
    
    #get model predictions
    Y_predictions = model_gs.predict(X_test.astype(float))
    
    #return various scoring metrics for model
    cr = classification_report(Y_test, Y_predictions)
    
    #calcualte the ROC AUC value and print with the best parameters
    roc_score = roc_auc_score(Y_test, Y_predictions)
    print('ROC AUC Score for Hold-Out set: {}'.format(roc_score), end='\n')
    
    #print the best parameters from model
    print('Best model parameters {}'.format(model_gs.best_params_), end="\n")
    
    
    return roc_score, pd.DataFrame(model_gs.cv_results_), cr
    
    

In [67]:
def pipeline_pca(model):
    """
    Creates a pipeline using scaler, pca, and the input model
    
    Returns the pipeline.
    """
    
    scaler = StandardScaler()
    
    pca = PCA()
    
    pipe = make_pipeline(scaler, pca, model)
    
    return pipe

### Logistic Classifier

Default settings for logistic regression classifier are to use l1 regularization with liblinear. In this case want to see the performance of the classifier without it.

In [177]:
#use the lbfgs solver on the classifier so we can infvestiagte performnce of no regularization
logistic = LogisticRegression(solver='lbfgs')

#c is the weight of the regularization in the model
c_space = [0.01, 0.1, 1, 10, 50, 100, 300]

#class weight is the representaiton of the class imblance. In this dataset class 1 accounts for 60% of the classes.
#try a range of balancing weights. balanced balance data. 
class_weights = ['balanced', {0:6, 1:4}, {0:1, 1:1}]

penalty = ['none']

parameter_logistic = {'pca__n_components' : [0.80, 0.90, 0.95],
                      'logisticregression__C' : c_space,
                      'logisticregression__class_weight': class_weights,
                     'logisticregression__penalty': penalty}

pipe = pipeline_pca(logistic)

#run grid seearch on the classifier
%time _, results_lr, report_lr = classify_model(pipe, parameter_logistic)

ROC AUC Score for Hold-Out set: 0.9895833333333333
Best model parameters {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'balanced', 'logisticregression__penalty': 'none', 'pca__n_components': 0.9}
CPU times: user 31 s, sys: 7.22 s, total: 38.2 s
Wall time: 9.75 s


In [157]:
print(report_lr)

              precision    recall  f1-score   support

           0       0.95      0.96      0.95        54
           1       0.98      0.97      0.97        89

    accuracy                           0.97       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.97      0.97      0.97       143



If we want to see the combinations of the training paramerters we can call results_lr.

In [160]:
result_cols = ['rank_test_score', 'params', 'mean_train_score', 'mean_test_score']
results_lr[result_cols].sort_values('mean_test_score', ascending=False).head(5)

Unnamed: 0,rank_test_score,params,mean_train_score,mean_test_score
15,1,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'none', 'pca__n_components': 0.8}",0.995468,0.99402
33,1,"{'logisticregression__C': 10, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'none', 'pca__n_components': 0.8}",0.995468,0.99402
42,1,"{'logisticregression__C': 50, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'none', 'pca__n_components': 0.8}",0.995468,0.99402
60,1,"{'logisticregression__C': 300, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'none', 'pca__n_components': 0.8}",0.995468,0.99402
6,1,"{'logisticregression__C': 0.01, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'none', 'pca__n_components': 0.8}",0.995468,0.99402


Note how the top 5 scores used a pca of 0.8. It is likely we should try tuning for lower pca values.

Also note that the imblance in the classes did not have strong change in the train/test scores.

### Lasso & Ridge Classifierer

A lasso classifier is essentially logistic regression with L1 regularization. Similarly, Ridge regression uses 'l2' regularization. So we can add this as a parameter in our grid search call.

In [176]:
lsso_ridge = LogisticRegression(solver='liblinear')

#c is the weight of the regularization in the model
c_space = [0.01, 0.1, 1, 10, 50, 100, 300]

#class weight is the representaiton of the class imblance. In this dataset class 1 accounts for 60% of the classes.
#try a range of balancing weights. balanced balance data. 
class_weights = ['balanced', {0:6, 1:4}, {0:1, 1:1}]

#penalties indicate the kind of regression we are performing. l1 is lasso, and l2 is ridge
penalty = ['l1', 'l2']

parameter_lsso_ridge = {'pca__n_components' : [0.80, 0.90, 0.95],
                      'logisticregression__C' : c_space,
                      'logisticregression__class_weight': class_weights,
                     'logisticregression__penalty' : penalty}

pipe_lso_rdg = pipeline_pca(lsso_ridge)

%time _, results_lso, report_lso = classify_model(pipe_lso_rdg, parameter_lsso_ridge)

ROC AUC Score for Hold-Out set: 0.9742726877985237
Best model parameters {'logisticregression__C': 0.1, 'logisticregression__class_weight': {0: 6, 1: 4}, 'logisticregression__penalty': 'l1', 'pca__n_components': 0.9}
CPU times: user 38.8 s, sys: 9.35 s, total: 48.1 s
Wall time: 12.3 s


In [163]:
print(report_lso)

              precision    recall  f1-score   support

           0       0.96      0.91      0.93        53
           1       0.95      0.98      0.96        90

    accuracy                           0.95       143
   macro avg       0.95      0.94      0.95       143
weighted avg       0.95      0.95      0.95       143



In [166]:
results_lso[result_cols].sort_values('mean_test_score', ascending=False).head()

Unnamed: 0,rank_test_score,params,mean_train_score,mean_test_score
49,1,"{'logisticregression__C': 1, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'l1', 'pca__n_components': 0.9}",0.99684,0.995899
50,1,"{'logisticregression__C': 1, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'l1', 'pca__n_components': 0.95}",0.997128,0.995899
85,3,"{'logisticregression__C': 50, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'l1', 'pca__n_components': 0.9}",0.99702,0.995671
88,3,"{'logisticregression__C': 50, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'l2', 'pca__n_components': 0.9}",0.99702,0.995671
103,3,"{'logisticregression__C': 100, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__penalty': 'l1', 'pca__n_components': 0.9}",0.997023,0.995671


### ElastoNet Classifier

Similar to Lasso and Ridge, the parmeters in logistic regression offer an option to apply both L1 and L2 regularization at the same time. This is known as ElastoNet regularization.

The elastonet regularization can only be used with the 'saga' solver.

In [182]:
elasto = LogisticRegression(solver='saga')

#c is the weight of the regularization in the model
c_space = [0.01, 0.1, 1, 10, 50, 100, 300]

#class weight is the representaiton of the class imblance. In this dataset class 1 accounts for 60% of the classes.
#try a range of balancing weights. balanced balance data. 
class_weights = ['balanced', {0:6, 1:4}, {0:1, 1:1}]

#penalties indicate the kind of regression we are performing. l1 is lasso, and l2 is ridge
penalty = ['elasticnet']

#defines the mixing of the l1 and l2 regularization. l1=0 means use l2 regularization.
l1_ratio = [0.1, 0.3, 0.5, 0.7, 0.9]

parameter_elasto = {'pca__n_components' : [0.80, 0.90, 0.95],
                      'logisticregression__C' : c_space,
                      'logisticregression__class_weight': class_weights,
                     'logisticregression__penalty' : penalty,
                   'logisticregression__l1_ratio' : l1_ratio}

pipe_elasso = pipeline_pca(elasto)

%time _, results_elaso, report_elaso = classify_model(pipe_elasso, parameter_elasto)

ROC AUC Score for Hold-Out set: 0.9470443349753696
Best model parameters {'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.1, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.95}
CPU times: user 2min 44s, sys: 38.4 s, total: 3min 22s
Wall time: 51.9 s


In [183]:
print(report_elaso)

              precision    recall  f1-score   support

           0       0.95      0.93      0.94        56
           1       0.95      0.97      0.96        87

    accuracy                           0.95       143
   macro avg       0.95      0.95      0.95       143
weighted avg       0.95      0.95      0.95       143



In [184]:
results_elaso[result_cols].sort_values('mean_test_score', ascending=False).head()

Unnamed: 0,rank_test_score,params,mean_train_score,mean_test_score
47,1,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.1, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.95}",0.997989,0.997372
104,1,"{'logisticregression__C': 1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.9, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.95}",0.998587,0.997372
68,1,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': {0: 6, 1: 4}, 'logisticregression__l1_ratio': 0.5, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.95}",0.998467,0.997372
134,1,"{'logisticregression__C': 1, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__l1_ratio': 0.9, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.95}",0.998537,0.997372
77,1,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': {0: 1, 1: 1}, 'logisticregression__l1_ratio': 0.1, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.95}",0.997931,0.997372


#### Interpretation of results

Classifier performance appears independent of the C space and the balancing used. We will interpret the results from setting C to 0.1 and class weight to balanced.

In [185]:
parameter_elasto = {'pca__n_components' : [0.80, 0.90, 0.95],
                      'logisticregression__C' : [0.1],
                      'logisticregression__class_weight': ['balanced'],
                     'logisticregression__penalty' : penalty,
                   'logisticregression__l1_ratio' : l1_ratio}

%time _, results_elaso, report_elaso = classify_model(pipe_elasso, parameter_elasto)

ROC AUC Score for Hold-Out set: 0.9567191283292978
Best model parameters {'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.9, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.9}
CPU times: user 6.39 s, sys: 1.47 s, total: 7.85 s
Wall time: 2.01 s


In [186]:
print(report_elaso)

              precision    recall  f1-score   support

           0       0.95      0.95      0.95        59
           1       0.96      0.96      0.96        84

    accuracy                           0.96       143
   macro avg       0.96      0.96      0.96       143
weighted avg       0.96      0.96      0.96       143



In [187]:
results_elaso[result_cols].sort_values('mean_test_score', ascending=False).head()

Unnamed: 0,rank_test_score,params,mean_train_score,mean_test_score
13,1,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.9, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.9}",0.99393,0.993606
14,1,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.9, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.95}",0.993924,0.993606
10,3,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.7, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.9}",0.994149,0.993557
11,3,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.7, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.95}",0.994149,0.993557
4,5,"{'logisticregression__C': 0.1, 'logisticregression__class_weight': 'balanced', 'logisticregression__l1_ratio': 0.3, 'logisticregression__penalty': 'elasticnet', 'pca__n_components': 0.9}",0.994598,0.993247


Fixing C to 0.1 and class weight to balance showed a reduction in the performance (unlikely this reduction of 0.4% is significant). It did allow us to see greater sensitivity of the l1_ratio. 

### Step forward selection

This is a brute force method to find the features that result in the best model fit. To implement this we will define a new pipeline with just MinMaxScaler and the selected model. The model will be Linear Regression with l2.

In [203]:
def pipeline_min_max(model):
    """
    Creates a pipeline using scaler, pca, and the input model
    
    Returns the pipeline.
    """
    
    min_max = MinMaxScaler()
    
    pipe = make_pipeline(min_max, model)
    
    return pipe

Following function tests the performance of a logistic regression classifier on indivdual columns. If a column performs better than a previous benchmark it will filter the other columns and use the last column. Best performaning columns will be selected as the final feature set.

In [197]:
def step_forward_selection(dataframe, target, model, parameters):
    
    #create a set from the columns in the input dataframe
    remaining = set(dataframe.columns)
    
    #remove the target variable from the set
    remaining.remove(target)
    
    #initalize an array to hold the best features, and set two scores for comparison.
    selected = []
    current_score, best_new_score = 0.0, 0.0
    
    #loop while remaining has features in the list and that the current score = best score
    while remaining and current_score == best_new_score:
        #store the scores from the current set of columns
        scores_with_candidates = []
        
        #loop through the features in remaining
        for candidate in remaining:
            
            X = dataframe[candidate]
            Y = dataframe[target]
            
            #setup the model in a pipeline
            pipe = pipeline_min_max(model)
            
            #run the modified classifer model
            score, _, _ = classify_model_sf(X, Y, pipe, parameters)
            
            #append the score and canditate label
            scores_with_candidates.append((score, candidate))
            
        #sort the set. last element is highest value     
        scores_with_candidates.sort()
        
        #assign the highest score and best candidate
        best_new_score, best_candidate = scores_with_candidates.pop()
        
        #compare accuracy. if better, assign the new score and feature to the selected candidates
        if current_score > best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    
    print('Selected features: \n {}'.format(selected))
    
    
    X = dataframe[candidate]
    Y = dataframe[target]
            
    #setup the model in a pipeline
    pipe = pipeline_min_max(model)
            
    #run the modified classifer model
    score, results_sf, report_sf = classify_model_sf(X, Y, pipe, parameters)
    
    return results_sf, report_sf

In [198]:
def classify_model_sf(X, Y, model, parameters, score='roc_auc'):
    
    """
    Modified classify function for use in the step forward selection function.
    
    General function for implementing grid search with validation to any model.
    
    Returns 
    
    """
    #load the train and test sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=99)
    
    #initalize kfolds
    kf = KFold(10, random_state=99)
    
    #initalize the grid search of the selected model
    #call return_train_score true to include training scores in the results output
    model_gs = GridSearchCV(model, parameters, cv=kf, scoring=score, return_train_score=True)
    
    #fit the model
    model_gs.fit(X_train.astype(float), Y_train)
    
    #get model predictions
    Y_predictions = model_gs.predict(X_test.astype(float))
    
    #return various scoring metrics for model
    cr = classification_report(Y_test, Y_predictions)
    
    #calcualte the ROC AUC value and print with the best parameters
    roc_score = roc_auc_score(Y_test, Y_predictions)
    print('ROC AUC Score for Hold-Out set: {}'.format(roc_score), end='\n')
    
    #print the best parameters from model
    print('Best model parameters {}'.format(model_gs.best_params_), end="\n")
    
    
    return roc_score, pd.DataFrame(model_gs.cv_results_), cr

In [204]:
#initate a default logistic regression model to run the step forward selection on

logistic_sf = LogisticRegression(solver='liblinear')

params = {'logisticregression__C': [0.1],
         'logisticregression__class_weight': ['balanced']}

results_sf, report_sf = step_forward_selection(data, 'target', logistic_sf, params)

ValueError: Expected 2D array, got 1D array instead:
array=[ 493.8  579.1  401.5  392.   471.3  928.8  477.3  819.8  260.9  264.
  221.8  566.2  599.4  453.1  826.8  693.7  520.  1041.   519.8  462.
  918.6  568.9  244.5  493.1  451.1 1075.   432.2  963.7 1092.   668.6
 1077.   610.7 1320.   428.   523.8 1297.   455.3 1686.   514.   399.8
  337.7  644.8  371.5  320.8  508.8  719.5  857.6  933.1  502.5  658.8
  994.   365.6  546.3  492.1  633.   457.9  170.4  310.8  527.2  980.5
  466.7  307.3  701.9  673.7  758.6  788.5  553.5  551.7 1138.  1068.
  461.4  689.5  407.4 2499.   800.  1419.  1123.   644.2  433.8 1288.
  449.9  712.8  465.4  519.4  420.3 2501.   246.3  412.5  395.7 1306.
  402.7  904.6  504.8 1264.   744.7 1878.   507.9  990.   143.5  559.2
 1145.   340.9  370.   629.9 1482.   606.5  541.6 1191.   290.9  431.9
 1110.   290.2  584.8  662.7  716.6  464.1  366.5  594.2 1152.   476.5
 1076.   294.5  245.2  951.6  558.1  992.1  948.   618.4  445.3 1214.
  509.2 1407.  1261.   363.7  641.2  651.9  378.2  609.1  285.7 1052.
  566.3  578.3 1206.   412.7  813.7  403.3  387.3  489.   390.  1104.
  588.9 1162.  1364.   684.5  373.9  991.7 1546.   492.9  651.   418.7
  684.5  448.6  590.   269.4  432.  1299.   982.   288.5  668.3  623.9
  475.9  520.2 1260.   381.9  552.4  514.5 1207.   408.8  386.3  403.1
 1491.   656.1  761.7  680.9  928.2  632.6 1157.  1747.   441.   599.5
  373.2  495.   280.5  514.3  203.9 1326.   546.4  904.3  682.5  257.8
 1250.   585.9  928.3  998.9 1130.   674.5  629.8  412.6  600.4  273.9
  477.4  477.1  248.7  409.1  361.6  506.3  384.6  321.6  575.5  469.1
  646.1  446.  1148.   731.3  597.8  394.1  761.3 1138.   512.2  616.5
  477.3  384.8  689.4 1230.   537.3  793.2  880.2  705.6  602.4  506.3
  748.9  329.6  559.2  455.8  716.9  463.7 1223.   573.2  525.2  347.
  698.8  389.4 1386.   716.6  476.3  678.1  704.4  399.8  366.8  446.2
  840.4  912.7  402.   575.3  782.7  432.8  595.9  286.3  271.2  221.2
  546.1 1245.   324.2  426.   386.8  666.   437.6  485.8  807.2  464.4
  403.5  338.3  710.6  664.9  575.3  358.9  489.9  420.3  501.3  516.6
  234.3  596.6  538.4  690.2  747.2  633.1 1404.   503.2  668.7  391.2
  404.9  766.6 1194.   371.1  512.  1670.   561.   421.   817.7  440.6
  981.6  674.8  426.7 1076.   803.1  496.4 1509.   530.2  585.   955.1
 1311.   396.5  380.3 1761.   805.1  336.1  311.9 1132.   466.5  481.6
  432.   372.7  408.2  485.6 1308.   441.3 1001.   360.5  537.9  420.5
  466.1  571.1  930.9 1006.   512.2  640.7  431.1 1264.   781.   230.9
  381.1  713.3 2010.   416.2  685.9  920.6 1102.   419.8  584.1  445.2
  645.7 1007.   534.6  271.3  272.5  520.   289.7  288.1  578.9 1155.
  838.1  224.5  561.3  302.4 1384.   409.  1169.   664.7 1024.   250.5
  984.6  317.5  869.5].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.