A baseline is a method that uses heuristics, simple summary statistics, randomness, or machine learning to create predictions for a dataset. You can use these predictions to measure the baseline's performance (e.g., accuracy)-- this metric will then become what you compare any other machine learning algorithm against.

# From scratch

This section describes some basic algorithm implemented from scratch. 

## Random Prediction Algorithm

Random prediction algorithm is very simple to implement, because it predicts random outcome in the data set.

In [1]:
import random
import collections

import numpy as np

This function generates random predictions for classification and regression problem.

In [2]:
def random_algorithm(y, test, seed=None):
    random.seed(seed)
    
    output_values = np.unique(y)
    predicted = list()
    
    for row in test:
        index = random.randrange(len(output_values))
        predicted.append(output_values[index])
    
    return predicted

**Example**

In [3]:
from sklearn import datasets
from sklearn import metrics

In [4]:
iris = datasets.load_iris()

X, y = iris['data'], iris['target']
y_predicted = random_algorithm(y, X, seed=42)

print 'Random Prediction Accuracy:', metrics.accuracy_score(y, y_predicted)
print 'Prediction:', y_predicted[:10]

Random Prediction Accuracy: 0.36
Prediction: [1, 0, 0, 0, 2, 2, 2, 0, 1, 0]


## Zero Rule Algorithm

The Zero Rule Algorithm uses information from data to generate one rule for making predictions. This rule depends on a problem type (regression or classiciation). For classification problems, the algorithm predicts majority class. For regression, it can uses mean or median.

In [5]:
def zero_rule_algorithm(y, test, problem='classification', regression_method=np.mean, seed=None):
    if problem == 'classification':
        y = list(y)
        prediction = max(y, key=y.count)
    else:
        prediction = regression_method(y)
    
    y_predicted = map(lambda x: prediction, test)
    
    return y_predicted

**Example**

For classification.

In [6]:
iris = datasets.load_iris()

X, y = iris['data'], iris['target']
y_predicted = zero_rule_algorithm(y, X, problem='classification', seed=42)

print 'Number instances in classes:', collections.Counter(y)
print 'Zero Rule Algorithm Accuracy:', metrics.accuracy_score(y, y_predicted)
print 'Prediction:', y_predicted[:10]

Number instances in classes: Counter({0: 50, 1: 50, 2: 50})
Zero Rule Algorithm Accuracy: 0.333333333333
Prediction: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


For regression.

In [7]:
boston = datasets.load_boston()
X, y = boston['data'], boston['target']

y_predicted = zero_rule_algorithm(y, X, problem='regression', regression_method=np.mean, seed=42)

print 'Using mean method'
print 'Zero Rule Algorithm Mean Squared Error:', metrics.mean_squared_error(y, y_predicted)
print 'Prediction:', y_predicted[:5]

y_predicted = zero_rule_algorithm(y, X, problem='regression', regression_method=np.median, seed=42)

print '\nUsing median method'
print 'Zero Rule Algorithm Mean Squared Error:', metrics.mean_squared_error(y, y_predicted)
print 'Prediction:', y_predicted[:5]

Using mean method
Zero Rule Algorithm Mean Squared Error: 84.4195561562
Prediction: [22.532806324110677, 22.532806324110677, 22.532806324110677, 22.532806324110677, 22.532806324110677]

Using median method
Zero Rule Algorithm Mean Squared Error: 86.1959288538
Prediction: [21.199999999999999, 21.199999999999999, 21.199999999999999, 21.199999999999999, 21.199999999999999]


# Using dummy estimators from scikit-learn

Dummies estimators are available in scikit-learn. For classification and regression problems, there are several simple strategies. Information is available here: http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html

In [28]:
import pandas as pd

from sklearn import cross_validation
from sklearn import dummy
from sklearn import linear_model

** Example 1**  
Testing all possible method for classification problem (same as above).

In [73]:
random_state = 42

iris = datasets.load_iris()

X, y = iris['data'], iris['target']

# Change target values for creating imbalanced dataset
y[y != 1] = 0

# Split data set to train and test subset
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.8, random_state=random_state)

# All available methods
strategies = ['stratified', 'most_frequent', 'prior', 'uniform', 'constant']
results = []

# Set random state for reproducibility
for strategy in strategies:
    if strategy == 'constant':
        # For constant method, constructor requires constant param
        dummy_model = dummy.DummyClassifier(strategy=strategy, random_state=random_state, constant=0)
    else:
        dummy_model = dummy.DummyClassifier(strategy=strategy, random_state=random_state)
    
    dummy_model.fit(X_train, y_train)
    
    y_predicted = dummy_model.predict(X_test)
    accuracy = metrics.accuracy_score(y_test, y_predicted)
    
    results.append((strategy, accuracy))
    
pd.DataFrame(results, columns=['strategy', 'accuracy'])

Unnamed: 0,strategy,accuracy
0,stratified,0.533333
1,most_frequent,0.7
2,prior,0.7
3,uniform,0.366667
4,constant,0.7


For comparing, below test shows accuracy for a real model.

In [74]:
model = linear_model.LogisticRegression()
model.fit(X_train, y_train)

y_predicted = model.predict(X_test)

print 'Logistic Regression Accuracy:', metrics.accuracy_score(y_test, y_predicted)

Logistic Regression Accuracy: 0.766666666667


** Example 2**  
Testing all possible method for regression problem (same as above).

In [75]:
random_state = 42

boston = datasets.load_boston()

X, y = boston['data'], boston['target']

# Split data set to train and test subset
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, train_size=0.8, random_state=random_state)

strategies = ['mean', 'median', 'quantile', 'constant']
results = []

for strategy in strategies:
    print strategy
    if strategy == 'quantile':
        # for quantile strategy you need to specify quantile
        dummy_model = dummy.DummyRegressor(strategy=strategy, quantile=0.75)
    
    elif strategy == 'constant':
        # for constant strategy you need to specify constant, which must be present in training data set
        dummy_model = dummy.DummyRegressor(strategy=strategy, constant=12.0)
    
    else:
        dummy_model = dummy.DummyRegressor(strategy=strategy, constant=12.0)
        
    dummy_model.fit(X_train, y_train)
    y_predicted = dummy_model.predict(X_test)
    
    mse = metrics.mean_squared_error(y_test, y_predicted)
    
    results.append((strategy, mse))

pd.DataFrame(results, columns=['strategy', 'mean squared error'])

mean
median
quantile
constant


Unnamed: 0,strategy,mean squared error
0,mean,75.04543
1,median,73.346275
2,quantile,97.459216
3,constant,163.360392


For comparing, below test shows mean squared error for a real model.

In [77]:
model = linear_model.LinearRegression()
model.fit(X_train, y_train)

y_predicted = model.predict(X_test)

print 'Linear Regression Mean Squared Error:', metrics.mean_squared_error(y_test, y_predicted)

Linear Regression Mean Squared Error: 24.3114269297
