# Randomly Selecting Data for Training

In [None]:
from sklearn.cross_validation import train_test_split

# Given a Pandas dataframe, df, with columns A, B, C and target
input_columns = ['A', 'B', 'C']
X, y = df[input_columns], df['target']

# Split the dataset into a training and a testing set
# Test set will be the 25% taken randomly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
print (X_train.shape, y_train.shape)

# Training Data Standardization 

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

This standardization of values (which does not change their distribution, as you could verify by plotting the X values before and after scaling) is a common requirement of machine learning methods, to avoid that features with large values may weight too much on the final results.

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Preprocessing

### Labeling

In [None]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
label_encoder = enc.fit(titanic_X['sex'])
titanic_X['gender_labeled'] = label_encoder.transform(titanic_X['sex'])
titanic_X[['sex', 'gender_labeled']].head()

### One Hot Encoding

In [None]:
one_hot_columns = pd.get_dummies(titanic_X['pclass'], prefix='pclass_')
titanic_X = pd.concat([titanic_X,one_hot_columns], axis=1)
titanic_X.head()

# Result Evaluation

In [None]:
from sklearn import metrics
y_train_pred = clf.predict(X_train)
print (metrics.accuracy_score(y_train, y_train_pred))

To finish our evaluation process, we will introduce a very useful method known as cross-validation. As we explained before, we have to partition our dataset into a training set and a testing set. However, partitioning the data, results such that there are fewer instances to train on, and also, depending on the particular partition we make (usually made randomly), we can get either better or worse results. Cross-validation allows us to avoid this particular case, reducing result variance and producing a more realistic score for our models. The usual steps for k-fold cross-validation are the following:

Partition the dataset into k different subsets.

Create k different models by training on k-1 subsets and testing on the remaining subset.

Measure the performance on each of the k models and take the average measure.

First, we will have to create a composite estimator made by a pipeline of the standardization and linear models. With this technique, we make sure that each iteration will standardize the data and then train/test on the transformed data. The Pipeline class is also useful to simplify the construction of more complex models that chain-multiply the transformations. We will chose to have k = 5 folds, so each time we will train on 80 percent of the data and test on the remaining 20 percent. Cross-validation, by default, uses accuracy as its performance measure, but we could select the measurement by passing any scorer function as an argument.

In [None]:
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
# create a composite estimator made by a pipeline of the standarization and the linear model
clf = Pipeline([
        ('scaler', preprocessing.StandardScaler()),
        ('linear_model', SGDClassifier())
])
# create a k-fold cross validation iterator of k=5 folds
cv = KFold(X.shape[0], 5, shuffle=True, random_state=33)
# by default the score used is the one returned by score method of the estimator (accuracy)
scores = cross_val_score(clf, X, y, cv=cv)
print (scores)

We obtained an array with the k scores. We can calculate the mean and the standard error to obtain a final figure:

In [None]:
from scipy.stats import sem
def mean_score(scores):
    return ("Mean score: {0:.3f} (+/- {1:.3f})").format(np.mean(scores), sem(scores))

print mean_score(scores)

## Building a cross validator

In [None]:
def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print (scores)
    print (("Mean score: {0:.3f} (+/-{1:.3f})").format(np.mean(scores), sem(scores)))

# svc_1 = SVC(kernel='linear')
evaluate_cross_validation(svc_1, X_train, y_train, 5)

# Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
label_encoder = enc.fit(titanic_X['sex'])
titanic_X['gender_labeled'] = label_encoder.transform(titanic_X['sex'])
titanic_X[['sex', 'gender_labeled']].head()

In [None]:
one_hot_columns = pd.get_dummies(titanic_X['pclass'], prefix='pclass')
titanic_X = pd.concat([titanic_X,one_hot_columns], axis=1)
titanic_X.head()