# General Process Data Analysis

## General Approach

In this section, I will walk you through the process of a Kaggle competition.

### Data Exploration

What we do at this stage is called EDA (Exploratory Data Analysis), which means analytically exploring data in order to provide some insights for subsequent processing and modeling.

Usually we would load the data using Pandas and make some visualizations to understand the data.

### Visualization

For plotting, Matplotlib and Seaborn should suffice.

Some common practices:

Inspect the distribution of target variable. Depending on what scoring metric is used, an imbalanced distribution of target variable might harm the model’s performance.
For numerical variables, use box plot and scatter plot to inspect their distributions and check for outliers.
For classification tasks, plot the data with points colored according to their labels. This can help with feature engineering.
Make pairwise distribution plots and examine their correlations.
Be sure to read this inspiring tutorial of exploratory visualization before you go on.

### Statistical Tests

We can perform some statistical tests to confirm our hypotheses. Sometimes we can get enough intuition from visualization, but quantitative results are always good to have. Note that we will always encounter non-i.i.d. data in real world. So we have to be careful about which test to use and how we interpret the findings.

In many competitions public LB scores are not very consistent with local CV scores due to noise or non-i.i.d. distribution. You can use test results to roughly set a threshold for determining whether an increase of score is due to genuine improvment or randomness.

### Data Preprocessing

In most cases, we need to preprocess the dataset before constructing features. Some common steps are:

Sometimes several files are provided and we need to join them.
Deal with missing data.
Deal with outliers.
Encode categorical variables if necessary.
Deal with noise. For example you may have some floats derived from raw figures. The loss of precision during floating-point arithemics can bring much noise into the data: two seemingly different values might be the same before conversion. Sometimes noise harms model and we would want to avoid that.
How we choose to perform preprocessing largely depends on what we learn about the data in the previous stage. In practice, I recommend using Jupyter Notebook for data manipulation and mastering usage of frequently used Pandas operations. The advantage is that you get to see the results immediately and are able to modify or rerun code blocks. This also makes it very convenient to share your approach with others. After all reproducible results are very important in data science.

### Feature Engineering

Some describe the essence of Kaggle competitions as feature engineering supplemented by model tuning and ensemble learning. Yes, that makes a lot of sense. Feature engineering gets your very far. Yet it is how well you know about the domain of given data that decides how far you can go. For example, in a competition where data is mainly consisted of texts, Natural Language Processing teachniques are a must. The approach of constructing useful features is something we all have to continuously learn in order to do better.

Basically, when you feel that a variable is intuitively useful for the task, you can include it as a feature. But how do you know it actually works? The simplest way is to plot it against the target variable.

### Feature Selection

Generally speaking, we should try to craft as many features as we can and have faith in the model’s ability to pick up the most significant features. Yet there’s still something to gain from feature selection beforehand:

Less features mean faster training
Some features are linearly related to others. This might put a strain on the model.
By picking up the most important features, we can use interactions between them as new features. Sometimes this gives surprising improvement.
The simplest way to inspect feature importance is by fitting a random forest model. There are more robust feature selection algorithms (e.g. this) which are theoretically superior but not practicable due to the absence of efficient implementation. You can combat noisy data (to an extent) simply by increasing number of trees used in a random forest.

This is important for competitions in which data is anonymized because you won’t waste time trying to figure out the meaning of a variable that’s of no significance.

### Feature Encoding

Sometimes raw features have to be converted to some other formats for them to work properly.

For example, suppose we have a categorical variable which can take more than 10K different values. Then naively creating dummy variables is not a feasible option. An acceptable solution is to create dummy variables for only a subset of the values (e.g. values that constitute 95% of the feature importance) and assign everything else to an ‘others’ class.

Updated on Oct 28th, 2016: For the scenario described above, another possible solution is to use Factorized Machines. Please refer to this post by Kaggle user “idle_speculation” for details.

### Model Selection

When the features are set, we can start training models. Kaggle competitions usually favor tree-based models:

Gradient Boosted Trees
Random Forest
Extra Randomized Trees
The following models are slightly worse in terms of general performance, but are suitable as base models in ensemble learning (will be discussed later):

SVM
Linear Regression
Logistic Regression
Neural Networks
Note that this does not apply to computer vision competitions which are pretty much dominated by neural network models.

All these models are implemented in Sklearn.

Here I want to emphasize the greatness of Xgboost. The outstanding performance of gradient boosted trees and Xgboost’s efficient implementation makes it very popular in Kaggle competitions. Nowadays almost every winner uses Xgboost in one way or another.

Updated on Oct 28th, 2016: Recently Microsoft open sourced LightGBM, a potentially better library than Xgboost for gradient boosting.

By the way, for Windows users installing Xgboost could be a painstaking process. You can refer to this post by me if you run into problems.

### Model Training

We can improve a model’s performance by tuning its parameters. A model usually have many parameters, but only a few of them are significant to its performance. For example, the most important parameters for a random forset is the number of trees in the forest and the maximum number of features used in developing each tree. We need to understand how models work and what impact does each parameter have to the model’s performance, be it accuracy, robustness or speed.

Normally we would find the best set of parameters by a process called grid search. Actually what it does is simply iterating through all the possible combinations and find the best one.

By the way, random forest usually reach optimum when max_features is set to the square root of the total number of features.

Here I’d like to stress some points about tuning XGB. These parameters are generally considered to have real impacts on its performance:

eta: Step size used in updating weights. Lower eta means slower training but better convergence.
num_round: Total number of iterations.
subsample: The ratio of training data used in each iteration. This is to combat overfitting.
colsample_bytree: The ratio of features used in each iteration. This is like max_features in RandomForestClassifier.
max_depth: The maximum depth of each tree. Unlike random forest, gradient boosting would eventually overfit if we do not limit its depth.
early_stopping_rounds: If we don’t see an increase of validation score for a given number of iterations, the algorithm will stop early. This is to combat overfitting, too.
Usual tuning steps:

Reserve a portion of training set as the validation set.
Set eta to a relatively high value (e.g. 0.05 ~ 0.1), num_round to 300 ~ 500.
Use grid search to find the best combination of other parameters.
Gradually lower eta until we reach the optimum.
Use the validation set as watch_list to re-train the model with the best parameters. Observe how score changes on validation set in each iteration. Find the optimal value for early_stopping_rounds.

In [None]:
X_dtrain, X_deval, y_dtrain, y_deval = \
    cross_validation.train_test_split(X_train, y_train, random_state=1026, test_size=0.3)
dtrain = xgb.DMatrix(X_dtrain, y_dtrain)
deval = xgb.DMatrix(X_deval, y_deval)
watchlist = [(deval, 'eval')]
params = {
    'booster': 'gbtree',
    'objective': 'reg:linear',
    'subsample': 0.8,
    'colsample_bytree': 0.85,
    'eta': 0.05,
    'max_depth': 7,
    'seed': 2016,
    'silent': 0,
    'eval_metric': 'rmse'
}
clf = xgb.train(params, dtrain, 500, watchlist, early_stopping_rounds=50)
pred = clf.predict(xgb.DMatrix(df_test))

Finally, note that models with randomness all have a parameter like seed or random_state to control the random seed. You must record this with all other parameters when you get a good model. Otherwise you wouldn’t be able to reproduce it.

### Cross Validation

Cross validation is an essential step in model training. It tells us whether our model is at high risk of overfitting. In many competitions, public LB scores are not very reliable. Often when we improve the model and get a better local CV score, the LB score becomes worse. It is widely believed that we should trust our CV scores under such situation. Ideally we would want CV scores obtained by different approaches to improve in sync with each other and with the LB score, but this is not always possible.

Usually 5-fold CV is good enough. If we use more folds, the CV score would become more reliable, but the training takes longer to finish as well. However, we shouldn’t use too many folds if our training data is limited. Otherwise we would have too few samples in each fold to guarantee statistical significance.

How to do CV properly is not a trivial problem. It requires constant experiment and case-by-case discussion. Many Kagglers share their CV approaches (like this one) after competitions when they feel that reliable CV is not easy.

### Ensemble Generation

Ensemble Learning refers to the technique of combining different models. It reduces both bias and variance of the final model (you can find a proof here), thus increasing the score and reducing the risk of overfitting. Recently it became virtually impossible to win prize without using ensemble in Kaggle competitions.

Common approaches of ensemble learning are:

Bagging: Use different random subsets of training data to train each base model. Then all the base models vote to generate the final predictions. This is how random forest works.
Boosting: Train base models iteratively, modify the weights of training samples according to the last iteration. This is how gradient boosted trees work. (Actually it’s not the whole story. Apart from boosting, GBTs try to learn the residuals of earlier iterations.) It performs better than bagging but is more prone to overfitting.
Blending: Use non-overlapping data to train different base models and take a weighted average of them to obtain the final predictions. This is easy to implement but uses less data.
Stacking: To be discussed next.
In theory, for the ensemble to perform well, two factors matter:

Base models should be as unrelated as possibly. This is why we tend to include non-tree-based models in the ensemble even though they don’t perform as well. The math says that the greater the diversity, and less bias in the final ensemble.
Performance of base models shouldn’t differ to much.
Actually we have a trade-off here. In practice we may end up with highly related models of comparable performances. Yet we ensemble them anyway because it usually increase the overall performance.

### Stacking

Compared with blending, stacking makes better use of training data. 
It’s much like cross validation. Take 5-fold stacking as an example. First we split the training data into 5 folds. Next we will do 5 iterations. In each iteration, train every base model on 4 folds and predict on the hold-out fold. You have to keep the predictions on the testing data as well. This way, in each iteration every base model will make predictions on 1 fold of the training data and all of the testing data. After 5 iterations we will obtain a matrix of shape #(samples in training data) X #(base models). This matrix is then fed to the stacker (it’s just another model) in the second level. After the stacker is fitted, use the predictions on testing data by base models (each base model is trained 5 times, therefore we have to take an average to obtain a matrix of the same shape) as the input for the stacker and obtain our final predictions.

Maybe it’s better to just show the codes:

In [None]:
class Ensemble(object):
    def __init__(self, n_folds, stacker, base_models):
        self.n_folds = n_folds
        self.stacker = stacker
        self.base_models = base_models
    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T)
        folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=True, random_state=2016))
        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))
        for i, clf in enumerate(self.base_models):
            S_test_i = np.zeros((T.shape[0], len(folds)))
            for j, (train_idx, test_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_holdout = X[test_idx]
                # y_holdout = y[test_idx]
                clf.fit(X_train, y_train)
                y_pred = clf.predict(X_holdout)[:]
                S_train[test_idx, i] = y_pred
                S_test_i[:, j] = clf.predict(T)[:]
            S_test[:, i] = S_test_i.mean(1)
        self.stacker.fit(S_train, y)
        y_pred = self.stacker.predict(S_test)[:]
        return y_pred

In the end generate pairwise polynomial interactions between top-ranking features in order to improve the model performance.