# GENERAL DATA ANALYSIS PIPELINE

## Type of labels

The labels define the problem and can be of different types, such as:

- Single column, binary values (classification problem, one sample belongs to one class only and there are only two classes)
- Single column, real values (regression problem, prediction of only one value)
- Multiple column, binary values (classification problem, one sample belongs to one class, but there are more than two classes)
- Multiple column, real values (regression problem, prediction of multiple values)
- Multilabel (classification problem, one sample can belong to several classes)

## Evaluation Metrics

For any kind of machine learning problem, we must know how we are going to evaluate our results, or what the evaluation metric or objective is. 
For example in case of a **skewed binary classification problem** we generally choose area under the receiver operating characteristic curve (**ROC AUC** or simply **AUC**). 
In case of **multi-label or multi-class classification problems**, we generally choose **categorical cross-entropy or multiclass log loss** and **mean squared error** in case of **regression problems**.

## Training-Validation Set Split

Suppose that you've already split the data in training-test set, then you have to split the training set in the real training and validation set (or use cross-validation).
The splitting of data into training and validation sets “must” be done according to labels. In case of any kind of classification problem, use stratified splitting. In python, you can do this using scikit-learn very easily:

In [None]:
from sklearn.cross_validation import StratifiedKFold
eval_size = 0.10
kf = StratifiedKFold(y, round(1./eval_size))
train_indices, valid_indices = next(iter(kf))
X_train, y_train = X[train_indices], y[train_indices]
X_val, y_val = X[valid_indices], y[valid_indices]

In case of regression task, a simple K-Fold splitting should suffice. There are, however, some complex methods which tend to keep the distribution of labels same for both training and validation but they're not treat here.

In [None]:
from sklearn.cross_validation import KFold
eval_size = 0.10
kf = KFold(len(y), round(1./eval_size))
train_indices, valid_indices = next(iter(kf))
X_train, y_train = X[train_indices], y[train_indices]
X_val, y_val = X[valid_indices], y[valid_indices]

The eval_size is here setted as 0.10, but it depends on how many data you have.
After the splitting of the data is done, leave this data out and don’t touch it. Any operations that are applied on training set must be saved and then applied to the validation set.

## Pre-processing

Next step is identification of different variables in the data. There are usually three types of variables we deal with. Namely, **numerical variables**, **categorical variables** and **variables with text inside them**.

Separate out the numerical variables first. These variables don’t need any kind of processing and thus we can start applying normalization and machine learning models to these variables.

There are two ways in which we can handle **categorical data**:
- Convert the categorical data to labels
- Convert the labels to binary variables (one-hot encoding)

In [None]:
# Convert the categorical data to labels

from sklearn.preprocessing import LabelEncoder

lbl_enc = LabelEncoder
lbl_enc.fit(X_train[categorical_features])
X_train_cat = lbl_enc.transform(X_train[categorical_features])

# One-hot Encoding, remember to convert categories to number first using LabelEncoder then use OneHotEncoder, if it's
# important to have the name of the catagories, use pd.get_dummies()

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder
ohe.fit(X_train[categorical_features])
X_train_cat = ohe.transform(X_train[categorical_features])

Let’s formulate a general rule on handling **text variables**. We can combine all the text variables into one and then use some algorithms which work on text data and convert it to numbers.

The text variables can be joined as follows:

In [None]:
text_data = list(X_train[text_features].apply(lambda row: "%s %s" %(row["column 1"], row["column 2"]), axis=1))

We can use CountVectorizer or TfidfVectorizer on it:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
ctv = CountVectorizer()
text_data_train = cvt.fit_transform(text_data_train)
vocab = cvt.get_feature_names()
text_data_valid = cvt.transform(text_data_valid)

from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer()
text_data_train = tfv.fit_transform(text_data_train)
vocab = tvf.get_feature_names()
text_data_valid = tfv.transform(text_data_valid)

The TfidfVectorizer performs better than the counts most of the time and I have seen that the following parameters for TfidfVectorizer work almost all the time.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents="unicode", analyzer="word", token_pattern=r'\w{1,}', \
    ngram_range=(1,2), use_idf=1,smooth_idf=1,sublinear_tf=1,stop_words="english")

## Feature stacker

You can horizontally stack all the features before putting them through further processing by using numpy hstack or sparse hstack depending on wheter you have dense or sparse features.

In [None]:
import numpy as np
from scipy import sparse

# in case of dense data:
X = np.hstack((x1,x2, ...))

# in case of sparse data:
X = sparse.hstack((x1,x2, ...))

This can also be achieved by FeatureUnion module in case there are other preprocessing steps such as pca or feature selection.

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

pca = PCA(n_components=10)
skb = SelectKBest(k=1)
combined_features = FeatureUnion([("pca", pca), ("skb",skb)])

Once, we have stacked the features together, we can start applying machine learning models. At this stage only models you should go for should be ensemble tree based models. These models include:

- RandomForestClassifier
- RandomForestRegressor
- ExtraTreesClassifier
- ExtraTreesRegressor
- XGBClassifier
- XGBRegressor
We cannot apply linear models to the above features since they are not normalized. To use linear models, one can use Normalizer or StandardScaler from scikit-learn.

These normalization methods work only on dense features and don’t give very good results if applied on sparse features. Yes, one can apply StandardScaler on sparse matrices without using the mean (parameter: with_mean=False).

If the above steps give a “good” model, we can go for optimization of hyperparameters and in case it doesn’t we can go for the following steps and improve our model.

## DECOMPOSITION METHODS

The next steps include decomposition methods:

In [6]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_17.png")

For the sake of simplicity, we will leave out LDA and QDA transformations. For high dimensional data, generally PCA is used decompose the data. For images start with 10-15 components and increase this number as long as the quality of result improves substantially. For other type of data, we select 50-60 components initially (we tend to avoid PCA as long as we can deal with the numerical data as it is).

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=12)
pca.fit(xtrain)
xtrain = pca.transform(xtrain)

For **text data**, after conversion of text to sparse matrix, go for Singular Value Decomposition (SVD). A variation of SVD called TruncatedSVD can be found in scikit-learn.

In [None]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=12)
svd.fit(xtrain)
xtrain = svd.transform(xtrain)

The number of SVD components that generally work for TF-IDF or counts are between 120-200. Any number above this might improve the performance but not substantially and comes at the cost of computing power.

After evaluating further performance of the models, we move to scaling of the datasets, so that we can evaluate linear models too. The normalized or scaled features can then be sent to the machine learning models or feature selection modules.

## FEATURE SELECTION

There are multiple ways in which feature selection can be achieved. One of the most common way is greedy feature selection (forward or backward). In greedy feature selection we choose one feature, train a model and evaluate the performance of the model on a fixed evaluation metric. We keep adding and removing features one-by-one and record performance of the model at every step. We then select the features which have the best evaluation score. One implementation of greedy feature selection with AUC as evaluation metric can be found here: https://github.com/abhishekkrthakur/greedyFeatureSelection. It must be noted that this implementation is not perfect and must be changed/modified according to the requirements.

Other faster methods of feature selection include selecting best features from a model. We can either look at coefficients of a logit model or we can train a random forest to select best features and then use them later with other machine learning models.

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf.fit(X,y)
X_selected = clt.transform(X)

Remember to keep low number of estimators and minimal optimization of hyper parameters so that you don’t overfit.

The feature selection can also be achieved using Gradient Boosting Machines.

In [None]:
import xgboost as xgb

params = {}

model = xgb.train(params, dtrain, num_boost_round=100)
sorted(model.get_fscore().items(),key=lambda t: -t[1])

Another popular method for feature selection from positive sparse datasets is chi-2 based feature selection and we also have that implemented in scikit-learn.

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

skb = SelectKBest(chi2, k=20)
skb.fit_transform(X,y)

Here, we use chi2 in conjunction with SelectKBest to select 20 features from the data. This also becomes a hyperparameter we want to optimize to improve the result of our machine learning models.

Next (or intermediate) major step is model selection + hyperparameter optimization.

## Model Selection and Hyperparameter Optimization

We generally use the following algorithms in the process of selecting a machine learning model:

Classification:
- Random Forest
- GBM
- Logistic Regression
- Naive Bayes
- Support Vector Machines
- k-Nearest Neighbors

Regression:
- Random Forest
- GBM
- Linear Regression
- Ridge
- Lasso
- SVR

Which parameters should I optimize? How do I choose parameters closest to the best ones? 
Let’s break down the hyperparameters, model wise:

In [7]:
Image(url="http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/07/abhishek_24.png")

RS* = Cannot say about proper values, go for Random Search in these hyperparameters.