## Practice for applying the method created in this post
==========================================================
### files structure
1. `path.py` : defines the paths of all kinds of data will be used below
  * original data: `train.csv` \ `test.csv`
  * parsed data: `parsed_train.csv` \  `parsed_test.csv`
  * parsed normalized data: `parsed_train_normalized.csv` \  `parsed_test_normalized.csv` 

<br>2. `preprocessing.py` : pre-process methods for tackling with original data set <br>
<br>
  &emsp;&emsp; *functions:*
  * loadParseData: limited feature engineering, imputing missing values with mean or median according to feature's format
  * fill_svd: returns complete data set where missing value imputed with svd method (execute after *loadParseData*)
  * normalization: normalize data with MaxMin scaler
  * et_selection: select features on ordered importances using **ExtraTreesClassifier** model
  * svc_selection: select features with regulation using **LinearSVC** model 

<br>3. `paramsTuning.py` : tune parameter for different model with best cross-validation performance with **grid_search** method <br>
  *class model_tuning_params consist of :*  
  * \__init__: specified with model name (xgb, rf, regress, svm), initilize corresponding models and parameters matrix for testing
  * modelfit_xgb: specification for xgboost model with *xgb.cv*, returns best fitted parameters
  * grid_search: general function for other models, return best fitted parameters<br>
  
<br>4. `ensemble.py` : a ensemble approach to feed the preliminary's predict probs back to training, train again with xgboost model <br>
<br>5. `offset.py`: a approach to compensate the imbalanced dataset, adding predict's value with offset vector to make its distribution reasonable, the offset vector optimized with *fmin_powell* function with training data.  
    

### Pipeline go-through:

~~~
git clone git@github.com:Entheos1994/Entheos1994.github.io.git
cd adml_dataminer
pip install -r requirements.txt
~~~

In [2]:
# adding pacakge path for import
import sys 
import os
path = os.getcwd()
sys.path.append(path)

In [None]:
# import path
import path
'''
TRAIN_PATH
TEST_PATH
'''
# preprocessing
import pre_processing
from pre_processing import *
parsed_train, parsed_test = loadParseData()

# svd imputing
all_data = parsed_train.append(parsed_test)
parsed_svd = fill_svd(all_data)

# normalization
parsed_train_normalized, parsed_test_normalized = normalization(parsed_train, parsed_test)

# feature selection by extratrees using parsed_train (not normalized)
# alternative 1. select features whose importances cumulated over 90% (can be tuned by the input: default_import)
extra_feature_list_1 = et_selection(parsed_train, alternatives=False, default_import=0.9)

# alternative 2. select features whose importances individually over 0.5%
extra_feature_list_2 = et_selection(parsed_train, alternatives=True)


# feature selection by SVC regression model using parsed_train_normalized (normalized)
# alternative 1. select features using l1 regularization
svc_feature_list = svc_selection(parsed_train_normalized, p='l1')

# alternative 2. select feature using l2 regularization
svc_feature_list = svc_selection(parsed_train_normalized, p='l2')

### pre process

In [16]:
import imp

In [18]:
import pre_processing
imp.reload(pre_processing)
from pre_processing import *

Load the data


### parameter tuning

In [13]:
import paramsTuning
from paramsTuning import *
xgb_model = model_tuning_params(model_name='xgb', random_seed=26)
rf_model = model_tuning_params(model_name='rf', random_seed=26)
svm_model = model_tuning_params(model_name='svm', random_seed=26)
regress_model = model_tuning_params(model_name='regress', random_seed=26)

In [None]:
'''
would running for a long time for getting best parameters
--------------------------------------------------------
1.xgb 
xgb_best = xgb_model.model_xgb(dtrain=parsed_train, useTrainCV=True, cv_folds=5, early_stopping_rounds=50, metric='rmse',
                     obt='reg:linear')
xgb_best.model.get_params()


2.random forest
rf_best = rf_model.grid_search(parsed_train)
rf_best.model.get_params()

3.svm
svm_best = svm_model.grid_search(parsed_train_normalized)
svm_best.model.get_params()

4.regress
regress_best = regress_model.grid_search(parsed_train_normalized)
regress_best.model.get_params()

'''

### offset vector

In [None]:
# take regress model as example
import offset
from offset import *
regress = deepcopy(regress_model.model)
predictors = [col for col in parsed_train_normalized.columns.values if col not in ['Response', 'Id']]
regress.fit(parsed_train_normalized[predictors], parsed_train_normalized['Response'])

train_preds = regress.predict(parsed_train_normalized[predictors])
test_preds = regress.predict(parsed_test_normalized[predictors])

offseted_preds = offset_apply(train_preds, test_preds, parsed_train_normalized['Response'])

### ensemble

In [None]:
import ensemble 
from ensemble import *
rf_class_model = rf_model.model
xgb_class_model = xgb_model.model
svm_class_model = svm_model.model
regress_class_model = regress_model.model
ensembled_pred = proba_ensemble(parsed_train, parsed_train_normalized, parsed_test, parsed_test_normalized)
print(ensembled_pred)