.. toctree::

# FAQ

## What is Steppy standard documentation framework?

The documentation framework is **Sphinx**

## What is the difference between scikit pipeline and steppy?

In a  pipeline a series of data-operations are connected together. In a data-analytics problem, this is usually a left-to-right series of calls of transformer calls ended by an estimator, that is a PIPELINE of invocations.

The major limitation the scikit learn Pipeline wrapper is that passed data object must be same input and output, i.e. data object(s) is(are) same and implicit thoughout Pipeline.

Thus step introduces two new wrappers, Step and Adapter to avoid these limitations:

Steps communicate data between each other with **Adapters**, which are implemented as Python dictionaries. This makes it possible to pass collections of arbitrary data types (Numpy arrays, Pandas dataframes, etc.). The basic structure is as follows:

    data_train = {'input':
                    {
                         'X': X_train,
                         'y': y_train,
                    }
                }

where X_train,y_train are local data objects, X,y are names of arguments to step-exec-object-instance and  ‘input'  is the name of the Adapter.


see https://github.com/neptune-ml/steppy-examples/blob/master/tutorials/1-getting-started.ipynb for more information.

## Error "no module named deepsense"

In the repository folder do: 

    pip3 install -r requirements.txt 
    
this should solve this error.

## How to get around  xp =f(x)  to, x,xp = f(x) challenge

All of the argument/inputs of ALL step-exec-object-instance must be covered by the named Adapter, but all the named arguments of the named Adapter need not be used by the step-exec-object-instance. For example the following Adapter would work for a **Step** that needs only X .


    data_train = {'input':
                    {
            'X': X_train,
            'y': y_train,
            ‘z1’, some_other_bound_variable,
            ‘etc’, etc	
                    }
                }

## Are there other Resources for Steepy?

Yes. see  https://steppy.readthedocs.io

## Are there tutorials for Steepy?

Yes. In the form of notebooks. See https://github.com/neptune-ml/steppy-examples/tree/master/tutorials

## Is Steppy Threadsafe?

Steppy itself is  thread-safe. However, you will use Steppy with many other packages which may not be thread-safe. For example, numpy is thread-safe.
ndarrays can be accessed in a thread-safe manner, but you must be careful with state if you mutate an array.

In Pandas, deleting a column is usually not thread-safe as changing the size of a DataFrame usually results in a new Dataframe object. At some point this may change in Pandas and other python libaries as multi-cored CPUs are becoming common.

## Q&A from open-solution-data-science-bowl-2018

## Q&A from Home Credit Default Risk (Open souce solution for Kaggle)

August-2018

Dear Kagglers,

As you well know we share our work with the community for the benefit of all. We want to help those who are learning the ropes of data science get in the groove of things, we want to help those less organized make their code and process cleaner and finally we want those at the very top of the game to use our work (or parts of it) to develop state of the art solutions quicker and test the boundaries of what is possible for a given problem.

:Authors: Kuba & Kamil

## What platforms are supported?

Python3.5 and Ubuntu 16.04. more details can be found in requirements.txt

(Ed: Also runs in approximately 3. hours in the following configuration:

    Model Name:	Mac Pro
    OS: High Sierra, version.10.6.13
    Processor Name:	12-Core Intel Xeon E5
    Processor Speed:	2.7 GHz
    Total Number of Cores:	12
    L2 Cache (per Core):	256 KB
    L3 Cache:	30 MB
    Memory:	64 GB
)

## How long does it take this solution to run?

Finally, let me mention that end-to-end execution is something like 10-12h. Hence, indeed you may wait some time for the features to get computed. It happens here Step installment_payments_hand_crafted, fitting and transforming... 

Runtime heavily depends on your hardware. 

Please make sure that you are 100% compliant with our requirements.txt file. Differences in packages versions may lead to unexpected results :)

##  Is is possible to save features with your pipeline? 
I mean raw features, for train and test set, in the exact same shape as you feed them to your model?

:Author: narsil (kaggle handle)

### Answer

Steppy Step object is designed to handle stuff like this. There are 3 flags that are particularly important to your problem:

    persist_output = True
    cache_output = True
    load_persisted_output = False
    
In our case the Step that joins the features is called feature_joiner so go to this line and setup persist_output=True both during training and evaluation.

Then you need to generate the features. I would suggest you do it in one go with something like this:

    neptune run --config configs/neptune.yaml main.py train --pipeline_name lightGBM

it will dump your features in the /YOUR_EXPERIMENT/outputs directory.

Everything can be loaded with 

    sklearn.externals.joblib.load(filepath)

The same thing can be done for the test set. Just run

    neptune run --config configs/neptune.yaml main.py predict --pipeline_name lightGBM

Now with our code we divide train/valid by default so if you want to have the entire train features and not train/valid you need to go to the pipeline_manager and change

    train_data = {'application': {'X': train_data_split.drop(cfg.TARGET_COLUMNS, axis=1),
                                  'y': train_data_split[cfg.TARGET_COLUMNS].values.reshape(-1),
                                  'X_valid': valid_data_split.drop(cfg.TARGET_COLUMNS, axis=1),
                                  'y_valid': valid_data_split[cfg.TARGET_COLUMNS].values.reshape(-1)
                                  },
to this

    train_data = {'application': {'X': tables.application_train.drop(cfg.TARGET_COLUMNS, axis=1),
                                  'y': tables.application_train[cfg.TARGET_COLUMNS].values.reshape(-1),
                                  'X_valid':None,
                                  'y_valid': None
                                  },
                                  
It will most likely fail at training (because we need X_valid, y_valid) but is should generate the features correctly. If you have any trouble please let me know.


:Author: Jakub Czakon (kaggle handle)

##  how do I convert a saved feature file into a dataframe

:Author: incarnation (Kaggle handle)

### Answer

You can get data frame with all features by following code:

    from sklearn.externals import joblib
    df = joblib.load('path/to/your/feature_joiner')['features']

:Author: Miłosz Michta (kaggle handle)


##  where do I change the setting in a project to use multiple cpus?

For example, the machine has 16 cpus, and if I want to use 8 cpu , how should I configure the setting in the project?

:Author: Shize Su (kaggle handle)

### Answer

num_workers in neptune.yaml is set by default to 1.

If you want to run your training on 8 cpus, set num_workers: 8.

(Ed: One cavet, is that most of the multi-cpu/core architectures, support 2 or more threads per cpu/core.  num_workers: should be set to thread count.  For example, Intel Core i7-7700K has 4 cores and 8 parellel threads.)

:Author: Miłosz Michta

## Where could we set (/change) the random seed for the kfold split? 

 :Author: Shize Su (kaggle handle)

### Answer

There is only one global random seed in pipeline_config.py: 

    RANDOM_SEED = 90210
    
To change random seed for only k-fold split, you can do this manually in pipeline_manager.py in _get_fold_generator function.

:Author: Miłosz Michta (kaggle handle)

## which setting in the project code should  be changed to avoid  state files over-writing?

:Author: incarnation (Kaggle handle)

### Answer


There are 3 flags in the Step constructor that you need to consider:

    persist_output: True
    load_persisted_output: True
    cache_output: True
    
If you don't want to ovewrite anything during eval/test just make sure to set 
    
    perist_output: False
    
All the steps used are defined in the pipeline_blocks.py . More specifically you need to take care of the feature_joiner .

It is also important to know that you can have a seperate experiment_dir for train and test so that you could just load your transformed features but change the classifier on top of it. In that case you would need to copy the experiment_dir/transformers to a new experiment directory. Also make sure to specify 

    clean_experiment_directory_before_training: 0 
    
in the neptune.yaml otherwise it will just remove everything from the experiment_dir . One last piece of the puzzle if you want to be steppy master is that if you have your lgbm trained in that folder steppy will simply load that model and transform it unless you pass 
    
    force_fitting: True 

to the Step contractor for that step. Which means that running grid search or random search or simply retraining with different hyperparams you don't have to remove that lgbm transformer from experiment_dir/transformers but you can simply pass that flag and overwrite it.

:Author: Jakub Czakon


## How do you change models?

:Author: Maximilian Hahn

### Answer

you can change that:

    'random_forest': {'train': partial(sklearn_main,
                                            ClassifierClass=RandomForestClassifier,
                                            clf_name='random_forest',
                                            train_mode=True),
                     'inference': partial(sklearn_main,
                                                ClassifierClass=RandomForestClassifier,
                                                clf_name='random_forest',
                                                train_mode=False)
                    }
to this:

    'random_forest': partial(sklearn_main,
                         ClassifierClass=RandomForestClassifier,
                         clf_name='random_forest'),
                         
for the case of SKLearn RandomForestClassifier. For other models you can use a model from steppy-toolkit or write your own custom model using the closest model you can find from steppy-toolkit as a template.                       

:Author: Miłosz Michta (Kaggle handle)

## is there a way to run Steppy or Neptune using Jupyter?

:Author: Daniel Burrueco

### Answer

If you are doing it in python.

    !python main.py -- train_evaluate_predict_cv --pipeline_name lightGBM

:Author: William Green

### Answer

What is being executed are pipelines from main.py. In practice in means that you can create notebook in the repository root, import required libs and execute one of these pipelines.

For example training pipeline is defined in the pipeline_manager.py file. Just make sure that you put correct paths to data.

:Author: Kamil (Kaggle handle)

##  I run out of disk space. Is it possible to run with the cache turned off?

:Author: Dromosys (Kaggle handle)

### Answer

Set 

    cache_output=False
    persist_output=False 
    load_persisted_output=False 
    
in pipelines.py and pipeline_blocks.py

:Author: Jakub Czakon (Kaggle handle)

##  It looks like the script was designed to run on neptune only. I don't see the difference between Fast Track and Step by step installation guide.

:Author: nlgn (Kaggle handle)

### Answer
Our code in neptune-agnostic, thus you can run it as Python script:

    python main.py train_evaluate_predict --pipeline_name lightGBM.

Full list of pipelines is here: lightGBM, XGBoost, random_forest, log_reg, svc. I'm still playing with XGBoost.

My intention for Fast Track was to give three short points for User who can clone repo, install reqs, etc.


:Author: kamil (Kaggle handle)

## Could you point me the right class structure and method names where to find save and load fold data?

### Answer

If you want to read them in go:

    from sklearn.externals import joblib
    feature_dict = joblib.load('PATH/TO/FEATURE_JOINER/')
    fetures = feature_dict['features']

:Author: Jakub Czakon (Kaggle handle)

## Is it possible to save target values via feature_joiner feature_joiner_valid while using cv? Or How can access cv target values and store?

:Author: kkaczmarek

### Answer

You can adjust FeatureJoiner to do that:

    class FeatureJoiner(BaseTransformer):
    def __init__(self, use_nan_count=False, **kwargs):
        super().__init__()
        self.use_nan_count = use_nan_count

    def transform(self, numerical_feature_list, categorical_feature_list, targets, **kwargs):
        features = numerical_feature_list + categorical_feature_list
        for feature in features:
            feature.reset_index(drop=True, inplace=True)
        features = pd.concat(features, axis=1).astype(np.float32)
        if self.use_nan_count:
            features['nan_count'] = features.isnull().sum(axis=1)

        outputs = dict()
        outputs['features'] = features
        outputs['feature_names'] = list(features.columns)
        outputs['categorical_features'] = self._get_feature_names(categorical_feature_list)
        outputs['targets'] = targets
        return outputs

    def _get_feature_names(self, dataframes):
        feature_names = []
        for dataframe in dataframes:
            try:
                feature_names.extend(list(dataframe.columns))
            except Exception as e:
                print(e)
                feature_names.append(dataframe.name)

        return feature_names
        
Remember to change the pipeline_blocks accordingly.

If you want it just for the adhoc purposes however I would simply dump the targets with _foldX suffix through pipeline_manager.py . For example:

    for fold_id, (train_idx, valid_idx) in enumerate(fold_generator):
    (train_data_split,
     valid_data_split) = tables.application_train.iloc[train_idx], tables.application_train.iloc[valid_idx]

     joblib.dump(train_data_split[cfg.TARGET_COLUMNS], 'train_target_fold_{}'.format(fold_id))

:Author: Jakub Czakon (Kaggle handle)

##  I am wondering: after running the code, where can I find the output, such as 'submission.csv'?

:Author:  DKADKA

### Answer

It is placed in the experiment_directory that you specified in the yaml file.

:Author: Kamil (Kaggle handle)

##  When I run any of the functions that include parallel apply the function gets hung up and does not move.

:Author: benedic2 (Kaggle handle)

### Answer

I observed this effect for other versions of Pandas:

1. make sure that you have version listed in the requirements file.
1. Check CPU and memory utilization. Very likely everything is just fine. It just takes some time to extract features from files.
1. If you are sure that you have issues with multiprocessing you can roll back to the standard Pandas. In such case you do not use parallel apply function. You simply change it to the operation on Pandas' group object. So something like groupobject.apply(func).reset_index(). This will do the same but with Pandas.

:Author: kamil (Kaggle handle)

## Is the number of estimators set anywhere for the LGBM models?

:Author: benedic2 (Kaggle handle)

### Answer

Yes, we set it in the configuration files. Look for lgbm__number_boosting_rounds. 

:Author: kamil (Kaggle handle)

## How do you go about removing unimportant features? Do you remove simply remove all features with SHAP values below a certain threshold??

:Author:  (Kaggle handle)

### Answer

Actually, we are not removing any features. This could be really problematic. Take a look, that in our notebook we analyze data/model from fold_0 and features which have zero importance in this fold also have nonzero importance in others.

:Author: Miłosz Michta (Kaggle handle)

## Error:  from toolkit.sklearn_transformers.models import SklearnClassifier ModuleNotFoundError: No module named 'toolkit'

:Author: Omid Safarzadeh (Kaggle handle)

### Answer

I think that you did not install steppy-toolkit==0.1.5. try this: 

    pip3 install steppy-toolkit==0.1.5.

:Author: kamil (Kaggle handle)

## Is the command same for training after making adjustments to the parameters?

:Author: William Green (Kaggle handle)

### Answer
 
Remember to change the clean_experiment_directory in the neptune.yaml to False and force_fitting=True in the lightgbm Step. Running it again will load the features for each fold and train model on top of it.

:Author: Jakub Czakon (Kaggle handle)

## Q&A from TGS Salt Identification Challenge  (Open souce solution for Kaggle)

## Goals

- establish solid benchmark for the competition,
- make this competition more approachable by giving starter code and providing help via discussion forum, promote the idea of clean and extensible code for Kaggle competitions :) 
- In this topic... you will read about open solution updates, new ideas, experiments and comments. Feel invited to participate in building this.

Have fun competing :) 

:Authors: Kuba & Kamil


## error: File "open-solution-salt-identification-master/common_blocks/callbacks.py",

    line 150, in on_batch_end loss = loss.data.cpu().numpy()[0] IndexError: too many indices for array 
   
  
:Author: tommao (kaggle handle)

### Answer

Please make sure that you have torch==0.3.1. On the more general level, please make sure that you have all requirements is place: requirements.txt.

:Author: Kamil (Kaggle handle)

## how to change the cost function?

 :Author: Ali Sharaf (kaggle handle)

### Answer

Go to the models.py and override the method set_loss() with whatever you like.

For example I experimented a bit with this lovash loss and pretty much just passed the lovash_softmax from pytorch implementation there. 

:Author: Jakub Czakon

##  'Error: Got unexpected extra argument (True)'

used 

    python main.py -- train --pipeline_name unet --dev_mode True

:Author: Tammao (kaggle handle)


### Answer
--dev_mode is a flag (it does not need any further arguments). You just need to put dev_mode without True.

Therefore, if you need dev_mode go with this command: 

    python main.py -- train --pipeline_name unet --dev_mode

If you do not need dev_mode use this one: 

    python main.py -- train --pipeline_name unet
    
:Author: kamil (kaggle handle)

## how can I determine that is it using pretrained encoder or not?

     U-Net parameters encoder: ResNet152

It using ResNet152 as encoder. According to models.py there are a lot of PRETRAINED_NETWORKS. 

:Author: Tammao (kaggle handle)

### Answer
There is a dictionary with models at the top of models.py and ResNet152 is one of the keys. So by selecting encoder in the neptune.yaml config you select that model from models.py


:Author: Jakub Czakon (kaggle handle)

## I run train() second time in one session it doesn't train, is it because already trained?

Same result if I delete models from experiment folders

:Author: Tammao (kaggle handle)

### Answer

Models are trained so running again just loads the model and transforms/predicts on it. You can change the experiment_dir at the top of the main.py to run another model.

:Author: Jakub Czakon (kaggle handle)

## How do I know which model, loss, etc. are being used?

:Author: Daniel Möller (kaggle handle)

### Answer
What we provide is source code that gives you good jump-start into the competition. It is worth to spend a while and analyze how do we think about the solution. Specifically, I recommend to git clone our source code and open it in your favorite IDE - I use PyCharm for coding. Then start with main.py and analyze the flow of the execution. This will give you good overview of what is happening. Regarding things that you mentioned:

1. Some installation and execution help is here.
2. model is UNet implemented in PyTorch.
3. activation function

:Author: kamil (kaggle handle)

## Is the train.csv the metadata.csv?
:Author: William Green (kaggle handle)

### Answer

No, you need to create it by running python main.py -- prepare_metadata or neptune run main.py prepare_metadata . Remember to specify the paths in the neptune.yaml first.

:Author: Jakub Czakon (kaggle handle)

##     neptune: Error: Invalid parameter 'prepare_metadata'. Parameter names must begin with double dash.
When I tried to run it, I got the following response:

    neptune: Executing in Offline Mode.

    neptune: Error: Invalid parameter 'prepare_metadata'. Parameter names must begin with double dash.
    
What I am doing wrong?

:Author: byoussin (kaggle handle)

### Answer
Try with double dash:

    python main.py -- prepare_metadata

:Author: Jakub Czakon (kaggle handle)

## when I run train, it gives the error  and the entire experiment directory vanishes, along with the meta_dir.

:Author: arao(kaggle handle)

### Answer
You have your meta_dir inside of your experiment_dir and in the neptune.yaml you have option 
        
        overwrite: 1
        
which means it cleans this experiment directory before doing anything else.

My directory structure is something like this:

project_dir:

- data
- meta
- experiments
    - exp_1
    - exp_2
    
The most important part is that your meta directory should not be inside the workspace/experiment_dir.

:Author: Jakub Czakon (kaggle handle)

## Q&A from open-solution-ship-detection
https://github.com/neptune-ml/open-solution-ship-detection


## Q&A from open-solution-value-prediction

## Q&A from open-solution-googleai-object-detection
https://github.com/neptune-ml/open-solution-googleai-object-detection

## Q&A from Mapping Challenge
https://www.crowdai.org/challenges/mapping-challenge


## Q&A from open-solution-value-prediction
https://github.com/neptune-ml/open-solution-value-prediction


## Q&A from open-solution-cdiscount-starter
https://github.com/neptune-ml/open-solution-cdiscount-starter


## Q&A from open-solution-talking-data
https://github.com/neptune-ml/open-solution-talking-data

## Q&A from open-solution-avito-demand-prediction
https://github.com/neptune-ml/open-solution-avito-demand-prediction


## Q&A from open-solution-toxic-comments