Skip to content
C++ code for rapid implementation of the Structured Sum-of-Squared Decomposition algorithm
Jupyter Notebook C++ Python Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
s3d rerun on macos after windows: also worked Dec 5, 2018
splitted_data
.gitignore
README.md
split_data.py

README.md

structured sum of squares decomposition (s3d) algorithm

c++ code for fast implementation of the structured sum of squares decomposition (s3d) algorithm

note: this is a python wrapper. see original repo: https://github.com/peterfennell/S3D for pure c++ implementation


clarification

the design of the constructor is meant to be easy for hyperparameter tuning by cross validation. therefore, the methods fit and predict may be a little weird to use as standalone functions.

see a quickstart notebook for more

python dependencies

  • pandas
  • joblib
  • networkx (the latest version has the edge arrow fixed!)
  • matplotlib
  • scikit-learn
  • seaborn
  • palettable

these can be installed by:

pip install -r requirements.txt

compiling s3d

compiling is simple and straight forward, do:

make all

or

make clean && make all

to remove previous compiled files.

data for cross val

for cross validation processes, datasets should be splitted first. the script split_data.py can be used to do so. the script can be run as the following:

python split_data.py data_name num_folds

this will read in data_name.csv from data/ folder; split data and store the train/test sets into splitted_data/data_name/ folder. for example, if there is a dataset called breastcancer.csv in data/, we can do:

python split_data.py breastcancer 5

if we do ls splitted_data/breastcancer/, we will see:

0  1  2  3  4

which are fold indices named folder, each of which there will be:

num_rows.csv  test.csv  train.csv

train.csv is the training set; test.csv is the testing set; num_rows.csv store the number of rows in train/test respectively.

finally, you can parallelize the data partitioning with more cores:

ptyhon split_data.py data_name num_folds num_jobs

where num_jobs is 1 by default.

data format

data format: in data_name.csv, the first column is the target column named target, followed by features:

target,feature_1,feature_2,...,feature_p

file structure

file directory setup: PYS3D class will by default create subfodlers data_name in:

tmp/ predictions/ models/ cv/

where:

  • tmp will store the "inner cross validation" temporary files
  • predictions will store the prediction results
  • models will store the trained models
  • cv will store the cross validation performance (on the validation set)

therefore, the first step is to create all these 4 for your convenience. you can do:

./init

which will cleanup and create these folders

reversely, run ./cleanup to remove these folders

You can’t perform that action at this time.