Skip to content

Implementation of new efficient approximation algorithms for computing string kernels

Notifications You must be signed in to change notification settings

mufarhan/sequence_class_NIPS_2017

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

Efficient-Approximation-Algorithms-for-Strings-Kernel-Based-Sequence-Classification

References:

Farhan, M., Tariq, J., Zaman, A., Shabbir, M., Khan, I. "Efficient Approximation Algorithms for Strings Kernel Based Sequence Classification." NIPS, 2017

This implements new approximation based kernel algorithms for strings (see references).

Data Sources:

Ding-Dubchak - http://ranger.uta.edu/~chqding/protein/
Music Genre - http://opihi.cs.uvic.ca/sound/genres/
ISMIR Contest - http://ismir2004.ismir.net/genre_contest/
Artist20 - https://labrosa.ee.columbia.edu/projects/artistid/

Installation:

Download Source Code folder to your desired directory, and make sure, environment variable for java is set

cd Source Code
javac Main.java

This will produce executable class files for string kernel

computations:

Main.class : computes mismatch kernel matrices

Usage:

This function takes a text file with sequences and output a text file with a kernel matrix.

E.g. to compute mismatch(8,2) kernel for sequences with alphabet size=1024:

java Main music.genre.txt 1000 8 2 1024

this will create Kernel-k8-m2.txt file with kernel matrix

String kernels are called with the following parameters:
java Main <Sequence-file> <# of Sequences> <k> <m> <AlphabetSize>

where
<Sequence-file> is the file with sequence data:
one sequence per line (line should end with line feed), with sequence elements separated by space,
all sequence elements are assumed to be in the range [0, <AlphabetSize> - 1].
See Datasets folder for an example of the sequence file format.
<k>,<m>,<b>,<sigma> are corresponding kernel parameters. (see references and help for a particular function for details)
<# of Sequences> is the total number of sequences.
<AlphabetSize> is the size of the alphabet.

Output kernel matrix is written into
Kernel-k<k>-m<m>.txt
file.

Authors:

Muhammad Farhan 14030031@lums.edu.pk
Imdad Ullah Khan imdad.khan@lums.edu.pk

About

Implementation of new efficient approximation algorithms for computing string kernels

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages