ValueError: This solver needs samples of at least 2 classes in the data #49

Open
mrshanth opened this Issue Jul 7, 2015 · 3 comments

Comments

Projects
None yet
3 participants
@mrshanth

mrshanth commented Jul 7, 2015

Hi,

I am using SparkLinearSVC. The code is as follows:

svm_model = SparkLinearSVC(class_weight='auto')
svm_fitted = svm_model.fit(train_Z,classes=np.unique(train_y))

and I get the following error:

File "/DATA/sdw1/hadoop/yarn/local/usercache/ad79139/filecache/328/spark-assembly-1.2.1.2.2.4.2-2-hadoop2.6.0.2.2.4.2-2.jar/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 258, in func
    return f(iterator)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 820, in <lambda>
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/usr/lib/python2.6/site-packages/splearn/linear_model/base.py", line 81, in <lambda>
    mapper = lambda X_y: super(cls, self).fit(
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/classes.py", line 207, in fit
    self.loss
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/base.py", line 809, in _fit_liblinear
    " class: %r" % classes_[0])
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

whereas, I have 2 classes, namely 0 and 1. The block size of the DictRDD is 2000. The percentage of classes 0 and 1 are 92% and 8% respectively

@kszucs kszucs added the bug label Jul 7, 2015

@kszucs

This comment has been minimized.

Show comment
Hide comment
@kszucs

kszucs Jul 7, 2015

Contributor

Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:

train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()

To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.

Contributor

kszucs commented Jul 7, 2015

Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:

train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()

To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.

@mrshanth

This comment has been minimized.

Show comment
Hide comment

mrshanth commented Jul 8, 2015

Thanks

@jaydee92

This comment has been minimized.

Show comment
Hide comment
@jaydee92

jaydee92 Dec 14, 2017

I believe I found a workaround for this. Considering these problems tend to happen to highly imbalanced datasets, I would suggest using StratifiedShuffleSplit, and alter the train_size or test_size ratio as an alternative as seen below:

for trainRatio in np.arange(0.05, 1, 0.05):
    split = StratifiedShuffleSplit(n_splits=2, train_size=trainRatio)
    for trainIdx, testIdx in split.split(X, y):
        Xtrain, Xtest = X[trainIdx], X[testIdx]
        ytrain, ytest = y[trainIdx], y[testIdx]
        model = someModel()
        model.fit(Xtrain, ytrain)
        pred = model.predict(Xtest)

I believe I found a workaround for this. Considering these problems tend to happen to highly imbalanced datasets, I would suggest using StratifiedShuffleSplit, and alter the train_size or test_size ratio as an alternative as seen below:

for trainRatio in np.arange(0.05, 1, 0.05):
    split = StratifiedShuffleSplit(n_splits=2, train_size=trainRatio)
    for trainIdx, testIdx in split.split(X, y):
        Xtrain, Xtest = X[trainIdx], X[testIdx]
        ytrain, ytest = y[trainIdx], y[testIdx]
        model = someModel()
        model.fit(Xtrain, ytrain)
        pred = model.predict(Xtest)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment