Speed issue of precision-recall quality control layer #456

MaratKhabibullin · 2021-09-29T12:30:45Z

Please implement calculation of positivesCorrect, positivesTotal, negativesCorrect, negativesTotal on device side.

neoml/NeoML/src/Dnn/Layers/PrecisionRecallLayer.cpp

Lines 65 to 91 in 7b76536

    
           void CPrecisionRecallLayer::RunOnceAfterReset() 
        
           { 
        
           	CPtr<CDnnBlob> inputBlob = inputBlobs[0]; 
        
           	CPtr<CDnnBlob> expectedLabelsBlob = inputBlobs[1]; 
        
           	CArray<float> labels; 
        
           	labels.SetSize( expectedLabelsBlob->GetObjectCount() ); 
        
           	expectedLabelsBlob->CopyTo( labels.GetPtr(), labels.Size() ); 
        
           	CArray<float> networkOutputs; 
        
           	networkOutputs.SetSize( inputBlob->GetObjectCount() ); 
        
           	inputBlob->CopyTo( networkOutputs.GetPtr(), networkOutputs.Size() ); 
        
           	for( int i = 0; i < inputBlob->GetObjectCount(); i++ ) { 
        
           		if( labels[i] > 0 ) { 
        
           			if( networkOutputs[i] >= 0 ) { 
        
           				positivesCorrect++; 
        
           			} 
        
           			positivesTotal++; 
        
           		} else { 
        
           			if( networkOutputs[i] < 0 ) { 
        
           				negativesCorrect++; 
        
           			} 
        
           			negativesTotal++; 
        
           		} 
        
           	}

The text was updated successfully, but these errors were encountered:

MaratKhabibullin · 2021-09-29T12:38:42Z

Example:

	
	CConstFloatHandle groundtruth = inputBlobs[1]->GetData();

	CConstFloatHandle notIgnoredSegmentsMask = inputBlobs[2]->GetData();

	const int vectorSize = inputBlobs[0]->GetDataSize();

	CFloatHandleStackVar minusOne( MathEngine() );
	minusOne.SetValue( -1.f );
	CFloatHandleStackVar zero( MathEngine() );
	zero.SetValue( 0.0f );
	CFloatHandleStackVar ones( MathEngine(), vectorSize );
	MathEngine().VectorFill( ones, 1.0f, vectorSize );
	
	//маска элементов классифицированных как положительный класс.
	//Вычисляем по порогу 0.5; это значит, что нужно посмотреть
	//на знак: у положительного класса он положительный.
	CFloatHandleStackVar binarizedCalculation( MathEngine(), vectorSize );
	MathEngine().VectorReLUDiff( calculatedLogit, ones, binarizedCalculation, vectorSize, zero );

	//маска истинно положительных элементов
	CFloatHandleStackVar binarizedLabel( MathEngine(), vectorSize );
	MathEngine().VectorReLUDiff( groundtruth, ones, binarizedLabel, vectorSize, zero );
	
	//маска правильно классифицированных положительных элементов
	CFloatHandleStackVar positiveMask( MathEngine(), vectorSize );
	//там, где в обоих вектора стоит 1, будет 1; иначе - 0
	MathEngine().VectorEltwiseMin( binarizedLabel, binarizedCalculation, positiveMask, vectorSize );
	//учитываем только неигнорируемые сегменты
	MathEngine().VectorEltwiseMultiply( positiveMask, notIgnoredSegmentsMask, positiveMask, vectorSize );

	//кол-во верно классифицированных положительных элементов
	CFloatHandleStackVar positiveCorrectTotal( MathEngine() );
	MathEngine().VectorSum( positiveMask, vectorSize, positiveCorrectTotal );

	//общее кол-во элементов положительного класса
	CFloatHandleStackVar positivesTotal( MathEngine(), 1 );
	CFloatHandleStackVar temp( MathEngine(), vectorSize );
	MathEngine().VectorCopy( temp, binarizedLabel, vectorSize );
	//учитываем только неигнорируемые сегменты
	MathEngine().VectorEltwiseMultiply( temp, notIgnoredSegmentsMask, temp, vectorSize );
	MathEngine().VectorSum( temp, vectorSize, positivesTotal );

	//маска правильно классифицированных отрицательных элементов.
	CFloatHandleStackVar negativesCorrect( MathEngine(), vectorSize );
	//0 там, где в обоих векторах нули; 1 - иначе.
	MathEngine().VectorEltwiseMax( binarizedLabel, binarizedCalculation, negativesCorrect, vectorSize );
	//сейчас правильно классифицированные отрицательные элементы имеют значение 0.
	//Инвертируем вектор.
	//{0, 1} -> {-1, 0}
	MathEngine().VectorAddValue( negativesCorrect, negativesCorrect, vectorSize, minusOne );
	//{-1, 0} -> {0, 1}
	MathEngine().VectorAbs( negativesCorrect, negativesCorrect, vectorSize );
	//учитываем только неигнорируемые сегменты
	MathEngine().VectorEltwiseMultiply( negativesCorrect, notIgnoredSegmentsMask, negativesCorrect, vectorSize );

	//кол-во верно классифицированных отрицательных элементов
	CFloatHandleStackVar negativesCorrectTotal( MathEngine() );
	MathEngine().VectorSum( negativesCorrect, vectorSize, negativesCorrectTotal );

	//общее кол-во элементов отрицательного класса
	CFloatHandleStackVar negativesTotal( MathEngine() );
	//сейчас отрицательные элементы имеют значение 0.
	//Инвертируем вектор.
	//{0, 1} -> {-1, 0}
	MathEngine().VectorAddValue( binarizedLabel, binarizedLabel, vectorSize, minusOne );
	//{-1, 0} -> {0, 1}
	MathEngine().VectorAbs( binarizedLabel, binarizedLabel, vectorSize );
	//учитываем только неигнорируемые сегменты
	MathEngine().VectorEltwiseMultiply( binarizedLabel, notIgnoredSegmentsMask, binarizedLabel, vectorSize );
	MathEngine().VectorSum( binarizedLabel, vectorSize, negativesTotal );

	negativesCount += to<int>( negativesTotal.GetValue() );
	positivesCount += to<int>( positivesTotal.GetValue() );
	truePositivesCount += to<int>( positiveCorrectTotal.GetValue() );
	trueNegativeCount += to<int>( negativesCorrectTotal.GetValue() );

	assert( positivesCount >= 0 );
	assert( negativesCount >= 0 );
	assert( trueNegativeCount <= negativesCount );
	assert( truePositivesCount <= positivesCount );`

FedyuninV · 2021-09-29T14:37:21Z

Have you tested its performance? Have you met any networks where current version leads to a significant decrease in speed?

My arguments:

It's hard to find an example where quality control takes significant time of training.
The amount of data is too small -> the CUDA kernel grids will be too small -> CUDA kernel launches overhead may negate profits from the GPU parallelism.
In the example above .GetValue() is called, that will lead to cudaMemcpy which is synchronous. As a result this example doesn't have an advantage in terms of CUDA device synchronization.

MaratKhabibullin · 2021-09-29T14:39:00Z

Have you tested its performance?

Yeap.

Have you met any networks where current version leads to a significant decrease in speed?

yes, semantic segmentation net with large output.

FedyuninV · 2021-09-29T14:44:01Z

Ok then, can you please provide inputBlobs sizes of this layer in the segmentation net?

MaratKhabibullin · 2021-09-29T15:11:05Z

832 * 320 (H x W)

FedyuninV mentioned this issue Dec 30, 2021

Optimize CPrecisionRecallLayer #533

Merged

FedyuninV closed this as completed in #533 Dec 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed issue of precision-recall quality control layer #456

Speed issue of precision-recall quality control layer #456

MaratKhabibullin commented Sep 29, 2021

MaratKhabibullin commented Sep 29, 2021

FedyuninV commented Sep 29, 2021

MaratKhabibullin commented Sep 29, 2021

FedyuninV commented Sep 29, 2021 •

edited

MaratKhabibullin commented Sep 29, 2021

Speed issue of precision-recall quality control layer #456

Speed issue of precision-recall quality control layer #456

Comments

MaratKhabibullin commented Sep 29, 2021

MaratKhabibullin commented Sep 29, 2021

FedyuninV commented Sep 29, 2021

MaratKhabibullin commented Sep 29, 2021

FedyuninV commented Sep 29, 2021 • edited

MaratKhabibullin commented Sep 29, 2021

FedyuninV commented Sep 29, 2021 •

edited