porting an example from tensorflow #86

prashant-saxena · 2024-03-22T18:05:01Z

Hello,

I'm trying to convert a simple of project of silent (little noise) detection in audio files to ydf from tensorflow.
The input data is single numpy array of shape (1500, 20). There are 1500 samples of Mel Frequency Cepstrum Coefficient (MFCC) with 20 floats in each.

How do I train this data using ydf?
Later I would like to generate predictions of a single MFCC array of 20 floats.

Thanks

rstz · 2024-03-22T18:19:22Z

Hi, you can train directly on multi-dimensional numpy data as explained in the documentation: https://ydf.readthedocs.io/en/latest/tutorial/multidimensional_feature

The super short version of it is (with random data)

import ydf
num_examples = 10000
num_rows = 20
train_data = np.random.uniform(size=(num_examples, num_rows))
train_label = np.random.randint(0, 2, size=(num_examples))

train_ds = {"features": train_data, "label": train_label}

model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

test_data = {"features": np.random.uniform(size=(1, num_rows))}

model.predict(test_data)

prashant-saxena · 2024-03-22T18:52:47Z

Hi,
Thanks for the tip.
I have tried as you suggested but prediction values are like random values between 0.0 and 1.0, not at all useful.

prashant-saxena · 2024-03-23T09:35:14Z

Ok, Here is the test. Extract files(train.npy, test.npy) from the attached zip file

import numpy as np
import ydf

train_data = np.load('train.npy')
train_label = np.random.randint(0, 2, size=(train_data.shape[0]))

print(train_data.shape)

train_ds = {"features": train_data, "label": train_label}
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
test_data = {"features": np.load('test.npy')}

predictions = model.predict(test_data)
print(predictions)

For the same data, TensorFlow's predictions are 99% correct but ydf's predictions look random to me. Am I missing something
here?
ydf.zip

achoum · 2024-04-16T11:57:33Z

This notebook shows how to train a model on this dataset and make predictions with a Random Forest and a Gradient Boosted Trees model. The notebook also runs a cross-validation to evaluate the quality of predictions on this small dataset.

The model self evaluation (model.describe() ; out-of-bag accuracy of 53%) and cross-validation (learner.cross_validation(train_ds) ; accuracy=50%, AUC=51%) shows that the input features are virtually not correlated with the labels.

You mention that with "TensorFlow's predictions are 99% correct". Are you sure you are using the same dataset? If so, are you sure you are not evaluating on the training dataset?

achoum closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

porting an example from tensorflow #86

porting an example from tensorflow #86

prashant-saxena commented Mar 22, 2024

rstz commented Mar 22, 2024

prashant-saxena commented Mar 22, 2024

prashant-saxena commented Mar 23, 2024

achoum commented Apr 16, 2024

porting an example from tensorflow #86

porting an example from tensorflow #86

Comments

prashant-saxena commented Mar 22, 2024

rstz commented Mar 22, 2024

prashant-saxena commented Mar 22, 2024

prashant-saxena commented Mar 23, 2024

achoum commented Apr 16, 2024