Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

porting an example from tensorflow #86

Closed
prashant-saxena opened this issue Mar 22, 2024 · 4 comments
Closed

porting an example from tensorflow #86

prashant-saxena opened this issue Mar 22, 2024 · 4 comments

Comments

@prashant-saxena
Copy link

Hello,

I'm trying to convert a simple of project of silent (little noise) detection in audio files to ydf from tensorflow.
The input data is single numpy array of shape (1500, 20). There are 1500 samples of Mel Frequency Cepstrum Coefficient (MFCC) with 20 floats in each.

How do I train this data using ydf?
Later I would like to generate predictions of a single MFCC array of 20 floats.

Thanks

@rstz
Copy link
Collaborator

rstz commented Mar 22, 2024

Hi, you can train directly on multi-dimensional numpy data as explained in the documentation: https://ydf.readthedocs.io/en/latest/tutorial/multidimensional_feature

The super short version of it is (with random data)

import ydf
num_examples = 10000
num_rows = 20
train_data = np.random.uniform(size=(num_examples, num_rows))
train_label = np.random.randint(0, 2, size=(num_examples))

train_ds = {"features": train_data, "label": train_label}

model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)

test_data = {"features": np.random.uniform(size=(1, num_rows))}

model.predict(test_data)

@prashant-saxena
Copy link
Author

Hi,
Thanks for the tip.
I have tried as you suggested but prediction values are like random values between 0.0 and 1.0, not at all useful.

@prashant-saxena
Copy link
Author

Ok, Here is the test. Extract files(train.npy, test.npy) from the attached zip file

import numpy as np
import ydf

train_data = np.load('train.npy')
train_label = np.random.randint(0, 2, size=(train_data.shape[0]))

print(train_data.shape)

train_ds = {"features": train_data, "label": train_label}
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
test_data = {"features": np.load('test.npy')}

predictions = model.predict(test_data)
print(predictions)

For the same data, TensorFlow's predictions are 99% correct but ydf's predictions look random to me. Am I missing something
here?
ydf.zip

@achoum
Copy link
Collaborator

achoum commented Apr 16, 2024

This notebook shows how to train a model on this dataset and make predictions with a Random Forest and a Gradient Boosted Trees model. The notebook also runs a cross-validation to evaluate the quality of predictions on this small dataset.

The model self evaluation (model.describe() ; out-of-bag accuracy of 53%) and cross-validation (learner.cross_validation(train_ds) ; accuracy=50%, AUC=51%) shows that the input features are virtually not correlated with the labels.

You mention that with "TensorFlow's predictions are 99% correct". Are you sure you are using the same dataset? If so, are you sure you are not evaluating on the training dataset?

@achoum achoum closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants