Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The way to select particular features from CSV file #50

Closed
galimru opened this issue Sep 28, 2020 · 5 comments · Fixed by #81
Closed

The way to select particular features from CSV file #50

galimru opened this issue Sep 28, 2020 · 5 comments · Fixed by #81
Labels
documentation Improvements or additions to documentation

Comments

@galimru
Copy link

galimru commented Sep 28, 2020

Hi,

I'm new to ML stuff but decided to start with your library.

I have a question:
Given this dataset: https://archive.ics.uci.edu/ml/datasets/student+performance

I'm trying to make a linear regression to predict grade (G3) for students based on several features from this dataset.
I want to use to the following features: G1, G2, studytime, failures, absences.

How I can define which columns of the csv file need to use as features for dataset?
It seems to me the CSVLoader treats all columns which are not response names as features.
This leads to the exception because it's trying to convert all features to double:

Exception in thread "main" java.lang.NumberFormatException: For input string: "course"
	at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
	at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)

I've seen examples on python with Pandas library where you can pick the columns for features and another column for result.
Is it possible to do kind of like that with Tribuo?

@galimru galimru added the enhancement New feature or request label Sep 28, 2020
@Craigacp
Copy link
Member

Yep, but to process csvs where you don't want to use all the columns then you need to use RowProcessor rather than the simple CSVLoader. If you instantiate a CSVDataSource you can pass it a RowProcessor instance which controls how columnar data is parsed. The RowProcessor accepts a fieldProcessorMap which gives the mapping between feature fields and the processor used to convert the String in that field into a list of Tribuo features, and a ResponseProcessor which you can use to pick the response column. The RowProcessor is very flexible so there are lots of other parameters but that should be sufficient to get you going.

@galimru
Copy link
Author

galimru commented Sep 28, 2020

Great, this is very clear. I appreciate it.

@Craigacp
Copy link
Member

I'll add docs to the RowProcessor constructors in the docs pass this week.

@Craigacp Craigacp added documentation Improvements or additions to documentation and removed enhancement New feature or request labels Sep 29, 2020
@leccelecce
Copy link

Hi @Craigacp - I'm in the same boat, trying to get started with Tribuo and some basic CSV files with non-double Feature values. I've been working my way through the library but it would be great to have some code samples using RowProcessor etc. - had a quick look in the docs and tests but couldn't see anything obvious. Are there plans to build out the docs or have I just missed something?

@Craigacp
Copy link
Member

Craigacp commented Oct 15, 2020

I'm writing a tutorial on the RowProcessor at the moment, it should land in a public branch sometime early next week.

In general, yes we do plan to build out the docs, focusing on the areas people are flagging initially. Next up on the deck after the RowProcessor tutorial is one on third party model loading using ONNXExternalModel and XGBoostExternalModel. We'll add TensorFlow to that tutorial once Tribuo has been migrated over to the recent tensorflow-java 0.2.0 release, as that will change how Tribuo's TF interface works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants