Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between discrete and continuous variables #7

Open
tfaits opened this issue Jul 19, 2016 · 1 comment
Open

Distinguish between discrete and continuous variables #7

tfaits opened this issue Jul 19, 2016 · 1 comment

Comments

@tfaits
Copy link
Collaborator

tfaits commented Jul 19, 2016

In its current form, PathoStat accepts "batch" and "condition" as possible discrete variables, and gives the user the option to color/group data (in various plots) by either of those. However, we're adding functionality: PathoStat will accept any number of covariates, such as patient age, weight, race, disease status, whatever. We still want to let users color/group data based on these things, but that doesn't make much sense for continous variables. Without binning, how do you group people by weight? You can, however, order data by continuous variables. We want to at least distinguish between the two types, and we may want to add functionality for continuous variables.

@mlbendall
Copy link
Collaborator

I agree with this, I am running up against the same issue now. If you are just looking for the types as currently assigned, you can do this:

sapply(sample_variables(pstat), function(v) { class(sample_data(pstat)[[v]]) })

However, I think we need to be explicit in assigning types to sample variables. A function should be implemented that accepts user input to assign types, or attempts to infer from the data. Inferring may not be 100% accurate. For example, R (read.table or similar) interprets "Subject ID" as an integer, but it should be a factor, since there is no meaningful ordering to the subjects. Still, inferring from the data would be a good first step.

I propose we have more than two types. I think our types should be according to the standard R data types:

  • factors: categorical/nominal variables
  • ordered factors: ordinal variables, useful for representing longitudinal variables and discretizing continuous variables
  • integer: continuous type
  • numeric/double: continuous type
  • character: text that does not need to be treated as a variable, mostly for display purposes.

These types will naturally suggest how to display them. For example, factors can be displayed using "select" inputs and qualitative color palettes, while ordered factors may also use "select" inputs but be displayed with sequential color palettes.

In addition, users should be able to indicate which covariates are "of interest". Perhaps there should be several categories, such as secondary/confounders, batch covariates, and random effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants