Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charts: stratified random samplin base on a target #490

Closed
reza1615 opened this issue May 21, 2021 · 6 comments
Closed

Charts: stratified random samplin base on a target #490

reza1615 opened this issue May 21, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@reza1615
Copy link

Now the random sampling for charts doest apply stratified
sampling based on target.
It would be helpful to have a drop down to set the target column and generate stratified samples based on that.
It will help to have the same distribution of the whole data set.

@reza1615 reza1615 changed the title stratified random samplin base on a target Charts: stratified random samplin base on a target May 21, 2021
@aschonfeld aschonfeld added the enhancement New feature or request label May 21, 2021
@harshithakolipaka
Copy link

hey, I would like to work on this issue. Can you help me direct to the source code so that I can add the feature

@aschonfeld
Copy link
Collaborator

@harshithakolipaka thanks for your interest in this feature! Ok so to give you some background on how this will work you'll have to update the run_query function to have a boolean named parameter stratified_random_sample. So it will look something like this

run_query(
          handle_predefined(data_id),
          query,
          global_state.get_context_variables(data_id),
          pct=inputs.get("load"),
          pct_type=inputs.get("load_type"),
          stratified_random_sample=True
)

The run_query function is located here

The idea is that you'll want to check this boolean parameter and if its True then you'll run your random sampling code. You'll want to run it on the df parameter passed into run_query, it's the dataframe you've loaded into D-Tale. You'll probably want to call that function before anything else in run_query. Lastly, your random sampling function will return a dataframe that will later be used in the rest of the run_query function.

Honestly, if you'd just like to provide the random sampling function I can plug in the rest and make sure you still get credit for the code you've committed. Let me know if you have any other questions. Thanks

@aschonfeld
Copy link
Collaborator

@reza1615 I'm looking to implement this but wanted to know which of these two solutions were more suitable:

@reza1615
Copy link
Author

@reza1615 I'm looking to implement this but wanted to know which of these two solutions were more suitable:

Hi, the second one. The sampling_rate is the ratio that we choose to down sample data. already you have it for down sampling. for example if user select 30% from UI the sample_rate is 0.3

@reza1615
Copy link
Author

Screenshot_20230410_155030_Chrome

@aschonfeld
Copy link
Collaborator

@reza1615 @harshithakolipaka just realeased v2.15.0 with this feature. Let me
Know if you have any issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants