-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes in functions for generation and an example of vertical plot with additional charts #182
base: master
Are you sure you want to change the base?
Changes from all commits
83a15bb
5dab608
3eb9c65
e40d6b9
7f5f918
e0d9df0
4e18668
b0c9c7b
3374d0d
9c546e0
7805d3f
4cab536
b1d0ec6
f64fb19
5443acb
5731bc6
ede49b5
062e337
3d884c4
b000f15
684be8c
ce55bd0
35ef9bf
746f679
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,7 @@ | |
import numpy as np | ||
|
||
|
||
def generate_samples(seed=0, n_samples=10000, n_categories=3): | ||
def generate_samples(seed=0, n_samples=10000, n_categories=3, extra_columns=0): | ||
"""Generate artificial samples assigned to set intersections | ||
|
||
Parameters | ||
|
@@ -19,12 +19,16 @@ def generate_samples(seed=0, n_samples=10000, n_categories=3): | |
Number of samples to generate | ||
n_categories : int | ||
Number of categories (named "cat0", "cat1", ...) to generate | ||
extra_columns : int | ||
If a vector is required,this would indicated the number of additional | ||
columns (named "value", "value1", "value2", ... ) | ||
|
||
Returns | ||
------- | ||
DataFrame | ||
Field 'value' is a weight or score for each element. | ||
Field 'index' is a unique id for each element. | ||
Field(s) 'value{i}' additional values for multiple-feature samples | ||
Index includes a boolean indicator mask for each category. | ||
|
||
Note: Further fields may be added in future versions. | ||
|
@@ -34,19 +38,25 @@ def generate_samples(seed=0, n_samples=10000, n_categories=3): | |
generate_counts : Generates the counts for each subset of categories | ||
corresponding to these samples. | ||
""" | ||
assert extra_columns >= 0, 'extra_columns parameter should be possitive' | ||
rng = np.random.RandomState(seed) | ||
df = pd.DataFrame({'value': np.zeros(n_samples)}) | ||
len_samples = 1 + extra_columns | ||
df = pd.DataFrame(np.zeros((n_samples, len_samples))) | ||
valuename_lst = [f'value{i}' if i > 0 else 'value' for i in | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we just call this variable |
||
range(len_samples)] | ||
df.columns = valuename_lst | ||
|
||
for i in range(n_categories): | ||
r = rng.rand(n_samples) | ||
df['cat%d' % i] = r > rng.rand() | ||
df['value'] += r | ||
r = rng.rand(n_samples, len_samples) | ||
df[f'cat{i}'] = r[:, 0] > rng.rand() | ||
Comment on lines
+50
to
+51
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This puzzles me. We're only using the first column of a random matrix of values, and There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't worry about making the values correlate with the categories. Just put in the docstring that the extra column values may change in a future version so we have licence to do it later. |
||
df[valuename_lst] += r | ||
|
||
df.reset_index(inplace=True) | ||
df.set_index(['cat%d' % i for i in range(n_categories)], inplace=True) | ||
df.set_index([f'cat{i}' for i in range(n_categories)], inplace=True) | ||
return df | ||
|
||
|
||
def generate_counts(seed=0, n_samples=10000, n_categories=3): | ||
def generate_counts(seed=0, n_samples=10000, n_categories=3, extra_columns=0): | ||
"""Generate artificial counts corresponding to set intersections | ||
|
||
Parameters | ||
|
@@ -57,20 +67,30 @@ def generate_counts(seed=0, n_samples=10000, n_categories=3): | |
Number of samples to generate statistics over | ||
n_categories : int | ||
Number of categories (named "cat0", "cat1", ...) to generate | ||
extra_columns: int | ||
Number of additional features to be use to generate each | ||
sample (value, value1, value2, ...) | ||
|
||
Returns | ||
------- | ||
Series | ||
Counts indexed by boolean indicator mask for each category. | ||
Series or DataFrame | ||
A Series of counts indexed by boolean indicator mask for each category, | ||
when ``extra_columns`` is 0. Otherwise a DataFrame with column ``value`` | ||
equivalent to the value produced when ``extra_columns`` is 0, as well as | ||
further random variables ``value1``, ``value2``, for extra columns. | ||
|
||
See Also | ||
-------- | ||
generate_samples : Generates a DataFrame of samples that these counts are | ||
derived from. | ||
""" | ||
assert extra_columns >= 0, 'extra_columns parameter should be possitive' | ||
df = generate_samples(seed=seed, n_samples=n_samples, | ||
n_categories=n_categories) | ||
return df.value.groupby(level=list(range(n_categories))).count() | ||
n_categories=n_categories, | ||
extra_columns=extra_columns) | ||
df.drop('index', axis=1, inplace=True) | ||
df = df if extra_columns > 0 else df.value | ||
return df.groupby(level=list(range(n_categories))).count() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think counting is meaningful for the extra columns. Maybe we should use a different aggregate? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or maybe we shouldn't offer this functionality in |
||
|
||
|
||
def generate_data(seed=0, n_samples=10000, n_sets=3, aggregated=False): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using generate_samples here makes more sense? But maybe 10k samples is a lot for three swam plots.