Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command to generate a test database, part 1 #1111

Merged
merged 13 commits into from Feb 20, 2017
Merged

Command to generate a test database, part 1 #1111

merged 13 commits into from Feb 20, 2017

Conversation

rowanseymour
Copy link
Member

@rowanseymour rowanseymour commented Feb 20, 2017

So far only generates non-message data: locations, orgs, users, groups, fields, labels, test contacts, contacts, contact group memberships, contact field values. Figure I'll add messages when I get onto a improving message searching performance.

The default settings generate a database with 100 orgs and 1,000,000 contacts. This takes about an hour and the resultant database is ~3.1GB.

PR includes a ~7MB dump of the AdminBoundary table in Postgres's compressed/custom format, after loading the test-data/nigeria.zip geojson file. Loading the geojson file takes half an hour but the dump loads in seconds.

@rowanseymour rowanseymour self-assigned this Feb 20, 2017
c_id = base_contact_id + c

# ensure every org gets at least one contact
org = orgs[c] if c < len(orgs) else self.random_org(orgs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important to ensure contacts for different orgs aren't stored sequentially which wouldn't resemble a real world database, and would effect query performance when an org's contacts are all bunched at the beginning or end of a table's data.

@nicpottier
Copy link
Collaborator

This is 1,000,000 contacts across 100 orgs or each has 1,000,000 contacts? Seems like the former is probably overkill. Seems our scaling issues tend to focus around single orgs being large, so wonder if we could cut down on the build speed by just building 10 orgs instead of 100.

# We want a variety of large and small orgs so when allocating content like contacts and messages, we apply a
# bias toward the beginning orgs. if there are N orgs, then the amount of content the first org will be
# allocated is (1/N) ^ (1/bias). This sets the bias so that the first org will get ~50% of the content:
self.org_bias = math.log(1.0 / num_orgs, 0.5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess I could have just read the code. :)

@rowanseymour
Copy link
Member Author

It's 1,000,000 contacts split across 100 orgs. Org generation doesn't take that much time so relatively little difference between 10 and 100 orgs. Regardless of how many orgs there are, the first org always gets ~50% of the total contacts.

One thing I haven't tried yet is ditching all indexes and recreating them after, but am trying to not over-engineer things at this point.

@rowanseymour
Copy link
Member Author

A sample of a first org in a database with total 1,000,000 contacts:

screen shot 2017-02-20 at 16 19 33

Copy link
Collaborator

@nicpottier nicpottier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say a few hours is a good target for build time for these, so agree let's keep it simple as long as we can keep it under that. Looks good!

Copy link
Member

@ericnewcomer ericnewcomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@rowanseymour rowanseymour merged commit e5c2ba3 into master Feb 20, 2017
@rowanseymour rowanseymour deleted the make_test_db branch February 20, 2017 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants