Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why 130MB Initial Bucket on windows ? #39

Closed
olekukonko opened this issue Dec 8, 2013 · 18 comments
Closed

Why 130MB Initial Bucket on windows ? #39

olekukonko opened this issue Dec 8, 2013 · 18 comments

Comments

@olekukonko
Copy link
Collaborator

Just curios why tiedot need to create initial 130MB file for _uid & data on windows. That is 260MB wasted without a inserting any documents.

@HouzuoGuo
Copy link
Owner

In default configuration, data file (documents) grows every 128 MB, and hash table has an initial capacity of:

  • 2 ^ 14 keys (16384 distinct values)
  • 100 entries per key

The very original story for such large initial size, was for benchmarks to get accurate measure (> 1 second for each feature) without being interrupted by file capacity growth.

But you are absolutely correct - in real usage scenarios, high initial capacity is not desired. What do you think about 32 MB data + 32 MB index?

@alexandrestein
Copy link

It's a good idea... 👍
I think we already discussed about that. :-)

@HouzuoGuo
Copy link
Owner

There actually is a "small_disk" branch that cuts down initial size to only 4 MB per collection =D but as a result, the performance is about 100x worse.

That sounds like a plan - I will fix benchmarks, and reduce initial file size as well.

@olekukonko
Copy link
Collaborator Author

I see 3 other possibilities

  • Making it optional 4 , 8 , 16, 32 , 64, 128 MB and recommend 128MB for best performance
  • Introduce preemption (Percentage Usage & Volume of Data insert per sec) on increment
  • Store data and read data from memory while file system would just be backup

@alexandrestein
Copy link

From 130MB to 4MB I can imagine, it has an impact on performances.
But maybe somewhere among those values you can find a balance between performances and DB files size.

You spoke about 32MB, it sounds good to me 👍

@olekukonko
Copy link
Collaborator Author

@alexandrestein i agree 32MB sounds good but the size can be flexible

@HouzuoGuo
Copy link
Owner

How about offering two options:

  • Small collection - grows every 32 MB
  • Large collection - grows every 128MB (current config)

And by default, HTTP API creates a small collection; a request parameter will be set for creating the large collection.

Benchmarks will continue to use large collection.

@olekukonko
Copy link
Collaborator Author

Does the collection growth really need to be fixed ? How about Small collection - grows every 32MB when data <= X because gowning a large data by 32MB might be too much overhead.

Example

func getIncrement() int {
    size := getSize()
    increment := 0
    switch true {
    case size > 536870912: // 512MB
        increment = 134217728 // 128MB
    case size > 134217728: // 128MB
        increment = 67108864 // Increase by 64MB
    default:
        increment = 33554432 // Increase by 32MB
    }
    return increment
}

We can still look for a better threshold after proper testing but this is just an example

@HouzuoGuo
Copy link
Owner

That sounds like a rather nice idea.

@HouzuoGuo
Copy link
Owner

The next question may be more interesting: what shall we do with hash table?

There are some difficulties with downsizing hash table:

  • The algorithm is a classic static hash table (unfortunately dynamic resizing is close to impossible)
  • The initial "head" buckets must be allocated upfront.
  • The performance worsens a lot if initial hash table size is brought down.
  • Downsize hash table configuration is not feasible right now - it will break everyone's existing hash table.

There seems to be two easy solutions:

  • Rewrite hash table to use a better algorithm
  • Make hash table parameter configurable

What do you think, any better idea?

@alexandrestein
Copy link

I don't think this is a big deal.

In most of the real case data won't grow more than some MB per minute. And 1MB of "data" per minute is a lot even if you store logs or things like this (I exclude the cases where you store images or binary files).

And for those how has data which grows like this, they probably take care of setting up database correctly :-)

I think the best thing to do is to set by default a small grow size and let user configure properly the database if he has special needs (server app which append a lot or millions of users adding content).

I'm maybe wrong...

@HouzuoGuo
Copy link
Owner

I think dynamically determine collection growth is a very good idea.

How about:

  • Dynamically determine collection growth
  • Make hash table size configurable (user has choice of small/large)

@HouzuoGuo
Copy link
Owner

See my comment in #23

What do you think?

@HouzuoGuo
Copy link
Owner

Fixed in nextgen - number of collection partitions is now configurable. A collection of one partition will only use 32MB disk storage in the beginning.

@agozie
Copy link

agozie commented Apr 26, 2015

I have 11 collections, 256Mb for each with no data at all! Some collections will only hold a few entries. This is a major blow to my project.
Any Thoughts?

@HouzuoGuo
Copy link
Owner

@agozie sorry - the size of collection files depends on the number of CPUs on the system:
https://github.com/HouzuoGuo/tiedot/blob/master/db/db.go#L50

I intended for the initial size of collection to depend on GOMAXPROCS, so the line must have been a mistake.

The collection file size could be reduced to 64MB by replacing runtime.NumCPU()) into 1.

@agozie
Copy link

agozie commented Apr 27, 2015

How will reducing runtime.NumCPU() to one affect performance? Thanks alot.

@HouzuoGuo
Copy link
Owner

Reduce it to one should retain approx.30% in your scenario. Run benchmark ./tiedot -mode=bench to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants