Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wouldn't storing the index into π be as big (or bigger) than the original data? #23

Open
bradparks opened this issue Jul 12, 2014 · 9 comments

Comments

Projects
None yet
6 participants
@bradparks
Copy link

commented Jul 12, 2014

Hey! So just say I find my data at index x.

Wouldn't the value of x be typically bigger than the data itself? And therefore cost just as much to store?

For example, say my data i wanted to store was "1234567890" and say that doesn't show up until index
1838448384248234818

Then storing that index would cost more than storing the data itself... I can see a case for some large #'s that happen to be stored at shorter index locations than the data itself, but how would this work otherwise?

@bradparks bradparks changed the title Wouldn't storing the index into π be as big (or bigger) as the original # itself? Wouldn't storing the index into π be as big (or bigger) than the original data? Jul 12, 2014

@georgy7

This comment has been minimized.

Copy link

commented Jul 14, 2014

I supposed the author of pifs already had solved this problem (throught an additional bit for example). But really how about some statistics approving pifs viability? I mean ability to compress data.

@eduard-sukharev

This comment has been minimized.

Copy link

commented Jul 14, 2014

Prove me wrong, but the sole purpose of the project is to demonstrate the possibility, not to actually get usable or sane results.
pifs is unusable and in no way viable project, it has no real value apart from academic use, i guess. This project is actually more of a joke, not real life solution. Although it works in some way, the issue you've raised is quite obvious and is a show-stopper.
Increasing the data chunk length leads to increased offset lengths which tend to take more memory than saving. Though, this project might be more usable as cryptography solution of some kind.

@ddaws

This comment has been minimized.

Copy link

commented Jul 14, 2014

It can not consequently be stated the :

Increasing the data chunk length leads to increased offset lengths

This is not always the case. You are just as likely to find a large data chunk in the first several digits of pi as you are anywhere in pi.

@georgy7

This comment has been minimized.

Copy link

commented Jul 14, 2014

You are just as likely to find a large data chunk in the first several digits of pi as you are anywhere in pi.

That's not true. If we have some integer cell for a pi offset, it can direct to a limited count of places in pi. In that part of pi we have a big count of small chunks and a smaller count of bigger chunks. And if there are all possible small chunks, not every small chunk have every possible continuation.

@anubisthejackle

This comment has been minimized.

Copy link

commented Jul 14, 2014

It's very true that it's mathematically impossible for a single file to be stored using this method and use less memory than it would now. But, what of a cloud solution? Storing individual bits this way, sort of like rainbow tables, and referencing individual bits--or even long strings of bits--which can be compiled by reference.

Granted, this may or may not have the same speed deficiencies that the current code has, but I believe storage wise, storing this meta data would be less burden.

@georgy7

This comment has been minimized.

Copy link

commented Jul 14, 2014

How about this algorithm: when you know exact pi offset B (bignum), you create its escimated value D (as Long exponent of some big base). So you can reverse convert D to bignum and get D'<B.

Then you take a few first bytes of original sequence and look up if the segment D'B contains this little opening subsequence. If it does then you slightly increase this opening subsequence and try again.

Finally you get D (64 bit Long), length of original sequence and initial subsequence of original sequence. So in worst case, Result = Input + 64 bits + ? bits for length.

@ddaws

This comment has been minimized.

Copy link

commented Jul 14, 2014

@georgy7 I don't think there are less large chunks within a range of pi than their are small chunks.

Consider having n sequential digits of pi and wanting to determine the number of unique sequential subsets of size x. At position 0 the first subset is 0 through to position x, and at position 1 the second subset is 1 through to position x + 1. Thus at position i the subset is i through to position x + i. Subsets may be made from position 0 through to position n - x.

Thus when looking for chunks through the first 10000 digits of Pi a chunk size of :

  • 16 results in 9984 chunks.
  • 32 results in 9968 chunks.

So doubling the chunk size barely reduces the number of potential chunks.

Next consider uniqueness. Larger chunks are more likely to be unique and for this reason I believe their will likely be the same of more unique chunks of a larger size than a lesser.

Experiment :

f = open('PI25K_DP.TXT', 'r')
pi_str = f.read().rstrip()

print(pi_str)
print('digits : {0}'.format(len(pi_str)))


def count_unique_chunks(chunk_size):

    set = {}
    for i, char in enumerate(pi_str):

        if i < len(pi_str) - chunk_size:
            chunk = pi_str[i:i + chunk_size]
            if chunk not in set:
                set[chunk] = True

    num_unique_chunks = len(set.keys())
    print('{0} | {1}'.format(chunk_size, num_unique_chunks))
    return num_unique_chunks


for i in xrange(64):
    count_unique_chunks(i)

You may find PI25K_DP.TXT here.

Results :

Chunk Size Number of Unique Chunks
0 1
1 10
2 100
3 1000
4 9169
5 22170
6 24690
7 24957
8 24987
9 24991
10 24990
11 24989
12 24988
13 24987
14 24986
15 24985
16 24984
17 24983
18 24982
19 24981
20 24980
21 24979
22 24978
23 24977
24 24976
25 24975
26 24974
27 24973
28 24972
29 24971
30 24970
31 24969
32 24968
33 24967
34 24966
35 24965
36 24964
37 24963
38 24962
39 24961
40 24960
41 24959
42 24958
43 24957
44 24956
45 24955
46 24954
47 24953
48 24952
49 24951
50 24950
51 24949
52 24948
53 24947
54 24946
55 24945
56 24944
57 24943
58 24942
59 24941
60 24940
61 24939
62 24938
63 24937
@georgy7

This comment has been minimized.

Copy link

commented Jul 14, 2014

@dreid93
25K of pi digits contains 24937 chunks of length = 63 decimal digits.
But how many chunks of this length can exist. Its 10^63.
Possibility of first 25K of pi contains a random 63 number sequence = 24937 / 10^63.

@kenorb

This comment has been minimized.

Copy link

commented May 12, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.