Linear hashed files and multi-attribute hashing are two techniques that can be used together to produce hashed files that grow as needed and which allow all attributes to contribute to the hash value of each tuple.
Creates MALH files by accepting four command line arguments:
the name of the relation the number of attributes the initial number of data pages (rounded up to nearest 2n) the multi-attribute hashing choice vector
This gives you storage for one relation/table, and is analogous to making an SQL data definition like:
create table R ( a1 text, a2 text, ... an text );
Note that, internally, attributes are indexed 0..n-1 rather than 1..n.
The following example of using create makes a table called abc with 4 attributes and 8 initial data pages:
$ ./create abc 4 6 "0,0:0,1:1,0:1,1:2,0:3,0"
The choice vector (fourth argument above) indicates that
bit 0 from attribute 0 produces bit 0 of the MA hash value bit 1 from attribute 0 produces bit 1 of the MA hash value bit 0 from attribute 1 produces bit 2 of the MA hash value bit 1 from attribute 1 produces bit 3 of the MA hash value bit 0 from attribute 2 produces bit 4 of the MA hash value bit 0 from attribute 3 produces bit 5 of the MA hash value
The above choice vector only specifies 6 bits of the combined hash, but combined hashes contain 32 bits. The remaining 26 entries in the choice vector are automatically generated by cycling through the attributes and taking bits from the high-order hash bits from each of those attributes.
Reads tuples, one per line, from standard input and inserts them into the relation specified on the command line. Tuples all take the form val1,val2,...,valn. The values can be any sequence of characters except ',' and '?'.
The bucket where the tuple is placed is determined by the appropriate number of bits of the combined hash value. If the relation has 2^d data pages, then d bits are used. If the specified data page is full, then the tuple is inserted into an overflow page of that data page.
Takes a "query tuple" on the command line, and finds all tuples in either the data pages or overflow pages that match the query. Queries take the form val1,val2,...,valn, where some of the vali can be '?' (without the quotes). Such "attributes" represent wild-cards and can match any value in the corresponding attribute position. Some example query tuples, and their interpretation are given below.
?,?,? # matches any tuple in the relation
10,?,? # matches any tuple with 10 as the value of attribute 0
?,abc,? # matches any tuple with abc as the value of attribute 1
10,abc,? # matches any tuple with 10 and abc as the values of attributes 0 and 1
R.info containing global information such as
a count of the number of attributes the depth of main data file (d for linear hashing) the page index of the split pointer (sp for linear hashing) a count of the number of main data pages the total number of tuples (in both data and overflow pages) the choice vector (cv for multi-attribute hashing)
R.data containing data pages, where each data page contains
offset of start of free space overflow page index (or NO_PAGE if none) a count of the number of tuples in that page the tuples (as comma-separated C strings)
R.ovflow containing overflow pages, which have the same structure as data pages
When a MALH relation is first created, it is set to contain a 2^n pages, with depth d=n and split pointer sp=0. The overflow file is initially empty. The following diagram shows an MALH file R with initial state with n=2.
Once you have the executables, you could build a sample database as follows:
$ ./create R 3 4 "0,0:0,1:0,2:1,0:1,1:2,0"
cv[0] is (0,0)
cv[1] is (0,1)
cv[2] is (0,2)
cv[3] is (1,0)
cv[4] is (1,1)
cv[5] is (2,0)
cv[6] is (0,31)
cv[7] is (1,31)
cv[8] is (2,31)
...
cv[30] is (0,23)
cv[31] is (1,23)
This command creates a new table called R with 3 attributes. It will be stored in files called R.info, R.data and R.ovflow. The data file initially has 4 pages (so depth=2). The overflow is initially empty. The lower-order 6 bits of the choice vector are given on the command line; the remaining bits are auto-generated. Given the file size (4 pages), only two of the hash bits are actually needed.
You could check the status of the files for table R via the stats command:
$ ./stats R
Global Info:
#attrs:3 #pages:4 #tuples:0 d:2 sp:0
Choice vector
0,0:0,1:0,2:1,0:1,1:2,0:0,31:1,31:2,31:0,30:1,30:2,30:0,29:1,29:2,29:0,28:1,28:2,28:
0,27:1,27:2,27:0,26:1,26:2,26:0,25:1,25:2,25:0,24:1,24:2,24:0,23:1,23
Bucket Info:
# Info on pages in bucket
(pageID,#tuples,freebytes,ovflow)
0 (d0,0,1012,-1)
1 (d1,0,1012,-1)
2 (d2,0,1012,-1)
3 (d3,0,1012,-1)
Since the file is size 2^d, the split pointer sp = 0. The rest of the global information should be self explanatory, as should choice vector. The bucket info shows a quadruple for each page; since there are no overflow pages (yet), only data pages appear. The pageID value in each quad consists of the character 'd' (indicating a data file), plus the page index. Each page is 1024 bytes long, which includes a small header, plus 1012 bytes of free space for tuples. There are currently zero tuples in any of the pages. The overflow page IDs are all -1 (for NO_PAGE) to indicate that no data page has an overflow page.
You can insert data into the table using the insert command This command reads tuple from its standard input and inserts them into the named table. For example, the command below inserts a single tuple into the R MALH files:
$ ./insert R
100,abc,xyz
hash(100) = 00011100 00101000 10100111 11101100
Ctl-D
The insert command prints the hash value for the tuple (based on just the first attribute), and then inserts it into the file. Since the table is currently empty, this tuple will be inserted into page 0.
Typing many individual tuples is tedious, so we have provided a command, gendata, which can generate tuples appropriate for a given table. It takes four comand line arguments, only two of which are compulsory: the number of tuples to generate, and the number of attributes in each tuple. a sample usage:
$ ./gendata 5 3
1,sandwich,pocket
2,circus,spectrum
3,snail,adult
4,crystal,fungus
5,bowl,surveyor
This generates five tuples, each with three attributes. The first attribute is a unique ID value; the other attributes are random words. You can modify the starting ID value and the seed for the random number generator from the command line.
You could use gendata to generate large numbers of tuples, and insert them as follows:
$ ./gendata 250 3 101 | ./insert R
hash(101) = 11110100 01100100 11010000 00110000
hash(102) = 00100101 10100110 10100001 11100100
hash(103) = 10110011 11001111 10100111 00001000
hash(104) = 00001100 11100000 10000011 11000000
...
hash(348) = 11110000 01011110 01000010 00101001
hash(349) = 01101101 01100101 00011111 10100111
hash(350) = 10011011 01100101 01111001 11001000
This will insert 250 tuples into the table, with ID values starting at 101. You can check the final state of the database using the stats command. It should look something like:
$ ./stats R
Global Info:
#attrs:3 #pages:4 #tuples:251 d:2 sp:0
Choice vector
0,0:0,1:0,2:1,0:1,1:2,0:0,31:1,31:2,31:0,30:1,30:2,30:0,29:1,29:2,29:0,28:1,28:2,28:
0,27:1,27:2,27:0,26:1,26:2,26:0,25:1,25:2,25:0,24:1,24:2,24:0,23:1,23
Bucket Info:
# Info on pages in bucket
(pageID,#tuples,freebytes,ovflow)
[ 0] (d0,56,4,0) -> (ov0,15,737,-1)
[ 1] (d1,57,2,3) -> (ov3,2,981,-1)
[ 2] (d2,59,1,2) -> (ov2,2,976,-1)
[ 3] (d3,54,7,1) -> (ov1,6,905,-1)
This shows that each data page has one overflow page, and that each data page has roughly the same number of tuples. The bucket starting at data page 0 has a few more tuples than th other buckets, because it has more tuples (15) in the overflow page. Note that page IDs in the overflow pages are distinguished by starting with "ov". Note also that e.g. the data page at position 3 in the data file has an overflow page at position 1 in the overflow file.
You could then use the select command to search for tuples using a command like:
$ ./select R 101,?,?