title | sidebar_label |
---|---|
Data structure introduction |
Introduction |
We have already seen data structures such as arrays. All data structures have two main aspects:
- operation they support (search item, find minimum, find maximum, insert new item, delete an existing item etc.). We can think of this as an interface to the data structure.
- implementation of such data structure
Depending on the what operations we need to perform on our data, a specific data structure is more suitable than others.
Arrays are good for certain operations but bad for others. Here are some pros and cons of static (size of the array is fixed) arrays.
Pros:
- Constant time access given the index
- Only consists of data, in contrary for linked lists we need to store data and a pointer for each entry
- Physical continuity (memory locality) between successive elements help exploit the high-speed cache-memory in modern CPUs.
Cons:
- Size of array must be known beforehand
- Resizing array is expensive
- Insertion/deletion requires shifting subsequent elements
Dynamic arrays might start with a specific size, and double the size once run
out of space. Although resizing is expensive, we hope that we won't do it too
often (for
The apparent waste in this procedure involves the recopying of the old contents
on each expansion. If half the elements move once, a quarter of the elements
move twice, and so on, the total number of movements
Thus each of the n elements move an average of only twice, the total work of
managing the dynamic array is the same as a simple array
Another disadvantage is it will waste memory by allocating extremely large arrays.
Searching in a linked list can be done iteratively or recursively. Searching in
sorted array could be done by binary search in
Pros of linked list:
- No overflow
- Insertion and deletion are simpler than static arrays
- With large records, moving pointer is much more efficient than moving the data part.
Containers to store items, we access them in the order they came. Stack: Last in first out (LIFO).
Queue: First in first out (FIFO).
Stacks and Queues can be implemented using arrays or linked lists.
Supports operations:
Search(Set, k)
: returns pointer to an itemx
such thatkey[x] = k
, ornil
if no such element exists.Insert(S, x)
Delete(S, x)
Min(S)
,Max(S)
(smallest/largest key)Predecessor(S, x)
/Successor(S, x)
There are variety of implementation of these dictionary operations, each of which yield different time complexities of different operations.
Example: Array based sets: unsorted arrays
-
Search(S, k)
- sequential search,$\mathcal{O}(n)$ -
Insert(S, x)
- place in first empty spot (last element),$\mathcal{O}(1)$ , can use dynamic array, no need to worry about resizing -
Delete(S, x)
- copy nth (last item) to thex
-th spot,$\mathcal{O}(1)$ -
Min/Max
- sequential search$\mathcal{O}(n)$ -
Successor/Predecessor
- sequential search$\mathcal{O}(n)$ (previous/next index is not successor/predecessor)
If we use sorted array, searching could be done by binary search and complexity
reduces to
Insert/delete
becomes costly as we need to keep them sorted. Min/Max
is easy Successor/Predecessor
is also
We can implement dictionary using singly/doubly sorted/unsorted linked list:
Dictionary Operation | Singly unsorted | Singly sorted | Doubly unsorted | Doubly sorted |
---|---|---|---|---|
Search(L, k) |
|
|||
Insert(L, x) |
|
|||
Delete(L, x) |
|
|||
Successor/ Predecessor(L, x) |
|
Successor |
||
Min/ Max(L) | Min |
Hash tables are a very practical way to maintain a dictionary. The idea is
simply that looking an item up in an array is
A hash function is a mathematical function which maps keys to integers. Since we are mapping larger set to a smaller set, collisions might happen. If the keys are uniformly distributed, then each bucket should contain very few keys.
A good hash function has following properties:
- Is cheap to evaluate
- Tends to use all positions from
$0, 1, \ldots, M$ with uniform frequency.
Here is an example of hash function: say we want to map text data into integers. The first step is usually to map the key to a big integer, remember ascii characters can have code up to 128 (27).
Here
Then we use use modular arithmetic to map large integers to smaller ones, whose size is between 1 and the size of our hash table.
Where M is best a large prime not too close to
No matter how good is our hash function, we have to prepared for collisions
since we are mapping larger space into a smaller size space. Recall the birthday
paradox: if you are in a party of about 23 people, there is 50% probability that
two person share the same birthday. And if there are about 56 people, the
probability is very close to 1. If we have too many collisions operations on our
data structure becomes less efficient. For example look up takes
x
is the number of collision in a particular bin, and in worst case it is
Supports following three operations:
-
Insert(Q, x)
: Given an itemx
with keyk
, insert into the priority queue$Q$ . Min/Max(Q)
Delete-Min/Max(Q)
We could use balanced binary trees for queues, and all three operations could be
done in