#### Hashing:
Hashing refers to the process of generating a fixed-size output from an input of variable size using the mathematical formulas known as hash functions. This technique determines an index or location for the storage of an item in a data structure.

+ Components of Hashing
    - Key
    - Hash Function
    - Hash Table

--------------

+ Types of hashing:
    - Trivial
    - Double
    - Chained
    - Closed

--------------

Eg: Log File: 
1) Method 1:
    - Implement dictionary
        - Unordered list
    - Insertion: O(1) & Deletion: O(n) (search and delete)

2) Method 2:
    - Implement Bucket Array
    - Range of elements is required before
    - Say element k will be stored in Array[k], eg: k=5 => Array[5]
    - Keys must be integers
    - Insertion: O(1) & Deletion: O(1)

3) Method 3:
    - Implement Hash Table
        - Hash Function: input key (strings) => output integers i.e. the index of the bucket array called compression map
        - Bucket Array


#### Hash Codes:
A Good Hash code:
- minimize collision
- h(key1) = h(key2), given key1 == key2
- h(key1) != h(key2), given key1 != key2


+ Memory Address:
    - h(key) == memory address of key, eg: h("abc") == 5000 == memory address of "abc".
    - Disadvantage:
        - if exactly same data is present and having 2 different h(key) output like "abc" in 5000 & 5010. It fails uniformity 
+ Integer Cast:
    - converts keys to integer using type casting
    - Disadvantage:
        - loss of data & occur collision i.e. type cast float -> int: 3.6,3.8,3.2 all converted to 3.
+ Component Sum:
    - eg: 32 bits key => 32 bits + 32 bits and the output to be 32 bits hash code (break 32 bits into tokens)
    - to bring the output of 32 bits from 64 bits is to ignore the overflow
    - eg: string example,
        - h("post") -> ascii(p) + ascii(0) + ascii(s) + ascii(t)
        - Disadvantage: h("post") == h("stop") == h("pots") will all be same value

+ Polynomial accumulation:
    - eg: string example,
        - h("post") -> ascii(p)*(a^0) + ascii(0)*(a^1) + ascii(s)*(a^2) + ascii(t)*(a^3)
        - h(key,a)
        - Advantage: values like "stop" & "post" will have different hash values.
        - sum = (sum*a) + arr[i] for i in range(arr):
            - i=0,sum=0, hash = p
            - i=1,sum=p, hash= pa + o
            - i=2,sum=pa + o, hash=pa^2 + oa + s
            - i=3,sum=pa^2 + oa + s, hash=pa^3 + oa^2 + sa + t

#### Compression Map:
- converts a integer from [-infinity to +infinity] to a range [0 to N-1]
- N => range of array

1) Division:
    - cmp(y) => y % N
    - collision will still occur.

2) Multiply Add Divide:
    - cmp(y) => ((ay+b)%p)%N
        - N -> prime integer
        - a & b -> non zero constants
        - p -> prime integer < N

#### Rehashing

- Load Factor(λ): 
    - n => no of entries
    - N => no of buckets
        - n/N < 1, good condition:
            - continue with hashing operations
        - n/N >= 1, not good condition:
            - Rehashing:
                - increase the number of buckets N' => 2N+1
                - rehash from x%N to x%N'

##### Bloom Filter:
- Phase 1 (inserting) & Phase 2 (Querying)
- Does an element belong to a set especially to prove if the value is not in a set. Difficult to prove the value is present in set since we get FALSE POSITIVES results.
- Size of bloom filter M, 

Eg: M=5
- h1(x) => x % 5
- h2(x) => (2x+3) % 5

Phase 1 (insertion):
- x -> 9,  h1(9) = 4 & h2(9) = 1,   => | 0 | 1 | 0 | 0 | 1 | 
- x -> 11,  h1(11) = 1 & h2(11) = 0,  => | 1 | 1 | 0 | 0 | 1 |, if overlap and if 1 already present then leave it that way.

Phase 2 (querying):
- x -> 15, h1(15) = 0 & h2(15) = 3,  => | 1 | 1 | 0 | 0 | 1 |, if both the pos 0 & 3 is not 1 then 15 is not present in set
- x -> 16, h1(16) = 1 & h2(16) = 0, => | 1 | 1 | 0 | 0 | 1 |, if both the pos 0 & 3 is 1, its probably the value is present in set (False Positive)


#### Trivial Hashing / Index Mapping
- This approach commonly employs the identity hash function, which translates any input data toward itself. In this instance, the data key is utilized as an index within the hash table while the corresponding value is saved at that position.

    - h(1) = 1 index of hash table
    - h(2) = 2 index of hash table

Allowing negative values by using abs() function while hashing

#### Collisions
-   Since a hash function gets us a small number for a key which is a big integer or string, there is a possibility that two keys result in the same value. The situation where a newly inserted key maps to an already occupied slot in the hash table is called collision and must be handled using some collision handling technique

---------------

2 methods to handle collision:
+ Separate Chaining 
+ Open Addressing



| Separate Chaining | Open Addressing |
| --- | --- |
Chaining is Simpler to implement. | Open Addressing requires more computation. |
In chaining, Hash table never fills up, we can always add more elements to chain. | In open addressing, table may become full.
Chaining is Less sensitive to the hash function or load factors. | Open addressing requires extra care to avoid clustering and load factor. |
Chaining is mostly used when it is unknown how many and how frequently keys may be inserted or deleted. | Open addressing is used when the frequency and number of keys is known. |
Cache performance of chaining is not good as keys are stored using linked list. | Open addressing provides better cache performance as everything is stored in the same table. |
Wastage of Space (Some Parts of hash table in chaining are never used). | In Open addressing, a slot can be used even if an input doesn’t map to it. |
Chaining uses extra space for links. | No links in Open addressing |