In [6]:
%%html
<style>h1{text-align:center;}h1{text-transform:none;}.rendered_html h4{color:#17b6eb;font-size: 1.6em;}img[alt=dia1]{width:35%;}img[alt=book]{width:20%;font-size: 3em;}img[alt=dia2]{width:50%;}.author{font-size:8px;}</style>

# Lecture 8: Elementary Data Structures: Linked Lists, Hashing

## 1. Linked Lists
<div class="author">src: Introduction to Algorithms, Thomas H. Cormen</div>

A __linked list__ is a data structure in which the objects are arranged in a linear order. Unlike an array, however, in which the linear order is determined by the array indices, the order in a linked list is determined by a pointer in each object.

Linked lists provide a simple, flexible representation for dynamic sets.

In its most basic form, each node contains: __data__, and a __reference__ (in other words, a link) to the next node in the sequence. This structure allows for efficient insertion or removal of elements from any position in the sequence during iteration. 

A drawback of linked lists is that access time is linear. Arrays have better cache locality compared to linked lists. 

![dia2](img/8linkedlist.png)
<div class="author">src: wikipedia.org</div>

### 1.1 Items are not constrained in physical memory placement

![dia2](img/8linkedlist2.png)
<div class="author">src: pas.rochester.edu</div>

### 1.2 Nomenclature

Each record of a linked list is often called an __element__ or __node__. Each node must hold at least two pieces of information:

1. The field of each node that contains the address (reference) of the next node is usually called the __next link__ or __next pointer__.

2. The item fields are known as the __data, information, value, cargo, or payload__ fields.

The __head__ of a list is its first node. 

The __tail__ of a list may refer either to the rest of the list after the head, or to the last node in the list. 

### 1.3 Basic Concepts
<div class="author">src: wikipedia.org</div>

##### 1.3.1 Sinlgy linked list

Singly linked lists contain nodes which have a data field as well as 'next' field, which points to the next node in line of nodes. Operations that can be performed on singly linked lists include insertion, deletion and traversal. 

![dia2](img/8linkedlist.png)


##### 1.3.2 Doubly linked list

In a __doubly linked list__, each node contains, besides the next-node link, a second link field pointing to the __previous__ node in the sequence. The two links may be called __forward__ and __backwards__, or __next__ and __prev__. 

![dia2](img/8doublyll.png)

A technique known as __XOR-linking__ allows a doubly linked list to be implemented using a single link field in each node. However, this technique requires the ability to do bit operations on addresses, and therefore may not be available in some high-level languages. 

An ordinary doubly linked list stores addresses of the previous and next list items in each list node, requiring two address fields: 
 ```
 ...  A       B         C         D         E  ...
          –>  next –>  next  –>  next  –>
          <–  prev <–  prev  <–  prev  <–
 ```
 An XOR linked list compresses the same information into one address field by storing the bitwise XOR (here denoted by ⊕) of the address for previous and the address for next in one field: 
 ```
  ...  A        B         C         D         E  ...
           ⇌   A⊕C   ⇌   B⊕D   ⇌   C⊕E   ⇌
```
<div class="author">src: wikipedia.org</div>

Hence (taking a look at *link(c)*):
```
link(C) = addr(B)⊕addr(D)
```
because:
```
addr(D) = link(C) ⊕ addr(B)
```
so:
```
addr(D) = addr(B)⊕addr(D) ⊕ addr(B)
```
using:
- X⊕X = 0
- X⊕0 = X
- X⊕Y = Y⊕X
- (X⊕Y)⊕Z = X⊕(Y⊕Z)




##### Drawbacks
- Debugging tools cannot follow XOR chains, hence debugging is more difficult
- The price for the decrease in memory usage is an increase in code complexity, making maintenance more expensive
- Not all languages support type conversion between pointers and integers, XOR on pointers is not defined in some contexts
- XOR linked lists do not provide some of the important advantages of doubly linked lists, such as the ability to delete a node from the list knowing only its address or the ability to insert a new node before or after an existing node when knowing only the address of the existing node
- Computer systems have increasingly cheap and plentiful memory, therefore storage overhead is not generally an overriding issue outside specialized embedded systems

##### 1.3.3 Circular linked list

The last node of a list can point to the first node of the list; in that case, the list is said to be __circular__. Otherwise, it is said to be *open* or *linear*. It is a list where the last pointer points to the first node. 

![dia2](img/8circularl.png)

In the case of a circular doubly linked list, the first node also points to the last node of the list. 

### 1.4 Python Implementation

##### 1.4.1 Node Implementation

Start with a single node class, since linking several nodes are a complete list.

The `node`class holds some `data` and a single pointer `next`, that points to the next Node object in the linked list. Furthermore, let's include some getter and setter methods.

In [2]:
class Node:
    def __init__(self, init_data):
        self.data = init_data
        self.next = None
        
    def get_data(self):
        return self.data
    
    def get_next(self):
        return self.next
    
    def set_data(self, new_data):
        self.data = new_data
        
    def set_next(self, new_next):
        self.next = new_next

#####  Linked List class

|Class / Method|Description|
|:---|:---|
|LinkedList()|creates a new list that is empty. It needs no parameters and returns an empty list.|
|add(item)| adds a new item to the list. It needs the item and returns nothing|
|remove(item)| removes the item from the list. It needs the item and modifies the list|
|search(item)| searches for the item in the list. It needs the item and returns a boolean value|
|it_empty()|tests to see whether the list is empty. It needs no parameters and returns a boolean value|
|length()|returns the number of items in the list. It needs no parameters and returns an integer|
|append(item)|adds a new item to the end of the list making it the last item in the collection. It needs the item and returns nothing|
|index(item)|returns the position of the item in the list. It needs the item and returns the index|
|insert(pos, item)|adds a new item to the list at position pos. It needs the item and returns nothing|
|pop()|removes and returns the last item in the list. It needs nothing and returns an item|
|pop(pos)|removes and returns the item at position pos. It needs the position and returns the item|

##### 1.4.2. Linked List class implementation

In [3]:
class LinkedList:
    # Function to initialize head
    def __init__(self):
        self.head = None
    
    def add(self, item):
        temp = Node(item)
        temp.set_next(self.head)
        self.head = temp
        
    def remove(self, item):
        current = self.head
        previous = None
        found = False
        while not found:
            if current.get_data() == item:
                found = True
            else:
                previous = current
                current = current.get_next()
                
        if previous == None:
            self.head = current.get_next()
        else:
            previous.set_next(current.get_next())
    
    def length(self):
        current = self.head
        count = 0
        while current != None:
            count = count + 1
            current = current.get_next()
        return count

    def search(self, item):
        current = self.head
        found = False
        while current != None and not found:
            if current.get_data() == item:
                found = True
            else:
                current = current.get_next()
        return found
        
    # Utility function to print the linked LinkedList
    def print_list(self):
        temp = self.head
        while(temp):
            print(temp.data)
            temp = temp.next
            

#### Exercise 1:
Name two advantages of using linked lists over arrays to store data.

#### Exercise 2:

Name an advantage of using arrays over linked lists to store data

#### Exercise 3:

Add a method to the LinkedListEx3 class to remove Duplicates from the linked list.

In [4]:
class LinkedListEx3:
    def __init__(self):
        self.head = None
    
    def add(self, item):
        temp = Node(item)
        temp.set_next(self.head)
        self.head = temp

    def print_list(self):
        temp = self.head
        while(temp):
            print(temp.data)
            temp = temp.next
            
    def remove_duplicates(self):
        #
        # Add method code here!!!!
        #

    
#Driver Code    
llist = LinkedListEx3()
 
llist.add(1)
llist.add(3)
llist.add(5)
llist.add(5)
llist.add(7)
llist.add(5)
llist.add(9)
print("Linked List after removing duplicates:",)
llist.remove_duplicates()
llist.print_list()

IndentationError: expected an indented block after function definition on line 16 (3909159415.py, line 23)

#### Exercise 4:

Add a method to the LinkedListEx4 class to move the last element to the front of a given linked list.

__Example:__ `1→2→3→4→5` should change to `5→1→2→3→4`

In [None]:
class LinkedListEx4:
    def __init__(self):
        self.head = None
    
    def add(self, item):
        temp = Node(item)
        temp.set_next(self.head)
        self.head = temp

    def print_list(self):
        temp = self.head
        while(temp):
            print(temp.data)
            temp = temp.next
            
    def move_last_to_front(self):
        #
        # Add code here!!!!
        #

    
#Driver Code    
llist = LinkedListEx4()
 
llist.add(5)
llist.add(4)
llist.add(3)
llist.add(2)
llist.add(1)

print("Linked List after moving last item to front:",)
llist.move_last_to_front()
llist.print_list()

#### Exercise 5:

You are given two __non-empty__ linked lists representing two non-negative integers. The digits are stored in __reverse order__, and each of their nodes contains a single digit. Add the two numbers and return the sum as a linked list.

You may assume the two numbers do not contain any leading zero, except the number 0 itself.

![dia1](img/8addtwonumbers.jpg)
<div class="author">src: leetcode.com</div>

__Example:__
```
Input: l1 = [2,4,3], l2 = [5,6,4]
Output: [7,0,8]
Explanation: 342 + 465 = 807.
```

In [None]:
def add_two_numbers(l1, l2):
    #
    # Implement here!
    #
        
    

#Driver Code
l1 = LinkedList()
l1.add(1)
l1.add(9)
l1.add(9)

l2 = LinkedList()
l2.add(2)
l2.add(3)
l2.add(8)

add_two_numbers(l1, l2)


#### Extra Exercise (Homework):

Implement a __Doubly Linked List__.


## 2. Hashing
<div class="author">src: programiz.com</div>

Hashing is a technique of __mapping__ a large set of arbitrary data into a __hash table__ (tabular indexes) using a __hash function__. It is a method for representing __dictionaries__ for large datasets.

It allows lookups, updating and retrieval operation to occur in a constant time i.e. __O(1)__ . 
The efficiency of mapping depends on the efficiency of the hash function used.

### 2.1 What is it used for?

Lookups are inevitable for large datasets and are often times very time-consuming

- Linear search: `O(n)`
- Binary search: `O(log n)`

As size of datasets increases, these complexities become significantly high.

-> Although searching for an element in a hash table can take as long as searching for an element in a linked list $\Theta(n)$ time is the worst time. In practice, hashing performs extremely well. Under reasonable assumptions, the averate time to search for an element in a hash table is `O(1)`.

When the number of keys actually stored is small relative to the total number of possible keys, hash tables become an effective alternative to directly addressing an array, since a hash table typically uses an array of size proportional to the number of keys actually stored. 

Instead of using the key as an array index directly, the array index is computed from the key. 

##### Applications of Hash Table

Hash tables are implemented where

- constant time lookup and insertion is required
- cryptographic applications
- indexing data is required


### 2.2 Functioning

![dia2](img/8hashtable.png)
<div class="author">src: jorge Stolfi, CC BY-SA 3.0, wikimedia</div>

A __hash table__ uses a __hash function__ to compute an index, also called a __hash code__, into an array of __buckets__ or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. 

Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause __hash collisions__ where the hash function generates the same index for more than one key. Such collisions are typically accommodated in some way. 

### 2.3 Hash table complexities

|Algorithm|Average|Worst case|
|:---|:---|:---|
|Space|$\Theta(n)$|O(n)|
|Search|$\Theta(1)$|O(n)|
|Insert|$\Theta(1)$|O(n)|
|Delete|$\Theta(1)$|O(n)|

Hashing is an example of a __space-time tradeoff__. If memory is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if time is infinite, values can be stored without regard for their keys, and a binary search or linear search can be used to retrieve the element.

### 2.4 Hash function

![dia1](img/8hashfunction.webp)
<div class="author">src: programmiz.com</div>

A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.

Let $k$ be a key and $h(x)$ be a _hash function_. A hash function is an easily computable function that maps a key $k$ to a "random-like" index in the range $[0..m-1]$. 

##### 2.4.1 "Good Hashing Functions"

A good hashing function should have the following properties:

- it should be *efficiently computable* in constant time and using simple arithmetic operations
- it should produce *few collisions*:
    - it should be a function of *every bit* of the key (otherwise keys that differ only in these bits will collide, i.e. "1111101", "1111110", "1111111"
    - it breaks up (scatters) naturally occuring clusters of key values, i.e. "temp1", "temp2", "temp3"

##### 2.4.2 Common (simple) Hash Functions

##### 2.4.2.1 Division Method

If `k` is a key and `m` is the size of the hash table, the hash function `h()` is calculated as:

$$h(k) = k\mod m$$

This is called *division hashing*. It satisfies our first criteria of efficiency, but consecutive keys are mapped to consecutive entries, and this is does not do a good job of breaking up clusters.

##### 2.4.2.2 Multiplicative Hash Function

$$h(k) = (a\cdot k)\mod m$$

, where $a$ is a large prime number (or at least does not share common factors with $m$). Or alternatively:

$$h(k) = \lfloor m A k \rfloor \mod m$$

Where $A$ is any real-valued constant. An advantage of the hashing by multiplication is that the $m$ is not critical. Although any value real value $A$ produces a hash function, Donald Knuth suggests using the golden ratio: $\frac{\sqrt{5}-1}{2}$.

##### 2.4.2.3 Linear Hash Function

$$h(k) = (a\cdot x + b)\mod m$$

Enhances the multiplicative hash function with an added constant term $b$.

##### 2.4.2.4 Polynomial Hash Function

We can further extend the linear hash function to a polynomial. This is often handy with keys that consist of a sequence of objects, such as strings
or the coordinates of points in a multi-dimensional space.

Suppose that the key being hashed involves a sequence of numbers $x = (c0, c1, . . . , ck−1)$.
We map them to a single number by computing a polynomial function whose coefficients are these values.

$$h(x_0, . . . , x_n) =\left ( \sum_{i=0}^{k−1}c_i p^i \right ) \mod m$$

For example, if $k = 4$ and $p = 37$, the associated polynomial would be $c_0 + 37c_1 + 2372c_2 + 373c_3$.

### 2.5. Hash Collision

It may be that different keys are mapped to the same location. Such events are
called __hash collisions__, and a key element in the design of a good hashing system how collisions
are to be handled.

Collisions can be resolved using on of the following techniques:

- Collisions resolution by __chaining__
- __Open Addressing__: Linear/Quadratic Probing and Double Hashing

##### 2.5.1 Collision resolution by (separate) chaining

If we have additional memory at our disposal, a simple approach to collision resolution, called separate chaining, is to store the colliding entries in a separate (doubly-)linked list,
one for each table entry. More formally, each table entry stores a reference to a list data
structure that contains all the dictionary entries that hash to this location.

If `j` is the slot for multiple elements, it contains a pointer to the head of the list of elements. If no element is present, `j` contains `NIL`.

![dia2](img/8chained.webp)
<div class="author">src: programmiz.com</div>

##### 2.5.1 Collision resolution by open addressing.
<div class="author">src: cs.umd.edu</div>

Unlike chaining, open addressing doesn't store multiple elements into the same slot. Here, each slot is either filled with a single key or left NIL. These collision-resolution methods do not require additional storage. The objective is to store all keys within the hash table.

To know which table entries store a value and which do not, we will store a special value, called __empty__ in the empty table entries. The value of empty must be such that it matches no valid key.

Whenever we attempt to insert a new entry and find that its position is already occupied, we
will begin probing other table entries until we discover an empty location where we can place
the new key. In it most general form, an open addressing system involves a secondary search
function, $f$ . If we discover that location $h(x)$ is occupied, we next try locations

$$(h(x) + f (1))\mod m, (h(x) + f (2)) \mod m, (h(x) + f (3)) \mod m, . . . .$$

until finding an open location. 

![dia2](img/8linearprobing.png)
The collision between John Smith and Sandra Dee (both hashing to cell 873) is resolved by placing Sandra Dee at the next free location, cell 874.

##### Linear Probing
<div class="author">src: programmiz.com</div>

In linear probing, collision is resolved by checking the next slot.

$$ h(k, i) = (h′(k) + i) \mod m$$

, where

- $i = {0, 1, …}$
- $h'(k)$ is a new hash function

If a collision occurs at $h(k, 0)$, then $h(k, 1)$ is checked. In this way, the value of $i$ is incremented linearly.

The problem with linear probing is that a cluster of adjacent slots is filled. When inserting a new element, the entire cluster must be traversed. This adds to the time required to perform operations on the hash table.

##### Quadratic Probing

It works similar to linear probing but the spacing between the slots is increased (greater than one) by using the following relation.

$$h(k, i) = (h′(k) + c_1 \cdot i + c_2 \cdot i^2) \mod m$$

, where

- $c1$ and $c2$ are positive auxiliary constants
- $i = {0, 1, …}$

Quadratic Probing is used to avoid primary clustering.

##### Double Hashing

Both linear probing and quadratic probing have shortcomings. Recall that in any open-addressing scheme, we are accessing the probe sequence $h(x) + f (1), h(x) + f (2), ...$. 

In Double Hashing the increment function $f(i)$ is a function of the search key, more specifically another hash function. This leads to the concept of double hashing.

More formally, we define two hash functions $h(x)$ and $g(x)$. We use $h(x)$ to determine the first probe location. If this entry is occupied, we then try:

$$h(x) + g(x), h(x) + 2g(x), h(x) + 3g(x), . . .$$

Let $n$ be the number of elements stored in $T$. Given a key $x$, the $(i+1)$-st hash location is computed by:

$$h(i, j)= (h(x)+i\cdot g(x)) \mod |T|$$

### 2.6 Python Example
<div class="author">src: programmiz.com</div>

#### Exercise 6
Which hashing function is used in the following code?

In [None]:
# Python program to demonstrate working of HashTable 

hashTable = [[],] * 10

def checkPrime(n):
    if n == 1 or n == 0:
        return 0
    for i in range(2, n//2):
        if n % i == 0:
            return 0
    return 1

def getPrime(n):
    if n % 2 == 0:
        n = n + 1
    while not checkPrime(n):
        n += 2
    return n

def hashFunction(key):
    capacity = getPrime(10)
    return key % capacity

def insertData(key, data):
    index = hashFunction(key)
    hashTable[index] = [key, data]

def removeData(key):
    index = hashFunction(key)
    hashTable[index] = 0

insertData(123, "apple")
insertData(432, "mango")
insertData(213, "banana")
insertData(654, "guava")

print(hashTable)

removeData(123)

print(hashTable)

#### Exercise 7:

Indicate whether you use an Array, LinkedList or Hash Table to store data in each of the following cases. Justify your answer:

1. A list of employee records need to be stored in a manner that is easy to find max or min in the list

2. A data set contains many records with duplicate keys. Only thing needed is to keep the list in sorted order.

3. A  library needs to maintain books by their ISBN number. Only thing important is finding them as soon as possible.

4. A data set needs to be maintained in order to find the median of the set quickly.

#### Exercise 8:
<div class="author">src: geeksforgeeks.org</div>

Given the following input: $$(4322, 1334, 1471, 9679, 1989, 6171, 6173, 4199)$$ and the hash function $$x \mod 10$$.

Which of the following statements are true? 
1. 9679, 1989, 4199 hash to the same value 
2. 1471, 6171 hash to the same value 
3. All elements hash to the same value 
4. Each element hashes to a different value 

- A) i only 
- B) ii only 
- C) i and ii only 
- D) iii or iv 

#### Exercise 9:

The keys $12, 18, 13, 2, 3, 23, 5, 15$ are inserted into an initially empty hash table of length 10 using open addressing with hash function 

$$h(k) = k \mod 10$$ 

and linear probing. What is the resultant hash table? 