# Intro to Data Structures and Algorithms 

[course link](https://learn.udacity.com/courses/ud513)

## Lesson 4. Maps and Hashing

### Sets and Maps

The defining characteristic of a map is its key-value structure.  
Actualy a dictionary in CS is a map.  

Set - data structure comparable to a list. The bigest difference is that a list has some order of the elemnts, while set is unordered data structure. Also sets do not allow for repeated elements. 

In Python, sets are unordered collections of unique elements. This means that you cannot access set elements by index directly. 

Sets in Python are mutable. This means that you can add or remove elements from a set.   
However, it’s important to note that while the set itself may be modified, the elements contained in the set must be of an immutable type.

Map - is a set based data structure.  
Map <key, value>  
The keys in a dictionary is a set, i.e. keys should be unique! 

In Python version 3.7 and later, dictionaries are ordered. This means that the items have a defined order, and that order will not change.  
However, in Python 3.6 and earlier versions, dictionaries were unordered.  

#### Task 1. 

In [1]:
"""Time to play with Python dictionaries!
You're going to work on a dictionary that
stores cities by country and continent.
One is done for you - the city of Mountain 
View is in the USA, which is in North America.

You need to add the cities listed below by
modifying the structure.
Then, you should print out the values specified
by looking them up in the structure.

Cities to add:
Bangalore (India, Asia)
Atlanta (USA, North America)
Cairo (Egypt, Africa)
Shanghai (China, Asia)"""

locations = {'North America': {'USA': ['Mountain View']}}

"""Print the following (using "print").
1. A list of all cities in the USA in
alphabetic order.
2. All cities in Asia, in alphabetic
order, next to the name of the country.
In your output, label each answer with a number
so it looks like this:
1
American City
American City
2
Asian City - Country
Asian City - Country"""

cities_to_add = ['Bangalore (India, Asia)',
                 'Atlanta (USA, North America)',
                 'Cairo (Egypt, Africa)',
                 'Shanghai (China, Asia)']


# helper function to unfold a sublist in a list
def unfold(lst):
    ans = []
    for i in lst:
        if type(i) != list:
            ans.append(i)
        else:
            ans.extend(i)
    return ans

# adding data to our dictionary
for city in cities_to_add:
    city = city.split(', ')
    city[0] = city[0].replace('(', '').replace(',', '').split(' ')
    city[1] = city[1].replace(')', '')
    city = unfold(city)
    if city[2] in locations:
        if city[1] in locations[city[2]]:
            locations[city[2]][city[1]].append(city[0])
        else:
            locations[city[2]][city[1]] = [city[0]]
    else:
        locations[city[2]] = {city[1]: [city[0]]}


# Print a list of all cities in the USA in alphabetic order.
for k, v in locations.items():
    if k == 'North America' and 'USA' in v:
        for country, cities in v.items():
            if country == 'USA':
                print(1, *sorted(cities), sep='\n')
            
            
# Print all cities in Asia, in alphabetic order, next to the name of the country
for k, v in locations.items():
    if k == 'Asia':
        ans = []
        for country, cities in v.items():
            for c in cities:
                to_add = c + ' - ' + country
                ans.append(to_add)
print(2, *sorted(ans), sep='\n')

1
Atlanta
Mountain View
2
Bangalore - India
Shanghai - China


### Hashing

Using a data structure that employs a hash function allows you to do look ups in constant time. All other data structures we've learnt so far allows you to search in O(n).

Hash functions take some value into a hash function and produces a hash value.  

Usualy we use division by 10 as a hash function and a remainder as a hash value.  

The hash value will act as an index in an array where we'll be storing our original values. And we can look up for an element by its index in a constant time. 

### Collision

Collision happens when two different values have the same hash value.  

There are two main tehcniques to fight with collisions:
- change the hash function, so remainders will be different after it
- change the structure of your array and istead of storing separate values in each cell, you can store a list of values (a collection) having the same hash number in each cell (a bucket).    

The first option will use a lot of extra space, but will still have a constant look up time complexity.  

The second option is used usually. And the worst case scenario time complexity will be O(m), where m - the lenght of a collection in a bucket. 

Also you can be creative and use a second hash function inside of your bucket, to have a contant overall look up time. 

### Load Factor

When we're talking about hash tables, we can define a "load factor":  
`Load Factor = Number of Entries / Number of Buckets`  

The purpose of a load factor is to give us a sense of how "full" a hash table is. For example, if we're trying to store 10 values in a hash table with 1000 buckets, the load factor would be 0.01, and the majority of buckets in the table will be empty. We end up wasting memory by having so many empty buckets, so we may want to rehash, or come up with a new hash function with less buckets. We can use our load factor as an indicator for when to rehash—as the load factor approaches 0, the more empty, or sparse, our hash table is.  

On the flip side, the closer our load factor is to 1 (meaning the number of values equals the number of buckets), the better it would be for us to rehash and add more buckets. Any table with a load value greater than 1 is guaranteed to have collisions. 

For the load factor, you should divide the number of values by the number of buckets. For example if there are 100 values and 100 buckets (0 to 99). We'll have, 100/100 = 1!

What hash function to choose:
- Dividing a bunch of multiples of 5 by another multiple of 5 will cause a lot of collisions. Here's an example, where 10 is used as the divisor:
5 % 10 = 5
10 % 10 = 0
15 % 10 = 5
20 % 10 = 0
- 87 is better than 125, but because it's less than 100 it'll still have collisions.
- 1001 is good, but it'll create a ton of leftover buckets and waste a lot of memory.
- the best one has function for our data will be % 107. 

### Hash Maps

A Python dictionary is a hash map!

You can use the keys as inputs to has function and then store the key-value pair in the bucket of the hash function produced by the function. 

Hash maps are very popular because constant time look up can really speed up your code. 

### String Keys

String hashing example:  
`s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]`  

This is why [31](https://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier).

#### Task 2. 

In [2]:
"""Write a HashTable class that stores strings
in a hash table, where keys are calculated
using the first two letters of the string."""

class HashTable:
    def __init__(self):
        self.table = [None]*10000
    
    def calculate_hash_value(self, string):
        """Helper function to calculate a hash value from a string."""
        hash_value = ord(string[0]) * 100 + ord(string[1])
        return hash_value
    
    def store(self, string):
        """Input a string that's stored in the table."""
        hash_value = self.calculate_hash_value(string)
        if self.table[hash_value] is not None:
            self.table[hash_value].append(string)
        else:
            self.table[hash_value] = [string]
    
    def lookup(self, string):
        """Return the hash value if the string is already in the table. Return -1 otherwise."""
        hash_value = self.calculate_hash_value(string)
        if self.table[hash_value] is not None:
            if string in self.table[hash_value]:
                return hash_value
        return -1
    
    
# Setup
hash_table = HashTable()

# Test calculate_hash_value
# Should be 8568
print(hash_table.calculate_hash_value('UDACITY'))

# Test lookup edge case
# Should be -1
print(hash_table.lookup('UDACITY'))

# Test store
hash_table.store('UDACITY')
# Should be 8568
print(hash_table.lookup('UDACITY'))

# Test store edge case
hash_table.store('UDACIOUS')
# Should be 8568
print(hash_table.lookup('UDACIOUS'))


8568
-1
8568
8568
