https://www.geeksforgeeks.org/hashing-set-1-introduction/

https://www.geeksforgeeks.org/hashing-data-structure/?ref=lbp

https://www.geeksforgeeks.org/hash-table-data-structure/

https://www.geeksforgeeks.org/what-is-hashing/

https://www.geeksforgeeks.org/hash-map-in-python/


https://medium.com/basecs/taking-hash-tables-off-the-shelf-139cbf4752f0

### Hashing | Set 1 (Introduction)

Suppose we want to design a system for storing employee records with phone numbers(as keys). And we want the following queries to be performed efficiently: 

    Insert a phone number and corresponding information.
    Search a phone number and fetch the information.
    Delete a phone number and related information.
We can think of using the following data structures to maintain information about different phone numbers. 

    1. Array of phone numbers and records.
    2. Linked List of phone numbers and records.
    3. Balanced binary search tree with phone numbers as keys.
    4. Direct Access Table.
    
For **arrays and linked lists**, we need to search in a linear fashion, which can be costly in practice. If we use arrays and keep the data sorted, then a phone number can be searched in O(Logn) time using Binary Search, but insert and delete operations become costly as we have to maintain sorted order. 

With **balanced binary search tree**, we get moderate search, insert and delete times. All of these operations can be guaranteed to be in O(Logn) time. 

Another solution that one can think of is to use a **direct access table** where we make a big array and use phone numbers as index in the array. An entry in array is NIL if phone number is not present, else the array entry stores pointer to records corresponding to phone number. Time complexity wise this solution is the best among all, we can do all operations in O(1) time. For example to insert a phone number, we create a record with details of given phone number, use phone number as index and store the pointer to the created record in table. 
This solution has many practical limitations. First problem with this solution is extra space required is huge. For example if phone number is n digits, we need O(m * 10n) space for table where m is size of a pointer to record. Another problem is an integer in a programming language may not store n digits. 

Due to above limitations Direct Access Table cannot always be used. 

**Hashing** is the solution that can be used in almost all such situations and performs extremely well compared to above data structures like Array, Linked List, Balanced BST in practice. With hashing we get O(1) search time on average (under reasonable assumptions) and O(n) in worst case.  Now let us understand what hashing is.

In [28]:
# Create dictionary
dictionary_capitals = {'Lisboa': 'Portugal', 'Madrid': 'Spain', 'Lisbon': 'Portugal', 'London': 'United Kingdom'}
print (dictionary_capitals)
print (dictionary_capitals.keys())
print (dictionary_capitals.values())

# Searching in a Dictionary 
euroCaps = dictionary_capitals

print(dictionary_capitals.get('Prague'))
dictionary_capitals['Berlin'] = 'Italy'
print (dictionary_capitals)
dictionary_capitals['Berlin'] = 'Germany'
print (dictionary_capitals)
dictionary_capitals.items()
# Delete key-value pair
del dictionary_capitals['Lisboa']
print (dictionary_capitals)

dictionary_capitals.clear()
print (dictionary_capitals)


{'Lisboa': 'Portugal', 'Madrid': 'Spain', 'Lisbon': 'Portugal', 'London': 'United Kingdom'}
dict_keys(['Lisboa', 'Madrid', 'Lisbon', 'London'])
dict_values(['Portugal', 'Spain', 'Portugal', 'United Kingdom'])
None
{'Lisboa': 'Portugal', 'Madrid': 'Spain', 'Lisbon': 'Portugal', 'London': 'United Kingdom', 'Berlin': 'Italy'}
{'Lisboa': 'Portugal', 'Madrid': 'Spain', 'Lisbon': 'Portugal', 'London': 'United Kingdom', 'Berlin': 'Germany'}
{'Madrid': 'Spain', 'Lisbon': 'Portugal', 'London': 'United Kingdom', 'Berlin': 'Germany'}
{}


In [39]:
euroCaps={'Lisboa': 'Portugal', 'Madrid': 'Spain', 'Lisbon': 'Portugal', 'London': 'United Kingdom'}
euroCaps['London'] = 'Great Britain'
euroCaps['Berlin'] = 'Germany'

for key, value in euroCaps.items():
    print (f"{key}, {value}")


Lisboa, Portugal
Madrid, Spain
Lisbon, Portugal
London, Great Britain
Berlin, Germany


In [41]:
for key in euroCaps.keys():
    print (f"{key}")

for value in euroCaps.values():
    print (f"{value.upper()}")

Lisboa
Madrid
Lisbon
London
Berlin
PORTUGAL
SPAIN
PORTUGAL
GREAT BRITAIN
GERMANY


HashMap - Best Practices and Common Mistakes 
1. Keys must be immutable - if the content of the key changes, the hash function will return differnt hash, python wont be able to find the value associated with the key. 
2. Addressing hashmap collisions -
       Hashing only works if each item maps to a unique location in the hash table. But sometimes, hash functions can return the same output for different inputs.
       For example, if you’re using a division hash function, different integers may have the same hash function (they may return the same remainder when applying the module division), thereby creating a problem called collision.
       Collisions must be resolved, and several techniques exist. Luckily, in the case of dictionaries, Python handles potential collisions under the hood.
4. Understanding load factor - The load factor is defined as the ratio of the number of elements in the table to the total number of buckets. It’s a measure to estimate how well-distributed the data is. As a rule of thumb, the more evenly the data is distributed, the less likelihood of collisions.
5. Be aware of performance - A good hash function would minimize the number of collisions, be easy to compute, and evenly distribute the items in the hash table. This could be done by increasing the table size or the complexity of the hash function. Although this is practical for small numbers of items, it is not feasible when the number of possible items is large, as it would result in memory-consuming, less efficient hashmaps.

Are dictionaries what you need?

    Dictionaries are great, but other data structures may be more suitable for your specific data and needs. In the end, dictionaries do not support common operations, such as indexing, slicing, and concatenation, making them less flexible and more difficult to work with in certain scenarios.

**Alternative Python Hashmap Implementations**

Let’s see some of the most popular examples.

**Defaultdict** - Every time you try to access a key that is not present in your dictionary, Python will return a KeyError. A way to prevent this is by searching for information using the .get() method. However, an optimized way to do that is by using Defaultdict, available in the module collections. Defaultdict and dictionaries are almost the same. The sole difference is that Defaultdict never raises an error because it provides a default value for non-existent keys.



In [42]:
from collections import defaultdict 

# Defining the dict 
capitals = defaultdict(lambda: "The key doesn't exist") 
capitals['Madrid'] = 'Spain'
capitals['Lisboa'] = 'Portugal'
  
print(capitals['Madrid']) 
print(capitals['Lisboa']) 
print(capitals['Ankara']) 

Spain
Portugal
The key doesn't exist


**Counter** - Counter is a subclass of a Python dictionary that is specifically designed for counting hashable objects. It’s a dictionary where elements are stored as keys and their counts are stored as values.

There are several ways to initialize Counter:

    By a sequence of items.

    By keys and counts in a dictionary.

    Using name:value mapping.


In [43]:
from collections import Counter 

# a new counter from an iterable
c1 = Counter(['aaa','bbb','aaa','ccc','ccc','aaa'])
# a new counter from a mapping
c2 = Counter({'red': 4, 'blue': 2})     
# a new counter from keyword args
c3 = Counter(cats=4, dogs=8)       
# print results
print(c1)
print(c2)
print(c3)

Counter({'aaa': 3, 'ccc': 2, 'bbb': 1})
Counter({'red': 4, 'blue': 2})
Counter({'dogs': 8, 'cats': 4})


In [44]:
print('keys of the counter: ', c3.keys())
print('values of the counter: ',c3.values()) 
print('list with all elements: ', list(c3.elements())) 
print('number of elements: ', c3.total()) # number elements
print('2 most common occurrences: ', c3.most_common(2)) # 2 most common occurrences 

keys of the counter:  dict_keys(['cats', 'dogs'])
values of the counter:  dict_values([4, 8])
list with all elements:  ['cats', 'cats', 'cats', 'cats', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs']
number of elements:  12
2 most common occurrences:  [('dogs', 8), ('cats', 4)]


In [45]:
dict_keys(['cats', 'dogs'])
dict_values([4, 8])
['cats', 'cats', 'cats', 'cats', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs', 'dogs']
12
[('dogs', 8), ('cats', 4)]

NameError: name 'dict_keys' is not defined

**Scikit-learn hashing methods**

Sklearn comes with various hashing methods that can be very useful for feature engineering processes.

One of the most common is the CountVectorizer method. It is used to transform a given text into a vector based on the frequency of each word that occurs in the entire text. CountVectorizer is particularly helpful in text analysis contexts.

In [47]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
 
documents = ["Welcome to this new DataCamp Python course",
            "Welcome to this new DataCamp R skill track",
            "Welcome to this new DataCamp Data Analyst career track"]
 
# Create a Vectorizer Object
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(documents)

# print unique values 
print('unique words: ', vectorizer.get_feature_names_out())


#print sparse matrix with word frequency
pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())

unique words:  ['analyst' 'career' 'course' 'data' 'datacamp' 'new' 'python' 'skill'
 'this' 'to' 'track' 'welcome']


Unnamed: 0,analyst,career,course,data,datacamp,new,python,skill,this,to,track,welcome
0,0,0,1,0,1,1,1,0,1,1,0,1
1,0,0,0,0,1,1,0,1,1,1,1,1
2,1,1,0,1,1,1,0,0,1,1,1,1
