# **Bloom Filter**

    
   **Bloom Filter** is a probabilistic data structure that is used to see if certain information is in a data set, which means that it can tell us with certainty that the element is not in the set but not so much if it is.
    
   How does a Bloom Filter work? It combines ideas from HashSets and bitsets into a single data structure. So, we create a bitset with m bits in it, and then we’ll hash each element using k hashing functions. Now each element occupies k bitset entries. 
    
   When we want to check if an element is present in the filter, we must use the same hashing functions used in the entire set. If all entries determine 1, the element is in the bloom filters. But if one or more entries determine a 0, the element isn’t present in the filter. 
    
   In some cases, all entries may determine 1 with an element that isn’t in the filter. This is when the probabilistic nature of the filter manifests itself, by giving us false positives. This probability can be calculated using the variables of numbers of hash functions, k, capacity, c, and the number of bits, m:  $$ {(1\ -\ e^{-kc/m})}^k $$
    
   Because it’s very space-efficient since the elements themselves are not added to a set but rather a hash of the elements, they are good for applications where space is needed, and the false positive probability is not a concern.

   **Possible application of the algorithm:**

   Create a software whose goal is to prevent frauds in the transactions that go through a banking institution.
    
   If the tax identification number (NIF) of the person that is performing the transaction exists in a set of NIF’s classified as fraudulent, then that transaction should be denied. 
    
   A simple implementation of this would be to put every known NIF classified as fraudulent in a database and for every transaction, we analyze to see if the NIF involved in the transaction is in the set. Being aware of the false positive probability, the bank will then verify if the transactions classified as fraudulent are so indeed.
    
   **Packages used:**
    
   Math is a Python package that provides access to mathematical functions.
    
   Mmh3 is a Python library for MurmurHash with a set of fast and robust hash functions.
    
   Bitarray is a library that provides an object type which efficiently represents an array of Booleans. 

    


In [1]:
# Section dedicated to the imports
import mmh3
import math
from bitarray import bitarray

In [20]:

class BloomFilter:
    
    # Initializes the BloomFilter
    def __init__(self, num_itens, error_percent):
        # Bloom filter size for N members with X % probability of error, according to the formula present on Wikipedia
        self.size = math.ceil(-(num_itens * math.log(error_percent/100))/(math.log(2)**2))
        
        # Calculate the ideal number of hash function, according to the formula present on Wikipedia
        self.num_hashes = round((self.size/num_itens) * math.log(2))
        
        # Instance a bitarray with the calculated size a set it all to zeros
        self.filter = bitarray(self.size)
        self.filter.setall(0)
        
    
    
    # Hashes the NIB and adds it to the filter.
    def add_element(self, nib):
        for t in range(self.num_hashes):
            index = mmh3.hash(nib, t) % self.size
            self.filter[index] = 1        
        return
    
    
    # Check whether the NIB is present in the filter
    def is_member(self, nib):
        for t in range(self.num_hashes):
            index = mmh3.hash(nib, t) % self.size
            if self.filter[index] == 0:
                return False
        return True
    
    # Auxiliary functions just to see some statistics regarding the filter.
    def get_bloom(self):
        return self.filter
    
    def get_num_hashes(self):
        return self.num_hashes
    
    def get_filter_size(self):
        return self.size
    

In [23]:
# Tests just to see if the base implementation is working as expected.

# Instancing the bloom filter for 3 elements and 1% chance of false positives
bloom = BloomFilter(3, 1)

# Adding some mock NIBs to the filter
nib = 'PT 42531'
nib2 = 'PT 3455'
nib3 = 'PT 23452'
bloom.add_element(nib)
bloom.add_element(nib)
bloom.add_element(nib2)

# Printing some statistics
print('Contents of the filter:\n->',bloom.get_bloom())
print('Filter size:\n->', bloom.get_filter_size())
print('Ideal number of hashes:\n->', bloom.get_num_hashes())

# Checking if the elements are present or not
print(f'\nIs \"{nib}\" a member of the filter?\n->',bloom.is_member(nib))
print(f'\nIs \"{nib2}\" a member of the filter?\n->',bloom.is_member(nib2))
print(f'\nIs \"{nib3}\" a member of the filter?\n->',bloom.is_member(nib3))


Contents of the filter:
-> bitarray('00110010110000101010110010111')
Filter size:
-> 29
Ideal number of hashes:
-> 7

Is "PT 42531" a member of the filter?
-> True

Is "PT 3455" a member of the filter?
-> True

Is "PT 23452" a member of the filter?
-> False


In [32]:
# Initial instancing
size = 100000
prob = 0.5

bloom_filter = BloomFilter(size, prob)

# Inital data import
with open('data_nifs.csv', 'r') as file:
    for line in file:
        bloom_filter.add_element(str(line.strip()))
        
print(bloom_filter.is_member('582914276'))


True
