<a href="https://colab.research.google.com/github/gmelaku/Assignment1/blob/master/lab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab-2: Bloom Filter
# 1: Please upload "random_ip.txt" (provided at the beggining of the Lab-2 instructions) inside a "sample_data" directory
# 2: Fill out right hand side below each # TODO comments in the code. There are 2 such lines in create_bloom_filter() and 1 such line in validate_bloom_filter()

If you do not have spark-hadoop setup properly on your system, use follwing lines of code to download and setup:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz

!tar -xvf  spark-2.4.4-bin-hadoop2.7.tgz

!pip install -q findspark

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

We will have three functions:

1. create_spark_context(): for creating/getting existing SparkContext

2. create_bloom_filter(): for creating a bloom filter

3. validate_bloom_filter(): for validating a bloom filter

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

import findspark
findspark.init()

from pyspark import SparkContext

def create_spark_context():
  return SparkContext.getOrCreate()


In [3]:
from google.colab import files
uploaded = files.upload()

Saving random_ip.txt to random_ip.txt


In [0]:
import hashlib
import math
in_fname = "./random_ip.txt"
def create_bloom_filter(in_fname, out_fname):
    """Creates a Bloom Filter index from input filename and writes to out filename"""
    # Create Spark Context
sc = create_spark_context()

    # Read the input file into an RDD
in_data = sc.textFile(in_fname)

    # First we need to find out how large our Bloom Filter should be
    # This will determine the length of the indices we will get from the hash functions

    # In this implementation, the number of hash functions is preset to 4
k = 4
    # Count the number of elements in the input
n = in_data.count()
    # m gives how many indices we need
m = k * n / (math.log(2))
    # Number of digits we need to take out of each hash function to get the indices
num_of_digits = len(str(int(m)))

print("*********************************************************")
print("Bloom filter will use indices of length:", num_of_digits)
print("*********************************************************")

    # Split the input file into multiple lines
in_split = in_data.flatMap(lambda x: x.split())

    # Apply four different hash functions to create the Bloom filter
    # Each one of these RDDs has the indice that will be set to 1
    # to mark the existance of the input elements
in_md5    = in_split.map(lambda x: int(hashlib.md5(x.encode("utf-8")).hexdigest(),16) % (10 ** num_of_digits))
in_sha224 = in_split.map(lambda x: int(hashlib.sha224(x.encode("utf-8")).hexdigest(),16) % (10 ** num_of_digits))
in_sha256 = in_split.map(lambda x: int(hashlib.sha256(x.encode("utf-8")).hexdigest(),16) % (10 ** num_of_digits))
in_sha384 = in_split.map(lambda x: int(hashlib.sha384(x.encode("utf-8")).hexdigest(),16) % (10 ** num_of_digits))

    # Union all sets of indices, so we know the full collection of indices that have to be set to 1
in_allhashes = in_md5.union(in_sha384).union(in_sha256).union(in_sha224)

    # As multiple hash functions can map the same input element to the same index,we need to remove duplicates

    # HINT!! : WE NEED TO GET SORTED INDICES AFTER GETTING UNIQUE INDICES
    # Create a set per index, so we can use reduceByKey to remove duplicates
    # TODO: FILL OUT RIGHT HAND SIDE: generate a mapper to map each indices inside in_allhashes variable to 1, i.e. (indice,1)
indice = {"in_allhashes"}

in_mapped = indice.map(in_allhashes, 1)
    
    # Use reduceByKey to remove index duplicates (d), and apply sort on that
    # TODO: FILL OUT RIGHT HAND SIDE: mearge duplicate entries and then sort them
in_nodupes = in_mapped.reduceByKey(lambda x,y: x+y)
    
    # This is the list of indices set to 1 in the Bloom Filter
bloom_indices = in_nodupes.map(lambda x: x[0])

    # Save Bloom Filter indices to output file
bloom_indices.saveAsTextFile(out_fname)
sc.stop()

In [0]:
def validate_bloom_filter(infile, item, num_of_digits):
    """Checks if an item exists in a set of indices for a bloom filter"""
    # Create Spark Context
    sc = create_spark_context()

    # Get the index value to check for the four hash functions we are using
    index_md5    = int(hashlib.md5(item.encode("utf-8")).hexdigest(),16) % (10 ** num_of_digits)
    index_sha224 = int(hashlib.sha224(item.encode("utf-8")).hexdigest(),16) % (10 ** num_of_digits)
    index_sha256 = int(hashlib.sha256(item.encode("utf-8")).hexdigest(),16) % (10 ** num_of_digits)
    index_sha384 = int(hashlib.sha384(item.encode("utf-8")).hexdigest(),16) % (10 ** num_of_digits)

    # Read the file with the Bloom filter indices previously created
    bloom_filter_indices = sc.textFile(infile)

    # Get the indices for each of the hash functions
    is_index1 = bloom_filter_indices.filter(lambda x: int(x) == index_md5).count() > 0
    is_index2 = bloom_filter_indices.filter(lambda x: int(x) == index_sha224).count() > 0
    is_index3 = bloom_filter_indices.filter(lambda x: int(x) == index_sha256).count() > 0
    is_index4 = bloom_filter_indices.filter(lambda x: int(x) == index_sha384).count() > 0

    # Get logical AND of all index existance flags
    # TODO: FILL OUT RIGHT HAND SIDE:, perform logical AND of all index existance flags
    is_element_there = { "is_index1", "is_index2", "is_index3", "is_index4"}
    # Print message with results depending on indices
    if is_element_there:
        print("Element ", item, "is possibly in the set of elements.")
    else:
        print("Element ", item, "is definitely not in the set of elements")
    sc.stop()

Now, we will create a indices using a bloom filer for the data in our input file.

In [0]:
infile = '/content/sample_data/random_ip.txt'
outfile = '/content/sample_data/index'
create_bloom_filter(infile, outfile)

Now, we will check the possibility of a given data belonging to the indices that we created using the bloom filter.

In [0]:
infile = './random_ip.txt' # path to the index file
num_of_digits = 3 # is decided based on the how many indices we needed for generating a bloom filter (see the output of create_bloom_filter(infile, outfile))

item = 'XKXDT7WZPQ'
validate_bloom_filter(infile, item, num_of_digits)

item = 'AAAAAAAAAA'
validate_bloom_filter(infile, item, num_of_digits)