# Big Data Platform
## Assignment 3: ServerLess

**By:**  

Omri Newman, 204866586<br> 
<br><br>

**The goal of this assignment is to:**
- Understand and practice the details of Serverless

**Instructions:**
- Students will form teams of two people each, and submit a single homework for each team.
- The same score for the homework will be given to each member of your team.
- Your solution is in the form of a Jupyter notebook file (with extension ipynb).
- Images/Graphs/Tables should be submitted inside the notebook.
- The notebook should be runnable and properly documented. 
- Please answer all the questions and include all your code.
- You are expected to submit a clear and pythonic code.
- You can change functions signatures/definitions.

**Submission:**
- Submission of the homework will be done via Moodle by uploading (not Zip):
    - Jupyter Notebook
    - 2 Log files
    - Additional local scripts
- The homework needs to be entirely in English.
- The deadline for submission is on Moodle.
- Late submission won't be allowed.

  
- In case of identical code submissions - both groups will get a Zero. 
- Some groups might be selected randomly to present their code.

**Requirements:**  
- Python 3.6 should be used.  
- You should implement the algorithms by yourself using only basic Python libraries (such as numpy,pandas,etc.)

<br><br><br><br>

**Grading:**
- Q0 - 10 points - Setup
- Q1 - 40 points - Serverless MapReduceEngine
- Q2 - 20 points - MapReduce job to calculate inverted index
- Q3 - 30 points - Shuffle

`Total: 100`

In [1]:
# !pip install git+https://github.com/lithops-cloud/lithops.git

In [2]:
import lithops
import sqlite3
import pandas as pd
import numpy as np
import os

# Question 0
## Setup

1. Navigate to IBM Cloud and open a trial account. No need to provide a credit card
2. Choose IBM Cloud Object Storage service from the catalog
3. Create a new bucket in IBM Cloud Object Storage
4. Create credentials for the bucket with HMAC (access key and secret key)
5. Choose IBM Cloud Functions service from the catalog and create a service


#### Lithops setup
1. By using “git” tool, install master branch of the Lithops project from
https://github.com/lithops-cloud/lithops
2. Follow Lithops documentation and configure Lithops against IBM Cloud Functions and IBM Cloud Object Storage
3. Configure Lithops log level to be in DEBUG mode
4. Run Hello World example by using Futures API and verify all is working properly.


#### IBM Cloud Object Storage setup
1. Upload all the input CSV files that you used in homework 2 into the bucket you created in IBM Cloud Object Storage


<br><br><br>

We've decided to work with AWS Lambda and S3. The configuration file below contains all required API for both the object storage and Faas.

In [3]:
config = {'lithops': {'backend': 'aws_lambda', 'storage': 'aws_s3'},
          'aws': {'access_key_id': 'AKIAWKR5TRRSHZAY6FO4',
                      'secret_access_key': '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',
                  'account_id': '434994646116'},
          'aws_lambda': {'execution_role': 'arn:aws:iam::434994646116:role/lithops-execution-role',
                      'region_name': 'us-east-1'},
          'aws_s3': {'storage_bucket': 'lithops-bucket', 'region_name': 'us-east-1'}}

We initiate `ServerlessExecutor` instance with the configuration above, following lithops documentation.

Let us test our connection by running the `hello_world` example in lithops documentation.

In [15]:
def hello_world(name):
    return 'Hello {}!'.format(name)

with lithops.ServerlessExecutor(config=config, log_level='DEBUG') as lexec:
    lexec.call_async(hello_world, 'World')
    print(lexec.get_result())

2022-01-03 22:07:41,659 [INFO] lithops.config -- Lithops v2.5.9.dev0
2022-01-03 22:07:41,660 [DEBUG] lithops.config -- Loading Serverless backend module: aws_lambda
2022-01-03 22:07:41,661 [DEBUG] lithops.config -- Loading Storage backend module: aws_s3
2022-01-03 22:07:41,662 [DEBUG] lithops.storage.backends.aws_s3.aws_s3 -- Creating S3 client
2022-01-03 22:07:41,668 [INFO] lithops.storage.backends.aws_s3.aws_s3 -- S3 client created - Region: us-east-1
2022-01-03 22:07:41,669 [DEBUG] lithops.serverless.backends.aws_lambda.aws_lambda -- Creating AWS Lambda client
2022-01-03 22:07:41,669 [DEBUG] lithops.serverless.backends.aws_lambda.aws_lambda -- Creating Boto3 AWS Session and Lambda Client
2022-01-03 22:07:41,792 [INFO] lithops.serverless.backends.aws_lambda.aws_lambda -- AWS Lambda client created - Region: us-east-1
2022-01-03 22:07:41,792 [DEBUG] lithops.invokers -- ExecutorID 02e085-3 - Invoker initialized. Max workers: 1000
2022-01-03 22:07:41,792 [DEBUG] lithops.invokers -- Execu

Hello World!


2022-01-03 22:07:48,150 [DEBUG] lithops.invokers -- ExecutorID 02e085-3 - Async invoker 1 finished
2022-01-03 22:07:48,150 [DEBUG] lithops.invokers -- ExecutorID 02e085-3 - Async invoker 0 finished


# Question 1
## Serverless MapReduceEngine

Modify MapReduceEngine from homework 2 into the MapReduceServerlessEngine where map and reduce tasks executed as a serverless actions, instead of local threads. In particular:
1. Deploy all map tasks as a serverless actions by using Lithops against IBM Cloud Functions.
2. Collect results from all map tasks and store them in the same SQLite as you used in MapReduceEngine and use the same code for the sort and shuffle phase.
3. Deploy reduce tasks by using Lithops against IBM Cloud Functions. Instead of persisting results from reduce tasks, return results back to the MapReduceServerlessEngine and proceed with the same workflow as in MapReduceEngine
4. Return results of reduce tasks to the user

**Please attach:**  
Text file with all log messages Lithops printed to console during the execution. Make
sure log level is set to DEBUG mode.

#### Code:

In [6]:
conn = sqlite3.connect('mydb.db')

In [7]:
c = conn.execute('''CREATE TABLE IF NOT EXISTS temp_results(
                    key VARCHAR(20),
                    value VARCHAR(20)
                    )''')
conn.commit()

Checking the scheme:

In [8]:
pd.read_sql('SELECT * FROM temp_results', conn).head()

Unnamed: 0,key,value


Here we define `MapReduceServerlessEngine`. This is a modified version of `MapReduceEngine` from HW2. Some remarks:
1. In order to terminate the connection to the cloud, we used `with` statements.
2. The result from line 7 is a 3d tensor, hence a (general case) reshape was made in the next line.
3. Originally, the reduce input from line 14 was a list of np.records. We needed to convert them to tuples so it could work with lithops.

In [9]:
class MapReduceServerlessEngine():
    @staticmethod
    def execute(input_data, map_function, reduce_function, conn, config):
        # Map 
        with lithops.ServerlessExecutor(config=config, log_level='DEBUG') as executor:
            executor.map(map_function, input_data)
            results = executor.get_result()
        arr = np.array(results)
        pd.DataFrame(arr.reshape(arr.shape[0]*arr.shape[1],arr.shape[2]),
                     columns=['key', 'value']).to_sql('temp_results', conn, if_exists='replace', index=False)
        # Shuffle
        sql = '''SELECT key, GROUP_CONCAT(value)
         FROM temp_results
         GROUP BY key
         ORDER BY key'''
        reduce_input = list(map(tuple, pd.read_sql(sql, conn).to_records(index=False)))
        # Reduce
        with lithops.ServerlessExecutor(config=config, log_level='DEBUG') as executor:
            executor.map(reduce_function, reduce_input)
            return executor.get_result()        

We modify the map and reduce functions from HW2 to work with lithops:

In [10]:
def inverted_map(document_name, doc_object):
    """Parse csv file into a list creating (key, value) pairs. Keys are entries 
       from the csv, and values are the csv filename."""
    file = doc_object.decode('utf-8').strip().replace('\r\n', '\n')
    lst = []
    rows = file.split('\n')
    title_row = rows[0].split(',')
    for line in rows[1:]:
        dic = dict(zip(title_row, line.strip().split(',')))
        lst.extend([(f'{key}_{value}', document_name) for key, value in dic.items()])
    return lst

def inverted_reduce(value, documents):
    """For each (key, value) pair, keep unique values for associated keys 
    """
    lst = [value]
    lst.append(','.join(set(map(str.strip, list(documents.split(','))))))
    return lst 

# Task 2
## Submit MapReduce job to calculate inverted index
1. Use input_data: `cos://bucket/<path to CSV data>`
2. Submit MapReduce job with reduce and map functions as you used in homework 2, as follows

    `mapreduce = MapReduceServerlessEngine()`  
    `results = mapreduce.execute(input_data, inverted_map, inverted_index)`   
    `print(results)`

**Please attach:**  
Text file with all log messages Lithops printed to console during the execution. Make
sure log level is set to DEBUG mode.

#### Code:

We start by creating the input data by accessing the object storage. Each item in the input data is a tuple, containing the name of the csv file and the file itself.

In [16]:
with lithops.ServerlessExecutor(config=config, log_level='DEBUG') as lexec:
    input_data =  list(zip([f'myCSV{i}.csv' for i in range(1,21)],
                       [lexec.internal_storage.get_data(f'csv_files/myCSV{i}.csv') for i in range(1,21)]))

2022-01-03 22:11:24,215 [INFO] lithops.config -- Lithops v2.5.9.dev0
2022-01-03 22:11:24,216 [DEBUG] lithops.config -- Loading Serverless backend module: aws_lambda
2022-01-03 22:11:24,217 [DEBUG] lithops.config -- Loading Storage backend module: aws_s3
2022-01-03 22:11:24,218 [DEBUG] lithops.storage.backends.aws_s3.aws_s3 -- Creating S3 client
2022-01-03 22:11:24,225 [INFO] lithops.storage.backends.aws_s3.aws_s3 -- S3 client created - Region: us-east-1
2022-01-03 22:11:24,226 [DEBUG] lithops.serverless.backends.aws_lambda.aws_lambda -- Creating AWS Lambda client
2022-01-03 22:11:24,227 [DEBUG] lithops.serverless.backends.aws_lambda.aws_lambda -- Creating Boto3 AWS Session and Lambda Client
2022-01-03 22:11:24,338 [INFO] lithops.serverless.backends.aws_lambda.aws_lambda -- AWS Lambda client created - Region: us-east-1
2022-01-03 22:11:24,339 [DEBUG] lithops.invokers -- ExecutorID 02e085-4 - Invoker initialized. Max workers: 1000
2022-01-03 22:11:24,340 [DEBUG] lithops.invokers -- Execu

Next, we run our MapReduce engine and get the results. Note the attached log files.

In [12]:
mapreduce = MapReduceServerlessEngine()
results = mapreduce.execute(input_data, inverted_map, inverted_reduce, conn, config)  

2022-01-03 20:58:28,429 [INFO] lithops.config -- Lithops v2.5.9.dev0
2022-01-03 20:58:28,432 [DEBUG] lithops.config -- Loading Serverless backend module: aws_lambda
2022-01-03 20:58:28,438 [DEBUG] lithops.config -- Loading Storage backend module: aws_s3
2022-01-03 20:58:28,439 [DEBUG] lithops.storage.backends.aws_s3.aws_s3 -- Creating S3 client
2022-01-03 20:58:28,446 [INFO] lithops.storage.backends.aws_s3.aws_s3 -- S3 client created - Region: us-east-1
2022-01-03 20:58:28,447 [DEBUG] lithops.serverless.backends.aws_lambda.aws_lambda -- Creating AWS Lambda client
2022-01-03 20:58:28,448 [DEBUG] lithops.serverless.backends.aws_lambda.aws_lambda -- Creating Boto3 AWS Session and Lambda Client
2022-01-03 20:58:28,565 [INFO] lithops.serverless.backends.aws_lambda.aws_lambda -- AWS Lambda client created - Region: us-east-1
2022-01-03 20:58:28,566 [DEBUG] lithops.invokers -- ExecutorID 02e085-1 - Invoker initialized. Max workers: 1000
2022-01-03 20:58:28,569 [DEBUG] lithops.invokers -- Execu

2022-01-03 20:58:34,121 [DEBUG] lithops.future -- ExecutorID 02e085-1 | JobID M000 - Got status from call 00005 - Activation ID: f85dac6e-37dd-4651-8f33-033e1643fc9b - Time: 0.29 seconds
2022-01-03 20:58:34,127 [DEBUG] lithops.future -- ExecutorID 02e085-1 | JobID M000 - Got status from call 00010 - Activation ID: 99240487-7fae-4947-9232-9080ca59131d - Time: 0.24 seconds
2022-01-03 20:58:34,428 [DEBUG] lithops.future -- ExecutorID 02e085-1 | JobID M000 - Got output from call 00005 - Activation ID: f85dac6e-37dd-4651-8f33-033e1643fc9b
2022-01-03 20:58:34,591 [DEBUG] lithops.monitor -- ExecutorID 02e085-1 - Pending: 11 - Running: 0 - Done: 9
2022-01-03 20:58:34,753 [DEBUG] lithops.future -- ExecutorID 02e085-1 | JobID M000 - Got output from call 00010 - Activation ID: 99240487-7fae-4947-9232-9080ca59131d
2022-01-03 20:58:34,764 [DEBUG] lithops.future -- ExecutorID 02e085-1 | JobID M000 - Got status from call 00007 - Activation ID: 30d80487-55ed-4477-99c7-20337fed88ea - Time: 0.22 seconds

2022-01-03 20:58:39,128 [INFO] lithops.invokers -- ExecutorID 02e085-2 | JobID M000 - Starting function invocation: inverted_reduce() - Total: 23 activations
2022-01-03 20:58:39,128 [DEBUG] lithops.invokers -- ExecutorID 02e085-2 | JobID M000 - Worker processes: 1 - Chunksize: 1
2022-01-03 20:58:39,135 [DEBUG] lithops.invokers -- ExecutorID 02e085-2 - Async invoker 0 started
2022-01-03 20:58:39,142 [DEBUG] lithops.invokers -- ExecutorID 02e085-2 - Async invoker 1 started
2022-01-03 20:58:39,144 [DEBUG] lithops.invokers -- ExecutorID 02e085-2 | JobID M000 - Free workers: 1000 - Going to run 23 activations in 23 workers
2022-01-03 20:58:39,207 [INFO] lithops.invokers -- ExecutorID 02e085-2 | JobID M000 - View execution logs at C:\Users\guypa\AppData\Local\Temp\lithops\logs\02e085-2-M000.log
2022-01-03 20:58:39,220 [DEBUG] lithops.monitor -- ExecutorID 02e085-2 - Starting Storage job monitor
2022-01-03 20:58:39,221 [INFO] lithops.wait -- ExecutorID 02e085-2 - Getting results from function

2022-01-03 20:58:42,711 [DEBUG] lithops.future -- ExecutorID 02e085-2 | JobID M000 - Got status from call 00015 - Activation ID: 2117b137-772a-4c51-85ba-2d1f03b9a368 - Time: 0.57 seconds
2022-01-03 20:58:42,711 [DEBUG] lithops.future -- ExecutorID 02e085-2 | JobID M000 - Got status from call 00008 - Activation ID: 573a0e82-750f-4df1-8fd2-876b2a6fbb23 - Time: 0.26 seconds
2022-01-03 20:58:42,909 [DEBUG] lithops.future -- ExecutorID 02e085-2 | JobID M000 - Got output from call 00015 - Activation ID: 2117b137-772a-4c51-85ba-2d1f03b9a368
2022-01-03 20:58:42,992 [DEBUG] lithops.future -- ExecutorID 02e085-2 | JobID M000 - Got output from call 00001 - Activation ID: 0f389066-dd25-4771-8e31-855831e89dcc
2022-01-03 20:58:43,013 [DEBUG] lithops.future -- ExecutorID 02e085-2 | JobID M000 - Got output from call 00003 - Activation ID: 20617ef8-2cc2-4ff0-8cff-c95e02c399ef
2022-01-03 20:58:43,033 [DEBUG] lithops.future -- ExecutorID 02e085-2 | JobID M000 - Got output from call 00008 - Activation ID:

Note that the result is a list of lists, where the i'th inner list is equivalent to `part-i-final.csv` from HW2.

In [13]:
results

[['city_Haifa',
  'myCSV11.csv,myCSV19.csv,myCSV9.csv,myCSV14.csv,myCSV20.csv,myCSV10.csv,myCSV8.csv,myCSV5.csv,myCSV17.csv,myCSV3.csv,myCSV1.csv,myCSV15.csv,myCSV6.csv,myCSV18.csv'],
 ['city_Hamburg',
  'myCSV10.csv,myCSV5.csv,myCSV6.csv,myCSV7.csv,myCSV9.csv,myCSV14.csv,myCSV18.csv,myCSV2.csv,myCSV15.csv,myCSV12.csv,myCSV3.csv,myCSV20.csv,myCSV17.csv,myCSV1.csv'],
 ['city_Kiel',
  'myCSV12.csv,myCSV2.csv,myCSV8.csv,myCSV13.csv,myCSV15.csv,myCSV16.csv,myCSV6.csv,myCSV7.csv,myCSV14.csv,myCSV20.csv,myCSV9.csv,myCSV4.csv,myCSV19.csv,myCSV3.csv,myCSV10.csv,myCSV17.csv,myCSV5.csv'],
 ['city_London',
  'myCSV12.csv,myCSV15.csv,myCSV19.csv,myCSV17.csv,myCSV3.csv,myCSV7.csv,myCSV10.csv,myCSV11.csv,myCSV20.csv,myCSV14.csv,myCSV16.csv,myCSV2.csv,myCSV1.csv,myCSV9.csv,myCSV13.csv'],
 ['city_Munchen',
  'myCSV3.csv,myCSV19.csv,myCSV6.csv,myCSV17.csv,myCSV10.csv,myCSV7.csv,myCSV12.csv,myCSV15.csv,myCSV16.csv,myCSV14.csv,myCSV2.csv,myCSV4.csv,myCSV1.csv,myCSV8.csv,myCSV5.csv,myCSV11.csv'],
 ['city_

In [14]:
conn.close()
os.remove('mydb.db')

# Question 3
## Shuffle

MapReduceServerlessEngine deploys both map and reduce tasks as serverless invocations.   
However, once map stage completed, the result are transferred from the map tasks to the SQLite database located on the client machine (laptop in your case), then performed local shuffle and then invoked reduce tasks passing them relevant parameters.

(To support your answers, feel free to use examples, Images, etc.)
<br><br>

**1. Explain why this approach is not efficient and what are cons and pros of such architecture in general. In broader scope you may assume that MapReduceServerlessEngine executed in some powerful machine and not just laptop.**

The whole point of using serverless services is to remove the need of local data storage (hence the use of invocations for both map and reduce tasks). Downloading data for this shuffle step defeats the purpose of serverless functionality and may cause memory issues. 

Pro: Running shuffle locally removes the need to pay for serverless functionality. 

Con: Additional communication between cloud services and local machines for the shuffle step increases the chance of failures, despite fault tolerance from the serverless provider.

<br><br>
**2. Suggest how can you improve shuffle so intermediate data will not be downloaded to the client at all and shuffle performed in the cloud as well. Explain pros and cons of the approaches you suggest.**


One options is to create a SQL database in object storage on the cloud and this will remove the need to download data to the clients machine.

Pro: Storage. No need to worry about external hard disks when storing big data on the cloud.

Con: Increased latency. Simply accessing the data will suffice in the shuffle step, which will always be slower when accessing it on the cloud versus accessing locally. 

<br><br>
**3. Can you make serverless shuffle?**


Yes. Assuming you remove the use of SQL databases, you could define new functions to execute their same functionality and then call them using the lithops `call_async` method. 

Con: This task doesn't require multiple machines so deploying it on the cloud could be considered unnecessary.  

<br><br><br><br>
Good Luck :) 