# Big Data Platform
## Assignment 3: ServerLess

**By:**  

John Doe, 300123123  
Jane Doe, 200123123

<br><br>

**The goal of this assignment is to:**
- Understand and practice the details of Serverless

**Instructions:**
- Students will form teams of two people each, and submit a single homework for each team.
- The same score for the homework will be given to each member of your team.
- Your solution is in the form of a Jupyter notebook file (with extension ipynb).
- Images/Graphs/Tables should be submitted inside the notebook.
- The notebook should be runnable and properly documented. 
- Please answer all the questions and include all your code.
- You are expected to submit a clear and pythonic code.
- You can change functions signatures/definitions.

**Submission:**
- Submission of the homework will be done via Moodle by uploading (not Zip):
    - Jupyter Notebook
    - 2 Log files
    - Additional local scripts
- The homework needs to be entirely in English.
- The deadline for submission is on Moodle.
- Late submission won't be allowed.

  
- In case of identical code submissions - both groups will get a Zero. 
- Some groups might be selected randomly to present their code.

**Requirements:**  
- Python 3.6 should be used.  
- You should implement the algorithms by yourself using only basic Python libraries (such as numpy,pandas,etc.)

<br><br><br><br>

**Grading:**
- Q0 - 10 points - Setup
- Q1 - 40 points - Serverless MapReduceEngine
- Q2 - 20 points - MapReduce job to calculate inverted index
- Q3 - 30 points - Shuffle

`Total: 100`

<br><br>

# Question 0
## Setup

1. Navigate to IBM Cloud and open a trial account. No need to provide a credit card
2. Choose IBM Cloud Object Storage service from the catalog
3. Create a new bucket in IBM Cloud Object Storage
4. Create credentials for the bucket with HMAC (access key and secret key)
5. Choose IBM Cloud Functions service from the catalog and create a service


#### Lithops setup
1. By using “git” tool, install master branch of the Lithops project from
https://github.com/lithops-cloud/lithops
2. Follow Lithops documentation and configure Lithops against IBM Cloud Functions and IBM Cloud Object Storage
3. Configure Lithops log level to be in DEBUG mode
4. Run Hello World example by using Futures API and verify all is working properly.


#### IBM Cloud Object Storage setup
1. Upload all the input CSV files that you used in homework 2 into the bucket you created in IBM Cloud Object Storage


<br><br><br>

# Question 1
## Serverless MapReduceEngine

Modify MapReduceEngine from homework 2 into the MapReduceServerlessEngine where map and reduce tasks executed as a serverless actions, instead of local threads. In particular:
1. Deploy all map tasks as a serverless actions by using Lithops against IBM Cloud Functions.
2. Collect results from all map tasks and store them in the same SQLite as you used in MapReduceEngine and use the same code for the sort and shuffle phase.
3. Deploy reduce tasks by using Lithops against IBM Cloud Functions. Instead of persisting results from reduce tasks, return results back to the MapReduceServerlessEngine and proceed with the same workflow as in MapReduceEngine
4. Return results of reduce tasks to the user

**Please attach:**  
Text file with all log messages Lithops printed to console during the execution. Make
sure log level is set to DEBUG mode.

#### Code:

In [113]:
import pandas as pd

from lithops import FunctionExecutor
from lithops import Storage

# general
import os
import time
import random
import warnings
import glob

import sqlite3
from sqlite3 import Error

# ml
import numpy as np
import scipy as sp
import pandas as pd

# notebook
from IPython.display import display

In [None]:
warnings.filterwarnings('ignore')

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

**Uploading all myCSV files from Assignment 2 to IBM Cloud Object Storage using lithops Storage API:**

In [None]:
st = Storage()

In [None]:
for my_csv in glob.glob('../Assignment2/myCSV*.csv'):
    with open(f'{my_csv}', 'rb') as fl:
        st.put_object('idc.g.test', f'{my_csv}', fl)

**Creating SQLite database:**

In [169]:
database = 'tmp_database.db'

CREATE_SCHEMA_STR = ''' CREATE TABLE IF NOT EXISTS temp_results (
                                        key TEXT,
                                        value TEXT
                                    ); '''

conn = None
try:
    conn = sqlite3.connect(database)
    c = conn.cursor()
    c.execute(CREATE_SCHEMA_STR)
except Error as e:
    print(e)

**MapReduceServerlessEngine:**

In [170]:
class MapReduceServerlessEngine():
        
    def execute(self, input_data, map_function, reduce_function, params):
        
        # using FunctionExecutor as a context manager
        # inside using map function that spawn multiple function activations based on the items of an input list.
        with FunctionExecutor() as fexec:
            fut = fexec.map(map_function, input_data[:3], extra_args=[params])
            #print(fut.get_result())
            results = fut.get_result()
        
        # at this point all maps completed and their results returned in a list (list of lists)
        
        # flatten the list of lists to one list
        flat_results = [item for sublist in results for item in sublist]
        
        # creating dataframe from all results
        tmp_dfs = pd.DataFrame(flat_results, columns=['key', 'value'])
        
        # creating connection and qurying db
        sql_conn = None
        try:
            sql_conn = sqlite3.connect(database)
            tmp_dfs.to_sql('temp_results',sql_conn, if_exists='append',index=False)
            
            cur = sql_conn.cursor()
            sort_and_shuffle_query = "SELECT key, GROUP_CONCAT(value) as value " \
                                     "FROM temp_results " \
                                     "GROUP BY key " \
                                     "ORDER BY key "
            cur.execute(sort_and_shuffle_query)
            sort_and_shuffle_res = cur.fetchall()
        except Error as e:
            return e
        finally:
            if sql_conn:
                sql_conn.close()
                
        list_of_results = [list(res) for res in sort_and_shuffle_res]
        
        # reduce part        
        with FunctionExecutor() as fexec:
            fut = fexec.map(reduce_function, list_of_results)
            reduce_results = fut.get_result()
                
        print('MapReduce Completed')
        return reduce_results           

# Task 2
## Submit MapReduce job to calculate inverted index
1. Use input_data: `cos://bucket/<path to CSV data>`
2. Submit MapReduce job with reduce and map functions as you used in homework 2, as follows

    `mapreduce = MapReduceServerlessEngine()`  
    `results = mapreduce.execute(input_data, inverted_map, inverted_index)`   
    `print(results)`

**Please attach:**  
Text file with all log messages Lithops printed to console during the execution. Make
sure log level is set to DEBUG mode.

#### Code:

In [40]:
# pulling all csv file names from my bucket in object storage
keys_list = st.list_keys('idc.g.test', prefix='myCSV')

2022-01-04 19:20:13,321 [DEBUG] lithops.config -- Loading configuration from /Users/gorelik/PycharmProjects/BigData/MS_Big_Data/Assignment3/.lithops_config
2022-01-04 19:20:13,335 [DEBUG] lithops.config -- Loading Storage backend module: ibm_cos
2022-01-04 19:20:13,340 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Creating IBM COS client
2022-01-04 19:20:13,341 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Set IBM COS Endpoint to https://s3.eu-de.cloud-object-storage.appdomain.cloud
2022-01-04 19:20:13,341 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Using access_key and secret_key
2022-01-04 19:20:13,346 [INFO] lithops.storage.backends.ibm_cos.ibm_cos -- IBM COS client created - Region: eu-de


In [None]:
st.put_object('idc.g.test', 'test_upload_myCSV1.csv', 'Hello World')

In [41]:
# adding location for further processing
input_data = ['cos://idc.g.test/' + key for key in keys_list]
input_data

['cos://idc.g.test/myCSV1.csv',
 'cos://idc.g.test/myCSV10.csv',
 'cos://idc.g.test/myCSV11.csv',
 'cos://idc.g.test/myCSV12.csv',
 'cos://idc.g.test/myCSV13.csv',
 'cos://idc.g.test/myCSV14.csv',
 'cos://idc.g.test/myCSV15.csv',
 'cos://idc.g.test/myCSV16.csv',
 'cos://idc.g.test/myCSV17.csv',
 'cos://idc.g.test/myCSV18.csv',
 'cos://idc.g.test/myCSV19.csv',
 'cos://idc.g.test/myCSV2.csv',
 'cos://idc.g.test/myCSV20.csv',
 'cos://idc.g.test/myCSV3.csv',
 'cos://idc.g.test/myCSV4.csv',
 'cos://idc.g.test/myCSV5.csv',
 'cos://idc.g.test/myCSV6.csv',
 'cos://idc.g.test/myCSV7.csv',
 'cos://idc.g.test/myCSV8.csv',
 'cos://idc.g.test/myCSV9.csv']

In [171]:
mapreduce = MapReduceServerlessEngine()
status = mapreduce.execute(input_data, inverted_map, inverted_reduce, params={'column':0})
print(status)

2022-01-04 21:58:19,992 [INFO] lithops.config -- Lithops v2.5.8
2022-01-04 21:58:20,002 [DEBUG] lithops.config -- Loading configuration from /Users/gorelik/PycharmProjects/BigData/MS_Big_Data/Assignment3/.lithops_config
2022-01-04 21:58:20,006 [DEBUG] lithops.config -- Loading Serverless backend module: ibm_cf
2022-01-04 21:58:20,007 [DEBUG] lithops.config -- Loading Storage backend module: ibm_cos
2022-01-04 21:58:20,008 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Creating IBM COS client
2022-01-04 21:58:20,008 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Set IBM COS Endpoint to https://s3.eu-de.cloud-object-storage.appdomain.cloud
2022-01-04 21:58:20,008 [DEBUG] lithops.storage.backends.ibm_cos.ibm_cos -- Using access_key and secret_key
2022-01-04 21:58:20,013 [INFO] lithops.storage.backends.ibm_cos.ibm_cos -- IBM COS client created - Region: eu-de
2022-01-04 21:58:20,014 [DEBUG] lithops.serverless.backends.ibm_cf.ibm_cf -- Creating IBM Cloud Functions client
2022-01-0

2022-01-04 21:58:24,208 [DEBUG] lithops.invokers -- ExecutorID ee8043-71 - Invoker initialized. Max workers: 1200
2022-01-04 21:58:24,208 [DEBUG] lithops.invokers -- ExecutorID ee8043-71 - Serverless invoker created
2022-01-04 21:58:24,209 [DEBUG] lithops.executors -- Function executor for ibm_cf created with ID: ee8043-71
2022-01-04 21:58:24,209 [INFO] lithops.invokers -- ExecutorID ee8043-71 | JobID M000 - Selected Runtime: lithopscloud/ibmcf-python-v38 - 256MB
2022-01-04 21:58:24,210 [DEBUG] lithops.storage.storage -- Runtime metadata found in local memory cache
2022-01-04 21:58:24,210 [DEBUG] lithops.job.job -- ExecutorID ee8043-71 | JobID M000 - Serializing function and data
2022-01-04 21:58:24,211 [DEBUG] lithops.job.serialize -- Referenced modules: None
2022-01-04 21:58:24,212 [DEBUG] lithops.job.serialize -- Modules to transmit: None
2022-01-04 21:58:24,212 [DEBUG] lithops.job.job -- ExecutorID ee8043-71 | JobID M000 - Uploading function and modules to the storage backend
2022-

[['Albert', 'myCSV11.csv,myCSV1.csv,myCSV10.csv'], ['Dana', 'myCSV1.csv,myCSV10.csv,myCSV11.csv'], ['Johanna', 'myCSV1.csv,myCSV11.csv,myCSV10.csv'], ['John', 'myCSV10.csv,myCSV1.csv'], ['Marc', 'myCSV10.csv,myCSV1.csv,myCSV11.csv'], ['Michael', 'myCSV1.csv'], ['Scott', 'myCSV1.csv,myCSV11.csv'], ['Steven', 'myCSV10.csv,myCSV11.csv']]
MapReduce Completed


2022-01-04 21:58:27,877 [DEBUG] lithops.invokers -- ExecutorID ee8043-71 - Async invoker 0 finished
2022-01-04 21:58:27,877 [DEBUG] lithops.invokers -- ExecutorID ee8043-71 - Async invoker 1 finished


In [135]:
def inverted_map(obj, extra_args):
    keys = []
    # reading file
    data = obj.data_stream.read().decode(encoding='utf-8')
    # reading lines and extracting keys from data to list
    for line in data.splitlines():
        keys.append(line.split(',')[extra_args['column']])
    # returning list of tuples (key_value, document_name)
    return list(zip(keys[1:], [obj.key]*len(keys[1:])))

In [137]:
def inverted_reduce(documents):

    # extracting key and value
    res_list = [documents[0]]    
    value = documents[1]
    
    # split value by ',' to get list of documents names and remove duplicates using set
    docs_set = set(value.split(','))
    # creating new "value" string from the set with no duplicates
    docs_str = ','.join(docs_set)
    
    #appending value to res list 
    res_list.append(docs_str)
    
    return res_list

# Question 3
## Shuffle

MapReduceServerlessEngine deploys both map and reduce tasks as serverless invocations.   
However, once map stage completed, the result are transferred from the map tasks to the SQLite database located on the client machine (laptop in your case), then performed local shuffle and then invoked reduce tasks passing them relevant parameters.

(To support your answers, feel free to use examples, Images, etc.)
<br><br>

**1. Explain why this approach is not efficient and what are cons and pros of such architecture in general. In broader scope you may assume that MapReduceServerlessEngine executed in some powerful machine and not just laptop.**

We are talking about Big Data, if there were no Big Data, no need for serverless or map_reduce, we could do it all on our machine in a short time and also not losing time for moving data via network at all (serverless + shuffle).
In case of Big Data, if we perfom shuffle locally we need to move all the results via network to our local machine, so this is the main case, it takes time depending how far it should be moved. Also as we are talking about Big Data, that means that "even the strongest machine on planet can't process it" (from Gil lections). So here we run small example, but if we talk about Big Data, local shuffle could be impossible. But let's assume it's not so Big, so as I mentioned before, we have problem moving all data via network to local machine. Secondly, even with strong machine, we want to do it effectivly and processing it all one one machine will be longer than processing it in parallel on different machines. On the other hand, while we do it locally on one machine we will not face struglers (case where most of processing done and we are waiting for struglers to continue processing all the data) and not facing rerun of maps or waiting till all shuffle done. 

<br><br>
**2. Suggest how can you improve shuffle so intermediate data will not be downloaded to the client at all and shuffle performed in the cloud as well. Explain pros and cons of the approaches you suggest.**


We can write our remote function logic to store results in object storage then reducers will read from there or pass the results by key to other machines. It's up to us how we can implement it. So we can laverage location of machines, like region and reduce network moving, so if our map functions running for example in Germany, we can save our results in the same region and config reducers in Germany to process it. This way we minimizing shuffle phase by minimzing data transformation. Anyway, we still need to face implementing that logic for shuffling the data, lots of complicated data transfering and strugglers that can cause delays for all processing to complete.

<br><br>
**3. Can you make serverless shuffle?**


Sure, it's seems that my answer for question 2 contains the answer for this question also. Anyway, yes, we can do it, for example we can run map functions and save temporary files to cloud object storage, also write logic for cloud functions that will be triggered and will read all that temporary files from object storage that generated by maps and sort them to provide each reducer a specific key.

<br><br><br><br>
Good Luck :) 