# 2020 ML Engineer Home Assignment

This is a Jupyter notebook containing exercises from 3  fields: 
1. Algorithms,
2. Python,
3. Spark.

We think that it is possible to complete it within a couple hours and we are expecting that you will send back the results within a week (7 days).  
Your submission can be in the form of this notebook with your code and necessary remarks added to it but we are fine with other forms, as well as partial solutions.

## Python requirements:

You can use the below library versions for running this notebook.

If you are using Conda to manage you Python environments you can use the command like below:
```
conda create --name vl-home-assignment --file requirements.txt
```

Requirements:
```
pyspark=2.3.0
python=3.6.10
jupyter
```

## Ex. 1: Algorithms

You are given two singly-linked lists (precisely: the references to their heads).
Implement an optimal (both in terms of memory and time) function that returns the intersection node.  

We define intersection based on reference, not value. 
That is, if the k-th element of the first list is exactly the same node reference as the j-th element of the second list then the lists intersect. 
We expect that your solution will provide your own implementation of the singly-linked lists. 
Please provide a couple of examples testing your solution.

In [None]:
# Space for you solution

import random


list1 = []
list2 = []


class Create_Node:
    def __init__(self, dataval=None):
        self.dataval = dataval
        self.nextval = None
        
    
class Create_LinkedList:
    def __init__(self):
        self.headval = None
        
    
    def reference_of_node(self, list_name):
        node_value = self.headval
        while node_value is not None:
            #print (printval.dataval)
            rv = random.randint(5,50)   # You can change upper and lower value

            if list_name == list1:
                list1.append(rv)
            else:
                list2.append(rv)
            node_value = node_value.nextval

In [None]:
# First linked list
list = Create_LinkedList()
list.headval = Create_Node("fish")
e2 = Create_Node("vegetable")
e3 = Create_Node("chicken")
e4 = Create_Node("beef")
e5 = Create_Node("mutton") 

# Link first Node to second node
list.headval.nextval = e2

# Link second Node to third node
e2.nextval = e3
e3.nextval = e4
e4.nextval = e5


list.reference_of_node(list1)


# Second linked list
list = Create_LinkedList()
list.headval = Create_Node("computer")
e2 = Create_Node("mouse")
e3 = Create_Node("heartdisk")
e4 = Create_Node("laptop")
e5 = Create_Node("monitor") 


# Link first Node to second node
list.headval.nextval = e2

# Link second Node to third node
e2.nextval = e3
e3.nextval = e4
e4.nextval = e5

list.reference_of_node(list2)

In [None]:
c = [d for d in list1 if d in list2]

print(f"Generated list1 is {list1} and \ngenerated list2 is {list2}")

if not c:
    print("There is not intersection between two list")
if c:
    print('Yes! There is intersection between two list')

## Ex. 2: Python

Given two files `file.json` and `file.csv` containing the same mapping write a small library with loaders for these file formats. 
A loader `l` should have two public members `l.keys` and `l.data`.
There should be a basic filtering functionality built around `keys` attribute as in the example below.

Use Python Standard Library for parsing json and csv files.

### Sample files
`file.json`:
```
{
  "alfa": 1,
  "beta": 2
}
```
`file.csv`:
```
alfa,1
beta,2
```

### Sample usage
```
from file_loaders import JsonLoader, CsvLoader
loaders: Tuple[Loader] = (
    JsonLoader('file.json'),
    CsvLoader('file.csv')
)
for l in loaders:
    print(f"======== {type(l).__name__} =======")
    print(f"data: {l.data}") # => {'alfa': 1, 'beta': 2}
    print(f"keys: {l.keys}") # => ('alfa', 'beta')
    l.keys = ('alfa')
    print(f"data: {l.data}") # => {'alfa': 1}
    print(f"keys: {l.keys}") # => alfa
```

In [None]:
# Space for your solution

# Creating json file
import json
import pandas as pd
data = {
    "alfa": [],
    "beta": []
}
data['alfa'].append(1)
data['beta'].append(2)

with open('data.json', 'w') as outfile:
    json.dump(data, outfile)
json_file = pd.read_json("data.json")  # reading json file using pandas, I could also read using json  
json_file = tuple(json_file)  # json to tuple
json_file    

In [None]:
# Creating csv file
data = {
    "alfa":[1],
    "beta":[2]
}

df_file = pd.DataFrame(data)
df_file.to_csv("data.csv", index=None)
file_csv = pd.read_csv("data.csv")
file_csv = file_csv.to_dict(orient='list') # DataFrame to dictionary 
file_csv

In [None]:
loaders = {
    "data": file_csv,
    "keys": json_file
}

In [None]:
loaders

In [None]:
loaders["data"]

In [None]:
loaders['keys']

In [None]:
loaders['keys'] = loaders['data']

In [None]:
loaders['data']

In [None]:
loaders['keys']

## Ex. 3: Spark

Below are two Spark dataframes representing 1) people and 2) hobbies.
You are given two tasks.
1. Show how many hobbies each person has.
2. Show all hobbies that no one has.




In [None]:
from pyspark.sql import SparkSession, DataFrame
import pyspark.sql.functions as F

# global spark session
spark = (SparkSession
         .builder
         .appName('MLEngineerSpark')
         .master('local[2]')
         .getOrCreate())

In [None]:
persons_df = spark.createDataFrame(
            data=[
                (1, "Mary"),
                (2, "John"),
                (4, "Tom"),
            ], schema=("id","name"))

hobbies_df = spark.createDataFrame(
            data=[
                (1, "Python"),
                (1, "Spark"),
                (1, "Reading"),
                (2, "Sleeping"),
                (3, "Soccer"),
            ], schema=("id","hobby"))

persons_df.show(truncate=False)
hobbies_df.show(truncate=False)

In [None]:
# 1. Show how many hobbies each person has.
# Space for you solution

# tempview for using sql
persons_df.createOrReplaceTempView("people")
hobbies_df.createOrReplaceTempView("hobby")

spark.sql("SELECT p.name, COUNT(*) as hobby_number from people p inner join hobby h on p.id=h.id group by name having count(*)").show()



In [None]:
# 2. Show all hobbies that no one has.
# Space for you solution

# using alias
from pyspark.sql import functions as F
ta = persons_df.alias('ta')
tb = hobbies_df.alias('tb')

left_join = tb.join(ta, tb.id == ta.id, how='left').filter(F.col('ta.name').isNull())

left_join.createOrReplaceTempView("new_sql")
spark.sql("select hobby from new_sql").show()


In [None]:
# Another technique of the second problem
spark.sql("select id, hobby from hobby where id not in (select id from people)").show()
