# Task 1

In [0]:
# Deleting tables left from previous runs in case they still exist after deleting an inactive cluster
dbutils.fs.rm("/user", recurse=True)

Out[1]: True

In [0]:
# We need to install 'ipython_unittest' to run unittests in a Jupyter notebook
!pip install -q ipython_unittest

You should consider upgrading via the '/databricks/python3/bin/python -m pip install --upgrade pip' command.[0m


In [0]:
# Loading PySpark modules that we need
import unittest
from collections import Counter
from pyspark.sql import DataFrame
from pyspark.sql.types import *

#### Subtask 1: defining the schema for the data
Typically, the first thing to do before loading the data into a Spark cluster is to define the schema for the data. Look at the schema for 'badges' and try to define the schema for other tables similarly.

In [0]:
# Defining a schema for 'badges' table
badges_schema = StructType([StructField('UserId', IntegerType(), False),
                            StructField('Name', StringType(), False),
                            StructField('Date', TimestampType(), False),
                            StructField('Class', IntegerType(), False)])

# Defining a schema for 'posts' table
posts_schema = StructType([StructField('Id', IntegerType(), False),
                            StructField('ParentId', IntegerType(), True),
                            StructField('PostTypeId', IntegerType(), False),
                            StructField('CreationDate', TimestampType(), False),
                            StructField('Score', IntegerType(), False),
                            StructField('ViewCount', IntegerType(), False),
                            StructField('Body', StringType(), False),
                            StructField('OwnerUserId', IntegerType(), False),
                            StructField('LastActivityDate', TimestampType(), False),
                            StructField('Title', StringType(), True),
                            StructField('Tags', StringType(), True),
                            StructField('AnswerCount', IntegerType(), True),
                            StructField('CommentCount', IntegerType(), False),
                            StructField('FavoriteCount', IntegerType(), True),
                            StructField('CloseDate', TimestampType(), True)])


# Defining a schema for 'users' table
users_schema = StructType([StructField('Id', IntegerType(), False),
                            StructField('Reputation', IntegerType(), False),
                            StructField('CreationDate', TimestampType(), False),
                            StructField('DisplayName', StringType(), False),
                            StructField('LastAccessDate', TimestampType(), False),
                            StructField('AboutMe', StringType(), False),
                            StructField('Views', IntegerType(), False),
                            StructField('UpVotes', IntegerType(), False),
                            StructField('DownVotes', IntegerType(), False)])

# Defining a schema for 'comments' table
comments_schema = StructType([StructField('PostId', IntegerType(), False),
                            StructField('Score', IntegerType(), False),
                            StructField('Text', StringType(), False),
                            StructField('CreationDate', TimestampType(), False),
                            StructField('UserId', IntegerType(), False)])


#### Subtask 2: implementing two helper functions
Next, we need to implement two helper functions:
1. 'load_csv' that as input argument receives path for a CSV file and a schema and loads the CSV pointed by the path into a Spark DataFrame and returns the DataFrame;
2. 'save_df' receives a Spark DataFrame and saves it as a Parquet file on DBFS.

Note that the column separator in CSV files is TAB character ('\t') and the first row includes the name of the columns. 

BTW, DBFS is the name of the distributed filesystem used by Databricks Community Edition to store and access data.

In [0]:
def load_csv(source_file: "path for the CSV file to load", schema: "schema for the CSV file being loaded as a DataFrame") -> DataFrame:
    df = (spark.read
          .format("csv")
          .option("header", "true")
          .option("sep", '\t')
          .schema(schema)
          .load(source_file)
    )
    return df

def save_df(df: "DataFrame to be saved", table_name: "name under which the DataFrame will be saved") -> None:
    df.write.saveAsTable(table_name)
    

In [0]:
# Loading 'ipython_unittest' so we can use '%%unittest_main' magic command
%load_ext ipython_unittest

#### Subtask 3: validating the implementation by running the tests

Run the cell below and make sure that all the tests run successfully. Moreover, at the end there should be four Parquet files named 'badges', 'comments', 'posts', and 'users' in '/user/hive/warehouse'.

Note that we assumed that the data for the project has already been stored on DBFS on the '/FileStore/tables/' path. (I mean as 'badges_csv.gz', 'comments_csv.gz', 'posts_csv.gz', and 'users_csv.gz'.)

In [0]:
%%unittest_main
class TestTask1(unittest.TestCase):
   
    # test 1
    def test_load_badges(self):
        result = load_csv(source_file="/FileStore/tables/badges_csv.gz", schema=badges_schema)
        self.assertIsNotNone(result, "Badges dataframe did not load successfully")
        self.assertIsInstance(result, DataFrame, "Result type is not of spark.sql.DataFrame")
        self.assertEqual(result.count(), 105640, "Number of records is not correct")

        coulmn_names = Counter(map(str.lower, ['UserId', 'Name', 'Date', 'Class']))
        self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)),
                              "Missing column(s) or column name mismatch")
    
    # test 2
    def test_load_posts(self):
        result = load_csv(source_file="/FileStore/tables/posts_csv.gz", schema=posts_schema)
        self.assertIsNotNone(result, "Posts dataframe did not load successfully")
        self.assertIsInstance(result, DataFrame, "Result type is not of spark.sql.DataFrame")
        self.assertEqual(result.count(), 61432, "Number of records is not correct")

        coulmn_names = Counter(map(str.lower,
                                   ['Id', 'ParentId', 'PostTypeId', 'CreationDate', 'Score', 'ViewCount', 'Body', 'OwnerUserId',
                                    'LastActivityDate', 'Title', 'Tags', 'AnswerCount', 'CommentCount', 'FavoriteCount',
                                    'CloseDate']))
        self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)),
                              "Missing column(s) or column name mismatch")
    
    # test 3
    def test_load_comments(self):
        result = load_csv(source_file="/FileStore/tables/comments_csv.gz", schema=comments_schema)
        self.assertIsNotNone(result, "Comments dataframe did not load successfully")
        self.assertIsInstance(result, DataFrame, "Result type is not of spark.sql.DataFrame")
        self.assertEqual(result.count(), 58735, "Number of records is not correct")

        coulmn_names = Counter(map(str.lower, ['PostId', 'Score', 'Text', 'CreationDate', 'UserId']))
        self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)),
                              "Missing column(s) or column name mismatch")
    
    # test 4
    def test_load_users(self):
        result = load_csv(source_file="/FileStore/tables/users_csv.gz", schema=users_schema)
        self.assertIsNotNone(result, "Users dataframe did not load successfully")
        self.assertIsInstance(result, DataFrame, "Result type is not of spark.sql.DataFrame")
        self.assertEqual(result.count(), 91616, "Number of records is not correct")

        coulmn_names = Counter(map(str.lower,
                                   ['Id', 'Reputation', 'CreationDate', 'DisplayName', 'LastAccessDate', 'AboutMe',
                                    'Views', 'UpVotes', 'DownVotes']))
        self.assertCountEqual(coulmn_names, Counter(map(str.lower, result.columns)),
                              "Missing column(s) or column name mismatch")
    # test 5
    def test_save_dfs(self):
        dfs = [("/FileStore/tables/users_csv.gz", users_schema, "users"),
               ("/FileStore/tables/badges_csv.gz", badges_schema, "badges"),
               ("/FileStore/tables/comments_csv.gz", comments_schema, "comments"),
               ("/FileStore/tables/posts_csv.gz", posts_schema, "posts")
               ]

        for i in dfs:
            df = load_csv(source_file=i[0], schema=i[1])
            save_df(df, i[2])



Success

.....
----------------------------------------------------------------------
Ran 5 tests in 92.868s

OK
Out[7]: <unittest.runner.TextTestResult run=5 errors=0 failures=0>

#### Subtask 4: answering to questions about Spark related concepts

Please write a short description for the terms below---one to two short paragraphs for each term. Don't copy-paste; instead, write your own understanding.

1. What do the terms 'Spark Application', 'SparkSession', 'Transformations', 'Action', and 'Lazy Evaluation' mean in the context of Spark?

Write your descriptions in the next cell.



Spark application is a program that uses apache spark to process data.  A spark application consists of a master program that runs in a cluster and a set of executors that executes on worker nodes. The master node responsability is to divide the workload to each of the worker nodes. Each task is executed on a worker node. 

A spark Session, is an endpoint for an Apache Spark used in standalone applications. Spark provides an interface to Spark contexts by running sql queries, reading, writing data and perform dataframe operations. 

Transformations are functions that returns a new RDD by modifing the existing RDD as an input. Some Spark transformations include narrow transformations map, flatMap(). You also have wide transformations distinct(), groupByKey(). The difference between wide and narrow transformations are that the data required to compute records are either in one partition of the parent RDD (narrow) or resides in one or more partitions of the parent RDD (wide).

We separate actions from Transformation. Actions only returns raw values instead of a map. It performs operations on data, these operations can include count which returns the number of rows in a Dataframe.

Lazy evaluations means that Spark will not run the execution of a process until an Action has been triggered. This means that when a Spark action has been called it will look at all the previous transformations until the action being called and creates a list of all operations to be exetuted before the action. 
