In [8]:
from pyspark.sql import SparkSession
import numpy as np
import pandas as pd
import datetime
import pyspark.sql.functions as F

## Setup

Before doing anything, please run the following cell which will make a CSV file called "data.csv" in this directory.

## TODO: 
1. Put some NaNs into the DF.

In [2]:
# Run me!
!python ./data_setup.py

---

## Making the Instance and Getting the Data

Here we're going to make a Spark Context.  If you need to restart it, go into "Kernel > Restart Kernel."  In general, if anything bad happens spark-related, you're going to want to do that reset kernel thing.

Problems will be given by section as below.  Note that you *will have to import some new modules from pyspark; not all required imports are given above*.

---

### 1.1 Importing with inferSchema

Import the `data.csv` file.  Use `inferSchema=True` to infer the schema.

---

### 1.2 Importing with Structs

Import the `data.csv` file (again).  Explicitly use StructFields to create the schema.

---

### 1.3  Some Example Queries of the Data

1. Select only those values whose `categorical_col` is `Low` and the `int_col` value is negative.
2. Group by categoricals, giving the sum of the `int_col` and the average of the `int_col` as new columns.
3. Group by categoricals, giving the sum of the `int_col` and the average of the `int_col` as new columns, and show only the ones having an average greater than 0.

In [10]:
file_loc = "./data.csv"
rdd = spark.read.format("csv") \
        .option("inferSchema", True) \
        .option("header", True) \
        .load(file_loc)


In [15]:
rdd.filter(rdd.categorical_col == "Low") \
   .filter(rdd.int_col < 0) \
   .show()

+-------------------+--------------------+-------+---------------+--------+
|       datetime_col|           float_col|int_col|categorical_col|bool_col|
+-------------------+--------------------+-------+---------------+--------+
|2018-01-01 00:00:00|  0.5748556785619214|    -99|            Low|    true|
|2018-01-01 05:00:00|  0.7361655134141251|    -82|            Low|    true|
|2018-01-02 03:00:00|  0.9940334727502934|    -99|            Low|   false|
|2018-01-02 04:00:00|   0.726596701630739|     -8|            Low|   false|
|2018-01-02 10:00:00| 0.20751173210042095|    -61|            Low|   false|
|2018-01-02 13:00:00|  0.6195021839986666|    -12|            Low|   false|
|2018-01-02 14:00:00|  0.7408211423975096|    -17|            Low|   false|
|2018-01-02 22:00:00|   0.976770375639157|    -29|            Low|   false|
|2018-01-03 03:00:00|   0.774084033537829|    -14|            Low|   false|
|2018-01-03 06:00:00|   0.808864300373038|    -90|            Low|   false|
|2018-01-03 