# Processing Big Data - Data Ingestion
© Explore Data Science Academy

## Honour Code
I ***JEROME, PIALAT***, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).
    Non-compliance with the honour code constitutes a material breach of contract.



## Context 

To work constructively with any dataset, one needs to create an ingestion profile to make sure that the data at the source can be readily consumed. For this section of the predict, as the Data Engineer in the team, you will be required to design and implement the ingestion process. For the purposes of the project the AWS cloud storage service, namely, the S3 bucket service will act as your data source. All the data required can be found [here](https://processing-big-data-predict-stocks-data.s3.eu-west-1.amazonaws.com/stocks.zip).

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/data_engineering/transform/predict/DataIngestion.jpg"
     alt="Data Ingestion"
     style="float: center; padding-bottom=0.5em"
     width=40%/>
     <p><em>Figure 1. Data Ingestion</em></p>
</div>

Your manager, Gnissecorp Atadgib, knowing very well that you've recently completed your Data Engineering qualification, asks you to make use of Apache Spark for the ingestion as well as the rest of the project. His rationale being, that stock market data is generated every day and is quite time-sensitive and would require scalability when deploying to a production environment. 

## Dataset - US Nasdaq




<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/data_engineering/transform/predict/Nasdaq.png"
     alt="Nasdaq"
     style="float: center; padding-bottom=0.5em"
     width=50%/>
     <p><em>Figure 2. Nasdaq</em></p>
</div>

The data that you will be working with is a historical snapshot of market data taken from the Nasdaq electronic market. This dataset contains historical daily prices for all tickers currently trading on Nasdaq. The up-to-date list can be found on their [website](https://www.nasdaq.com/)


The provided data contains price data dating back from 02 January 1962 up until 01 April 2020. The data found in the S3 bucket has been stored in the following structure:

```
     stocks/<Year>/<Month>/<Day>/stocks.csv
```
Each CSV file for every trading day contains the following details:
- **Date** - specifies trading date
- **Open** - opening price
- **High** - maximum price during the day
- **Low** - minimum price during the day
- **Close** - close price adjusted for splits
- **Adj Close** - close price adjusted for both dividends and splits
- **Volume** - the number of shares that changed hands during a given day

## Basic initialisation
To get you started, let's import some basic Python libraries as well as Spark modules and functions.

In [84]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import os

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

Remember that we need a `SparkContext` and `SparkSession` to interface with Spark.
We will mostly be using the `SparkContext` to interact with RDDs and the `SparkSession` to interface with Python objects.

> ℹ️ **Instructions** ℹ️
>
>Initialise a new **Spark Context** and **Session** that you will use to interface with Spark.

In [85]:
#TODO: Write your code here
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

## Investigate dataset schema
At this point, it is enough to read in a single file to ascertain the data structure. You will be required to use the information obtained from the small subset to create a data schema. This data schema will be used when reading the entire dataset using Spark.

> ℹ️ **Instructions** ℹ️
>
>Make use of Pandas to read in a single file and investigate the plausible data types to be used when creating a Spark data schema. 
>
>*You may use as many coding cells as necessary.*

In [86]:
#TODO: Write your code here
raw_df = spark.read.csv(r'C:\Users\Jpialat\OneDrive - Raging River Trading (Pty) Ltd\Documents\stocks\1962\01\02\stocks.csv',header=True)

In [87]:
raw_df.toPandas()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,stock
0,1962-01-02,6.5321550369262695,6.556184768676758,6.5321550369262695,6.5321550369262695,1.5366575717926023,55900.0,AA
1,1962-01-02,6.125843524932861,6.160982131958008,6.125843524932861,6.125843524932861,1.4146506786346436,59700.0,ARNC
2,1962-01-02,0.8374485373497009,0.8374485373497009,0.8230452537536621,0.8230452537536621,0.1457476019859314,352200.0,BA
3,1962-01-02,1.6041666269302368,1.6197916269302368,1.5885416269302368,1.6041666269302368,0.1369574218988418,163200.0,CAT
4,1962-01-02,0.0,3.296130895614624,3.2440476417541504,3.296130895614624,0.0519925132393837,105600.0,CVX
5,1962-01-02,0.092908389866352,0.0960261151194572,0.092908389866352,0.092908389866352,0.0355172455310821,817400.0,DIS
6,1962-01-02,0.0,30.6875,30.375,30.375,0.5069432854652405,1600.0,DTE
7,1962-01-02,0.0,10.28125,10.125,10.125,0.2273507714271545,25600.0,ED
8,1962-01-02,0.0,7.6875,7.541666507720947,7.583333492279053,0.9789445996284484,49200.0,FL
9,1962-01-02,0.7512019276618958,0.7637219429016113,0.7436898946762085,0.7486979365348816,0.0017815923783928,2156500.0,GE


In [88]:
raw_df.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Adj Close: string (nullable = true)
 |-- Volume: string (nullable = true)
 |-- stock: string (nullable = true)



## Read CSV files

When working with big data, it is often not tenable to keep processing an entire data batch when you are in the process of development - this can be quite time-consuming. If the data is uniform, it is sufficient to work with a smaller subset to create basic functionality. Your manager has identified the year **1962** to perform the initial testing for data ingestion. 

> ℹ️ **Instructions** ℹ️
>
>Read in the data for **1962** using a data schema that purely uses string data types. You will be required to convert to the appropriate data types at a later stage.
>
>*You may use as many coding cells as necessary.*

In [89]:
directory = r'C:\Users\Jpialat\OneDrive - Raging River Trading (Pty) Ltd\Documents\stocks\1962'
li = []
for root, subdirectories, files in os.walk(directory):
    #for subdirectory in subdirectories:
    #    print(os.path.join(root, subdirectory))
    for file in files:
        if file.endswith('csv') == True:
            path = os.path.join(root, file)
            li.append(path)
            

   

In [90]:
subset_1962 = spark.read.format("csv") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .load(li)

subset_1962.show(1000)

+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+
|      Date|                Open|                High|                 Low|               Close|           Adj Close|   Volume|stock|
+----------+--------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+
|1962-02-19|   5.839290142059326|   5.907374858856201|   5.839290142059326|   5.863319873809815|  1.3863292932510376|  29900.0|   AA|
|1962-02-19|   5.481634140014648|  5.5284857749938965|   5.481634140014648|   5.516772747039795|  1.2804527282714844|  32000.0| ARNC|
|1962-02-19|  0.9074074029922484|  0.9156378507614136|  0.8991769552230835|   0.903292179107666|  0.1614154428243637| 619400.0|   BA|
|1962-02-19|  1.6770833730697632|  1.6927083730697632|  1.6614583730697632|  1.6770833730697632|  0.1440587043762207| 170400.0|  CAT|
|1962-02-19|                 0.0|  3.5788691043853764|        

## Update column names
To make the data easier to work with, you will need to make a few changes:
1. Column headers should all be in lowercase; and
2. Whitespaces should be replaced with underscores.


> ℹ️ **Instructions** ℹ️
>
>Make sure that the column headers are all in lowercase and that any whitespaces are replaced with underscores.
>
>*You may use as many coding cells as necessary.*

In [91]:
#TODO: Write your code here
new_df = subset_1962.select('*')

In [92]:
for col in new_df.columns:
    new_df = new_df.withColumnRenamed(col, col.lower().replace(' ','_'))

In [93]:
new_df.show(5)

+----------+------------------+------------------+------------------+------------------+------------------+--------+-----+
|      date|              open|              high|               low|             close|         adj_close|  volume|stock|
+----------+------------------+------------------+------------------+------------------+------------------+--------+-----+
|1962-02-19| 5.839290142059326| 5.907374858856201| 5.839290142059326| 5.863319873809815|1.3863292932510376| 29900.0|   AA|
|1962-02-19| 5.481634140014648|5.5284857749938965| 5.481634140014648| 5.516772747039795|1.2804527282714844| 32000.0| ARNC|
|1962-02-19|0.9074074029922484|0.9156378507614136|0.8991769552230835| 0.903292179107666|0.1614154428243637|619400.0|   BA|
|1962-02-19|1.6770833730697632|1.6927083730697632|1.6614583730697632|1.6770833730697632|0.1440587043762207|170400.0|  CAT|
|1962-02-19|               0.0|3.5788691043853764|              20.0| 3.549107074737549|0.0565012246370315|273600.0|  CVX|
+----------+----

## Null Values
Null values often represent missing pieces of data. It is always good to know where your null values lie - so you can quickly identify and remedy any issues stemming from these.

> ℹ️ **Instructions** ℹ️
>
>Write code to count the number of null values found in each column.
>
>*You may use as many coding cells as necessary.*

In [94]:
#TODO: Write your code here
# Uncomment, to check for missing values within the columns.

missing_count = {}  # Dictionary to keep track of the results
for column in new_df.columns:   # loop through each column
    _count = new_df.where(new_df[column].isNull()).count()  # null count in column x
    _total_count = new_df.select(new_df[column]).count()    # total count of column x 
    print(f'There are {_count} ({round(_count/_total_count*100, 3)}%) null values in {column} column')  # print out and calculate results
    missing_count[f'{column}'] = _count # recording results in missing_count dictionary 


There are 0 (0.0%) null values in date column
There are 0 (0.0%) null values in open column
There are 0 (0.0%) null values in high column
There are 22 (0.431%) null values in low column
There are 0 (0.0%) null values in close column
There are 0 (0.0%) null values in adj_close column
There are 21 (0.411%) null values in volume column
There are 0 (0.0%) null values in stock column


## Data type conversion - The final data schema

Now that we have identified the number of missing values in the data set, we'll move on to convert our data schema to the required data types. 

> ℹ️ **Instructions** ℹ️
>
>Use typecasting to convert the string data types in your current data schema to more appropriate data types.
>
>*You may use as many coding cells as necessary.*

In [95]:
#TODO: Write your code here

new_df2 = new_df.withColumn("date", new_df["date"].cast(DateType()))
new_df2 = new_df2.withColumn("open", new_df2["open"].cast(FloatType()))
new_df2 = new_df2.withColumn("high", new_df2["high"].cast(FloatType()))
new_df2 = new_df2.withColumn("low", new_df2["low"].cast(FloatType()))
new_df2 = new_df2.withColumn("close", new_df2["close"].cast(FloatType()))
new_df2 = new_df2.withColumn("adj_close", new_df2["adj_close"].cast(FloatType()))
new_df2 = new_df2.withColumn("volume", new_df2["volume"].cast(IntegerType()))

In [96]:
new_df2.show(1000)

+----------+-----------+-----------+-----------+-----------+------------+-------+-----+
|      date|       open|       high|        low|      close|   adj_close| volume|stock|
+----------+-----------+-----------+-----------+-----------+------------+-------+-----+
|1962-02-19|    5.83929|   5.907375|    5.83929|    5.86332|   1.3863293|  29900|   AA|
|1962-02-19|   5.481634|   5.528486|   5.481634|  5.5167727|   1.2804527|  32000| ARNC|
|1962-02-19|  0.9074074| 0.91563785| 0.89917696|  0.9032922|  0.16141544| 619400|   BA|
|1962-02-19|  1.6770834|  1.6927084|  1.6614584|  1.6770834|   0.1440587| 170400|  CAT|
|1962-02-19|        0.0|   3.578869|       20.0|   3.549107| 0.056501225| 273600|  CVX|
|1962-02-19|0.099767394|0.099767394| 0.09820853| 0.09820853| 0.037543412| 817400|  DIS|
|1962-02-19|        0.0|    29.9375|      29.75|    29.9375|  0.49964145|   1600|  DTE|
|1962-02-19|        0.0|   9.921875|   9.890625|   9.921875|  0.22499175|   8800|   ED|
|1962-02-19|        0.0|  7.0833

In [97]:
new_df2.printSchema()

root
 |-- date: date (nullable = true)
 |-- open: float (nullable = true)
 |-- high: float (nullable = true)
 |-- low: float (nullable = true)
 |-- close: float (nullable = true)
 |-- adj_close: float (nullable = true)
 |-- volume: integer (nullable = true)
 |-- stock: string (nullable = true)



## Consolidate missing values
We have to check if the data type conversion above was done correctly.
If the casting was not successful, a null value gets inserted into the dataframe. You can thus check for successful conversion by determining if any null values are included in the resulting dataframe.


> ℹ️ **Instructions** ℹ️
>
>Write code to compare the number of invalid entries (nulls) pre-conversion and post-conversion.
>
>*You may use as many coding cells as necessary.*

In [98]:
#TODO: Write your code here

nulls = {}  # Dictionary to keep track of the results
for column in new_df2.columns:   # loop through each column
    _count = new_df2.where(new_df2[column].isNull()).count()  # null count in column x
    _total_count = new_df2.select(new_df2[column]).count()    # total count of column x 
    print(f'There are {_count} ({round(_count/_total_count*100, 3)}%) null values in {column} column')  # print out and calculate results
    nulls[f'{column}'] = _count # recording results in missing_count dictionary

There are 0 (0.0%) null values in date column
There are 0 (0.0%) null values in open column
There are 0 (0.0%) null values in high column
There are 42 (0.823%) null values in low column
There are 0 (0.0%) null values in close column
There are 21 (0.411%) null values in adj_close column
There are 21 (0.411%) null values in volume column
There are 0 (0.0%) null values in stock column


Here you should be able to see if any of your casts went wrong. 
Do not attempt to correct any missing values at this point. This will be dealt with in later sections of the predict.

## Generate parquet files
When writing in Spark, we typically use parquet format. This format allows parallel writing using Spark's optimisation while maintaining other useful things like metadata.

When writing, it is good to make sure that the data is sufficiently partitioned. 

Generally, data should be partitioned with one partition for every 200MB of data, but this also depends on the size of your cluster and executors. 


### Check the size of the dataframe before partitioning

In [99]:
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

In [101]:
rdd = new_df2.rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
obj = rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
size = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(obj)
size_MB = size/1000000
partitions = max(int(size_MB/200), 2)
print(f'The dataframe is {size_MB} MB')

The dataframe is 53.064496 MB


### Write parquet files to the local directory
> ℹ️ **Instructions** ℹ️
>
> Use the **coalesce** function and the number of **partitions** derived above to write parquet files to your local directory 
>
>*You may use as many coding cells as necessary.*

In [None]:
#TODO: Write your code here