<a href="https://colab.research.google.com/github/jmagwede/processing-big-data-predict/blob/main/Data_ingestion_student_version_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processing Big Data - Data Ingestion
© Explore Data Science Academy

## Honour Code
I {**NDIVHUHO JUDITH**, **MAGWEDE**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).
    Non-compliance with the honour code constitutes a material breach of contract.



## Context

To work constructively with any dataset, one needs to create an ingestion profile to make sure that the data at the source can be readily consumed. For this section of the predict, as the Data Engineer in the team, you will be required to design and implement the ingestion process. For the purposes of the project the AWS cloud storage service, namely, the S3 bucket service will act as your data source. All the data required can be found [here](https://processing-big-data-predict-stocks-data.s3.eu-west-1.amazonaws.com/stocks.zip).

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/data_engineering/transform/predict/DataIngestion.jpg"
     alt="Data Ingestion"
     style="float: center; padding-bottom=0.5em"
     width=40%/>
     <p><em>Figure 1. Data Ingestion</em></p>
</div>

Your manager, Gnissecorp Atadgib, knowing very well that you've recently completed your Data Engineering qualification, asks you to make use of Apache Spark for the ingestion as well as the rest of the project. His rationale being, that stock market data is generated every day and is quite time-sensitive and would require scalability when deploying to a production environment.

## Dataset - US Nasdaq




<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/data_engineering/transform/predict/Nasdaq.png"
     alt="Nasdaq"
     style="float: center; padding-bottom=0.5em"
     width=50%/>
     <p><em>Figure 2. Nasdaq</em></p>
</div>

The data that you will be working with is a historical snapshot of market data taken from the Nasdaq electronic market. This dataset contains historical daily prices for all tickers currently trading on Nasdaq. The up-to-date list can be found on their [website](https://www.nasdaq.com/)


The provided data contains price data dating back from 02 January 1962 up until 01 April 2020. The data found in the S3 bucket has been stored in the following structure:

```
     stocks/<Year>/<Month>/<Day>/stocks.csv
```
Each CSV file for every trading day contains the following details:
- **Date** - specifies trading date
- **Open** - opening price
- **High** - maximum price during the day
- **Low** - minimum price during the day
- **Close** - close price adjusted for splits
- **Adj Close** - close price adjusted for both dividends and splits
- **Volume** - the number of shares that changed hands during a given day

## Basic initialisation
To get you started, let's import some basic Python libraries as well as Spark modules and functions.

In [35]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark findspark
!pip install py4j





In [36]:
!pip install pyspark==3.0.1




In [38]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

In [53]:
import pandas as pd
import numpy as np
import findspark
findspark.init()
import matplotlib.pyplot as plt

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

In [40]:
!pip install fsspec



In [41]:
!pip install s3fs



In [42]:
import s3fs

In [43]:
import pyspark
print(pyspark.__version__)

3.1.1


Remember that we need a `SparkContext` and `SparkSession` to interface with Spark.
We will mostly be using the `SparkContext` to interact with RDDs and the `SparkSession` to interface with Python objects.

> ℹ️ **Instructions** ℹ️
>
>Initialise a new **Spark Context** and **Session** that you will use to interface with Spark.

In [44]:
#TODO: Write your code here

spark = SparkSession.builder.master("local[*]").appName("YourAppName").getOrCreate()

# Access the SparkContext from SparkSession
sc = spark.sparkContext

## Investigate dataset schema
At this point, it is enough to read in a single file to ascertain the data structure. You will be required to use the information obtained from the small subset to create a data schema. This data schema will be used when reading the entire dataset using Spark.

> ℹ️ **Instructions** ℹ️
>
>Make use of Pandas to read in a single file and investigate the plausible data types to be used when creating a Spark data schema.
>
>*You may use as many coding cells as necessary.*

In [57]:
from google.colab import files
uploaded = files.upload()

Saving stocks.csv to stocks.csv


In [58]:
import io

In [59]:
# Assuming the file is uploaded correctly
df_sample = pd.read_csv(io.BytesIO(uploaded['stocks.csv']))

# Display the data types of the columns
print(df_sample.dtypes)

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume       float64
stock         object
dtype: object


## Read CSV files

When working with big data, it is often not tenable to keep processing an entire data batch when you are in the process of development - this can be quite time-consuming. If the data is uniform, it is sufficient to work with a smaller subset to create basic functionality. Your manager has identified the year **1962** to perform the initial testing for data ingestion.

> ℹ️ **Instructions** ℹ️
>
>Read in the data for **1962** using a data schema that purely uses string data types. You will be required to convert to the appropriate data types at a later stage.
>
>*You may use as many coding cells as necessary.*

In [15]:
file_name = next(iter(uploaded))
file_path_1962 = file_name

In [16]:
#TODO: Write your code here

# Define the schema with string data types
schema = StructType([
    StructField("Date", StringType(), True),
    StructField("Open", StringType(), True),
    StructField("High", StringType(), True),
    StructField("Low", StringType(), True),
    StructField("Close", StringType(), True),
    StructField("Adj Close", StringType(), True),
    StructField("Volume", StringType(), True)
])

# Read the data for 1962 using the defined schema
df_1962 = spark.read.csv(file_path_1962, schema=schema)

# Show the first few rows of the dataframe
df_1962.show()

+----------+-------------------+-------------------+-------------------+-------------------+--------------------+---------+
|      Date|               Open|               High|                Low|              Close|           Adj Close|   Volume|
+----------+-------------------+-------------------+-------------------+-------------------+--------------------+---------+
|      Date|               Open|               High|                Low|              Close|           Adj Close|   Volume|
|1962-01-02| 6.5321550369262695|  6.556184768676758| 6.5321550369262695| 6.5321550369262695|  1.5366575717926023|  55900.0|
|1962-01-02|  6.125843524932861|  6.160982131958008|  6.125843524932861|  6.125843524932861|  1.4146506786346436|  59700.0|
|1962-01-02| 0.8374485373497009| 0.8374485373497009| 0.8230452537536621| 0.8230452537536621|  0.1457476019859314| 352200.0|
|1962-01-02| 1.6041666269302368| 1.6197916269302368| 1.5885416269302368| 1.6041666269302368| 0.13695742189884186| 163200.0|
|1962-01

## Update column names
To make the data easier to work with, you will need to make a few changes:
1. Column headers should all be in lowercase; and
2. Whitespaces should be replaced with underscores.


> ℹ️ **Instructions** ℹ️
>
>Make sure that the column headers are all in lowercase and that any whitespaces are replaced with underscores.
>
>*You may use as many coding cells as necessary.*

In [17]:
#TODO: Write your code here
# Update column names to lowercase and replace whitespaces with underscores
df_1962 = df_1962.toDF(*[c.lower().replace(' ', '_') for c in df_1962.columns])

# Show the updated column names
df_1962.show()

+----------+-------------------+-------------------+-------------------+-------------------+--------------------+---------+
|      date|               open|               high|                low|              close|           adj_close|   volume|
+----------+-------------------+-------------------+-------------------+-------------------+--------------------+---------+
|      Date|               Open|               High|                Low|              Close|           Adj Close|   Volume|
|1962-01-02| 6.5321550369262695|  6.556184768676758| 6.5321550369262695| 6.5321550369262695|  1.5366575717926023|  55900.0|
|1962-01-02|  6.125843524932861|  6.160982131958008|  6.125843524932861|  6.125843524932861|  1.4146506786346436|  59700.0|
|1962-01-02| 0.8374485373497009| 0.8374485373497009| 0.8230452537536621| 0.8230452537536621|  0.1457476019859314| 352200.0|
|1962-01-02| 1.6041666269302368| 1.6197916269302368| 1.5885416269302368| 1.6041666269302368| 0.13695742189884186| 163200.0|
|1962-01

## Null Values
Null values often represent missing pieces of data. It is always good to know where your null values lie - so you can quickly identify and remedy any issues stemming from these.

> ℹ️ **Instructions** ℹ️
>
>Write code to count the number of null values found in each column.
>
>*You may use as many coding cells as necessary.*

In [18]:
#TODO: Write your code here
# Count the number of null values in each column
from pyspark.sql.functions import isnan, when, count, col

df_1962.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_1962.columns]).show()

+----+----+----+---+-----+---------+------+
|date|open|high|low|close|adj_close|volume|
+----+----+----+---+-----+---------+------+
|   0|   0|   0|  0|    0|        0|     0|
+----+----+----+---+-----+---------+------+



## Data type conversion - The final data schema

Now that we have identified the number of missing values in the data set, we'll move on to convert our data schema to the required data types.

> ℹ️ **Instructions** ℹ️
>
>Use typecasting to convert the string data types in your current data schema to more appropriate data types.
>
>*You may use as many coding cells as necessary.*

In [19]:
from pyspark.sql.types import DateType, DoubleType

In [22]:
df_1962.columns

['Date', 'Open', 'High', 'Low', 'Close', 'adj_close', 'volume']

In [23]:
#TODO: Write your code here
# Convert string data types to appropriate data types
df_1962 = df_1962.withColumn("Date", df_1962["Date"].cast(DateType()))
df_1962 = df_1962.withColumn("Open", df_1962["Open"].cast(DoubleType()))
df_1962 = df_1962.withColumn("High", df_1962["High"].cast(DoubleType()))
df_1962 = df_1962.withColumn("Low", df_1962["Low"].cast(DoubleType()))
df_1962 = df_1962.withColumn("Close", df_1962["Close"].cast(DoubleType()))
df_1962 = df_1962.withColumn("adj_close", df_1962["adj_close"].cast(DoubleType()))
df_1962 = df_1962.withColumn("Volume", df_1962["Volume"].cast(IntegerType()))

# Show the updated data types
df_1962.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- adj_close: double (nullable = true)
 |-- Volume: integer (nullable = true)



## Consolidate missing values
We have to check if the data type conversion above was done correctly.
If the casting was not successful, a null value gets inserted into the dataframe. You can thus check for successful conversion by determining if any null values are included in the resulting dataframe.


> ℹ️ **Instructions** ℹ️
>
>Write code to compare the number of invalid entries (nulls) pre-conversion and post-conversion.
>
>*You may use as many coding cells as necessary.*

In [27]:
#TODO: Write your code here
# Count the number of null values in each column before and after conversion
import pyspark.sql.functions as F

df_1962_converted = df_1962.na.drop()

#Check for null values in Date columns
df_1962_converted.select([
    F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df_1962_converted.columns
]).show()

+----+----+----+---+-----+---------+------+
|Date|Open|High|Low|Close|adj_close|Volume|
+----+----+----+---+-----+---------+------+
|   0|   0|   0|  0|    0|        0|     0|
+----+----+----+---+-----+---------+------+



Here you should be able to see if any of your casts went wrong.
Do not attempt to correct any missing values at this point. This will be dealt with in later sections of the predict.

## Generate parquet files
When writing in Spark, we typically use parquet format. This format allows parallel writing using Spark's optimisation while maintaining other useful things like metadata.

When writing, it is good to make sure that the data is sufficiently partitioned.

Generally, data should be partitioned with one partition for every 200MB of data, but this also depends on the size of your cluster and executors.


### Check the size of the dataframe before partitioning

In [61]:
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

In [64]:
df = spark.read.csv('stocks.csv', header=True, inferSchema=True)


In [65]:
df.show(5)


+----------+------------------+------------------+------------------+------------------+-------------------+--------+-----+
|      Date|              Open|              High|               Low|             Close|          Adj Close|  Volume|stock|
+----------+------------------+------------------+------------------+------------------+-------------------+--------+-----+
|1962-01-02|6.5321550369262695| 6.556184768676758|6.5321550369262695|6.5321550369262695| 1.5366575717926023| 55900.0|   AA|
|1962-01-02| 6.125843524932861| 6.160982131958008| 6.125843524932861| 6.125843524932861| 1.4146506786346436| 59700.0| ARNC|
|1962-01-02|0.8374485373497009|0.8374485373497009|0.8230452537536621|0.8230452537536621| 0.1457476019859314|352200.0|   BA|
|1962-01-02|1.6041666269302368|1.6197916269302368|1.5885416269302368|1.6041666269302368|0.13695742189884186|163200.0|  CAT|
|1962-01-02|               0.0|3.2961308956146236|3.2440476417541504|3.2961308956146236| 0.0519925132393837|105600.0|  CVX|
+-------

In [66]:
rdd = df.rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
obj = rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
size = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(obj)
size_MB = size/1000000
partitions = max(int(size_MB/200), 2)
print(f'The dataframe is {size_MB} MB')

The dataframe is 6.536016 MB


### Write parquet files to the local directory
> ℹ️ **Instructions** ℹ️
>
> Use the **coalesce** function and the number of **partitions** derived above to write parquet files to your local directory
>
>*You may use as many coding cells as necessary.*

In [69]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [70]:
#TODO: Write your code here
# Check the size of the dataframe before partitioning
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

rdd = df.rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
obj = rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
size = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(obj)
size_MB = size / 1000000
partitions = max(int(size_MB / 200), 2)
print(f'The dataframe is {size_MB} MB')

# Write parquet files to the local directory
df_1962_converted.coalesce(partitions).write.parquet('/content/drive/My Drive/Processing data')


The dataframe is 6.540016 MB
