In [12]:
"""

DATA ENGINEERING ETL PIPELINE - XETRA DATASET

2: Reading multiple files and cleansing data.

Aim:
Write a production ready ETL pipeline using python and pandas.

Overview:
Xetra is a German stock exchange based in Frankfurt operated by Deutsche Börse Group. 
Data related to daily trading activity is stored publicly on the Amazon S3 database. 
(Update - as of July 2022 the data is no longer available. An archival S3 database will be used) 

Task:
Use jupyter notebook as a protoyping tool to extract and transform source data.
Request and extract source data from cloud based web services.
Use loops and iteration to read and consolidate multiple source files.
Familiarise with pandas package functions to clean output data. 

Below outlines the steps to be performed:
    
    1) First import the necessary libraries and functions for the project.
    2) Create variables to define the Amazon S3 cloud resource we're going to call. 
    3) Retrieve the trading data from amazon S3 bucket labelled 'xetra-1234'.
    4) Filter multiple xetra buckets by date concatenate data elements into bucket list.
    5) Extract body of data (csv) and read into into pandas dataframe:
        - Initialisation step, where csv column template is taken from first element of bucket list.
        - Iteration step, where remaining list elements are read and appended to pandas dataframe. 
    6) Print column names and remove any that are unecessary.
    7) Remove any records with missing values. 
    8) Print data frame object.
    
"""


"\n\nDATA ENGINEERING ETL PIPELINE - XETRA DATASET\n\n2: Reading multiple files and cleansing data.\n\nAim:\nWrite a production ready ETL pipeline using python and pandas.\n\nOverview:\nXetra is a German stock exchange based in Frankfurt operated by Deutsche Börse Group. \nData related to daily trading activity is stored publicly on the Amazon S3 database. \n(Update - as of July 2022 the data is no longer available. An archival S3 database will be used) \n\nTask:\nUse jupyter notebook as a protoyping tool to extract and transform source data.\nRequest and extract source data from cloud based web services.\nUse loops and iteration to read and consolidate multiple source files.\nFamiliarise with pandas package functions to clean output data. \n\nBelow outlines the steps to be performed:\n    \n    1) First import the necessary libraries and functions for the project.\n    2) Create variables to define the Amazon S3 cloud resource we're going to call. \n    3) Retrieve the trading data fr

In [13]:
import boto3 #AWS service management package.
import pandas as pd #Data analysis library.
from io import StringIO #String buffer to read CSV files.

In [14]:
s3 = boto3.resource('s3') #Use the Amazon S3 cloud storage resource.
bucket = s3.Bucket('xetra-1234') #Create instance of the "xetra" data bucket.

In [15]:
bucket_obj1 = bucket.objects.filter(Prefix='2022-01-28/') #Filter by date and store data as "bucket_obj1".
bucket_obj2 = bucket.objects.filter(Prefix='2022-02-28/') #Filter by date and store data as "bucket_obj2".
bucket_objects = [obj for obj in bucket_obj1] + [obj for obj in bucket_obj2]  #Store data into bucket list.

In [16]:
#Read csv body of dataset into pandas dataframe - initialisation step:
csv_obj_init = bucket.Object(key=bucket_objects[0].key).get().get('Body') #Initialise first element of csv object.
csv_obj_init = csv_obj_init.read().decode('utf-8') #Store into csv object in utf-8 format.
data = StringIO(csv_obj_init) #Convert csv object from streaming body to string data.
df_init = pd.read_csv(data, delimiter=',') #Read data into pandas data frame.
df_all = pd.DataFrame(columns=df_init.columns) #Initialise df_all with df_init columns.

In [17]:
#Read csv body of dataset into pandas dataframe - iteration step:
for obj in bucket_objects:
    csv_obj = bucket.Object(key=obj.key).get().get('Body') #Read data element from list.
    csv_obj = csv_obj.read().decode('utf-8') #Store into to csv object in utf-8 format.
    data = StringIO(csv_obj) #Convert csv object to string data.
    df = pd.read_csv(data, delimiter=',') #Read data as pandas data frame.
    df_all = pd.concat([df, df_all]) #Concatenate data to one master dataframe.

In [18]:
csv_obj #Print csv object to view columns.

'ISIN,Mnemonic,SecurityDesc,SecurityType,Currency,SecurityID,Date,Time,StartPrice,MaxPrice,MinPrice,EndPrice,TradedVolume,NumberOfTrades\r\n'

In [19]:
#Remove unecessary columns by storing required columns in variable and passing as .loc function argument. 
columns_use = ['ISIN', 'Date', 'Time', 'StartPrice', 'MaxPrice', 'MinPrice', 'EndPrice', 'TradedVolume']
df_all = df_all.loc[:,columns_use]

In [20]:
df_all.dropna(inplace=True) #Drop all missing values from the dataset.
df_all.shape #Check if there was any filtering (should match table dimensions)

(257248, 8)

In [21]:
df_all #Print data frame object.

Unnamed: 0,ISIN,Date,Time,StartPrice,MaxPrice,MinPrice,EndPrice,TradedVolume
0,US98956P1021,2022-02-28,20:30,113.100,113.100,113.100,113.100,0
1,US9224171002,2022-02-28,20:30,24.600,24.600,24.600,24.600,0
2,IT0005143547,2022-02-28,20:30,3.100,3.100,3.100,3.100,0
0,CA0679011084,2022-02-28,16:00,20.215,20.215,20.185,20.185,60
1,CA32076V1031,2022-02-28,16:00,10.060,10.060,10.060,10.060,11
...,...,...,...,...,...,...,...,...
16728,DK0061539921,2022-01-28,08:59,23.270,23.270,23.270,23.270,37
16729,DE000A3E5D56,2022-01-28,08:59,30.060,30.080,30.060,30.080,218
16730,DE000A3E5D64,2022-01-28,08:59,38.100,38.100,38.100,38.100,34
16731,FR0000121147,2022-01-28,08:59,38.840,38.840,38.840,38.840,40
