# Data application Example

#### Import necessary libraries such as pandas and boto3 for more information visti the websites listed below
- https://pandas.pydata.org/docs/getting_started/overview.html
- https://aws.amazon.com/sdk-for-python/
- https://apscheduler.readthedocs.io/en/3.x/#:~:text=Advanced%20Python%20Scheduler%20(APScheduler)%20is,either%20just%20once%20or%20periodically.&text=That%20said%2C%20APScheduler%20does%20provide,run%20a%20dedicated%20scheduler%20process.

In [2]:
import pandas as pd
import boto3 as bt
from apscheduler.schedulers.background import BackgroundScheduler

#### The function below is in charge to read the parquet file from a relative path, it also add some validations in case the file doesn't exists. For more information about <i>fastparquet</i> see: https://fastparquet.readthedocs.io/en/latest/

In [4]:
def parquet_reader(relative_path=""):
    try:
        return pd.DataFrame(pd.read_parquet(relative_path,
         engine='fastparquet'))
    except:
        print("Unable to locate the parquet file or the content is invalid :(")

In [5]:
relative_path = '../fixtures/sample-file-assessment.snappy.parquet'

data_frame = parquet_reader(relative_path)
display(data_frame)

Unnamed: 0,name,value,start_date,end_date,year_week,has_subtrackers,token,dataplatform_inserted_at,country,os_name
0,Website,0.0,2020-12-17,2020-12-17,2020_51,True,17xptrn,2020-12-24 13:12:53.538505,GB,ios
1,Email,0.0,2020-12-17,2020-12-17,2020_51,True,dy3vti6,2020-12-24 13:12:53.538505,GB,ios
2,Instagram Installs,1.0,2020-12-17,2020-12-17,2020_51,True,ew373nn,2020-12-24 13:12:53.538505,GB,ios
3,Google Ads UAC,7.0,2020-12-17,2020-12-17,2020_51,True,f91turk,2020-12-24 13:12:53.538505,GB,ios
4,Facebook Installs,1.0,2020-12-17,2020-12-17,2020_51,True,iongxw7,2020-12-24 13:12:53.538505,GB,ios
...,...,...,...,...,...,...,...,...,...,...
9325,Display,0.0,2021-12-12,2021-12-12,2021_49,True,y9o2eiy,2021-12-14 03:00:31.907207,ALL,all
9326,Email,0.0,2021-12-13,2021-12-13,2021_50,True,dy3vti6,2021-12-14 03:00:32.113487,ALL,all
9327,Partnerships,0.0,2021-12-13,2021-12-13,2021_50,True,er8cjmc,2021-12-14 03:00:32.113487,ALL,all
9328,Organic,0.0,2021-12-13,2021-12-13,2021_50,False,oyojl1x,2021-12-14 03:00:32.113487,ALL,all


#### The function defined below removes the named column from the specified DataFrame, if not column name is specified then it removes the first column of the DataFrame.

In [6]:
def drop_column(column_name="", data_frame=None):
    if data_frame is None:
        print("DataFrame cant be empty :)")
        return
    else:
        try:
            if not column_name:
                column_name = data_frame.columns[0]
            data_frame.drop(columns=column_name, inplace=True)
        except KeyError:
            print("The column does not exist :(, please try another value")
    return data_frame

In [47]:
column_name = 'name'
drop_column(column_name, data_frame)
display(data_frame)

The column does not exist :(, please try another value


Unnamed: 0,value,start_date,end_date,year_week,has_subtrackers,token,dataplatform_inserted_at,country,os_name
0,0.0,2020-12-17,2020-12-17,2020_51,True,17xptrn,2020-12-24 13:12:53.538505,GB,ios
1,0.0,2020-12-17,2020-12-17,2020_51,True,dy3vti6,2020-12-24 13:12:53.538505,GB,ios
2,1.0,2020-12-17,2020-12-17,2020_51,True,ew373nn,2020-12-24 13:12:53.538505,GB,ios
3,7.0,2020-12-17,2020-12-17,2020_51,True,f91turk,2020-12-24 13:12:53.538505,GB,ios
4,1.0,2020-12-17,2020-12-17,2020_51,True,iongxw7,2020-12-24 13:12:53.538505,GB,ios
...,...,...,...,...,...,...,...,...,...
9325,0.0,2021-12-12,2021-12-12,2021_49,True,y9o2eiy,2021-12-14 03:00:31.907207,ALL,all
9326,0.0,2021-12-13,2021-12-13,2021_50,True,dy3vti6,2021-12-14 03:00:32.113487,ALL,all
9327,0.0,2021-12-13,2021-12-13,2021_50,True,er8cjmc,2021-12-14 03:00:32.113487,ALL,all
9328,0.0,2021-12-13,2021-12-13,2021_50,False,oyojl1x,2021-12-14 03:00:32.113487,ALL,all


#### Below is defined a lambda expression/function that uses the <i>boto3</i> library to perform a direct upload of the file named <i>test0</i> to the specified S3 Bucket (excample of usage provided in the last line)

In [49]:
file_name = 'test0'
data_frame.to_parquet('test0')

s3_uploader = lambda s3_key='', bucket_name='', data_frame_file='' : bt.resource('s3').meta.client.upload_file(data_frame_file, bucket_name, s3_key)

# s3_uploader(result_name, bucket_name, parquet_file_path)