## DATA ENGINEERING PIPELINE - INTRASTAT DECLARATION

Aim: Write a production ready data engineering pipeline using python and pandas.

Overview: Intrastat is a system that collects information relating to the trade of goods. This script will transform sample invoice data from a fictious company into a submissable Swedish intrastat declaration.

Task:

Below outlines the steps to be performed:

    01) Import the necessary libraries for the project.
    02) Define the functions that will faciliate the data engineering.
    03) Request intrastat commodity code list URL and read content into pd dataframe.
    04) Request ECB FX rates from URL, parse xml file and read data into pd dataframe.
    05) Cleanse and transform ex.rate data into pivot table.
    06) Request the sample intrastat data url and read content into pd dataframe.
    07) Verify sample data using commodity code list.
    08) Apply daily exchange rate calculation on sample invoice values.
    09) Apply final transformations to intrastat output file. 
    10) Display the content of the prepared file.
    11) Export the content as an excel file, submissible to the Swedish stats authority.

#### Import Packages

In [99]:
import pandas as pd # Data analysis library.
import numpy as np # Array and matrice libary
import ssl # Secure sockets layer package.
import urllib # Url handling module.
import xml.etree.ElementTree as et # XML parsing library.
import datetime as dt # Datetime parsing library.

#### Define Functions

Functions to read, transform and export commodity code data.

In [100]:
def cc_read(url):
    # Function to request commodity code list and read url content into dataframe.
    try:
        # If URL is valid print confirmation.
        urllib.request.urlopen(url)
        print('Message: Requested URL is valid.')
        # Read data into pandas dataframe.
        df = pd.read_excel(url)
    except urllib.error.HTTPError as e:
        if e.code == 404:
            # If URL is invalid print errror.
            print('Error: Requested commodity code url is invalid.')
    return df

def cc_transform(df):
    # Function to convert commodity codes into cn8 format.
    df.iloc[:,0] = df.iloc[:,0].astype("str").str.pad(8, side='left', fillchar='0')
    # Rename first column to CN8. 
    df.columns.values[0] = 'CN8'
    return df

def cc_export(df):
    # Function to export CC codes to csv file 
    df.to_csv('CN8 Codes.csv', index=False)
    
def cc_rte_process(url):
    # Function to run full rte process and display csv output file.
    df_request = cc_read(url)
    df_transform = cc_transform(df_request)
    cc_export(df_transform)
    print('1. Commodity Code List')
    display(pd.read_csv('CN8 Codes.csv', dtype=str))
    return df_transform

Functions to read, transform and export foreign exchange rate data.

In [101]:
def fx_parse_xml(xml_obj, xml_child, xml_namespaces):
    # Function to parse xml content and read into dataframe. 
    try:
        xml_tree = et.parse(xml_obj)
        xml_root = xml_tree.getroot()
        # Find required tags and store data via list comprehension.
        rows = xml_root.findall('.//ex:Cube', namespaces=xml_namespaces)
        xml_data = [[row.get('time'), row.get('currency'), row.get('rate')] for row in rows]
        # Create columns for dataframe and read in content.
        df = pd.DataFrame(xml_data, columns = ['Date', 'Currency', 'Rate'])
        print('Message: Xml data parsing successful.')
    except et.ParseError:
            # Return empty dataframe and print errpr if parsing failed.
            df = pd.DataFrame()
            print('Error: Xml data parsing failed.')
    return df    
    
def fx_create_pivot(df):
    # Function to create fx rate pivot table by date and currency.
    df_out = pd.pivot_table(df, index='Date', columns='Currency', values='Rate')
    # Add weekend dates missing from period to the table index.
    date_idx = pd.date_range(df['Date'].min(), df['Date'].max())
    df_out.index = pd.DatetimeIndex(df_out.index)
    df_out = df_out.reindex(date_idx)
    # Fill forward missing weekend fx rate values. 
    df_out = df_out.ffill(axis=0)
    df_out = df_out.sort_index(ascending=0)
    return df_out

def fx_read(url, xml_child, xml_namespaces):
    # Function to request fx rate url, parse content and read into dataframe.
    try:
        # Request url content. 
        xml_object = urllib.request.urlopen(url)
        df = fx_parse_xml(xml_object, xml_child, xml_namespaces)
    except urllib.error.HTTPError as e:
        if e.code == '404':
            # If URL is invalid create empty dataframe and print error.
            print('Error: Requested URL is invalid.')
    return df

def fx_transform(df):
    # Function to fill empty fx dates, clean data and convert into pivot table. 
    df = df.ffill(axis=0)
    # Drop all other empty rows.
    df = df.dropna()
    # Create ex.rate pivot table.
    df['Rate'] = pd.to_numeric(df['Rate'])
    df_out = fx_create_pivot(df)
    return df_out

def fx_export(df):
    # Function to export to csv file 
    df.index.name = 'Date'
    df.to_csv('ECB FX Rates.csv', index=True)
    
def fx_rte_process(url, xml_child, xml_namespaces):
    # Function to run full rte process and display csv output file.
    df_read = fx_read(url, xml_child, xml_namespaces)
    df_transform = fx_transform(df_read)
    fx_export(df_transform)
    print('2. ECB FX Rate Table')
    display(pd.read_csv('ECB FX Rates.csv', dtype=str))
    return df_transform

Functions to read, check, transform and export source data.

In [102]:
def src_return_mot(mode):
    # Function to return mode of transport code from string.
    mot_switch={'Sea':'1', 'Rail':'2','Road':'3', 'Air':'4'}
    return mot_switch.get(mode,"Invalid mode of transport\n")
    
def src_cc_check(df_src, df_cc):
    # Function to verify src commodity codes match official codes.
    df_src['Commodity Code'] = df_src['Commodity Code'].astype(str)
    df_out = pd.merge(df_src, df_cc, how='left', left_on='Commodity Code', right_on='CN8')
    df_out['CC Check'] = np.where(df_out['Commodity Code'] == df_out['CN8'], 'OK', '`ERROR')
    return df_out

def src_fx_convert(df_src, df_fx):
    # Function to perform fx rate conversion by shipping date.
    df_src['Shipping Date'] = df_src['Shipping Date'].astype(str)
    df_src['Shipping Date'] = pd.to_datetime(df_src['Shipping Date'], format="%d-%m-%Y")
    df_out = pd.merge(df_src, df_fx, how='left', left_on='Shipping Date', right_on=df_fx.index)
    df_out.rename(columns = {'SEK':'EUR to SEK'}, inplace = True)
    df_out['Net (SEK)'] = df_out['Net (EUR)'].astype('float').multiply(df_out['EUR to SEK'].astype('float'))    
    return df_out

def src_read(url):
    # Function to request source data url and read content into dataframe.
    try:
        # If URL is valid print confirmation.
        urllib.request.urlopen(url)
        print('Message: Requested URL is valid.')
        print('3. Intrastat Source File')
        # Read data into pandas dataframe.
        df = pd.read_excel(url)
        display(df)
    except urllib.error.HTTPError as e:
        if e.code == 404:
            # If URL is invalid print errror.
            print('Error: Requested source data url is invalid.')
    return df

def src_analyse(df_src, df_cc, df_fx):
    # Analyse source data through commodity code checks and fx rate conversions.  
    df_cc_check = src_cc_check(df_src, df_cc)
    df_fx_convert = src_fx_convert(df_cc_check, df_fx)
    return df_fx_convert

def src_transform(df_src):
    # Function to transform column data into format for intrastat submission.
    df_src['Mode of Transport'] = [src_return_mot(mode) for mode in df_src['Mode of Transport']] 
    df_src['Partner VAT'] = np.where(df_src['Transaction'] == 'B2C', 'QV999999999999', df_src['Partner VAT'])
    df_src['Mass (KG)'] = df_src['Mass (grams)'].astype('float').multiply(0.001)
    df_src["County of Origin"] = 'CN'
    df_src = df_src.drop(['Description_x', 'Mass (grams)', 'Shipping Date', 'Ship From', 'Incoterms', 'Transaction','CN8','SU', 'Description_y', 'CC Check', 'Net (EUR)', 'EUR to SEK'], axis = 1)
    df_src = df_src[['Ship To', 'Commodity Code','Net (SEK)', 'Quantity', 'Mass (KG)', 'County of Origin', 'Mode of Transport', 'Partner VAT']]    
    return df_src

def src_export(file_name, df):
    # Function to export to excel file.
    df.iloc[:,0] = df.iloc[:,0].astype("str")
    df.to_excel(file_name, index=False)
    print('\nMessage: Submission file successfully exported.')
    print('4. Intrastat Submission File')
    display(df)

def src_rate_process(url, df_cc, df_fx, filename):
    # Function to perform the full rate process.
    df_read = src_read(url)
    df_analyse = src_analyse(df_read, df_cc, df_fx)
    df_transform = src_transform(df_analyse)
    src_export(filename, df_transform)
    

#### Define Main

In [103]:
def main():
    # Define variables.
    src_url = 'https://github.com/homodudu/Data-Engineering/raw/main/intrastat/Intrastat%20Dispatches%20Data%20Sample.xlsx'
    cc_url = 'https://www.cbs.nl/-/media/cbsvooruwbedrijf/international-trade-in-goods/commoditycodes-2023.xlsx'
    fx_url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml'
    xml_namespaces = {'ex': 'http://www.ecb.int/vocabulary/2002-08-01/eurofxref'}
    xml_child = './/ex:Cube'
    output_file_name = 'Intrastat Submission Sample.xlsx'
    
    # Disable security certificate checks for url requests.
    ssl._create_default_https_context = ssl._create_unverified_context
     
    #Read url references into data frames.
    df_cn8_codes = cc_rte_process(cc_url)
    df_fx_rates = fx_rte_process(fx_url, xml_child, xml_namespaces)
    
    #Run intrastat data check, fx conversion, transformation and export process.
    src_rate_process(src_url, df_cn8_codes, df_fx_rates, output_file_name)
    
# Define main as program entry point if script is running as standalone and not as module.
if __name__=="__main__":
    main()

Message: Requested URL is valid.
1. Commodity Code List


Unnamed: 0,CN8,SU,Description
0,01012100,p/st,Pure-bred breeding horses
1,01012910,p/st,Horses for slaughter
2,01012990,p/st,"Live horses (excl. for slaughter, pure-bred fo..."
3,01013000,p/st,Live asses
4,01019000,p/st,Live mules and hinnies
...,...,...,...
9750,97052900,-,Collections and collectors? pieces of zoologic...
9751,97053100,-,Collections and collectors? pieces of numismat...
9752,97053900,-,Collections and collectors? pieces of numismat...
9753,97061000,-,"Antiques, over 250 years old"


Message: Xml data parsing successful.
2. ECB FX Rate Table


Unnamed: 0,Date,AUD,BGN,BRL,CAD,CHF,CNY,CZK,DKK,GBP,...,NZD,PHP,PLN,RON,SEK,SGD,THB,TRY,USD,ZAR
0,2023-02-01,1.5392,1.9558,5.5174,1.4506,0.998,7.3452,23.775,7.4396,0.88413,...,1.6903,59.318,4.7075,4.9117,11.3455,1.4303,35.841,20.4978,1.0894,18.8328
1,2023-01-31,1.5476,1.9558,5.5373,1.457,1.0032,7.3198,23.792,7.4388,0.88073,...,1.6858,59.192,4.709,4.921,11.348,1.4268,35.787,20.3787,1.0833,18.87755
2,2023-01-30,1.539,1.9558,5.5654,1.4532,1.0045,7.3601,23.861,7.4383,0.87978,...,1.6778,59.47,4.7103,4.9055,11.262,1.431,35.68,20.5063,1.0903,18.90565
3,2023-01-29,1.5289,1.9558,5.5104,1.4479,1.0017,7.369,23.826,7.4378,0.87885,...,1.6759,59.187,4.7085,4.8965,11.2108,1.4277,35.702,20.4365,1.0865,18.80375
4,2023-01-28,1.5289,1.9558,5.5104,1.4479,1.0017,7.369,23.826,7.4378,0.87885,...,1.6759,59.187,4.7085,4.8965,11.2108,1.4277,35.702,20.4365,1.0865,18.80375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,2022-11-08,1.5435,1.9558,5.203,1.3489,0.9911,7.2495,24.326,7.4378,0.87378,...,1.686,58.187,4.6918,4.8978,10.8373,1.4022,37.22,18.5991,0.9996,17.85835
86,2022-11-07,1.5428,1.9558,5.07,1.3464,0.9874,7.2189,24.301,7.4393,0.87135,...,1.6834,58.361,4.6865,4.8855,10.832,1.4022,37.284,18.5875,0.9993,17.799
87,2022-11-06,1.5311,1.9558,4.9682,1.3351,0.9863,7.0894,24.422,7.4419,0.87478,...,1.6769,57.672,4.6825,4.8893,10.8538,1.3891,36.906,18.3845,0.9872,17.7783
88,2022-11-05,1.5311,1.9558,4.9682,1.3351,0.9863,7.0894,24.422,7.4419,0.87478,...,1.6769,57.672,4.6825,4.8893,10.8538,1.3891,36.906,18.3845,0.9872,17.7783


Message: Requested URL is valid.
3. Intrastat Source File


Unnamed: 0,Commodity Code,Description,Mass (grams),Net (EUR),Quantity,Shipping Date,Ship From,Ship To,County of Origin,Mode of Transport,Incoterms,Transaction,Partner VAT
0,61012010,Men's or boys' overcoats,595,190,1,04-01-2023,SE,DE,China,Rail,DAP,B2C,Private Customer
1,61012090,Men's or boys' overcoats,678,150,1,01-01-2023,SE,NL,China,Sea,DDP,B2B,NL999999999999
2,61013010,Men's or boys' overcoats,704,175,1,05-01-2023,SE,ES,China,Road,DAP,B2C,Private Customer
3,61019080,Men's or boys' overcoats,844,135,1,07-01-2023,SE,FR,China,Air,DDP,B2B,FR999999999999
4,61021010,Women's or girls' overcoats,461,145,1,02-01-2023,SE,NL,China,Road,DDP,B2B,NL999999999999
5,61021090,Women's or girls' overcoats,589,160,1,03-01-2023,SE,BE,China,Air,DDP,B2B,BE999999999999
6,61022090,Women's or girls' overcoats,533,155,1,05-01-2023,SE,PT,China,Sea,DAP,B2C,Private Customer
7,61029010,Women's or girls' overcoats,406,180,1,06-01-2023,SE,IT,China,Rail,DAP,B2C,Private Customer



Message: Submission file successfully exported.
4. Intrastat Submission File


Unnamed: 0,Ship To,Commodity Code,Net (SEK),Quantity,Mass (KG),County of Origin,Mode of Transport,Partner VAT
0,DE,61012010,2121.407,1,0.595,CN,2,QV999999999999
1,NL,61012090,1668.27,1,0.678,CN,1,NL999999999999
2,ES,61013010,1957.34,1,0.704,CN,3,QV999999999999
3,FR,61019080,1519.83,1,0.844,CN,4,FR999999999999
4,NL,61021010,1619.1135,1,0.461,CN,3,NL999999999999
5,BE,61021090,1782.88,1,0.589,CN,4,BE999999999999
6,PT,61022090,1733.644,1,0.533,CN,1,QV999999999999
7,IT,61029010,2026.44,1,0.406,CN,2,QV999999999999
