## DATA ENGINEERING PIPELINE - INTRASTAT DECLARATION

Aim: Write a production ready data engineering pipeline using python and pandas.

Overview: Intrastat is a system that collects information relating to the trade of goods. This script will transform sample invoice data from a fictious company into a submissable Swedish intrastat declaration.

Task:

Below outlines the steps to be performed:

    01) Import the necessary libraries for the project.
    02) Define the functions that will faciliate the data engineering.
    03) Read the sample intrastat data into a pd dataframe.
    04) Request intrastat commodity code list URL and read content into pd dataframe.
    05) Verify sample data using commodity code list.
    06) Request ECB FX rates from URL, parse xml file and read data into pd dataframe.
    07) Cleanse and transform ex.rate data into pivot table.
    08) Apply daily exchange rate calculation on sample invoice values.
    09) Apply final transformations to intrastat output file. 
    10) Display the content of the prepared file.
    11) Export the content as an excel file, submissible to the Swedish stats authority.

#### Import Packages

In [62]:
import pandas as pd # Data analysis library.
import numpy as np # Array and matrice libary
import ssl # Secure sockets layer package.
import urllib # Url handling module.
import sys # Runtime environment handling module.
import xml.etree.ElementTree as et # XML parsing library.
import datetime as dt # Datetime parsing library.

#### Define Functions

Functions to read, transform and export commodity code data.

In [79]:
def cc_request(url):
    # Request commodity code list from url.
    try:
        # If URL is valid print confirmation.
        urllib.request.urlopen(url)
        print('Message: Requested commodity code url is valid.\n')
        # Read data into pandas dataframe.
        df = pd.read_excel(url)
    except urllib.error.URLError:
        # If URL is invalid print error.
        print('Error: Requested commodity code url is invalid.\n')
        sys.exit()
    return df

def cc_transform(df):
    # Pad left first column to CN8 format
    df.iloc[:,0] = df.iloc[:,0].astype("str").str.pad(8, side='left', fillchar='0')
    # Rename first column to CN8. 
    df.columns.values[0] = 'CN8'
    return df

def cc_export(df):
    # Export to csv file 
    df.iloc[:,0] = df.iloc[:,0].astype("str")
    df.to_csv('CN8 Codes.csv', index=False)
    
def cc_rte_process(url):
    # Run full rte process and display csv output file.
    df_request = cc_request(url)
    df_transform = cc_transform(df_request)
    cc_export(df_transform)
    print('2. Commodity Code List')
    display(pd.read_csv('CN8 Codes.csv', dtype=str))
    return df_transform

Functions to read, transform and export foreign exchange rate data.

In [72]:
def fx_parse_xml(xml_obj, xml_child, xml_namespaces):
    # Parse xml content. 
    xml_tree = et.parse(xml_obj)
    xml_root = xml_tree.getroot()
    # Find required child element instances and store content via list comprehension.
    rows = xml_root.findall(xml_child, namespaces=xml_namespaces)
    xml_data = [[row.get('time'), row.get('currency'), row.get('rate')] for row in rows]
    # Create columns for dataframe and read in content.
    df = pd.DataFrame(xml_data, columns = ['Date', 'Currency', 'Rate'])
    return df
    
def fx_create_pivot(df):
    # Create fx rate pivot table by date and currency.
    df_out = pd.pivot_table(df, index='Date', columns='Currency', values='Rate')
    # Add weekend dates missing from period to the table index.
    max_date = df.iloc[1, df.columns.get_loc('Date')]
    min_date = df.iloc[-1, df.columns.get_loc('Date')]
    date_idx = pd.date_range(min_date, max_date)
    df_out.index = pd.DatetimeIndex(df_out.index)
    df_out = df_out.reindex(date_idx)
    # Fill forward missing weekend fx rate values. 
    df_out = df_out.ffill(axis=0)
    df_out = df_out.sort_index(ascending=0)
    return df_out

def fx_request(url, xml_child, xml_namespaces):
    # Request foreign exchange rates from url.
    try:
        # If URL is valid print confirmation.
        xml_object = urllib.request.urlopen(url)
        print('\n\nMessage: Requested fx rate url is valid.\n')
        df = fx_parse_xml(xml_object, xml_child, xml_namespaces)
    except urllib.error.URLError:
        # If URL is invalid print error.
        print('\n\nError: Requested fx rate url is invalid.\n')
        sys.exit()
    return df

def fx_transform(df):
    # Fill forward rows with missing dates. 
    df = df.ffill(axis=0)
    # Drop all other empty rows.
    df = df.dropna()
    # Create ex.rate pivot table.
    df['Rate'] = pd.to_numeric(df['Rate'])
    df_out = fx_create_pivot(df)
    return df_out

def fx_export(df):
    # Export to csv file 
    df.iloc[:,0] = df.iloc[:,0].astype("str")
    df.index.name = 'Date'
    df.to_csv('ECB FX Rates.csv', index=True)
    
def fx_rte_process(url, xml_child, xml_namespaces):
    # Run full rte process and display csv output file.
    df_request = fx_request(url, xml_child, xml_namespaces)
    df_transform = fx_transform(df_request)
    fx_export(df_transform)
    print('3. ECB FX Rate Table')
    display(pd.read_csv('ECB FX Rates.csv', dtype=str))
    return df_transform

Functions to read, check, transform and export source data.

In [82]:
def src_read(input_file_name):
    #Read intrastat source data.  
    df = pd.read_excel(input_file_name, dtype = str)
    try:
        # If file is read into data frame print confirmation.
        print('Message: Source data successfully read.\n')
        print('1. Intrastat Source File')
        display(df)
    except pd.errors.ParserError:
        # If file is not read into data frame print error.
        print('Error: Source data could not be read.\n')
        sys.exit()
    return df

def src_return_mot(mode):
    # Switch function for mot codes.
    mot_switch={'Sea':'1', 'Rail':'2','Road':'3', 'Air':'4'}
    return mot_switch.get(mode,"Invalid mode of transport\n")
    
def src_checks(df_src, df_cc):
    df_out = pd.merge(df_src, df_cc, how='left', left_on='Commodity Code', right_on='CN8')
    df_out['CC Check'] = np.where(df_out['Commodity Code'] == df_out['CN8'], 'OK', '`ERROR')
    df_out['Partner VAT'] = np.where(df_out['Transaction'] == 'B2C', 'QV999999999999', df_out['Partner VAT'])
    return df_out

def src_fx_convert(df_src, df_fx):
    df_src['Shipping Date'] = df_src['Shipping Date'].astype("string")
    df_src['Shipping Date'] = pd.to_datetime(df_src['Shipping Date'], format="%d-%m-%Y")
    df_out = pd.merge(df_src, df_fx, how='left', left_on='Shipping Date', right_on=df_fx.index)
    df_out.rename(columns = {'SEK':'EUR to SEK'}, inplace = True)
    df_out['Net (SEK)'] = df_out['Net (EUR)'].astype('float').multiply(df_out['EUR to SEK'].astype('float'))    
    return df_out

def src_transform(df_src):
    df_src['Mode of Transport'] = [src_return_mot(mode) for mode in df_src['Mode of Transport']] 
    df_src['Mass (KG)'] = df_src['Mass (grams)'].astype('float').multiply(0.001)
    df_src["County of Origin"] = 'CN'
    df_src = df_src.drop(['Description_x', 'Mass (grams)', 'Shipping Date', 'Ship From', 'Incoterms', 'Transaction','CN8','SU', 'Description_y', 'CC Check', 'Net (EUR)', 'EUR to SEK' ], axis = 1)
    df_src = df_src[['Ship To', 'Commodity Code','Net (SEK)', 'Quantity', 'Mass (KG)', 'County of Origin', 'Mode of Transport', 'Partner VAT'  ]]    
    return df_src

def src_export(file_name, df):
    # Export to excel file 
    df.iloc[:,0] = df.iloc[:,0].astype("str")
    df.to_excel(file_name, index=False)

def src_rcte_process(df_src, df_cc, df_fx, output_file_name):
    df_check = src_checks(df_src, df_cc)
    df_fx_convert = src_fx_convert(df_check, df_fx)
    df_transform = src_transform(df_fx_convert)
    src_export(output_file_name, df_transform)
    print('\n\n4. Intrastat Submission File')
    display(df_transform)
    

#### Define Main

In [83]:
def main():
    # Define variables.
    input_file_name = 'Intrastat Dispatches Data Sample.xlsx'
    output_file_name = 'Intrastat Submission Sample.xlsx'
    cc_url = 'https://www.cbs.nl/-/media/cbsvooruwbedrijf/international-trade-in-goods/commoditycodes-2023.xlsx'
    fx_url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml'
    xml_namespaces = {'ex': 'http://www.ecb.int/vocabulary/2002-08-01/eurofxref'}
    xml_child = './/ex:Cube'
    
    # Disable security certificate checks for url requests.
    ssl._create_default_https_context = ssl._create_unverified_context
     
    #Read source file and url references into data frames.
    df_source = src_read(input_file_name)
    df_cn8_codes = cc_rte_process(cc_url)
    df_fx_rates = fx_rte_process(fx_url, xml_child, xml_namespaces)
    
    #Run intrastat data check, fx conversion, transformation and export process.
    src_rcte_process(df_source, df_cn8_codes, df_fx_rates, output_file_name)
    
# Define main as program entry point if script is running as standalone and not as module.
if __name__=="__main__":
    main()

Message: Source data successfully read.

1. Intrastat Source File


Unnamed: 0,Commodity Code,Description,Mass (grams),Net (EUR),Quantity,Shipping Date,Ship From,Ship To,County of Origin,Mode of Transport,Incoterms,Transaction,Partner VAT
0,61012010,Men's or boys' overcoats,595,190,1,04-01-2023,SE,DE,China,Rail,DAP,B2C,Private Customer
1,61012090,Men's or boys' overcoats,678,150,1,01-01-2023,SE,NL,China,Sea,DDP,B2B,NL999999999999
2,61013010,Men's or boys' overcoats,704,175,1,05-01-2023,SE,ES,China,Road,DAP,B2C,Private Customer
3,61019080,Men's or boys' overcoats,844,135,1,07-01-2023,SE,FR,China,Air,DDP,B2B,FR999999999999
4,61021010,Women's or girls' overcoats,461,145,1,02-01-2023,SE,NL,China,Road,DDP,B2B,NL999999999999
5,61021090,Women's or girls' overcoats,589,160,1,03-01-2023,SE,BE,China,Air,DDP,B2B,BE999999999999
6,61022090,Women's or girls' overcoats,533,155,1,05-01-2023,SE,PT,China,Sea,DAP,B2C,Private Customer
7,61029010,Women's or girls' overcoats,406,180,1,06-01-2023,SE,IT,China,Rail,DAP,B2C,Private Customer


Message: Requested commodity code url is valid.

2. Commodity Code List


Unnamed: 0,CN8,SU,Description
0,01012100,p/st,Pure-bred breeding horses
1,01012910,p/st,Horses for slaughter
2,01012990,p/st,"Live horses (excl. for slaughter, pure-bred fo..."
3,01013000,p/st,Live asses
4,01019000,p/st,Live mules and hinnies
...,...,...,...
9750,97052900,-,Collections and collectors? pieces of zoologic...
9751,97053100,-,Collections and collectors? pieces of numismat...
9752,97053900,-,Collections and collectors? pieces of numismat...
9753,97061000,-,"Antiques, over 250 years old"




Message: Requested fx rate url is valid.

3. ECB FX Rate Table


Unnamed: 0,Date,AUD,BGN,BRL,CAD,CHF,CNY,CZK,DKK,GBP,...,NZD,PHP,PLN,RON,SEK,SGD,THB,TRY,USD,ZAR
0,2023-01-13,1.5586,1.9558,5.5512,1.4494,1.0051,7.2729,24.011,7.4387,0.888,...,1.7014,59.44,4.6888,4.9423,11.2528,1.4311,35.751,20.3196,1.0814,18.2482
1,2023-01-12,1.557,1.9558,5.5556,1.4439,1.0056,7.27,24.036,7.4385,0.8869,...,1.6937,59.292,4.692,4.944,11.273,1.4309,35.849,20.2312,1.0772,18.19495
2,2023-01-11,1.5588,1.9558,5.584,1.4429,0.9967,7.2807,24.027,7.4375,0.88673,...,1.6912,59.013,4.6819,4.9335,11.2783,1.4316,35.895,20.1793,1.0747,18.2122
3,2023-01-10,1.5616,1.9558,5.6471,1.4382,0.9908,7.2732,23.984,7.4375,0.8833,...,1.6879,58.751,4.695,4.9338,11.1963,1.429,36.024,20.1356,1.0723,18.29405
4,2023-01-09,1.5446,1.9558,5.6475,1.4299,0.9865,7.2546,23.99,7.4374,0.88048,...,1.6741,58.946,4.6963,4.9253,11.196,1.4244,35.789,20.0824,1.0696,18.25565
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,2022-10-21,1.5646,1.9558,5.1117,1.3465,0.9855,7.0504,24.511,7.4382,0.87728,...,1.7347,57.287,4.7885,4.9125,11.0868,1.3917,37.349,18.0988,0.973,18.0323
85,2022-10-20,1.5554,1.9558,5.1387,1.3461,0.9836,7.0858,24.525,7.4389,0.87258,...,1.7206,57.742,4.7728,4.9203,10.982,1.3959,37.36,18.2257,0.9811,17.95635
86,2022-10-19,1.5568,1.9558,5.1755,1.3479,0.981,7.0672,24.563,7.439,0.86993,...,1.7264,57.741,4.7878,4.9248,10.9448,1.3931,37.469,18.1793,0.9778,17.87225
87,2022-10-18,1.5557,1.9558,5.1795,1.3495,0.9792,7.0805,24.593,7.4393,0.86928,...,1.7251,57.897,4.804,4.9359,10.906,1.3963,37.422,18.2813,0.9835,17.812150000000003




4. Intrastat Submission File


Unnamed: 0,Ship To,Commodity Code,Net (SEK),Quantity,Mass (KG),County of Origin,Mode of Transport,Partner VAT
0,DE,61012010,2121.407,1,0.595,CN,2,QV999999999999
1,NL,61012090,1668.27,1,0.678,CN,1,NL999999999999
2,ES,61013010,1957.34,1,0.704,CN,3,QV999999999999
3,FR,61019080,1519.83,1,0.844,CN,4,FR999999999999
4,NL,61021010,1619.1135,1,0.461,CN,3,NL999999999999
5,BE,61021090,1782.88,1,0.589,CN,4,BE999999999999
6,PT,61022090,1733.644,1,0.533,CN,1,QV999999999999
7,IT,61029010,2026.44,1,0.406,CN,2,QV999999999999
