## DATA ENGINEERING PIPELINE - INTRASTAT DECLARATION

Aim: Write a production ready data engineering pipeline using python and pandas.

Overview: Intrastat is a system that collects information relating to the trade of goods. This script will transform sample invoice data from a fictious company into a submissable Swedish intrastat declaration.

Task:

Below outlines the steps to be performed:

    01) Import the necessary libraries for the project.
    02) Define the functions that will faciliate the data engineering.
    03) Request intrastat commodity code list URL and read content into pd dataframe.
    04) Request ECB FX rates from URL, parse xml file and read data into pd dataframe.
    05) Cleanse and transform ex.rate data into pivot table.
    06) Read the sample intrastat data into a pd dataframe.
    07) Verify sample data using commodity code list.
    08) Apply daily exchange rate calculation on sample invoice values.
    09) Apply final transformations to intrastat output file. 
    10) Display the content of the prepared file.
    11) Export the content as an excel file, submissible to the Swedish stats authority.

#### Import Packages

In [12]:
import pandas as pd # Data analysis library.
import numpy as np # Array and matrice libary
import ssl # Secure sockets layer package.
import urllib # Url handling module.
import sys # Runtime environment handling module.
import xml.etree.ElementTree as et # XML parsing library.
import datetime as dt # Datetime parsing library.

#### Define Functions

Functions to read, transform and export commodity code data.

In [13]:
def cc_read(url):
    # Function to request commodity code list and read url content into dataframe.
    try:
        # If URL is valid print confirmation.
        urllib.request.urlopen(url)
        print('\n\nMessage: Requested commodity code url is valid.\n')
        # Read data into pandas dataframe.
        df_in= pd.read_excel(url)
    except urllib.error.URLError:
        # If URL is invalid print error.
        print('\n\nError: Requested commodity code url is invalid.\n')
        sys.exit()
    return df

def cc_transform(df):
    # Function to convert commodity codes into cn8 format.
    df.iloc[:,0] = df.iloc[:,0].astype("str").str.pad(8, side='left', fillchar='0')
    # Rename first column to CN8. 
    df.columns.values[0] = 'CN8'
    return df

def cc_export(df):
    # Function to export CC codes to csv file 
    df.iloc[:,0] = df.iloc[:,0].astype("str")
    df.to_csv('CN8 Codes.csv', index=False)
    
def cc_rte_process(url):
    # Function to run full rte process and display csv output file.
    df_request = cc_request(url)
    df_transform = cc_transform(df_request)
    cc_export(df_transform)
    print('1. Commodity Code List')
    display(pd.read_csv('CN8 Codes.csv', dtype=str))
    return df_transform

Functions to read, transform and export foreign exchange rate data.

In [14]:
def fx_parse_xml(xml_obj, xml_child, xml_namespaces):
    # Function to parse xml content and read into dataframe. 
    xml_tree = et.parse(xml_obj)
    xml_root = xml_tree.getroot()
    # Find required child element instances and store content via list comprehension.
    rows = xml_root.findall(xml_child, namespaces=xml_namespaces)
    xml_data = [[row.get('time'), row.get('currency'), row.get('rate')] for row in rows]
    # Create columns for dataframe and read in content.
    df_in= pd.DataFrame(xml_data, columns = ['Date', 'Currency', 'Rate'])
    return df
    
def fx_create_pivot(df):
    # Function to create fx rate pivot table by date and currency.
    df_out = pd.pivot_table(df, index='Date', columns='Currency', values='Rate')
    # Add weekend dates missing from period to the table index.
    max_date = df.iloc[1, df.columns.get_loc('Date')]
    min_date = df.iloc[-1, df.columns.get_loc('Date')]
    date_idx = pd.date_range(min_date, max_date)
    df_out.index = pd.DatetimeIndex(df_out.index)
    df_out = df_out.reindex(date_idx)
    # Fill forward missing weekend fx rate values. 
    df_out = df_out.ffill(axis=0)
    df_out = df_out.sort_index(ascending=0)
    return df_out

def fx_read(url, xml_child, xml_namespaces):
    # Function to request fx rate url, parse content and read into dataframe.
    try:
        # If URL is valid print confirmation.
        xml_object = urllib.request.urlopen(url)
        print('\n\nMessage: Requested fx rate url is valid.\n')
        df_in= fx_parse_xml(xml_object, xml_child, xml_namespaces)
    except urllib.error.URLError:
        # If URL is invalid print error.
        print('\n\nError: Requested fx rate url is invalid.\n')
        sys.exit()
    return df

def fx_transform(df):
    # Function to fill empty fx dates, clean data and convert into pivot table. 
    df_in= df.ffill(axis=0)
    # Drop all other empty rows.
    df_in= df.dropna()
    # Create ex.rate pivot table.
    df['Rate'] = pd.to_numeric(df['Rate'])
    df_out = fx_create_pivot(df)
    return df_out

def fx_export(df):
    # Function to export to csv file 
    df.iloc[:,0] = df.iloc[:,0].astype("str")
    df.index.name = 'Date'
    df.to_csv('ECB FX Rates.csv', index=True)
    
def fx_rte_process(url, xml_child, xml_namespaces):
    # Function to run full rte process and display csv output file.
    df_request = fx_request(url, xml_child, xml_namespaces)
    df_transform = fx_transform(df_request)
    fx_export(df_transform)
    print('2. ECB FX Rate Table')
    display(pd.read_csv('ECB FX Rates.csv', dtype=str))
    return df_transform

Functions to read, check, transform and export source data.

In [15]:
def src_return_mot(mode):
    # Function to switch mot codes.
    mot_switch={'Sea':'1', 'Rail':'2','Road':'3', 'Air':'4'}
    return mot_switch.get(mode,"Invalid mode of transport\n")
    
def src_cc_check(df_src, df_cc):
    # Function to verify src commodity codes match official codes.
    df_out = pd.merge(df_src, df_cc, how='left', left_on='Commodity Code', right_on='CN8')
    df_out['CC Check'] = np.where(df_out['Commodity Code'] == df_out['CN8'], 'OK', '`ERROR')
    return df_out

def src_fx_convert(df_src, df_fx):
    # Function to perform fx rate conversion by shipping date.
    df_src['Shipping Date'] = df_src['Shipping Date'].astype("string")
    df_src['Shipping Date'] = pd.to_datetime(df_src['Shipping Date'], format="%d-%m-%Y")
    df_out = pd.merge(df_src, df_fx, how='left', left_on='Shipping Date', right_on=df_fx.index)
    df_out.rename(columns = {'SEK':'EUR to SEK'}, inplace = True)
    df_out['Net (SEK)'] = df_out['Net (EUR)'].astype('float').multiply(df_out['EUR to SEK'].astype('float'))    
    return df_out

def src_read(input_file_name):
    #Function to read intrastat source data.  
    df_in= pd.read_excel(input_file_name, dtype = str)
    try:
        # If file is read into data frame print confirmation.
        print('\n\nMessage: Source data successfully read.\n')
        print('3. Intrastat Source File')
        display(df)
    except pd.errors.ParserError:
        # If file is not read into data frame print error.
        print('Error: Source data could not be read.\n')
        sys.exit()
    return df

def src_analyse(df_src, df_cc, df_fx):
    # Function to analyse source data commodity codes and daily fx rate.  
    df_cc_check = src_cc_check(df_src, df_cc)
    df_fx_convert = src_fx_convert(df_cc_check, df_fx)
    return df_fx_convert

def src_transform(df_src):
    # Function to transform column data into format for intrastat submission.
    df_src['Mode of Transport'] = [src_return_mot(mode) for mode in df_src['Mode of Transport']] 
    df_src['Partner VAT'] = np.where(df_src['Transaction'] == 'B2C', 'QV999999999999', df_src['Partner VAT'])
    df_src['Mass (KG)'] = df_src['Mass (grams)'].astype('float').multiply(0.001)
    df_src["County of Origin"] = 'CN'
    df_src = df_src.drop(['Description_x', 'Mass (grams)', 'Shipping Date', 'Ship From', 'Incoterms', 'Transaction','CN8','SU', 'Description_y', 'CC Check', 'Net (EUR)', 'EUR to SEK' ], axis = 1)
    df_src = df_src[['Ship To', 'Commodity Code','Net (SEK)', 'Quantity', 'Mass (KG)', 'County of Origin', 'Mode of Transport', 'Partner VAT'  ]]    
    return df_src

def src_export(file_name, df):
    # Function to export to excel file.
    df.iloc[:,0] = df.iloc[:,0].astype("str")
    df.to_excel(file_name, index=False)
    print('\n\nMessage: Submission file successfully exported.\n')
    print('4. Intrastat Submission File')
    display(df)

def src_rate_process(input_file_name, df_cc, df_fx, output_file_name):
    # Function to perform rate process.
    df_read = src_read(input_file_name)
    df_analyse = src_analyse(df_read, df_cc, df_fx)
    df_transform = src_transform(df_analyse)
    src_export(output_file_name, df_transform)
    

#### Define Main

In [17]:
def main():
    # Define variables.
    input_file_name = 'Intrastat Dispatches Data Sample.xlsx'
    output_file_name = 'Intrastat Submission Sample.xlsx'
    cc_url = 'https://www.cbs.nl/-/media/cbsvooruwbedrijf/international-trade-in-goods/commoditycodes-2023.xlsx'
    fx_url = 'https://www.ecb.europa.eu/stats/eurofxref/eurofxref-hist-90d.xml'
    xml_namespaces = {'ex': 'http://www.ecb.int/vocabulary/2002-08-01/eurofxref'}
    xml_child = './/ex:Cube'
    
    # Disable security certificate checks for url requests.
    ssl._create_default_https_context = ssl._create_unverified_context
     
    #Read url references into data frames.
    df_cn8_codes = cc_rte_process(cc_url)
    df_fx_rates = fx_rte_process(fx_url, xml_child, xml_namespaces)
    
    #Run intrastat data check, fx conversion, transformation and export process.
    src_rate_process(input_file_name, df_cn8_codes, df_fx_rates, output_file_name)
    
# Define main as program entry point if script is running as standalone and not as module.
if __name__=="__main__":
    main()



Message: Requested commodity code url is valid.

1. Commodity Code List


Unnamed: 0,CN8,SU,Description
0,01012100,p/st,Pure-bred breeding horses
1,01012910,p/st,Horses for slaughter
2,01012990,p/st,"Live horses (excl. for slaughter, pure-bred fo..."
3,01013000,p/st,Live asses
4,01019000,p/st,Live mules and hinnies
...,...,...,...
9750,97052900,-,Collections and collectors? pieces of zoologic...
9751,97053100,-,Collections and collectors? pieces of numismat...
9752,97053900,-,Collections and collectors? pieces of numismat...
9753,97061000,-,"Antiques, over 250 years old"




Message: Requested fx rate url is valid.

2. ECB FX Rate Table


Unnamed: 0,Date,AUD,BGN,BRL,CAD,CHF,CNY,CZK,DKK,GBP,...,NZD,PHP,PLN,RON,SEK,SGD,THB,TRY,USD,ZAR
0,2023-01-19,1.5726,1.9558,5.6326,1.4603,0.9921,7.3424,23.924,7.4398,0.87648,...,1.6978,59.099,4.7063,4.9265,11.1533,1.4326,35.803,20.3295,1.0815,18.6931
1,2023-01-18,1.5413,1.9558,5.5252,1.4505,0.9906,7.3193,23.954,7.4399,0.8753,...,1.6683,59.17,4.6983,4.9388,11.1735,1.428,35.633,20.3706,1.0839,18.54455
2,2023-01-17,1.5611,1.9558,5.5607,1.4547,0.9998,7.3473,23.966,7.4386,0.88595,...,1.6957,59.441,4.6958,4.9356,11.285,1.4343,35.869,20.3762,1.0843,18.49935
3,2023-01-16,1.5537,1.9558,5.524,1.4486,1.0026,7.2791,23.997,7.4392,0.88758,...,1.6945,58.936,4.6935,4.9313,11.2725,1.4288,35.696,20.3175,1.0812,18.52385
4,2023-01-15,1.5586,1.9558,5.5512,1.4494,1.0051,7.2729,24.011,7.4387,0.888,...,1.7014,59.44,4.6888,4.9423,11.2528,1.4311,35.751,20.3196,1.0814,18.346600000000002
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83,2022-10-28,1.5511,1.9558,5.327,1.3542,0.992,7.2159,24.465,7.4423,0.8612,...,1.7151,57.739,4.7275,4.9189,10.9403,1.4055,37.724,18.5219,0.9951,18.113300000000002
84,2022-10-27,1.561,1.9558,5.3889,1.3672,0.9949,7.2552,24.53,7.4387,0.86745,...,1.7316,58.441,4.7585,4.8893,10.9583,1.4154,37.975,18.681,1.0037,18.10255
85,2022-10-26,1.5466,1.9558,5.2944,1.3568,0.9917,7.1948,24.535,7.4381,0.86603,...,1.7249,58.493,4.7548,4.8806,10.953,1.4104,37.862,18.6461,1.0023,18.08665
86,2022-10-25,1.5599,1.9558,5.2254,1.3537,0.9888,7.2072,24.472,7.4387,0.87143,...,1.7321,57.988,4.777,4.9036,10.9728,1.405,37.758,18.3508,0.9861,18.12115




Message: Source data successfully read.

3. Intrastat Source File


Unnamed: 0,Commodity Code,Description,Mass (grams),Net (EUR),Quantity,Shipping Date,Ship From,Ship To,County of Origin,Mode of Transport,Incoterms,Transaction,Partner VAT
0,61012010,Men's or boys' overcoats,595,190,1,04-01-2023,SE,DE,China,Rail,DAP,B2C,Private Customer
1,61012090,Men's or boys' overcoats,678,150,1,01-01-2023,SE,NL,China,Sea,DDP,B2B,NL999999999999
2,61013010,Men's or boys' overcoats,704,175,1,05-01-2023,SE,ES,China,Road,DAP,B2C,Private Customer
3,61019080,Men's or boys' overcoats,844,135,1,07-01-2023,SE,FR,China,Air,DDP,B2B,FR999999999999
4,61021010,Women's or girls' overcoats,461,145,1,02-01-2023,SE,NL,China,Road,DDP,B2B,NL999999999999
5,61021090,Women's or girls' overcoats,589,160,1,03-01-2023,SE,BE,China,Air,DDP,B2B,BE999999999999
6,61022090,Women's or girls' overcoats,533,155,1,05-01-2023,SE,PT,China,Sea,DAP,B2C,Private Customer
7,61029010,Women's or girls' overcoats,406,180,1,06-01-2023,SE,IT,China,Rail,DAP,B2C,Private Customer




Message: Submission file successfully exported.

4. Intrastat Submission File


Unnamed: 0,Ship To,Commodity Code,Net (SEK),Quantity,Mass (KG),County of Origin,Mode of Transport,Partner VAT
0,DE,61012010,2121.407,1,0.595,CN,2,QV999999999999
1,NL,61012090,1668.27,1,0.678,CN,1,NL999999999999
2,ES,61013010,1957.34,1,0.704,CN,3,QV999999999999
3,FR,61019080,1519.83,1,0.844,CN,4,FR999999999999
4,NL,61021010,1619.1135,1,0.461,CN,3,NL999999999999
5,BE,61021090,1782.88,1,0.589,CN,4,BE999999999999
6,PT,61022090,1733.644,1,0.533,CN,1,QV999999999999
7,IT,61029010,2026.44,1,0.406,CN,2,QV999999999999
