DATA ENGINEERING PIPELINE - INTRASTAT COMMODITY CODES

Aim:
Write a production ready data engineering pipeline using python and pandas.

Overview:
Intrastat is a system that collects information relating to the trade of goods. This script will request intrastat commodity codes from an online resource, transform the data and export to csv. 

Task:

Below outlines the steps to be performed:
    
    1) Import the necessary libraries and functions for our project.
    2) Disable security certification checks for client-server connections.
    2) Create variables to define the url that will be requested. 
    3) Fetch 2022 intrastat commodity codes data from cbs.nl url. 
    4) Parse the url content into a pandas dataframe.
    5) Apply cleansing and transformation using pandas libray functions.
    6) Export the content as a csv file. 


Import Packages

In [39]:
import pandas as pd # Data analysis package.
import ssl # Secure sockets layer package.
import urllib # Url handling module.
import sys # Runtime environment handling module.

Define Methods

In [51]:
def request(url):
    # Disable security certificate checks for url requests.
    ssl._create_default_https_context = ssl._create_unverified_context
    try:
        # If URL is valid print confirmation.
        urllib.request.urlopen(url)
        print('Message: Requested URL is valid.')
        # Read data into pandas dataframe.
        df = pd.read_excel(url)
    except urllib.error.URLError:
        # If URL is invalid print error.
        print('Error: Requested URL is invalid.')
        sys.exit()

    return df

def transform(df):
    # Copy CN codes from column index [0] into new 'CN8' column. Pad left 'CN8' values.  
    df['CN8'] = df.iloc[:,[0]].astype(str) 
    df['CN8'] = df['CN8'].str.zfill(8) 
    return df

def export(df):
    # Export to csv file 
    df.to_csv('CN8 Codes.csv', encoding='utf-8', index=False)
    
def rte_process(url):
    # Run full rte process and display tablature data.
    df_request = request(url)
    df_transform = transform(df_request)
    export(df_transform)
    display(df_transform)
    


Define Main Function

In [53]:
def main():
    # Define variables.
    url = 'https://www.cbs.nl/-/media/cbsvooruwbedrijf/international-trade-in-goods/commoditycodes-2023.xlsx'
    # Request data from URL, transform and export as CSV.
    rte_process(url)

    
# Run as program entry point if script is standalone and not module.
if __name__=="__main__":
    main()




Message: Requested URL is valid.


Unnamed: 0,CN2023,SU,Description,CN8
0,1012100,p/st,Pure-bred breeding horses,01012100
1,1012910,p/st,Horses for slaughter,01012910
2,1012990,p/st,"Live horses (excl. for slaughter, pure-bred fo...",01012990
3,1013000,p/st,Live asses,01013000
4,1019000,p/st,Live mules and hinnies,01019000
...,...,...,...,...
9750,97052900,-,Collections and collectors? pieces of zoologic...,97052900
9751,97053100,-,Collections and collectors? pieces of numismat...,97053100
9752,97053900,-,Collections and collectors? pieces of numismat...,97053900
9753,97061000,-,"Antiques, over 250 years old",97061000
