## DATA ENGINEERING PIPELINE - INTRASTAT COMMODITY CODES

Aim:
Write a production ready data engineering pipeline using python and pandas.

Overview:
Intrastat is a system that collects information relating to the trade of goods. This script will request intrastat commodity codes from an online resource, transform the data and export to csv. 

Task:

Below outlines the steps to be performed:
    
    1) Import the necessary libraries for the project.
    2) Define the functions that will faciliate the data engineering.
    3) Create variables to define the url that will be requested. 
    4) Request 2023 intrastat commodity code data from url. 
    5) Parse the url content into a pandas dataframe.
    6) Cleanse and transform data using pandas library functions.
    7) Display the content as a pandas data frame.
    8) Export the content to a csv file. 


#### Import Packages

In [4]:
import pandas as pd # Data analysis package.
import ssl # Secure sockets layer package.
import urllib # Url handling module.

#### Define Methods

In [5]:
def read(url):
    # Disable security certificate checks for url requests.
    ssl._create_default_https_context = ssl._create_unverified_context
    try:
        # If URL is valid print confirmation.
        urllib.request.urlopen(url)
        print('Message: Requested URL is valid.')
        # Read data into pandas dataframe.
        df = pd.read_excel(url)
    except urllib.error.HTTPError as e:
        if e.code == 404:
            print('Error: Requested commodity code url is invalid.')
        else:
            print('Message: Requested commodity code url is valid.')
    return df

def transform(df, column_rename):
    # Pad left first column to CN8 format
    df.iloc[:,0] = df.iloc[:,0].astype("str").str.pad(8, side='left', fillchar='0')
    # Rename first column to CN8. 
    df.columns.values[0] = column_rename
    return df

def export(file_name, df):
    # Export to csv file 
    df.iloc[:,0] = df.iloc[:,0].astype("str")
    df.to_csv(file_name, index=False)
    
def rte_process(url, column_rename, file_name):
    # Run full rte process and display csv output file.
    df_request = read(url)
    df_transform = transform(df_request, column_rename)
    export(file_name,df_transform)
    display(pd.read_csv(file_name, dtype=str))
    


#### Define Main Function

In [6]:
def main():
    # Define variables.
    url = 'https://www.cbs.nl/-/media/cbsvooruwbedrijf/international-trade-in-goods/commoditycodes-2023.xlsx'
    column_rename = 'CN8'
    file_name = 'CN8 Codes.csv'
 
    # Request data from URL, transform and export as CSV.
    rte_process(url, column_rename, file_name)

    
# Run as program entry point if script is standalone and not module.
if __name__=="__main__":
    main()




Message: Requested URL is valid.


Unnamed: 0,CN8,SU,Description
0,01012100,p/st,Pure-bred breeding horses
1,01012910,p/st,Horses for slaughter
2,01012990,p/st,"Live horses (excl. for slaughter, pure-bred fo..."
3,01013000,p/st,Live asses
4,01019000,p/st,Live mules and hinnies
...,...,...,...
9750,97052900,-,Collections and collectors? pieces of zoologic...
9751,97053100,-,Collections and collectors? pieces of numismat...
9752,97053900,-,Collections and collectors? pieces of numismat...
9753,97061000,-,"Antiques, over 250 years old"
