# SM04: ETL Script

The bulk of the code to clean the [Insurance Company Benchmark (COIL 2000) dataset](https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+%28COIL+2000%29) was developed in posts [SM01](https://julielinx.github.io/blog/aws01_read_from_s3/) and [SM02](https://julielinx.github.io/blog/aws02_clean_data/). Now that I've covered how the pipeline works, I can tackle streamlining my data cleaning code into a single Python script.

## Write Python script

The first step is to create the `.py` script that will be run by the pipeline step. I pretty much already wrote this code in the first two posts. The only difference between the `.py` script below and those posts is that I don't have to save intermediate steps to S3. I do all the processing at one time.

I'll only walk through the code specific to being able to run the code in the pipeline. The code that does the actual work of cleaning the data was already explained in detail in posts [SM01]() and [SM02]().

### Create `.py` file

I prefer to be able to do all of my work within the same interface. I have nothing against working in an IDE like Visual Studio or PyCharm, I just don't like switching between interfaces in the middle of a project and trying to keep which files need to be opened where straight and synced.

While a `.py` file type isn't available to create under the `Launch` options in SageMaker, I can get around this using the `%%writefile` magic function. This magic function writes whatever file I want to the current working directory. This trick also works if I need to manually create `.txt` or other files types.

In [None]:
%%writefile etl.py

### Import libraries

As always, I must import libraries. This pretty much always comes at the beginning of a file.

In [None]:
import pandas as pd
import sagemaker

### Define parameters and functions

The second part to most scripts is parameter and function definitions. I generally put my parameters first because they're the thing I update most regularly. Putting them first makes them easy to find and update. I'm especially prone to updating parameters while prototyping. I'll hard code in parameters, then make them more dynamic once the code works.

Functions come second because I generally leave these alone once I know they work. However, I want them all in the same place so they're easy to find, reference, and update when necessary. I frequently need to reference my functions to determine expected input and output.

This particular script doesn't need any custom functions, but if I pull in the function from the SM02 post, it would look like this:

In [None]:
prefix = '1_ins_dataset'

def read_mult_txt(bucket, prefix):
    s3_resource = boto3.resource("s3")
    s3_bucket = s3_resource.Bucket(bucket)

    files = {}
    for object_summary in s3_bucket.objects.filter(Prefix=prefix):
        if (len(object_summary.key.rsplit('.')) == 2) & (len(object_summary.key.split('/')) <= 3):
            files[object_summary.key.split('/')[-1].split('.')[0]] = f"s3://{bucket}/{object_summary.key}"
            
    df_dict = {}
    for df_name in files.keys():
        df_dict[df_name] = pd.read_csv(files[df_name])

    return df_dict

### `__name__ == '__main__'`

To be honest, when working in Jupyter Notebooks, I never bother with `if __name__ == '__main__':`. I just run the code cell by cell until I get what I need. However, when running a `.py` script as a step in a SageMaker Pipeline, the code needs to be treated as a more formal piece of stand alone code.

Anything that should be run when the script is called goes under `if __name__ == '__main__':`. Don't forget to indent everything under it.

In [None]:
if __name__ == '__main__':

### Define filepaths

Coming from a background working almost exclusively in Jupyter Notebooks, the next step confused me at first.

In Jupyter Notebooks, I read/write directly from/to S3. Locally, my read/writes go into the current working directory or somewhere within the folder structure that I designate (still generally based on the working directory instead of full filepaths to allow easy handling between different users). 

When working with SageMaker Pipelines, the filepath(s) to S3 is defined in the pipeline step, not in the `.py` script. The pipeline deposits the file(s) specified in the pipeline step into the instance it spins up to run the `.py` code. SageMaker deposits the file(s) into specific folders within that instance.

The step type used to clean data is a Processing step. This step type uses the default folder locations `/opt/ml/processing/input` and `/opt/ml/processing/output`.

The input location is where the file(s) designated in the Pipeline step are deposited. Thus any inputs I need are referenced from that filepath, not the working directory of the `.py` script.

The output location is where SageMaker expects any produced file(s) to be located.

#### Example:

**Input**: In post SM01, I read files directly from the internet. As these are `.txt` files and my code is prepared to read them in this way, I can specify the URLs in the Pipeline step and they'll be deposited into the `/opt/ml/processing/input` folder. From a production stand point, this means I can create a Pipeline and specify files in different locations without having to change the Pipeline itself. I'll cover this in more detail in a later post.

**Output**: The final product of my code is a single dataframe saved as a `.csv`. Placing this file in the `/opt/ml/processing/output` folder allows me to reference it in my Pipeline step and designate where to save it, generally an S3 location. I can also reference it in subsequent steps, allowing me to easily pass files from one step to the next. Additionally, using this method tells SageMaker what steps to run in what order. More on that in later posts.

Why `/opt/ml/processing/input` and `/opt/ml/processing/output`? It's just what AWS decided to call the folders. I can easily create my own directories if I feel like it, I just need to use the `os.makedirs()` function and put the newly created filepath in the `input` or `output` parameter. I'll need this functionality later, but for now, why complicate things? The defaults are perfectly servicable.

In [None]:
    input_path = '/opt/ml/processing/input'
    output_path = '/opt/ml/processing/output'

## Final script

The final Python script looks like this:

In [14]:
%%writefile etl.py
import pandas as pd
import os

if __name__ == '__main__':
    input_path = '/opt/ml/processing/input'
    output_path = '/opt/ml/processing/output'
    
    col_names = ['zip_agg_customer_subtype',
                 'zip_agg_number_of_houses',
                 'zip_agg_avg_size_household',
                 'zip_agg_avg_age',
                 'zip_agg_customer_main_type',
                 'zip_agg_roman_catholic',
                 'zip_agg_protestant',
                 'zip_agg_other_religion',
                 'zip_agg_no_religion',
                 'zip_agg_married',
                 'zip_agg_living_together',
                 'zip_agg_other_relation',
                 'zip_agg_singles',
                 'zip_agg_household_without_children',
                 'zip_agg_household_with_children',
                 'zip_agg_high_level_education',
                 'zip_agg_medium_level_education',
                 'zip_agg_lower_level_education',
                 'zip_agg_high_status',
                 'zip_agg_entrepreneur',
                 'zip_agg_farmer',
                 'zip_agg_middle_management',
                 'zip_agg_skilled_labourers',
                 'zip_agg_unskilled_labourers',
                 'zip_agg_social_class_a',
                 'zip_agg_social_class_b1',
                 'zip_agg_social_class_b2',
                 'zip_agg_social_class_c',
                 'zip_agg_social_class_d',
                 'zip_agg_rented_house',
                 'zip_agg_home_owners',
                 'zip_agg_1_car',
                 'zip_agg_2_cars',
                 'zip_agg_no_car',
                 'zip_agg_national_health_service',
                 'zip_agg_private_health_insurance',
                 'zip_agg_income_<_30.000',
                 'zip_agg_income_30-45.000',
                 'zip_agg_income_45-75.000',
                 'zip_agg_income_75-122.000',
                 'zip_agg_income_>123.000',
                 'zip_agg_average_income',
                 'zip_agg_purchasing_power_class',
                 'contri_private_third_party_ins',
                 'contri_third_party_ins_(firms)',
                 'contri_third_party_ins_(agriculture)',
                 'contri_car_policies',
                 'contri_delivery_van_policies',
                 'contri_motorcycle/scooter_policies',
                 'contri_lorry_policies',
                 'contri_trailer_policies',
                 'contri_tractor_policies',
                 'contri_agricultural_machines_policies',
                 'contri_moped_policies',
                 'contri_life_ins',
                 'contri_private_accident_ins_policies',
                 'contri_family_accidents_ins_policies',
                 'contri_disability_ins_policies',
                 'contri_fire_policies',
                 'contri_surfboard_policies',
                 'contri_boat_policies',
                 'contri_bicycle_policies',
                 'contri_property_ins_policies',
                 'contri_ss_ins_policies',
                 'nbr_private_third_party_ins',
                 'nbr_third_party_ins_(firms)',
                 'nbr_third_party_ins_(agriculture)',
                 'nbr_car_policies',
                 'nbr_delivery_van_policies',
                 'nbr_motorcycle/scooter_policies',
                 'nbr_lorry_policies',
                 'nbr_trailer_policies',
                 'nbr_tractor_policies',
                 'nbr_agricultural_machines_policies',
                 'nbr_moped_policies',
                 'nbr_life_ins',
                 'nbr_private_accident_ins_policies',
                 'nbr_family_accidents_ins_policies',
                 'nbr_disability_ins_policies',
                 'nbr_fire_policies',
                 'nbr_surfboard_policies',
                 'nbr_boat_policies',
                 'nbr_bicycle_policies',
                 'nbr_property_ins_policies',
                 'nbr_ss_ins_policies',
                 'nbr_mobile_home_policies']

    train = pd.read_csv(os.path.join(input_path, 'train.csv'))
    test = pd.read_csv(os.path.join(input_path, 'test.csv'))
    ground_truth = pd.read_csv(os.path.join(input_path, 'gt.csv'))
    columns = pd.read_csv(os.path.join(input_path, 'col_info.csv'))

    data_dict = {}
    data_dict['feat_info'] = columns.iloc[1:87, 0].str.split(n=2, expand=True)
    data_dict['feat_info'].columns = columns.iloc[0, 0].split(maxsplit=2)
    data_dict['L0'] = columns.iloc[89:130, 0].str.split(n=1, expand=True)
    data_dict['L0'].columns = columns.iloc[88, 0].split()
    data_dict['L2'] = columns.iloc[138:148, 0].str.split(n=1, expand=True)
    data_dict['L2'].columns = ['Value', 'Bin']

    test_df = pd.concat([test, ground_truth], axis=1)
    test_df.columns = data_dict['feat_info']['Name'].to_list()
    train.columns = data_dict['feat_info']['Name'].to_list()

    df = pd.concat([train, test_df], ignore_index=True)
    df.columns = col_names

    data_dict['L0']['Value'] = pd.to_numeric(data_dict['L0']['Value'])
    l0_dict = data_dict['L0'].set_index('Value').to_dict()['Label']
    data_dict['L2']['Value'] = pd.to_numeric(data_dict['L2']['Value'])
    l2_dict = data_dict['L2'].set_index('Value').to_dict()['Bin']
    df[df.columns[0]] = df[df.columns[0]].replace(l0_dict)
    df[df.columns[4]] = df[df.columns[4]].replace(l2_dict)

    df.to_csv(os.path.join(output_path, 'full_data.csv'), index=False)

Overwriting etl.py


## Next steps

Now that the `.py` script has been written, I can put it all together in the next post.