## Unzip zipped folder in Sagemaker
If the data is in a zip folder, and 
copied into S3, the following steps will
- load the zip file into Sagemaker instance
- unzip all files

In [1]:
! aws s3 cp s3://bucketname/filename.zip .

In [1]:
! unzip -o filename.zip -d zip_contents > stdout; echo -n 'files unzipping completed'

The `stdout` file contains progress of the unzipped files.
Once the files are unzipped, the message
`files unzipping completed` will be printed.

## Load unzipped files to S3
The function `upload_to_s3` will upload files from
local Sagemaker instance to S3 bucket.

Note: This process takes a long time.
For a large number of files, some asynchronous method
has to be applied. Using `lambda` function
or `cli` method could also help.
Have not explored those options yet.

In [2]:
# install required packages
import boto3
import os
import time

In [None]:
# define s3 bucketname, folderpath for boto3 session

s3 = boto3.resources('s3')
bucket_name = 'bucketname'
folder_path = 'zip_contents'
bucket = s3.Bucket(bucket_name)

In [None]:
def upload_to_s3(folder, file):
    '''
    function to upload all contents of
    zipped folder into s3 bucket
    using put_object method
    '''
    key = folder + '/' + file
    data = open(key, 'rb')
    bucket.put_object(Key=key, Body=data)

In [None]:
# walk through zipped folder to upload files into s3

stTime = time.time()

for root, dirs, files in os.walk(folder_path):
    for name in files:
        upload_to_s3(root, name)

# keep track of time taken to complete the process
endTime = time.time()
tTime = (endTime - stTime)/3600     #convert to hours

print(f'Time taken to upload all files into s3 bucket: {tTime:.2f} hours')
print('process completed - all files uploaded')

## Load data from S3 into dataframe
The following cells
- load contents of `.txt` files, filename into a dictionary
- convert dictionary into dataframe and save as a `.pkl` file locally

In [None]:
# install required packages
import boto3
import pandas as pd
import time

In [None]:
# define bucketname for boto3 session

s3 = boto3.resource('s3')
bucket_name = 'bucketname'
bucket = s3.Bucket(bucket_name)

In [None]:
# declare empty dict
docList = {}

# load all txt files and contents in dict
stTime = time.time()

for obj in bucket.objects.all():
    key = obj.key
    if key.endswith('.txt'):
        # extracting only filename part
        fname = key.split('/path')[-1]
        
        # dict key-value pair -> filename-contents
        docList[fname] = obj.get()['Body'].read().decode('utf-8')
        
# keep track of time taken to complete the process
endTime = time.time()
tTime = = (endTime - stTime)/3600   #convert to hours

print(f'Time taken to load all files and contents into dictionary: {tTime:.2f} hours')
print('process completed')

In [None]:
# check entries in the dict

list(docList.items())[0]

In [None]:
# convert dict into dataframe

df = pd.DataFrame(docList.items(), columns = ['FileName', 'RawContents'])

In [1]:
# save dataframe as a pickle file

df.to_pickle('docList.pkl')