#### Data processing notebook

This notebook will process the data acquired on experiments around the latency to fetch a file from an S3 bucket directly or passing through AWS CDN, a.k.a CloudFront.

#### Methodology

An S3 bucket is deployed in each continent 

["sa-east-1", "us-east-1", "af-south-1", "ap-northeast-1", "eu-west-1", "ap-southeast-2"](https://github.com/gpspelle/cdk-s3-cdn-experiment/blob/main/cdk/bin/pfg.ts#L53)

São Paulo, North Virginia, South Africa, Tokyo, Ireland, Sydney.

On top of that, CloudFront is deployed and is serving the content from each ofthese buckets from a specific route. So, if https://gpspelle.click is the domain used by CloudFront, https://gpspelle.click/sa-east-1/* requests are redirected to the São Paulo (sa-east-1) bucket.

To measure the difference of performance with or without cloudfront, for each region, the files from the bucket are fetched passing through cloudfront and not passing through it. There's a log containing the result of these requests for each region.

Each file consists of a fixed amount of lines, in our early experiments, it's 10, where each line contains the following information:


- ProcessDate := when the fetch request was executed
- SourceUrl := URL used to get the content, it can be a CF url or an S3 url
- FileSize := size of the file that is fetched
- ElapsedTime := time to fetch the file
- StatusCode := status code of the request

For now, we are considering files with [1kb, 10kb, 100kb, 1000kb, 10000kb](https://github.com/gpspelle/cdk-s3-cdn-experiment/blob/main/utils/fetchFiles.cjs#L17)


In [30]:
# Helper function to get the .csv files generated by the fetch command
from os import listdir

def find_csv_filenames(path_to_dir, suffix=".csv"):
    filenames = listdir(path_to_dir)
    return [ filename for filename in filenames if filename.endswith( suffix ) ]

In [7]:
filenames = find_csv_filenames(".")
print(filenames[0])

log-ap-southeast-1-20220508132655.csv


In [26]:
import pandas as pd

df = pd.read_csv(filenames[0])
assert (df['StatusCode'] == 200).all()

In [27]:
df.drop(columns=['StatusCode'], inplace=True)

In [28]:
df = df.groupby('SourceUrl')

In [29]:
print(df.describe())

                                                   ElapsedTime               \
                                                         count         mean   
SourceUrl                                                                     
https://ap-southeast-1-latency-test-pfg-unicamp...        10.0  3152.265429   
https://gpspelle.click/ap-southeast-1/100kb               10.0   368.912487   

                                                                             \
                                                           std          min   
SourceUrl                                                                     
https://ap-southeast-1-latency-test-pfg-unicamp...  998.643246  2300.877042   
https://gpspelle.click/ap-southeast-1/100kb         692.464631   119.119375   

                                                                              \
                                                            25%          50%   
SourceUrl                                       