# [Serverless Machine Learning in Action](https://www.manning.com/books/serverless-machine-learning-in-action?a_aid=osipov&a_bid=fa913283&)
## by Carl Osipov

## **Work In Progress** Source Code for [Chapter 4](https://livebook.manning.com/book/serverless-machine-learning-in-action/chapter-2?a_aid=osipov&a_bid=fa913283&) 

## <font color=red>Upload the `BUCKET_ID` file</font>

Before proceeding, ensure that you have a backup copy of the `BUCKET_ID` file created in the [Chapter 2](https://colab.research.google.com/github/osipov/smlbook/blob/master/ch2.ipynb) notebook before proceeding. The contents of the `BUCKET_ID` file are reused later in this notebook and in the other notebooks.


In [None]:
import os
from pathlib import Path
assert Path('BUCKET_ID').exists(), "Place the BUCKET_ID file in the current directory before proceeding"

BUCKET_ID = Path('BUCKET_ID').read_text().strip()
os.environ['BUCKET_ID'] = BUCKET_ID
os.environ['BUCKET_ID']

## **OPTIONAL:** Download and install AWS CLI

This is unnecessary if you have already installed AWS CLI in a preceding notebook.

In [None]:
%%bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip -o awscliv2.zip
sudo ./aws/install

## Specify AWS credentials

Modify the contents of the next cell to specify your AWS credentials as strings. 

If you see the following exception:

`TypeError: str expected, not NoneType`

It means that you did not specify the credentials correctly.

In [None]:
import os
# *** REPLACE None in the next 2 lines with your AWS key values ***
os.environ['AWS_ACCESS_KEY_ID'] = None
os.environ['AWS_SECRET_ACCESS_KEY'] = None

## Confirm the credentials

Run the next cell to validate your credentials.

In [None]:
%%bash
aws sts get-caller-identity

If you have specified the correct credentials as values for the `AWS_ACCESS_KEY_ID` and the `AWS_SECRET_ACCESS_KEY` environment variables, then `aws sts get-caller-identity` used by the previous cell should have returned back the `UserId`, `Account` and the `Arn` for the credentials, resembling the following

```
{
    "UserId": "█████████████████████",
    "Account": "████████████",
    "Arn": "arn:aws:iam::████████████:user/█████████"
}
```

## Specify the region

Replace the `None` in the next cell with your AWS region name, for example `us-west-2`.

In [None]:
# *** REPLACE None in the next line with your AWS region ***
os.environ['AWS_DEFAULT_REGION'] = None

If you have specified the region correctly, the following cell should return back the region that you have specifies.

In [None]:
%%bash
echo $AWS_DEFAULT_REGION

## Review the summary statistics of the cleaned up dataset

At the conclusion of Chapter 3, along with the cleaned up version of the dataset, the PySpark job saved some metadata information with the statistical description for the dataset, including the total number of rows, means, standard deviations, minimums, and maximums for every column of values in the dataset. To read this information into your Jupyter notebook using Pandas, execute the following code:

In [None]:
import pandas as pd

df = pd.read_csv(f"s3://dc-taxi-{os.environ['BUCKET_ID']}-{os.environ['AWS_DEFAULT_REGION']}/parquet/clean_summary/*")

print(df.info())

To get started using the data frame, you can index the data frame using the summary column and take a look at the result:

In [None]:
summary_df = df.set_index('summary')
summary_df = summary_df._get_numeric_data()
summary_df

Let’s save the size of the dataset (i.e. in terms of the number of the rows) to a separate variable which can be used later:

In [None]:
ds_size = summary_df.loc['count'].astype(int).max()
print(ds_size)

Since the upcoming part of the chapter is focused on sampling from the data, go ahead and create two separate series to gather the population mean (`mu`):





In [None]:
mu = summary_df.loc['mean']
print(mu)

and the standard deviation (`sigma`) statistics

In [None]:
sigma = summary_df.loc['stddev']
print(sigma)

## Choosing the right sample size for the test dataset

You can obtain the number of records for the 30% / 15% / 10% / 1% / 0.1% test partitions (specified by the `fractions` variable).

In [None]:
fractions = [.3, .15, .1, .01, .001]
print([ds_size * fraction for fraction in fractions])

To find power of two estimates for the fractions of the dataset

In [None]:
from math import log, ceil, floor, sqrt
ranges = [floor(log(ds_size * fraction, 2)) for fraction in fractions]
print(ranges)

Let's continue using the upper part of the range with the largest sample size of $2^{24} = 16,777,216$ and a smaller part of the range of $2^{14} = 16,384$

In [None]:
sample_size_upper, sample_size_lower = max(ranges) + 1, min(ranges) - 1
print(sample_size_upper, sample_size_lower)

Given a range, you can figure out how well the range approximates fractions of the dataset:


In [None]:
sizes = [2 ** i for i in range(sample_size_lower, sample_size_upper)]
original_sizes = sizes
fracs = [ size / ds_size for size in sizes]
print(*[(idx, sample_size_lower + idx, frac, size) for idx, (frac, size) in enumerate(zip(fracs, sizes))], sep='\n')

Which shows that a test dataset size of $2^{14}$ covers only about 0.045% of the dataset while a test datasize of $2^{23}$ covers roughly 23% of the entire dataset.


## Standard Error of the Mean

With this information in place, you are ready to find out the standard error of the mean across the range of the sample sizes for the individual columns in the dataset.

In [None]:
import numpy as np
def sem_over_range(lower, upper, mu, sigma):
  sizes_series = pd.Series([2 ** i for i in range(lower, upper + 1)])
  est_sem_df = pd.DataFrame( np.outer( (1 / np.sqrt(sizes_series)), sigma.values ), 
                        columns = sigma.index, 
                        index = sizes_series.values)
  return est_sem_df

sem_df = sem_over_range(sample_size_lower, sample_size_upper, mu, sigma)  
sem_df

On a standard, linear plot, the standard error of the mean declines exponentially as the sample size doubles.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize = (12, 9))
plt.plot(sem_df.index, sem_df.mean(axis = 1))
plt.xticks(sem_df.index, 
           labels = list(map(lambda i: f"2^{i}", np.log2(sem_df.index.values).astype(int))), 
           rotation = 90);

To discover the point of diminishing returns (the marginal) on the doubling of the sample size, you can start by looking at the total reduction in the standard error of the mean for each increase in the sample size. This is computed by

`sem_df.cumsum()`

in the following code snippet.

Next, to obtain a single, aggregate measure for each sample size, the `mean(axis = 1)` computes the average of the total reduction in the standard error of the mean across the columns in the dataset.

In [None]:
agg_change = sem_df.cumsum().mean(axis = 1)
agg_change

The point of diminishing returns (the marginal) can also be described as the point on the curve that is the furthest from the imaginary line connecting the first and the last point on the curve.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize = (12, 9))
plt.scatter(agg_change.index, agg_change)
plt.plot(agg_change.index, agg_change)
plt.xticks(sem_df.index.values, 
           labels = list(map(lambda i: f"2^{i}", np.log2(sem_df.index.values).astype(int))), 
           rotation = 90);
# plt.xticks(agg_change.index, labels = []);

The following computes the marginal (using the `marginal` function) and assigns it to the sample size value:

In [None]:
import numpy as np

def marginal(x):
  coor = np.vstack([x.index.values, x.values]).transpose()
  return pd.Series(index = x.index, data = np.cross(coor[-1] - coor[0], coor[-1] - coor) / np.linalg.norm(coor[-1] - coor[0])).idxmin()

SAMPLE_SIZE = marginal(agg_change).astype(int)
SAMPLE_SIZE, SAMPLE_SIZE / ds_size

Since the computed `SAMPLE_SIZE` is need to launch PySpark job, it is saved to an environment variable.

In [None]:
os.environ['SAMPLE_SIZE'] = str(SAMPLE_SIZE)
os.environ['SAMPLE_SIZE']

## Download a Utility Script to Run PySpark Jobs

The script is downloaded as `utils.sh` and is loaded in the upcoming cells using `source utils.sh` command.

In [None]:
%%bash
wget -q --no-cache https://raw.githubusercontent.com/osipov/smlbook/master/utils.sh

## Use a PySpark job to sample the test set
* the job also saves statistical summaries, including the z-scores and p-values of the sample test sets


In case by an unlucky draw, the p-value of the test set may be less than 0.05. The `SAMPLE_COUNT` parameter enables the PySpark job to sample a fair sample for up to `SAMPLE_COUNT` times.

In [None]:
os.environ['SAMPLE_COUNT'] = str(1)
os.environ['SAMPLE_COUNT']

The next cell uses the Jupyter `%%writefile` magic to save the source code for the PySpark job to the `dctaxi_dev_test.py` file.

In [None]:
%%writefile dctaxi_dev_test.py
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME',
                                     'BUCKET_SRC_PATH',
                                     'BUCKET_DST_PATH',
                                     'SAMPLE_SIZE',
                                     'SAMPLE_COUNT',
                                     'SEED'
                                     ])

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
spark = glueContext.spark_session

job = Job(glueContext)
job.init(args['JOB_NAME'], args)

BUCKET_SRC_PATH = args['BUCKET_SRC_PATH']
df = ( spark.read.format("parquet")
        .option("inferSchema", "true")
        .load("{}".format(BUCKET_SRC_PATH))
        .cache() )

SAMPLE_SIZE = float( args['SAMPLE_SIZE'] )
dataset_size = float( df.count() )
sample_frac = SAMPLE_SIZE / dataset_size

from kaen.spark import spark_df_to_stats_pandas_df, \
                      pandas_df_to_spark_df, \
                      spark_df_to_shards_df

summary_df = spark_df_to_stats_pandas_df(df)
mu = summary_df.loc['mean']
sigma = summary_df.loc['stddev']

SEED = int(args['SEED'])
SAMPLE_COUNT = int(args['SAMPLE_COUNT'])
BUCKET_DST_PATH = args['BUCKET_DST_PATH']

for idx in range(SAMPLE_COUNT):
  dev_df, test_df = ( df
                      .cache()
                      .randomSplit( [1.0 - sample_frac, sample_frac],
                                    seed = SEED) )
  test_df = test_df.limit( int(SAMPLE_SIZE) )

  test_stats_df = \
    spark_df_to_stats_pandas_df(test_df, summary_df, pvalues = True, zscores = True)
  
  pvalues_series = test_stats_df.loc['pvalues']
  if pvalues_series.min() > 0.05:
    for df, desc in [(dev_df, "dev"), (test_df, "test")]:
      ( df
        .write
        .option('header', 'true')
        .csv(f"{BUCKET_DST_PATH}/{desc}", mode="overwrite") )

      stats_pandas_df = \
        spark_df_to_stats_pandas_df(df, summary_df, pvalues = True, zscores = True)

      ( pandas_df_to_spark_df(spark,  stats_pandas_df)
        .coalesce(1)
        .write
        .option('header', 'true')
        .csv(f"{BUCKET_DST_PATH}/{desc}/.meta/stats", mode="overwrite") )
      
      ( spark_df_to_shards_df(spark, df)
        .coalesce(1)
        .write
        .option('header', True)
        .csv(f"{BUCKET_DST_PATH}/{desc}/.meta/shards", mode='overwrite') )
        
    break
  else:
    SEED = SEED + idx
      
job.commit()

## Run the PySpark job as specified by `dctaxi_dev_test.py`.
* **the job should take about 8 minutes to finish**


In [None]:
%%bash
source utils.sh

PYSPARK_SRC_NAME=dctaxi_dev_test.py \
PYSPARK_JOB_NAME=dc-taxi-dev-test-job \
ADDITIONAL_PYTHON_MODULES="kaen[spark]" \
BUCKET_SRC_PATH=s3://dc-taxi-$BUCKET_ID-$AWS_DEFAULT_REGION/parquet/clean \
BUCKET_DST_PATH=s3://dc-taxi-$BUCKET_ID-$AWS_DEFAULT_REGION/csv \
SAMPLE_SIZE=$SAMPLE_SIZE \
SAMPLE_COUNT=1 \
SEED=68 \
run_job

Once the job finishes successfully, you can review the metadata with summary statistics about the test dataset:

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

test_stats_df = pd.read_csv(f"s3://dc-taxi-{os.environ['BUCKET_ID']}-{os.environ['AWS_DEFAULT_REGION']}/csv/test/.meta/stats/*")
test_stats_df = test_stats_df.set_index('summary')
test_stats_df

or the metadata about how the PySpark job sharded the development dataset into objects with specific IDs and different number of records per object:

In [None]:
dev_shards_df = pd.read_csv(f"s3://dc-taxi-{os.environ['BUCKET_ID']}-{os.environ['AWS_DEFAULT_REGION']}/csv/dev/.meta/shards/*")
dev_shards_df = dev_shards_df.set_index('id')
dev_shards_df

Copyright 2021 CounterFactual.AI LLC. All Rights Reserved.

Licensed under the GNU General Public License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. 

You may obtain a copy of the License at

https://github.com/osipov/smlbook/blob/master/LICENSE

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.