# Deploying the PyData stack onto AWS Lambda

AWS Lambda is a serverless stack from Amazon.  It allows you to have functions that run without needing to maintain a running server.  Lambda functions can be triggered from web requests, SQS, Kinesis, and a variety of other events.  Constructing apps with Lambda functions allows you to easily scale without worrying about spinning up servers.

AWS Lambda has a well known package size limit of 50MB, which can be expanded to 250MB through some hacks.  I hadn't thought it was possible to easily deploy functions that depended on the PyData stack (pandas, numpy, scikit-learn...) because of these size limitations.  In this notebook I walk through

* a simple lambda deployment with no dependencies
* a reqular packaged lambda deployment
* individual steps necessary to dpeloy the PyData stack
* a clean scripted PyData deploy

## Implementation notes about this notebook
I use the python magics of `%%writefile` and `%%bash` extensively.  `%%writefile` allows me to write the lambda functions and bash scripts inline.  `%%bash` allows multiline shell examples.

In a serious deployment system these bash scripts would probably be integrated into Ansible, Chef, or Puppet.  The aws python api could also be used to accomplish the same tasks.  Using the AWS CLI tools through bash is the most straight forward way of experimenting with the lambda platform


## Running this notebook.

The code examples assume a properly configured AWS CLI environment.  The user for the AWS CLI environment must have access to create Lambda functions.  This tutorial also assumes an environment variable AWS_ID with your AWS user_id.  There are scripts integrated which use this variable, and replace the actual account number with "$AWS_ID" from the output for privacy.

Some bash commands especially towards the end take a while to run, I have used `time` before these commands.

Finally, running these commands will generate AWS charges, but they should be minimal.

In [1]:
%%writefile aws_sanitize
#!/bin/bash
#this is used to prevent my aws_id leaking into public
#I'm not completely clear why protecting my account number is necessary for security
#but all tutorials do it, so I will too.
replace='$AWS_ID'
sed -e "s/$AWS_ID/$replace/"

Overwriting aws_sanitize


In [None]:
!chmod +x ./aws_sanitize

## Simple Lambda Function

In [3]:
%%writefile simple_lambda/nb1_simple_lambda_function.py

def lambda_handler(event, context):
    return {'body': "hello world"}

Writing simple_lambda/nb1_simple_lambda_function.py


In [4]:
%%bash
cd simple_lambda
zip function.zip nb1_simple_lambda_function.py
aws lambda create-function \
        --function-name nb1_simple_lambda_function \
        --handler nb1_simple_lambda_function.lambda_handler \
        --zip-file fileb://function.zip \
        --runtime python3.7 \
        --role "arn:aws:iam::$AWS_ID:role/service-role/aws_lambda_role" | ../aws_sanitize

  adding: nb1_simple_lambda_function.py (deflated 1%)
{
    "FunctionName": "nb1_simple_lambda_function",
    "FunctionArn": "arn:aws:lambda:us-east-2:$AWS_ID:function:nb1_simple_lambda_function",
    "Runtime": "python3.7",
    "Role": "arn:aws:iam::$AWS_ID:role/service-role/aws_lambda_role",
    "Handler": "nb1_simple_lambda_function.lambda_handler",
    "CodeSize": 279,
    "Description": "",
    "Timeout": 3,
    "MemorySize": 128,
    "LastModified": "2019-04-11T23:00:18.165+0000",
    "CodeSha256": "RRKvPXlwWTIa3MfSRYBED69gpZwrHPNnGuOXByVO3Uw=",
    "Version": "$LATEST",
    "TracingConfig": {
        "Mode": "PassThrough"
    },
    "RevisionId": "f0b91af3-7e5e-455c-9669-e735a5f1aa43"
}


### Let's test the function

In [5]:
%%bash
aws lambda invoke \
    --function-name "nb1_simple_lambda_function" \
    --log-type Tail  --invocation-type  RequestResponse slf.out > /dev/null
cat slf.out  | ./aws_sanitize

{"body": "hello world"}


## Simple Package
This example shows how to package simple python libraries with a lambda function

In [6]:
!mkdir simple_package

mkdir: simple_package: File exists


In [7]:
%%writefile simple_package/nb1_requests_function.py
import requests

def lambda_handler(event, context):
    resp = requests.get("https://www.google.com")
    resp_len = len(resp.content)
    return {'resp_len': resp_len}

Writing simple_package/nb1_requests_function.py


In [8]:
%%bash
cd simple_package
pip install requests --target .  2>&1 > /dev/null
zip -r9 ./package_function.zip ./* 2>&1 > /dev/null
aws lambda create-function \
        --function-name nb1_requests_function \
        --handler nb1_requests_function.lambda_handler \
        --zip-file fileb://package_function.zip \
        --runtime python3.7 \
        --role "arn:aws:iam::$AWS_ID:role/service-role/aws_lambda_role" | ../aws_sanitize

awscli 1.16.121 has requirement botocore==1.12.111, but you'll have botocore 1.12.112 which is incompatible.
{
    "FunctionName": "nb1_requests_function",
    "FunctionArn": "arn:aws:lambda:us-east-2:$AWS_ID:function:nb1_requests_function",
    "Runtime": "python3.7",
    "Role": "arn:aws:iam::$AWS_ID:role/service-role/aws_lambda_role",
    "Handler": "nb1_requests_function.lambda_handler",
    "CodeSize": 901151,
    "Description": "",
    "Timeout": 3,
    "MemorySize": 128,
    "LastModified": "2019-04-11T23:01:26.271+0000",
    "CodeSha256": "MkPJ4PilcKK/E5N0D9swgDNnrU+EppePXsDCGreVEcI=",
    "Version": "$LATEST",
    "TracingConfig": {
        "Mode": "PassThrough"
    },
    "RevisionId": "cfcdbb96-1cc8-45ea-8739-0a442eea159e"
}


## Run_function script

In [9]:
%%writefile run_function
#!/bin/bash
aws lambda invoke \
    --function-name $1 \
    --log-type Tail  --invocation-type  RequestResponse slf.out > /dev/null
cat slf.out  | ./aws_sanitize

Overwriting run_function


In [10]:
!chmod +x run_function
!./run_function nb1_requests_function

{"resp_len": 11293}


## PyData Package with NumPy, Pandas, Matplotlib, and Scikit-Learn
The following script creates a packaged directory including `numpy`, `scipy`, `scikit-learn`, `pandas`, and `matplotlib`.  There are some tricky bits here that ensure a small package.  Layers must be less than 250MB when expanded.  To get around this the script byte compiles all `.py` files producing `.pyc` files.  The existing `.py` files can then be deleted.  SciPy and NumPy also share openblas at 28MB, with identical md5sums, this is included only once by using a symbolic link.  Further space could be trimmed by removing unit tests included with all of these packages, but that isn't really necessary.  With the existing changes the expanded package shrinks from 261MB to 185MB.

In [11]:
%%bash
echo "This takes about 2 minutes to run on my laptop"
rm -rf pydata_full*
mkdir -p pydata_full/python
#layers must have code in a "python"
cd pydata_full/python
#note the extra options to force linux packages even if you are on OS X
pip install numpy scipy scikit-learn pandas matplotlib \
        --platform manylinux1_x86_64\
        --python-version 37 \
        --only-binary=:all:  --target ./ 2>&1 > /dev/null
du -h ./ | tail -n 1
cd numpy/.libs/
#openblas 28mb and exactly the same as in scipy, symbolic links let us share this resource
rm libopenblasp-r0-382c8f3a.3.5.dev.so
#add a symlink to scipy's copy of libopenblas
ln -s ../../scipy/.libs/libopenblasp-r0-382c8f3a.3.5.dev.so ./
cd ../..
# byte compile all files in compatability mode which puts pyc files next to the py files
# without -b compiled files will be put into pycache directories, in pycache directories
# the original source file must also be available
python -m compileall -b ./ > /dev/null
# remove all of the pycache directories, they contain duplicates of the 
# parallel pyc files
find . -type d | grep pycache | xargs rm -rf
# remove all of the original .py files, 
find . -type f -mindepth 2 | grep \.py$ | grep -v f2py | grep -v "__" | xargs rm
cd ..
#note the -y flag which preserves symbolic links
zip -y -r9 ../pydata_full.zip ./ 2>&1 > /dev/null
cd ..
du -h pydata_full.zip
du -h pydata_full/ | tail -n1

This takes about 2 minutes to run on my laptop
awscli 1.16.121 has requirement botocore==1.12.111, but you'll have botocore 1.12.112 which is incompatible.
261M	./
 65M	pydata_full.zip
185M	pydata_full/


## Create an Lambda Layer for fast deployment
Layers allow us to include a package as a standalone entity.  When deploying lambda functions, this layer is referenced and only application need be included in the lambda deployment.  This makes deploying python code that depends on the PyData packages much faster 2:30 -> 0:15

In [12]:
%%bash
time aws s3 cp pydata_full.zip  s3://pandas-sklearn-demo/pydata_full.zip > /dev/null
#Note this lambda package is still published via an S3 bucket because the zip is over the 50MB limit
time aws lambda publish-layer-version --layer-name pydata_full \
            --description "Core PyData libraries packaged" \
            --content S3Bucket=pandas-sklearn-demo,S3Key=pydata_full.zip \
            --compatible-runtimes python3.7 | ./aws_sanitize

{
    "Content": {
        "Location": "https://awslambda-us-east-2-layers.s3.us-east-2.amazonaws.com/snapshots/$AWS_ID/pydata_full-54b38fcc-4bf2-4f8a-89ca-ea03d7243d10?versionId=GmkEKnvEbUk6Od1m0rPeQTWe.Oz_4p1H&X-Amz-Security-Token=AgoJb3JpZ2luX2VjEM3%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMiJGMEQCIGXaO2xsIU6XKtMUl%2FOEkhoUhBK1fgG9HjilIlK8kO0nAiAVxjeoYZ9ctLOYuZl6XKgA%2BV9RERw6iXbPNbfzlBhwNirjAwiX%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAEaDDEwNDI0NjAxNzg2NSIMNjvw6A%2F7WGEfH6GEKrcD11MSeBhxFkvAG7iwtWeuVOt7XgorQEO8lpCvDDiEJo9p6XY75hTpsYXsLXe96ltcrQixpNuS9Ekx5QnXP1pUH%2FVTV2FVCKki8AFkGgpX%2FMcA4%2Fg1ajicuvv6yNL%2BGBerLwHKgYmMIM%2B2%2BcxMTKULmxyaug%2B7lWdL7AL1iQqCUqLvaLKPX5V7gXz8R%2BORPjYk32s8JksRQ3MfwAvH9fznrL%2FEnaZJlLqtoL2La%2B9fgWfSQGt0sop5Drdunh4wZxWtlE9a938TYuF6sQE%2BwCQ%2FbZO%2BnZuA4vrOolsh3dVF3DUdNinNNokEEAVgPlDWzQLp480GLgltsthpRKdwMuekrdbdIrC6DRlGJ2PfWzksCCakjDsDzIYWknlDNTFxmrwAKd%2FGN32vFhEsqUMs1nCkmfjdcZ7GCMPb21JWMMpnAP3IYP2%2Bs2GUHuH%2F59TxOHHUnVgeLmFKTEjYzkcO%2BYn5OaLkEwLVWNf3Fdz


real	1m43.495s
user	0m2.793s
sys	0m1.941s

real	0m26.390s
user	0m0.979s
sys	0m0.278s


## Create_lambda script
Packaging up these lambda function is getting complex.  let's put all of this into a script.

In [13]:
%%writefile create_lambda
#!/bin/bash

#note function name must be the same as the module name
function_name=$1
ver_number=$2
layer_name=$3 #ie pydata_full:7  version number important

ver_name="${function_name}_${ver_number}"
mod_file="${function_name}.py"
mod_name=$function_name
handler_name="${mod_name}.lambda_handler"
appcode_zip_file=/tmp/appcode.zip

zip $appcode_zip_file $mod_file

#aws lambda delete-function  --function-name nb1_matplotlib_s3 > /dev/null

aws lambda create-function --function-name $ver_name \
           --zip-file fileb://$appcode_zip_file  --handler $handler_name \
           --runtime python3.7 \
           --layers "arn:aws:lambda:us-east-2:$AWS_ID:layer:$layer_name" \
           --timeout 25 \
           --role "arn:aws:iam::$AWS_ID:role/service-role/aws_lambda_role"  | ./aws_sanitize
./run_function $ver_name

Overwriting create_lambda


In [15]:
%%writefile nb1_pandas_sum.py
import numpy as np
import pandas as pd

def lambda_handler(event, context):
    df = pd.DataFrame({'a':np.arange(40, 50, step=.5), 'b':np.arange(40,60)})
    return df.sum().to_dict()

Writing nb1_pandas_sum.py


In [19]:
#note we are using pydata_full:8 as the layer name, this was returned from the call to publish-layer-version
!time ./create_lambda nb1_pandas_sum 2 "pydata_full:8"

updating: nb1_pandas_sum.py (deflated 22%)
{
    "FunctionName": "nb1_pandas_sum_2",
    "FunctionArn": "arn:aws:lambda:us-east-2:$AWS_ID:function:nb1_pandas_sum_2",
    "Runtime": "python3.7",
    "Role": "arn:aws:iam::$AWS_ID:role/service-role/aws_lambda_role",
    "Handler": "nb1_pandas_sum.lambda_handler",
    "CodeSize": 1277,
    "Description": "",
    "Timeout": 25,
    "MemorySize": 128,
    "LastModified": "2019-04-11T23:11:29.813+0000",
    "CodeSha256": "05UMk/6q3naVIA7Vh2s8aRue1GogpMV9eDe/5yB11+4=",
    "Version": "$LATEST",
    "TracingConfig": {
        "Mode": "PassThrough"
    },
    "RevisionId": "eb47d6f6-a2b5-454d-ae4d-a8e3d989b759",
    "Layers": [
        {
            "Arn": "arn:aws:lambda:us-east-2:$AWS_ID:layer:pydata_full:8",
            "CodeSize": 67982307
        }
    ]
}
{"a": 895.0, "b": 990.0}

real	0m14.575s
user	0m1.071s
sys	0m0.223s


## Matplotlib example


In [20]:
%%writefile nb1_matplotlib_s3.py
from io import BytesIO

import matplotlib as mpl
import matplotlib.pyplot as plt

import boto3
import botocore

def save_plot(fig, bucket='pandas-sklearn-demo', key='plot.png'):
    buffer_ = BytesIO()
    fig.savefig(buffer_)
    buffer_.seek(0)
    s3 = boto3.resource('s3')
    bucket_obj = s3.Bucket(bucket)
    
    bucket_obj.put_object(
        Key=key, Body=buffer_,
        StorageClass='REDUCED_REDUNDANCY',
        #ACL='public-read',
        ContentType='image/png')
    s3Client = boto3.client('s3')
    temp_url = s3Client.generate_presigned_url(
        'get_object', Params = {'Bucket': bucket, 'Key': key}, ExpiresIn = 100)
    return temp_url

    
def lambda_handler(event, context):
    mpl.use('agg')

    fig, ax = plt.subplots(figsize=(10,7))
    ax.plot(range(20), range(20))
    image_url = save_plot(fig, key='plot7.png')
    return {'image_url': image_url}

Overwriting nb1_matplotlib_s3.py


In [21]:
!time ./create_lambda nb1_matplotlib_s3 1 "pydata_full:8"

updating: nb1_matplotlib_s3.py (deflated 46%)
{
    "FunctionName": "nb1_matplotlib_s3_1",
    "FunctionArn": "arn:aws:lambda:us-east-2:$AWS_ID:function:nb1_matplotlib_s3_1",
    "Runtime": "python3.7",
    "Role": "arn:aws:iam::$AWS_ID:role/service-role/aws_lambda_role",
    "Handler": "nb1_matplotlib_s3.lambda_handler",
    "CodeSize": 1273,
    "Description": "",
    "Timeout": 25,
    "MemorySize": 128,
    "LastModified": "2019-04-11T23:12:41.739+0000",
    "CodeSha256": "37Zh71oFfNJHnyPPNE17ugem2hjf2YEwsGVC+BBuqIw=",
    "Version": "$LATEST",
    "TracingConfig": {
        "Mode": "PassThrough"
    },
    "RevisionId": "6d500f71-2033-4c0e-8de1-d17bee43627d",
    "Layers": [
        {
            "Arn": "arn:aws:lambda:us-east-2:$AWS_ID:layer:pydata_full:8",
            "CodeSize": 67982307
        }
    ]
}
{"image_url": "https://pandas-sklearn-demo.s3.amazonaws.com/plot7.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAZOFMDRKPTCG3KHV6%2F20190411%2Fus-east-2%2Fs3%2Faws4

In [None]:
%%bash
#NOTE TO Proofreaders, these are cleanup functions that won't be in the final product
rm -rf simple_lambda
rm -rf simple_package
mkdir simple_lambda
mkdir simple_package

#delete all lambda functions with nb1- in the name,  
#all lambda functions in this document are created with the nb1_ prefix
aws lambda list-functions | grep FunctionName | cut -d ":" -f 2| cut -d "," -f 1 | grep "nb1_" | xargs -L 1 aws lambda delete-function --function-name
