# AWS Comprehend Entity Extraction

Jupyter notebook to demostrate how to setup and use AWS Comprehend for Entity Extractions. 

* [Amazon Comprehend examples using SDK for Python (Boto3)](https://docs.aws.amazon.com/code-library/latest/ug/python_3_comprehend_code_examples.html)
* [Comprehend.Client.detect_entities(**kwargs)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend/client/detect_entities.html#)
* [Permissions to allow all Amazon Comprehend actions
](https://docs.aws.amazon.com/comprehend/latest/dg/security_iam_id-based-policy-examples.html#custom-policy-all-all-actions)

In [None]:
!apt -qq update -y
!apt -qq install zip

In [63]:
import sys
import time
import re
import json
import pathlib

In [2]:
PATH_TO_LIB: str = str(pathlib.Path("../../../../lib").resolve())

# Constant

In [3]:
AWS_SERVICE: str = "comprehend"
LAMBDA_FUNCTION_NAME: str = "tagging-poc-dev-comprehend"

# Data

In [4]:
example="""
在 SBS On Demand 可免费收看《世界上最幸福的国度》（The World's Happiest Country）。

澳大利亚人梅丽莎·乔治奥（Melissa Georgiou）十多年前移居芬兰，在地球上最寒冷、最黑暗的地方之一寻求幸福。

梅丽莎说：“在这里生活，我最喜欢的事情之一是，无论你是在住宅区还是在城市中间，都很容易接近大自然。”
她原本是一名教师，12年前，她从悉尼的海滩换到了芬兰的黑暗冬天和冰冷湖泊，此后再也没有回头。
梅丽莎说：“对芬兰人来说，幸福的概念与澳大利亚人的幸福概念非常不同。

她说，芬兰人乐于接受将自己描绘成忧郁、矜持的形象——当地流行的一句话是：“拥有幸福的人必须把它隐藏起来。”

“在这里，我注意到的第一件事是，你不会去参加晚宴或烧烤，也不会谈论房地产。没有人问你住在哪里，住在哪个郊区，你的孩子在哪里上学。” 
芬兰人似乎对现状相当满意，他们似乎并不总是想要更多。
梅丽莎·乔治奥
北欧的黑夜
在联合国最新发布的《世界幸福报告》中，芬兰连续第六年被评为世界上最幸福的国家。

幸福专家和研究员弗兰克·马特拉（Frank Martela）解释说：“北欧国家往往是有（良好）失业救济、养老金和其他福利的国家。”
但弗兰克说，芬兰在排名中的位置往往让其本国人民感到惊讶。

“芬兰人，他们几乎感到愤怒，因为他们觉得这不可能是真的。我们听的是悲伤的音乐，还有硬摇滚。”
“因此，幸福并不是芬兰人自我形象的一部分。”

芬兰人忧郁的另一面是对毅力的文化关注，弗兰克说这重新定义了芬兰人看待幸福的方式——一个被称为“sisu”的概念——这是芬兰文化的一部分，很难直译，但可理解为意志、决心、毅力和理性面对逆境。

他说，这在芬兰人最喜欢的消遣方式中得到了最好的体现——在冰点以下的气温中泡完海澡后，在桑拿房里取暖。

“这是关于这种矛盾——从一个极端到另一个极端，而这是相当有趣的体验......因为你需要毅力。”

梅丽莎说，但芬兰有很多东西是伟大的，可以为这个国家的人提供幸福。
芬兰是受新冠大流行影响最小的欧洲国家之一，专家们将此归功于对政府的高度信任和对遵循限制措施的较小阻力。

而对政府的信任则源于国家对其公民的投资。

公立学校系统很少对儿童进行测试，是世界上最好的学校之一。芬兰也有一个全民医疗保健系统，有民众负担得起的儿童保育和对父母的有力支持。

梅丽莎说：“整个国家都在照顾孩子的成长，这个制度设置得非常好。因此，从生下我儿子到在家抚养他，再到送他去日托，再到上学，这一切的每个方面都得到了很好的支持。”
芬兰VS中国 幸福感哪国最强？
自《世界幸福报告》发布以来，北欧国家一直在前十名中占主导地位。在今年的报告中，芬兰及其邻国丹麦（第2名）、冰岛（第3名）、瑞典（第6名）和挪威（第7名）在幸福指标中的得分都很高，包括健康的预期寿命、人均GDP、低腐败程度、社会支持、自由、信任与慷慨等。

其他位列前十的国家/地区包括，荷兰（第5名）、瑞士（第8名）、卢森堡（第9名）及新西兰（第10名）。

澳大利亚在这份报告中排名第12名，紧随其后的是加拿大（第13名）、爱尔兰（第14名）、美国（第15名）。

亚洲地区，新加坡排全球第25名，较去年上升两位，台湾较去年下跌一位到第27名，日本排名升至第47名，中国大陆排第64名，香港排名第82名。

 与此同时，民调机构益普索集团（Ipsos）业发布了一份有关全球幸福指数的调查报告，结果显示，在32个国家中，幸福感指数最高的国家是中国（91%），其后是沙特阿拉伯（86%）、荷兰（85%）、印度（84%）、巴西（83%）。

澳大利亚在这份报告中位列第9名。

调查报告指出，平均而言，中等收入国家（按照世界银行定义）的幸福感，比高收入国家的幸福感增长得更明显。
"""

# Directory Arrangement

In [5]:
%%bash -s "$AWS_SERVICE"
PWD=$(basename $(pwd))
if [[ ${PWD}  != $1 ]] ; then
    echo  "make sure to be in ./${1} directory"
    exit -1
fi

In [6]:
%rm -rf ./package && mkdir ./package 
%cd ./package
PYTHON_DEPENDENCY_DIRECTORY=%pwd

/Users/oonisim/home/repository/git/oonisim/python-programs/lib/util_aws/boto3/comprehend/package


In [7]:
PYTHON_DEPENDENCY_DIRECTORY

'/Users/oonisim/home/repository/git/oonisim/python-programs/lib/util_aws/boto3/comprehend/package'

## PYTHONPATH

Make sure to use the directory of the intended dpendencies packages.

In [8]:
if PYTHON_DEPENDENCY_DIRECTORY != sys.path[0]:
    sys.path.insert(0, PYTHON_DEPENDENCY_DIRECTORY)
    print(sys.path)

['/Users/oonisim/home/repository/git/oonisim/python-programs/lib/util_aws/boto3/comprehend/package', '/Users/oonisim/home/repository/git/oonisim/python-programs/lib/util_aws/boto3/comprehend', '/Library/Frameworks/Python.framework/Versions/3.9/lib/python39.zip', '/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9', '/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/lib-dynload', '', '/Users/oonisim/venv/tf/lib/python3.9/site-packages']


## Python dependencies 

Contain all the Python dependencies within the package directory and only use them to avoid Python dependency hell. The dependencies is also packaged into the lambda deployment package so that the same dependencies used at development will be used at Lambda rantime.


In [73]:
%%bash -s "$PATH_TO_LIB" "$PYTHON_DEPENDENCY_DIRECTORY"
cp -r "$1/util_aws" $2/

In [None]:
!echo $PYTHON_DEPENDENCY_DIRECTORY
!pip install --target $PYTHON_DEPENDENCY_DIRECTORY botocore boto3 --quiet

---
# Define Function


## Lambda Function Code

Define your lambda function. As an example, the code execute AWS S3 SDK code to list the S3 buckets.

* [Lambda function handler in Python](https://docs.aws.amazon.com/lambda/latest/dg/python-handler.html)
* [Boto3 Error handling](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/error-handling.html)


In [77]:
%%writefile lambda_function.py
import json
import logging
from http import (
    HTTPStatus
)
from typing import (
    List,
    Dict,
    Set,
    Any,
    Union,
)

from util_aws.boto3.lambdas import (
    LambdaFunction
)
from util_aws.boto3.comprehend import (
    ComprehendDetect
)

import botocore
import boto3


# --------------------------------------------------------------------------------
# Global instances to avoid re-instantiations
# --------------------------------------------------------------------------------
logger: logging.Logger = logging.getLogger(__name__)

comprehend: ComprehendDetect = ComprehendDetect(
    comprehend_client=boto3.client('comprehend')
)


# --------------------------------------------------------------------------------
# Lambda handler
# --------------------------------------------------------------------------------
def lambda_handler(event, context):
    """Lambda function to invoke comprehend"""
    # --------------------------------------------------------------------------------
    # Extract payload elements from JSON/Dictionary
    # --------------------------------------------------------------------------------
    try:
        payload: Dict[str, Any] = LambdaFunction.get_json_payload_from_event(
            event=event, 
            expect_payload_as_dictionary=True,
            expected_dictionary_element_names=["text"]
        )
        text: str = payload.get('text', None)
        language_code = payload.get('language_code', None)
        entity_type = payload.get('entity_type', None)
        assert text is not None
        
    except (TypeError, ValueError) as error:
        return {
            "statusCode": HTTPStatus.BAD_REQUEST,
            "headers": {
                "Content-Type": "application/json"
            },
            "body": json.dumps(
                {
                    "error": str(error),
                    "event": event
                },
                default=str
            )
        }

    # --------------------------------------------------------------------------------
    # Entity detection
    # --------------------------------------------------------------------------------
    try:
        # --------------------------------------------------------------------------------
        # Entity detection
        # --------------------------------------------------------------------------------
        results: List[str] = comprehend.detect_entities_by_type(
            text=text, 
            language_code=language_code,
            auto_detect_language=(language_code is None),
            entity_types=[entity_type] if entity_type is not None else None,
            sort_by_score=True,
            return_entity_value_only=True
        
        )
        if len(results) == 0:
            msg: str = f"no entity detected in the text"
            msg = msg + (f" for the entity_type [{entity_type}]." if entity_type else ".")
            
            logger.error("%s", msg)
            return {
                "statusCode": HTTPStatus.BAD_REQUEST,
                "headers": {
                    "Content-Type": "application/json"
                },
                "body": json.dumps(
                    {
                        "error": msg,
                        "event": event
                    },
                    default=str
                )
            }

        return {
            "statusCode": HTTPStatus.OK,
            "headers": {
                "Content-Type": "application/json"
            },
            "body": json.dumps(results, sort_keys=True, default=str)
        }

    except (RuntimeError, ValueError) as error:
        return {
            "statusCode": HTTPStatus.INTERNAL_SERVER_ERROR,
            "headers": {
                "Content-Type": "application/json"
            },
            "body": json.dumps({"error": error}, sort_keys=True, default=str)
        }

        
if __name__ == "__main__":
    with open("example.txt", "r", encoding='utf-8') as example_text:
        example: str = example_text.read()

    body: dict = {
        "text": example,
        "language_code": "zh",
        "entity_type": "location"
    }
        
    body_as_escaped_string: str = json.dumps(
        body, 
        default=str, 
        ensure_ascii=True    # ASCII is 100% network safe.
    )
    event = {
        "body": body_as_escaped_string
    }
    
    # --------------------------------------------------------------------------------
    # Test the lambda handler
    # --------------------------------------------------------------------------------
    response: dict = lambda_handler(
        event=event,
        context=None
    )
    
    # --------------------------------------------------------------------------------
    # Restore the JSON/Dictionary from the body as escaped string.
    # --------------------------------------------------------------------------------
    response_body_as_dictionary = json.loads(response['body'])
    print(json.dumps(response_body_as_dictionary, indent=4, default=str, ensure_ascii=False))

Overwriting lambda_function.py


###  Test the Code

In [78]:
%store example >example.txt

Writing 'example' (str) to file 'example.txt'.


In [79]:
!python ./lambda_function.py

[
    "悉尼",
    "芬兰",
    "地球上",
    "日托",
    "全球",
    "卢森堡",
    "亚洲地区",
    "中国大陆",
    "新加坡",
    "新西兰",
    "香港",
    "瑞士",
    "欧洲",
    "荷兰",
    "美国",
    "加拿大",
    "爱尔兰",
    "日本"
]


---
# Package

* [Deployment package with dependencies](https://docs.aws.amazon.com/lambda/latest/dg/python-package.html#python-package-create-package-with-dependency)


## Dependency Libraries

Package the exact Python library versions you have used for development. Do not rely on the libraries installed in the AWS Lambda runtime to avoid Python package dependency hell.

## Zip File

Package the source code file and libraries into a zip file.


* [How do I troubleshoot "permission denied" or "unable to import module" errors when uploading a Lambda deployment package?](https://repost.aws/knowledge-center/lambda-deployment-package-errors)

> The correct permissions for all executable files within a Lambda deployment package is **644** in Unix permissions numeric notation. For folders within a deployment package, the correct permissions setting is **755**.
> ```
> $ chmod 644 $(find /tmp/package_contents -type f)
> $ chmod 755 $(find /tmp/package_contents -type d)
> $ zip -r new-lambda-package.zip *
> ```

In [80]:
!chmod -R u=rwX,go=rX .
!rm -rf ../lambda-deployment-package.zip
!zip -q -r ../lambda-deployment-package.zip .

In [81]:
%cd ..
%pwd

/Users/oonisim/home/repository/git/oonisim/python-programs/lib/util_aws/boto3/comprehend


'/Users/oonisim/home/repository/git/oonisim/python-programs/lib/util_aws/boto3/comprehend'

In [82]:
%%bash -s "$AWS_SERVICE"
PWD=$(basename $(pwd))
if [[ ${PWD}  != $1 ]] ; then
    echo  "make sure to be in ./${1} directory"
    exit -1
fi

---
# Deploy Function

Deploy the package using AWS CLI.

* [AWS CLI lambda - create-function](https://docs.aws.amazon.com/cli/latest/reference/lambda/create-function.html)
* [Using Lambda with the AWS CLI](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-awscli.html)

> ### Create the execution role
> Create the execution role that gives your function permission to access AWS resources. 
> ```
> aws iam create-role --role-name lambda-ex --assume-role-policy-document file://trust-policy.json
> aws iam attach-role-policy --role-name lambda-ex --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
> ```
> **trust-policy-json**:
> ```
> {
>   "Version": "2012-10-17",
>   "Statement": [
>     {
>       "Effect": "Allow",
>       "Principal": {
>         "Service": "lambda.amazonaws.com"
>       },
>       "Action": "sts:AssumeRole"
>     }
>   ]
> }'
> ```



We need to ask the AWS team to run the code to deploy the function with AWS CLI create-function because we do not have the permission.

Once the Lambda function was created, run the update as below.

In [None]:
%%bash -s "$LAMBDA_FUNCTION_NAME"
aws lambda update-function-code \
    --function-name $1 \
    --zip-file fileb://lambda-deployment-package.zip

Wait for a while for the lambda function update to be done. otherwise you can get the error:

```
{
    "StatusCode": 200,
    "FunctionError": "Unhandled",
    "ExecutedVersion": "$LATEST"
}
```
when invoking the function.

In [None]:
time.sleep(5)

# Invoke Function

You can invoke the function via UI, CLI, SDK, or from AWS service. Under the hood, a Lambda function is invoked via the AWS HTTPS API call.

* [AWS API Lambda - Invoke](https://docs.aws.amazon.com/lambda/latest/dg/API_Invoke.html)

```
POST /2015-03-31/functions/FunctionName/invocations?Qualifier=Qualifier HTTP/1.1
X-Amz-Invocation-Type: InvocationType
X-Amz-Log-Type: LogType
X-Amz-Client-Context: ClientContext

Payload
```

### Request Body

```payload``` HTTP request body is a JSON document passed to the function as ```event``` argument.

### Response

```
HTTP/1.1 StatusCode
X-Amz-Function-Error: FunctionError
X-Amz-Log-Result: LogResult
X-Amz-Executed-Version: ExecutedVersion

Payload
```

Lambda function can set the status, headers, and response body in the HTTP response.

```
import json
        
def lambda_handler(event, context):
    return {
        "statusCode": 200,
        "headers": {
            "Content-Type": "application/json"
        },
        "body": json.dumps({
            "key": "value"
        })
    }
```


## Invoke via CLI

Use AWS CLI to invoke the deployed function.

* [AWS CLI lambda - invoke](https://docs.aws.amazon.com/cli/latest/reference/lambda/invoke.html)

In [None]:
with open("./package/example.txt", "r", encoding='utf-8') as example_text:
    text: str = example_text.read()

body: dict = {
    "text": text,
    # "language_code": "ru",
    "entity_type": "location"
}

body_as_escaped_string: str = json.dumps(
    body, 
    default=str, 
    ensure_ascii=True    # ASCII is 100% network safe.
)
payload = {
    "body": body_as_escaped_string
}

with open("payload.json", "w", encoding='utf-8') as payload_json:
    payload_json.write(json.dumps(payload, indent=4, default=str))

In [None]:
%%bash -s "$LAMBDA_FUNCTION_NAME"
aws lambda invoke \
    --function-name $1 \
    --payload file://payload.json \
    response.json

In [None]:
with open(file="response.json", mode="r", encoding="utf-8") as f:
    content: dict = json.loads(f.read())
    body = json.loads(content['body'])
    
    results = set()
    print(body)
    
    if isinstance(body, dict):
        for entity in body:
            if entity['Type'] == "LOCATION":
                results.add(entity['Text'])

    print(results)

# Clearnup

In [83]:
!rm -rf package/ lambda-deployment-package.zip payload.json response.json 