# Harpin AI Identity Resolution - 50k Data Sample


## Pre-requisites
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**.
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used:
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**
    1. or your AWS account has already subscribed to this free product from AWS Marketplace: [Identity Resolution](https://aws.amazon.com/marketplace/pp/prodview-etnavzupbnthk?sr=0-7&ref_=beagle&applicationId=AWSMPContessa).

## Set up Amazon SageMaker environment

The session remembers our connection parameters to Amazon SageMaker. We'll use it to perform all of our Amazon SageMaker operations.

In [1]:
#Set up your S3 bucket
s3_bucket = 'YOUR S3 BUCKET'
#Set up the Algorithm ARN from your algorithm subscription
algorithm_arn = 'YOUR ALGORITHM ARN'

## Input Data Description
For this example, the input dataset contains 50,000 rows of sample identity data in a CSV format.  The file is located in the Git repository that was cloned into this notebook.Other notebooks will provide examples of loading data from different sources, such as AWS S3.  Data can be read from files in CSV, Avro, or Parquet format.

In [2]:
import sagemaker
role = sagemaker.get_execution_role()

common_prefix = '/harpin/batch_resolution'
common_prefix_url = 's3://' + s3_bucket + common_prefix
#Set up the file location for the sample data
input_data = common_prefix + '../data/sample_data_50k'

#Upload the configuration file and sample data to S3.
config_local = '../config/sample_data_50k.yml'
data_local = '../data/sample_data_50k'
config_prefix = common_prefix + '/sample_data_50k/config'
data_prefix = common_prefix + '/sample_data_50k/data'
source_config = sagemaker_session.upload_data(config_local, bucket=s3_bucket, key_prefix=config_prefix)
sagemaker_session.upload_data(data_local, bucket=s3_bucket, key_prefix=data_prefix)

#Set up the output s3 location for the identity graph
identity_graph = common_prefix
print('Input identity data location: ', data_prefix)
print('Source config file location: ', config_prefix)
print('Output identity graph location: ' + identity_graph)

## Create Identity Resolution SageMaker TrainingJob using Algorithm ARN

We will use the tools provided by the Amazon SageMaker Python SDK to create the [AlgorithmEstimator](https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html) to perform the job.

In [3]:
from sagemaker.algorithm import AlgorithmEstimator

algo = AlgorithmEstimator(
    algorithm_arn=algorithm_arn,
    role=role,
    instance_count=1,
    instance_type='ml.m5.2xlarge',
    base_job_name='harpin-ai-identity-resolution-50k-sample',
    output_path=identity_graph
)

## Run Identity Resolution Clustering with SageMaker TrainingJob
Note that the TrainingJob actually performs a clustering process. The clustering process produces an identity graph, which clusters the records in the input dataset into a set of dis-joint customer profiles. 

In [4]:
#Specify the input data sources for up to 3 channels (i.e. clustering, clustering2 and clustering3), and a channel config file.
#And run the identity resolution process by calling the fit() method
print('Now run the identity resolution clustering using Algorithm ARN %s in region %s' % (algorithm_arn, sagemaker_session.boto_region_name))
algo.fit({"sample_data_50k": input_data, 
          "channel_config": source_config})

## Identity Graph Data and Format
The identity graph will be stored in a folder with one or more files with the exact same type as the input files (csv, avro or parquet). If the input files are CSVs, then the output will contains CSV files too. All the fields in the input files will be retained in the output files, along with one additional field called PIN. The field PIN is the assigned unique customer profile identitfier. Customer records with the same (non-default) PIN are considered to be referring to the same customer profile. The default value for PIN is -1, meaning that there is not enough information available in the input record to determine which customer profile it belongs to.

In [5]:
#Here is the output path for storing the results from running the algorithm
path = algo.output_path
!aws s3 ls $path/

In [6]:
#Make sure that you change the value to match for your "specific_run".
specific_run = 'harpin-ai-identity-resolution-50k-sample-TIMESTAMP'

#Specify a temporary directory, and extract the identity graph from S3 to the temp_data directory for analysis
temp_data = './temp_data'
!rm -rf $temp_data
!mkdir -p $temp_data
!aws s3 cp $path/$specific_run/output/output.tar.gz $temp_data/
!tar -xzvf $temp_data/output.tar.gz -C $temp_data/

## Identity Graph Analysis
Now the identity resolution clustering process is finished and we have the identity graph. We can perform some simple analysis on the identity graph such as record count, unique customer profiles, duplicate identity analysis, etc.

In [7]:
#Import the pyspark libraries and create the spark object
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()

In [8]:
Load the identity graph into spark dataframe in csv
identity_graph = spark.read.format('csv').options(header='true', inferSchema='false', delimiter='|') \
                                         .option('mode', 'DROPMALFORMED') \
                      .load(temp_data + '/identity_graph/')



In [9]:
#List the fields and their data types in the identity graph
#The identity graph will contain the union of fields from all the input data sources, plus an additional field "pin"
identity_graph.dtypes

In [10]:
#Count the number of records in the identity graph
identity_graph.count()

In [11]:
#Count the unique number of customer profiles in the identity graph
identity_graph.filter(F.col('pin') != '-1') \
              .select('pin') \
              .distinct() \
              .count()

In [12]:
#Perform duplicate records analysis for the identity graph. For example, there are 47 records which are assigned the same PIN (10000000543). 
#Those records are considered to be referring to the same customer.
identity_graph.groupBy('pin') \
              .count() \
              .orderBy(F.desc('count')) \
              .show(20)

In [13]:
#Clean up the temporary directory
!rm -rf $temp_data