## Introduction

Word2Vec is a popular algorithm used for generating dense vector representations of words in large corpora using unsupervised learning. The resulting vectors have been shown to capture semantic relationships between the corresponding words and are used extensively for many downstream natural language processing (NLP) tasks like sentiment analysis, named entity recognition and machine translation.  

SageMaker BlazingText which provides efficient implementations of Word2Vec on

- single CPU instance
- single instance with multiple GPUs - P2 or P3 instances
- multiple CPU instances (Distributed training)

## Setup

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK.

In [22]:
import sagemaker
from sagemaker import get_execution_role
import boto3 # amazon sdk library
import json

sess = sagemaker.Session()
role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf



arn:aws:iam::587023234711:role/service-role/AmazonSageMaker-ExecutionRole-20190603T102730


## set s3 propreties
S3 is needed to store and access the modelo inpout and out put 


In [21]:
#!wget http://mattmahoney.net/dc/text8.zip -O text8.gz
# Uncompressing
#!gzip -d text8.gz -f
# s3_input  -> trainf data
#s3_output_location result date 

#!more transacoesObrasbag.csv

we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.
    

In [23]:
bucket = 'recomenderteste' 
prefix = 'bookrecomender' #Replace with the prefix under which you want to store the data if needed
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)
print(s3_output_location)


recomenderteste
s3://recomenderteste/bookrecomender/output
bookrecomender


we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.
    

In [25]:
train_channel = prefix + '/train'
sess.upload_data(path='transacoesObrasbag.csv', bucket=bucket, key_prefix=train_channel)
s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
print(s3_train_data)


s3://recomenderteste/bookrecomender/train


# Training Setup

Now that we are done with all the setup that is needed, we are ready to train our object detector. To begin, let us create a sageMaker.estimator.Estimator object. This estimator will launch the training job.
-- region_name 
-- container


In [13]:
region_name = boto3.Session().region_name
print(region_name)

us-east-1


In [14]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))


Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest (us-east-1)


In [26]:

bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1,
                                         train_instance_type='ml.c4.2xlarge',
                                         train_volume_size = 5,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

Please refer to [algorithm documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext_hyperparameters.html) for the complete list of hyperparameters.

In [27]:
bt_model.set_hyperparameters(mode="batch_skipgram",
                             epochs=5,
                             min_count=5,
                             sampling_threshold=0.0001,
                             learning_rate=0.05,
                             window_size=5,
                             vector_dim=100,
                             negative_samples=5,
                             batch_size=11, #  = (2*window_size + 1) (Preferred. Used only if mode is batch_skipgram)
                             evaluation=True,# Perform similarity evaluation on WS-353 dataset at the end of training
                             subwords=False) # Subword embedding learning is not supported by batch_skipgram

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the sagemaker.session.s3_input objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.


In [28]:
train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', 
                        content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data}

In [29]:
bt_model.fit(inputs=data_channels, logs=True)

2019-07-30 18:59:58 Starting - Starting the training job...
2019-07-30 19:00:10 Starting - Launching requested ML instances......
2019-07-30 19:01:11 Starting - Preparing the instances for training...
2019-07-30 19:01:49 Downloading - Downloading input data...
2019-07-30 19:02:24 Training - Training image download completed. Training in progress..
[31mArguments: train[0m
[31m[07/30/2019 19:02:25 INFO 140552354150208] nvidia-smi took: 0.0252649784088 secs to identify 0 gpus[0m
[31m[07/30/2019 19:02:25 INFO 140552354150208] Running single machine CPU BlazingText training using batch_skipgram mode.[0m
[31m[07/30/2019 19:02:25 INFO 140552354150208] Processing /opt/ml/input/data/train/transacoesObrasbag.csv . File size: 88 MB[0m
[31mRead 10M words[0m
[31mRead 20M words[0m
[31mRead 30M words[0m
[31mRead 33M words[0m
[31mNumber of words:  83293[0m
[31m##### Alpha: 0.0487  Progress: 2.56%  Million Words/sec: 3.01 #####[0m
[31m##### Alpha: 0.0461  Progress: 7.86%  Million W

In [31]:
bt_endpoint = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

---------------------------------------------------------------------------------------------------!

In [34]:
words = ["327797", "16708"]

payload = {"instances" : words}

response = bt_endpoint.predict(json.dumps(payload))

vecs = json.loads(response)
print(vecs)

[{'vector': [0.05717432498931885, 0.04714442044496536, -0.01673906110227108, 0.02388642355799675, 0.015328881330788136, -0.025969361886382103, -0.11886902898550034, 0.06898565590381622, 0.09335990250110626, -0.02641250565648079, -0.03700384497642517, 0.06896259635686874, -0.031042464077472687, 0.012479080818593502, -0.04531089961528778, 0.10692345350980759, 0.025482043623924255, -0.00020958659297320992, -0.03344803676009178, -0.012593284249305725, 0.07492173463106155, 0.07054547220468521, 0.006326812319457531, -0.0707009807229042, -0.03975049406290054, -0.021173417568206787, 0.06802577525377274, 0.02770390175282955, -0.0837588831782341, 0.05689423158764839, 0.011320154182612896, 0.001598377013579011, 0.04239125922322273, -0.08188647031784058, -0.0483018197119236, -0.026983944699168205, -0.01045744027942419, -0.03256293386220932, -0.0643141120672226, -0.07908020913600922, 0.008036988787353039, 0.004903685301542282, -0.016543073579669, 0.11999335139989853, 0.0431520976126194, -0.07578457

Evaluation

Let us now download the word vectors learned by our model and visualize them using a t-SNE plot.


In [36]:
s3 = boto3.resource('s3')

key = bt_model.model_data[bt_model.model_data.find("/", 5)+1:]
s3.Bucket(bucket).download_file(key, 'model.tar.gz')

In [38]:
!ls

!tar -xvzf model.tar.gz

blazingtext_word2vec_text8_2019-07-30  model.tar.gz
ClassificadorCDD		       Recomender_amazon_word2Vec.ipynb
lost+found			       transacoesObrasbag.csv
vectors.txt
eval.json
vectors.bin


In [None]:
books = {
    1735: 'moreninha',
    1522 : "dom casmurro",
    2172 : "Quincas Borba: ",
    278  : 'Quem Mexeu no Meu Queijo?', 
    5310: 'Como Fazer Amigos e Influenciar Pessoas',
    1117  : 'Fundamentos da metafisica dos costumes ',
    9948 : 'Convite a Filosofia:',
    277  : 'Grande sertao Veredas'
}



In [None]:
print(vecs)