# AWS BlazingText

The bag of words was not giving me any satisfactory results, whether using single words or 2-grams. So, searching for another method of text classification, I stumbled upon AWS' BlazingText algorithm. From BlazingText documentation:

_The Amazon SageMaker BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc._

## Set up environment

### Download dataset manually

Let's start downloading the train dataset from Kaggle's website:

https://www.kaggle.com/c/nlp-getting-started/data

I downloaded the dataset manually and saved if in the 'input' folder.

The dataset could be downloaded programmatically using Kaggle API, but it requires authentication and it would be unsafe on a notebook that is meant to be shared.

### Read dataset from CSV

Read the dataset from the CSV file.

In [1]:
import numpy as np
import pandas as pd

# Read train dataset
train_df = pd.read_csv('input/train.csv', encoding = 'ISO-8859-1', index_col='id')

# Read test dataset
test_df = pd.read_csv('input/test.csv', encoding = 'ISO-8859-1', index_col='id')

# Check
train_df.head()

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,Just got sent this photo from Ruby #Alaska as ...,1


## Text tokenization

Tweets are tokenized in a multi-step process:
1. **Remove hyperlinks:** Replace hyperlinks with the keyword “islink”.
1. **Numbers:** Replace all numbers with the keyword “isnumber”.
1. **Punctuation:** Remove any non-alphanumeric character.
1. **Case:** Make all tweets lowercase.
1. **Split** text into different words.
1. **Remove stopwords**.
1. **Stem words**.
1. **Label tokenized strings** as per BlazingText's requirements.

I use the _Natural Language Toolkit_ and _Regular Expressions_ to tokenize tweets.

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import re

def tweet_to_words(text):
    nltk.download('stopwords', quiet=True)
    stemmer = PorterStemmer()

    text = re.sub(r'http\S+', ' islink ', text) # replace hyperlinks with keyword
    text = re.sub(r'(?!,$)[\d,.]+', ' isnumber ', text) # replace numbers with keyword
    text = re.sub(r'[^a-zA-Z0-0]', ' ', text.lower()) # Remove non-alphanumeric characters and convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
            
    return words

In [3]:
# Train dataset
train_set = []
for index, row in train_df.iterrows():
    tokenized_text = tweet_to_words(row['text'])
    tokenized_text.insert(0, '__label__' + str(row['target'])) # blazingtext algorithm expects labels to be inserted and appended with __label__
    train_set.append(tokenized_text)

# Test dataset
#test_set = []
#for index, row in test_df.iterrows():
#    test_set.append(tweet_to_words(row['text']))

print(train_set[:5]) # Check

[['__label__1', 'deed', 'reason', 'earthquak', 'may', 'allah', 'forgiv', 'us'], ['__label__1', 'forest', 'fire', 'near', 'la', 'rong', 'sask', 'isnumb', 'canada'], ['__label__1', 'resid', 'ask', 'shelter', 'place', 'notifi', 'offic', 'isnumb', 'evacu', 'shelter', 'place', 'order', 'expect'], ['__label__1', 'isnumb', 'peopl', 'receiv', 'wildfir', 'evacu', 'order', 'california'], ['__label__1', 'got', 'sent', 'photo', 'rubi', 'alaska', 'smoke', 'wildfir', 'pour', 'school']]


## Save data to train the model

### Save train dataset

In [4]:
import os
import csv
import pickle

data_dir = 'data' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)

with open(data_dir + '/train.txt', 'w') as csvoutfile:
    csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
    csv_writer.writerows(train_set)
#with open(data_dir + '/test.txt', 'w') as csvoutfile:
#    csv_writer = csv.writer(csvoutfile, delimiter=' ', lineterminator='\n')
#    csv_writer.writerows(test_x)

### Upload data to S3

In [5]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()
print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = sess.default_bucket() # Replace with your own bucket name if needed
print(bucket)
prefix = 'blazingtext/supervised' #Replace with the prefix under which you want to store the data if needed

arn:aws:iam::496791827306:role/service-role/AmazonSageMaker-ExecutionRole-20191201T135241
sagemaker-eu-west-1-496791827306


In [6]:
%%time

train_channel = prefix + '/train'
#validation_channel = prefix + '/validation'

sess.upload_data(path=data_dir + '/train.txt', bucket=bucket, key_prefix=train_channel)
#sess.upload_data(path='tweets.validation', bucket=bucket, key_prefix=validation_channel)

s3_train_data = 's3://{}/{}'.format(bucket, train_channel)
#s3_validation_data = 's3://{}/{}'.format(bucket, validation_channel)

CPU times: user 22.7 ms, sys: 4.16 ms, total: 26.9 ms
Wall time: 117 ms


In [7]:
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

In [8]:
region_name = boto3.Session().region_name

## Training model

### Set up training instance

In [9]:
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix')
data_channels = {'train': train_data}

Using SageMaker BlazingText container: 685385470294.dkr.ecr.eu-west-1.amazonaws.com/blazingtext:latest (eu-west-1)


In [10]:
bt_model = sagemaker.estimator.Estimator(container,
                                         role, 
                                         train_instance_count=1, 
                                         train_instance_type='ml.m4.xlarge',
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path=s3_output_location,
                                         sagemaker_session=sess)

### Set hyperparameters

In [11]:
bt_model.set_hyperparameters(mode="supervised",
                            evaluation=False,
                            epochs=10,
                            min_count=2,
                            learning_rate=0.05,
                            vector_dim=10,
                            early_stopping=False,
                            patience=4,
                            min_epochs=5,
                            word_ngrams=1)

### Train model

In [12]:
bt_model.fit(inputs=data_channels, logs=True)

2020-01-20 18:05:02 Starting - Starting the training job...
2020-01-20 18:05:04 Starting - Launching requested ML instances......
2020-01-20 18:06:30 Starting - Preparing the instances for training......
2020-01-20 18:07:25 Downloading - Downloading input data...
2020-01-20 18:07:54 Training - Downloading the training image..[34mArguments: train[0m
[34m[01/20/2020 18:08:14 INFO 140231241815872] nvidia-smi took: 0.025181055069 secs to identify 0 gpus[0m
[34m[01/20/2020 18:08:14 INFO 140231241815872] Running single machine CPU BlazingText training using supervised mode.[0m
[34m[01/20/2020 18:08:14 INFO 140231241815872] 2 files found in train channel. Using /opt/ml/input/data/train/train.txt for training...[0m
[34m[01/20/2020 18:08:14 INFO 140231241815872] Processing /opt/ml/input/data/train/train.txt . File size: 0 MB[0m
[34mRead 0M words[0m
[34mNumber of words:  5294[0m
[34m##### Alpha: 0.0000  Progress: 99.94%  Million Words/sec: 9.71 #####[0m
[34m##### Alpha: 0.0000  

## Deployment

In [13]:
# Deploy estimator

# Remember to shut down the endpoint when we are done!!!!

text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')
text_classifier.endpoint

--------------------------------------------------------------------------!

'blazingtext-2020-01-20-18-05-02-792'

## Testing

### Prepare test data & predict

In [14]:
predictions = []
for index, row in test_df.iterrows():
    tokenized_sentence = ' '.join(tweet_to_words(row['text'])) # Tokenize text
    payload = {"instances" : [tokenized_sentence]} # Format data for prediction
    response = text_classifier.predict(json.dumps(payload)) # Predict
    prediction = json.loads(response) # Load response
    if prediction[0]['label'][0] == '__label__1': # Save value accordingly
        predictions.append(np.int(1))
    else:
        predictions.append(np.int(0))

In [15]:
# Create a dataframe with predictions
predictions_df = pd.DataFrame(predictions, index=test_df.index, columns=['target']).astype('int64')
predictions_df.head()

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
0,1
2,1
3,1
9,1
11,1


### Save predictions for submission to Kaggle

Kaggle expects a 2-column CSV file. First column containing the _id_ and the second one our predicted _target_.

In [16]:
# Save to a .csv file
predictions_df.to_csv(r'data/predictions.csv', header=True)

A file with predictions has been saved in "data/predictions.csv". Now we can upload it to Kaggle manually and find out how it performed.