# Amazon Comprehend - Classification Example
### Classify using Text Features

Objective: Train a model to identify tweets that require followup  

Input: Tweets  
Target: Binary. 0=Normal, 1=Followup



#### AWS Twitter Labelled Tweets are available in this bucket: 
#### https://s3.console.aws.amazon.com/s3/buckets/aml-sample-data/?region=us-east-2
####   File:  social-media/aml_training_dataset.csv

In [None]:
import numpy as np
import pandas as pd
import json

### Download Twitter training data

In [None]:
!aws s3 cp s3://aml-sample-data/social-media/aml_training_dataset.csv .

### Prepare Training and Test data 

In [None]:
df = pd.read_csv('aml_training_dataset.csv')

In [None]:
print('Rows: {0}, Columns: {1}'.format(df.shape[0],df.shape[1]))

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df = df[['text','trainingLabel']]

In [None]:
# trainingLabel contains the class
# Valid values are: 
#  0 = Normal
#  1 = Followup

df.trainingLabel.value_counts()

In [None]:
tweet_normal = df['trainingLabel'] == 0
tweet_followup = df['trainingLabel'] == 1

In [None]:
# Some examples of tweets that are classified as requiring follow-up
for i in range(15):
    print(df[tweet_followup]['text'].iloc[i])
    print()

In [None]:
# Some examples of tweets that are classified as normal
for i in range(10):
    print(df[tweet_normal]['text'].iloc[i])
    print()

In [None]:
# Training, Validation and Test Split
# Comprehend service automatically splits the provided dataset into 80-20 ratio for training and validation
# We need to independently confirm quality of the model using a test set.

# So, let's reserve 10% of the data for test and provide the remaining 90% to Comprehend service
# Training & Validation   = 90% of the data
# Test       = 10% of the data

# Randomize the datset
np.random.seed(5)
l = list(df.index)
np.random.shuffle(l)
df = df.iloc[l]

In [None]:
rows = df.shape[0]
train = int(.9 * rows)
test = rows - train

In [None]:
rows, train, test

In [None]:
df_train = df[:train]
df_test = df[train:]

In [None]:
df_train.trainingLabel.value_counts()

In [None]:
df_test.trainingLabel.value_counts()

In [None]:
df_train.columns

## Save the data .. Notice No Header and Label Before Text

In [None]:
df_train.to_csv('twitter_train.csv',
                index=False,
                header=False,
                columns=['trainingLabel','text'])

In [None]:
df_test.to_csv('twitter_test_with_label.csv',
                index=False,
                header=False,
                columns=['trainingLabel','text'])

In [None]:
df_test.to_csv('twitter_test_without_label.csv',
                index=False,
                header=False,
                columns=['text'])

### Upload to S3

### Specify your bucket name. Replace 'chandra-ml-sagemaker' with your bucket

In [None]:
!aws s3 cp twitter_train.csv s3://aws-ml-test-nsadawi/twitter/train/twitter_train.csv

In [None]:
!aws s3 cp twitter_test_without_label.csv s3://aws-ml-test-nsadawi/twitter/test/twitter_test_without_label.csv

# After Running Classification Job on Comprehend
### Copy tar gz file from S3
#### Update the S3 path to point to the file in your bucket

In [None]:
!aws s3 cp "s3://aws-ml-test-nsadawi/twitter/test_output/479320215787-CLN-85177e7f1be27bbfa8eaa87eee9b8b0f/output/output.tar.gz" .

In [None]:
# Extract the tar file content
import tarfile
tar = tarfile.open("output.tar.gz")
tar.extractall()
tar.close()

In [None]:
ls

In [None]:
test_file = 'twitter_test_with_label.csv'
predicted_file = 'predictions.jsonl'

In [None]:
# Specify the column names as the file does not have column header
df = pd.read_csv(test_file,names=['trainingLabel','text'])

In [None]:
df.head()

In [None]:
predicted_class = []
predicted_prob = []

with open(predicted_file,'r') as f:
    l = f.readline()
    while (l):
        j = json.loads(l)
        if j['Classes'][0]['Name'] == '0':            
            predicted_class.append(0)
            # Add positive class probability
            predicted_prob.append(j['Classes'][1]['Score'])
        else:
            predicted_class.append(1)
            # Add positive class probability
            predicted_prob.append(j['Classes'][0]['Score'])
        
        l = f.readline()

In [None]:
print(predicted_prob[:5],predicted_prob[-5:])
print(predicted_class[:5],predicted_class[-5:])

In [None]:
df['predicted_class'] = predicted_class
df['predicted_prob'] = predicted_prob

In [None]:
df.head()