# CORE Processing Articles for Downstream Use in Modeling

* Ingestion of data in S3 from the CORE API stored data as JSONs with up to 100 search results stored in each file. 
Per [BlazingText Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html), the algorithm requires each line of the input file should contain a single sentence of space separated tokens. Raw data will need to be processed to accomodate the training format. 
* Before processing the raw data, a summary sheet will be created to catalog the data

## Initial Prep

Imports

In [2]:
import time
from datetime import datetime
import boto3
import pandas as pd
import pickle
from nltk import tokenize
import nltk
nltk.download('punkt')
import re, string
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer as netlem
lem = netlem()

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Declarations

In [3]:
core_bucket_name = 'core0823'
stg_bucket = 'core0823-stg'

Prep

In [4]:
s3_client = boto3.client('s3')

In [5]:
json_list = [i['Key'] for i in s3_client.list_objects(Bucket=core_bucket_name)['Contents']]

Generic Functions

In [6]:
def s3_file_location(f_bucket, f_file):
    """
    Simply returns a formatted string with the S3 file location
    """
    data_location = 's3://{}/{}'.format(f_bucket,f_file)
    return data_location

def serialobj_file_to_list(f_bucket, f_obj_key):
    """
    Intakes bucket and the key for a serialized object. 
    In this case it is a serialized list object from CORE #3.
    Returns list.
    """
    try:
        s3_obj = s3_client.get_object( Bucket= f_bucket, Key = f_obj_key )['Body'].read()
        return pickle.loads(s3_obj)
    except:
        pass 
        print('Fail getting and deserialization object.')

## Text Data Processing

Functions

In [7]:
def prep_sent_list(f_sent_list):
    """
    Intakes a list of sentences. 
    Uses a series of list comprehensions to prepare sentences for analysis.
    Returns a list of sentences. 
    """
    t0 = datetime.fromtimestamp( time.time() )
    print('Preparing sentences started at: {}'.format(t0))
    t_sent_list = [re.sub(r'[%s]' % re.escape(string.punctuation),'',sent.lower()) for sent in f_sent_list] # make lowercase and remove punctuation
    t1 = datetime.fromtimestamp( time.time() )
    print('Lowercase and punctuation removal completed at {}, taking {} seconds.'.format(t1, (t1-t0).total_seconds() ) )
    t_sent_list = [re.sub(r'\w*\d\w*', '',sent) for sent in t_sent_list if len( re.sub(r'\w*\d\w*', '',sent) ) > 0 ] # remove words with numbers and only where non-zero length
    t2 = datetime.fromtimestamp( time.time() )
    print('Words with numbers and zero-length removal completed at {}, taking {} seconds.'.format(t2, (t2-t1).total_seconds() ))
    
    # lemmatize and remove stop words
    t_sent_list = [' '.join([lem.lemmatize(word) for word in words if word not in STOPWORDS]) for words in [sent.split(' ') for sent in t_sent_list]]
    t3 = datetime.fromtimestamp( time.time() )
    print('Word lemmatization and stopword removal completed at {}, taking {} seconds.'.format(t3, (t3-t2).total_seconds() ))
    
    t4 = datetime.fromtimestamp( time.time() )
    print('Preparing sentences completed at {}, taking a total time of {} seconds.'.format(t4, (t4-t1).total_seconds() ))
    
    return t_sent_list

Import

In [8]:
sent_list = serialobj_file_to_list(stg_bucket,'BT_STG/sentences.txt')
print('Total sentences imported from S3: {}'.format(len(sent_list)))

Total sentences imported from S3: 4140623


Prepare text

In [None]:
prepd_sentence = []
sent_list_chunk = [[i-100000,i-1] for i in range(100000,len(sent_list),100000)]
for i in sent_list_chunk:
    print('Preparing sentences {} to {}'.format(i[0],i[1]))
    prepd_sentence.extend( prep_sent_list(sent_list[i[0]:i[1]]) )

Preparing sentences 0 to 99999
Preparing sentences started at: 2020-09-08 10:27:45.642059
Lowercase and punctuation removal completed at 2020-09-08 10:27:47.245597, taking 1.603538 seconds.
Words with numbers and zero-length removal completed at 2020-09-08 10:27:53.133917, taking 5.88832 seconds.
Word lemmatization and stopword removal completed at 2020-09-08 10:27:59.414502, taking 6.280585 seconds.
Preparing sentences completed at 2020-09-08 10:27:59.414629, taking a total time of 12.169032 seconds.
Preparing sentences 100000 to 199999
Preparing sentences started at: 2020-09-08 10:27:59.419429
Lowercase and punctuation removal completed at 2020-09-08 10:28:00.957339, taking 1.53791 seconds.
Words with numbers and zero-length removal completed at 2020-09-08 10:28:06.583864, taking 5.626525 seconds.
Word lemmatization and stopword removal completed at 2020-09-08 10:28:12.932063, taking 6.348199 seconds.
Preparing sentences completed at 2020-09-08 10:28:12.932191, taking a total time of

In [None]:
prepd_sentences = prep_sent_list(sent_list)

Store in S3

In [None]:
prepd_sentences_serialized = pickle.dumps(prepd_sentences)

In [None]:
response = s3_client.put_object(Body=sentences_serialized, Bucket='core0823-stg', Key='BT_STG/prepd_sentences.txt')