# CORE Processing Articles for Downstream Use in Modeling

* Ingestion of data in S3 from the CORE API stored data as JSONs with up to 100 search results stored in each file. 
Per [BlazingText Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html), the algorithm requires each line of the input file should contain a single sentence of space separated tokens. Raw data will need to be processed to accomodate the training format. 
* Before processing the raw data, a summary sheet will be created to catalog the data

## Initial Prep

Imports

In [1]:
import boto3
import pandas as pd
import json
import ast
from io import StringIO
from nltk import tokenize
import nltk
nltk.download('punkt')
import re, string
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORD = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer as netlem
lem = netlem()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Declarations

In [2]:
core_bucket_name = 'core0823'
stg_bucket = 'core0823-stg'
stg_catalog_bucket = stg_bucket + '/Catalog'
stg_bt_bucket = stg_bucket + '/BT_STG'

Prep

In [3]:
s3_client = boto3.client('s3')

In [4]:
json_list = [i['Key'] for i in s3_client.list_objects(Bucket=core_bucket_name)['Contents']]

Generic Functions

In [6]:
def s3_file_location(f_bucket, f_file):
    """
    Simply returns a formatted string with the S3 file location
    """
    data_location = 's3://{}/{}'.format(f_bucket,f_file)
    return data_location

def json_file_to_dict(f_bucket, f_json_file):
    """
    Intakes bucket and json file.
    Returns dictionary.
    """
    try:
        json_s3_obj = s3_client.get_object( Bucket= f_bucket, Key = f_json_file )
        tmp_str_json = json_s3_obj['Body'].read().decode('utf-8')
        fnl_json = ast.literal_eval(tmp_str_json)
        return fnl_json
    except:
        pass 
        print('Fail importing json file/')

## Text Data Processing

In [134]:
def json_text_parse(f_bucket, f_file_name):
    """
    Intakes a bucket and file name. 
    Parses CORE API JSON.     
    Returns list of lists, where each entry is a sentence.
    """
    results_list = []
    tmp_file = json_file_to_dict(f_bucket, f_file_name)
    if tmp_file is not None and tmp_file['data'] is not None:
        for item in tmp_file['data']:
            if item['_source']['description'] is not None:
                tmp_parse_list = tokenize.sent_tokenize(item['_source']['description'])
                results_list.extend(tmp_parse_list)

            if item['_source']['fullText'] is not None:
                tmp_parse_list = tokenize.sent_tokenize(item['_source']['fullText'])
                results_list.extend(tmp_parse_list)
            
    return results_list

def json_extract_text(f_bucket, f_file_list):
    """
    Intakes a bucket and list of JSON files from CORE API. 
    Parses CORE API JSON. 
    This function iterates over a list of files, where json_text_parse is for a single file.
    Returns list of lists, where each entry is a sentence.
    """
    results_list = []
    for file in f_file_list:
        tmp_results = json_text_parse(f_bucket, file)
        results_list.extend(tmp_results)
    
    results_list = [sent for sent in results_list if len(sent) > 1]
    
    return results_list

In [123]:
test_func2 = json_extract_text(core_bucket_name, json_list)

In [124]:
len(test_func2)

5281986

In [129]:
test_func2[300000:300005]

['The distorted wave including up to 25(th) harmonics were prepared for testing of the neural networks.',
 "Elman's recurrent and feed forward neural networks were used to recognize each harmonic.",
 "The results obtained using Elman's recurrent neural networks are better than the results values obtained using the feed forward neural networks for resilient back propagation",
 'The junction of AI and computer security is an area of increasing concern, due to the imminent application of AI to fielded systems.',
 'Two new areas of research need are identified: artificial intelligence techniques in the development of secure systems.']

In [130]:
test_func3 = [sent for sent in test_func2 if len(sent) > 1]

In [131]:
len(test_func3)

4140623

In [147]:
def prep_sent_list(f_sent_list):
    """
    Intakes a list of sentences. 
    Uses a series of list comprehensions to prepare sentences for analysis.
    Returns a list of sentences. 
    """
    sent_list = [re.sub(r'[%s]' % re.escape(string.punctuation),'',sent.lower()) for sent in f_sent_list] # make lowercase and remove punctuation
    sent_list = [re.sub(r'\w*\d\w*','',sent) for sent in sent_list] # remove words with numbers in them
    sent_list = [sent for sent in sent_list if len(sent) > 0] # remove any zero length sentences 
    
    # lemmatize and remove stop words
    for i,sent in enumerate(sent_list):
        tmp = sent.split(' ')
        sent_list[i] = ' '.join([lem.lemmatize(word) for word in tmp if lem.lemmatize(word) not in STOPWORDS])
    
    return sent_list

In [None]:
test_func4 = prep_sent_list(test_func3)

In [None]:


test = [' '.join([lem.lemmatize(word) for word in [sent.lower().split(' ') for sent in test_func3[:1000]] if lem.lemmatize(word) not in STOPWORDS])]

In [146]:
test_func4[:5]

['discovering how the brain works is perhaps the most extraordinary scientific challenge of our time',
 'advances in understanding the brain will inform medical research into new treatments for neurological disorders as well as lead to powerful new techniques in artificial intelligence and robot control',
 'to meet this challenge our foundation is raising funds to support a new centre for theoretical neuroscience at oxford which will be dedicated to teaching and research in computer modelling of the brain',
 'the centre is currently based within the oxford university department of experimental psychology',
 'over the last year we have made important contributions to understanding various areas of brain function including for example how do our visual systems learn to make sens']