## Testing code: Clean unstructured Burning Glass data

Phai Phongthiengtham: IBM CAO 2018 summer intern
 
***

This notebook demonstrates how to pre-process text, extract task measures, and prepare input for word2vec model.

- Variable "BGTJobId" is a identifier -- to be merged with processed unstructured data

Note:
- Make sure to follow up with the package updates and whether any syntax has to be changed.
- The scikit-learn module, in particular, does changes syntax quite often!

IMPORTANT:
- The sample input file in this notebook is from MIT, and the format of the recently purchased version by IBM could be different!

In [1]:
!pip install -U pyldavis
!pip install -U spacy
!pip install -U scikit-learn
!pip install -U https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
!pip install -U nltk
!pip install -U gensim
import nltk
nltk.download('all')

Collecting pyldavis
  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K    100% |████████████████████████████████| 1.6MB 609kB/s eta 0:00:01
[?25hRequirement not upgraded as not directly required: wheel>=0.23.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from pyldavis)
Requirement not upgraded as not directly required: numpy>=1.9.2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from pyldavis)
Requirement not upgraded as not directly required: scipy>=0.18.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from pyldavis)
Requirement not upgraded as not directly required: pandas>=0.17.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from pyldavis)
Collecting joblib>=0.8.4 (from pyldavis)
  Downloading https://files.pythonhosted.org/packages/f6/26/317725ffd9e8e8c0eb4b2fc77614f52045ddfc1c5026387fbefef9050eec/joblib-0.12.2-py2.py3-no

True

In [2]:
import requests, re, os, json, sys, csv, time, datetime, types
from botocore.client import Config
import ibm_boto3
import operator, curl
from io import StringIO
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 50)
from scipy import spatial
from pprint import pprint
from project_lib import Project

# sklearn
import sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.externals import joblib

# gensim
import gensim
from gensim import corpora, models
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel, KeyedVectors

# nltk
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# scacy
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load(disable=['parser', 'tagger','ner'] )

# plotting tools
import matplotlib.pyplot as plt
%matplotlib inline

import pyLDAvis
import pyLDAvis.gensim
import pyLDAvis.sklearn

def __iter__(self): return 0

### Credentials for load and save files

In [3]:
client_c134506f72ab49d1a232818751853c8a = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='28lR3Um-L0FbgIYW7qqJQ1ezn2VXdjWlZJV3XtoYUA4u',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

In [4]:
project = Project(project_id='fbb58841-9ee4-4f0d-9784-ea4c5475261d', project_access_token='p-ada0289c1d7acc5fa217767ec4ada2573bdf20bd')
pc = project.project_context

### Define a function that imports the raw file 

In [5]:
def import_BurningGlass(filename):
    
    try:

        body_file = client_c134506f72ab49d1a232818751853c8a.get_object(Bucket='mlforeconomy-donotdelete-pr-o5jigrnf6kwxg5', Key=filename)['Body']
        if not hasattr(body_file, "__iter__"): 
            body_file.__iter__ = types.MethodType( __iter__, body_file )

        df = pd.read_csv(body_file, sep = '\t', header = 0, dtype = object, encoding = 'utf-8')
        
    except UnicodeDecodeError:
        
        body_file = client_c134506f72ab49d1a232818751853c8a.get_object(Bucket='mlforeconomy-donotdelete-pr-o5jigrnf6kwxg5', Key=filename)['Body']
        if not hasattr(body_file, "__iter__"): 
            body_file.__iter__ = types.MethodType( __iter__, body_file )

        df = pd.read_csv(body_file, sep = '\t', header = 0, dtype = object, encoding = 'latin-1')
    
    print(filename + ' : ' + str(len(df)) + ' records' )
    
    return df

### Define clusters of task

In [6]:
nr_analytic = ['researching','analyzing', 'evaluating', 'planning', 'designing', 'sketching', 'research', 'analyze', 'evaluate', 
               'plan', 'design', 'sketch','devising rule', 'interpreting rule', 'budgeting', 'lettering', 'stylized', 'implementing', 
               'evaluation', 'developing', 'plans', 'analyse', 'architecture', 'installing', 'determining', 'freehand', 'sketcher', 
               'compiling', 'deign', 'rsrch', 'devising', 'administering', 'researcher', 'modifying', 'mechanicals', 'desi', 'assessing', 
               'correlate', 'synthesizing', 'implement', 'validate', 'rendering', 'defining', 'sketchbook', 'conceptualizing', 'creating', 
               'validating', 'synthesize', 'analysing', 'renderer', 'resolving', 'monitoring', 'reviewing', 'designs', 'interpreting', 'rule', 
               'fettering', 'summarize', 'researching', 'recommending', 'monitor', 'collecting', 'configuring', 'analyzing', 'identify', 
               'identifying', 'compile', 'develop', 'assess', 'formulating', 'layouts', 'constructing', 'investigating', 'architecting', 
               'prepare', 'formulate', 'evaluating', 'interpret', 'renderings', 'documenting', 'define', 'mock', 'gathering', 'recommend', 
               'evaluates', 'pylon', 'summarizing', 'illustration', 'artistic', 'determine', 'coordinate', 'dylan', 'gather', 'airbrush', 
               'analyzes', 'designer', 'analysis', 'illustrating', 'sketches', 'deploying', 'examine', 'designing', 'resign']

In [7]:
nr_interactive = ['negotiating', 'lobbying', 'coordinating', 'organizing', 'teaching', 'selling', 'buying','advising', 'advertising', 
                  'entertaining', 'presenting' , 'managing','negotiate', 'lobby','coordinate', 'organize', 'sell', 'purchase', 'advise', 
                  'advertise', 'entertain', 'presentation', 'presentations', 'tising', 'legislators', 'assisting', 'manage', 'lobbyist', 
                  'educating', 'educate', 'azov', 'articulating', 'directing', 'proposals', 'seil', 'developing', 'guying', 'vertislng', 
                  'advertlslag', 'iobby', 'structuring', 'sales', 'facilitating', 'execute', 'supporting', 'advtsg', 'administering', 
                  'negotiations', 'advt', 'lobbv', 'executing', 'explaining', 'odv', 'informative', 'advtg', 'lobbyists', 'advertlsln', 
                  'adv', 'briefings', 'coordinates', 'lobb', 'congressional', 'advert', 'delightful', 'seli', 'presenting', 'intergovernmental', 
                  'reviewing', 'advises', 'communicating', 'soiling', 'adverting', 'arranging', 'adverts', 'adver', 'vtg', 
                  'oversee', 'adverhslng', 'coordination', 'publicity', 'grassroots', 'legislative', 'academic', 'leaching', 
                  'advfg', 'adve', 'media', 'marketing', 'advertiser', 'irreverent', 'prioritize', 'teething', 'prioritizing', 
                  'adverllslng', 'overseeing', 'negotiates', 'initiate', 'supervise', 'ertislng', 'nightlife', 'negotiation', 
                  'kdv', 'leeching', 'selim', 'baying', 'evaluate', 'consult', 'preparing', 'supervising', 'negotiating', 
                  'facilitate', 'administer', 'presents', 'inform']

In [8]:
r_cognitive = ['calculating', 'bookkeeping', 'correcting', 'measuring','calculate', 'corrections', 'measurement','gauging', 'modifications', 
               'measurements', 'bkkpo', 'bkkp', 'isolating', 'bookkpg', 'micrometer', 'correcting', 'payroll', 'measurement', 'measuring', 
               'ajp', 'calculation', 'calipers', 'corrects', 'calculations', 'bkkpg', 'correction', 'bluing', 'revisions', 'dkkpg', 
               'bookkeeper', 'adjustments', 'reconcile', 'calculate', 'calculates', 'stenography', 'bkkpng', 'bkkping', 'compute', 'resolving', 
               'clerical', 'billing', 'fixing', 'calculating', 'measure', 'rectifying', 'beekeeping', 'gauges']

In [9]:
r_manual = ['operating', 'controlling', 'equipping','operate', 'control', 'equip', 'equipment''quip', 'eqp', 'troll', 'instruments', 'minimizing', 
            'eouip', 'eaulp', 'equlp', 'ntrol', 'eguip', 'quid', 'equlo', 'eoulp', 'equtp', 'equip', 'gulp', 'machines', 'apparatus', 'uip', 'equipage', 
            'eqpmt', 'controls', 'sterilizers', 'controlling', 'eauio', 'eqpt', 'engulfment', 'eauip', 'equipment', 'conhol', 'devices', 'ulp', 'control', 
            'eq', 'equlpt', 'instrumentation', 'equ', 'equio', 'operation', 'equlpmt', 'operating', 'machinery', 'epuip']

In [10]:
nr_manual = ['repairing', 'renovating', 'restoring', 'accommodating','repair', 'renovate', 'restore', 'service', 'accommodation', 'accommodate',
             'accommodations', 'inspecting', 'calibrating', 'accomodate', 'overhaul', 'rebuilding', 'serves', 'installing', 'restoring', 'serve', 
             'installation', 'diagnosing', 'reassembling', 'mechanic', 'repairs', 'overhauling', 'accomodation']

### Clean unstructured text data
- Only tokenize, NOT lemmatize  
- https://spacy.io/usage/spacy-101

In [11]:
# string replace
def cleanup(text):
    if text == '': # allows for possibility of being empty 
        output = ''
    else:
        text = text.replace("'s", " ")
        text = text.replace("n't", " not ")
        text = text.replace("'ve", " have ")
        text = text.replace("'re", " are ")
        text = text.replace("'m","  am ")
        text = text.replace("'ll","  will ")
        text = text.replace("-"," ")
        text = text.replace("/"," ")
        text = text.replace("("," ")
        text = text.replace(")"," ")
        text = re.sub(r'[^A-Za-z ]', '', text) #remove all characters that are not A-Z, a-z or 0-9
        output = ' '.join([w for w in re.split(' ',text) if not w=='']) #remove extra spaces 
    return output  

# pre-process text
def main_preprocess(text):
    text = str(text) # make sure the input is actually string
    text = ''.join([i if ord(i) < 128 else ' ' for i in text])
    if text == '': # allows for possibility of being empty 
        output = ''
    else:
        tokens = [w.text for w in nlp(cleanup(text))] # cleanup and tokenize
        output = ' '.join([w.lower() for w in tokens if not w==''])
    return output

In [12]:
BG_file_numbers = ['01-04-2016', '01-05-2016', '01-06-2016']

for num in BG_file_numbers:
    
    filename = 'JobText_US_' + num + '.txt' # example: 'JobText_US_01-04-2016.txt'
    
    print('working on ' + filename)
    df = import_BurningGlass(filename)
    df = df[['BGTJobId','JobId','JobText']]
    
    if len(df) > 10000:
        df = df.sample(10000) # sample 10000 ads
    
    # pre-process
    df['CleanText'] = df['JobText'].apply(lambda x: main_preprocess(x))
    
    # extract task
    df['t_nr_analytic'] = df['CleanText'].apply(lambda x: len(re.findall('|'.join(['\\b' + w + '\\b' for w in nr_analytic]), x)))
    df['t_nr_interactive'] = df['CleanText'].apply(lambda x: len(re.findall('|'.join(['\\b' + w + '\\b' for w in nr_interactive]), x)))
    df['t_r_cognitive'] = df['CleanText'].apply(lambda x: len(re.findall('|'.join(['\\b' + w + '\\b' for w in r_cognitive]), x)))
    df['t_r_manual'] = df['CleanText'].apply(lambda x: len(re.findall('|'.join(['\\b' + w + '\\b' for w in r_manual]), x)))
    df['t_nr_manual'] = df['CleanText'].apply(lambda x: len(re.findall('|'.join(['\\b' + w + '\\b' for w in nr_manual]), x)))
    
    # export sample files
    df_sample = df[['BGTJobId', 'JobId', 'JobText', 'CleanText']].sample(100)
    output_filename = 'BG_sampled.csv'
    project.save_data(output_filename, df_sample.to_csv(index=False), overwrite=True)

    # export task
    df_task = df[['BGTJobId', 'JobId', 't_nr_analytic', 't_nr_interactive', 't_r_cognitive', 't_r_manual', 't_nr_manual']]
    task_output_filename = 'BG_task_' + num + '.csv'
    project.save_data(task_output_filename, df_task.to_csv(index=False), overwrite=True)
    
    # export cleaned text
    df_text_for_word2vec = df['CleanText']
    text_for_word2vec_filename = 'text_for_word2vec_' + num + '.txt'
    project.save_data(text_for_word2vec_filename, df_text_for_word2vec.to_csv(index=False), overwrite=True)
    
print('---DONE---')

working on JobText_US_01-04-2016.txt
JobText_US_01-04-2016.txt : 82779 records
working on JobText_US_01-05-2016.txt
JobText_US_01-05-2016.txt : 104775 records
working on JobText_US_01-06-2016.txt
JobText_US_01-06-2016.txt : 153132 records
---DONE---


- Output 1 "df_task" : to be merged with skill dataset (structured data). Variable "BGTJobId" is a identifier.
- (Phai) : Not completely sure if "JobId" is important or not -- so, let's keep it for now.

In [13]:
df_task.head(10)

Unnamed: 0,BGTJobId,JobId,t_nr_analytic,t_nr_interactive,t_r_cognitive,t_r_manual,t_nr_manual
152166,37996970028,0xf5c763c5db981558e6528ba1f10c522c,5,4,0,0,0
79487,37996570967,ad9b833a5b28c84e4da6871d9a61ef24fc338bc9,0,0,0,0,0
144155,37996908958,3532559d9a8a2b8a8c414889753bec1e2946b2,2,5,0,0,1
41144,37996393724,ad4943d78082dbf1de1ea54c56b665c957731e,0,0,0,3,1
30631,37996360124,734d5c91876629db1d24641a57a16eda9f4f5e,4,0,1,1,0
57754,37996446131,4cf323cb24a8773a2548c628e545ff437ca04e78,5,3,0,0,2
87116,37996596510,f892676436488e28797ad196448ccb168032df78,4,1,1,5,2
79671,37996572944,709ce2bd373981178fa9f46562c66e43742c9b71,0,0,0,0,0
130883,37996856678,43e34aa3241285aeae81c5364a7855db63b6963d,1,0,0,0,1
69998,37996536763,3598f217185b24ea5a57aaa7958ca6787a3b6a,5,5,0,0,1


- Output 2 "df_text_for_word2vec" : To be used to contruct word2vec model

In [14]:
df_text_for_word2vec.head(10)

152166    title store seasonal employee cashier company ...
79487                                        host hostesses
144155    tru green sales representative s base pay comm...
41144     pharmacy tech driver elderwood other williamsv...
30631     java developer collateral management company t...
57754     personal trainer at hour fitness in sandy ut m...
87116     kelly services alignment technician at pierce ...
79671                               dental insurance humana
130883    dependency case manager trainee cma brevard em...
69998     headquarters mission street suite san francisc...
Name: CleanText, dtype: object