# The Initial Text Cleaning
Phai Phongthiengtham

This IPython notebook demonstrates the initial text cleaning steps of the text from online job vacancy postings, provided by Economic Modeling Specialists International (EMSI). 

## Import necessary modules

In [1]:
import os
import re
import json
import gzip
import pandas as pd

import nltk
from nltk import word_tokenize

import enchant
from enchant import DictWithPWL
d = enchant.DictWithPWL("en_US", 'myPWL.txt')

pd.set_option('display.max_columns', 50)

## Import raw input files

The raw input files are in JSON (JavaScript Object Notation) format. This Ipython notebook demonstates how we extract and pre-process vacancy postings from a subset of raw data from January 2016:

In [2]:
sample_raw_filename = 'part-00000.gz' # In January 2016, we have 'part-00000.gz' up to 'part-00031.gz' 
sample_raw_file = gzip.open(sample_raw_filename, 'rb')

## Transform JSON-encoded string

The python pandas module is typically used to transfrom JSON-encoded string to pandas dataframe. There is, however, one problem with this method: the list of variable in each observation is not consistent across observations. Our data provider, Economic Modeling Specialists International (EMSI), has provided us with all possible variables of:     

In [3]:
list_variable = [w for w in re.split('\n',open('list_variable.csv').read()) if not w=='']
total_var_num = len(list_variable)
print(list_variable)

['yearmonth', 'city', 'state', 'zip', 'county', 'company_name', 'company_naics', 'company_isstaffingfirm', 'master_company_name', 'master_company_naics', 'master_company_isstaffingfirm', 'onet', 'cb_jobtitle_id', 'cb_jobtitle', 'edulevels_name', 'source', 'subsource', 'original_jobtitle', 'url', 'description']


Some observations do not have all these variables. We deal with this issue by transforming each observation and variable one by one, and skipping if we encounter *KeyError*:

In [4]:
def transform_json(obs, list_variable, total_var_num):
    # This function transfrom JSON-encoded string into a tab-separated observation.
    # "obs" : input by observation
    # "list_variable" : list of all possible fields
    # "total_var_num" : number of all possible fields = len(list_variable)
    
    # (1.) preliminary cleaning
    obs_cleaned = str(obs.decode("utf-8"))
    obs_cleaned = re.sub('\n',' ',obs_cleaned) # remove extra line breaks within an observation.
    obs_cleaned = ''.join([i if ord(i) < 128 else ' ' for i in obs_cleaned]) # replace non-ascii characters with white space.
    
    # (2.) transform JSON-encoded string
    obs_json = json.loads(obs_cleaned) # load json object
    obs_transforms = ['']*(total_var_num) # initialize output  

    for num in range(0, total_var_num): # loop over all possible fields
        
        field = list_variable[num]  
        
        try: # record, if the value in that field exists
            value_this_field = str(obs_json[field])
            
            # remove extra tabs (mostly in the job description). 
            value_this_field_cleaned = re.sub('\t',' ',value_this_field)
            
            if field in ['onet']:
                if re.findall('\d{2}-\d{4}\.\d{2}',value_this_field_cleaned): 
                    pass # check if the value is in a correct format, e.g., 43-4051.00.
                else:
                    value_this_field_cleaned = "" # if not, replace with empty string 
                
            if field in ['zip','county','company_naics','master_company_naics']:
                if re.findall('\d+',value_this_field_cleaned): # check if the value is indeed a number.
                    value_this_field_cleaned = str(int(value_this_field_cleaned))                
                else:
                    # some obs has a actual string value of "NaN" in the raw file.
                    # if not, i.e., no number found, replace with empty string.
                    value_this_field_cleaned = ""
                    
            obs_transforms[num] = str(value_this_field_cleaned) # update output
        
        except KeyError: # skip, if KeyError (the field does not exist). 
            pass
        
    return obs_transforms # return final output

We transform raw data into a tab-separated spreadsheet format. Note: We use cloud-computing service in the actual implementation.  

In [5]:
structured_data_filename = 'structured_data.txt' # define output filename
structured_data = open(structured_data_filename,'w')   
structured_data.write( '\t'.join(['ad_num'] + list_variable) + '\n' ) # write down header

ad_num = 0 # create an extra variable "ad_num"
# Together, "yearmonth" and "ad_num" uniquely identify each ad in the sample.

for line in sample_raw_file:
    
    obs_transforms = transform_json(line, list_variable, total_var_num) # transfrom JSON-encoded string
    structured_data.write( '\t'.join([str(ad_num)] + obs_transforms) + '\n' ) # write down obs
    assert(len(obs_transforms) == 20) # there are total 20 variables (excluding the newly created "ad_num")
    
    ad_num += 1

structured_data.close()

## Manage dataset in pandas dataframe

We import the transformed dataset into pandas dataframe.

In [6]:
df = pd.read_csv(structured_data_filename,sep='\t', header = 0, dtype = object)
df.head(50)

Unnamed: 0,ad_num,yearmonth,city,state,zip,county,company_name,company_naics,company_isstaffingfirm,master_company_name,master_company_naics,master_company_isstaffingfirm,onet,cb_jobtitle_id,cb_jobtitle,edulevels_name,source,subsource,original_jobtitle,url,description
0,0,201601,"Sebring, FL",FL,,12055,A & Associates Inc,541611,False,"A & Associates, Inc",561311,True,43-1011.00,43.123,Claims Assistants,['High school or GED'],c20b8969,8f34a7d2,EMS Claims Specials,http://www.americasjobexchange.com/job-detail/...,JOB DESCRIPTIONPOSITION PURPOSE & OBJECTIVES:A...
1,1,201601,"Elk Grove Village, IL",IL,,17031,A-1 Roofing,238160,False,"A-1 Roofing, Inc",238160,False,43-4051.00,43.30757,Customer Service Coordinators,[],fe1c33a8,,Service Coordinator,http://www.careerbuilder.com/JobSeeker/Jobs/Jo...,"A-1 Roofing Company, located in Elk Grove Vill..."
2,2,201601,"Lexington, KY",KY,,21067,"Adesa of Lexington, Inc",423110,False,"Adesa of Lexington, Inc",423110,False,13-1199.00,43.31,Customer Service Specialists,['High school or GED'],c20b8969,f302d56f,Lot Specialist,http://kar.taleo.net/careersection/kar_pro/job...,*Job Summary: Reporting to the General Manager...
3,3,201601,"Lexington, KY",KY,,21067,"Adesa of Lexington, Inc",423110,False,"Adesa of Lexington, Inc",423110,False,53-1031.00,53.15,Commercial Driver's License (CDL) Drivers,['High school or GED'],c20b8969,f302d56f,Auction Driver,http://kar.taleo.net/careersection/kar_pro/job...,*Job Summary:Reporting to the designated super...
4,4,201601,"Lexington, KY",KY,,21067,"Adesa of Lexington, Inc",423110,False,"Adesa of Lexington, Inc",423110,False,37-2011.00,37.4,Environmental Services Managers,['High school or GED'],c20b8969,f302d56f,Custodian,http://kar.taleo.net/careersection/kar_pro/job...,*Job Summary: Reporting to the facility design...
5,5,201601,"Lexington, KY",KY,,21067,"Adesa of Lexington, Inc",423110,False,"Adesa of Lexington, Inc",423110,False,43-9061.00,43.7,Data Entry Clerks,['High school or GED'],c20b8969,f302d56f,Block Clerk,http://kar.taleo.net/careersection/kar_pro/job...,*Job Summary: Reporting to the General Manager...
6,6,201601,"Lexington, KY",KY,,21067,"Adesa of Lexington, Inc",423110,False,"Adesa of Lexington, Inc",423110,False,41-1012.00,43.31,Customer Service Specialists,['High school or GED'],c20b8969,f302d56f,Lot Specialist,http://kar.taleo.net/careersection/kar_pro/job...,*Job Summary: Reporting to the General Manager...
7,7,201601,"Lexington, KY",KY,,21067,"Adesa of Lexington, Inc",423110,False,"Adesa of Lexington, Inc",423110,False,49-3023.01,49.0,Maintenance Mechanics,['High school or GED'],c20b8969,f302d56f,Auto Mechanic Technician - Experienced,http://kar.taleo.net/careersection/kar_pro/job...,*Job Summary: Reporting to the Mechanical Serv...
8,8,201601,"San Diego, CA",CA,,6073,"Adesa Corporation, LLC",423110,False,"Adesa Corporation, LLC",423110,False,53-6051.07,53.23,City Drivers,['High school or GED'],c20b8969,f302d56f,Vehicle Inspector I (C) - San Diego,http://kar.taleo.net/careersection/kar_pro/job...,* Job Summary: Reporting to the Inspection Ma...
9,9,201601,"San Diego, CA",CA,,6073,"Adesa Corporation, LLC",423110,False,"Adesa Corporation, LLC",423110,False,49-1011.00,49.0,Maintenance Mechanics,['High school or GED'],c20b8969,f302d56f,Mechanical Services Mgr I (Sm/Med-C),http://kar.taleo.net/careersection/kar_pro/job...,*Job Summary: Reporting to the General Manager...


### (1.)  Integer Variables

First, we check that all numeric variables are in the correct format: 

In [7]:
integer_var = ['yearmonth','zip','county','company_naics','master_company_naics','onet']

for var in integer_var:
    print('-------------- all possible values of ' + str(var) + ' --------------')
    all_values = sorted([str(w) for w in df[var].unique()])
    print('|'.join(all_values))
    print('')

-------------- all possible values of yearmonth --------------
201601

-------------- all possible values of zip --------------
10001|10002|10007|10010|10011|10016|10018|10020|10022|10023|10025|10034|10036|10065|10116|10150|10166|10171|1020|10301|10305|10307|1035|1040|1041|10461|10474|10507|10509|10523|10537|10543|10547|10552|10567|10573|10577|10579|10580|10583|10591|10601|10607|10701|10801|10803|1089|10917|10924|10927|10940|10960|10973|10974|10977|10990|11001|11003|1101|11021|11030|11096|11201|11213|11224|11236|11371|11375|11417|11507|11510|11516|11520|11542|11552|11554|11557|11558|11572|11580|11590|11701|11703|11704|11706|11710|11716|11717|11725|11727|11735|11740|11741|11746|11747|11751|11753|11754|11756|11757|11758|11762|11763|11764|11766|11767|11768|11772|11776|11778|11779|11782|11787|11788|11790|11797|11801|11804|11901|11934|11937|11944|11946|11949|11952|11959|11967|11968|11975|11977|1201|12010|12020|12033|12043|12047|12054|12065|12084|12090|12095|12110|12144|12159|12180|12188|122

### (2.) Education requirement  
There are five possible values for education requirement level. Each job posting may list more than one of the followings or simply leave blank. Our next task is to convert education requirement level into these five binary variables (equals 1 if an ad mention this particular education level). 
 
1. high_school : "High school or GED"
2. associate : "Associate's degree"
3. bachelor : "Bachelor's degree"
4. master : "Master's degree"
5. phd : "Ph.D. or professional degree"

In the data, education requirement level is listed as: 

In [8]:
edulevels_all_values = df['edulevels_name'].unique()
print(' | '.join(edulevels_all_values))

['High school or GED'] | [] | ["Bachelor's degree"] | ["Bachelor's degree", "Master's degree"] | ["Associate's degree", "Bachelor's degree"] | ["Associate's degree"] | ['High school or GED', "Associate's degree"] | ['Ph.D. or professional degree'] | ["Master's degree", 'Ph.D. or professional degree'] | ["Master's degree"] | ["Associate's degree", "Bachelor's degree", "Master's degree"] | ['High school or GED', "Associate's degree", "Bachelor's degree"] | ["Bachelor's degree", "Master's degree", 'Ph.D. or professional degree'] | ["Associate's degree", "Bachelor's degree", "Master's degree", 'Ph.D. or professional degree'] | ['High school or GED', "Bachelor's degree"] | ['High school or GED', "Master's degree"] | ['High school or GED', "Bachelor's degree", "Master's degree"] | ["Associate's degree", "Master's degree", 'Ph.D. or professional degree'] | ["Associate's degree", "Master's degree"] | ["Bachelor's degree", 'Ph.D. or professional degree'] | ['High school or GED', "Associate's de

We convert education requirement into five binary variables.

In [9]:
all_edu = ['high_school','associate','bachelor','master','phd']

df['high_school'] = df['edulevels_name'].apply(lambda x: len(re.findall('High school',x)))
df['associate'] = df['edulevels_name'].apply(lambda x: len(re.findall('Associate',x)))
df['bachelor'] = df['edulevels_name'].apply(lambda x: len(re.findall('Bachelor',x)))
df['master'] = df['edulevels_name'].apply(lambda x: len(re.findall('Master',x)))
df['phd'] = df['edulevels_name'].apply(lambda x: len(re.findall('Ph\.D\.',x)))

df[['ad_num','yearmonth','edulevels_name'] + all_edu]

Unnamed: 0,ad_num,yearmonth,edulevels_name,high_school,associate,bachelor,master,phd
0,0,201601,['High school or GED'],1,0,0,0,0
1,1,201601,[],0,0,0,0,0
2,2,201601,['High school or GED'],1,0,0,0,0
3,3,201601,['High school or GED'],1,0,0,0,0
4,4,201601,['High school or GED'],1,0,0,0,0
5,5,201601,['High school or GED'],1,0,0,0,0
6,6,201601,['High school or GED'],1,0,0,0,0
7,7,201601,['High school or GED'],1,0,0,0,0
8,8,201601,['High school or GED'],1,0,0,0,0
9,9,201601,['High school or GED'],1,0,0,0,0


### (3.) Posting description

Our next task is to correct spelling errors due to imperfect data scraping procedure. The next section demonstrates our spelling error correction: 

In [10]:
personal_words = [w.lower() for w in re.split('\n', open('myPWL.txt').read()) if not w=='']
# load a list of personal words, i.e., words that are not contained in the default pyenchant library.  

def SpellingCorrection(input_string, personal_words):
    # this function performs spelling error corrections.

    # (1.) initial character replacements.
    text = input_string
    text = text.replace("-"," ")
    text = text.replace("_", " ")
    text = text.replace(":"," : ")
    text = text.replace("/"," / ")
    text = text.replace("("," ( ")
    text = text.replace(")"," ) ")
    text = text.replace("*"," ")
    text = text.replace("{"," ")
    text = text.replace("}"," ")
    text = text.replace("["," ")
    text = text.replace("]"," ")
    text = text.replace(","," , ")
    text = text.replace('"',' " ')
    text = text.replace('&',' & ')
    
    # (2.) tokenize the text, i.e., decompose text into words.
    input_tokens = word_tokenize(text)
    
    # (3.) correct spell errors.
    output_tokens = list() # initialize output. 

    for word in input_tokens: 
        corrected_word = word # initialize corrected word to be the original word itself. 

        if len(word) <= 2 or word.lower() in personal_words or d.check(word.lower()):
            # ignore if word is very short, contained in the list our personal words or contained in the pyenchant library. 
            pass
        elif d.suggest(word): # if misspelled, get suggestions from pyenchant module.
            list_suggestion = d.suggest(word)
            for suggestion in list_suggestion: 
                if re.sub(' ','',suggestion.lower()) == re.sub(' ','',word.lower()):
                    # record the corrected word if the correction only has to do with extra spaces, 
                    # e.g., "DESCRIPTIONPOSITION" => "DESCRIPTION POSITION".    
                    corrected_word = suggestion

        if corrected_word == word:
            if re.findall('\w{3,}\.\w{3,}',word):
                # record the corrected word if the correction only has to do with extra dots, 
                # e.g., "processing.ESSENTIAL " => "processing . ESSENTIAL",
                # and the word must be long enough (at least 3 characters + "." + at least 3 characters).
                split_words = re.split('\.',word)
                make_correction = True
                for w in split_words:
                    if d.check(w.lower()): # check if both of the splitted words are correctly spelled. 
                        pass
                    else:
                        make_correction = False

                if make_correction == True: # only use the corrected word if all splitted words are correctly spelled. 
                    corrected_word = ' . '.join(split_words)

        output_tokens.append(corrected_word) # append the corrected token.

    output_tokens = [w for w in output_tokens if not w=='']
            
    return ' '.join(output_tokens) # return output as string.

In [11]:
job_description = list(df.description[:5])

for text in job_description:
    print('-------------- original text --------------')
    print(text)
    print('-------------- preprocessed text --------------')
    print(SpellingCorrection(text, personal_words))
    print('')

-------------- original text --------------
JOB DESCRIPTIONPOSITION PURPOSE & OBJECTIVES:A skilled position responsible for moderately complex and varied clerical and secretarial work including accurate typing/word processing.ESSENTIAL JOB FUNCTIONS:Files both electronic and paper Medicare and Medicaid claims, follows up on claims, files Workers    Compensation claims, assists patients with insurance questions and problems, keeps up with changes in Medicare and Medicaid regulations and software. Frequently requires independent action and discretion to solve problems.Types form letters and other routines correspondence based information from records and files.Processes documents which require procedural knowledge of E.M.S. Maintains files and reviews documents for accuracy. Provides information in person of by telephone to other units and the general public, applying knowledge of rules, regulations and procedures of E.M.S.Prepares and types monthly lists of allowances after Medicare and


-------------- original text --------------
*Job Summary: Reporting to the facility designated manager, perform custodial cleaning and light maintenance activities to auction buildings and grounds per auction standards and procedures. Perform all additional duties as directed by facility designated manager. Must know, practice, and ensure that company policies, procedures, and applicable state and federal laws are followed at all times.Responsibilities and Duties: 1.    Greet and provide courteous service to every person with whom they come in contact. Maintain a professional appearance and neat work environment consistent with company policy.2.    Sweep, mop and polish the floors within the offices of the auction. Dust equipment, furniture, and fixtures.3.    Remove trash and debris from various areas of the auction and dispose of in the proper central collection area.4.    Perform clean up duties on the outside grounds of auction property.5.    Perform any other cleaning and janitor