

Date: 26/08/2018

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* re 2.2.1 (for regular expression) 


## 1. Introduction
This assignment involves parsing raw text files . There are a total of 30,687 job postings which are to be extracted into a structured dataset from a text file. T


More details for each task will be given in the following sections.

## 2.  Import libraries 

In [18]:
import re
import json

## 3.  Opening  Files

The job postings file is opened and converted  to a string.

Upon observing the contents of the file, one can see each posting is delimited with the string "------------------------------".
This string was used as a delimiter to split the string into individual job postings. The length of the resulting list implies the file contains 30,687 job postings.

In [2]:
file = open("29466695.dat","r+")
file_string = file.read()
job = file_string.split("------------------------------")
print(len(job))

30687


## 4.  Extracting Information 
While perusing the contents of the file, it soon became clear the pattern of a job posting.

Each job posting consisted of various categories e.g. `ID` or `JOB DESCRIPTION`, followed by a colon or a forward slash and then the information pertaining to that category. The next step was to design a regular expression (regex) which could capture this data.

Each posting was divided into a section. A section consists of a category and its pertaining information. A section begins with a category, which is then followed by its relevant information; the section is ended when either:
    1. Another section is encountered
    2. The job posting has ended
The regex would therefore look like: `(Opening category)(Information)(Closing category)`.

I will detail my method for developing the regex in the following cells.

### a. Finding Category Keys

The first challenge was categories in the postings could take various forms. For example, job responsibilities could be of type `job_resps` or `resp`. From here on, I will refer to these category variations as **keys**.

The first regular expression I designed was of the form:
```python 
r'''^           # (1)
(_)?            # (2)
[a-zA-Z]+       # (3)
([\s_][\w]+)?   # (4)
[:/]            # (5)
'''
```
This regex says `(1)` from the start of the line `(2)`the string of characters may begin with an underscore '\_' and then `(3)` will have at least one letter (upper or lowercase). `(4)` This letter(s) may be followed by another string of characters. In this case, it will have either a space or an underscore and at least one alphanumeric character. `(5)` The expression must end in a colon or forward slash.

However, the results contained some multiline strings of format: `word\nword[:/]`.

To avoid this, I added a lookahead `?!\n` after the first string to ensure there was no newline character after it. The resulting regex:
```python 
r"^(_)?[a-zA-Z]+(?!\n)([\s_][\w]+)?[:/]"
```


In [3]:
#all possible candidates are found
candidates=re.finditer(r"^(_)?[a-zA-Z]+(?!\n)([\s_][\w]+)?[:/]",file_string,re.MULTILINE)

In [4]:
#candidates are standardized by removing colon and forward slash and converting to same case
key_candidates=[]
#candidates added to a list
for item in candidates:
    item=item.group(0)
    if item != '':
        temp=item.strip(":").strip("/")
        key_candidates.append(temp.upper())

In [5]:
#the resulting list had 1931 candidates
key_candidates_set = set(key_candidates)
len(key_candidates_set)

1931

### b. Reducing Candidate List

The resulting candidate list was too long to work with. To reduce the number of candidate keys to analyse, I created a dictionary consisting of the count of each candidate. I added a cutoff point of 10 for the count of each candidate. I reasoned potential data lost is miniscule in comparison to the the data size.

The resulting list of candidates was whittled down 1931 to 314.

From the resulting list, I quickly deleted all candidates which were semantically implausible (over 100), e.g: "cc to". I then inspected the context of the remaining list of candidates. The final list comprised of 56 keys.

The output of this section is a list of keys, grouped by each category.

In [6]:
val={}
for item in key_candidates_set:
    count=key_candidates.count(item)
    if count>10:
        val[item]=count
len(val)

314

The resulting keys are provided below.

In [None]:
keys='''ID[:/]|

APPLICATION_DEADL[:/]|
APPLICATION_DL[:/]|
DEADLINE[:/]|
DEADLINES[:/]|
DEAD_LINE[:/]|

JOB_SAL[:/]|
SALARY[:/]|
REMUNERATION/|
REMUNERATION:|

JOB_LOC[:/]|
LOCATION[:/]|
LOCATIONS[:/]|
_LOC[:/]|
_LOCS[:/]|

JOB DESCRIPTION[:/]|
_DESCRIPTION[:/]|
JOB_DESC[:/]|
DESCRIPTION[:/]|

TITLE[:/]|
TITLES[:/]|
JOB TITLE[:/]|
_TTL[:/]|
JOB_T[:/]|

JOB_PROC[:/]|
JOB_PROCS[:/]|
PROCEDURE[:/]|
PROCEDURES[:/]|

JOB_RESPS[:/]|
RESP[:/]|
RESPONSIBILITIES[:/]|
RESPONSIBILITY[:/]|
JOB RESPONSIBILITIES[:/]|

ABOUT[:/]|
ABOUT COMPANY[:/]|
ABOUT PROGRAM[:/]|
ABOUT_COMPANY[:/]|
_INFO[:/]|
COMPANYS_INFO[:/]|

QUALIFICATION[:/]|
QUALIFICATIONS[:/]|
QUALIFS[:/]|
REQUIRED QUALIFICATIONS[:/]|
REQ_QUALS[:/]|

DATES/|
DATE_START/|
START DATE/|
START_DA/|
START_DATE/|
OPEN TO/|
DATES:|
DATE_START:|
START DATE:|
START_DA:|
START_DATE:|
OPEN TO:|

'''

Keys are added to a list, grouped by category.

In [7]:

#keys are split to form key list
key_list = keys.split("\n")
categories = []
j = 0
no_keys = 0
for i in range (len(key_list)-1):
    #each group of keys is separated by an empty string in the list.
    #this code section slices the list using the empty string as an index
    #output is a list of categories. Each category represents one group of keys
    if key_list [i]=='':
        category = key_list[j:i]
        no_keys+= len(category)
        categories.append(category)
        j = i+1
        
print(no_keys)

56


### c. Retriving Data and Building Dataset

Now the list of keys has been generated. I can write a regex expression that retrieves the data.

As aforementioned the regex expression will be of format: `(Opening category)(Information)(Closing category)`.

####  Opening Category
We will first look at the **Opening Category**. The opening category simply consists of the keys for that particular category.
Using Application Deadline as an example, the regex for the opening category is:
```python
r"^(APPLICATION_DEADL[:/]|APPLICATION_DL[:/]|DEADLINE[:/]|DEADLINES[:/]|DEAD_LINE[:/])"
```
That is, the Application Deadline category will begin with `APPLICATION_DEADL[:/]` or `APPLICATION_DL[:/]` or `DEADLINE[:/]` or `DEADLINES[:/]` or `DEAD_LINE[:/]`.

However, I had to compensate for the two categories-- Salary and Start Date-- which would occassionally occure twice in one job posting. For any key in either of those categories that ended with a forward slash '/', the resultant information is always null. Hence, I skipped such keys when building the opening category regex.

####  Information
The information category can be expressed as `(.*?)` in the regex expression. The question mark makes the regex expression lazy and stop at the first occurrence of the closing category. This was added after I discovered the regex was consistently matching till the last key.

####  Closing Category
The job postings all follow a general pattern regarding category order:

`'listing' -> 'application_deadline' -> 'salary' -> 'location' -> 'job_descriptions' -> 'title' -> ('start_date') -> application_procedure' -> 'job_responsibilities' -> 'about_company' -> 'required_qualifications' -> ('salary') -> 'start_date'`

\***\(\)** denotes duplicate categories. 

However, one or more categories may be missing from a posting. At first I just put all the keys in the closing category. However, the regex was returning NaN's for all categories except ID. Hence, I modified the closing category to consist of all keys for **all** categeroies that ***could*** potentially follow the opening category.


For instance, the closing category regex for Required Qualifications is:
```python
"^(REMUNERATION/|DATES/|DATE_START/|START DATE/|START_DA/|START_DATE/|OPEN TO/|DATES:|DATE_START:|START DATE:|START_DA:|START_DATE:|OPEN TO:)"
```
    We can see this follows the above listed pattern of '(salary)'->'start_date' which come after the opening category of 'required_qualifications'.


Another consideration I made when building the closing category keys was for the repeat categories. For the start_date category, I put it as the last category option in the closing category regex. This is because a job posting always ends with a start_date category. Additionally, as it is the last option in the regex, it will only be selected if the other closing categories have not been matched. For the salary category, I noticed that the repeat will always have the key `REMUNERATION/`. Hence, this was used in the closing category for all opening categories that occur after salary ('location', 'job_description'... etc). The example above also illustrates this usage.

---

The regex for each category was iteratively built using the list of keys and the aformentioned patterns.

The dataset was then iteratively built with the developed regex expressions and stored in a dictionary format.

In [28]:
def regex_info_and_close(category_list,index):
    #the category_list has already been sorted for the pattern in which they occur in the job postings file.
    #hence an opening category's closing category will consist of all items in the list of keys after its index
    new_category_list=category_list[index+1:]
    #the last category in the posting will end when the end of the posting is reached
    if index==len(category_list)-1:
        info_and_close='(.*)'
    elif index <= 2:    
        info_and_close='(.*?)^('
        #all keys in new_category_list added to closing category
        for i in range(len(new_category_list)):
            category=new_category_list[i]
            if i < len(new_category_list) - 1:
                for keys in category:
                    info_and_close+=keys.strip()              
            else :
                #we are at the last category
                for keys in category:
                    if keys != category[len(category)-1]:
                        info_and_close+=keys.strip() 
                    else:
                        #we are at the last key in the last category
                        #closing parenthesis is added and the "or" character ('|') is removed for the last key in closing category
                        info_and_close+=keys[:len(keys)-1].strip()+')'
    
    # the duplicate salary category needs to be re-added to closing categories as it is filtered out after salary has been the opening category
    #which is after index number 2.
    #The duplicate salary category will always have key `REMUNERATION/`
    else:
        info_and_close='(.*?)^('
        for i in range(len(new_category_list)):
            category=new_category_list[i]
            if i < len(new_category_list) - 1:
                for keys in category:
                    info_and_close+=keys.strip()              
            else :
                info_and_close+='REMUNERATION/|'
                for keys in category:
                    if keys != category[len(category)-1]:
                        info_and_close+=keys.strip() 
                    else:
                        info_and_close+=keys[:len(keys)-1].strip()+')'
            
    return info_and_close

regex_keys=[]

for i in range (len(categories)):
    #the beginning of the opening category
    regex_open='^('
    category=categories[i]
    for j in range(len(category)):
        key=category[j].strip()
        if j<len(category)-1:
            #as mentioned, these specific keys are bypassed in the opening category
            if key[len(key)-2]!='/':
                regex_open+=key
        else:
            #we are at the last key in the opening category
            #complete regex expression is formed adding the opening category to the info and the closing category
            if key[len(key)-2]!='/':
                regex_open+=key[:len(key)-1]+')'+regex_info_and_close(categories,i)
    regex_keys.append(regex_open)
regex_keys[5]

'^(TITLE[:/]|TITLES[:/]|JOB TITLE[:/]|_TTL[:/]|JOB_T[:/])(.*?)^(JOB_PROC[:/]|JOB_PROCS[:/]|PROCEDURE[:/]|PROCEDURES[:/]|JOB_RESPS[:/]|RESP[:/]|RESPONSIBILITIES[:/]|RESPONSIBILITY[:/]|JOB RESPONSIBILITIES[:/]|ABOUT[:/]|ABOUT COMPANY[:/]|ABOUT PROGRAM[:/]|ABOUT_COMPANY[:/]|_INFO[:/]|COMPANYS_INFO[:/]|QUALIFICATION[:/]|QUALIFICATIONS[:/]|QUALIFS[:/]|REQUIRED QUALIFICATIONS[:/]|REQ_QUALS[:/]|REMUNERATION/|DATES/|DATE_START/|START DATE/|START_DA/|START_DATE/|OPEN TO/|DATES:|DATE_START:|START DATE:|START_DA:|START_DATE:|OPEN TO:)'

In [9]:
#dataset is iteratively built using the list of regex generated in the previous cells
category_list = ['listing','application_deadline','salary','location','job_descriptions','title','application_procedure','job_responsibilities','about_company','required_qualifications','start_date']
data = {}
for i in range(len(category_list)):
    data[category_list[i]]=[]
    for j in range (len(job)):
        #developed regex is used to extract data
        #dotall to match mutliline sections
        #multiline to match the expression from beginning of the line or string
        #ignorecase to ignore case of keys
        temp =re.findall(r"" + regex_keys[i],job[j],re.DOTALL|re.MULTILINE|re.IGNORECASE)
        if temp!=[]:
            data[category_list[i]].append(temp[0][1].strip())
         # if the category is not present in the posting 'N/A' is added   
        else:
            data[category_list[i]].append('N/A')

See below for an example of the constructed dataset.

In [10]:
for i in range (len(category_list)):
        key=category_list[i]
        print(key,':',data[key][100])

listing : 53640
application_deadline : N/A
salary : Highly competitive
location : Yerevan, Armenia
job_descriptions : N/A
title : Area Sales Manager
application_procedure : All interested and qualified candidates are
encouraged to fill in the last updated version of HSBC Application Form
attached to this announcement and email it to: vacancy.armenia@... .
The old versions of application forms will not be reviewed. Only
short-listed candidates will be invited for interviews. Please put in the
subject line of the e-mail ""ITO Intern"".
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
job_responsibilities : - Organize and manage marketing of the companys products abroad (in CIS
countries); 
- Recruit companys representatives in CIS countries and be responsible
for the efficiency of their work.
about_company : Levon Travel has been operating in Armenian tourist


## References

Sadik (2014, January 28). *How to check if a word is an English word with Python?*. Retrieved September 02, 2018, from https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python

*Lesson 01: Introduction to XML*. Retrieved September 02, 2018, from http://www.functionx.com/xml/Lesson01.htm

Iman Mirzadeh (2015, September 28)*Converting Dictionary to JSON in python.* Retrieved September 02, 2018, from https://stackoverflow.com/questions/26745519/converting-dictionary-to-json-in-python