#### Author: Trung Kien Nguyen
Created_Date : 09/08/2018

Modified_Date: 02/09/2018


Environment: Python 3.6.0 and Anaconda 5.2.0 (64-bit)

Version: 2.1

Libraries used:
* pandas (for data frame, included in Anaconda Python 3.6) 
* re (for regular expression, included in Anaconda Python 3.6) 
* json (for writing json file)


## 1. Introduction
This project comprises the execution of parsing and text processing raw job posting text files from *.dat* extension and extracted text files to *.json* and *.xml* documents format. There is a file named 'job_posting.dat' which is 71 MB in total. The required tasks are the following:

1. Extract the text from file by using efficient regex.
2. Extract all the jobs posting and store the thoses into a `job_posting.json`
3. Extract all the jobs posting and store the thoses into a `job_posting.xml`

More details for each task will be given in the following sections.

## 2. Import Libraries


In [6]:
import pandas as pd
import re
import json

## 3. Examining and Loading Data

In this section, we load data file `job_posting.dat` which includes several jobs posting from local disk to main memory. The text file will be parsed as an array of string which each element of array represent for each line of raw data. 
Let's take a look to see how the data look like

Note as the size of data is about 71 MB, it could be exceeded the capable of jupyter notebook.

If you get the following notice:

    IOPub data rate exceeded.
    The notebook server will temporarily stop sending output
    to the client in order to avoid crashing it.
    To change this limit, set the config variable
    `--NotebookApp.iopub_data_rate_limit`.
    
Please terminate the current jupyter notebook and re-run jupyter notebook from command line it with the following command:
```shell
    jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000000
```
You can change the number 10000000000 by whatever you want as long as it overpass the size of data

In [1]:
#open file job_posting.data
with open('job_posting.dat','r') as f:
    data = f.readlines()
data


['ID: 60999\n',
 'PROCEDURE: We encourage all suitable candidates to apply\n',
 'via email through a cover letter and CV with GIS Officer on the subject\n',
 'line to: Human Resources, WWF CauPO at: office@... .\n',
 'Please clearly mention in your application letter that you learned of\n',
 'this job opportunity through Career Center and mention the URL of its\n',
 'website - www.careercenter.am, Thanks.\n',
 'JOB TITLE: Graphic Designer\n',
 'START DATE/\n',
 '_description: We are looking for candidates to take part in the\n',
 'competition for the position of Branch Manager at TechnoNICOL Armenian\n',
 'office to be opened soon.\n',
 '_LOC: Yerevan, Armenia\n',
 'REQUIRED QUALIFICATIONS:\n',
 ' \n',
 'DEAD_LINE: 04 June 2010\n',
 'ABOUT COMPANY:\n',
 ' Save the Children International established its presence\n',
 'in Armenia in 1993, with a mission to achieve immediate and lasting\n',
 'change in childrens lives.\n',
 'RESP:\n',
 ' - Mobilize rural communities, educate and train the

Skim quickly over the raw data above, it is clearly that each job posting is seperated by
`------------------------------\n` notation. 

By this thought, we can re-load the data and parse the raw data as a whole string, then split by `------------------------------\n`  notation to get a list of job posting.

Also, we notice the pattern of each element. Starting by key with ':' sysbol and following by value of the key.
```
Eg: ID: 97611\n
    JOB_PROCS: 'Interested candidate....'\n
    
```
Keep in mind that pattern, we will use it latter then.



In [3]:
with open('job_posting.dat','r') as f:
    data = f.read()
data = data.split('------\n')
data

['ID: 60999\nPROCEDURE: We encourage all suitable candidates to apply\nvia email through a cover letter and CV with GIS Officer on the subject\nline to: Human Resources, WWF CauPO at: office@... .\nPlease clearly mention in your application letter that you learned of\nthis job opportunity through Career Center and mention the URL of its\nwebsite - www.careercenter.am, Thanks.\nJOB TITLE: Graphic Designer\nSTART DATE/\n_description: We are looking for candidates to take part in the\ncompetition for the position of Branch Manager at TechnoNICOL Armenian\noffice to be opened soon.\n_LOC: Yerevan, Armenia\nREQUIRED QUALIFICATIONS:\n \nDEAD_LINE: 04 June 2010\nABOUT COMPANY:\n Save the Children International established its presence\nin Armenia in 1993, with a mission to achieve immediate and lasting\nchange in childrens lives.\nRESP:\n - Mobilize rural communities, educate and train them to become eligible\nfor Heifer projects development;\n- Work with project holders to collect data on so

There we go, raw data is now parsed as a list of string of jobs posting. 
Let's take a look on an individual data

In [3]:
data[0]

'ID: 60999\nPROCEDURE: We encourage all suitable candidates to apply\nvia email through a cover letter and CV with GIS Officer on the subject\nline to: Human Resources, WWF CauPO at: office@... .\nPlease clearly mention in your application letter that you learned of\nthis job opportunity through Career Center and mention the URL of its\nwebsite - www.careercenter.am, Thanks.\nJOB TITLE: Graphic Designer\nSTART DATE/\n_description: We are looking for candidates to take part in the\ncompetition for the position of Branch Manager at TechnoNICOL Armenian\noffice to be opened soon.\n_LOC: Yerevan, Armenia\nREQUIRED QUALIFICATIONS:\n \nDEAD_LINE: 04 June 2010\nABOUT COMPANY:\n Save the Children International established its presence\nin Armenia in 1993, with a mission to achieve immediate and lasting\nchange in childrens lives.\nRESP:\n - Mobilize rural communities, educate and train them to become eligible\nfor Heifer projects development;\n- Work with project holders to collect data on soc

The first job posting is shown above, what's about the last element of jobs posting list ?


In [4]:
data[len(data)-1]

''

It is a null. Obviously, there is no reason to keep the last value, we can remove it out of the list

In [5]:
data.pop()

''

In [7]:
# the last element
data[len(data)-1]

'ID: 69508\nJOB_PROCS: Interested candidates are encouraged to submit a\nCV to: hr.sas@... mentioning ""Procurement Officer"" in the subject\nline or call: 374 10 525722 for inquiries. The Group thanks all who\nexpress interest in this opportunity; however only those selected for an\ninterview will be contacted. Applications privacy and confidentiality are\nguaranteed.\nPlease clearly mention in your application letter that you learned of\nthis job opportunity through Career Center and mention the URL of its\nwebsite - www.careercenter.am, Thanks.\nJOB TITLE: Loan Specialist\n_description: JINJ engineering consulting company is looking for a\nFundraising and Procurement Specialist for searching financial sources\nand development of project proposals.\n_LOCS: Yerevan, Armenia\nQUALIFICATION:\n - Bachelors degree in Computer Sciences or Electrical Engineering\n(Masters degree a plus);\n- At least 5 years of experience in embedded software development;\n- Hands-on software development wit

Now, let's see how many job posting we have in this list.

In [17]:
data = [d for d in data if d.strip() != '']
len(data)

30175

We have collected 30175 elements in the list which are corresponding to 30175 of jobs posting. Keep this number to compared the result after cleaning and extracting.



## 4. Wrangling Data

In this section, we are going to extract the key information as well as the following value from raw text data 

As we mentioned earlier, the main informations of a job posting is represented as a key and value (eg: ID: 1234).Take a closer examination of an individual data showing above, it seems like each piece of information is seperated by new line notation (`\n`). For the conveniences of text processing, we will divide a job which is a string to the list of line.

We start with the first value of data, because all the jobs posting could have similar pattern, thus we will examined an individual job posting and extract the pattern for the rest of the data.


In [4]:
sample = data[0]
sample = sample.split('\n')
sample

['ID: 60999',
 'PROCEDURE: We encourage all suitable candidates to apply',
 'via email through a cover letter and CV with GIS Officer on the subject',
 'line to: Human Resources, WWF CauPO at: office@... .',
 'Please clearly mention in your application letter that you learned of',
 'this job opportunity through Career Center and mention the URL of its',
 'website - www.careercenter.am, Thanks.',
 'JOB TITLE: Graphic Designer',
 'START DATE/',
 '_description: We are looking for candidates to take part in the',
 'competition for the position of Branch Manager at TechnoNICOL Armenian',
 'office to be opened soon.',
 '_LOC: Yerevan, Armenia',
 'REQUIRED QUALIFICATIONS:',
 ' ',
 'DEAD_LINE: 04 June 2010',
 'ABOUT COMPANY:',
 ' Save the Children International established its presence',
 'in Armenia in 1993, with a mission to achieve immediate and lasting',
 'change in childrens lives.',
 'RESP:',
 ' - Mobilize rural communities, educate and train them to become eligible',
 'for Heifer projec

After spliting a string by '\n' notation, we have an array of line. Give a look closer into the data, we notice the pattern are quite consistent that contain key and value.

* Few first line:
   ```
         'ID: 60999',
         'PROCEDURE: We encourage all suitable candidates to apply',
         'via email through a cover letter and CV with GIS Officer on the subject',
         'line to: Human Resources, WWF CauPO at: office@... .',
         'Please clearly mention in your application letter that you learned of',
         'this job opportunity through Career Center and mention the URL of its',
         'website - www.careercenter.am, Thanks.',
         'JOB TITLE: Graphic Designer',
    ```
* The few last line:
    ```
         'RESP:',
         ' - Mobilize rural communities, educate and train them to become eligible',
         'for Heifer projects development;',
         '- Work with project holders to collect data on social-economic status of',
         'beneficiary families in targeted communities;',
         '- Educate, train and advise community groups;',
         '- Train community groups the developing plans for agriculture projects;',
         '- Monitor implementation of a comprehensive technical assistance program',
         'for beneficiaries;',
         '- Provide in time reliable and valid data reflecting progress in Heifer',
         'projects in Armenia;',
         '- Assist in other areas when needed.',
         '--------------------------'
    ```

Thus, to extract data, we need to extract the key and the value following the key. We also notice that there are some keys are not in normal form, for example: _LOC for location and _description for description. In order to extract the all the value, we create a list of possible keys that could match to all the different keys of data.

The question is, how do we know what are the keys we gotta to extract?

After wrangling raw data, our goal is write out the cleaning data to the  *'.json'* and  *'.xml'* file, thus, to get the idea what are those keys. To do that, we examine two output samples in the given file, called *"sample.json"* and *"sample.xml"*.

The following keys (11 keys) are extracted from those file:

```
    id: ID of job posting
    title: title of job posting
    loc: location of company
    job_descriptions: description of job
    required_qualifications: required qualificatiob for the job
    job_responsibilities: responsibilities of the job
    salary: salary
    application_procedure:
    start_date:
    application_deadline:
    about_company:
    
```

Now we are going to extract the keys, following is a regular expression to get the key.
```python
    import re
    keys = re.match(r'\n[a-zA-Z]*? ?_?-?[a-zA-Z]*[:/])',data)
```

In [12]:
with open('job_posting.dat','r') as f:
    data_keys = f.read()
keys = re.findall(r"(\n[a-zA-Z]*? ?_?-?[a-zA-Z]*[:/])",data_keys)
keys = [key.strip() for key in keys]
keys = list(set(keys))
keys
# data_keys

['projects are:',
 'partner/',
 'portfolio to:',
 'technology/',
 'Program Communications:',
 'balance/',
 'layout/',
 'Senior Loan/',
 'the position:',
 'organization and/',
 'inclusive:',
 'FAO/',
 'transactions/',
 'Companies at:',
 'Address:',
 'Revenue Growth:',
 'with math/',
 'staff:',
 'Contact tel:',
 'title Secretary/',
 'Waiters/',
 'branch at:',
 'Certificates/',
 'and to:',
 'Recommendations:',
 'and PL/',
 'Payment schedule:',
 'components:',
 'economics and/',
 'Auditor InA/',
 'medicine/',
 'Personal Skills:',
 'Organization and/',
 'Associate Reservations:',
 'related SW/',
 'Competencies:',
 'use/',
 'Diaspora:',
 'Project implementation/',
 'Expected Outputs:',
 'configuring/',
 'Illustrator and/',
 'on C/',
 'industry and/',
 'Required skills:',
 'office/',
 'Job Summary:',
 'USAID and/',
 'other NGOs/',
 'Armenian/',
 'manufacturing/',
 'international standards/',
 'public policy/',
 'Child Safeguarding:',
 'conclusions/',
 'addressed:',
 'Ministry Performance:',
 

Now, we extract the keywords from 2175 keys above and create a dictionary of all the possible keys.

Following is a dictionary of possible and potential keys for the 11 main key we've list above

In [8]:
# keys for id
key_id = ['_id','id']
# keys for title
key_title = ['title','titles','job_title','job_titles','job title','job titles','title','job_t']
# keys for location
key_loc = ['location','locations','loc','locs','job_loc','located_at']
# keys for job descriptions
key_jd = ["job_descriptions","job description","job_description",'job descriptions','description','job_desc']
# keys for required qualifications
key_rq = ["required_qualifications",
          "required qualifications","required qualification","qualifications","required_qualification",
         'qualification','qualifs','req_quals']
#consider requirement(s)
# keys for job responsibilities
key_jr = ['job_responsibilities','job_responsibilitie','job responsibilities','job responsibilitie','resp',
         'job_resps','responsibility','specific responsibilities','resps',
         'responsibilities include']
# keys for salary
key_salary = ['salary','job_sal','payment','salary']
# keys for application procedure
key_proc = ['application_procedure','application procedure','procedure','procedures','job_proc','job_procs','job proc',
           'job procs','specific requirements']
# keys for start date
key_sdate = ['start_date','start date','start_da','date_start','starting date']
# keys for application deadline
key_ad = ['application_deadline','application deadline','deadline','application_deadl',
         'application_dl','deadlines','application_deadl']
# keys for about company
key_company = ['about_company','company','about','about company','_info']


keywords = [key_id,key_title,key_loc,key_jd,key_rq,key_jr,key_salary,key_proc,key_sdate,key_ad,key_company]


def check_keyword(key):
    for dic_key in keywords:
        if key in dic_key:
            return True
    return False



You may wondering why do we put a lot of keywords for a single key? As we discussed before, each single key can be represented by different form. More potential keys we put, more correct data we get. 

You can notice that for every array of single keys, the first element of key array is considered a main key as it match to the key of the output sample.

For example: 
```
    key_jr = ['job_responsibilities','job_responsibilitie','job responsibilities','job responsibilitie','resp']
```
clearly that *job_responsibilities* is the main key as it is shown in the sample output. We explain more when we dig into extracting data.

At that point, we defined a function to get the main key.


    

In [9]:
def get_main_key(key):
    for dic_key in keywords:
        if key in dic_key:
            return dic_key[0]

Now, we extract the main information following the pattern 'key:value' by looping every single line of sample. Here we choose to use regular expressions:

``` python
     import re 
     for line in sample:
         key = re.match(r'\w.+[:/|]',line)  # extract the key
         # check if key is not none and key is in keywords 
         value = re.findall(r''+key+':?(\w.+)',line)         
```

In [10]:
# sample.pop() # remove the last element of array, which is a continuous - notation

dic = {}
for line in sample:
        # remove the space at the beginning and start
        line = line.strip()
        # pre-processing for the convenience 
        line = line.replace(": ",":")
        # get the key of each line
        key = re.match(r'^([^:]+):',line)
        if key != None:
            # pre-processing for the sake of the convenience
            key = key.group().replace(":","") 
            key = key.replace("/","")
            
            text = ""
            if check_keyword(key.lower()): # to ignore case sensitive
                # get the following text
                pattern = r''+key+'[:/]? ?(.*)'
                text = " ".join(re.findall(pattern,line))
                
                # if key is in dictionary, add value to current value of dict
                # otherwise create a value for key
                if key in dic:
                    dic[key] = dic[key]+text
                else:
                    dic[key] = text
dic

{'ABOUT COMPANY': '',
 'ID': '60999',
 'JOB TITLE': 'Graphic Designer',
 'PROCEDURE': 'We encourage all suitable candidates to apply',
 'REQUIRED QUALIFICATIONS': '',
 'RESP': ''}

Nwe run the above script, we got what looks like all the keys which match the key we defined above and get the value following. However,if we look closely at the output, we will find that our script is not adequately parsing the lines with keys. The following issues can be identified:

1. Some key have values that spreading over more than one line. For example:
    ```
       'RESP:',
       ' - Mobilize rural communities, educate and train them to become eligible',
       'for Heifer projects development;',
       '- Work with project holders to collect data on social-economic status of',
       'beneficiary families in targeted communities;',
    ```
    but what we got:
    ```
       'RESP: '' ',
    ```
    
2. Do you notice that some field are missing?


The first problem is because we seperated a job posting as a string to the list of an array. Therefore, some key values will spread over more than one line.

In order to fix this issue, we identify a variable, called previous_key to memorize the last key that expression match, then if expression does not match the next line, we append to the current value of the last key.

The updated script is as follows. We also put the update script into a function, called extract_key_to_dict

In [11]:
def extract_key_to_dict(data):
    dic = {}
    dic["ID"] = int(data[0].split(":")[1].strip())
    previous_key = None
    for line in data[1:]:
            # remove the space at the beginning and start
            line = line.strip()
            # get the key of each line
            key = re.match(r'^([^:/]+)[:/]',line)
            
            if key != None:
                text = ""
                key = key.group().replace(":","") 
                key = key.replace("/","")
                if check_keyword(key.lower()): # to ignore case sensitive
                    # get the following text
                    pattern = r''+key+'[:/]? ?(.*)'
                    text = " ".join(re.findall(pattern,line))
                    # save the current key
                    previous_key = key
                    # if key is in dictionary, add value to current value of dict
                    # otherwise create a value for key
                    if key in dic:
                        dic[key] = dic[key]+text
                    else:
                        dic[key] = text
                else:
                    # if key is not in dictionary, add the value of line to current key 
                    if previous_key != None:
                        dic[previous_key] = dic[previous_key] + line
            else:
                # if pattern doesn't match, add the value of line to current key
                if previous_key != None:
                    dic[previous_key] = dic[previous_key] + line
    return dic

extract_key_to_dict(sample)

{'ABOUT COMPANY': 'Save the Children International established its presencein Armenia in 1993, with a mission to achieve immediate and lastingchange in childrens lives.',
 'ID': 60999,
 'JOB TITLE': 'Graphic Designer',
 'PROCEDURE': 'We encourage all suitable candidates to applyvia email through a cover letter and CV with GIS Officer on the subjectline to: Human Resources, WWF CauPO at: office@... .Please clearly mention in your application letter that you learned ofthis job opportunity through Career Center and mention the URL of itswebsite - www.careercenter.am, Thanks.',
 'REQUIRED QUALIFICATIONS': 'DEAD_LINE: 04 June 2010',
 'RESP': '- Mobilize rural communities, educate and train them to become eligiblefor Heifer projects development;- Work with project holders to collect data on social-economic status ofbeneficiary families in targeted communities;- Educate, train and advise community groups;- Train community groups the developing plans for agriculture projects;- Monitor implemen

We got better result, but consider the carefully, we still lost *location* field as is shown in raw data. 

The reason is, the *location* field of this job posting is represent as *"_LOC"* which is not one of the key that we defiend above.

To solve this issues, we can easily add *'loc'* to the key array of location. We skip the `_` notation because we can handle this issue in the code.

The following script uses to handle this issues.

First, we get the key id of line by using this expression:
```python
    import re
    key = re.match(r'^([^:]+):',line)
```
For each key, we extract the meaning key:
```python
     key_line = key.group(1) # get the text of the key
     
     # get only alphabet character of the key
     key = re.search(r'([a-zA-Z]+)',key_line)
     # remove the the first element of the key if it is not alphabet character
     while key.start() != 0:
           key_line = key_line[1:]
           key = re.search(r'([a-zA-Z]+)',key_line)
```
The keywords for each job posting could be different, however, we will normalized the keywords to main keywords by using the function get_main_key above

The updated script is following

In [16]:
def extract_key_to_dict(data):
    dic = {}
    dic["_id"] = int(data[0].split(":")[1].strip())
    previous_key = None
    for line in data[1:]:
            # remove the space at the beginning and start
            line = line.strip()
            # get the key of each line
            key = re.match(r'^([^:/]+)[:/]',line)
            
            if key != None:
                key_line = key.group(1)
                key = re.search(r'([a-zA-Z]+)',key_line)
                while key != None and key.start() != 0:
                    key_line = key_line[1:]
                    key = re.search(r'([a-zA-Z]+)',key_line)
                text = ""
                key_line = key_line.replace(":","")
                if check_keyword(key_line.lower()): # to ignore case sensitive
                    # get the following text
                    pattern = r''+key_line+'[:/]? ?(.*)'
                    text = " ".join(re.findall(pattern,line))
                    # save the current key
                    key_line = get_main_key(key_line.lower())
                    previous_key = key_line
                    # if key is in dictionary, add value to current value of dict
                    # otherwise create a value for key
                    if key_line in dic:
                        dic[key_line] = dic[key_line]+text
                    else:
                        dic[key_line] = text
                else:
                    # if key is not in dictionary, add the value of line to current key 
                    if previous_key != None:
                        dic[previous_key] = dic[previous_key] + line
            else:
                # if pattern doesn't match, add the value of line to current key
                if previous_key != None:
                    dic[previous_key] = dic[previous_key] + line
    return dic

# extract_key_to_dict(sample)

There we go, *'required_qualifications'*  is not null, but the value is about *dead_line*. 

If you've seen the output before updating script and the output after updating, it is clearly that *"RESP"* now have value.

Are there any missing data?

Yes, it is clearly that we still lost *"dead_line'*  field.


As you can see on the raw data, the job posting which have ID: 60999 do have *'dead_line'* field, which is represent *'DEAD_LINE'*, but why do it not show up on the result showing above?

Obviously, our keys filter do not identify the *'dead_line'* field

To fix this issues, we update key_loc which we defined above by adding *'dead_line'* to the keyword and then call the function *extract_key_to_dict* to see the result



In [13]:
key_loc.append('dead_line')
job = extract_key_to_dict(sample)
job

{'_id': 60999,
 'about_company': 'Save the Children International established its presencein Armenia in 1993, with a mission to achieve immediate and lastingchange in childrens lives.',
 'application_procedure': 'We encourage all suitable candidates to applyvia email through a cover letter and CV with GIS Officer on the subjectline to: Human Resources, WWF CauPO at: office@... .Please clearly mention in your application letter that you learned ofthis job opportunity through Career Center and mention the URL of itswebsite - www.careercenter.am, Thanks.',
 'job_descriptions': 'We are looking for candidates to take part in thecompetition for the position of Branch Manager at TechnoNICOL Armenianoffice to be opened soon.',
 'job_responsibilities': '- Mobilize rural communities, educate and train them to become eligiblefor Heifer projects development;- Work with project holders to collect data on social-economic status ofbeneficiary families in targeted communities;- Educate, train and advi

After running the script above, the result showing is now update with *dead_line* field, which is what we want. 

The field *"required_qualifications"* is still null, but it is because the data is acutally null.

Now it is obvious to answer the question we raised in the first. We put more representation of key into a array key because the fields of each job posting may represented different. To be more precise, we build a *'bag of keyword'* for each single key. More keywords for each single key we have, more accurate data we can collect.

As we mentioned before, each job posting could have different keywords in different form. However, we normalized the different keywords to the main keyword by using function get_main_key.

For example:
```
    ABOUT COMPANY             -->  about_company
    DEAD_LINE                 -->  application_deadline
    ID                        -->  _id
    JOB TITLE                 -->  job_title
    PROCEDURE                 -->  application_procedure
    RESP                      -->  job_responsibilities
    REQUIRED QUALIFICATIONS   -->  required_qualifications
    START DATE                -->  start_date
    _LOC                      -->  location
    _description              -->  description
```
Look at the data again:

In [14]:
job

{'_id': 60999,
 'about_company': 'Save the Children International established its presencein Armenia in 1993, with a mission to achieve immediate and lastingchange in childrens lives.',
 'application_procedure': 'We encourage all suitable candidates to applyvia email through a cover letter and CV with GIS Officer on the subjectline to: Human Resources, WWF CauPO at: office@... .Please clearly mention in your application letter that you learned ofthis job opportunity through Career Center and mention the URL of itswebsite - www.careercenter.am, Thanks.',
 'job_descriptions': 'We are looking for candidates to take part in thecompetition for the position of Branch Manager at TechnoNICOL Armenianoffice to be opened soon.',
 'job_responsibilities': '- Mobilize rural communities, educate and train them to become eligiblefor Heifer projects development;- Work with project holders to collect data on social-economic status ofbeneficiary families in targeted communities;- Educate, train and advi

We are pretty much clean major data error. Now, the big question is how can we construct the *bag of key* as we have 30175 job posting, it is impossible to examining all the data.

Let's recap our goal which write out the data into *.json* and *.xml* file with the main key we discover above.

Now, to get to know all the possible keywords for each single key, we are going to construct data frame from the given data.

The reason is because we will know what field of job posting is empty, then we can show that job to get to know whether it is actually empty or that field is represent as a different key as well as let us do some exploration and exploitation 


In [18]:
#clean the data first, the empty field will be represent as N/A
def clean_dictionary(data_dict):
    for key in data_dict.keys():
        if str(data_dict[key]).strip() == '' \
            or str(data_dict[key]).strip() =='N/A' \
            or str(data_dict[key]).strip()== 'N:A':
            data_dict[key] = None
    return data_dict
        
#transfrom the output order to make it look the same the sample
def transform_to_order(data_dict):
    result = {}
    keys = ['_id','title','location','job_descriptions','job_responsibilities','required_qualifications','salary',
           'application_procedure','start_date','application_deadline','about_company']
    for key in keys:
        if key in data_dict:
            result[key] = data_dict[key]
        else:
            result[key] = None
    return result

pd_data = []
for d in data:
    _d = d.split("\n")
    _d.pop()
    _d = extract_key_to_dict(_d)
    _d = clean_dictionary(_d)
    _d = transform_to_order(_d)
    pd_data.append(_d)



In [19]:
df = pd.DataFrame(pd_data)
df

Unnamed: 0,_id,about_company,application_deadline,application_procedure,job_descriptions,job_responsibilities,location,required_qualifications,salary,start_date,title
0,60999,Save the Children International established it...,,We encourage all suitable candidates to applyv...,We are looking for candidates to take part in ...,"- Mobilize rural communities, educate and trai...","Yerevan, Armenia04 June 2010",,,,Graphic Designer
1,82387,,July 28 2009,Applicants are requested to send a cover lette...,The Reporting Analyst has to examine the uniqu...,,"Yerevan, Armenia","- University degree, preferably in finance/cre...",,03 October 2006,Pre-Sales Engineer
2,47791,The USAID funded Armenia Legislative Strengthe...,24 January 2006,"If interested, please email your last updateda...",Mavas Group LLC is looking for successful cand...,- Be responsible for regular personal visits t...,"Yerevan, Armenia","- BS/MS in Business Administration, Computer S...",,19 September 2006,
3,59252,"ContourGlobal develops, acquires and operatese...",,Interested candidates are encouraged to submit...,,- Develop and maintain a profitable and qualit...,"Yerevan, Armenia","- University degree with honor, MBA desirable;...","Ranging between AMD 350,000 and 1,000,000 base...",30 December 2009,French TranslatorOPEN TO/
4,42111,,09 May 2015,Application forms are available at: 2aAgatange...,,- Perform on-site pre-sales and post-sales tec...,,- Fifth year college or university program cer...,,24 August 2010,Sales ManagerREMUNERATION: Competitive
5,73969,Save the Children International's mission is t...,,"If interested, please send your cover letteran...",Arka News Agency is seeking for a Website Editor.,"Design, develop and maintain a coplex suite of...","Yerevan, Armenia26 June 2010","- University degree in Economics, Management, ...",Highly competitive,27 October 2011,Senior Developer for Real-Time Trading Applica...
6,59158,,,Interested applicants should submit their CVst...,"""""Media Style"""" LLC is looking a Journalist fo...",- Define quality and design requirements for e...,"Yerevan, Armenia05 July 2007COMPANYS_INFO:""""Yo...",,Starting from 320 000 AMD.,11 October 2007,International Relation Manager
7,48746,,"20 June 2008, 14:00 p.m.COMPANYS_INFO:France T...",Qualified and interested candidates arerequest...,,- Contribute original ideas for new marketing ...,"Yerevan, Armenia",,"Highly competitive, based on qualifications an...",15 April 2010,Assistant to International Project Administrator
8,60283,,,"To apply for this position, please send yourre...","""""LDT Technology"""" LLC is looking for highly m...",- Develop applications in accordance with give...,"Yerevan, Armenia30 November 2013COMPANYS_INFO:...",- University degree in Computer Sciences or a ...,"Competitive, based on qualifications",05 December 2012,Procurement Officer
9,72612,,20 December 2009responsibilities:- Deal with r...,Please send your resume in English language to...,Energize Global Services CJSC is looking for J...,,,,Compatible to the salary of National Assemblye...,10 September 2015,Merchandiser


Quickly overview data from data frame, in the row number 7, the field *about_company* is None, however, the field *application_deadline* is quite strange when it quite long and not only date but also the long text. 

Obviously, the field *about_company* of job posting ID = 48746 is represent as *"COMPANYS_INFO"* which is not in our bag of key word.

Similarity case witness in the row 10 (ID=62463)

We also notice that in the row 9 (ID=72612) *application_deadline* including *responsibilities* and *job_responsibilities* field is None.

Now, let put the new keywords we find above to dictionary and update 

In [20]:
key_company.append("companys_info")
key_jr.append("responsibilities")

In [21]:
def contruct_data_frame():
    pd_data = []
    for d in data:
        _d = d.split("\n")
        _d.pop()
        _d = extract_key_to_dict(_d)
        _d = clean_dictionary(_d)
        _d = transform_to_order(_d)
        pd_data.append(_d)
        
    features = [k[0] for k in keywords]
    df = pd.DataFrame(pd_data)
    return df
df = contruct_data_frame()

In [22]:
df

Unnamed: 0,_id,about_company,application_deadline,application_procedure,job_descriptions,job_responsibilities,location,required_qualifications,salary,start_date,title
0,60999,Save the Children International established it...,,We encourage all suitable candidates to applyv...,We are looking for candidates to take part in ...,"- Mobilize rural communities, educate and trai...","Yerevan, Armenia04 June 2010",,,,Graphic Designer
1,82387,,July 28 2009,Applicants are requested to send a cover lette...,The Reporting Analyst has to examine the uniqu...,,"Yerevan, Armenia","- University degree, preferably in finance/cre...",,03 October 2006,Pre-Sales Engineer
2,47791,The USAID funded Armenia Legislative Strengthe...,24 January 2006,"If interested, please email your last updateda...",Mavas Group LLC is looking for successful cand...,- Be responsible for regular personal visits t...,"Yerevan, Armenia","- BS/MS in Business Administration, Computer S...",,19 September 2006,
3,59252,"ContourGlobal develops, acquires and operatese...",,Interested candidates are encouraged to submit...,,- Develop and maintain a profitable and qualit...,"Yerevan, Armenia","- University degree with honor, MBA desirable;...","Ranging between AMD 350,000 and 1,000,000 base...",30 December 2009,French TranslatorOPEN TO/
4,42111,,09 May 2015,Application forms are available at: 2aAgatange...,,- Perform on-site pre-sales and post-sales tec...,,- Fifth year college or university program cer...,,24 August 2010,Sales ManagerREMUNERATION: Competitive
5,73969,Save the Children International's mission is t...,,"If interested, please send your cover letteran...",Arka News Agency is seeking for a Website Editor.,"Design, develop and maintain a coplex suite of...","Yerevan, Armenia26 June 2010","- University degree in Economics, Management, ...",Highly competitive,27 October 2011,Senior Developer for Real-Time Trading Applica...
6,59158,"""""Youth For Achievements"""" is an educational N...",,Interested applicants should submit their CVst...,"""""Media Style"""" LLC is looking a Journalist fo...",- Define quality and design requirements for e...,"Yerevan, Armenia05 July 2007",,Starting from 320 000 AMD.,11 October 2007,International Relation Manager
7,48746,France Telecom is a telecommunications operato...,"20 June 2008, 14:00 p.m.",Qualified and interested candidates arerequest...,,- Contribute original ideas for new marketing ...,"Yerevan, Armenia",,"Highly competitive, based on qualifications an...",15 April 2010,Assistant to International Project Administrator
8,60283,PricewaterhouseCoopers is a professional servi...,,"To apply for this position, please send yourre...","""""LDT Technology"""" LLC is looking for highly m...",- Develop applications in accordance with give...,"Yerevan, Armenia30 November 2013",- University degree in Computer Sciences or a ...,"Competitive, based on qualifications",05 December 2012,Procurement Officer
9,72612,,20 December 2009,Please send your resume in English language to...,Energize Global Services CJSC is looking for J...,- Deal with routine correspondence including a...,,,Compatible to the salary of National Assemblye...,10 September 2015,Merchandiser


The data look more better 

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30175 entries, 0 to 30174
Data columns (total 11 columns):
_id                        30175 non-null int64
about_company              24079 non-null object
application_deadline       16828 non-null object
application_procedure      28129 non-null object
job_descriptions           23283 non-null object
job_responsibilities       26687 non-null object
location                   28886 non-null object
required_qualifications    26800 non-null object
salary                     18365 non-null object
start_date                 24354 non-null object
title                      25092 non-null object
dtypes: int64(1), object(10)
memory usage: 2.5+ MB


Keep doing that and constantly update to the dictionay to construct *bag of keywords* we will get more clean data.

The following is bag of potential keywords that I've tried to extract from raw data


In [24]:
# keys for id
key_id = ['_id','id']
# keys for title
key_title = ['title','titles','job_title','job_titles','job title','job titles','title','job_t','ttl']
# keys for location
key_loc = ['location','locations','loc','locs','job_loc','located_at']
# keys for job descriptions
key_jd = ["job_descriptions","job description","job_description",'job descriptions','description','job_desc']
# keys for required qualifications
key_rq = ["required_qualifications",
          "required qualifications","required qualification","qualifications","required_qualification",
         'qualification','qualifs','req_quals','req_qual']
#consider requirement(s)
# keys for job responsibilities
key_jr = ['job_responsibilities','job_responsibilitie','job responsibilities','job responsibilitie','resp',
         'job_resps','responsibility','specific responsibilities','resps',
         'responsibilities include','responsibilities']
# keys for salary
key_salary = ['salary','job_sal','payment','salary','remuneration']
# keys for application procedure
key_proc = ['application_procedure','application procedure','procedure','procedures','job_proc','job_procs','job proc',
           'job procs','specific requirements']
# keys for start date
key_sdate = ['start_date','start date','start_da','date_start','dates','open to']
# keys for application deadline
key_ad = ['application_deadline','application deadline','deadline','application_deadl',
         'application_dl','deadlines','application_deadl','dead_line']
# keys for about company
key_company = ['about_company','company','about','about company','companys_info','info']


keywords = [key_id,key_title,key_loc,key_jd,key_rq,key_jr,key_salary,key_proc,key_sdate,key_ad,key_company]


Let's summarize data

In [26]:
df = contruct_data_frame()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30175 entries, 0 to 30174
Data columns (total 11 columns):
_id                        30175 non-null int64
about_company              27111 non-null object
application_deadline       28006 non-null object
application_procedure      28129 non-null object
job_descriptions           22590 non-null object
job_responsibilities       26687 non-null object
location                   28108 non-null object
required_qualifications    26791 non-null object
salary                     28114 non-null object
start_date                 27933 non-null object
title                      28111 non-null object
dtypes: int64(1), object(10)
memory usage: 2.5+ MB


We know that there are 30175 job posting, however, there are a lot of null filed for each job posting.

As we summarize data above, title field has 28111 non-null object, which mean that there is more than 1500 null job title. The reason is maybe the raw data is actually null or maybe we missed few keywords in *bag of keyword*

## 5. Storing Data

In this section, we are going to write data to * .json* and * .xml* file by following two given samples file.

### 5.1. Storing Data in JSON format

Following is sample output for *.json* file
```json
    {
	"listings": {
		"listing": [{
			"_id": "90173",
			"title": "Local Expert in Support of Establishment of National Disaster Observatory in Armenia",
			"location": "Yerevan, Armenia",
			"job_descriptions": {
			
				"description": ["Synopsys Armenia is looking for CAE for its AMSG division", 'something']
			},
			"job_responsibilities": {
				"responsibility": [
					"Provide the existing and potential customers with the information on Bank products and services",
					"Provide Bank account services by phone"
				]
			},
			"required_qualifications": {
				"qualification": [
					"Minimum 4-6 years of extensive development experience, and minimum 3+ years with the following technologies: a) ASP.NET for building the new Control Panel; b) Pure C# code both on the middle tier and as part of the web UI layer of ASP.NET",
					"Familiarity with the .NET Framework, specifically the following packages: messaging, threading, generic collections, custom controls and LINQ to SQL classes and also ADO.NET"
				]
			},
			"salary": "Commensurate with skills and experience",
			"application_procedure": "To apply for this position, please send a letter of intent with a CV addressing relevant qualifications and experience to:hr_wvarm@... with cc to: arman_grigoryan@... . In the subject line of your e-mail message, please, mention title of the position you are applying for. No information inquiries will be handled over the phone. Only short listed candidates will be notified for the interview. Please clearly mention in your application letter that you learned of this job opportunity through Career Center and mention the URL of its website - www.careercenter.am, Thanks",
			"start_date": "11 September 2012",
			"application_deadline": "08 August 2006",
			"about_company": "N/A"
            }]
        }
    }
```

We notice that in the sample output, *job_descriptions*,  *job_responsibilities*, *required_qualifications* are tranformed to the array. We defined a function, called *transform_to_array* to meet this requirement.

In [27]:
def transform_to_array(data_dict):
    respon = []
    if 'job_responsibilities' in data_dict:
        text = data_dict['job_responsibilities']
        if text != None:
            list_respon = text.split(";")
            for line in list_respon:
                line = re.sub(r'-','',line,1)
                respon.append(line.strip())
            respon_object = {'responsibility':respon}
            
        else:
            respon_object = {'responsibility':"N/A"}
    data_dict['job_responsibilities'] = respon_object
    
    qualification = []
    if 'required_qualifications' in data_dict:
        text = data_dict['required_qualifications']
        if text != None:
            list_quali = text.split(";")
            for line in list_quali:
                line = re.sub(r'-','',line,1)
                qualification.append(line.strip())
            quali_object = {'qualification':qualification}
        else:
            quali_object = {'qualification':"N/A"}
    data_dict['required_qualifications'] = quali_object
            
    descriptions = []
    if 'job_descriptions' in data_dict:
        text = data_dict['job_descriptions']
        if text != None:
            list_description = text.split(";")
            for line in list_description:
                line = re.sub(r'-','',line,1)
                descriptions.append(line.strip())
            description_object = {'descriptions':descriptions}
            data_dict['job_descriptions'] = description_object
        else:
            description_object = {'descriptions':"N/A"}
    data_dict['job_descriptions'] = description_object
    
    return data_dict


# null value will be transform to "N/A"
def to_na_value(data):
    for key in data:
        if data[key] == None:
            data[key] = 'N/A'
    return data


# function to convert data to make it look like sample data
def to_json(data):
    result = {}
    listing = []
    for d in data:
        _d = d.split("\n")
        _d.pop()
        _d = extract_key_to_dict(_d)
        _d = clean_dictionary(_d)
        _d = transform_to_order(_d)
        _d = transform_to_array(_d)
        _d = to_na_value(_d)
        listing.append(_d)
    listing_object = {}
    listing_object['listing'] = listing
    result['listings'] = listing_object
    return result



In [28]:
json_file = to_json(data)
with open("job_posting.json",'w') as f:
    json.dump(json_file,f,indent=4)
    

### 5.2 Storing Data in XML format

Following is sample output for .xml file

```xml
    <?xml version="1.0" encoding="UTF-8" ?>
    <listings>
        <listing id='90173'>
            <title>Local Expert in Support of Establishment of National Disaster Observatory in Armenia</title>
            <location>Yerevan, Armenia</location>
            <job_descriptions>
                <description>
                   Synopsys Armenia is looking for CAE for its AMSG division
                </description>
            </job_descriptions>
            <job_responsibilities>
                <responsibility> 
                Provide the existing and potential customers with the information on Bank products and services
                </responsibility>
                <responsibility> 
                Provide Bank account services by phone
                </responsibility>
            </job_responsibilities>
            <required_qualifications>
                <qualification> 
                Minimum 4-6 years of extensive development experience, and minimum 3+ years with the following    technologies: a) ASP.NET for building the new Control Panel; b) Pure C# code both on the middle tier and as part of the web UI layer of ASP.NET
                </qualification>
                <qualification> 
                Familiarity with the .NET Framework, specifically the following packages: messaging, threading, generic collections, custom controls and LINQ to SQL classes and also ADO.NET
                </qualification>
            </required_qualifications>
            <salary> Commensurate with skills and experience </salary>
            <application_procedure> 
            To apply for this position, please send a letter of intent with a CV addressing relevant qualifications and experience to:hr_wvarm@... with cc to: arman_grigoryan@... . In the subject line of your e-mail message, please, mention title of the position you are applying for. No information inquiries will be handled over the phone. Only short listed candidates will be notified for the interview. Please clearly mention in your application letter that you learned of this job opportunity through Career Center and mention the URL of its website - www.careercenter.am, Thanks
            </application_procedure>
            <start_date> 11 September 2012 </start_date>
            <application_deadline> 08 August 2006 </application_deadline>
            <about_company> N/A </about_company>
        </listing>
    </listings>  

```

In [31]:
# a function to transform and write down data to xml file 
def json2xml(json_list, line_spacing=""):
    with open("job_posting.xml",'w') as f:
        f.write("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n")
        f.write("<listings>\n")
        for json_obj in json_list:
            f.write("\t<listing id='"+str(json_obj['_id'])+"'>\n")
            json_obj.pop("_id")
            for key, val in json_obj.items():
                if key =='job_descriptions':
                    f.write("\t\t<job_descriptions>\n")
                    job_desc = val
                    jobs = val['descriptions']
                    k = []
                    if jobs != 'N/A':
                        for d in jobs:
                            f.write("\t\t\t<description>\n")
                            f.write("\t\t\t\t"+d+"\n")
                            f.write("\t\t\t</description>\n")
                    else:
                        f.write("\t\t\t\tN/A\n")
                    f.write("\t\t</job_descriptions>\n")

                elif key == 'job_responsibilities':
                    f.write("\t\t<job_responsibilities>\n")
                    job_desc = val
                    jobs = val['responsibility']
                    k = []
                    if jobs != 'N/A':
                        for d in jobs:
                            f.write("\t\t\t<responsibility>\n")
                            f.write("\t\t\t\t"+d+"\n")
                            f.write("\t\t\t</responsibility>\n")
                    else:
                        f.write("\t\t\t\tN/A\n")
                    f.write("\t\t</job_responsibilities>\n")
                elif key == 'required_qualifications':
                    f.write("\t\t<required_qualifications>\n")
                    job_desc = val
                    jobs = val['qualification']
                    k = []
                    if jobs != 'N/A':
                        for d in jobs:
                            f.write("\t\t\t<qualification>\n")
                            f.write("\t\t\t\t"+d+"\n")
                            f.write("\t\t\t</qualification>\n")
                    else:
                        f.write("\t\t\t\tN/A\n")
                    f.write("\t\t</required_qualifications>\n")
                else:
                    f.write("\t\t<"+key+">"+val+"</"+key+">\n")
            f.write("\t</listing>\n")
        f.write("</listings>\n")


In [29]:
# a function to extract, clean and transform data for xml format
def extract_clean_data_for_xml(data):
    result = []
    for d in data:
        _d = d.split("\n")
        _d.pop()
        _d = extract_key_to_dict(_d)
        _d = clean_dictionary(_d)
        _d = transform_to_order(_d)
        _d = transform_to_array(_d)
        _d = to_na_value(_d)
        result.append(_d)
    return result

result = extract_clean_data_for_xml(data)

Write raw data to xml file

In [32]:
json2xml(result)

We are done with the Task 1, let try to read the *json* data and *xml* data that we've written above

In [39]:
# read the json file
with open("job_posting.json", 'r') as f:
    json_file = json.load(f)
json_file['listings']['listing']

[{'_id': 60999,
  'about_company': 'Save the Children International established its presencein Armenia in 1993, with a mission to achieve immediate and lastingchange in childrens lives.',
  'application_deadline': '04 June 2010',
  'application_procedure': 'We encourage all suitable candidates to applyvia email through a cover letter and CV with GIS Officer on the subjectline to: Human Resources, WWF CauPO at: office@... .Please clearly mention in your application letter that you learned ofthis job opportunity through Career Center and mention the URL of itswebsite - www.careercenter.am, Thanks.',
  'job_descriptions': {'descriptions': ['We are looking for candidates to take part in thecompetition for the position of Branch Manager at TechnoNICOL Armenianoffice to be opened soon.']},
  'job_responsibilities': {'responsibility': ['Mobilize rural communities, educate and train them to become eligiblefor Heifer projects development',
    'Work with project holders to collect data on socia

In [33]:
# read the xml file
from bs4 import BeautifulSoup

xml_file = open("job_posting.xml").read()
soup = BeautifulSoup(xml_file)
print(soup.prettify())


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
