# Text Processing


Date: 14th April 2019

Environment: Python 2.7.11 and Jupyter notebook
Libraries used: 
* re (for regular expression, included in Anaconda Python 2.7) 
* json ( to convert to json file)

## 1. Introduction

This analysis extract data from a text file and converts the required information into a xml file and json file. I've used regular expression to firstly extract the chunk of details for every unit. Re library was imported to extract required information about the unit code, unit title, pre-requisites, prohibitions, synopsis, requirements, outputs and chief examiners on the chunk of details for every unit extracted initially. 
Thereafter, an xml file is created manually through a while loop and json file is created by using the json library. 



## 2. Import libraries

In [830]:
#importing re library to perform regular expression
import re
# importing json library to create json file
import json

## 3. Extracting data

## Getting units into chunks

In [831]:
# opening and reading file given as infile
with open("Unit Guide.txt",'r') as infile:
    #using .read to read the file as a whole
    file = infile.read()
# using re.findall to iterate over the lines in the file and capture the groups with a positive lookbehind of
#  ?<=\<div class=\"content-inner__main\"\>\n and positive lookahead of ?=<!-- /.content-inner__main --></div>
unit_list = re.findall(r'(?<=\<div class=\"content-inner__main\"\>\n)((.*\n)+?)(?=<!-- /.content-inner__main --></div>)',file)

# the code above was producing a list of tuples which contained two strings. I needed the first index of it hence
# ive put 0 to select the first tuple 
unit_sum = list(zip(*unit_list))[0]

## 3.1 Extracting unit code

In [832]:
# empty list for pattern to be extended on 
units_final = []

#there were 2 units which had 4 letters so I've made my code to capture capital letters within the range of 3 to 4
#along with 4 numbers beside the letters. The unitcode is between  <span class=\"unitcode\"> and ends with </span>
#hence, i've made both <span class=\"unitcode\"> and </span> a non capturing group to just capture the unitcodes.
#i've used refindall to iterate through the lines of codes of file to look for the pattern.
pattern = re.findall('(?:<span class=\"unitcode\">)([A-Z]{3,4}[0-9]{4})(?:</span>)',file)

# used extend ot iterate over pattern and adding each element to the units_final list and extending the list.
units_final.extend(pattern)

## 3.2 Extracting unit title

In [833]:
# empty list for title_pattern to be extended on 
unit_title = []

# the unit title was in between the unitcode\".*</span> and <span class. Hence, to capture just the unit title, I've
# placed unitcode\".*</span> and <span class in a non capturing group. i've used refindall to iterate through the lines of codes of file 
#to look for the pattern.
title_pattern = re.findall(r'(?:\"unitcode\".*</span> -)(.*)(?:<span class)',file)

# used extend ot iterate over pattern and adding each element to the unit_title list and extending the list.
unit_title.extend(title_pattern)

## 3.3 Extracting Pre-requisites

In [834]:
# empty list for total_unit which contains prerequisities or correquisities to be appended on 
prereq2 = []
# to iterate through every units in units sum 
for units in unit_sum:
    # I've done a condition for re.findall to search either prerequisites or correquisites from the units in unit_sum
    # The content for prerequisites is captured with the regex code Prerequisites<\/p>\n*.*<\/p> as it starts
    # with Prerequisites<\/p>\n and ends with <\/p>, the content for co-requisites is captured with the regex
    # Co-requisites<\/p>\n*.*<\/p> and ends with <\/p> 
    total_unit = re.findall('Prerequisites<\/p>\n*.*<\/p>|Co-requisites<\/p>\n*.*<\/p>',units[0:])
    
    # the total_unit which captures the pattern for prerequisites and co requisites is appended to prereq2
    prereq2.append(total_unit)
    
# to iterate through every units within the range of the length of prereq2 
for units in range(len(prereq2)):
    # prereq2 produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty bracket
    # is present for the output
    if prereq2[units] == []:
        prereq2[units] = "NA"

# empty list for units for pre-requisites and correquisites to be appended on 
final_prereq2 = []
# to iterate through every requisities in prerequsites
for requi in prereq2:
    # used re.findall to find units in pre-requisites and correquisites which has 3 capital letters and 4 numbers
    # in requi.Requi is converted to a string in order for re.findall to iterate through it 
    overall_u = re.findall('([A-Z]{3}[0-9]{4})',str(requi))
    # set was used to remove duplication units for overall units. the set of units is then put into a list 
    le = list(set(overall_u))
    # the list of set of units for prerequisites and correquisites is appended to final_prereq2
    final_prereq2.append(le)

# to iterate through the whole range of length of final_prereq2
for requi in range(len(final_prereq2)):
    # final prereq2 produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output 
    if final_prereq2[requi] == []:
        final_prereq2[requi] = "NA"

## 3.4 Extracting prohibitions

In [835]:
# empty list for prohibition to be appended on 
prohibition1 = []

# to iterate through every units in unit_sum
for units in unit_sum:
    # the information about prohibitions starts with Prohibitions<\/p> and ends with <\/p>. Hence, i've used re.findall
    # to iterate through the lines of code to extract the following information
    prohibition = re.findall('Prohibitions<\/p>\n*.*<\/p>',units[0:])
    # prohibition which captures the pattern for prohibition is appended to prohibition1
    prohibition1.append(prohibition)

# to iterate through every units within the length of prohibition1
for units in range(len(prohibition1)):
    # prohibition1 produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output 
    if prohibition1[units] == []:
        prohibition1[units] = "NA"

# empty list for set prohibition to be appended on 
final_prohibition = []
# to iterate through every prohibition in prohibition 1
for prohibition in prohibition1:
    # used re.findall to find units in prohibition which has 3 capital letters and 4 numbers
    overall_prohibition = re.findall('([A-Z]{3}[0-9]{4})',prohibition[0])
    # set was used to remove duplication units for overall units. the set of units is then put into a list 
    set_prohibition = list(set(overall_prohibition))
    # the list of set of units for prerequisites and correquisites is appended to final_prohibition
    final_prohibition.append(set_prohibition)

# to iterate through prohibition within the range of length of final_prohibition 
for prohibition in range(len(final_prohibition)):
    # final_prohibition produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output
    if final_prohibition[prohibition] == []:
        final_prohibition[prohibition] = "NA"


## 3.5 Extracting synopsis

In [836]:
# empty list for synopsis to be appended on 
synopsis1 = []
# to iterate through every units in unit_sum
for units in unit_sum:
   # the information about synopsis starts with Synopsis<\/h2>\n<div>\n<p> and ends with <\/p>. Hence, i've used re.findall
   # to iterate through the lines of code to extract the following information. I've also used a non capturing group
    # for Synopsis<\/h2>\n<div>\n<p> and </p> to not capture the groups but capture the group in between the 
   # the non capturing group
   synopsis = re.findall('(?:Synopsis<\/h2>\n<div>\n<p>)(\n*.*)(?:</p>)',units[0:])
   # the synopsis pattern was appended to synopsis1
   synopsis1.append(synopsis)

# to iterate through every units within the range of length synopsis1
for units in range(len(synopsis1)):
    # synopsis1 produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output
    if synopsis1[units] == []:
       synopsis1[units] = "NA"

# to iterate through synopsis within the range of length synopsis1 
for j in range(len(synopsis1)):
    # using re.sub to substitute whatever that is in between brackets with empty space as there were unitlinks
    # which were in between the brackets that needs to be removed
    synopsis1[j] = re.sub('<.*?>',"",str(synopsis1[j]))
    

## 3.6 Extracting Outcomes

In [857]:
# empty list for output to be appended on 
output1 = []
# to iterate through every unit in unit_sum 
for units in unit_sum:
    # the information about outcomes starts with Outcomes</h2>\n<div> and ends with \">Assessment. Hence, i've used
    # re.findall to iterate through the lines of code to extract the following information. I've also used 
    # a non capturing group for Outcomes</h2>\n<div> and \">Assessment to not capture the groups but capture the 
    # group in between the the non capturing group. The capturing groups have space and empty spaces between them
    # hence i've placed a condition [\s\S]*? to capture them 
    output = re.findall('(?:Outcomes</h2>\n<div>)([\s\S]*?)(?:\">Assessment)', units[0:])
    # the output was appended to output1
    output1.append(output)

# to iterate through every units within the range of output 
for units in range(len(output1)):
    # output1 produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output
    if output1[units] == []:
        output1[units] = "NA"

# empty list for outputs to be appended 
final_output = []
# to iterate through output in output1 
for output in output1:
    # outcomes information is between <li> and <\/li>. Hence, I've placed a capturing group between both on them 
    # including spaces and non spaces to capture and a condition of * and ? as spaces might be zero or more and 
    # it happens zero or more times. 
    outputs = re.findall(r'(?:<li>)([\s\S]*?)(?:<\/li>)', output[0])
    final_output.append(outputs)

# to iterate through the output within the range of length final_output
for output in range(len(final_output)):
    # final_output produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output
    if final_output[output] == []:
        final_output[output] = "NA"

## 3.7 Extracting requirements

In [874]:
# empty list for requirement to be appended on 
requirement1 = []
# iterate through units in unit_sum
for units in unit_sum:
    # Requirement information starts with Assessment</h2>\n<div> and ends with </div>. I've use re.findall to 
    # iterate through every line to find the pattern in the file, re.DOTALL is used so that the the dot in
    # the capturing group captures everything including new lines 
    requirement = re.findall(r'Assessment</h2>\n<div>(.*?)</div>',units[0:],re.DOTALL)
    # requirement is appended to the requirement 1 list
    requirement1.append(requirement)

# iterate through every units in range within the length of requirement
for units in range(len(requirement1)):
    #  requirement1 produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output
    if requirement1[units] == []:
        requirement1[units] = "NA"
   
# empty list for overall_requirement to be appended on
final_requirement = []
# iterate through requirements in requirement1
for requirements in requirement1:
    # i've used re.findall to find the requirement between  <p> and </p>
    overall_requirement = re.findall('(?:<p>)(.*)(?:</p>)',requirements[0])
    # i've used re.sub to
    overall_req2 = re.sub('<.*?>',"",str(overall_requirement)) 
    final_requirement.append(overall_requirement)
 
for requirements in range(len(final_requirement)):
    if final_requirement[requirements] == []:
        final_requirement[requirements] = "NA"


## 3.8 Extracting Chief Examiners

In [875]:
# an empty list for list of examiners in examiner to be appended on  
chief1 = []
# to go through every units in unit_sum 
for units in unit_sum:
    # the content for chief examiner starts from Chief examiner\(s\)</p>\n<p> and ends with </p>. i've used
    # non capturing group for Chief examiner\(s\)</p>\n<p> and </p>. A capturing group was used to capture
    # whats in between both of it 
    examiner = re.findall(r'(?:Chief examiner\(s\)</p>\n<p>)(.*?)(?:</p>)',units[0:],re.DOTALL)
    # examiner is appended to chief1 list
    chief1.append(examiner)

# iterate through every units in range within length chief1
for units in range(len(chief1)):
    # chief1 produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output
    if chief1[units] == []:
        chief1[units] = "TBA"
        
# an empty list for list of overall_chief in examiner to be appended on  
final_chief = []
## iterate through every chief in range within length chief1
for chief in chief1:
    # chief examiners are obtained between > and </a> with capturing group
    overall_chief = re.findall('(?<=>)(.*?)(?=</a>)',chief[0])
    # appending overall_chief to final_chief
    final_chief.append(overall_chief)

# iterate through every chief in range length final chief
for chief in range(len(final_chief)):
    # final_chief produces an empty bracket when it is NA. Hence, I've set a condition to print NA when an empty 
    #bracket is present for the output
    if final_chief[chief] == []:
        final_chief[chief] = "TBA"


## 4. Creating XML File

In [886]:
# creating file to write on 
file = open('unitguide.xml','w')
# starting index
i = 0
file.write('<units>\n')
# continue writing while count is less than length of units
while i < len(units_final):
    unitid='{}{}{}'.format("<unit id='",units_final[i],"'>\n")
    file.write(unitid)
    title='{}{}{}'.format("<title>",unit_title[i],"</title>\n")
    file.write(title)
    tem_synopsis = '<synopsis>' + synopsis1[i] + "</synopsis>\n"
    # re.sub brackets so that brackets are not shown
    synopsis= re.sub(r'\[|]','',tem_synopsis)
    file.write(synopsis) 
    if final_prereq2[i] == 'NA':
        prerequistic = '{}'.format('<pre_requistics> NA </pre_requistics>\n')
        file.write(prerequistic)
    else:
        file.write('<pre_requistics>\n')
        for k in range(len(final_prereq2[i])):
                file.write('<pre_requistic>'+ final_prereq2[i][k] + '</pre_requistic>')
        file.write('</pre_requistics>\n')
    if final_prohibition[i] == 'NA':
        prohibition = '{}'.format('<prohibisions> NA </prohibisions>\n')
        file.write(prohibition)
    else:
        file.write('<prohibisions>\n')
        for k in range(len(final_prohibition[i])):
            file.write('<prohibision>'+ final_prohibition[i][k] + '</prohibision>')
        file.write('</prohibisions>\n')
    if final_requirement[i] == 'NA':
        requirement = '{}'.format('<requirements> NA </requirements>\n')
        file.write(requirement)
    else:
        file.write('<requirements>\n')
        for k in range(len(final_requirement[i])):
            file.write('<requirement>'+ final_requirement[i][k] + '</requirement>')
        file.write('</requirements>\n')
    file.write('<outcomes>\n')
    for k in range(len(final_output[i])):
        file.write('<outcome> '+ final_output[i][k] + '</outcome>')
    file.write('</outcomes>\n')
    if final_chief[i] == 'TBA':
        chiefexam = '{}'.format('<chief_examiners> TBA </chief_examiners>\n')
        file.write(chiefexam)
    else:
        file.write('<chief_examiners>\n')
        for k in range(len(final_chief[i])):
            file.write('<chief_examiner>'+ final_chief[i][k] + '</chief_examiner>')
        file.write('</chief_examiners>\n')
            
    i +=1
    file.write('</unit>\n')
    
file.close()


## 5. Creating JSON File

In [887]:

i = 0
sum_all = []
while i < len(units_final):
    summary = {'@id': units_final[i],
                           'title': unit_title[i],
                           'synopsis': synopsis1[i],           
                           'pre_requistics': 'NA' if final_prereq2[i] == 'NA' else {'pre_requistics': final_prereq2[i]},
                           'prohibisions':'NA' if final_prohibition[i] == 'NA' else {'prohibision': final_prohibition[i]},
                           'requirements':'NA' if final_requirement[i] == 'NA' else {'requirement': final_requirement[i]},
                           'outcomes': 'NA' if final_output[i] == 'NA' else {'outcome': final_output[i]},
                           'chief_examiners': 'TBA' if final_chief[i] == 'TBA' else{'chief examiner': final_chief[i]}
              }
    sum_all.append(summary)
    i +=1

unit_dict = {"unit":sum_all}
units_dict={"units":unit_dict}

with open('unitguide.json','w') as f:
    json.dump(units_dict,f,indent = 5)