# FIT5196 Assessment 1: Task 1
#### Student Name: Isobel Rowe
#### Student ID: 30042585

Date: 11/04/2019

Version: 3.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* re
* json

## 1. Introduction

This task of assignment 1 in FIT5196 deals with the extraction of data from semi-structured files. It involves using the provided file to extract and transform data into the XML and JSON formats. Broadly, the steps are as follows:

1. Opening the file and extracting the data.
2. Extracting the relevant data, including:
    * unit ID
    * title
    * synopsis
    * pre-requisites
    * prohibitions
    * outcomes
    * chief examiners
3. Transforming the parsed data into the specified formats.

More details for each task will be given in the following sections.

## 2. Import Libraries

In [1]:
import re
import json

## 3. Opening file and initial extraction

To begin, the .txt file is opened. I noticed that the information for each unit started with: '<div class="content-inner__main"', so a loop is used to run through the file to extract the information for each unit using the .startswith() method. Then, this information is appended into a list.

In [2]:
# Defining an empty list for the information to be appended to.
htmlelements = []

#Opening the file and executing a loop
with open('30042585.txt') as file:
    for line in file:
        if line.startswith('<div class="content-inner__main">'):
            htmlelements.append(line)
        else:
            htmlelements[-1] += line

#Verification that the loop worked and all sections have been properly extracted
print("Length: ", len(htmlelements))

Length:  400


## 4. Secondary extraction with regex

Next, using regular expressions, the relevant information (as per the assignment specifications) is extracted. This information is appended into separate lists, and is then validated by checking that the length of the lists are equal to 400. 

The regular expressions used here all differ from one another, but mostly all use either positive lookbehinds or lookaheads, or both. This is a great tool as most often, we are able to isolate the relevant information by looking for everything in between the HTML tags.

### Unit ID / title / synopsis extraction

In [3]:
# Unit ID
unitid = []
for element in htmlelements:
    unitre = re.findall('(?<=\"unitcode\">)\w{3,4}\d{4}', element) #unitcodes
    unitid.append(unitre)
print("Length of unit id: ", len(unitid))

# Title
title = []   
for element in htmlelements:
    titlere = re.findall('(?<=-\s)\w.*(?=<span class=)', element) #unit title
    title.append(titlere)
print("Length of titles: ", len(title))

# Synopsis
synopsis = []
for element in htmlelements:
    synopsisre = re.findall('(?<=Synopsis</h2>\n<div>\n<p>)\w.*(?=</p>)', element) #synopsis
    synopsis.append(synopsisre)
    # Ensuring any null values are filled with 'NA'
    if not synopsisre:
        synopsis.pop()
        synopsis.append('NA')
print("Length of synopsis: ", len(synopsis))


Length of unit id:  400
Length of titles:  400
Length of synopsis:  400


### Prerequisite extraction

Here, it starts to get a little bit trickier. As an initial step, the tags (?=Prerequisites) and (?=<p class) are used to capture everything as there is no defined layout format for pre-requisites in the HTML file. Next, using this intial extraction, the unit codes are found. Finally, any duplicate values in a single element are removed using a set to ensure uniqueness. 

In [4]:
# Initial extraction between the tags
prereqs1 = []
for element in htmlelements:
    prereqsre1 = re.findall('(?=Prerequisites)(.*?)(?=<p class)', element, flags=re.DOTALL)
    prereqs1.append(prereqsre1)
    
# Second extraction of unit codes
prereqs2 = []
for element in prereqs1:
    prereqsre = re.findall('\w{3,4}\d{4}', str(element))
    prereqs2.append(prereqsre)
    if not prereqsre:
        prereqs2.pop()
        prereqs2.append('NA')

# Removing duplicate values
prereqs = []
for element in prereqs2:
    if element == 'NA':
        prereqs.append(element)
    else:
        setelement = set(element)
        prereqs.append(list(setelement))
        
# Verification
print("Length of pre-requisites: ", len(prereqs))

Length of pre-requisites:  400


### Prohibition extraction
Similar to the pre-requisite extraction, this was a little bit more difficult as the prohibitions don't all follow the same format. Once again, the easiest way is to first extract everything between the two tags, and then the actual unit codes.

In [5]:
#Initial regex for extraction
prohibs1 = []
for element in htmlelements:
    prohibsre1 = re.findall('(?=Prohibitions)(.*?)(?=<h2)', element, flags=re.DOTALL) #synopsis
    prohibs1.append(prohibsre1)

#Regex for the unit codes
prohibs2 = []
for element in prohibs1:
    prohibsre = re.findall('\w{3,4}\d{4}', str(element))
    prohibs2.append(prohibsre)
    if not prohibsre:
        prohibs2.pop()
        prohibs2.append('NA')
        
#Removing duplicate values
prohibs = []
for element in prohibs2:
    if element == 'NA':
        prohibs.append(element)
    else: 
        setelement = set(element)
        prohibs.append(list(setelement))

#Checking that all elements have been captured
print("Length of prohibitions: ", len(prohibs))

Length of prohibitions:  400


### Requirement / chief examiner extraction


In [6]:
# Requirements
reqs = []
for element in htmlelements:
    reqsre = re.findall('(?<=requirements<\/h2>\n<div>\n<p>)\w.*(?=<\/p><p>)', element) 
    reqs.append(reqsre)
    if not reqsre:
        reqs.pop()
        reqs.append('NA')
        
print("Length of requirements: ", len(reqs))

# Chief Examiner
chiefex = []
for element in htmlelements:
    chiefexre = re.findall('(?<=\">)\w.*(?=</a>\n<br/>\n</p>\n<p class=\"hbk-highlight-heading\">C)', element)
    chiefex.append(chiefexre)
    if not chiefexre:
        chiefex.pop()
        chiefex.append('TBA')
        
print("Length of chief examiners: ", len(chiefex))

Length of requirements:  400
Length of chief examiners:  400


### Outcome extraction
Once the outcome section is extracted, each individual element is split on the dividing tags and appended into a new list.

In [7]:
#Initial extraction
outcomes1 = []
for element in htmlelements:
    outcomesre = re.findall("(?<=Outcomes<\/h2>)(.*?)(?=<\/ol>)", element, flags= re.MULTILINE|re.DOTALL) 
    outcomes1.append(outcomesre)
    if not outcomesre:
        outcomes1.pop()
        outcomes1.append('NA')

# Extracting everything after 'type="1">'
outcomes2 = []        
for element in outcomes1:
    outcomesre2 = re.findall('type="1">(.*)' , str(element), flags= re.MULTILINE|re.DOTALL)
    outcomes2.append(outcomesre2)
    if not outcomesre2:
        outcomes2.pop()
        outcomes2.append('NA')
        
        
# Splitting on the tag 
outcomes3 = []
for element in outcomes2:
    outcomes3.append(str(element).split('</li>\\\\n<li>'))
    
# Validation    
print("Length of outcomes: ", len(outcomes3))
print(outcomes3[1])


Length of outcomes:  400
['["\\\\n<li>Validate the scientific literature to comprehend the progress within a specific research area;', 'Identify and apply the processes involved in the design, development and implementation of a research project;', 'Design, develop and implement a research project;', 'Acquire and analyse computer based data for graphical and tabular summarisation of findings;', 'Summarise research outcomes into scientific manuscripts in accordance with publication requirements;', 'Effectively and clearly communicate scientific principles and research findings in verbal and written form to a broad audience;', 'Identify and select techniques that are essential to the satisfactory completion and reporting of a research project', 'Apprehend the significance of ethics, laboratory etiquette and adherence to OHS', 'Recognize a critical problem and formulate a hypothesis to solve it; and', 'Interpret the research findings with reference to the existing literature</li>\\\\n\']"

As can be seen in the above output, there is a problem at the start and end of each element with some tags that have been left behind. To fix this, the re.sub() method is used with various regular expressions.


In [8]:
outcomes = []
for element in outcomes3:
    # Tags at the beginning
    element = re.sub('\[\"', '', str(element))
    element = re.sub('\\\\', '', str(element))
    element = re.sub('n<li>', '', str(element))
    # Tags at the end
    element = re.sub('</li>n', '', str(element))
    element = re.sub('\'\]\"\]', '', str(element))
    # Removing leftover symbols
    element = re.sub('\[\'|\'\]', '', str(element))
    element = re.sub('(?<=\s)\'|\'(?=\W)', '', str(element))
  
    outcomes.append(element)
    
# Validation
print("Length of outcomes: ", len(outcomes))
print(outcomes[1])


Length of outcomes:  400
Validate the scientific literature to comprehend the progress within a specific research area;, Identify and apply the processes involved in the design, development and implementation of a research project;, Design, develop and implement a research project;, Acquire and analyse computer based data for graphical and tabular summarisation of findings;, Summarise research outcomes into scientific manuscripts in accordance with publication requirements;, Effectively and clearly communicate scientific principles and research findings in verbal and written form to a broad audience;, Identify and select techniques that are essential to the satisfactory completion and reporting of a research project, Apprehend the significance of ethics, laboratory etiquette and adherence to OHS, Recognize a critical problem and formulate a hypothesis to solve it; and, Interpret the research findings with reference to the existing literature


## 5. Creating the JSON file

### Formatting elements 

The next step is creating the JSON file, which requires a little bit more processing. For cetrain elements, where the value is not equal to 'NA', it needs to be nested, for example:

  "pre_requistics": {
                         "pre_requistic": [
                              "SCU1021",
                              "SCU1611",
                              "SCU1612",
                              "OHS1000",
                              "SCU1022"
                         ]
                    },

So, a loop is used to run through the initial lists. If the element is equal to 'NA' or 'TBA', it is appended into the final JSON list. Alternatively, the elements are put into a dictionary object (which fixes the formatting), and then appended to the final json list.

In [9]:
# Prerequisites
prereqsjson=[]

for element in prereqs:
    prereqs_dict = {}
    if element == 'NA':
        prereqsjson.append(element) 
    else:
        prereqs_dict["pre_requistic"]=element
        prereqsjson.append(prereqs_dict) 

# Prohibitions        
prohibsjson=[]
for element in prohibs:
    prohibs_dict = {}
    if element == 'NA':
        prohibsjson.append(element) 
    else:
        prohibs_dict["prohibision"]=element
        prohibsjson.append(prohibs_dict)
        
# Requirements
reqsjson=[]
for element in reqs:
    reqs_dict = {}
    if element == 'NA':
        reqsjson.append(element) 
    else:
        reqs_dict["requirement"]=element
        reqsjson.append(reqs_dict)

# Outcomes
outcomejson=[]
for element in outcomes:
    outcome_dict = {}
    if element == 'NA':
        outcomejson.append(element) 
    else:
        outcome_dict["outcome"]=element
        outcomejson.append(outcome_dict)
        
# Chief examiners
chiefexjson=[]
for element in chiefex:
    chiefex_dict = {}
    if element == 'TBA':
        chiefexjson.append(element) 
    else:
        chiefex_dict["chief_examiner"]=element
        chiefexjson.append(chiefex_dict)

### Creating dictionary and writing to file
Now, using these newly formatted lists, along with the original unitid, title, and synopsis lists, we can create the JSON file.

The steps are:

1. Zip all the information elements together to create the values.
2. Create a list of all the final keys.
3. Make a loop to create the key:value pairs using another zip(), and append to a list.
4. Format the JSON data to add "units" and "unit" tags as per exmaple JSON file.
5. Write the final dictionary to a temporary file using json.dump().
6. Remove unwanted symbols and write final file.

In [10]:
# Defining the keys and values 
zipvalues = list(zip(unitid, title, synopsis, prereqsjson, prohibsjson, reqsjson, outcomejson, chiefexjson))
keys = ['@id', 'title', 'synopsis', 'pre_requistics', 'prohibisions', 'requirements', 'outcomes', 'chief_examiners']

# Creating a loop to add all elements to a list
jsondict1 = []
for i in range(len(zipvalues)):
    dictionary = dict(zip(keys, zipvalues[i]))
    jsondict1.append(dictionary)
    
#Formatting the json data
jsondict2 = {}
jsondict2['unit']=jsondict1
jsondict3 = {}
jsondict3['units']=jsondict2

# Writing to the file
with open('jsontemp.json', 'w') as outfile:   
    json.dump(jsondict3, outfile, indent=4)
    
    

Finally, removing the unwanted square brackets around the unitid, title, and synopsis elements.

In [11]:
# Opening the file and reading
with open('jsontemp.json', 'r' ) as f:
    content = f.read()
    # using thhe regex to remove unwanted symbols
    content_new1 = re.sub('(?<=@id":\s)\[', '', content, flags = re.MULTILINE | re.DOTALL)
    content_new2 = re.sub('(?<=title":\s)\[', '', content_new1, flags = re.MULTILINE | re.DOTALL)
    content_new3 = re.sub('(?<=synopsis":\s)\[', '', content_new2, flags = re.MULTILINE | re.DOTALL)
    content_new4 = re.sub('\](?=,)', '', content_new3, flags = re.MULTILINE | re.DOTALL)

# Writing to the final file
with open('30042585.json', 'w') as outfile:
    outfile.write(content_new4)


## 6. Creating the XML file

In order to make the XML file, a string object is made for every relevant piece of information. The string objects include the start and end tags as they will appear in the final XML file. A placeholder ('=split=') is used so that the string can finally be broken up into a list.

### Unit ID / title / synopsis string creation


didnt want to delete apostrophes but wanted to delete single and double quotes so did a quite regex

In [12]:
# Unit ID
unitidstring = ''
for element in unitid:
    unitidstring = unitidstring + "=split=  <unit id="  + str(element) + ">\n"
    
#Removing square brackets and replacing single quotes with doubles
unitidstring = re.sub('\[|\]', '', unitidstring)
unitidstring = re.sub('\'', '\"', unitidstring)

unitidxml = unitidstring.split('=split=')
del unitidxml[0]
print("Length of unitid: ", len(unitidxml))
    
    
# Title 
titlestring = ''
for element in title:
    titlestring = titlestring + "=split=    <title>\n    "  + str(element) + "\n    </title>\n"

#Removing single quotes
titlestring = re.sub('(?<=\[)\'', '', titlestring)
titlestring = re.sub('"(?=\])|\'(?=\])', '', titlestring)

titlexml = titlestring.split('=split=')
del titlexml[0]
print("Length of title: ", len(titlexml))


# Synopsis
synopsisstring = ''
for element in synopsis:
    synopsisstring = synopsisstring + "=split=    <synopsis>\n    " + str(element) + "\n    </synopsis> \n"

synopsisxml = synopsisstring.split('=split=')
del synopsisxml[0]
print("Length of synopsis: ", len(synopsisxml))

Length of unitid:  400
Length of title:  400
Length of synopsis:  400


### Pre-requisites / prohibitions string creation

In [13]:
# Prerequisites
prereqstring = ''

for element in prereqs:
    prereqstring = prereqstring + "=split="
    if element == "NA":
        prereqstring = prereqstring + "    <pre_requistics> "
        prereqstring = prereqstring + "NA"
        prereqstring = prereqstring + " </pre_requistics>\n"     
    else:
        prereqstring = prereqstring + "    <pre_requistics>\n"
        for element2 in element:
            prereqstring = prereqstring + "     <pre_requistic> " + element2 + " </pre_requistic>\n"
        prereqstring = prereqstring + "    </pre_requistics>\n"

prereqxml = prereqstring.split('=split=')
del prereqxml[0]
print("Length of pre-requisites: ", len(prereqxml))


# Prohibitions
prohibstring = ''

for element in prohibs:
    prohibstring = prohibstring + "=split="
    if element == "NA":
        prohibstring = prohibstring + "    <prohibisions> "
        prohibstring = prohibstring + "NA"
        prohibstring = prohibstring + " </prohibisions> \n" 
    else:
        prohibstring = prohibstring + "    <prohibisions>\n"
        for element2 in element:
            prohibstring = prohibstring + "     <prohibision> " + element2 + " </prohibision>\n"
        prohibstring = prohibstring + "    </prohibisions>\n"

prohibxml = prohibstring.split('=split=')
del prohibxml[0]
print("Length of prohibitions: ", len(prohibxml))

Length of pre-requisites:  400
Length of prohibitions:  400


### Requirements string creation

In [14]:
reqstring = ''

for element in reqs:
    reqstring = reqstring + "=split=" 
    if element == "NA":
        reqstring = reqstring + "    <requirements> "
        reqstring = reqstring + "NA"
        reqstring = reqstring + " </requirements> \n"
    else:
        reqstring = reqstring + "    <requirements>   \n" 
        reqstring = reqstring + "     <requirement> " + str(element) + " </requirement>\n"
        reqstring = reqstring + "    </requirements> \n"

reqsxml = reqstring .split('=split=')
del reqsxml[0]
print("Length of requirements: ", len(reqsxml))

Length of requirements:  400


### Outcomes string creation

In [16]:
outcomestring = ''

for element in outcomes3:
    outcomestring = outcomestring + "=split="
    
    if element == "NA":
        outcomestring = outcomestring + "    <outcomes> "
        outcomestring = outcomestring + "NA"
        outcomestring = outcomestring + " </outcomes>\n"
    else:
        outcomestring = outcomestring + "    <outcomes> \n"
        for element2 in element:
            outcomestring = outcomestring + "     <outcome> " + element2 + " </outcome>\n"
        outcomestring = outcomestring + "    </outcomes>\n"        

        
outcomexml = outcomestring.split('=split=')
del outcomexml[0]
print("Length of outcomes: ", len(outcomexml))

Length of outcomes:  400


### Chief examiners string creation

In [17]:
chiefexstring = ''

for element in chiefex:
    chiefexstring = chiefexstring + "=split="
    if element == "TBA":
        chiefexstring = chiefexstring + "    <chief_examiners> "
        chiefexstring = chiefexstring + "TBA"
        chiefexstring = chiefexstring + " </chief_examiners>\n"
        chiefexstring = chiefexstring + "  </unit>\n"
    else:
        chiefexstring = chiefexstring + "    <chief_examiners>\n" 
        chiefexstring = chiefexstring + "     <chief_examiner> " + str(element) + " </chief_examiner>\n"
        chiefexstring = chiefexstring + "    </chief_examiners>\n"
        chiefexstring = chiefexstring + "  </unit>\n"
    
chiefexxml = chiefexstring.split('=split=')
del chiefexxml[0]
print("Length of chief examiners: ", len(chiefexxml))


Length of chief examiners:  400


### Combining strings and writing to file

The last step is to combine all of the strings together into a final string and write it all to a file. 
In additon, there is a lot of leftover formatting from previous sections such as '\\\\n', and leftover tags such as '\<p>' In order to clean this up, we use regular expressions with re.sub() to replace any leftover symbols.

In [21]:
length = len(unitid)
index = 0
finalxmlstring = ''
finalxmlstring = finalxmlstring + "<units>\n"

# Creating a while loop to 
while index <= length-1:
    finalxmlstring = finalxmlstring + unitidxml[index]
    finalxmlstring = finalxmlstring + titlexml[index]
    finalxmlstring = finalxmlstring + synopsisxml[index]
    finalxmlstring = finalxmlstring + prereqxml[index]
    finalxmlstring = finalxmlstring + prohibxml[index]
    finalxmlstring = finalxmlstring + reqsxml[index]
    finalxmlstring = finalxmlstring + outcomexml[index]
    finalxmlstring = finalxmlstring + chiefexxml[index]
    index+=1
    
finalxmlstring = finalxmlstring + "</units>"

# Substituting the unwanted symbols
finalxmlstring = re.sub('\[', '', finalxmlstring)
finalxmlstring = re.sub('\]', '', finalxmlstring)
finalxmlstring = re.sub('(?<=\[)\"|(?<=\[)\'', '', finalxmlstring)
finalxmlstring = re.sub('"(?=\])|\'(?=\])', '', finalxmlstring)    
finalxmlstring = re.sub('<li>|</li>', '', finalxmlstring)
finalxmlstring = re.sub('\\\\n\'|\\\\n', '', finalxmlstring)
finalxmlstring = re.sub('\\\\', '', finalxmlstring)
finalxmlstring = re.sub('<p>|</p>', '', finalxmlstring)
finalxmlstring = re.sub('<ul>|</ul>', '', finalxmlstring)
finalxmlstring = re.sub('<ol princestart=\"0\" start=\"1\" type=\"i\">', '', finalxmlstring)


text_file = open("30042585.xml", "w")
text_file.write(finalxmlstring)
text_file.close()

## 7. Summary

This task has demonstrated the basics of parsing raw text files in the Python. The main outcomes achieved while executing this task were: applying regular expressions for data extraction, string manipulation techniques, and JSON and XML file writing.

There are things I could have done in a more 'pythonic' or quicker way. For instance, writing the XML file was time-consuming - I could have written the information to an XML file in a much shorter way, but the exact formatting specified in the assessment requirements took careful consideration to get the information nested in the perfect way. But ultimately, the XML and JSON files were completed.

## 8. References

*Regular Expressions 101*. Retrieved from https://regex101.com/

Python Software Foundation. (2019). *json — JSON encoder and decoder*. Retrieved from
https://docs.python.org/2/library/json.html#module-json
