## Importing libraries and dataset


- This section downloads the original XML data file from Old Bailey's court and unzips it inside the dataset/data directory.

- The files are divided by date, on which multiple trials can be stored at once.

- A list ('xml_files') stores all the files' names and, later on, is iterated over, opening each file and collecting the relevant data about the trials (func process_trials_on_file).

- The function goes through the XML file and registers each trial inside it, gathering the data in dictionaries and storing the dictionaries on a list.

- The list is then converted to JSON for storage.

In [None]:
import pandas as pd
import lxml
from lxml import etree
import os
import json
import wget
import unzip
import numpy as np
from matplotlib import pyplot as plt

In [None]:
# Download and unzip file
# UNHARDCODE PATH TEST
!wget https://orda.shef.ac.uk/ndownloader/articles/4775434/versions/2
!unzip 2 -d dataset
!unzip dataset/OBO_XML_7-2.zip -d dataset/data

--2025-01-21 18:13:33--  https://orda.shef.ac.uk/ndownloader/articles/4775434/versions/2
Resolving orda.shef.ac.uk (orda.shef.ac.uk)... 2a05:d018:1f4:d003:af1a:8116:f2e7:d5e9, 2a05:d018:1f4:d000:3f3e:dee3:afb0:cb4b, 54.77.156.78, ...
Connecting to orda.shef.ac.uk (orda.shef.ac.uk)|2a05:d018:1f4:d003:af1a:8116:f2e7:d5e9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 315238180 (301M) [application/zip]
Saving to: '2'

     0K .......... .......... .......... .......... ..........  0% 78,5K 65m21s
    50K .......... .......... .......... .......... ..........  0%  235K 43m34s
   100K .......... .......... .......... .......... ..........  0% 89,8M 29m3s
   150K .......... .......... .......... .......... ..........  0%  236K 27m13s
   200K .......... .......... .......... .......... ..........  0%  130M 21m47s
   250K .......... .......... .......... .......... ..........  0%  115M 18m9s
   300K .......... .......... .......... .......... ..........  0%  237K 18m

Archive:  2
 extracting: dataset/OB_xml_notes.docx  
 extracting: dataset/OBO_XML_7-2.zip  
Archive:  dataset/OBO_XML_7-2.zip
  inflating: dataset/data/licence.txt  
  inflating: dataset/data/listOA.txt  
  inflating: dataset/data/listOBP.txt  
  inflating: dataset/data/OB_xml_notes.docx  
   creating: dataset/data/ordinarysAccounts/
  inflating: dataset/data/ordinarysAccounts/.DS_Store  
  inflating: dataset/data/ordinarysAccounts/OA16760517.xml  
  inflating: dataset/data/ordinarysAccounts/OA16760705.xml  
  inflating: dataset/data/ordinarysAccounts/OA16760830.xml  
  inflating: dataset/data/ordinarysAccounts/OA16761025.xml  
  inflating: dataset/data/ordinarysAccounts/OA16770316.xml  
  inflating: dataset/data/ordinarysAccounts/OA16770504.xml  
  inflating: dataset/data/ordinarysAccounts/OA16771017.xml  
  inflating: dataset/data/ordinarysAccounts/OA16771219.xml  
  inflating: dataset/data/ordinarysAccounts/OA16780123.xml  
  inflating: dataset/data/ordinarysAccounts/OA16780306.xml 

In [None]:
# Creates a list with all xml file's names
folder_path = 'dataset\data\sessionsPapers'
xml_files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.xml')]
xml_files

In [None]:
# Gets trial's info from xml files and adds to list
def process_trials_on_file(tree, data_list):
    '''
    Extracts trial's info from each XML file in a dictionary and appends to a list.

    Args:
        tree: the parsed XML document tree.
        data_list: a list to store the trial's dictionaries.

    '''
    # finds date of all trials in file
    trial_date = tree.xpath('//div0//interp[@type="date"]/@value')
    trial_date = trial_date[0]

    # finds total number of trials in file
    trials_ids = tree.xpath('//div0//div1[@type="trialAccount"]/@id')
    num_trials = len(trials_ids)

    for trial in trials_ids:
        trial_defendant_gender = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//persName[@type="defendantName"]//interp[@type="gender"]/@value')
        trial_defendant_age = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//persName[@type="defendantName"]//interp[@type="age"]/@value')
        trial_defendant_occupation = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//persName[@type="defendantName"]//interp[@type="occupation"]/@value')
        trial_victim_gender = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//persName[@type="victimName"]//interp[@type="gender"]/@value')
        trial_verdict_category = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//rs[@type="verdictDescription"]//interp[@type="verdictCategory"]/@value')
        trial_verdict_subcategory = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//rs[@type="verdictDescription"]//interp[@type="verdictSubcategory"]/@value')
        trial_offence_category = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//rs[@type="offenceDescription"]//interp[@type="offenceCategory"]/@value')
        trial_offence_subcategory = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//rs[@type="offenceDescription"]//interp[@type="offenceSubcategory"]/@value')
        trial_punishment_category = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//rs[@type="punishmentDescription"]//interp[@type="punishmentCategory"]/@value')
        trial_punishment_subcategory = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//rs[@type="punishmentDescription"]//interp[@type="punishmentSubcategory"]/@value')


        data_list.append({
            "trial_id": trial,
            "trial_date": trial_date,
            "trial_defendant_gender": trial_defendant_gender,
            "trial_defendant_age": trial_defendant_age,
            "trial_defendant_occupation": trial_defendant_occupation,
            "trial_victim_gender": trial_victim_gender,
            "trial_verdict_category": trial_verdict_category,
            "trial_verdict_subcategory": trial_verdict_subcategory,
            "trial_offence_category": trial_offence_category,
            "trial_offence_subcategory": trial_offence_subcategory,
            "trial_punishment_category": trial_punishment_category,
            "trial_punishment_subcategory": trial_punishment_subcategory,
        })


In [None]:
# Adds info on xml files to a list
def get_data_from_xml(file, data_list):
  tree = etree.parse(file)          # parses xml file

  process_trials_on_file(tree, data_list)      # extracts info

In [None]:
trials_data = []                  # creates list

# Iterates through all existing trial's files
for file in xml_files:

  get_data_from_xml(file, trials_data)

In [None]:
def save_to_json(data, file_name):
    with open(file_name, 'w') as f:
        json.dump(data, f)

save_to_json(trials_data, 'extracted_data.json')