# Dataset download and cleanup


- This section downloads the original XML data file from Old Bailey's court and unzips it inside the dataset/data directory.

- The files are divided by date, on which multiple trials can be stored at once.

- A list ('xml_files') stores all the files' names and, later on, is iterated over, opening each file and collecting the relevant data about the trials (func process_trials_on_file).

- The function goes through the XML file and registers each trial inside it, gathering the data in dictionaries and storing the dictionaries on a list.

- The list is then converted to JSON for storage.

## Data download

In [45]:
import pandas as pd
import lxml
from lxml import etree
import os
import json
import wget
import unzip
import numpy as np
from matplotlib import pyplot as plt

In [46]:
# Download and unzip file
!wget https://orda.shef.ac.uk/ndownloader/articles/4775434/versions/2
!unzip 2 -d dataset
!unzip dataset/OBO_XML_7-2.zip -d dataset/data

--2025-02-03 20:19:11--  https://orda.shef.ac.uk/ndownloader/articles/4775434/versions/2
Resolving orda.shef.ac.uk (orda.shef.ac.uk)... 2a05:d018:1f4:d000:fc8b:3c92:d877:d071, 2a05:d018:1f4:d003:286f:d78f:c2e6:d35, 54.217.202.212, ...
Connecting to orda.shef.ac.uk (orda.shef.ac.uk)|2a05:d018:1f4:d000:fc8b:3c92:d877:d071|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 315238180 (301M) [application/zip]
Saving to: '2'

     0K .......... .......... .......... .......... ..........  0%  114K 45m9s
    50K .......... .......... .......... .......... ..........  0%  228K 33m49s
   100K .......... .......... .......... .......... ..........  0% 82,6M 22m34s
   150K .......... .......... .......... .......... ..........  0%  228K 22m32s
   200K .......... .......... .......... .......... ..........  0%  225M 18m2s
   250K .......... .......... .......... .......... ..........  0% 99,9M 15m2s
   300K .......... .......... .......... .......... ..........  0% 92,9M 12m

Archive:  2
 extracting: dataset/OB_xml_notes.docx  
 extracting: dataset/OBO_XML_7-2.zip  
Archive:  dataset/OBO_XML_7-2.zip
  inflating: dataset/data/licence.txt  
  inflating: dataset/data/listOA.txt  
  inflating: dataset/data/listOBP.txt  
  inflating: dataset/data/OB_xml_notes.docx  
   creating: dataset/data/ordinarysAccounts/
  inflating: dataset/data/ordinarysAccounts/.DS_Store  
  inflating: dataset/data/ordinarysAccounts/OA16760517.xml  
  inflating: dataset/data/ordinarysAccounts/OA16760705.xml  
  inflating: dataset/data/ordinarysAccounts/OA16760830.xml  
  inflating: dataset/data/ordinarysAccounts/OA16761025.xml  
  inflating: dataset/data/ordinarysAccounts/OA16770316.xml  
  inflating: dataset/data/ordinarysAccounts/OA16770504.xml  
  inflating: dataset/data/ordinarysAccounts/OA16771017.xml  
  inflating: dataset/data/ordinarysAccounts/OA16771219.xml  
  inflating: dataset/data/ordinarysAccounts/OA16780123.xml  
  inflating: dataset/data/ordinarysAccounts/OA16780306.xml 

In [47]:
# Creates a list with all xml file's names
trials_folder = 'dataset\data\sessionsPapers'
trial_files = [os.path.join(trials_folder, f) for f in os.listdir(trials_folder) if f.endswith('.xml')]
trial_files

  trials_folder = 'dataset\data\sessionsPapers'


['dataset\\data\\sessionsPapers\\16740429.xml',
 'dataset\\data\\sessionsPapers\\16740717.xml',
 'dataset\\data\\sessionsPapers\\16740909.xml',
 'dataset\\data\\sessionsPapers\\16741014.xml',
 'dataset\\data\\sessionsPapers\\16741212.xml',
 'dataset\\data\\sessionsPapers\\16750115.xml',
 'dataset\\data\\sessionsPapers\\16750219.xml',
 'dataset\\data\\sessionsPapers\\16750414.xml',
 'dataset\\data\\sessionsPapers\\16750707.xml',
 'dataset\\data\\sessionsPapers\\16750909.xml',
 'dataset\\data\\sessionsPapers\\16751013.xml',
 'dataset\\data\\sessionsPapers\\16751208.xml',
 'dataset\\data\\sessionsPapers\\16760114.xml',
 'dataset\\data\\sessionsPapers\\16760117.xml',
 'dataset\\data\\sessionsPapers\\16760405.xml',
 'dataset\\data\\sessionsPapers\\16760510.xml',
 'dataset\\data\\sessionsPapers\\16760628.xml',
 'dataset\\data\\sessionsPapers\\16760823.xml',
 'dataset\\data\\sessionsPapers\\16761011.xml',
 'dataset\\data\\sessionsPapers\\16761213.xml',
 'dataset\\data\\sessionsPapers\\1677011

In [48]:
# Gets trial's info from xml files and adds to list
def extract_trial_data(tree, data_list):
    '''
    Extracts trial's info from each XML file in a dictionary and appends to a list.

    Args:
        tree: the parsed XML document tree.
        data_list: a list to store the trial's dictionaries.

    '''
    # finds date of all trials in file
    trial_date = tree.xpath('//div0//interp[@type="date"]/@value')
    trial_date = trial_date[0]

    # finds total number of trials in file
    trials_ids = tree.xpath('//div0//div1[@type="trialAccount"]/@id')
    num_trials = len(trials_ids)

    for trial in trials_ids:
        defendant_gender = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//persName[@type="defendantName"]//interp[@type="gender"]/@value')
        defendant_age = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//persName[@type="defendantName"]//interp[@type="age"]/@value')
        defendant_occupation = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//persName[@type="defendantName"]//interp[@type="occupation"]/@value')
        victim_gender = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//persName[@type="victimName"]//interp[@type="gender"]/@value')
        verdict_category = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//rs[@type="verdictDescription"]//interp[@type="verdictCategory"]/@value')
        verdict_subcategory = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//rs[@type="verdictDescription"]//interp[@type="verdictSubcategory"]/@value')
        offence_category = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//rs[@type="offenceDescription"]//interp[@type="offenceCategory"]/@value')
        offence_subcategory = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//rs[@type="offenceDescription"]//interp[@type="offenceSubcategory"]/@value')
        punishment_category = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//rs[@type="punishmentDescription"]//interp[@type="punishmentCategory"]/@value')
        punishment_subcategory = tree.xpath(f'//div0//div1[@type="trialAccount" and @id="{trial}"]//p//rs[@type="punishmentDescription"]//interp[@type="punishmentSubcategory"]/@value')


        data_list.append({
            "trial_id": trial,
            "trial_date": trial_date,
            "defendant_gender": defendant_gender,
            "defendant_age": defendant_age,
            "defendant_occupation": defendant_occupation,
            "victim_gender": victim_gender,
            "verdict_category": verdict_category,
            "verdict_subcategory": verdict_subcategory,
            "offence_category": offence_category,
            "offence_subcategory": offence_subcategory,
            "punishment_category": punishment_category,
            "punishment_subcategory": punishment_subcategory,
        })


In [49]:
# Adds info on xml files to a list
def parse_trial_file(file, data_list):
  tree = etree.parse(file)          # parses xml file

  extract_trial_data(tree, data_list)      # extracts info

In [50]:
trial_records = []                  # creates list

# Iterates through all existing trial's files
for file in trial_files:

  parse_trial_file(file, trial_records)

## Data cleaning

Converting list to pandas dataframe.

In [51]:
df = pd.DataFrame(trial_records)

In [52]:
df

Unnamed: 0,trial_id,trial_date,defendant_gender,defendant_age,defendant_occupation,victim_gender,verdict_category,verdict_subcategory,offence_category,offence_subcategory,punishment_category,punishment_subcategory
0,t16740429-1,16740429,[male],[],[],[male],[guilty],[],[violentTheft],[highwayRobbery],[],[]
1,t16740429-2,16740429,[male],[],[],[male],[guilty],[],[theft],[grandLarceny],[death],[]
2,t16740429-3,16740429,"[male, male, male]",[],[],[male],[guilty],[],[theft],[burglary],[],[]
3,t16740429-4,16740429,[male],[],[],[female],[notGuilty],[],[sexual],[rape],[],[]
4,t16740429-5,16740429,[female],[],[],[female],[guilty],[],[theft],[other],[transport],[]
...,...,...,...,...,...,...,...,...,...,...,...,...
197746,t19130401-63,19130401,[male],[45],[clicker],[female],[guilty],[pleadedGuilty],[breakingPeace],[wounding],[imprison],[hardLabour]
197747,t19130401-64,19130401,[male],[24],[labourer],[],[guilty],[no_subcategory],[sexual],[rape],[imprison],[hardLabour]
197748,t19130401-65,19130401,"[male, male]","[24, 18]",[],[male],[notGuilty],[noEvidence],[kill],[manslaughter],[],[]
197749,t19130401-66,19130401,[male],[17],[labourer],[],[guilty],[no_subcategory],[sexual],[sodomy],[imprison],[otherInstitution]


Replacing the following values:
- '[]'
- ' '
- 'indeterminate'

With NumPy's NaN using the mask function.

In [79]:
df_cleaned = df.mask((df.map(type).eq(list) & ~df.astype(bool)) |
                  (df.map(lambda x: (x == [''] or x == ['indeterminate']))),
                  other=np.nan)

In [80]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197751 entries, 0 to 197750
Data columns (total 12 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   trial_id                197751 non-null  object
 1   trial_date              197751 non-null  object
 2   defendant_gender        197627 non-null  object
 3   defendant_age           119297 non-null  object
 4   defendant_occupation    7013 non-null    object
 5   victim_gender           165658 non-null  object
 6   verdict_category        197445 non-null  object
 7   verdict_subcategory     107907 non-null  object
 8   offence_category        197695 non-null  object
 9   offence_subcategory     197695 non-null  object
 10  punishment_category     146132 non-null  object
 11  punishment_subcategory  89033 non-null   object
dtypes: object(12)
memory usage: 18.1+ MB


### Converting date column to datetime type.

Replacing date values using pd.to_datetime. Records dated before 1677 were replaced as NaT, since not supported by Pandas.

In [86]:
from lxml.etree import _ElementUnicodeResult

df_cleaned['trial_date'] = df_cleaned['trial_date'].apply(
    lambda x: str(x) if isinstance(x, _ElementUnicodeResult) else x
)


In [87]:
print(df_cleaned['trial_date'])
print(type(df_cleaned.trial_id[0]))

0         16740429
1         16740429
2         16740429
3         16740429
4         16740429
            ...   
197746    19130401
197747    19130401
197748    19130401
197749    19130401
197750    19130401
Name: trial_date, Length: 197751, dtype: object
<class 'lxml.etree._ElementUnicodeResult'>


In [88]:
df_cleaned['trial_date'] = pd.to_datetime(df_cleaned['trial_date'], format='%Y%m%d', errors="coerce")

df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197751 entries, 0 to 197750
Data columns (total 12 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   trial_id                197751 non-null  object        
 1   trial_date              197522 non-null  datetime64[ns]
 2   defendant_gender        197627 non-null  object        
 3   defendant_age           119297 non-null  object        
 4   defendant_occupation    7013 non-null    object        
 5   victim_gender           165658 non-null  object        
 6   verdict_category        197445 non-null  object        
 7   verdict_subcategory     107907 non-null  object        
 8   offence_category        197695 non-null  object        
 9   offence_subcategory     197695 non-null  object        
 10  punishment_category     146132 non-null  object        
 11  punishment_subcategory  89033 non-null   object        
dtypes: datetime64[ns](1), object(1

In [89]:
print(df_cleaned['trial_date'])

0               NaT
1               NaT
2               NaT
3               NaT
4               NaT
            ...    
197746   1913-04-01
197747   1913-04-01
197748   1913-04-01
197749   1913-04-01
197750   1913-04-01
Name: trial_date, Length: 197751, dtype: datetime64[ns]


## Exporting to JSON format

In [92]:
df_cleaned.to_json('cleaned_data.json', orient='records', lines=True)
