# Extract
This notebook is used for extracting information from the given data file (which is a zip file). This file contains many XML files and all of these XML files are parsed. Texts and subjects are extracted from the XML files. All information is finaly stored in another file. In this notebook, the resulting file is loaded into memory and the number of succesfully parsed files is shown.

In [1]:
# Load requires modules
%reload_ext autoreload
%autoreload
%run imports.py

## Extracting information
The `parse_zip_file` method parses the zip file. Every time an XML file is found, the `extractor.handle_xml_file` method is called. This method locally stores the extracted information. Finally, the data which was found is stored by the `store_dataset` method.

In [None]:
# Extract information from the input file
extractor = LegalIntelligenceExtractor()
parse_zip_file(INPUT_FILE, extractor.handle_xml_file, show_state)
store_dataset(extractor.data, DATA_FILE)

## Loading information
The following piece of code shows how the stored information is loaded by the `load_dataset` method.

In [None]:
# Display the number of files which were parsed succesfully
dataset = load_dataset(DATA_FILE)
print("Number of succesfully parsed files:", len(dataset.files))

## Transformations
There are lots of transformations possible. In the next code fragment, a few of them are shown.

In [2]:
# Load the dataset
dataset = load_dataset(DATA_FILE)
# Shuffle it
dataset = shuffle_dataset(dataset)
# Remove records which do not have a second level subject (so those who only have a top level subject)
dataset = filter_incomplete_subjects(dataset)
# Remove all records from subjects which have less than 8 records
dataset = filter_small_subjects(dataset, 8)
# Make sure that all (second level) subjects have at most 500 records (unnecessary records are removed)
dataset = chop_large_subjects(dataset, 500)

# Show the number of records per top level and per second level subject
indices = get_indices(dataset)
for target1_name in indices:
    print(target1_name, " (top level subject):")
    for target2_name in indices[target1_name]:
        print("\t", target2_name, " (second level subject):")
        print("\t\t", len(indices[target1_name][target2_name]), " records")

Bestuursrecht  (top level subject):
	 Vreemdelingenrecht  (second level subject):
		 500  records
	 Bestuursprocesrecht  (second level subject):
		 11  records
	 Socialezekerheidsrecht  (second level subject):
		 500  records
	 Bestuursstrafrecht  (second level subject):
		 40  records
	 Mededingingsrecht  (second level subject):
		 4  records
	 Omgevingsrecht  (second level subject):
		 225  records
	 Ambtenarenrecht  (second level subject):
		 257  records
	 Belastingrecht  (second level subject):
		 500  records
Strafrecht  (top level subject):
	 Materieel strafrecht  (second level subject):
		 8  records
	 Strafprocesrecht  (second level subject):
		 9  records
Civiel recht  (top level subject):
	 Verbintenissenrecht  (second level subject):
		 43  records
	 Insolventierecht  (second level subject):
		 169  records
	 Arbeidsrecht  (second level subject):
		 126  records
	 Intellectueel-eigendomsrecht  (second level subject):
		 17  records
	 Ondernemingsrecht  (second level subject