# Notebook for running the script to build database instance

We highly advise running this in a venv as supplied with the `requirements.txt` file in the `../package_nationbetter/` folder. To run the code first install required dependencies using 
```
pip install ../package_nationbetter/ --upgrade 
``` 
When editing files in the package body `../package_nationbetter/nationbetter/` rerun the previous command to refresh the dependencies of nationbetter on the contained `.py` files which are called using the `import nationbetter` command.

Running the code builds the full labeled sectioned dataset. Note this rebuilds the full library and takes a few minutes, I have not yet written the functionality to get the code to check if rebuilding the tree gives the same output -> then do not build the tree.

In [1]:
pip install ../package_nationbetter --upgrade

Processing /Users/joao/Dropbox/Mathematics/Data Analysis/S2DS/Nation Better/Aug20_NationBetter/deliverable/package_nationbetter
Building wheels for collected packages: NationBeter-PIVIGO-GovUKCorpusParser
  Building wheel for NationBeter-PIVIGO-GovUKCorpusParser (setup.py) ... [?25ldone
[?25h  Created wheel for NationBeter-PIVIGO-GovUKCorpusParser: filename=NationBeter_PIVIGO_GovUKCorpusParser-0.0.1-cp37-none-any.whl size=31447 sha256=683a830b6b7d30adca1149320030c98910c3dccdad8e3116db87e8e3c7938cf4
  Stored in directory: /private/var/folders/_l/g2349h3j6fjf11xrwj4fmp1h0000gn/T/pip-ephem-wheel-cache-sqyso9xz/wheels/69/4c/92/21b638651813249e84a40d959802297180031081548b8f08ad
Successfully built NationBeter-PIVIGO-GovUKCorpusParser
Installing collected packages: NationBeter-PIVIGO-GovUKCorpusParser
Successfully installed NationBeter-PIVIGO-GovUKCorpusParser-0.0.1
Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install -r ../package_nationbetter/requirements.txt

In [None]:
%run -i ../package_nationbetter/examples/example_build_data.py

# Data structure

In [37]:
import nationbetter
import os

path = nationbetter.get_data_folder(os.path.abspath(\
        os.path.join(__file__,'..','..','..')))
datapath = os.path.join(path,'data_nationbetter')
datapath

'/Users/EyzoStoutenAir/S2DS/Nation.Better/Aug20_NationBetter/deliverable/data_nationbetter'

Currently the database of extracted files is stored in the data_nationbetter folder. Each part of the extraction and cleaning up is handled by seperate files in the `package_nationbetter/nationbetter/` which write and read files using the methods in `file_handler.py`: `write_files` and `import_source_to_df` and `import_dict`. These methods by default import and export using pickle but other formats can be called as well by passing `out_type='csv'` or `out_type=json`. The readers should handle these files without any change.

In [38]:
folders = os.listdir(datapath)[1:]
folders

['formatted_html_dfs',
 'formatted_pdf_dfs',
 'labeled_corpus.pkl',
 'raw_html_dicts',
 'raw_pdf',
 'raw_pdf_dicts']

In [39]:
html_folder = os.path.join(datapath,folders[0])
html_file = os.listdir(html_folder)
html_file[0:5]

['Immigration_Rules_Appendix_6_academic_subjects_that_need_a_certificate.pkl',
 'Immigration_Rules_Appendix_7_overseas_workers_in_private_households.pkl',
 'Immigration_Rules_Appendix_A_attributes.pkl',
 'Immigration_Rules_Appendix_AR_(EU).pkl',
 'Immigration_Rules_Appendix_AR_administrative_review.pkl']

In [47]:
folder_no = 1
file_no = 2
file = os.listdir(os.path.join(datapath,folders[folder_no]))
nationbetter.import_source_to_df(os.path.join(datapath,folders[folder_no],file[file_no]))

Unnamed: 0,title,subtitle,subsubtitle,text_type,page_no,text,url
0,0.0,0.0,0.0,title,1,\n \n \nTier 2 and 5: Guidance for Sponsors -...,https://assets.publishing.service.gov.uk/gover...
1,0.0,0.0,0.0,main_text,1,\nThis addendum was published on 19 July 2019...,https://assets.publishing.service.gov.uk/gover...
2,1.0,0.0,0.0,title,1,\n \n \n \n \n \n \n \n \n \n \n \nTiers 2 an...,https://assets.publishing.service.gov.uk/gover...
3,1.0,0.0,0.0,main_text,2,\nVersion 07/20 \n \nThis guidance is to be u...,https://assets.publishing.service.gov.uk/gover...
4,1.0,1.0,0.0,subtitle,2,\n \nContents \n,https://assets.publishing.service.gov.uk/gover...
...,...,...,...,...,...,...,...
502,51.0,5.0,0.0,main_text,208,"If your application is approved, you will not ...",https://assets.publishing.service.gov.uk/gover...
503,51.0,6.0,0.0,subtitle,209,\nIf your application for a licence is refuse...,https://assets.publishing.service.gov.uk/gover...
504,51.0,6.0,0.0,main_text,209,If we are not satisfied that you can offer gen...,https://assets.publishing.service.gov.uk/gover...
505,51.0,7.0,0.0,subtitle,209,\nFurther information \n,https://assets.publishing.service.gov.uk/gover...


In [33]:
%run -i ../package_nationbetter/examples/example_retrieve_dataframes.py

To access one of the files in formatted_pdf_dfs and formatted_html_dfs,        select the wanted name from
 ['2020-07-13_Tier_2_Policy_Guidance.pkl', '2020-07-16_Tier-2-5-sponsor-guidance_Jul-2020_v1.pkl', 'calculating-continuous-leave-v21.pkl', 'english-language-v17.pkl', 'Final_Tier_5_Temporary_Worker_Guidance_05-04-19.pkl', 'good-character-guidance.pkl', 'Guide-AN_-_July_20.pkl', 'naturalisation-as-a-british-citizen-by-discretion-v5.pkl', 'sponsorguideappBfrom060412.pkl']
 ['Immigration_Rules_Appendix_7_overseas_workers_in_private_households.pkl', 'Immigration_Rules_Appendix_A_attributes.pkl', 'Immigration_Rules_Appendix_AR_(EU).pkl', 'Immigration_Rules_Appendix_AR_administrative_review.pkl', 'Immigration_Rules_Appendix_Armed_Forces.pkl', 'Immigration_Rules_Appendix_B_English_language.pkl', 'Immigration_Rules_Appendix_C_maintenance_(funds).pkl', 'Immigration_Rules_Appendix_D_highly_skilled_migrants.pkl', 'Immigration_Rules_Appendix_E_maintenance_(funds)_for_the_family_of_Relevant_Po