In [1]:
import pandas as pd

from web_scraping_lib import *
from text_wrangling_utils import *


### A collection of all current immigration rules
https://www.gov.uk/guidance/immigration-rules


### Web scraping with beautifulsoup
https://gilberttanner.com/blog/introduction-to-web-scraping-with-beautifulsoup

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Note: The pages we are working with make heavy use of JavaScript, which complicates matters.
Beautiful soup completely ignores JavaScript, as far as I understand it. In order to make sense of the document structure,
it is important to study the plain HTML code and work with that, since it is directly mapped to the BeautifulSoup object. In order to get the scraping right, one needs to study the same plain HTML code that BeautifulSoup sees, i.e. the HTML as downloaded from the sever BEFORE JavaScript can spoil it!
This is NOT what we get when we open the web page in a browser and then view its source!

# documents relevant for Tier 2 visa

In [2]:


url_list = [
    "https://www.gov.uk/guidance/immigration-rules/immigration-rules-index",
    "https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-6a-the-points-based-system",
    "https://www.gov.uk/guidance/immigration-rules/immigration-rules-appendix-a-attributes",
    "https://www.gov.uk/guidance/immigration-rules/immigration-rules-appendix-c-maintenance-funds",
    "https://www.gov.uk/guidance/immigration-rules/immigration-rules-appendix-b-english-language"
]


for url in url_list:
    print(url + "\n")

https://www.gov.uk/guidance/immigration-rules/immigration-rules-index

https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-6a-the-points-based-system

https://www.gov.uk/guidance/immigration-rules/immigration-rules-appendix-a-attributes

https://www.gov.uk/guidance/immigration-rules/immigration-rules-appendix-c-maintenance-funds

https://www.gov.uk/guidance/immigration-rules/immigration-rules-appendix-b-english-language



# Scraping demo

In [3]:

docs_scrape = map_df(scrape_govuk_guidance,url_list)
docs_scrape.head()

Unnamed: 0,URL,title,summary,text_dump,text_segmented,hyperlinks_dump,timestamp
0,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules: Index,The rules are divided into different documents...,\nImmigration Rules: Index\nThe rules are divi...,"[(text, Immigration Rules: Index\nThe rules ar...",[https://www.gov.uk/guidance/immigration-rules...,2020-08-17T19:43:09+00:00
1,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules part 6A: the points-based sy...,Points-based system (paragraphs 245AAA to 245Z...,\nImmigration Rules part 6A: the points-based ...,"[(text, Immigration Rules part 6A: the points-...",[],2020-08-17T19:43:09+00:00
2,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules Appendix A: attributes,Points needed for attributes for applicants in...,\nImmigration Rules Appendix A: attributes\nPo...,"[(text, Immigration Rules Appendix A: attribut...",[http://www.oanda.com],2020-08-17T19:43:10+00:00
3,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules Appendix C: maintenance (funds),Maintenance (funds),\nImmigration Rules Appendix C: maintenance (f...,"[(text, Immigration Rules Appendix C: maintena...",[],2020-08-17T19:43:13+00:00
4,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules Appendix B: English language,English Language,\nImmigration Rules Appendix B: English langua...,"[(text, Immigration Rules Appendix B: English ...",[],2020-08-17T19:43:13+00:00


# Segmentation demo

In [4]:
idx = 2
test_segments = docs_scrape.loc[idx,"text_segmented"]
print(docs_scrape.loc[idx,"URL"])

https://www.gov.uk/guidance/immigration-rules/immigration-rules-appendix-a-attributes


In [5]:
test_df = build_segments_df(test_segments)
#test_df = test_df.set_index("section")
test_df.head(20)

Unnamed: 0,section,subsection,is table,string
0,0,0,False,Immigration Rules Appendix A: attributes\nPoin...
1,1,0,False,Attributes for Tier 1 (Exceptional Talent) Mig...
2,1,0,False,\n1. An applicant applying for indefinite leav...
3,1,1,False,Table 1
4,1,2,False,Applications for indefinite leave to remain
5,1,2,True,...
6,1,2,False,Notes
7,1,3,False,Tier 1 (Exceptional Talent) Limit
8,1,3,False,\n4. DELETED\n5. DELETED\n6. DELETED\n
9,2,0,False,Money earned in the UK


In [6]:
for ii in range(5):
    print("#########################")
    print(test_segments[ii][0])
    print("\n")
    print(test_segments[ii][1])
    print("\n")

#########################
text


Immigration Rules Appendix A: attributes
Points needed for attributes for applicants in Tiers 1, 2, 4 and 5 of the points-based system.


#########################
section


Attributes for Tier 1 (Exceptional Talent) Migrants


#########################
text



1. An applicant applying for indefinite leave to remain as a Tier 1 (Exceptional Talent) Migrant must score 75 points for attributes.
2. Available points are shown in Table 1.
3. Notes to accompany the table are shown below the table.



#########################
subsection


Table 1


#########################
subsection


Applications for indefinite leave to remain




 Download plain HTML pages (no JavaScript!)
 
 This is just for testing purposes and not needed for scraping!

In [7]:
fetch_hmtl2disc(url_list,"html_raw/")

In [8]:
test_url = url_list[-1]
print(test_url)

https://www.gov.uk/guidance/immigration-rules/immigration-rules-appendix-b-english-language


In [9]:
test_scrape = scrape_govuk_guidance_raw(test_url)
print(test_scrape["timestamp"])

2020-08-17T19:43:16+00:00


In [10]:
docs_scrape_raw = map_df(scrape_govuk_guidance_raw,url_list)

docs_scrape_raw.head()

Unnamed: 0,URL,title,text_dump,hyperlinks_dump,timestamp
0,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules: Index,\nImmigration Rules: Index\nThe rules are divi...,[https://www.gov.uk/guidance/immigration-rules...,2020-08-17T19:43:16+00:00
1,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules part 6A: the points-based sy...,\nImmigration Rules part 6A: the points-based ...,[],2020-08-17T19:43:16+00:00
2,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules Appendix A: attributes,\nImmigration Rules Appendix A: attributes\nPo...,[http://www.oanda.com],2020-08-17T19:43:17+00:00
3,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules Appendix C: maintenance (funds),\nImmigration Rules Appendix C: maintenance (f...,[],2020-08-17T19:43:17+00:00
4,https://www.gov.uk/guidance/immigration-rules/...,Immigration Rules Appendix B: English language,\nImmigration Rules Appendix B: English langua...,[],2020-08-17T19:43:17+00:00


In [11]:
print(docs_scrape_raw.loc[0,"text_dump"])


Immigration Rules: Index
The rules are divided into different documents. The index page will help you find the part you need.







Paragraph   number






Introduction (Paragraphs 1 to 6C)
 
 


Implementation and transitional provisions
4
 


Application
5
 


Interpretation
6
 


Public funds clarification
6A to 6C
 



Part 1: General   provisions regarding leave   to enter or remain in the   United Kingdom (Paragraphs 7 to 39E)
 
 


Leave   to enter the United Kingdom
7 to 9
 


Exercise of the power to refuse leave to enter the United Kingdom
10
 


Suspension of leave to enter or remain in the United Kingdom
10A
 


Cancellation of leave to enter or remain in the United Kingdom
10B
 


Requirement for persons arriving in the United Kingdom or seeking entry through the Channel Tunnel to produce evidence of identity and nationality
11
 


Requirement for a person not requiring leave to enter the United Kingdom to prove that he has the right   of abode
12 to 14
 


Common Trave

In [12]:
docs_scrape.loc[1,"text_segmented"][0:4]

[('text',
  'Immigration Rules part 6A: the points-based system\nPoints-based system (paragraphs 245AAA to 245ZZE).'),
 ('section', '245AAA.General requirements for indefinite leave to remain'),
 ('text',
  'The following rules apply to all requirements for indefinite leave to remain in Part 6A and Appendix A:\n\n(a) References to a “continuous period” “lawfully in the UK” means, subject to paragraph (e), residence in the UK for an unbroken period with valid leave, and for these purposes a period shall be considered unbroken where:\n    \n(i)\tthe applicant has not been absent from the UK for more than 180 days during any 12 month period in the continuous period, except that:\n        \n(1) any absence from the UK for the purpose of assisting with a national or international humanitarian or environmental crisis overseas shall not count towards the 180 days, if the applicant provides evidence that this was the purpose of the absence(s) and that their Sponsor, if there was one, agreed to

In [13]:
docs_scrape["title"]
docs_scrape.set_index("title")

Unnamed: 0_level_0,URL,summary,text_dump,text_segmented,hyperlinks_dump,timestamp
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Immigration Rules: Index,https://www.gov.uk/guidance/immigration-rules/...,The rules are divided into different documents...,\nImmigration Rules: Index\nThe rules are divi...,"[(text, Immigration Rules: Index\nThe rules ar...",[https://www.gov.uk/guidance/immigration-rules...,2020-08-17T19:43:09+00:00
Immigration Rules part 6A: the points-based system,https://www.gov.uk/guidance/immigration-rules/...,Points-based system (paragraphs 245AAA to 245Z...,\nImmigration Rules part 6A: the points-based ...,"[(text, Immigration Rules part 6A: the points-...",[],2020-08-17T19:43:09+00:00
Immigration Rules Appendix A: attributes,https://www.gov.uk/guidance/immigration-rules/...,Points needed for attributes for applicants in...,\nImmigration Rules Appendix A: attributes\nPo...,"[(text, Immigration Rules Appendix A: attribut...",[http://www.oanda.com],2020-08-17T19:43:10+00:00
Immigration Rules Appendix C: maintenance (funds),https://www.gov.uk/guidance/immigration-rules/...,Maintenance (funds),\nImmigration Rules Appendix C: maintenance (f...,"[(text, Immigration Rules Appendix C: maintena...",[],2020-08-17T19:43:13+00:00
Immigration Rules Appendix B: English language,https://www.gov.uk/guidance/immigration-rules/...,English Language,\nImmigration Rules Appendix B: English langua...,"[(text, Immigration Rules Appendix B: English ...",[],2020-08-17T19:43:13+00:00
