# Finding sections containing a country

Here we take the whole database of documents from https://www.gov.uk/guidance/immigration-rules
and find occurences of countries in them.

We also extract all sections with a given country name in them.

See https://pypi.org/project/geotext/

In [1]:
#!pip install https://github.com/elyase/geotext/archive/master.zip
from geotext import GeoText
from web_scraping_lib import scrape_govuk_guidance

In [3]:
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
from web_scraping_lib import *

import pandas as pd
from scraping_lib import scrape_documents

Scraping all documents from https://www.gov.uk/guidance/immigration-rules

In [4]:
immigration_rules_url = "https://www.gov.uk/guidance/immigration-rules"

soup = BeautifulSoup(urllib.request.urlopen(immigration_rules_url), 'html.parser')
tag = soup.article.find(attrs={'class' : 'section-list'})
links = get_links_raw(tag,immigration_rules_url)

scrape_df = pd.DataFrame(scrape_documents(links))
#scrape_df.to_csv('immigration_rules_scrape.csv', encoding='utf-8',index=False)

scraping 1/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-index
scraping 2/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-introduction
scraping 3/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-1-leave-to-enter-or-stay-in-the-uk
scraping 4/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-2-transitional-provisions
scraping 5/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-3-students
scraping 6/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-4-work-experience
scraping 7/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-5-working-in-the-uk
scraping 8/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-6-self-employment-and-business-people
scraping 9/51: https://www.gov.uk/guidance/immigration-rules/immigration-rules-part-6a-the-points-based-system
scraping 10/51: https://www.gov.uk/guidance/immigration-rule

In [5]:
text_segment = scrape_df['text_segmented'].tolist()

This function takes as input a country name and outputs a list where each element is a tuple containing the document index and section index of a section that contains this country name.

In [6]:
def locate_countrysection(country):
    list =[]
    for i in range(len(scrape_df)):
        for j in range(len(text_segment[i])):
            if country in GeoText(text_segment[i][j][1]).countries:
                list.append((i,j)) 
    return list

In [7]:
locate_countrysection('Guernsey')

[(6, 3),
 (8, 2),
 (8, 83),
 (10, 176),
 (23, 54),
 (23, 56),
 (31, 5),
 (31, 25),
 (32, 20),
 (33, 7),
 (41, 6)]

The following function takes as input the index of a document and outputs all sections in this document with country names in it.

In [8]:
def finding_countrysection(doc_index):
    return [i for  i in range(len(text_segment[doc_index])) if (GeoText(text_segment[doc_index][i][1]).countries != [])]

Example for document 5

In [9]:
finding_countrysection(5)

[0, 32, 33, 34, 35, 36, 37, 39, 40, 41, 42, 43, 44]

In [10]:
GeoText(text_segment[5][0][1]).countries

['United Kingdom']

The following function takes as input the index of a document and outputs all sections in this document with countries different from United Kingdom in it.

In [11]:
def finding_nonUK_countrysection(doc_index):
    return [i for i in finding_countrysection(doc_index) if (list(set(GeoText(text_segment[doc_index][i][1]).countries)) != ['United Kingdom'])]

Example for document 6

In [12]:
finding_nonUK_countrysection(6)

[3, 80]

In [13]:
list(set(GeoText(text_segment[6][80][1]).countries))

['Bahamas',
 'Jamaica',
 'Australia',
 'Canada',
 'Dominica',
 'Belize',
 'Guyana',
 'Grenada',
 'Barbados',
 'New Zealand',
 'United States',
 'United Kingdom',
 'Ireland']