# Extract URLs from XML data using BeautifulSoup

This method contents takes a URL. It sends a GET request to the URL using the requests library, then creates a BeautifulSoup object with the XML response text. It then iterates through the P2, P3 and P4 tags in the XML, and for each tag that has a DocumentURI attribute, it adds the value of that attribute to a list called `urls`. Finally, it returns the `urls` list. The purpose of this method is to extract URLs from XML data, which could be used for further processing or analysis.
 

In [21]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
from typing import Optional, Any, List, Union, TypeVar, Type, cast, Callable
import re
from statistics import mean 


def contents(url: str) -> List[str]:
    urls = []
    response = requests.get(url)
    soup = BeautifulSoup(response.text, features="xml")
    for a in soup.find_all('P2', DocumentURI=True):
        urls.append(a['DocumentURI'])
        for paragraph in a.find_all('P3', DocumentURI=True):
            urls.append(paragraph['DocumentURI'])
            for lines in paragraph.find_all('P4', DocumentURI=True):
                urls.append(paragraph['DocumentURI'])
    return urls

year=2004 
number=34

parts=contents('https://www.legislation.gov.uk/ukpga/'+str(year)+'/'+str(number)+''+'/data.xml')
print(parts)

['http://www.legislation.gov.uk/ukpga/2004/34/section/1/1', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/1/a', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/1/b', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/2', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/2/a', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/2/b', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/3', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/3/a', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/3/b', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/3/c', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/4', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/4/a', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/4/b', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/4/c', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/4/d', 'http://www.legislation.gov.uk/ukpga/2004/34/section/1/5', 'http://www.legislation.gov.uk/uk

# section(): a method for retrieving and parsing UK legislation sections with BeautifulSoup


The `section()` method retrieves and parses the XML data for a given URL of a section of UK legislation using the BeautifulSoup library. It then extracts and prints the text for each paragraph and any sub-paragraphs in the section. Additionally, it prints any remaining text at the end of the section that does not correspond to a sub-paragraph. This method is intended to be used for sections that do not contain any `BlockAmendments`, which represent amendments to the text of a section and can be handled by a separate method. In the future, this method can be updated to handle `BlockAmendments` and store all textual information as an object for further processing or analysis.





In [24]:
def section(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, features="xml")
    for a in soup.find_all('P2para'):
        print(a.find('Text').text)
        for paragraph in a.find_all('P3', DocumentURI=True):
            paragraphLine(paragraph['DocumentURI']+"/data.xml")
            for lines in paragraph.find_all('P4', DocumentURI=True):
                print("P4:", lines['DocumentURI']+"/data.xml")
        size=len(a.find_all('Text'))
        p3s=len(a.find_all('P3', DocumentURI=True))
        if(size-1>p3s):
            print(a.find_all('Text')[size-1].text)


sectionObject=section("http://www.legislation.gov.uk/ukpga/2004/34/section/5/data.xml")

If a local housing authority consider that a category 1 hazard exists on any residential premises, they must take the appropriate enforcement action in relation to the hazard.
In subsection (1)
					 “the appropriate enforcement action” means whichever of the following courses of action is indicated by subsection (3) or (4)—
a serving an improvement notice under section 11;
b making a prohibition order under section 20;
c serving a hazard awareness notice under section 28;
d taking emergency remedial action under section 40;
e making an emergency prohibition order under section 43;
f making a demolition order under subsection (1) or (2) of section 265 of the Housing Act 1985 (c. 68);
g declaring the area in which the premises concerned are situated to be a clearance area by virtue of section 289(2) of that Act.
If only one course of action within subsection (2) is available to the authority in relation to the hazard, they must take that course of action.
If two or more courses of actio

# paragraphLine()

The `paragraphLine()` method retrieves and parses the XML data for a given URL of a sub-paragraph in UK legislation using the BeautifulSoup library. It then extracts and prints the sub-paragraph number and text.

This method can be incorporated into the `contents()` method to extract and print the content of all sub-paragraphs in a section of UK legislation.





In [25]:
def paragraphLine(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, features="xml")
    for paragraph in soup.find_all('P3', DocumentURI=True):
        print(paragraph.select('Pnumber')[0].text+" "+paragraph.select('Text')[0].text)

paragraphLine("http://www.legislation.gov.uk/ukpga/2004/34/section/194/1/b/data.xml")

b to take any appropriate action in relation to the tenant in reliance on either of those provisions.
