# Sample techniques to clean up mark up

This notebook highlights a few Beatiful Soup 4 techniques you can use to clean up mark up.

## Install packages and set up the soup object

In [97]:
#!pip install beautifulsoup4

# Install the parsersc
#!pip install html5lib
#!pip install lxml

1. Import the module

In [109]:
from bs4 import BeautifulSoup

2. Load a sample file into the soup object from a sample HTML document

In [110]:
with open("..\sourcefiles\making-a-noise-complaint.html") as fp:
    noise_soup = BeautifulSoup(fp, 'html.parser')

## Find the content

1. Find the main block of content based on its control id attribute

In [100]:
content = noise_soup.find('div', { 'id':'ctl00_PlaceHolderMain_ctl03__ControlWrapper_RichHtmlField'})
content.attrs['id'] = 'sample_page'

2. Define a class to hold attributes we want to check for (can be added to a dictionary later)

In [101]:
class Lcc_markup:
  def __init__(self, tag: str, attr: str, val: str):
    self.tag = tag
    self.attr = attr
    self.val = val

# div class="lcc-tree-container lcc-noise-nuisance"
jp = Lcc_markup('div', 'class', 'lcc-tree-container')

## Swap out unwanted controls

3. Replace the mark up we dont want with a placeholder. Use the instance of the class as a convenience (could be put in a loop later).

In [102]:
journey_page_tag = content.find(jp.tag, { jp.attr : jp.val})

4. Use the BeautifulSoup `string` property to replace the content.

In [103]:
journey_page_tag.string = '<div class="replace_me">THIS PAGE CONTAINS A JOURNEY PAGE AND NEEDS UPDATING'

## Remove unnecessary HTML tags like <div class='col..'

5. For now simply store the attributes in a dictionary but this could be a class etc. BS4 will take a `lambda` to match attributes so we can pass that in to `find_all`

In [111]:
strip = {
  'tag' : 'div',
  'matchfunc' : lambda L: L and L.startswith('ctl'),
  'attr' : 'id'
}

# You can use Lambdas in the match
result = content.findAll(strip['tag'], { strip['attr'] : strip['matchfunc'] })



## Save the result to a new file

In [105]:
new_file = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Document</title>
</head>
<body>
    
</body>
</html>
'''
new_soup = BeautifulSoup(new_file, 'html.parser')
new_soup.body.insert(1, content)
with open("..\cleaned_pages\making-a-noise-complaint.html", "w", encoding='utf-8') as file:
    file.write(new_soup.prettify())

### Check for double div tags (work in progress)

In [None]:
from bs4 import NavigableString

for d in content.find_all('div') :
    if(len(d.contents) == 1 and type(d.contents[0]) == NavigableString and not d.contents[0] == '\n') :
        # print('empty', d.contents[0])