This notebook is explained at [ManavSehgal.com](https://manavsehgal.com) where you can find other useful Notebooks which can speed up your machine learning workflow and projects.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Use SHIFT+TAB keys to popup inplace code help
%config IPCompleter.greedy = True

# Output multiple statements from one input cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Customize Notebook

**table_from_top.** If the Wikipedia page has one table then use `table_from_top = 1` value. Otherwise count table number from top and replace value to get specific table.

**wikipedia_page.** Specify the wikipedia page name from where to source dataset. The CSV file will be saved with the same name.

**trace.** Set `trace = True` to trace how feature values are extracted. Does not save extracted dataset. Prefixes applied parsing/extraction rules to extracted values.

In [2]:
table_from_top = 1
wikipedia_page = 'List_of_tallest_buildings_and_structures'
trace = False

## Load and Parse

This section loads the Wikipedia page and parses the table data we are interested in converting to a dataset.

In [3]:
wikipedia_url = 'https://en.wikipedia.org/wiki/{}'.format(wikipedia_page)
page = requests.get(wikipedia_url)
soup = BeautifulSoup(page.content, 'lxml')
tables = soup.find_all('table', {'class': 'wikitable'})
table = tables[table_from_top - 1]

## Quick Preview

This section extracts the table header with feature or column names.

Use this section to quick preview if you have the right table in processing.

In [4]:
feature_names = []

header_row = table.find('tr')
for header in header_row.find_all('th'):
    feature_name = ' '.join(header.find_all(text=True))
    feature_name.replace('\n', '')
    feature_names.append(feature_name)

'Category'

'Structure'

'Country'

'City'

'Height (metres)'

'Height (feet)'

'Year built'

'Coordinates'

## Data Wrangling

This section applies data wrangling rules based on exceptions found when parsing Wikipedia tables.

- If a feature value contains a link then extract text from the link.
- Ignore text which starts with `[` square brackets.
- Ignore image links (...flags) prefix link text.
- Ignore hidden text used for IDs.

In [5]:
def has_coords(tag):
    if tag.has_attr('class'):
        if tag['class'][0] == 'latitude' or tag['class'][0] == 'longitude':
            return True
    return False

def get_coords(child):
    coords = []
    for coord in child.find_all(has_coords):
        coords.append(coord.string)
    if coords:
        if trace:
            return 'C = {}'.format(' '.join(coords))
        else:
            return ' '.join(coords)
    else:
        return ''

samples = []
sample_rows = table.find_all('tr')[1:]
for sample_row in sample_rows:
    features = []
    for feature_col in sample_row.find_all('td'):
        feature_value = ''
        text = feature_col.string
        if text:
            if trace:
                features.append('T = {}'.format(text))
            else:
                features.append(text)
            continue
        
        for child in feature_col.children:
            if child.name == 'span':
                if child.has_attr('class'):
                    if child['class'] == 'display:none':
                        continue
                if child.find_all(has_coords):
                    feature_value = get_coords(child)
                    if feature_value:
                        break
                    else:
                        continue
            if child.name == 'sup':
                continue
            if child.name == 'a':
                if child.string[0] == '[':
                    continue            
            if child.name == 'a':
                if trace:
                    feature_value = 'A = {}'.format(child.string)
                else:
                    feature_value = child.string
                break
            if child.name == 'font':
                if trace:
                    feature_value = 'F = {}'.format(child.string)
                else:
                    feature_value = child.string
                break
            try:
                # feature_value = '' for any tags not covered above
                content = child.contents
            except AttributeError:
                # Handle whitespace between child tags, treated as a child string
                if child.isspace():
                    continue
                if trace:
                    feature_value = 'E = {}'.format(child)
                else:
                    feature_value = child
                break
        features.append(feature_value)
    samples.append(dict(zip(feature_names, features)))

## Preview Dataset

This section enables you to preview the parsed dataset.

In [6]:
df = pd.DataFrame(samples)
df.head()
df.tail()

Unnamed: 0,Category,City,Coordinates,Country,Height (feet),Height (metres),Structure,Year built
0,Mixed use,Dubai,25°11′50.0″N 55°16′26.6″E,United Arab Emirates,2717,828.1,Burj Khalifa,2010
1,Self-supporting tower,Tokyo,35°42′36.5″N 139°48′39″E,Japan,2080,634.0,Tokyo Skytree,2011
2,Guyed steel lattice mast,"Blanchard, North Dakota",47°20′32″N 97°17′25″W,United States,2063,628.8,KVLY-TV mast,1963
3,Clock building,Mecca,21°25′08″N 39°49′35″E,Saudi Arabia,1972,601.0,Abraj Al Bait Towers,2011
4,Office,"New York, NY",40°42′46.8″N 74°0′48.6″W,United States,1776,541.0,One World Trade Center,2013


Unnamed: 0,Category,City,Coordinates,Country,Height (feet),Height (metres),Structure,Year built
46,Aerial tramway,Kaprun,47°11′58.62″N 12°41′16.96″E,Austria,373,113.6,Pillar of,1966
47,Sphere,Stockholm,59°17′36.92″N 18°04′58.79″E,Sweden,279,85.0,Ericsson Globe,1989
48,Brick,Genoa,44°24′16.25″N 8°54′16.67″E,Italy,253,77.0,Lighthouse of Genoa,1543
49,Gopuram,Murudeshwara,14°05′39″N 74°29′07″E,India,249,76.0,Murudeshwara Temple,2008
50,Wooden,Săpânța,47°58′59.5″N 23°42′02.5″E,Romania,246,75.0,Church of the Săpânța-Peri Monastery,2003


## Save Dataset

We can now save the dataset using the same Wikipedia page name we use earlier to extract the dataset.

In [7]:
dataset_file_name = '../datasets/wikipedia/{}.csv'.format(wikipedia_page)
if not trace:
    df.to_csv(dataset_file_name, index=False)