# Retrieving, Scraping and Decorating a Single Table

### Design Decisions

* **Immutability:** We pretend the data is immutable. All functions return a new data, no data is modified, nothing happens in place.

### Music Decisions

Code to https://www.youtube.com/watch?v=M5QY2_8704o


In [188]:
import requests
from bs4 import BeautifulSoup
import re
import time

urls = ["https://www.motorcyclespecs.co.za/model/triu/triumph-street-triple-rs-22.html",
        "https://www.motorcyclespecs.co.za/model/ducati/ducati_scrambler_desert_sled_21.html",
        "https://www.motorcyclespecs.co.za/model/bmw/bmw-r100-7-78.html",
       "https://www.motorcyclespecs.co.za/model/kawasaki/kawasaki_zrx1100%2097.htm",
       "https://www.motorcyclespecs.co.za/model/kawasaki/kawasaki_zrx1200r%2004.htm",
       "https://www.motorcyclespecs.co.za/model/triu/triumph_speed_triple_1200rs_21.html",
        "https://www.motorcyclespecs.co.za/model/kawasaki/kawasaki_z900rs_20.html",
       "https://www.motorcyclespecs.co.za/model/yamaha/yamaha_r1_20.html",
        "https://www.motorcyclespecs.co.za/model/suzu/suzuki_gsxr1300r_21.html",
        "https://www.motorcyclespecs.co.za/model/suzu/suzuki_sv_650_18.htm",
        "https://www.motorcyclespecs.co.za/model/suzu/suzuki_sv650n%2007.htm"
       ]



In [189]:
def retrieve_url_to_string(url):
    # Request the Page
    response = requests.get(url)
    return response.text

def parse_table_to_dict(text):
    soup = BeautifulSoup(text, 'lxml') # important to pip install lxml
    
    # Find the Table
    hook = soup.find_all(string=re.compile(".*Make Model.*"))
    assert(len(hook)==1)  # Assert we find one and exactly one NavigatableString containing "Make Model"
    hook = hook[0]
    table = hook.find_parent('table')
    rows = table.find_all('tr')
    assert(len(rows) > 1) # Assert we've found and parsed a table with many rows
    
    # Extract the key:value pairs
    output = {}
    for row in rows:
        cols = row.find_all('td')
        assert(len(cols) == 2) # We expect two columns in each row, which we treat as a key:value pair
        key = cols[0].get_text().strip()
        value = cols[1].get_text().strip().replace("\t", "")
        output[key] = value
    return output

def decorate_table(table_dict, key_vals):
    return table_dict | key_vals


In [190]:
raw_html = retrieve_url_to_string(urls[9])
type(raw_html)

str

In [191]:
table_dict = parse_table_to_dict(raw_html)
type(table_dict)

dict

In [192]:
table_dict_decorated = decorate_table(table_dict, {"Link":urls[9]})
type(table_dict_decorated)

dict

In [193]:
table_dict_decorated

{'Make Model': 'Suzuki SV \n    650 / ABS',
 'Year': '2018 - 19',
 'Engine': 'Four stroke, 90°-V-twin, \n    DOHC, 4 valves per cylinder',
 'Capacity': '645 cc / 39.3 cu in',
 'Bore x Stroke': '81 x 62.6 mm',
 'Cooling System': 'Liquid cooled',
 'Compression Ratio': '11.2:1',
 'Lubrication': 'Wet sump',
 'Induction': 'Fuel Injection, 39mm \nthrottle bodies',
 'Ignition': 'Electronic ignition (Transistorized)',
 'Starting': 'Electric',
 'Max Power': '56 kW / 75 hp @ 8500 rpm',
 'Max Torque': '64 Nm / 6.5 kgf-m / 47.2 lb-ft @ 8100 rpm',
 'Transmission': '6 Speed, constant mesh',
 'Final Drive': 'Chain, DID525V8, 114 links',
 'Frame': 'Compact trellis steel frame',
 'Rake': '25o',
 'Trail': '104 mm / 4.1 in',
 'Front Suspension': 'Telescopic fork, coil spring, oil \n    damped',
 'Front \n\nWheel Travel': '125 mm / 4.9 in',
 'Rear Suspension': 'Link type, coil spring, oil damped, spring \n    preload 7-step adjustable',
 'Rear \n\nWheel Travel': '130 mm / 5.1 in',
 'Front Brakes': '2 x 29

In [194]:
tables = []
for url in urls:
    print("Retrieving", url)
    raw_html = retrieve_url_to_string(url)
    table_dict = parse_table_to_dict(raw_html)
    table_dict_decorated = decorate_table(table_dict, {"Link":url})
    tables.append(table_dict_decorated)

Retrieving https://www.motorcyclespecs.co.za/model/triu/triumph-street-triple-rs-22.html
Retrieving https://www.motorcyclespecs.co.za/model/ducati/ducati_scrambler_desert_sled_21.html
Retrieving https://www.motorcyclespecs.co.za/model/bmw/bmw-r100-7-78.html
Retrieving https://www.motorcyclespecs.co.za/model/kawasaki/kawasaki_zrx1100%2097.htm
Retrieving https://www.motorcyclespecs.co.za/model/kawasaki/kawasaki_zrx1200r%2004.htm
Retrieving https://www.motorcyclespecs.co.za/model/triu/triumph_speed_triple_1200rs_21.html
Retrieving https://www.motorcyclespecs.co.za/model/kawasaki/kawasaki_z900rs_20.html
Retrieving https://www.motorcyclespecs.co.za/model/yamaha/yamaha_r1_20.html
Retrieving https://www.motorcyclespecs.co.za/model/suzu/suzuki_gsxr1300r_21.html
Retrieving https://www.motorcyclespecs.co.za/model/suzu/suzuki_sv_650_18.htm
Retrieving https://www.motorcyclespecs.co.za/model/suzu/suzuki_sv650n%2007.htm


# Cleaning the Table

There's a bunch of things we want to do here:

### Formatting of individual entries

The text often contains weird tabs and newlines. Other times the newlines indicate an ad-hoc subtable-like entry, for example having "Height: X\n Width: Y" in one entry. It's a mess.

BTW this also happens in the keys of entries, not just the values. For example, "Front Wheel", "Front\n Wheel".

We're going to have to have a bunch of special cases here.

### Combines Entries

Motorcyclespecs have the habit of sticking "Make and Model" into a single column, when it's much nicer to have a Make entry and a Model entry, for example. Similarly, "Bore x Stroke"

### Different keys for the same type of value.

The tables are not consistent in naming. For example, "Front Wheel", "Wheels Front", "Front Wheels" all refer to the same thing.

### Units

This is a real rats-nest. It sure would be nice if all torque and horsepower figures were in the same units, but it's inconsistently reported everywhere. Oi. 


In [195]:
def clean_keys(table_dict):
    d = {}
    for k, v in table_dict.items():
        # Remove punctuation 
        k = k.translate(str.maketrans('', '', '!"#$%&\'()*+,-.:;<=>?@[\]^_`{|}~'))
        
        # Remove extraneous whitespace
        k = " ".join(k.split()).strip()
        
        # Ensure Title Case
        k = k.title()

        # Collapse certain special cases. DANGER, THIS WILL COLLIDE TABLE ENTRIES
        k = k.replace("Abs", "ABS")
        k = k.replace("Rear Wheels", "Rear Wheel")
        k = k.replace("Wheels Rear", "Rear Wheel")
        k = k.replace("Front Wheels", "Front Wheel")
        k = k.replace("Wheels Front", "Front Wheel")
        k = k.replace("Final Reduction Ratio", "Final Reduction")
        k = k.replace("Gear Ratios", "Gear Ratio")
        k = k.replace("Primary Reduction Ratio", "Primary Reduction")
        
        # Save the new key
        assert(k not in d)
        d[k] = v
        
    return d


In [196]:
table_dict_decorated_keys_cleaned = clean_keys(table_dict_decorated)
table_dict_decorated_keys_cleaned
    

{'Make Model': 'Suzuki\n      SV 650N / ABS',
 'Year': '2007 - 08',
 'Engine': 'Four stroke, 90°-V-twin, \n    DOHC, 4 valves per cylinder',
 'Capacity': '645 cc / 39.4 cu-in',
 'Bore X Stroke': '81\n      x 62.6 mm',
 'Cooling System': 'Liquid cooled',
 'Compression Ratio': '11.5:1',
 'Lubrication': 'Wet sump',
 'Engine Oil': 'Synthetic, 10W40',
 'Induction': 'Fuel Injection',
 'Ignition': 'Digital\xa0transistorized',
 'Spark Plug': 'NGK, CR8E',
 'Starting': 'Electric',
 'Max Power': '54.7 kW / 73.4 \nhp @ 8800 rpm',
 'Max Torque': '64\n      Nm / 6.53 kg-m / 47.2 lb-ft @ 7200 rpm',
 'Clutch': 'Wet, multiple discs, cable operated',
 'Transmission': '6\n      Speed',
 'Primary Reduction': '34/71 (2.088)',
 'Gear Ratio': '1 st 32/13 (2.461) / 2nd 32/18 (1.777) / 3rd 29/21 \n(1.380) / 4th 27/24 (1.125) / 5th 25/26 (0.961) / 6th 23/27 (0.851)',
 'Final Reduction': '45/14 (3.000)',
 'Final Drive': 'Chain, #525 O-ring',
 'Frame': 'Pressure cast aluminium alloy diamond truss',
 'Front Suspen

In [197]:
keys_set = set({})
for t in tables:
    t = clean_keys(t)
    
    keys_set = keys_set.union(set(t.keys()))

print(len(keys_set))
keys_set


62


{'ABS',
 'Alternator',
 'Bevel / Crown Wheel',
 'Bore X Stroke',
 'Capacity',
 'Clutch',
 'Compression Ratio',
 'Consumption Average',
 'Cooling System',
 'Dashboard',
 'Desert Sled Equipment',
 'Dimensions',
 'Dry Weight',
 'Emission',
 'Engine',
 'Engine Management',
 'Engine Oil',
 'Exhaust',
 'Final Drive',
 'Final Reduction',
 'Frame',
 'Front Brakes',
 'Front Suspension',
 'Front Tyre',
 'Front Wheel',
 'Front Wheel Travel',
 'Fuel Capacity',
 'Gear Ratio',
 'Gear Ratio Sport Version',
 'Ground Clearance',
 'Ignition',
 'Induction',
 'Instruments',
 'Link',
 'Lubrication',
 'Make Model',
 'Max Power',
 'Max Torque',
 'Oil Capacity',
 'Primary Drive',
 'Primary Reduction',
 'Rake',
 'Rear Brakes',
 'Rear Suspension',
 'Rear Tyre',
 'Rear Wheel',
 'Rear Wheel Travel',
 'Rider Aids',
 'Seat Height',
 'Spark Plug',
 'Standard Equipment',
 'Standing ¼ Mile',
 'Starting',
 'Swingarm',
 'Top Speed',
 'Total Steering Lock',
 'Trail',
 'Transmission',
 'Wet Weight',
 'Wheelbase',
 'Wheels

In [198]:
def clean_vals(table_dict):
    d = {}
    for k, v in table_dict.items():
        # Remove extraneous whitespace
        v = " ".join(v.split()).strip()

        # Save the new key
        d[k] = v
        
    return d


In [199]:
table_dict_decorated_keys_and_vals_cleaned = clean_vals(table_dict_decorated_keys_cleaned)
table_dict_decorated_keys_and_vals_cleaned

{'Make Model': 'Suzuki SV 650N / ABS',
 'Year': '2007 - 08',
 'Engine': 'Four stroke, 90°-V-twin, DOHC, 4 valves per cylinder',
 'Capacity': '645 cc / 39.4 cu-in',
 'Bore X Stroke': '81 x 62.6 mm',
 'Cooling System': 'Liquid cooled',
 'Compression Ratio': '11.5:1',
 'Lubrication': 'Wet sump',
 'Engine Oil': 'Synthetic, 10W40',
 'Induction': 'Fuel Injection',
 'Ignition': 'Digital transistorized',
 'Spark Plug': 'NGK, CR8E',
 'Starting': 'Electric',
 'Max Power': '54.7 kW / 73.4 hp @ 8800 rpm',
 'Max Torque': '64 Nm / 6.53 kg-m / 47.2 lb-ft @ 7200 rpm',
 'Clutch': 'Wet, multiple discs, cable operated',
 'Transmission': '6 Speed',
 'Primary Reduction': '34/71 (2.088)',
 'Gear Ratio': '1 st 32/13 (2.461) / 2nd 32/18 (1.777) / 3rd 29/21 (1.380) / 4th 27/24 (1.125) / 5th 25/26 (0.961) / 6th 23/27 (0.851)',
 'Final Reduction': '45/14 (3.000)',
 'Final Drive': 'Chain, #525 O-ring',
 'Frame': 'Pressure cast aluminium alloy diamond truss',
 'Front Suspension': 'Telescopic 41 mm, oil damped, ful

In [200]:
cleaned_tables = []
for t in tables:
    t = clean_vals(clean_keys(t))
    cleaned_tables.append(t)
    

In [201]:
cleaned_tables

[{'Make Model': 'Triumph Street Triple RS',
  'Year': '2022',
  'Engine': 'Four stroke, in-line 3-cylinder, DOHC, 4 valve per cylinder',
  'Capacity': '765 cc / 46.6 cu-in',
  'Bore X Stroke': '78 x 53.4 mm',
  'Cooling System': 'Liquid-cooled',
  'Compression Ratio': '12.54:1',
  'Lubrication': 'Wet sump',
  'Induction': 'Multipoint sequential electronic fuel injection with SAI. Electronic throttle control',
  'Exhaust': 'Stainless steel 3 into 1 exhaust system low single sided stainless steel silencer',
  'Emission': 'Euro 5',
  'Ignition': 'Digital - inductive type',
  'Starting': 'Electric',
  'Max Power': '121.3 hp / 90 kW @ 11750 rpm',
  'Max Torque': '79 Nm / 58.3 lb-ft @ 9350 rpm',
  'Clutch': 'Wet, multi-plate, slip-assisted',
  'Transmission': '6 Speed with Triumph Shift Assist',
  'Final Drive': 'X ring chain',
  'Frame': 'Front - Aluminum beam twin spar Rear - 2 piece high pressure die cast',
  'Swingarm': 'Twin-sided, cast aluminum alloy',
  'Front Suspension': 'Showa 41 m

In [202]:
import pandas as pd
df = pd.DataFrame(cleaned_tables)
df

Unnamed: 0,Make Model,Year,Engine,Capacity,Bore X Stroke,Cooling System,Compression Ratio,Lubrication,Induction,Exhaust,...,Consumption Average,Standing ¼ Mile,Instruments,Primary Reduction,Final Reduction,Engine Management,Rider Aids,Engine Oil,Spark Plug,Oil Capacity
0,Triumph Street Triple RS,2022,"Four stroke, in-line 3-cylinder, DOHC, 4 valve...",765 cc / 46.6 cu-in,78 x 53.4 mm,Liquid-cooled,12.54:1,Wet sump,Multipoint sequential electronic fuel injectio...,Stainless steel 3 into 1 exhaust system low si...,...,,,,,,,,,,
1,Ducati Scrambler 800 Desert Sled,2021,"Four stroke, 90° L twin cylinder, SOHC, desm...",803 cc / 49.0 cub in,88 x 66 mm,Air cooled,11.0:1,,"Electronic fuel injection, 50 mm throttle body",Stainless steel muffler with catalytic convert...,...,,,,,,,,,,
2,BMW R 100/7,1978 - 10,"Four stroke, two cylinder horizontally opposed...",980 cc / 59.8 cu in.,94 x 70.6 mm,Air cooled,9.1:1,,2 x 36mm Bing V94 carburetors,,...,,,,,,,,,,
3,Kawasaki ZR-X 1100,1997 - 98,"Four stroke, transverse four cylinder, DOHC, 4...",1052 cc / 64.2 cu-in,79 x 59.4 mm,Liquid cooled,10.1:1,,4x 36mm Mikuni carburetors,,...,34 mpg,11.6 sec,,,,,,,,
4,Kawasaki ZRX 1200R,2005 - 06,"Four stroke, transverse four cylinder, DOHC, 4...",1165 cc / 71.0 cu-in,79 x 59.4 mm.,Liquid cooled,10.1:1,,4x Keihin CVK36 carburetors,,...,17.6 km/lit,10.9 sec,,,,,,,,
5,Triumph Speed Triple 1200RS,2021,"Four stroke, transverse three cylinder, DOHC, ...",1160 cc / 70.7 cu-in,90 x 60.8 mm,Liquid-cooled,13.2:1,,Multipoint sequential electronic fuel injectio...,Stainless steel 3 into 1 header system with un...,...,,,Full-colour 5 in TFT,,,,,,,
6,Kawasaki Z 900RS,2020,"Four stroke, transverse four cylinder, DOHC, 4...",948 cc / 57.8 cu-in,73.4 x 56.0 mm,Liquid cooled,11.8:1,Forced lubrication wet sum,DFI with 36mm Keihin throttle bodies,,...,,,,1.627 (83/51),2.933 (44/15),,,,,
7,Yamaha YZF 1000 R1,2020,"Four stroke, transverse four cylinder, DOHC, 4...",998 cc / 60.9 cu-in,79.0 x 50.9 mm,Liquid cooled,13.0:1,Wet sump,Fuel Injection with YCC-T and YCC-I,,...,,,,,,"YCC-T, YCC-I, PWR, TCS, LCS, LIF, SCS, QSS, CC...",,,,
8,Suzuki GSX 1300R Hayabusa,2021,"Four stroke, transverse four cylinder, DOHC, 4...",1340 cc / 81.8 cu-in,81 x 65 mm,Liquid cooled,12.5:1,Wet sump,Fuel injection with Ride-by-Wire throttle bodies,,...,,,,,,,Suzuki Drive Mode Selector Alpha (SDMS-α) feat...,,,
9,Suzuki SV 650 / ABS,2018 - 19,"Four stroke, 90°-V-twin, DOHC, 4 valves per cy...",645 cc / 39.3 cu in,81 x 62.6 mm,Liquid cooled,11.2:1,Wet sump,"Fuel Injection, 39mm throttle bodies",,...,,,,,,,,,,


In [180]:
df.to_csv('04 - motorcycles.csv')