# Preparing data for classification

Now that we have a more or less correct list of road names, we need to decide on what features we'll need for the classifier, and how we are going to extract them. Here are some features I'm planning to use:

* N-grams on actual road names (by this, I mean the "Tan Tock Seng" in "Tan Tock Seng Avenue", i.e. minus what I'll call the "road tag" - here, "Avenue".
* Average length of roads in the road name ("Tan Tock Seng" => (3+4+4)/3 = 3.7)
* Number of words in road name ("Tan Tock Seng" => 3)
* Whether all the words in the road name are actual English words ("Tan Tock Seng" => False, "Chrysanthemum Drive" => yes)
* Whether the road tag is Malay (Lorong, Jalan, etc) or English (Street, Avenue, etc)

So, it looks like what we'll need are "road name" and the "road tag", dropping all modifiers like "North" or "First". We'll proceed to divide up the dataframe of full road names accordingly.

(Note: this is fairly pedestrian stuff. You may wish to go on to the classification steps.)

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv("singapore-roadnames-final.csv")

In [5]:
# drop the column of numbers
df.drop("Unnamed: 0", inplace=True, axis=1)

In [7]:
df
# we'll be using final_name: the name column will be for combining
# our final classification info back with the geojson file 
# with all the geographic data

Unnamed: 0,name,final_name
0,Orchard Road,Orchard Road
1,Hougang Avenue 1,Hougang Avenue 1
2,Scotts Road,Scotts Road
3,Keng Lee Road,Keng Lee Road
4,Newton Road,Newton Road
5,Sarkies Road,Sarkies Road
6,Patterson Road,Paterson Road
7,Orchard Boulevard,Orchard Boulevard
8,Grange Road,Grange Road
9,Paterson Hill,Paterson Hill


In [9]:
# to get an idea of the road tags/modifiers that we should eliminate, 
# let's do a word frequency table for the full road names we do have 
from collections import Counter
c = Counter()
for name in df.final_name:
    for word in name.split():
        c[word] += 1
c

Counter({'Road': 831, 'Avenue': 396, 'Jalan': 381, 'Street': 304, 'Drive': 222, 'Lane': 136, 'Lorong': 131, 'Crescent': 116, 'Walk': 108, 'Park': 94, 'West': 82, 'Link': 70, 'Woodlands': 68, 'Terrace': 68, 'Tuas': 65, 'Bukit': 62, 'Tampines': 60, 'Place': 59, '1': 58, 'Close': 57, '2': 55, 'View': 53, 'Jurong': 52, 'Geylang': 49, 'North': 49, 'Way': 48, '3': 47, 'Kang': 46, 'Chu': 44, 'East': 44, 'Hill': 43, 'Grove': 43, 'Bedok': 40, 'Central': 39, 'Pasir': 37, 'Mo': 37, 'Kio': 37, 'Ang': 37, '4': 36, 'Rise': 35, 'South': 34, 'Changi': 33, 'Ris': 32, 'Coast': 32, 'Industrial': 31, 'Batok': 30, 'Seletar': 28, '5': 28, 'Yishun': 26, 'Upper': 26, '6': 25, 'Choa': 24, 'Lim': 23, 'Green': 23, 'Telok': 22, 'Serangoon': 21, 'Tai': 21, 'Toh': 21, 'Hougang': 21, 'Gardens': 20, 'Eunos': 20, 'Mount': 19, 'Siglap': 18, 'Teck': 18, 'Toa': 17, 'Kim': 17, '8': 17, 'Payoh': 17, 'Merah': 17, 'Hwan': 16, 'Seng': 16, 'Heights': 15, '7': 15, 'Lentor': 15, 'Chuan': 15, 'Sungei': 14, 'St': 14, 'Bridge': 14,

In [30]:
modifiers = ["North", "South", "East", "West", "Central", "Upper", "Lower", "Old", "New",
             "First", "Second", "Third", "Fourth", "Fifth", "Sixth", "Seventh", 
             "Eighth", "Ninth", "Tenth", "Seventeenth", "Twenty-fourth"]

def remove_modifiers(words):
    """
    Removes modifiers such as numbers "First ?? Avenue" 
    and directions/descriptors "?? Avenue North"
    """
    # remove integers/A-Z
    words = [word for word in words if not word[0].isdigit()
             and not (len(word) > 1 and word[1].isdigit())
             and not len(word) == 1]
    
    # remove "North/South/East/West/Central/Upper/etc"
    words = [word for i, word in enumerate(words) if i == 0 or not word in modifiers]
    
    return words

In [39]:
road_tags = ["Road", "Avenue", "Street", "Drive", "Lane",
             "Crescent", "Walk", "Park", "Terrace", "Close", "Link", 
             "Place", "Way", "Grove", "Rise", "View", "Hill", "Estate",
             "Farmway", "Green", "Garden", "Gardens", "Junction", "Boulevard",
             "Central", "Circle", "Court", "Loop", "Track", "Square",
             "Heights", "Village", "Promenade", "Vale", "Vista",
             "Sector", "Circus", "Bridge", "Gate", "Valley", "Turn",
             "Interchange", "Plaza", "Little", "Mount", "Highway", "Quay",
             "Mall", "Bank", "Plain", "Beach", "Height", "Wood", "Ring",
             "Ridge", "Island", "Ind", "Industrial", "Terminal", "Coast",
             "Centre", "Northview", "Reservoir", "Alley", "Plains", "Parkway", "Viaduct",
             "Expressway", "Tunnel", "Bow", "Concourse", "Grande", "Field", "Route",
             "link", "Cresent", "Rd", "St", "AVe", "Cres", "Av.", "Carpark"] # residual typos and abbrevs

malay_prefix_tags = ["Jalan", "Lorong", "Bukit", "Lengkok", "Taman", "Kampong", "Lengkong"]

In [40]:
def split_tag(words):
    """
    Splits road into tuple of name, tag, and an indicator of whether the road tag is Malay
    """
    # split road tag from the actual name
    tags = list()

    # occasionally there may be >1 tag e.g. "Ring Road", so repeat until we're down to the name
    while len(words) >= 2 and words[-1] in road_tags:
        tags.append(words[-1])  # wrong order, we'll reverse them later
        remainder = words[:-1]
        words.pop()

    if tags:
        return (' '.join(remainder), ' '.join(reversed(tags)), 0)  # remember to reverse!
    # the above assumed the road tags would be at the end - 
    # in case it's Malay and the road tags are at the beginning,
    # or it contains no road tag at all:
    else:
        if words[0] in malay_prefix_tags:
            return (' '.join(words[1:]), words[0], 1)
        else:
            return (' '.join(words), '', 0)

In [47]:
def remove_residual_modifiers(data):
    """
    Final round of modifier removal, ensuring that at least one word is left.
    e.g. "North Road" would not have its modifier removed, but "North Bridge Road" would.
    """
    # remove *initial* "North/South/East/West/Central/Upper/etc"
    # assuming that there is still something left
    road_name, road_tag, has_malay_road_tag = data
    road_name_words = road_name.split()
    if len(road_name_words) > 1:
        words = [word for word in road_name_words if word not in modifiers]
        return (' '.join(words), road_tag, has_malay_road_tag)
    return data

In [42]:
def process_roadname(roadname):
    """
    Perform 3 steps of cleaning: initial modifier removal, 
    tag splitting, removal of residual modifiers, and return the tuple
    """
    return remove_residual_modifiers(split_tag(remove_modifiers(roadname.split())))

In [43]:
# put the results into a dataframe

split_roads = pd.DataFrame([process_roadname(road) for road in df['final_name'].values])
# rename columns
split_roads.columns = ['road_name', 'road_tag', 'has_malay_road_tag']

In [44]:
# which we then concatenate with the rest (there are other ways to do this too)

final = pd.concat([df, split_roads], axis=1)

In [45]:
final

Unnamed: 0,name,final_name,road_name,road_tag,has_malay_road_tag
0,Orchard Road,Orchard Road,Orchard,Road,0
1,Hougang Avenue 1,Hougang Avenue 1,Hougang,Avenue,0
2,Scotts Road,Scotts Road,Scotts,Road,0
3,Keng Lee Road,Keng Lee Road,Keng Lee,Road,0
4,Newton Road,Newton Road,Newton,Road,0
5,Sarkies Road,Sarkies Road,Sarkies,Road,0
6,Patterson Road,Paterson Road,Paterson,Road,0
7,Orchard Boulevard,Orchard Boulevard,Orchard,Boulevard,0
8,Grange Road,Grange Road,Grange,Road,0
9,Paterson Hill,Paterson Hill,Paterson,Hill,0


In [48]:
final.to_csv("singapore-roadnames-final-split.csv")

We don't need to classify all these roads individually, since there are repeats like "Ang Mo Kio $roadtag$ $n$" for various values of $roadtag$ and $n$. So let's do a `groupby` to collate the information

In [63]:
# we're really only interested in the max value of has_malay_road_tag 
# as we're not using the rest of the info
gb  = final.groupby('road_name').aggregate(max) 
# flatten the groupby into a regular df and select the only columns we really want
gb2 = pd.DataFrame(gb).reset_index()[['road_name', 'has_malay_road_tag']]

In [64]:
# this is the file that we'll actually do classification on
# the rest will be used for merging the actual GeoJSON file with full road names
# and linestring data to the final classification
gb2.to_csv('singapore-roadnames-final-toclassify.csv')