## Cleansing addresses - am I doing it wrong? ##

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder

X = pd.read_json("../input/train.json")
X_test = pd.read_json("../input/test.json")

We have 2 features: 'display_address' and 'street_address'. We are using these columns in the following way:

In [None]:
street_encoder = LabelEncoder()
street_encoder.fit(list(X['display_address']) + list(X_test['display_address']))

Let's take a closer look to feature called 'display_address'

In [None]:
X['display_address'].head(15)

First let's deal with the **Street** addresses. There are several main representations:

 1. It can be **West/East** or simply **W/E**
 2. It can be **11th** or **11**
 3. It can be **Street**, **Str**, **St**, **St.**

So we are going to convert all these addresses to the form: **n 11 st** or **name st**.

About **Avenue**:

 1. If can be **1st**, **1** or even **First**
 2. It can be **Avenue**, **Ave**, **Ave.** or **Av**

Convert all these addresses to the form **Name/Number av**

To measure how well we normalize addresses, we will check the unique number of addresses before normalization and after. Also as a last step of cleansing we will remove all characters like **.** and **,** and then strip all space symbols.

In [None]:
def normalize_address(X, column):
    print("Before: {0}".format(len(X[column].unique())))
    substitution = [('west', 'w'), ('east', 'e'), ('south', 's'), ('north', 'n'),
                    ('1st', '1'), ('1th', '1'), ('2nd', '2'), ('2th', '2'),
                    ('3rd', '3'), ('3th', '3'), ('4th', '4'), ('5th', '5'),
                    ('6th', '6'), ('7th', '7'), ('8th', '8'), ('9th', '9'),
                    ('0th', '0'),
                    ('street', 'st'), ('str', 'st'),
                    ('avenue', 'av'), ('ave', 'av'),
                    ('place', 'pl'), ('boulevard', 'blvd'), ('road', 'rd'),
                    ('first', '1'), ('second', '2'), ('third', '3'),
                    ('fourth', '4'), ('fifth', '5'), ('sixth', '6'),
                    ('seventh', '7'), ('eighth', '8'), ('nineth', '9'),
                    ('tenth', '10'),                    
                    (',', ''), ('.', '')]
    
    def apply_normalization(s):
        for subst in substitution:
            s = s.lower().replace(subst[0], subst[1])
        s = s.strip()
        
        return s
        
    X[column] = X[column].apply(apply_normalization)
    print("After: {0}".format(len(X[column].unique())))

In [None]:
normalize_address(X, 'display_address')   
normalize_address(X_test, 'display_address')
normalize_address(X, 'street_address')   
normalize_address(X_test, 'street_address')

As you can see we decreased number of unique addresses by nearly 30%. Let's check top 15 addresses from the training dataframe:

In [None]:
X['display_address'].head(15)

Now I can see the following steps for further improvement:

 1. Think about house numbers, do we need them or not (e.g. "521 E 11" and "456 E 11" wil be treated as different addresses)
 2. Think about outliers: "williamsburg - NO FEE", "W 10 and Waverly Place "

And now the really disappointing bit of all these normalizations: **the score has been increased**, both on CV and public! Whit it can be? Any help will be greatly appreciated! The only thought: I have removed some information and now addresses are more correlated with the longitude/latitude thus not having value for algorithms.