# Goal

In this notebook we are going to use the GeoPy API in order to obtain more useful location related information from the coordinates. This technique is called *"Reverse Geocoding"*.

# Import Libraries and Set Options

In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from daftpy.daftfeanalysis import (location_dict, location_dataframe, location_engineering, 
                                   geonames_dict, missing_values, 
                                   eircode_homogenize, add_location)

In [24]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
plt.style.use('seaborn')

# Load Data

In [25]:
sale = pd.read_csv('data/sale_cleaned.csv', sep=',')
sale.shape

(7662, 19)

In [26]:
sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7662 entries, 0 to 7661
Data columns (total 19 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   daft_id          7662 non-null   int64  
 1   url              7662 non-null   object 
 2   name             7662 non-null   object 
 3   price            7662 non-null   float64
 4   sale_type        7662 non-null   object 
 5   floor_area       7662 non-null   int64  
 6   psr              7646 non-null   float64
 7   ber              7467 non-null   object 
 8   entered_renewed  7662 non-null   object 
 9   views            7662 non-null   float64
 10  type_house       7175 non-null   object 
 11  type             7662 non-null   object 
 12  scraping_date    7662 non-null   object 
 13  description      7661 non-null   object 
 14  latitude         7662 non-null   float64
 15  longitude        7662 non-null   float64
 16  bedroom          7662 non-null   int64  
 17  bathroom      

# Check Missing Values

In [27]:
# Check missing values in absolute and relative terms
missing_values(sale)

Unnamed: 0,Absolute,Relative
daft_id,0,0.0
url,0,0.0
name,0,0.0
price,0,0.0
sale_type,0,0.0
floor_area,0,0.0
psr,16,0.002088
ber,195,0.02545
entered_renewed,0,0.0
views,0,0.0


# Reverse Geocoding

We do reverse geocoding with GeoPy and Nominatim geocoder. The `location_engineering` function uses two more functions, one to creating a dictionary with the extracted information and another one to add that dictionary to the DataFrame.

The cell bellow need around two or three hours. The resulting DataFrame is saved as a CSV file in the following cell so you can avoid running it. Also, if you run it and save the result instead keeping the current saved file you could see a little different results in the analysis notebook.  

In [6]:
sale = location_engineering(df=sale)

0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
3000
3200
3400
3600
3800
4000
4200
4400
4600
4800
5000
5200
5400
5600
5800
6000
6200
6400
6600
6800
7000
7200
7400
7600
Shape before adding: (7662, 19)
Shape after adding: (7662, 32)
----------
Difference: 13 columns


It took a long time so I decided save the resulted dataframe into a csv file.

In [7]:
sale.to_csv('data/sale_post_reverse_geocoding.csv', 
            sep=',', index=False)

--------------

# Load Post Reverse Geocoding Data

In [28]:
sale = pd.read_csv('data/sale_post_reverse_geocoding.csv', sep=',')
sale.shape

(7662, 32)

In [29]:
sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7662 entries, 0 to 7661
Data columns (total 32 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   daft_id          7662 non-null   int64  
 1   url              7662 non-null   object 
 2   name             7662 non-null   object 
 3   price            7662 non-null   float64
 4   sale_type        7662 non-null   object 
 5   floor_area       7662 non-null   int64  
 6   psr              7646 non-null   float64
 7   ber              7467 non-null   object 
 8   entered_renewed  7662 non-null   object 
 9   views            7662 non-null   float64
 10  type_house       7175 non-null   object 
 11  type             7662 non-null   object 
 12  scraping_date    7662 non-null   object 
 13  description      7661 non-null   object 
 14  latitude         7662 non-null   float64
 15  longitude        7662 non-null   float64
 16  bedroom          7662 non-null   int64  
 17  bathroom      

# Check Missing Values

In [30]:
# Check missing values in absolute and relative terms
missing_values(sale)

Unnamed: 0,Absolute,Relative
daft_id,0,0.0
url,0,0.0
name,0,0.0
price,0,0.0
sale_type,0,0.0
floor_area,0,0.0
psr,16,0.002088
ber,195,0.02545
entered_renewed,0,0.0
views,0,0.0


As we can see there are a lot of new missing values in the data recently added. We are going to deal with them in a minut but first let's see if any UK ad managed to skip our data cleaning task when we cleaned the coordinates.

In [31]:
sale.country.value_counts()

Éire / Ireland    7661
United Kingdom       1
Name: country, dtype: int64

There it is. Let's quickly drop it.

In [32]:
sale.drop(sale[sale.country == 'United Kingdom'].index, inplace=True)
sale.country.value_counts()

Éire / Ireland    7661
Name: country, dtype: int64

# Dealing With New Missing Values

Let's isolate the variables we are interested in to work easier.

In [33]:
location_features = ['url', 
                     'latitude', 
                     'longitude', 
                     'country_code', 
                     'country', 
                     'postcode', 
                     'state_district', 
                     'county', 
                     'municipality', 
                     'city', 
                     'town', 
                     'city_district', 
                     'locality', 
                     'road', 
                     'house_number']

missing_values(sale[location_features])

Unnamed: 0,Absolute,Relative
url,0,0.0
latitude,0,0.0
longitude,0,0.0
country_code,0,0.0
country,0,0.0
postcode,1154,0.150633
state_district,394,0.051429
county,698,0.091111
municipality,6392,0.834356
city,5644,0.736718


## Eircode

The postcode is called "Eircode" in Ireland. It consists of a "Routing Key" and a "Unique Identifier". The routing key is associated with the city or town and is the three first characters from the Eircode.

![](imgs/eircode.png)

There is a 73% and a 85% of missing values in `city` and `town` columns respectively. However, there is only a 15% of missing values in the `postcode` column. So we can use the `postcode` column to figure out the `town` values.

Let's check how many missing values there are in both columns at the same time.

In [34]:
sale.loc[sale[['city', 'town']].isna().all(axis=1)].shape

(4541, 32)

And how many missing values there are in the three columns at the same time.

In [35]:
sale.loc[sale[['city', 'town', 'postcode']].isna().all(axis=1)].shape

(1046, 32)

### Scraping Geonames Page

We can scrape [Geonames.org](http://www.geonames.org/postalcode-search.html?q=&country=IE) website to obtain the information about the town. That page shows the eircode's routing key and its respectively town or city. As we have most of the eircodes, we can match them with the city or town names.

We will use the `geonames_dict()` function to scrape the page and create a dictionary with the info. Also we can create a DataFrame with that dictionary.

In [36]:
# Make a DataFrame with the dictionary obtained from `geonames_dict` function
geonames_df = pd.DataFrame(geonames_dict())
print(geonames_df.shape)
geonames_df.head(3)

(139, 4)


Unnamed: 0,place,code,admin1,place_coordinates
0,Ballyboughal,A41,Leinster,53.52/-6.267
1,Garristown,A42,Leinster,53.566/-6.386
2,Oldtown,A45,Leinster,53.525/-6.316


### Homogenize Postcode Column

Now we need to check the `postcode` column in order to make sure that it has the eircode's routing key. 

Let's dig a little deeper into `postcode` column.

In [37]:
sale['postcode'].str.len().value_counts(dropna=False)

8.0     5773
NaN     1154
7.0      578
9.0       48
1.0       30
3.0       21
2.0       19
6.0       12
12.0       9
10.0       8
11.0       4
4.0        3
13.0       2
Name: postcode, dtype: int64

We should check each different case and wrang and clean each one to try to homogenize them.

#### Eircode length = 8

Eircode equals to eigth is what we want. The eircode has eigth characters, three of the routing key, other four of the unique identifier, and the blank space.

In [38]:
sale.loc[sale.postcode.str.len() == 8, ['postcode']].sample(3)

Unnamed: 0,postcode
6893,A67 X225
5307,K45 K650
3576,A98 VP11


#### Eircode length = 7

We could add a blank space between the routing key and the unique identifier.

In [39]:
sale.loc[sale.postcode.str.len() == 7, ['postcode']].sample(3)

Unnamed: 0,postcode
281,A96H312
2781,W23EP48
6976,F92W803


#### Eircode length = 9

We will handle all this cases in the homogenize function, giving most of them a `np.nan` value.

In [40]:
sale.loc[sale.postcode.str.len() == 9, ['postcode']].sample(3)

Unnamed: 0,postcode
1608,DUBLIN 13
971,DUBLIN 13
4201,DUBLIN 18


#### Eircode length = 1

Those ads with a single number as a postcode belong to Dublin. It is reasonable to think that those numbers correspond with the postal district's number so we will fix them in the `homogenize()` function.

In [41]:
sale.loc[sale.postcode.str.len() == 1, ['postcode','city']].sample(3)

Unnamed: 0,postcode,city
2996,9,Dublin
6600,4,Dublin
4294,1,Dublin


#### Eircode length = 10

Values with the patern "CO." mean "county" so we will fill them with `np.nan`.

In [42]:
sale.loc[sale.postcode.str.len() == 10, ['postcode','city','town']].sample(3)

Unnamed: 0,postcode,city,town
1459,CO WICKLOW,,
1810,CO WICKLOW,,
811,CO WICKLOW,,


#### Eircode length = 3

We can keep eircodes with lengt equal to three since they contain the routing key and we don't need the unique identifier.

In [43]:
sale.loc[sale.postcode.str.len() == 3, ['postcode']].sample(3)

Unnamed: 0,postcode
5706,F12
2929,D04
5372,D16


#### Eircode length = 12

We will fill these ones with `np.nan`.

In [44]:
sale.loc[sale.postcode.str.len() == 12, ['postcode']]

Unnamed: 0,postcode
514,CO WESTMEATH
655,CO WESTMEATH
673,CO WESTMEATH
1474,CO WESTMEATH
2253,CO. KILKENNY
4085,CO. KILKENNY
6040,CO WESTMEATH
6108,CO. KILKENNY
7524,CO WESTMEATH


#### Eircode length = 11

We will fill these ones with `np.nan`.

In [45]:
sale.loc[sale.postcode.str.len() == 11, ['postcode']]

Unnamed: 0,postcode
3396,CO. WICKLOW
6848,CO. WICKLOW
7361,CO. WICKLOW
7467,CO. WICKLOW


#### Eircode length = 6

We will take the eircode from those that have it and keep those equal to `DUBLIN` as they are.

In [46]:
sale.loc[sale.postcode.str.len() == 6, ['postcode']].head(3)

Unnamed: 0,postcode
365,H91 DV
527,DUBLIN
1173,DUBLIN


#### Eircode length = 4

We will fill these ones with `np.nan`.

In [47]:
sale.loc[sale.postcode.str.len() == 4, ['postcode']]

Unnamed: 0,postcode
1060,0
1891,0
2302,0


#### Eircode length = 13

We will fill these ones with `np.nan`.

In [48]:
sale.loc[sale.postcode.str.len() == 13, ['postcode']]

Unnamed: 0,postcode
7044,CO. ROSCOMMON
7155,CO. ROSCOMMON


#### Eircode length = 2

We will fix all these values since they are from Dublin and have the postal district value.

In [49]:
sale.loc[sale.postcode.str.len() == 2, ['postcode','city']].sample(3)

Unnamed: 0,postcode,city
1266,D5,Dublin
4933,D5,Dublin
2185,12,Dublin


#### Homogenize Eircode

Now we can use the `eircode_homogenize()` function to homogenize the `postcode` column. This function applies the `homogenize()` function to the `postcode` column.

In [50]:
sale = eircode_homogenize(sale)

Let's check whether the function results are as expected.

In [51]:
sale['postcode'].str.len().value_counts(dropna=False)

8.0    6344
NaN    1189
3.0     118
6.0      10
Name: postcode, dtype: int64

## Adding Geonames Page Information To Sale DataFrame

In [52]:
sale = add_location(df=sale, geonames_df=geonames_df)

Shape before dropping: (7661, 32)
Shape after dropping: (7661, 36)
----------
Difference: 4 columns


In [53]:
missing_values(sale)

Unnamed: 0,Absolute,Relative
daft_id,0,0.0
url,0,0.0
name,0,0.0
price,0,0.0
sale_type,0,0.0
floor_area,0,0.0
psr,16,0.002089
ber,195,0.025454
entered_renewed,0,0.0
views,0,0.0


Let's drop those columns that won't be useful in the future.

In [54]:
sale.drop(columns=['country_code', 'country', 'county', 'municipality', 
                   'city', 'town', 'locality', 'suburb', 'road', 'house_number', 
                   'admin1', 'place_coordinates'], inplace=True)

In [55]:
sale.to_csv('data/sale_post_geosp_fe.csv', 
                 sep=',', index=False)