In [1]:
import numpy as np
import pandas as pd
from pandas.tseries.offsets import MonthEnd
import datetime
import calendar

## EDA and data cleaning

In [2]:
df = pd.read_csv('Q1_2020_Data_analyst_transactions.csv')

df.head()

Unnamed: 0,id,Date of Sale (dd/mm/yyyy),Address,Postal Code,County,Price,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description
0,1,01/01/2010,"5 Braemor Drive, Churchtown, Co.Dublin",,Dublin,343000.0,No,No,Second-Hand Dwelling house /Apartment,
1,2,03/01/2010,"134 Ashewood Walk, Summerhill Lane, Portlaoise",,Laois,185000.0,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...
2,11,04/01/2010,"16 Aisling Geal, Fr. Russell Road",,Limerick,110000.0,No,No,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...
3,21,04/01/2010,"48 KILLIANS COURT, MULLAGH",,Cavan,122000.0,No,Yes,New Dwelling house /Apartment,greater than 125 sq metres
4,35,04/01/2010,"Knock, Lanesboro",,Longford,125000.0,No,No,Second-Hand Dwelling house /Apartment,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410523 entries, 0 to 410522
Data columns (total 10 columns):
id                           410523 non-null int64
Date of Sale (dd/mm/yyyy)    410523 non-null object
Address                      410523 non-null object
Postal Code                  76876 non-null object
County                       410523 non-null object
Price                        410523 non-null object
Not Full Market Price        410523 non-null object
VAT Exclusive                410523 non-null object
Description of Property      410523 non-null object
Property Size Description    52567 non-null object
dtypes: int64(1), object(9)
memory usage: 31.3+ MB


We explore the data, we notice that Postal Code have a lot of missing values. This shouldn't surprise, as the PostCode in Ireland are pretty recent. Another variable with high degree of missing values is Property SIze Description.

# Report script

Here, we generate the script that ingest the raw data, manipulate the data and transform it to the expected output. 

We notice that the date of Sale is in the object format. We want to transform it into a datetime so it would be easier to treat it later on. Also, I want to change the name column to have an easier handle and to match the final export sheet.

Since we are here, I want to rename the columns to the final layout: 
- id	
- Date	
- Address	
- Postal Code	
- County	
- Price	
- Not Full Market Price	
- VAT Exclusive	
- Description of Property	
- Property Size Description

In [4]:
# the only column to be renamed is the Date column

df.rename(columns = {'Date of Sale (dd/mm/yyyy)':'Date'}, inplace = True)
df['Date'] = pd.to_datetime(df['Date'])
df.head()

Unnamed: 0,id,Date,Address,Postal Code,County,Price,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description
0,1,2010-01-01,"5 Braemor Drive, Churchtown, Co.Dublin",,Dublin,343000.0,No,No,Second-Hand Dwelling house /Apartment,
1,2,2010-03-01,"134 Ashewood Walk, Summerhill Lane, Portlaoise",,Laois,185000.0,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...
2,11,2010-04-01,"16 Aisling Geal, Fr. Russell Road",,Limerick,110000.0,No,No,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...
3,21,2010-04-01,"48 KILLIANS COURT, MULLAGH",,Cavan,122000.0,No,Yes,New Dwelling house /Apartment,greater than 125 sq metres
4,35,2010-04-01,"Knock, Lanesboro",,Longford,125000.0,No,No,Second-Hand Dwelling house /Apartment,


We want now to generate an additional label for the Province. A google search allow us to identify the counties that are associated with each province. The definitons are: 

Irish Provinces:
- Leinster (https://it.wikipedia.org/wiki/Leinster) => Carlow, Dublin, Kildare, Kilkenny, Laois, Longford, Louth, Meath, Offaly, Westmeath, Wexford e Wicklow.
- Ulster (https://it.wikipedia.org/wiki/Ulster) => Antrim, Armagh, Down, Fermanagh, Londonderry Tyrone, Cavan, Donegal, Monaghan
- Munster (https://it.wikipedia.org/wiki/Munster) => Cork, Clare, Kerry, Limerick, Tipperary, Waterford
- Connaught (https://it.wikipedia.org/wiki/Connaught) =>  Galway, Leitrim, Mayo, Roscommon, Sligo



In [36]:
prov_label = {['Carlow', 'Dublin', 'Kildare', 'Kilkenny', 'Laois', 
                           'Longford', 'Louth', 'Meath', 'Offaly', 'Westmeath', 'Wexford' 'Wicklow']:'Leinster', 
             'Ulster': ['Antrim', 'Armagh', 'Down', 'Fermanagh', 'Londonderry' 
                        'Tyrone', 'Cavan', 'Donegal', 'Monaghan'], 
             'Munster' : ['Cork', 'Clare', 'Kerry', 'Limerick', 'Tipperary', 'Waterford'],
             'Connaught': ['Galway', 'Leitrim', 'Mayo', 'Roscommon', 'Sligo']}


# Using list comprehension 
# Iterating through value lists dictionary 

df['Province'] = df['County'].map(prov_label)
df.head()

TypeError: unhashable type: 'list'

Here, we want to extract the last day of the month

In [6]:


df['Month_end_day'] = pd.to_datetime(df['Date'], format="%Y%m") + MonthEnd(0)

df.head()

Unnamed: 0,id,Date,Address,Postal Code,County,Price,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description,Province,Month_end_day
0,1,2010-01-01,"5 Braemor Drive, Churchtown, Co.Dublin",,Dublin,343000.0,No,No,Second-Hand Dwelling house /Apartment,,Leinster,2010-01-31
1,2,2010-03-01,"134 Ashewood Walk, Summerhill Lane, Portlaoise",,Laois,185000.0,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2010-03-31
2,11,2010-04-01,"16 Aisling Geal, Fr. Russell Road",,Limerick,110000.0,No,No,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2010-04-30
3,21,2010-04-01,"48 KILLIANS COURT, MULLAGH",,Cavan,122000.0,No,Yes,New Dwelling house /Apartment,greater than 125 sq metres,Leinster,2010-04-30
4,35,2010-04-01,"Knock, Lanesboro",,Longford,125000.0,No,No,Second-Hand Dwelling house /Apartment,,Leinster,2010-04-30


Postal code is not a requirement for our output. However, we don't want to lose this information. We will fill the null values with a text label. We apply the same temporarily for Property Size and Description

In [7]:
df['Postal Code'] = df['Postal Code'].fillna('unavailable')
df['Property Size Description'] = df['Property Size Description'].fillna('unavailable')

df.head()

Unnamed: 0,id,Date,Address,Postal Code,County,Price,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description,Province,Month_end_day
0,1,2010-01-01,"5 Braemor Drive, Churchtown, Co.Dublin",unavailable,Dublin,343000.0,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-01-31
1,2,2010-03-01,"134 Ashewood Walk, Summerhill Lane, Portlaoise",unavailable,Laois,185000.0,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2010-03-31
2,11,2010-04-01,"16 Aisling Geal, Fr. Russell Road",unavailable,Limerick,110000.0,No,No,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2010-04-30
3,21,2010-04-01,"48 KILLIANS COURT, MULLAGH",unavailable,Cavan,122000.0,No,Yes,New Dwelling house /Apartment,greater than 125 sq metres,Leinster,2010-04-30
4,35,2010-04-01,"Knock, Lanesboro",unavailable,Longford,125000.0,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410523 entries, 0 to 410522
Data columns (total 12 columns):
id                           410523 non-null int64
Date                         410523 non-null datetime64[ns]
Address                      410523 non-null object
Postal Code                  410523 non-null object
County                       410523 non-null object
Price                        410523 non-null object
Not Full Market Price        410523 non-null object
VAT Exclusive                410523 non-null object
Description of Property      410523 non-null object
Property Size Description    410523 non-null object
Province                     410523 non-null object
Month_end_day                410523 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1), object(9)
memory usage: 37.6+ MB


To evaluate duplicate transactions, we can evaluate if there are multiple ids

In [9]:
df['id'].nunique()

410523

In [10]:
# explore address to evaluate if there are duplicates in these rows
print(df['Address'].str.lower().nunique())

# we transform all address values in lower string 
df['Address_lower'] = df['Address'].str.lower()

print(df['Address'].str.lower().nunique())

383312
383312


It seems that there are no duplicate ids, but there are duplicated addresses in the dataset. This could suggest the presence of duplicate transactions

since the transaction will associate likely the address and the price, we can check if we have duplicates when we concatenate the two information

In [11]:
# sorting by first name 
df.sort_values("Address_lower").head(30)



Unnamed: 0,id,Date,Address,Postal Code,County,Price,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description,Province,Month_end_day,Address_lower
297116,297168,2018-12-02,"!5 Ard Coillte, Ballina",unavailable,Tipperary,281939.0,No,Yes,New Dwelling house /Apartment,greater than or equal to 125 sq metres,Leinster,2018-12-31,"!5 ard coillte, ballina"
339572,339635,2018-11-15,"!9 Blossomhill, Broomfield Village, Midleton",unavailable,Cork,321585.0,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2018-11-30,"!9 blossomhill, broomfield village, midleton"
395689,395858,2019-10-31,"!A Taylor Hill Crescent, Naul Road, Balbriggan",unavailable,Dublin,273128.75,No,Yes,New Dwelling house /Apartment,unavailable,Leinster,2019-10-31,"!a taylor hill crescent, naul road, balbriggan"
153287,153300,2015-05-05,"#392 The Oak, Trimbleston, Goatstown Road",Dublin 14,Dublin,375770.93,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2015-05-31,"#392 the oak, trimbleston, goatstown road"
153288,153301,2015-05-05,"#396 The Oak, Trimbleston, Goatstown Road",Dublin 14,Dublin,375770.93,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2015-05-31,"#396 the oak, trimbleston, goatstown road"
258701,258694,2017-06-19,#NAME?,unavailable,Carlow,48000.0,Yes,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2017-06-30,#name?
342264,342374,2018-11-29,"& James Fort Avenue, Kinsale",unavailable,Cork,273127.0,No,Yes,New Dwelling house /Apartment,unavailable,Leinster,2018-11-30,"& james fort avenue, kinsale"
20974,20970,2010-12-30,"' Adrigole', Ballincurrig, Douglas",unavailable,Cork,360000.0,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-12-31,"' adrigole', ballincurrig, douglas"
67579,67515,2013-02-25,"' Bedford', Killiney Ave, Killiney",unavailable,Dublin,800000.0,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2013-02-28,"' bedford', killiney ave, killiney"
14913,14892,2010-09-24,"' Fishermans Cottage', Bawnbrack, Golden",unavailable,Tipperary,115000.0,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-09-30,"' fishermans cottage', bawnbrack, golden"


We also have some inconsistency in termsof how the address are reported. A further step would be to combine address and price of the house to match two different flags

In [12]:
# copy the data set first
df_o = df.copy()

In [13]:
df_o.drop_duplicates(subset ="Address_lower", 
                     keep = 'first', inplace = True) 

df_o['Address'].str.lower().nunique()

383312

We can evaluate how much information we lost based on the output we want to get. In particular, let's try to evaluate the information lost in the time frame we are interested in

In [14]:
mask = (df['Month_end_day'] > '2014-01-31') & (df['Month_end_day'] <= '2019-12-31')

df.loc[mask]['Address'].str.lower().nunique()

289758

In [15]:
mask = (df_o['Month_end_day'] > '2014-01-31') & (df_o['Month_end_day'] <= '2019-12-31')

df_o.loc[mask]['Address'].str.lower().nunique()

286081

we lost 287810 /289758 (ca 1%) of the information

In [16]:
df_o['Description of Property'].unique()

array(['Second-Hand Dwelling house /Apartment',
       'New Dwelling house /Apartment', 'Teach/¡ras·n CÛnaithe Ath·imhe',
       'Teach/¡ras·n CÛnaithe Nua', 'Teach/?ras?n C?naithe Nua'],
      dtype=object)

In [17]:
df_o.groupby('Description of Property')['id'].count()

Description of Property
New Dwelling house /Apartment             64595
Second-Hand Dwelling house /Apartment    318687
Teach/?ras?n C?naithe Nua                     1
Teach/¡ras·n CÛnaithe Ath·imhe               27
Teach/¡ras·n CÛnaithe Nua                     2
Name: id, dtype: int64

We have few property description that are quite strange. Not sure at the moment if we want to keep those or not. But consider. 

In [18]:
df_o.loc[df_o['Description of Property'].str.contains("Teach") ]

Unnamed: 0,id,Date,Address,Postal Code,County,Price,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description,Province,Month_end_day,Address_lower
1160,1145,2010-01-02,"247 GLANNTAN, GOLF LINKS ROAD, CASTLETROY.",NÌ Bhaineann,Limerick,228500.0,No,No,Teach/¡ras·n CÛnaithe Ath·imhe,unavailable,Leinster,2010-01-31,"247 glanntan, golf links road, castletroy."
12764,12751,2010-08-20,"8 Millhill Park, Skerries.",unavailable,Dublin,320000.0,No,No,Teach/¡ras·n CÛnaithe Ath·imhe,unavailable,Leinster,2010-08-31,"8 millhill park, skerries."
17324,17312,2010-02-11,"Carrigvore, Killiskey.",unavailable,Wicklow,610000.0,No,No,Teach/¡ras·n CÛnaithe Ath·imhe,unavailable,Leinster,2010-02-28,"carrigvore, killiskey."
20034,20121,2010-12-16,"7 Cul Na Toinne, Bunbeg.",unavailable,Donegal,74889.87,No,Yes,Teach/¡ras·n CÛnaithe Nua,nÌos mÛ n· nÛ cothrom le 38 mÈadar cearnach ag...,Leinster,2010-12-31,"7 cul na toinne, bunbeg."
22624,22655,2011-02-16,"Racecourse Road, Roscommon.",NÌ Bhaineann,Roscommon,100000.0,No,No,Teach/¡ras·n CÛnaithe Ath·imhe,unavailable,Leinster,2011-02-28,"racecourse road, roscommon."
26873,26791,2011-05-27,"12 Southdene, Gleann Bhaile Na Manach, Baile N...",Baile ¡tha Cliath 14,Dublin,272000.0,No,No,Teach/¡ras·n CÛnaithe Ath·imhe,unavailable,Leinster,2011-05-31,"12 southdene, gleann bhaile na manach, baile n..."
32172,32198,2011-07-09,"Station road, Castlebellingham, Dundalk.",unavailable,Louth,179000.0,No,No,Teach/¡ras·n CÛnaithe Ath·imhe,unavailable,Leinster,2011-07-31,"station road, castlebellingham, dundalk."
39758,39774,2012-12-01,"Apartment 12 Block B, Corofin House Clare Vi...",Baile ?tha Cliath 17,Dublin,115045.0,No,Yes,Teach/?ras?n C?naithe Nua,n?os l? n? 38 m?adar cearnach,Leinster,2012-12-31,"apartment 12 block b, corofin house clare vi..."
48868,48861,2012-06-28,"7 Thorndale Grove, Artane, Dublin.",Baile ¡tha Cliath 5,Dublin,250000.0,No,No,Teach/¡ras·n CÛnaithe Ath·imhe,unavailable,Leinster,2012-06-30,"7 thorndale grove, artane, dublin."
54057,53982,2012-10-09,"121 Ardilaun, Portmarnock, Co Dublin.",NÌ Bhaineann,Dublin,375000.0,No,No,Teach/¡ras·n CÛnaithe Ath·imhe,unavailable,Leinster,2012-10-31,"121 ardilaun, portmarnock, co dublin."


It looks like they are all related with Leinster, so dropping them won't hurt the final output

We want to evaluate the purchase in bulk. A naive approach could be to identify the purchases that happened in the same date for the same property. 

In [31]:
pd.concat([df_o, df_o['Address_lower'].str.split(', ', expand=True)], axis=1)

Unnamed: 0,id,Date,Address,Postal Code,County,Price,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description,Province,Month_end_day,Address_lower,0,1,2,0.1,1.1,2.1
0,1,2010-01-01,"5 Braemor Drive, Churchtown, Co.Dublin",unavailable,Dublin,343000.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-01-31,"5 braemor drive, churchtown, co.dublin",5 braemor drive,churchtown,co.dublin,5 braemor drive,churchtown,co.dublin
1,2,2010-03-01,"134 Ashewood Walk, Summerhill Lane, Portlaoise",unavailable,Laois,185000.00,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2010-03-31,"134 ashewood walk, summerhill lane, portlaoise",134 ashewood walk,summerhill lane,portlaoise,134 ashewood walk,summerhill lane,portlaoise
2,11,2010-04-01,"16 Aisling Geal, Fr. Russell Road",unavailable,Limerick,110000.00,No,No,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2010-04-30,"16 aisling geal, fr. russell road",16 aisling geal,fr. russell road,,16 aisling geal,fr. russell road,
3,21,2010-04-01,"48 KILLIANS COURT, MULLAGH",unavailable,Cavan,122000.00,No,Yes,New Dwelling house /Apartment,greater than 125 sq metres,Leinster,2010-04-30,"48 killians court, mullagh",48 killians court,mullagh,,48 killians court,mullagh,
4,35,2010-04-01,"Knock, Lanesboro",unavailable,Longford,125000.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"knock, lanesboro",knock,lanesboro,,knock,lanesboro,
5,10,2010-04-01,"15a Moore Bay, Kilkee",unavailable,Clare,126500.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"15a moore bay, kilkee",15a moore bay,kilkee,,15a moore bay,kilkee,
6,25,2010-04-01,"59 ormond keep, nenagh",unavailable,Tipperary,128000.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"59 ormond keep, nenagh",59 ormond keep,nenagh,,59 ormond keep,nenagh,
7,26,2010-04-01,"66 Rory O'Connor Place, Arklow",unavailable,Wicklow,145000.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"66 rory o'connor place, arklow",66 rory o'connor place,arklow,,66 rory o'connor place,arklow,
8,32,2010-04-01,"Cloonlaughnan, Mount Talbot",unavailable,Roscommon,145000.00,No,Yes,New Dwelling house /Apartment,greater than 125 sq metres,Leinster,2010-04-30,"cloonlaughnan, mount talbot",cloonlaughnan,mount talbot,,cloonlaughnan,mount talbot,
9,28,2010-04-01,"90 Suncroft Drive, Tallaght, Dublin 24",unavailable,Dublin,147950.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"90 suncroft drive, tallaght, dublin 24",90 suncroft drive,tallaght,dublin 24,90 suncroft drive,tallaght,dublin 24


In [32]:
df_o

Unnamed: 0,id,Date,Address,Postal Code,County,Price,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description,Province,Month_end_day,Address_lower,0,1,2
0,1,2010-01-01,"5 Braemor Drive, Churchtown, Co.Dublin",unavailable,Dublin,343000.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-01-31,"5 braemor drive, churchtown, co.dublin",5 braemor drive,churchtown,co.dublin
1,2,2010-03-01,"134 Ashewood Walk, Summerhill Lane, Portlaoise",unavailable,Laois,185000.00,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2010-03-31,"134 ashewood walk, summerhill lane, portlaoise",134 ashewood walk,summerhill lane,portlaoise
2,11,2010-04-01,"16 Aisling Geal, Fr. Russell Road",unavailable,Limerick,110000.00,No,No,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...,Leinster,2010-04-30,"16 aisling geal, fr. russell road",16 aisling geal,fr. russell road,
3,21,2010-04-01,"48 KILLIANS COURT, MULLAGH",unavailable,Cavan,122000.00,No,Yes,New Dwelling house /Apartment,greater than 125 sq metres,Leinster,2010-04-30,"48 killians court, mullagh",48 killians court,mullagh,
4,35,2010-04-01,"Knock, Lanesboro",unavailable,Longford,125000.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"knock, lanesboro",knock,lanesboro,
5,10,2010-04-01,"15a Moore Bay, Kilkee",unavailable,Clare,126500.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"15a moore bay, kilkee",15a moore bay,kilkee,
6,25,2010-04-01,"59 ormond keep, nenagh",unavailable,Tipperary,128000.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"59 ormond keep, nenagh",59 ormond keep,nenagh,
7,26,2010-04-01,"66 Rory O'Connor Place, Arklow",unavailable,Wicklow,145000.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"66 rory o'connor place, arklow",66 rory o'connor place,arklow,
8,32,2010-04-01,"Cloonlaughnan, Mount Talbot",unavailable,Roscommon,145000.00,No,Yes,New Dwelling house /Apartment,greater than 125 sq metres,Leinster,2010-04-30,"cloonlaughnan, mount talbot",cloonlaughnan,mount talbot,
9,28,2010-04-01,"90 Suncroft Drive, Tallaght, Dublin 24",unavailable,Dublin,147950.00,No,No,Second-Hand Dwelling house /Apartment,unavailable,Leinster,2010-04-30,"90 suncroft drive, tallaght, dublin 24",90 suncroft drive,tallaght,dublin 24


In [21]:
# df_o = df_o['Address_lower'].str.split(',')

In [22]:
# df_o.head()

In [23]:
# time_mask = (df_o['Month_end_day'] > '2014-01-31') & (df_o['Month_end_day'] <= '2019-12-31')
# new_house_mask= (df_o['Description of Property'].str.contains("Second-Hand") )


# df_time = df_o.loc[mask]

In [33]:
df_o['Province'].unique()

array(['Leinster'], dtype=object)