# FIT5196 Assessment 2
#### Student Name: Amarade Punfueng

Date: 06/06/2022

Environment: Jupyter Notebook 6.1.4 and Python 3.8.13

Libraries used:
* tabula
* geopandas
* regex
* numpy
* pandas
* xmlschema
* requests
* beautifulsoup
* pyplot

## Introduction

This assignment has two tasks. The first task is data integration from 3 different sources. XML, PDF, and website integrated into one dataset with the required schema. The second task is data reshaping. The required criteria are that features must be on the same scale. The second criteria is to have as much linear relationship as possible with the target variable.

## Import libraries

In [1]:
import tabula
import geopandas as gpd
import re
import numpy as np
import pandas as pd
import xmlschema
import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
from matplotlib import pyplot as plt

## First task
Data integration from 3 different sources. XML, PDF, and website integrated into one dataset with the required scheme.

## 1.Import PDF and XML file

Import the PDF and XML by using tabula and pandas library.

In [2]:
%%time
#Import PDF file to pandas dataframe
df_pdf = tabula.read_pdf("data_realstate.pdf")
df_csv = df_pdf[0]

'pages' argument isn't specified.Will extract only from page 1 by default.


CPU times: total: 31.2 ms
Wall time: 2min 56s


This process should take around 5 to 8 minutes to finish.

Check the structure of the dataframe and edit to correct structure.

In [3]:
#Check the structure of the dataframe
df = df_csv.copy()
df.head()

Unnamed: 0.1,property_id,Unnamed: 0,lat,Unnamed: 1,lng,Unnamed: 2,addr_street
0,,83851,-37.854654,,145.004546,,30 Banole Avenue
1,,74446,-37.894338,,145.071977,,1/23 Rosella Street
2,,83649,-37.85633,,144.994837,,2 Somerset Place
3,,64021,-37.806072,,145.274986,,24 Vinter Avenue
4,,56630,-37.854585,,145.095298,,375 Warrigal Road


In [4]:
#Edit to correct structure by removing and renaming columns
df = df.drop(df.columns[[0,3 , 5]], axis=1)
df.rename(columns = {'Unnamed: 0':'property_id'}, inplace = True)
df

Unnamed: 0,property_id,lat,lng,addr_street
0,83851,-37.854654,145.004546,30 Banole Avenue
1,74446,-37.894338,145.071977,1/23 Rosella Street
2,83649,-37.856330,144.994837,2 Somerset Place
3,64021,-37.806072,145.274986,24 Vinter Avenue
4,56630,-37.854585,145.095298,375 Warrigal Road
...,...,...,...,...
1201,49261,-37.803990,145.082849,23 Kaleno View
1202,10632,-37.792689,144.931921,5 Robertson Street
1203,25263,-37.575311,144.932023,12 Windrock Avenue
1204,84711,-37.886866,144.990655,187 Ormond Road


Checking XML structure by using xmlschema library.

In [5]:
try:
    xmlschema.validate('data_realstate.xml', 'some.xsd')
except Exception as e: print(e)

junk after document element: line 7, column 0


Found the error in XML file and fixed the file with regex expression.

In [6]:
#read the xml file
with open('data_realstate.xml', 'r') as f:
    data_xml = f.read()
data_xml

"<property>\n  <property_id>49261</property_id>\n  <lat>-37.80399</lat>\n  <lng>145.082849</lng>\n  <addr_street>23 Kaleno View</addr_street>\n</property>\n<property>\n  <property_id>10632</property_id>\n  <lat>-37.792689</lat>\n  <lng>144.931921</lng>\n  <addr_street>5 Robertson Street</addr_street>\n</property>\n<property>\n  <property_id>25263</property_id>\n  <lat>-37.575311</lat>\n  <lng>144.932023</lng>\n  <addr_street>12 Windrock Avenue</addr_street>\n</property>\n<property>\n  <property_id>84711</property_id>\n  <lat>-37.886866</lat>\n  <lng>144.990655</lng>\n  <addr_street>187 Ormond Road</addr_street>\n</property>\n<property>\n  <property_id>87667</property_id>\n  <lat>-37.932817</lat>\n  <lng>144.997925</lng>\n  <addr_street>37 Holyrood Street</addr_street>\n</property>\n<property>\n  <property_id>85206</property_id>\n  <lat>-37.87859</lat>\n  <lng>145.005777</lng>\n  <addr_street>10 Liscard Street</addr_street>\n</property>\n<property>\n  <property_id>4046</property_id>\n  

In [7]:
#fix the file
data_xml = ("<data>") + data_xml + ('</data>')
data_xml = re.sub('&','&amp;',data_xml)

Import the XML by using pandas library.

In [8]:
df_xml = pd.read_xml(data_xml)
df_xml.head

<bound method NDFrame.head of       property_id        lat         lng         addr_street
0           49261 -37.803990  145.082849      23 Kaleno View
1           10632 -37.792689  144.931921  5 Robertson Street
2           25263 -37.575311  144.932023  12 Windrock Avenue
3           84711 -37.886866  144.990655     187 Ormond Road
4           87667 -37.932817  144.997925  37 Holyrood Street
...           ...        ...         ...                 ...
1205        26542 -37.609910  144.914254     32 Tusmore Rise
1206        70912 -37.887106  145.092712     9 Dundee Avenue
1207        50445 -37.767408  145.092557  94 Manningham Road
1208        46020 -37.717353  145.133063   30  Adam Crescent
1209        74013 -37.854507  145.274292   8 Medway Crescent

[1210 rows x 4 columns]>

Integrating PDF and XML into one dataframe.

In [9]:
df2=df.append(df_xml, ignore_index=True)
df2

  df2=df.append(df_xml, ignore_index=True)


Unnamed: 0,property_id,lat,lng,addr_street
0,83851,-37.854654,145.004546,30 Banole Avenue
1,74446,-37.894338,145.071977,1/23 Rosella Street
2,83649,-37.856330,144.994837,2 Somerset Place
3,64021,-37.806072,145.274986,24 Vinter Avenue
4,56630,-37.854585,145.095298,375 Warrigal Road
...,...,...,...,...
2411,26542,-37.609910,144.914254,32 Tusmore Rise
2412,70912,-37.887106,145.092712,9 Dundee Avenue
2413,50445,-37.767408,145.092557,94 Manningham Road
2414,46020,-37.717353,145.133063,30 Adam Crescent


## 2.Adding data into suburb column

Import shapefile of suburb area by using geopandas library

In [10]:
suburb = gpd.read_file("VIC_LOCALITY_POLYGON_shp.shp")
suburb.head()

Unnamed: 0,LC_PLY_PID,DT_CREATE,DT_RETIRE,LOC_PID,VIC_LOCALI,VIC_LOCA_1,VIC_LOCA_2,VIC_LOCA_3,VIC_LOCA_4,VIC_LOCA_5,VIC_LOCA_6,VIC_LOCA_7,geometry
0,6670,2011-08-31,,VIC2615,2012-04-27,,UNDERBOOL,,,G,,2,"POLYGON ((141.74552 -35.07229, 141.74552 -35.0..."
1,6671,2011-08-31,,VIC1986,2012-04-27,,NURRAN,,,G,,2,"POLYGON ((148.66877 -37.39571, 148.66876 -37.3..."
2,6672,2011-08-31,,VIC2862,2012-04-27,,WOORNDOO,,,G,,2,"POLYGON ((142.92288 -37.97886, 142.90449 -37.9..."
3,6673,2011-08-31,,VIC734,2017-08-09,,DEPTFORD,,,G,,2,"POLYGON ((147.82336 -37.66001, 147.82313 -37.6..."
4,6674,2011-08-31,,VIC2900,2012-04-27,,YANAC,,,G,,2,"POLYGON ((141.27978 -35.99859, 141.27989 -35.9..."


In [11]:
#Remove unused columns
suburb = suburb.drop(suburb.columns[[0,1,2,3,4,5,7,8,9,10,11]], axis=1)
suburb

Unnamed: 0,VIC_LOCA_2,geometry
0,UNDERBOOL,"POLYGON ((141.74552 -35.07229, 141.74552 -35.0..."
1,NURRAN,"POLYGON ((148.66877 -37.39571, 148.66876 -37.3..."
2,WOORNDOO,"POLYGON ((142.92288 -37.97886, 142.90449 -37.9..."
3,DEPTFORD,"POLYGON ((147.82336 -37.66001, 147.82313 -37.6..."
4,YANAC,"POLYGON ((141.27978 -35.99859, 141.27989 -35.9..."
...,...,...
2968,MELBOURNE AIRPORT,"POLYGON ((144.86382 -37.67087, 144.86405 -37.6..."
2969,BULLA,"POLYGON ((144.80217 -37.66167, 144.80243 -37.6..."
2970,SOMERS,"POLYGON ((145.19211 -38.39105, 145.19392 -38.3..."
2971,HMAS CERBERUS,"POLYGON ((145.21831 -38.38722, 145.21863 -38.3..."


Adding required columns and default values to the dataframe.

In [12]:
df2['suburb'] = 'not available'
df2['closest_train_station_id'] = 0
df2['distance_to_closest_train_station'] = 0
df2

Unnamed: 0,property_id,lat,lng,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station
0,83851,-37.854654,145.004546,30 Banole Avenue,not available,0,0
1,74446,-37.894338,145.071977,1/23 Rosella Street,not available,0,0
2,83649,-37.856330,144.994837,2 Somerset Place,not available,0,0
3,64021,-37.806072,145.274986,24 Vinter Avenue,not available,0,0
4,56630,-37.854585,145.095298,375 Warrigal Road,not available,0,0
...,...,...,...,...,...,...,...
2411,26542,-37.609910,144.914254,32 Tusmore Rise,not available,0,0
2412,70912,-37.887106,145.092712,9 Dundee Avenue,not available,0,0
2413,50445,-37.767408,145.092557,94 Manningham Road,not available,0,0
2414,46020,-37.717353,145.133063,30 Adam Crescent,not available,0,0


Using geopandas to create geometry column which is lng,lat of property

In [13]:
gdf = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.lng, df2.lat))
#Setting Coordinate Reference System to Australia system
gdf = gdf.set_crs(epsg=4283)
gdf

Unnamed: 0,property_id,lat,lng,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station,geometry
0,83851,-37.854654,145.004546,30 Banole Avenue,not available,0,0,POINT (145.00455 -37.85465)
1,74446,-37.894338,145.071977,1/23 Rosella Street,not available,0,0,POINT (145.07198 -37.89434)
2,83649,-37.856330,144.994837,2 Somerset Place,not available,0,0,POINT (144.99484 -37.85633)
3,64021,-37.806072,145.274986,24 Vinter Avenue,not available,0,0,POINT (145.27499 -37.80607)
4,56630,-37.854585,145.095298,375 Warrigal Road,not available,0,0,POINT (145.09530 -37.85459)
...,...,...,...,...,...,...,...,...
2411,26542,-37.609910,144.914254,32 Tusmore Rise,not available,0,0,POINT (144.91425 -37.60991)
2412,70912,-37.887106,145.092712,9 Dundee Avenue,not available,0,0,POINT (145.09271 -37.88711)
2413,50445,-37.767408,145.092557,94 Manningham Road,not available,0,0,POINT (145.09256 -37.76741)
2414,46020,-37.717353,145.133063,30 Adam Crescent,not available,0,0,POINT (145.13306 -37.71735)


Using geopandas.sjoin to identify where are property located in which suburb.

In [14]:
#Use sjoin to find property located in which suburb
df_final = gpd.sjoin(gdf, suburb , how="inner", op='intersects')
df_final['suburb'] = df_final['VIC_LOCA_2']
#Remove unused columns
df_final = df_final.drop(columns=['VIC_LOCA_2','index_right'])
df_final

  if await self.run_code(code, result, async_=asy):


Unnamed: 0,property_id,lat,lng,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station,geometry
0,83851,-37.854654,145.004546,30 Banole Avenue,PRAHRAN,0,0,POINT (145.00455 -37.85465)
74,84203,-37.852626,145.002968,6 Aberdeen Road,PRAHRAN,0,0,POINT (145.00297 -37.85263)
78,84349,-37.851704,145.005646,75 Pridham Street,PRAHRAN,0,0,POINT (145.00565 -37.85170)
351,83478,-37.847361,144.987132,13 Athol Street,PRAHRAN,0,0,POINT (144.98713 -37.84736)
853,84124,-37.856013,145.004308,50 Packington Street,PRAHRAN,0,0,POINT (145.00431 -37.85601)
...,...,...,...,...,...,...,...,...
1961,5218,-37.868387,144.839055,46 Station Street,SEAHOLME,0,0,POINT (144.83906 -37.86839)
1975,5034,-37.865875,144.840423,20 Waratah Drive,SEAHOLME,0,0,POINT (144.84042 -37.86588)
1987,61832,-37.771996,145.254431,34 Braden Brae Drive,WARRANWOOD,0,0,POINT (145.25443 -37.77200)
2080,69300,-37.842694,145.038120,77 Talbot Crescent,KOOYONG,0,0,POINT (145.03812 -37.84269)


## 3.Adding data into closest_train_station_id and distance_to_closest_train_station  column

Import stops.txt which is the data about train stations

In [15]:
stops = pd.read_csv('stops.txt', delimiter = ",")
stops

Unnamed: 0,stop_id,stop_name,stop_short_name,stop_lat,stop_lon
0,15351,Sunbury Railway Station,Sunbury,-37.579091,144.727319
1,15353,Diggers Rest Railway Station,Diggers Rest,-37.627017,144.719922
2,19827,Stony Point Railway Station,Crib Point,-38.374235,145.221837
3,19828,Crib Point Railway Station,Crib Point,-38.366123,145.204043
4,19829,Morradoo Railway Station,Crib Point,-38.354033,145.189602
...,...,...,...,...,...
213,44817,Coolaroo Railway Station,Coolaroo,-37.661003,144.926056
214,45793,Lynbrook Railway Station,Lynbrook,-38.057341,145.249275
215,45794,Cardinia Road Railway Station,Pakenham,-38.071290,145.437791
216,45795,South Morang Railway Station,South Morang,-37.649159,145.067032


In [16]:
#Rename the columns
stops.rename(columns = {'stop_lat':'lat','stop_lon':'lng'}, inplace = True)
stops

Unnamed: 0,stop_id,stop_name,stop_short_name,lat,lng
0,15351,Sunbury Railway Station,Sunbury,-37.579091,144.727319
1,15353,Diggers Rest Railway Station,Diggers Rest,-37.627017,144.719922
2,19827,Stony Point Railway Station,Crib Point,-38.374235,145.221837
3,19828,Crib Point Railway Station,Crib Point,-38.366123,145.204043
4,19829,Morradoo Railway Station,Crib Point,-38.354033,145.189602
...,...,...,...,...,...
213,44817,Coolaroo Railway Station,Coolaroo,-37.661003,144.926056
214,45793,Lynbrook Railway Station,Lynbrook,-38.057341,145.249275
215,45794,Cardinia Road Railway Station,Pakenham,-38.071290,145.437791
216,45795,South Morang Railway Station,South Morang,-37.649159,145.067032


Create a function to calculate the distance between two points which is the same function as my function in assessment 2.

In [17]:
##This function slightly adjusted from 
##https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas/29546836#29546836 and 
##https://www.adamsmith.haus/python/answers/how-to-find-the-distance-between-two-lat-long-coordinates-in-python

def dist(lat1, lon1, lat2, lon2):

    earth_radius = 6378.0
    lat1 = np.radians(lat1)
    lon1 = np.radians(lon1)
    lat2 = np.radians(lat2)
    lon2 = np.radians(lon2)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2

    distance = earth_radius * 2 * np.arcsin(np.sqrt(a))
    
    return distance

Create a function to find closest train stations

In [18]:
##This function slightly adjusted from 
##https://medium.com/analytics-vidhya/finding-nearest-pair-of-latitude-and-longitude-match-using-python-ce50d62af546
def find_closest_train_stations(lat, long):
    distances = stops.apply(lambda row: dist(lat, long, row['lat'], row['lng']),axis=1)
    return stops.loc[distances.idxmin(), 'stop_id']

In [19]:
#Apply the function to find closest train station
df_final['closest_train_station_id'] = df_final.apply(lambda row: find_closest_train_stations(row['lat'], row['lng']),axis=1)
df_final = df_final.drop(columns=['geometry'])
df_final

Unnamed: 0,property_id,lat,lng,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station
0,83851,-37.854654,145.004546,30 Banole Avenue,PRAHRAN,19946,0
74,84203,-37.852626,145.002968,6 Aberdeen Road,PRAHRAN,19947,0
78,84349,-37.851704,145.005646,75 Pridham Street,PRAHRAN,19946,0
351,83478,-37.847361,144.987132,13 Athol Street,PRAHRAN,19958,0
853,84124,-37.856013,145.004308,50 Packington Street,PRAHRAN,19946,0
...,...,...,...,...,...,...,...
1961,5218,-37.868387,144.839055,46 Station Street,SEAHOLME,19927,0
1975,5034,-37.865875,144.840423,20 Waratah Drive,SEAHOLME,19927,0
1987,61832,-37.771996,145.254431,34 Braden Brae Drive,WARRANWOOD,19878,0
2080,69300,-37.842694,145.038120,77 Talbot Crescent,KOOYONG,19910,0


Finding the distance from the property and the closest train station.

Add coordinate of the closest train station to the dataframe.

In [20]:
df_final = pd.merge(df_final, stops, how='left', left_on=['closest_train_station_id'], right_on=['stop_id'])
df_final

Unnamed: 0,property_id,lat_x,lng_x,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station,stop_id,stop_name,stop_short_name,lat_y,lng_y
0,83851,-37.854654,145.004546,30 Banole Avenue,PRAHRAN,19946,0,19946,Toorak Railway Station,Armadale,-37.850774,145.013909
1,84203,-37.852626,145.002968,6 Aberdeen Road,PRAHRAN,19947,0,19947,Hawksburn Railway Station,South Yarra,-37.844591,145.002142
2,84349,-37.851704,145.005646,75 Pridham Street,PRAHRAN,19946,0,19946,Toorak Railway Station,Armadale,-37.850774,145.013909
3,83478,-37.847361,144.987132,13 Athol Street,PRAHRAN,19958,0,19958,Prahran Railway Station,Prahran,-37.849518,144.989860
4,84124,-37.856013,145.004308,50 Packington Street,PRAHRAN,19946,0,19946,Toorak Railway Station,Armadale,-37.850774,145.013909
...,...,...,...,...,...,...,...,...,...,...,...,...
2411,5218,-37.868387,144.839055,46 Station Street,SEAHOLME,19927,0,19927,Seaholme Railway Station,Seaholme,-37.867842,144.840958
2412,5034,-37.865875,144.840423,20 Waratah Drive,SEAHOLME,19927,0,19927,Seaholme Railway Station,Seaholme,-37.867842,144.840958
2413,61832,-37.771996,145.254431,34 Braden Brae Drive,WARRANWOOD,19878,0,19878,Croydon Railway Station,Croydon,-37.795437,145.280598
2414,69300,-37.842694,145.038120,77 Talbot Crescent,KOOYONG,19910,0,19910,Kooyong Railway Station,Kooyong,-37.839929,145.033552


Using the function to calculate the distance between the property and the closest train station.

In [21]:
#Using the function
df_final['distance_to_closest_train_station'] =  dist(df_final['lat_x'], df_final['lng_x'],df_final['lat_y'], df_final['lng_y'])
#Remove unused column and rename column
df_final = df_final.drop(columns=['stop_id','stop_name','stop_short_name','lat_y','lng_y'])
df_final.rename(columns = {'lat_x':'lat','lng_x':'lng'}, inplace = True)
df_final

Unnamed: 0,property_id,lat,lng,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station
0,83851,-37.854654,145.004546,30 Banole Avenue,PRAHRAN,19946,0.929392
1,84203,-37.852626,145.002968,6 Aberdeen Road,PRAHRAN,19947,0.897389
2,84349,-37.851704,145.005646,75 Pridham Street,PRAHRAN,19946,0.733610
3,83478,-37.847361,144.987132,13 Athol Street,PRAHRAN,19958,0.339376
4,84124,-37.856013,145.004308,50 Packington Street,PRAHRAN,19946,1.025765
...,...,...,...,...,...,...,...
2411,5218,-37.868387,144.839055,46 Station Street,SEAHOLME,19927,0.177860
2412,5034,-37.865875,144.840423,20 Waratah Drive,SEAHOLME,19927,0.223980
2413,61832,-37.771996,145.254431,34 Braden Brae Drive,WARRANWOOD,19878,3.479717
2414,69300,-37.842694,145.038120,77 Talbot Crescent,KOOYONG,19910,0.505987


## 4.Adding data into travel_min_to_MC and direct_journey_flag columns

Import stop_times.txt calendar.txt and trip.txt

In [22]:
stop_times = pd.read_csv('stop_times.txt', delimiter = ",")
calendar = pd.read_csv('calendar.txt', delimiter = ",")
trips = pd.read_csv('trips.txt', delimiter = ",")

As a specification, travel_min_to_MC must be only on weekdays from Monday to Friday.

In [23]:
calendar

Unnamed: 0,service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
0,T2,0,0,0,0,0,1,0,20151009,20151011
1,UJ,0,0,0,0,0,0,1,20151009,20151011
2,T6,0,0,0,0,1,0,0,20151009,20151011
3,T5,1,1,1,1,0,0,0,20151012,20151015
4,T2_1,0,0,0,0,0,1,0,20151016,20151018
5,UJ_1,0,0,0,0,0,0,1,20151016,20151018
6,T6_1,0,0,0,0,1,0,0,20151016,20151018
7,T5_1,1,1,1,1,0,0,0,20151019,20151022
8,T0,1,1,1,1,1,0,0,20151023,20151122
9,T2_2,0,0,0,0,0,1,0,20151023,20151122


Assuming 1 is equal to positive, only service_id T0 operates from Monday to Friday.

Finding stop_id for Melbourne Central Railway Station.

In [24]:
stops

Unnamed: 0,stop_id,stop_name,stop_short_name,lat,lng
0,15351,Sunbury Railway Station,Sunbury,-37.579091,144.727319
1,15353,Diggers Rest Railway Station,Diggers Rest,-37.627017,144.719922
2,19827,Stony Point Railway Station,Crib Point,-38.374235,145.221837
3,19828,Crib Point Railway Station,Crib Point,-38.366123,145.204043
4,19829,Morradoo Railway Station,Crib Point,-38.354033,145.189602
...,...,...,...,...,...
213,44817,Coolaroo Railway Station,Coolaroo,-37.661003,144.926056
214,45793,Lynbrook Railway Station,Lynbrook,-38.057341,145.249275
215,45794,Cardinia Road Railway Station,Pakenham,-38.071290,145.437791
216,45795,South Morang Railway Station,South Morang,-37.649159,145.067032


In [25]:
stops.loc[stops.stop_name == 'Melbourne Central Railway Station']

Unnamed: 0,stop_id,stop_name,stop_short_name,lat,lng
17,19842,Melbourne Central Railway Station,Melbourne City,-37.809939,144.962594


In [26]:
stop_times

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
0,17182517.T2.2-ALM-B-mjp-1.1.H,04:57:00,04:57:00,19847,1,,0,0,0.000000
1,17182517.T2.2-ALM-B-mjp-1.1.H,04:58:00,04:58:00,19848,2,,0,0,723.017818
2,17182517.T2.2-ALM-B-mjp-1.1.H,05:00:00,05:00:00,19849,3,,0,0,1951.735072
3,17182517.T2.2-ALM-B-mjp-1.1.H,05:02:00,05:02:00,19850,4,,0,0,2899.073349
4,17182517.T2.2-ALM-B-mjp-1.1.H,05:04:00,05:04:00,19851,5,,0,0,3927.090952
...,...,...,...,...,...,...,...,...,...
390300,17199140.UJ.2-ain-mjp-1.4.R,18:09:00,18:09:00,20028,1,,0,0,0.000000
390301,17199140.UJ.2-ain-mjp-1.4.R,18:15:00,18:15:00,19973,4,,0,0,4011.161109
390302,17199140.UJ.2-ain-mjp-1.4.R,18:19:00,18:19:00,22180,5,,0,0,5676.741894
390303,17199142.T2.2-ain-mjp-1.5.R,24:00:00,24:00:00,20027,1,,0,0,0.000000


In [27]:
trips

Unnamed: 0,route_id,service_id,trip_id,shape_id,trip_headsign,direction_id
0,2-ALM-F-mjp-1,T0,17067982.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
1,2-ALM-F-mjp-1,T0,17067988.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
2,2-ALM-F-mjp-1,T0,17067992.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
3,2-ALM-F-mjp-1,T0,17067999.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
4,2-ALM-F-mjp-1,T0,17068003.T0.2-ALM-F-mjp-1.1.H,2-ALM-F-mjp-1.1.H,City (Flinders Street),0
...,...,...,...,...,...,...
23804,2-WMN-F-mjp-1,UJ_2,17072252.UJ.2-WMN-F-mjp-1.6.R,2-WMN-F-mjp-1.6.R,Williamstown,1
23805,2-WMN-F-mjp-1,UJ_2,17072256.UJ.2-WMN-F-mjp-1.6.R,2-WMN-F-mjp-1.6.R,Williamstown,1
23806,2-WMN-F-mjp-1,UJ_2,17072260.UJ.2-WMN-F-mjp-1.6.R,2-WMN-F-mjp-1.6.R,Williamstown,1
23807,2-WMN-F-mjp-1,UJ_2,17072264.UJ.2-WMN-F-mjp-1.6.R,2-WMN-F-mjp-1.6.R,Williamstown,1


From the two above dataframe, It can be seen that the second group of trip_id is service_id.

Filtering the dataframe to show only trip_id that has service_id equal to T0.

In [28]:
#Check the length of the digit in first group trip_id equal to 8
stop_times[stop_times['trip_id'].str.match('\d\d\d\d\d\d\d\d')]

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
0,17182517.T2.2-ALM-B-mjp-1.1.H,04:57:00,04:57:00,19847,1,,0,0,0.000000
1,17182517.T2.2-ALM-B-mjp-1.1.H,04:58:00,04:58:00,19848,2,,0,0,723.017818
2,17182517.T2.2-ALM-B-mjp-1.1.H,05:00:00,05:00:00,19849,3,,0,0,1951.735072
3,17182517.T2.2-ALM-B-mjp-1.1.H,05:02:00,05:02:00,19850,4,,0,0,2899.073349
4,17182517.T2.2-ALM-B-mjp-1.1.H,05:04:00,05:04:00,19851,5,,0,0,3927.090952
...,...,...,...,...,...,...,...,...,...
390300,17199140.UJ.2-ain-mjp-1.4.R,18:09:00,18:09:00,20028,1,,0,0,0.000000
390301,17199140.UJ.2-ain-mjp-1.4.R,18:15:00,18:15:00,19973,4,,0,0,4011.161109
390302,17199140.UJ.2-ain-mjp-1.4.R,18:19:00,18:19:00,22180,5,,0,0,5676.741894
390303,17199142.T2.2-ain-mjp-1.5.R,24:00:00,24:00:00,20027,1,,0,0,0.000000


In [29]:
#Check trip_id that has service_id equal to T0.
stop_times = stop_times[stop_times['trip_id'].str.match('\d\d\d\d\d\d\d\d.T0')]
stop_times

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
8568,17067982.T0.2-ALM-F-mjp-1.1.H,05:01:00,05:01:00,19847,1,,0,0,0.000000
8569,17067982.T0.2-ALM-F-mjp-1.1.H,05:02:00,05:02:00,19848,2,,0,0,723.017818
8570,17067982.T0.2-ALM-F-mjp-1.1.H,05:04:00,05:04:00,19849,3,,0,0,1951.735072
8571,17067982.T0.2-ALM-F-mjp-1.1.H,05:06:00,05:06:00,19850,4,,0,0,2899.073349
8572,17067982.T0.2-ALM-F-mjp-1.1.H,05:08:00,05:08:00,19851,5,,0,0,3927.090952
...,...,...,...,...,...,...,...,...,...
389734,17072091.T0.2-WMN-F-mjp-1.6.R,23:59:00,23:59:00,19991,4,,0,0,3641.811422
389743,17072097.T0.2-WMN-F-mjp-1.6.R,24:33:00,24:33:00,19994,1,,0,0,0.000000
389744,17072097.T0.2-WMN-F-mjp-1.6.R,24:35:00,24:35:00,19993,2,,0,0,1702.554760
389745,17072097.T0.2-WMN-F-mjp-1.6.R,24:37:00,24:37:00,19992,3,,0,0,2598.738912


Create the dataframe that has only Melbourne Central Railway Station.

In [30]:
filter_df = stop_times.loc[stop_times.stop_id == 19842]
filter_df

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
9484,17068379.T0.2-ALM-F-mjp-1.2.H,06:06:00,06:06:00,19842,15,,0,0,15810.104392
9501,17068398.T0.2-ALM-F-mjp-1.2.H,09:36:00,09:36:00,19842,15,,0,0,15810.104392
9518,17068399.T0.2-ALM-F-mjp-1.2.H,09:51:00,09:51:00,19842,15,,0,0,15810.104392
9535,17068381.T0.2-ALM-F-mjp-1.2.H,06:34:00,06:34:00,19842,15,,0,0,15810.104392
9552,17068383.T0.2-ALM-F-mjp-1.2.H,06:50:00,06:50:00,19842,15,,0,0,15810.104392
...,...,...,...,...,...,...,...,...,...
358112,17070350.T0.2-UFD-F-mjp-1.9.R,17:46:00,17:46:00,19842,18,,0,0,22660.953896
358131,17070362.T0.2-UFD-F-mjp-1.9.R,18:46:00,18:46:00,19842,18,,0,0,22660.953896
358150,17070366.T0.2-UFD-F-mjp-1.9.R,19:11:00,19:11:00,19842,18,,0,0,22660.953896
358169,17070390.T0.2-UFD-F-mjp-1.9.R,21:38:00,21:38:00,19842,18,,0,0,22660.953896


Suppose the trip_id has stop at Melbourne Central Railway Station. So that trip must have a direct journey to Melbourne Central Railway Station and operate on weekdays.

In [31]:
filter_df = filter_df.reset_index(drop=True)
stop_times = stop_times.reset_index(drop=True)
#Create the dataframe that has only trip_id have a direct journey to Melbourne Central Railway Station and operates on weekdays
df_traveltime = stop_times.loc[stop_times.trip_id.isin(filter_df['trip_id'])]
df_traveltime

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
217,17068379.T0.2-ALM-F-mjp-1.2.H,05:38:00,05:38:00,19847,1,,0,0,0.000000
218,17068379.T0.2-ALM-F-mjp-1.2.H,05:39:00,05:39:00,19848,2,,0,0,723.017818
219,17068379.T0.2-ALM-F-mjp-1.2.H,05:41:00,05:41:00,19849,3,,0,0,1951.735072
220,17068379.T0.2-ALM-F-mjp-1.2.H,05:43:00,05:43:00,19850,4,,0,0,2899.073349
221,17068379.T0.2-ALM-F-mjp-1.2.H,05:45:00,05:45:00,19851,5,,0,0,3927.090952
...,...,...,...,...,...,...,...,...,...
40124,17070400.T0.2-UFD-F-mjp-1.9.R,22:54:00,22:54:00,22180,15,,0,0,18659.883441
40125,17070400.T0.2-UFD-F-mjp-1.9.R,22:58:00,23:04:00,19854,16,,0,0,20195.638187
40126,17070400.T0.2-UFD-F-mjp-1.9.R,23:06:00,23:06:00,19843,17,,0,0,21513.362115
40127,17070400.T0.2-UFD-F-mjp-1.9.R,23:08:00,23:08:00,19842,18,,0,0,22660.953896


In [32]:
#Remove unused columns
df_traveltime = df_traveltime.drop(filter_df.columns[[4,5 ,6,7,8]], axis=1)
df_traveltime

Unnamed: 0,trip_id,arrival_time,departure_time,stop_id
217,17068379.T0.2-ALM-F-mjp-1.2.H,05:38:00,05:38:00,19847
218,17068379.T0.2-ALM-F-mjp-1.2.H,05:39:00,05:39:00,19848
219,17068379.T0.2-ALM-F-mjp-1.2.H,05:41:00,05:41:00,19849
220,17068379.T0.2-ALM-F-mjp-1.2.H,05:43:00,05:43:00,19850
221,17068379.T0.2-ALM-F-mjp-1.2.H,05:45:00,05:45:00,19851
...,...,...,...,...
40124,17070400.T0.2-UFD-F-mjp-1.9.R,22:54:00,22:54:00,22180
40125,17070400.T0.2-UFD-F-mjp-1.9.R,22:58:00,23:04:00,19854
40126,17070400.T0.2-UFD-F-mjp-1.9.R,23:06:00,23:06:00,19843
40127,17070400.T0.2-UFD-F-mjp-1.9.R,23:08:00,23:08:00,19842


Create the dataframe that have Melbourne Central Railway Station and arrival_time.

In [33]:
df_mel = df_traveltime.loc[df_traveltime.stop_id == 19842]
df_mel = df_mel.drop(df_mel.columns[[2]], axis=1)
df_mel

Unnamed: 0,trip_id,arrival_time,stop_id
230,17068379.T0.2-ALM-F-mjp-1.2.H,06:06:00,19842
247,17068398.T0.2-ALM-F-mjp-1.2.H,09:36:00,19842
264,17068399.T0.2-ALM-F-mjp-1.2.H,09:51:00,19842
281,17068381.T0.2-ALM-F-mjp-1.2.H,06:34:00,19842
298,17068383.T0.2-ALM-F-mjp-1.2.H,06:50:00,19842
...,...,...,...
40051,17070350.T0.2-UFD-F-mjp-1.9.R,17:46:00,19842
40070,17070362.T0.2-UFD-F-mjp-1.9.R,18:46:00,19842
40089,17070366.T0.2-UFD-F-mjp-1.9.R,19:11:00,19842
40108,17070390.T0.2-UFD-F-mjp-1.9.R,21:38:00,19842


Merge the dataframe to calculate travel time from stop_id_x to Melbourne Central Railway Station.

In [34]:
df_traveltime = pd.merge(df_traveltime, df_mel, how='left', left_on=['trip_id'], right_on=['trip_id'])
df_traveltime

Unnamed: 0,trip_id,arrival_time_x,departure_time,stop_id_x,arrival_time_y,stop_id_y
0,17068379.T0.2-ALM-F-mjp-1.2.H,05:38:00,05:38:00,19847,06:06:00,19842
1,17068379.T0.2-ALM-F-mjp-1.2.H,05:39:00,05:39:00,19848,06:06:00,19842
2,17068379.T0.2-ALM-F-mjp-1.2.H,05:41:00,05:41:00,19849,06:06:00,19842
3,17068379.T0.2-ALM-F-mjp-1.2.H,05:43:00,05:43:00,19850,06:06:00,19842
4,17068379.T0.2-ALM-F-mjp-1.2.H,05:45:00,05:45:00,19851,06:06:00,19842
...,...,...,...,...,...,...
26697,17070400.T0.2-UFD-F-mjp-1.9.R,22:54:00,22:54:00,22180,23:08:00,19842
26698,17070400.T0.2-UFD-F-mjp-1.9.R,22:58:00,23:04:00,19854,23:08:00,19842
26699,17070400.T0.2-UFD-F-mjp-1.9.R,23:06:00,23:06:00,19843,23:08:00,19842
26700,17070400.T0.2-UFD-F-mjp-1.9.R,23:08:00,23:08:00,19842,23:08:00,19842


As a specification, travel_min_to_MC must be departing between 7 to 9 am.

In [35]:
df_traveltime = df_traveltime.loc[(df_traveltime.departure_time > '07:00:00' ) & (df_traveltime.departure_time < '09:00:00')]
df_traveltime

Unnamed: 0,trip_id,arrival_time_x,departure_time,stop_id_x,arrival_time_y,stop_id_y
94,17068385.T0.2-ALM-F-mjp-1.2.H,07:02:00,07:02:00,19905,07:12:00,19842
95,17068385.T0.2-ALM-F-mjp-1.2.H,07:04:00,07:04:00,19906,07:12:00,19842
96,17068385.T0.2-ALM-F-mjp-1.2.H,07:07:00,07:07:00,19908,07:12:00,19842
97,17068385.T0.2-ALM-F-mjp-1.2.H,07:10:00,07:10:00,19843,07:12:00,19842
98,17068385.T0.2-ALM-F-mjp-1.2.H,07:12:00,07:12:00,19842,07:12:00,19842
...,...,...,...,...,...,...
26239,17070663.T0.2-UFD-F-mjp-1.6.R,08:54:00,08:54:00,19967,09:12:00,19842
26240,17070663.T0.2-UFD-F-mjp-1.6.R,08:56:00,08:56:00,19968,09:12:00,19842
26241,17070663.T0.2-UFD-F-mjp-1.6.R,08:57:00,08:57:00,19969,09:12:00,19842
26242,17070663.T0.2-UFD-F-mjp-1.6.R,08:59:00,08:59:00,19970,09:12:00,19842


Calculate travel time in minutes and assign the value to the time column.

In [36]:
df_traveltime['time'] = (pd.to_timedelta(df_traveltime['arrival_time_y']) - pd.to_timedelta(df_traveltime['departure_time'])).astype("timedelta64[m]")
df_traveltime

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_traveltime['time'] = (pd.to_timedelta(df_traveltime['arrival_time_y']) - pd.to_timedelta(df_traveltime['departure_time'])).astype("timedelta64[m]")


Unnamed: 0,trip_id,arrival_time_x,departure_time,stop_id_x,arrival_time_y,stop_id_y,time
94,17068385.T0.2-ALM-F-mjp-1.2.H,07:02:00,07:02:00,19905,07:12:00,19842,10.0
95,17068385.T0.2-ALM-F-mjp-1.2.H,07:04:00,07:04:00,19906,07:12:00,19842,8.0
96,17068385.T0.2-ALM-F-mjp-1.2.H,07:07:00,07:07:00,19908,07:12:00,19842,5.0
97,17068385.T0.2-ALM-F-mjp-1.2.H,07:10:00,07:10:00,19843,07:12:00,19842,2.0
98,17068385.T0.2-ALM-F-mjp-1.2.H,07:12:00,07:12:00,19842,07:12:00,19842,0.0
...,...,...,...,...,...,...,...
26239,17070663.T0.2-UFD-F-mjp-1.6.R,08:54:00,08:54:00,19967,09:12:00,19842,18.0
26240,17070663.T0.2-UFD-F-mjp-1.6.R,08:56:00,08:56:00,19968,09:12:00,19842,16.0
26241,17070663.T0.2-UFD-F-mjp-1.6.R,08:57:00,08:57:00,19969,09:12:00,19842,15.0
26242,17070663.T0.2-UFD-F-mjp-1.6.R,08:59:00,08:59:00,19970,09:12:00,19842,13.0


If travel time is larger than 0, it means that the closest train station will travel to Melbourne Central Railway Station as criteria.
But if travel time is less than 0, Melbourne Central Railway Station will travel to the closest train station instead.
And travel time is equal to 0, which means that the closest train station is Melbourne Central Railway Station.

In [37]:
#Remove negative time value
df_traveltime = df_traveltime.loc[(df_traveltime.time >= 0)]
#Remove unused coluns
df_traveltime = df_traveltime.drop(df_traveltime.columns[[0,1,2,4,5]], axis=1)
df_traveltime.stop_id_x.value_counts()

19842    254
19843    128
19841    123
19908     80
22180     59
        ... 
19989      3
19990      3
19856      2
19857      2
19855      2
Name: stop_id_x, Length: 167, dtype: int64

In [38]:
#Calculate the mean by using groupby
df_traveltime = df_traveltime.groupby(['stop_id_x']).mean()
df_traveltime

Unnamed: 0_level_0,time
stop_id_x,Unnamed: 1_level_1
15351,44.333333
15353,40.333333
19841,2.000000
19842,0.000000
19843,2.000000
...,...
40221,40.923077
44817,35.000000
45793,52.857143
45794,67.000000


In [39]:
#Rename the column
df_traveltime.reset_index(inplace=True)
df_traveltime.rename(columns = {'stop_id_x':'closest_train_station_id','time':'travel_min_to_MC'},inplace = True)
df_traveltime

Unnamed: 0,closest_train_station_id,travel_min_to_MC
0,15351,44.333333
1,15353,40.333333
2,19841,2.000000
3,19842,0.000000
4,19843,2.000000
...,...,...
162,40221,40.923077
163,44817,35.000000
164,45793,52.857143
165,45794,67.000000


Merge the dataframe

In [40]:
df_final = pd.merge(df_final, df_traveltime, how='left', left_on=['closest_train_station_id'], right_on=['closest_train_station_id'])
df_final.fillna('-1', inplace=True)
df_final['travel_min_to_MC'] = df_final['travel_min_to_MC'].astype(int)
df_final

Unnamed: 0,property_id,lat,lng,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station,travel_min_to_MC
0,83851,-37.854654,145.004546,30 Banole Avenue,PRAHRAN,19946,0.929392,14
1,84203,-37.852626,145.002968,6 Aberdeen Road,PRAHRAN,19947,0.897389,12
2,84349,-37.851704,145.005646,75 Pridham Street,PRAHRAN,19946,0.733610,14
3,83478,-37.847361,144.987132,13 Athol Street,PRAHRAN,19958,0.339376,-1
4,84124,-37.856013,145.004308,50 Packington Street,PRAHRAN,19946,1.025765,14
...,...,...,...,...,...,...,...,...
2411,5218,-37.868387,144.839055,46 Station Street,SEAHOLME,19927,0.177860,-1
2412,5034,-37.865875,144.840423,20 Waratah Drive,SEAHOLME,19927,0.223980,-1
2413,61832,-37.771996,145.254431,34 Braden Brae Drive,WARRANWOOD,19878,3.479717,41
2414,69300,-37.842694,145.038120,77 Talbot Crescent,KOOYONG,19910,0.505987,-1


Create direct_journey_flag column and assign the default value as -1.

In [41]:
df_final['direct_journey_flag'] = -1
#If travel_min_to_MC > 0 direct_journey_flag must be 1
df_final.loc[df_final.travel_min_to_MC > 0 , 'direct_journey_flag'] = 1
#If travel_min_to_MC = 0 direct_journey_flag must be 1
df_final.loc[df_final.travel_min_to_MC == 0, 'direct_journey_flag'] = 1
df_final

Unnamed: 0,property_id,lat,lng,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station,travel_min_to_MC,direct_journey_flag
0,83851,-37.854654,145.004546,30 Banole Avenue,PRAHRAN,19946,0.929392,14,1
1,84203,-37.852626,145.002968,6 Aberdeen Road,PRAHRAN,19947,0.897389,12,1
2,84349,-37.851704,145.005646,75 Pridham Street,PRAHRAN,19946,0.733610,14,1
3,83478,-37.847361,144.987132,13 Athol Street,PRAHRAN,19958,0.339376,-1,-1
4,84124,-37.856013,145.004308,50 Packington Street,PRAHRAN,19946,1.025765,14,1
...,...,...,...,...,...,...,...,...,...
2411,5218,-37.868387,144.839055,46 Station Street,SEAHOLME,19927,0.177860,-1,-1
2412,5034,-37.865875,144.840423,20 Waratah Drive,SEAHOLME,19927,0.223980,-1,-1
2413,61832,-37.771996,145.254431,34 Braden Brae Drive,WARRANWOOD,19878,3.479717,41,1
2414,69300,-37.842694,145.038120,77 Talbot Crescent,KOOYONG,19910,0.505987,-1,-1


## 5.Adding data into remaining columns
which are House_report, Median_house_price, House_quarterly_growth, House_twelve_month_growth and House_average_annual_growth.

This part needs to use the web scraping method. The required libraries are requests and beautifulsoup

In [42]:
url = 'https://www.yourinvestmentpropertymag.com.au/top-suburbs/vic/'
res = requests.get(url)
res.encoding = "utf-8"
soup = BeautifulSoup(res.text, "html.parser")
#Extract a list of links 
elt =  soup.find('ul',{'class':"suburbs"})
print(elt)

<ul class="suburbs">
<li><a href="/top-suburbs/vic-3067-abbotsford.aspx">ABBOTSFORD</a></li>
<li><a href="/top-suburbs/vic-3040-aberfeldie.aspx">ABERFELDIE</a></li>
<li><a href="/top-suburbs/vic-3825-aberfeldy.aspx">ABERFELDY</a></li>
<li><a href="/top-suburbs/vic-3714-acheron.aspx">ACHERON</a></li>
<li><a href="/top-suburbs/vic-3352-addington.aspx">ADDINGTON</a></li>
<li><a href="/top-suburbs/vic-3465-adelaide-lead.aspx">ADELAIDE LEAD</a></li>
<li><a href="/top-suburbs/vic-3962-agnes.aspx">AGNES</a></li>
<li><a href="/top-suburbs/vic-3231-aireys-inlet.aspx">AIREYS INLET</a></li>
<li><a href="/top-suburbs/vic-3851-airly.aspx">AIRLY</a></li>
<li><a href="/top-suburbs/vic-3042-airport-west.aspx">AIRPORT WEST</a></li>
<li><a href="/top-suburbs/vic-3021-albanvale.aspx">ALBANVALE</a></li>
<li><a href="/top-suburbs/vic-3206-albert-park.aspx">ALBERT PARK</a></li>
<li><a href="/top-suburbs/vic-3971-alberton.aspx">ALBERTON</a></li>
<li><a href="/top-suburbs/vic-3971-alberton-west.aspx">ALBERTON

Found that extension of url locate at between " / " and " " ".
Similarly, Suburb name locate at between " > " and  " <\/a> "

In [43]:
#Extract url and suburb name to list
extension_url = re.findall('(?<=\/)(.*)(?=\")', str(elt))
suburb_name = re.findall('(?<=">)(.*)(?=\<\/a>)', str(elt))

Using for loop to extract all the required data.

In [44]:
%%time
#Create empty list
all_data = []

for i in range(len(extension_url)):
    
    url = 'https://www.yourinvestmentpropertymag.com.au/' + extension_url[i]
    res = requests.get(url)
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text, "html.parser")
    placeholder = []    
    
    #Using beautiful soup to find required data
    house_report = soup.find('div',{'id':"ContentPlaceHolder1_ContentPlaceHolder1_contentHouse"})
    house_Median = soup.find('td',{'class':"align_r House Median"})
    house_QuarterlyGrowth = soup.find('td',{'class':"align_r House QuarterlyGrowth"})
    house_1yr = soup.find('td',{'class':"align_r House 1yr"})
    house_MedianGrowthThisYr = soup.find('td',{'class':"align_r House MedianGrowthThisYr"})
    
    house_report_list =[]
    house_Median_l = []
    house_QuarterlyGrowth_l = []
    house_1yr_l = []
    house_MedianGrowthThisYr_l = []
    
    
    #If we cannot find any value. It will returned as not available instead
    if house_report == None:            
            house_report_list.append('not available')
            joined = house_report_list[0]
    else:
        for n in house_report :
            obj = n.string
            house_report_list.append(obj.strip())
    #Remove \n at first and last index, 
        house_report_list = house_report_list[1:-1]
    #Combine all element in list to one element
        joined = " ".join(house_report_list)

    if house_Median == None:
        house_Median_l.append('not available')
    else:
        for n in house_Median :
            obj = n.string
            house_Median_l.append(obj.strip())
            
            
    if house_QuarterlyGrowth == None:
        house_QuarterlyGrowth_l.append('not available')
    else:    
        for n in house_QuarterlyGrowth :
            obj = n.string 
            house_QuarterlyGrowth_l.append(obj.strip())
        
    if house_1yr == None:
        house_1yr_l.append('not available')
    else:    
        for n in house_1yr :
            obj = n.string
            house_1yr_l.append(obj.strip())
        
    if house_MedianGrowthThisYr == None:
        house_MedianGrowthThisYr_l.append('not available')
    else:   
        for n in house_MedianGrowthThisYr :
            obj = n.string
            house_MedianGrowthThisYr_l.append(obj.strip())      

    house_Median_l = house_Median_l[0]
    house_QuarterlyGrowth_l = house_QuarterlyGrowth_l[0]
    house_1yr_l = house_1yr_l[0]
    house_MedianGrowthThisYr_l = house_MedianGrowthThisYr_l[0]
    
    #placeholder struture should be 
    #[suburb_name,house_report,house_Median, house_QuarterlyGrowth,house_1yr,house_MedianGrowthThisYr]
    placeholder.append(suburb_name[i])
    placeholder.append(joined)
    placeholder.append(house_Median_l)
    placeholder.append(house_QuarterlyGrowth_l)
    placeholder.append(house_1yr_l)
    placeholder.append(house_MedianGrowthThisYr_l)
    
    
    
    all_data.append(placeholder)


CPU times: total: 1min 49s
Wall time: 13min 19s


This process should take around 10 to 15 minutes to finish.

In [45]:
print(all_data)

[['ABBOTSFORD', 'Giving property investors a a solid capital  gain of 26.64%  for the last year, Abbotsford, 3067 is  the 835th highest performer in  Australia in this respect. A 44.16% growth in median value for property investors in Abbotsford,3067 puts this suburb at number 107th in terms of best performing suburbs in VIC LACK OF BUYER INTEREST may well be the reason that Abbotsford  is offering  property investors an average of -1.95. This rate of discount on properties puts  Suburb at number 376th in terms of most discounted suburbs in VIC Advertised rents are around the  $0 mark per week – giving a return of 0.00% based on the median price in Suburb', '$1,297,500', '14.17%', '26.64%', '7.51%'], ['ABERFELDIE', 'Aberfeldie  has had a pretty good year for property investment returns compared to the rest of VIC, giving investors a capital gain of 30.81% to date . Aberfeldie,3040 was ranked 476 in Australia by increase in median property value over the quarter. Vendor discounting in A

Create the dataframe from all_data list.

In [46]:
df_suburb = pd.DataFrame(all_data, columns =['suburb','House_report', 'Median_house_price', 'House_quarterly_growth','House_twelve_month_growth','House_average_annual_growth'])
df_suburb

Unnamed: 0,suburb,House_report,Median_house_price,House_quarterly_growth,House_twelve_month_growth,House_average_annual_growth
0,ABBOTSFORD,Giving property investors a a solid capital g...,"$1,297,500",14.17%,26.64%,7.51%
1,ABERFELDIE,Aberfeldie has had a pretty good year for pro...,"$1,805,000",12.30%,30.81%,6.65%
2,ABERFELDY,not available,not available,not available,not available,not available
3,ACHERON,not available,not available,not available,not available,not available
4,ADDINGTON,not available,not available,not available,not available,not available
...,...,...,...,...,...,...
2094,YOUANMITE,not available,not available,not available,not available,not available
2095,YUNDOOL,not available,not available,not available,not available,not available
2096,YUROKE,not available,not available,not available,not available,not available
2097,YUULONG,not available,not available,not available,not available,not available


Merge the two dataframe to match with the specification.

In [47]:
df_final = pd.merge(df_final, df_suburb, how='left', left_on=['suburb'], right_on=['suburb'])
df_final

Unnamed: 0,property_id,lat,lng,addr_street,suburb,closest_train_station_id,distance_to_closest_train_station,travel_min_to_MC,direct_journey_flag,House_report,Median_house_price,House_quarterly_growth,House_twelve_month_growth,House_average_annual_growth
0,83851,-37.854654,145.004546,30 Banole Avenue,PRAHRAN,19946,0.929392,14,1,"Over the last year, property investments in Pr...","$1,800,000",13.37%,39.29%,11.51%
1,84203,-37.852626,145.002968,6 Aberdeen Road,PRAHRAN,19947,0.897389,12,1,"Over the last year, property investments in Pr...","$1,800,000",13.37%,39.29%,11.51%
2,84349,-37.851704,145.005646,75 Pridham Street,PRAHRAN,19946,0.733610,14,1,"Over the last year, property investments in Pr...","$1,800,000",13.37%,39.29%,11.51%
3,83478,-37.847361,144.987132,13 Athol Street,PRAHRAN,19958,0.339376,-1,-1,"Over the last year, property investments in Pr...","$1,800,000",13.37%,39.29%,11.51%
4,84124,-37.856013,145.004308,50 Packington Street,PRAHRAN,19946,1.025765,14,1,"Over the last year, property investments in Pr...","$1,800,000",13.37%,39.29%,11.51%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2411,5218,-37.868387,144.839055,46 Station Street,SEAHOLME,19927,0.177860,-1,-1,Property investors who have had real estate in...,"$1,187,500",10.88%,18.76%,7.40%
2412,5034,-37.865875,144.840423,20 Waratah Drive,SEAHOLME,19927,0.223980,-1,-1,Property investors who have had real estate in...,"$1,187,500",10.88%,18.76%,7.40%
2413,61832,-37.771996,145.254431,34 Braden Brae Drive,WARRANWOOD,19878,3.479717,41,1,Warranwood is in the bottom 10% in VIC when co...,"$1,100,000",11.61%,-29.30%,1.66%
2414,69300,-37.842694,145.038120,77 Talbot Crescent,KOOYONG,19910,0.505987,-1,-1,Investment property in Kooyong has done poorly...,"$3,150,000",13.33%,0.00%,3.84%


## 5. Saving the csv file

In [48]:
df_final.to_csv('data_A2_solution.csv',float_format='%.2f', encoding='utf-8', index=False)

## Summary

This task taught me how to integrate the data from different sources such as pdf XML and websites and read and understand shapefile data and gtfs data. Lastly, how to transform data to linearity.
- **Integrate the data** by using tabula and pandas.
- **Understand and edit xml** by using xmlschema.
- **Read geopandas** by using pandas.
- **Web scraping** by using requests and beautifulsoup
- **Transform data** by using numpy and matplotlib.

## References
- derricw. (2019, November 12). *Fast Haversine Approximation (Python/Pandas)* Retrieved from
https://stackoverflow.com/questions/29545704/fast-haversine-approximation-python-pandas/29546836#29546836

- Adamsmith. (n.d.). *How to find the distance between two lat-long coordinates in Python* Retrieved from
https://www.adamsmith.haus/python/answers/how-to-find-the-distance-between-two-lat-long-coordinates-in-python

- Rahil Ahmed. (2020, May 13). *Finding Nearest pair of Latitude and Longitude match using Python* Retrieved from
https://medium.com/analytics-vidhya/finding-nearest-pair-of-latitude-and-longitude-match-using-python-ce50d62af546

- axschlepzig. (2018, September 13). *Validating with an XML schema in Python* Retrieved from
https://stackoverflow.com/questions/299588/validating-with-an-xml-schema-in-python