# Python Fynesse Data Analysis Template

### 31st May 2021

### Neil D. Lawrence

### Updated 31st October 2021

This notebook serves as a stub for the fynesse data analysis pipeline.


In [None]:
%pip uninstall --yes fynesse
# Replace this with the location of your fynesse implementation.
%pip install git+https://github.com/lawrennd/fynesse_template.git


In [None]:
import fynesse

In [None]:
fynesse.access.config['data_url']

First a quick analysis of the data as it is, which values are missing, and where there might be anomalies

# Initial assessment of paid price data
The first action was to briefly look through the files that hold house price data, and normalise the uploads by adding the 'OPTIONALLY ENCLOSED BY `""`' line.
This was necessary as the csv files from 1995-1998 are saved as csvs of strings marked by "", and strings in later files are also enclosed by the double quotation marks.

The column that indicates a flat or apartment name can contain nulls, but this will be accounted for.

Some transactions will be missing from this data set as only properties registered with the HM Land registry that are sold for value are listed.

Some properties that have been in the same ownership since before 1990 and have not been mortgaged since will not be available. 
Further only sales for value are listed, some properties are inherited, given, or exchanged, and thus these transactions are not included.

This excludes two types of residential property, that owned by nobility or large landowners over many generations (for example royal homes, or old farms), and new builds that have not yet been sold. 
These would be the most difficult to value from the set of data, old properties can often be worth a large amount of money, and this would not be calculable necessarily by location, as historical value, property size on the land, and even wealth of the owner can all be crucial factors. Further they are not necessarily near any other similar properties.
New builds would also be difficult, as if the size of the total development is not known (only a location, and property type), and the quality of the work not yet known, then only a rough estimate can be given. A better source for this would be to look at the predicted price from the developer, and a comparison with recently sold similar new builds.

Now if we look solely at the 2022 data, we can check for anomalies in the prices.
When looking at the Cumalitive Distribution Function of terraced house prices from 2022, we saw this graph:

£120 million is very expensive, and on inspection of the record that has this price we can find the house location.

Inspection of this house online, and a listing for another of the houses in its block reveal that the real price sold for was actually £1.2 million.
The error that happened here is that the price is written with an extra two zeros in the dataset retrieved from the HM Land Registry (to represent pennies) but there is no decimal point.
To now clean the data, we should check the most expensive and the cheapest properties.
The cheapest terraced house was £12,000. (This on inspection turned out to be wrong and actually that property was sold for £120,000).
As the mean of the terraced house prices was of order of magnitude around £100,000 , the add two zeros error would signify houses with prices of more than £10,000,000.
Looking at these gave:

Two of these houses were listed wrong.
(incidentally it is the two highest and two lowest values in this data set that have been identified as wrong)
It is thus pertinent to introduce a function that cleans our data before using it as training as on order of magnitude 100 error would badly affect our model.
Therefore, an anomaly removing function that for a postcode/place removes points that are more than 50 times the average or less than 0.1 times would be appropriate. As there are more likely to be larger prices appearing correctly than lower prices, the function will be more lenient on higher prices, so looking between the 95th percentile and the 10th percentile of prices for the area.

File that holds postcode data
-frequent changes!

We now consider aspects of the paid price data and what we can learn from them:
We know that location and year sold will be aspects of this project

First we will consider the different types of property, the types of which can be selected from the database and seen as follows:

In [None]:
select_unique_vals(table,column_name) #this function allows selection of the unique values in a column
#Bounded?

We see that the possible property types are:
Flats (includes maisonettes), Semi-Detached, Detached, Terraced

If we plot these against prices throughout the UK for 2022, we can see that there is indeed a significant split between the different types of property:

In [None]:
plot_prices_propertytypes("all")

We note that average prices for terraced houses and flats across the UK appear very similar, which is initially surprising, but makes sense when we consider the number of terraced houses () compared to the number of flats sold () and further the distribution of the prices for each.

Further consider the variance in prices for each type of property:

Some flats sell for extremely high sums,
it is of course therefore pertinent to consider prices according to location.


In [None]:
Average price of property type per location (counties?)

Prices of flats in london are significantly higher than
So therefore it would be sensible to choose a model where price prediction models are chosen by what type of property it is, and then use function of the location and features of the property

# Limitations of OSMNX data

On initial look throughs of OSM data, there are some immediate problems, mainly that as OSM is crowd-sourced, there are lots of missing data points.

One example is to look at this plot of all buildings in a 1km bounding box in a residential area in south london.
The plot shows many streets as empty without buildings, but looking at the corresponding google maps satellite image, there are indeed many houses.

This indicates that using the data to match sold houses to buildings in OSM data will not be completely reliable.