In [None]:
# Install your library here, for example the fynesse template
# is set up to be pip installable
%pip install git+https://github.com/jeffrey-22/ads.git
import os, fynesse
# 2 min

In [4]:
# Import local fynesse module. Do NOT run this cell if the notebook is not run from the repo - this is a quick hack for local runs
%load_ext autoreload
%autoreload 2
import os, sys, IPython
from pathlib import Path
try:
    notebook_path = Path(IPython.get_ipython().run_line_magic('pwd', '')).as_posix()
except AttributeError:
    notebook_path = Path(__file__).resolve().as_posix()
script_path = os.path.abspath(notebook_path)
project_path = os.path.abspath(os.path.join(script_path, '..'))
sys.path.append(project_path)
import fynesse

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
# Create connections. The connections will be reused throughout the modules
pp_database_details = fynesse.access.retreive_database_details()
# fynesse.access.create_database(pp_database_details)
pp_database_conn = fynesse.access.create_connection(pp_database_details)

The schema stores price transaction data about the traded households.
- ```transaction_unique_identifier``` and ```db_id``` are indices for transactions and for our database respectively.
- ```price``` describes, in GBP, the price of the household.
- ```postcode```, ```primary_addressable_object_name```, ```secondary_addressable_object_name```, ```street```, ```locality```, ```town_city```, ```district```, ```county``` describe the address. We will ultimately only use the postcode to find the latitude and longitude.
- ```property_type``` describes one of: Detached, Semi-detached, Terraced, Flat/maisonette, Other, indicated by the initials.
- ```new_build_flag```, ```tenure_type```, ```ppd_category_type```, ```record_status``` likely describe some categories, but they are not very helpful as we are ultimately not given these when predicting.

We will do some sanity checks and general visualisation of the data in the address part.

In [None]:
# Write the code you need for creating the table, downloading and uploading the data here. You can use as many code blocks as you need.
# Warning: run these once when populating the database. No need to run them again just for prediction!
# 20 min
os.makedirs("tmp_data", exist_ok=True)
downloaded_pathnames = fynesse.access.download_price_data()
print(f"Downloaded files: {downloaded_pathnames}")
fynesse.access.create_pp_table(pp_database_conn)
# 15m
fynesse.access.upload_files_to_table(pp_database_conn, downloaded_pathnames, 'pp_data')
# 11m
fynesse.access.setup_pp_table(pp_database_conn)

In [None]:
# Warning: run these once when populating the database. No need to run them again just for prediction!
fynesse.access.create_postcode_table(pp_database_conn)
# 10s
postcode_filename = fynesse.access.download_postcode_data()
# 1m
fynesse.access.upload_files_to_table(pp_database_conn, [postcode_filename], 'postcode_data')
# 1m
fynesse.access.setup_postcode_table(pp_database_conn)

In [None]:
# Warning: run these once when populating the database. No need to run them again just for prediction!
fynesse.access.create_prices_coordinates_table(pp_database_conn)
# 20m
joined_table_pathnames = fynesse.access.join_all_tables(downloaded_pathnames, postcode_filename, overwrite=False)
# 10m
fynesse.access.upload_files_to_table(pp_database_conn, joined_table_pathnames, 'prices_coordinates_data', ignore_first_row=True)
# 9m
fynesse.access.setup_prices_coordinates_table(pp_database_conn)

## Question 2. Accessing OpenStreetMap and Assessing the Available Features

In question 3 you will be given the task of constructing a prediction system for property price levels at a given location. We expect that knowledge of the local region around the property should be helpful in making those price predictions. To evaluate this we will now look at [OpenStreetMap](https://www.openstreetmap.org) as a data source.

The tasks below will guide you in accessing and assessing the OpenStreetMap data. The code you write will eventually be assimilated in your python module, but documentation of what you've included and why should remain in the notebook below.

Accessing OpenStreetMap through its API can be done using the python library `osmx`. Using what you have learned about the `osmx` interface in the lectures, write general code for downloading points of interest and other relevant information that you believe may be useful for predicting house prices. Remembering the perspectives we've taken on *data science as debugging*, the remarks we've made when discussing *the data crisis* of the importance of reusability in data analysis, and the techniques we've explored in the labsessions for visualising features and exploring their correlation use the notebook to document your assessment of the OpenStreetMap data as a potential source of data.

The knowledge you need to do a first pass through this question will have been taught by end of lab session three (16th November 2021). You will likely want to review your answer as part of *refactoring* your code and analysis pipeline shortly before hand in.

You should write reusable code that allows you to explore the characteristics of different points of interest. Looking ahead to question 3 you'll want to incorporate these points of interest in your prediction code.

*5 marks*


In [None]:
# Use this cell and cells below for summarising your analysis and documenting your decision making.

In [4]:
import numpy as np
import sklearn
from sklearn.preprocessing import StandardScaler
from statsmodels.multivariate import pca

# Create a sample dataset (replace this with your own data)
np.random.seed(42)
data = np.random.rand(100, 5)  # 100 datapoints with 5 features each

# Standardize the data
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)

# Perform PCA with Statsmodels
pca_model = pca.PCA(data_standardized, method='eig')

# Explained variance ratio
explained_variance_ratio = pca_model.eigenvals / np.sum(pca_model.eigenvals)
print("Explained Variance Ratio:", explained_variance_ratio)

# The transformed data with reduced dimensions
principal_components = pca_model.factors
print("Principal Components:")
print(len(principal_components))
print(principal_components)

Explained Variance Ratio: [0.29005782 0.23214006 0.18701323 0.15426411 0.13652477]
Principal Components:
100
[[ 0.12338063  0.00976194  0.12045144 -0.05408748 -0.04422174]
 [ 0.00754624  0.13603111 -0.18950419 -0.04625231 -0.09271119]
 [ 0.23269906  0.02761399  0.03065082  0.0429948  -0.01436272]
 [ 0.04564239  0.00209513 -0.08425635  0.06005296 -0.11312667]
 [-0.07031759 -0.07054058 -0.10690928  0.04548833 -0.00887608]
 [-0.04237996 -0.12681684 -0.03560634 -0.09784192 -0.11723192]
 [-0.22292024  0.09538146 -0.01586747  0.05746745 -0.00621258]
 [-0.14287864 -0.07229839  0.02244075  0.02277046 -0.01983526]
 [-0.05375733  0.03556943  0.09600858  0.15981331 -0.20419865]
 [-0.01937748 -0.08094334 -0.02962887 -0.06028077 -0.07740401]
 [-0.02310773  0.05618061  0.0999706  -0.26985888  0.07705218]
 [-0.07610371 -0.20295466 -0.13922898  0.04121898  0.07569644]
 [ 0.07110406 -0.01762223 -0.12461358 -0.06498576 -0.06829724]
 [ 0.00106953  0.05229044 -0.24859607 -0.03437616  0.1713219 ]
 [-0.2072

## Model decisions

There are a total of X features, they are:
- latitude

Reasonings for the features:
- aaa

The model is a GLM, with a link of $f(x) = e^x$

Reasonings for the model:
- aaa

## Question 3. Addressing a Property Price Prediction Question

For your final tick, we will be asking you to make house price predictions for a given location, date and property type in the UK. You will provide a function that takes input a latitude and longitude as well as the `property_type` (either type" of property (either `F` - flat, `S` - semidetached, `D` - detached, `T` - terraced or `O` other). Create this function in the `address.py` file, for example in the form,

```
def predict_price(latitude, longitude, date, property_type):
    """Price prediction for UK housing."""
    pass
```

We suggest that you use the following approach when building your prediction.

1. Select a bounding box around the housing location in latitude and longitude.
2. Select a data range around the prediction date.
3. Use the data ecosystem you have build above to build a training set from the relevant time period and location in the UK. Include appropriate features from OSM to improve the prediction.
4. Train a linear model on the data set you have created.
5. Validate the quality of the model.
6. Provide a prediction of the price from the model, warning appropriately if your validation indicates the quality of the model is poor.

Please note that the quality of predictions is not the main focus of the assignment - we expect to see models that output reasonable predictions and have positive R^2's, but you should not spend too much time on increasing the model's accuracy.

The knowledge you need to do a first pass through this question will have been taught by end of lab session four (7th November 2023). You will likely want to review your answer as part of *refactoring* your code shortly before hand in.



In [3]:
from datetime import date
fynesse.address.predict_price(52.206767, 0.119229, date(2023, 1, 1), 'S', pp_database_conn)

==== Validation of current model, level 2 ====
                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                  609
Model:                            GLM   Df Residuals:                      594
Model Family:                Gaussian   Df Model:                           14
Link Function:               Identity   Scale:                      8.6974e+12
Method:                          IRLS   Log-Likelihood:                -9928.8
Date:                Sat, 25 Nov 2023   Deviance:                   5.1663e+15
Time:                        00:42:33   Pearson chi2:                 5.17e+15
No. Iterations:                     3   Pseudo R-squ. (CS):             0.1423
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const

1679782.3748247623

## Large Language Models

If you used LLMs to generate or fix code in this assignment (recommended), briefly summarise the process and prompts you used. What do you think of the integration of LLMs in the data science pipeline?

```GIVE YOUR WRITTEN ANSWER HERE```

- Some other questions are answered in [this reddit forum](https://www.reddit.com/r/CST_ADS/) or [this doc](https://docs.google.com/document/d/1GfDROyUW8HVs2eyxmJzKrYGRdVyUiVXzPcDfwOO8wX0/edit?usp=sharing). Feel free to also ask about anything that comes up.