# Python Fynesse Data Analysis Template

### 31st May 2021

### Neil D. Lawrence

### Updated 31st October 2021

This notebook serves as a stub for the fynesse data analysis pipeline.


In [3]:
%pip uninstall --yes fynesse
# Replace this with the location of your fynesse implementation.
# %pip install git+https://github.com/lawrennd/fynesse_template.git


[0mNote: you may need to restart the kernel to use updated packages.


In [20]:
import sys
sys.path.insert(0, "/Users/simon/Documents/II/fynesse/")
%load_ext autoreload
%autoreload 2
import fynesse

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
fynesse.access.config['data_url']

'https://raw.githubusercontent.com/lawrennd/datasets_mirror/main/'

In [4]:
import yaml
# Setup code needed for setting up database access
database_details = {"url": "database-ads-sl2057.cgrre17yxw11.eu-west-2.rds.amazonaws.com",
                    "port": 3306}

with open("credentials.yaml") as file:
  credentials = yaml.safe_load(file)
username = credentials["username"]
password = credentials["password"]
url = database_details["url"]

In [35]:
db = fynesse.access.Database(username=username, password=password, url=url)
db.use_database("property_prices")

Successfully connected to server.


In [43]:
# db.kill_process(6)
db.get_processlist()

Unnamed: 0,Id,User,Host,db,Command,Time,State,Info,Progress
0,4,rdsadmin,localhost,mysql,Sleep,5,,,0.0
1,9,admin,131.111.5.246:61421,property_prices,Query,0,starting,SHOW FULL PROCESSLIST,0.0
2,1389,admin,131.111.5.246:62771,property_prices,Sleep,16,,,0.0


# Data Exploration

Understanding what is in the data. Is it what it's purported to be, how are missing values encoded, what are the outliers, what does each variable represent and how is it encoded.

Data that is accessible can be imported (via APIs or database calls or reading a CSV) into the machine and work can be done understanding the nature of the data. The important thing to say about the assess aspect is that it only includes things you can do *without* the question in mind. This runs counter to many ideas about how we do data analytics. The history of statistics was that we think of the question *before* we collect data. But that was because data was expensive, and it needed to be excplicitly collected. The same mantra is true today of *surveillance data*. But the new challenge is around *happenstance data*, data that is cheaply available but may be of poor quality. The nature of the data needs to be understood before its integrated into analysis. Unfortunately, because the work is conflated with other aspects, decisions are sometimes made during assessment (for example approaches to imputing missing values) which may be useful in one context, but are useless in others. So the aim in *assess* is to only do work that is repeatable, and make that work available to others who may also want to use the data.

In [7]:
import pandas as pd
pd.set_option('display.max_rows', 500)
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Access

In [10]:
db.show_indexes("prices_coordinates_data")

Unnamed: 0,Table,Non_unique,Key_name,Seq_in_index,Column_name,Collation,Cardinality,Sub_part,Packed,Null,Index_type,Comment,Index_comment,Ignored
0,prices_coordinates_data,0,PRIMARY,1,db_id,A,790051,,,,BTREE,,,NO


In [55]:
db.execute_to_df("SELECT COUNT(db_id) FROM prices_coordinates_data")

Unnamed: 0,COUNT(db_id)
0,0


In [45]:
db.execute_to_df("SELECT * FROM pp_data LIMIT 1")

InterfaceError: (0, '')

## OpenStreetMap

In [383]:
import osmnx as ox

In [394]:
place_name = "Cambridge"
latitude = 52.205276
longitude = 0.119167

box_width = 0.02 # About 2.2 km
box_height = 0.02
north = latitude + box_height/2
south = latitude - box_height/2
west = longitude - box_width/2
east = longitude + box_width/2

# Retrieve POIs
tags = {"amenity": True,
        "buildings": True,
        "historic": True,
        "leisure": True,
        "shop": True,
        "tourism": True}

In [392]:
pois = ox.features_from_bbox(north, south, east, west, tags)

In [395]:
print("There are {number} points of interest surrounding {placename} latitude: {latitude}, longitude: {longitude}".format(number=len(pois), placename=place_name, latitude=latitude, longitude=longitude))

There are 2112 points of interest surrounding Cambridge latitude: 52.205276, longitude: 0.119167


In [396]:
pois.info

<bound method DataFrame.info of                           highway  \
element_type osmid                  
node         20823646         NaN   
             20824011         NaN   
             20827182         NaN   
             20921875         NaN   
             20921876         NaN   
...                           ...   
relation     2296602          NaN   
             7952616          NaN   
             8117433   pedestrian   
             9449564          NaN   
             9449788          NaN   

                                                                geometry  \
element_type osmid                                                         
node         20823646                           POINT (0.12613 52.20262)   
             20824011                           POINT (0.12837 52.21210)   
             20827182                           POINT (0.12808 52.20007)   
             20921875                           POINT (0.11811 52.20892)   
             20921876         

In [414]:
fetch_place_data("London", db)

KeyboardInterrupt: 