## Identifying false duplicates
As there are duplicates in the dataset there may be occasions that either through incorrect manual changes or by errors in the specification or pipeline configurations errors are introduced and data is compiled into the same entity where it's not appropriate. This notebook aims to download a dataset and examine the facts against an entity to check whether it needs correcting.

In [2]:
from download_data import download_dataset
from data import get_organisation_summary
from plot import plot_map
import spatialite
import pandas as pd
import geopandas as gpd
import os
import itertools
import shapely.wkt

pd.set_option("display.max_rows", None)

Download the sqlite file to run the tests against. This is set not to overwrite the data if it exists

In [3]:
# define required variables
dataset = 'conservation-area'
collection = 'conservation-area-collection'
data_dir = os.path.join('../data/entity_resolution',dataset)
dataset_path = os.path.join(data_dir,f'{dataset}.sqlite3')

In [4]:
download_dataset(dataset,collection,data_dir,overwrite=False)

#### Testing WIP
I have started to create a query that can compare all of the entities against themselves. Last time i tried to run this it didn't finish overnight so a new approach probably needs to be taken. This may even be more efficient using pandas and running through for each individual entity. Worth playing around with

In [5]:
def get_false_duplicates(dataset_path):    
    sql = f"""
            SELECT a.fact As primary_fact,
                a.entity AS primary_entity,
                a.value AS primary_value,
                b.fact AS secondry_fact,
                b.entity AS secondary_entity,
                b.value AS secondary_value,
                100 *(ST_Area(ST_Intersection(GeomFromText(a.value), GeomFromText(b.value)))/ MIN(ST_Area(GeomFromText(a.value)), ST_Area(GeomFromText(b.value)))) AS pct_overlap
            FROM
                (SELECT fact,
                        entity,
                        field,
                        value
                FROM fact
                WHERE field = 'geometry' 
                AND ST_IsValid(GeomFromText(value))) a
            JOIN
                (SELECT fact,
                        entity,
                        field,
                        value
                FROM fact
                WHERE field = 'geometry' 
                AND ST_IsValid(GeomFromText(value))) b 
            ON a.entity <> b.entity
            AND ST_Intersects(GeomFromText(a.value), GeomFromText(b.value))
            WHERE 100 *(ST_Area(ST_Intersection(GeomFromText(a.value), GeomFromText(b.value)))/ MIN(ST_Area(GeomFromText(a.value)), ST_Area(GeomFromText(b.value)))) > 95;
        """
    with spatialite.connect(dataset_path) as con:
        cursor = con.execute(sql)
        cols = [column[0] for column in cursor.description]
        results = pd.DataFrame.from_records(data=cursor.fetchall(), columns=cols)
    
    return results

In [None]:
results = get_false_duplicates(dataset_path)
results