## **Breakdown by UK Postcode Area, GB Postcode District and Sectors.**
- This data pipeline creates 3 GeoDataFrames from digital vector boundaries files for 3 subsets of UK/GB Postcode Geographies: **Postcode Areas** (124 rows), **Postcode Districts** (2880 rows) and **Postcode Sectors** (10814 rows).
- Each GeoDataFrame was saved to Postgres as a distinct table. 
- They can be then be downloaded from Postgres and reloaded into Geopandas for analysis and visualisation.
- **Downloaded data format: Shapefiles** (downloaded as a compressed zip file).
- **Processed data purpose:** Use as a map plotting source to produce interactive maps for various projects.
- **Data provider: https://www.opendoorlogistics.com/**
- **Data pipeline result:** 
    
    3 Postgres tables: **gdf_postcode_areas**, **gdf_postcode_districts** and **gdf_postcode_sectors**.
----


In [1]:
# IMPORT LIBRARIES.
import pandas as pd
import requests
import io
import zipfile
import shutil
import sqlalchemy
from sqlalchemy import create_engine
from shapely import wkt
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd

In [2]:
# SET DISPLAY OPTIONS (None MEANS UNLIMITED).
# TO SET NUMBER OF ROWS DISPLAYED:
pd.options.display.max_rows=200
# TO SET NUMBER OF COLUMNS DISPLAYED:
pd.options.display.max_columns=None

## 1. DOWNLOAD AND PROCESS DIGITAL VECTOR BOUNDARIES.

### Open Door Logistics UK Postcode Area, GB Postcode District and Sectors.                 FORMAT: Zipped Shapefiles.

In [3]:
def process_zip_shp(url):
    """
    This function downloads digital vector boundaries files for UK Postcode Areas and GB Postcode Districts 
    and Sectors.
    The files are downloaded in a compressed format as a single zip file.
    The downloaded zip file is extracted into a temporary local directory and each of the 3 contained shapefiles 
    are loaded into GeoPandas as GeoDataFrames, where the attributes are cleaned and processed into a standardised 
    format.
    The temporary local files and directory are then deleted.
    """
   
    # SET LOCAL PATH TO SAVE SHAPEFILES TO.
    local_path = "./datasets/" 

    # DOWNLOAD THE ZIPPED DATA: DON'T NEED TO BYPASS 403 ERROR HERE HENCE NO USER AGENT REQUIRED.
    response = requests.get(url)
    zipped = zipfile.ZipFile(io.BytesIO(response.content))
    # UNZIP SHAPEFILES INTO LOCAL STORAGE.
    zipped.extractall(path=local_path) 

    # CREATE EMPTY DICTIONARY TO HOLD RESULT GEODATAFRAMES.
    geodf_dict = {}
    # LOAD EACH DOWNLOADED SHAPEFILE INTO GEOPANDAS AS A GEODATAFRAME. STORE GEODATAFRAMES IN THE DICTIONARY.
    for file in zipped.namelist():
        if file.endswith("shp"):
            # LOWERCASE AND RENAME KEY FOR EACH GEODATAFRAME STORED IN THE DICTIONARY.
            geodf_dict[("gdf_" + str(file))
                       .lower()
                       .replace("distribution/","postcode_")] =\
            gpd.read_file(local_path+file)
    
    # SET DTYPES AND INDEX COLUMN FOR ALL GEODATAFRAMES.
    for gdf in geodf_dict.values():
        gdf["name"] = gdf["name"].astype("string")
        gdf.set_index("name",inplace=True)
   
    
    # CLEAN UP: DELETE LOCAL FILE VERSIONS AND TEMPORARY DIRECTORY NOW THAT DATA HAS BEEN LOADED INTO GEOPANDAS.
    shutil.rmtree("./datasets/",ignore_errors=True)
    
    return geodf_dict

In [4]:
# DOWNLOAD ZIPPED SHAPEFILES FOR UK postcode areas AND GB postcode districts AND sectors.
url = "https://www.opendoorlogistics.com/wp-content/uploads/Data/UK-postcode-boundaries-Jan-2015.zip"

# RUN THE FUNCTION AND USE APPROPRIATE DICTIONARY KEY TO GENERATE DESIRED GEODATAFRAME.
gdf_postcode_areas = process_zip_shp(url)["gdf_postcode_areas.shp"]
gdf_postcode_districts = process_zip_shp(url)["gdf_postcode_districts.shp"]
gdf_postcode_sectors = process_zip_shp(url)["gdf_postcode_sectors.shp"]

---
---

## 2. CONNECT TO POSTGRESQL.

In [5]:
def connect_to_postgres():
    """
    Connect to Postgres database 'github_projects' as user 'postgres'.
    """
    conn_params_dict = {"user":"postgres",
                        "password":"password",
                        # FOR host, USE THE POSTGRES INSTANCE CONTAINER NAME, AS THE CONTAINER IP CAN CHANGE.
                        "host":"postgres",
                        "database":"github_projects"}

    connect_alchemy = "postgresql+psycopg2://%s:%s@%s/%s" % (
        conn_params_dict['user'],
        conn_params_dict['password'],
        conn_params_dict['host'],
        conn_params_dict['database']
    )

    # CREATE POSTGRES ENGINE (CONNECTION POOL).
    engine = create_engine(connect_alchemy)
    print("Connection to Postgres successful.")
    return engine

In [6]:
# EXECUTE FUNCTION TO CONNECT TO POSTGRES.
engine = connect_to_postgres()

Connection to Postgres successful.


---
---

## 3. WRITE / DOWNLOAD GEODATAFRAMES TO / FROM POSTGRES.

In [7]:
def ul_gdf_to_pg(gdfs,names,pkey,dtype,index_bool):
    """
    Write 1 or more GeoDataFrames to Postgres for storage. 
    For each written GeoDataFrame, the contents of the geometry column is serialized and then saved as a string 
    in Postgres.
    
    The following arguments are required:
    'gdfs': A list of 1 or more GeoDataFrames to write to Postgres.
    'names': A list of equal length to 'gdfs' with the corresponding names of each Postgres table to be created.
    'pkey': Which attribute to set as the new table(s) primary key. String.
    'dtype': A dictionary of "attribute names:SQL Alchemy data types" for the new table(s). All tables being 
    written must share the same attribute names/dtypes.
    'index_bool': Whether or not to write the GeoPandas index to Postgres as a column. Possible values True/False.
    """  
    for gdf,table_name in zip(gdfs,names):  
        # SERIALISE THE CONTENTS OF THE geometry COLUMN INTO WKT (well-known text) STRINGS SO THAT IT ...
        # CAN BE REPRESENTED BY DTYPE object AND THE GEODATAFRAME CAN BE SAVED TO POSTGRES.
        gdf['geometry'] = gdf['geometry'].apply(lambda x: wkt.dumps(x))
        
        # CONVERT GEODATAFRAME TO DATAFRAME AND WRITE TO POSTGRES.
        pd.DataFrame(gdf).to_sql(table_name, con = engine, if_exists='replace', index=index_bool,
                                 # SET POSTGRES DTYPES.
                                 dtype=dtype)
    
        # ADD PRIMARY KEY TO CREATED TABLE. 
        set_primary_key = engine.execute(f"""
                                         ALTER TABLE {table_name} ADD PRIMARY KEY ({pkey})
                                         """)
        set_primary_key.close()
    
    if len(names)==1:
        print(f"The \033[1m{table_name}\033[0m Postgres table has been successfully created.\n")
    else:
        print(f"The \033[1m{', '.join(names[:-1])}\033[0m and \033[1m{(names)[-1]}\033[0m Postgres tables have been successfully created.\n")

### 3.1. WRITE GEO DATA TO POSTGRES:

In [8]:
# WRITE THE OPEN DOOR LOGISTICS SHAPEFILE BASED GEODATAFRAMES TO POSTGRES:
ul_gdf_to_pg(gdfs=[gdf_postcode_areas,gdf_postcode_districts,gdf_postcode_sectors],
             names=["gdf_postcode_areas","gdf_postcode_districts","gdf_postcode_sectors"],
             pkey="name",
             dtype={
                    "name":sqlalchemy.types.Text,
                    "geometry":sqlalchemy.types.Text
             },
             index_bool=True
            )

  set_primary_key = engine.execute(f"""


The [1mgdf_postcode_areas, gdf_postcode_districts[0m and [1mgdf_postcode_sectors[0m Postgres tables have been successfully created.



---
---

### 3.2. DOWNLOAD GEO DATA FROM POSTGRES:

In [9]:
def dl_gdf_from_pg(table_name,dtype,index_col=None):
    """
    Download a single Postgres table and automatically process it into a GeoPandas GeoDataFrame. 
    For each downloaded GeoDataFrame, the contents of the geometry column is deserialized and then saved as a 
    dtype="geometry" in GeoPandas.
    
    The following arguments are required:
    'table_name': A string with the name of the GeoDataFrame as held on Postgres as a table.
    'dtype': A dictionary of "attribute names:GeoPandas data types" for the downloaded GeoDataFrame. If an 
    attribute is not present in this dictionary then a default dtype is set by GeoPandas automatically for 
    the attribute.
    'index_col': Which attribute to set as the index of the GeoDataFrame. Optional string, defaults to None.
    """ 
    # DOWNLOAD DATAFRAME VERSION OF DIGITAL VECTOR BOUNDARIES FROM POSTGRES.
    df_from_pg = pd.read_sql_table(table_name, con=engine)
    
    # DESERIALIZE THE WKT STRINGS REPRESENTATION OF THE GEOMETRY COLUMN.
    df_from_pg['geometry'] = df_from_pg["geometry"].apply(lambda x: wkt.loads(x))

    # CONVERT DATAFRAME INTO GEODATAFRAME AND SET GEOMETRY COLUMN.
    gdf_from_pg = gpd.GeoDataFrame(df_from_pg,geometry=df_from_pg["geometry"],crs=4326)
    
    # SET DTYPES.
    gdf_from_pg = gdf_from_pg.astype(dtype)
    
    print(f"The \033[1m{table_name}\033[0m table was successfully downloaded from Postgres and loaded into GeoPandas.\n") 
    
    # SET INDEX COLUMN (OPTIONAL).
    if index_col==None:
        print(f"No index attribute was set on the GeoDataFrame so the default numeric index has been used.")
    else:
        gdf_from_pg.set_index(index_col,inplace=True)
        print(f"The \033[1m{index_col}\033[0m attribute was set as the GeoDataFrame index.")
   
    
    return gdf_from_pg

In [10]:
# OPEN DOOR LOGISTICS BASED GEODATAFRAME DOWNLOAD FROM POSTGRES, WHERE OPTIONAL INDEX COLUMN IS NOT SET IN ...
# GEOPANDAS:
dl_gdf_from_pg(table_name="gdf_postcode_sectors",
               dtype={"name":"string"}
              ).head(10)

The [1mgdf_postcode_sectors[0m table was successfully downloaded from Postgres and loaded into GeoPandas.

No index attribute was set on the GeoDataFrame so the default numeric index has been used.


Unnamed: 0,name,geometry
0,AB10 1,"POLYGON ((-2.11645 57.14656, -2.11655 57.14663..."
1,AB10 6,"MULTIPOLYGON (((-2.12239 57.12887, -2.12279 57..."
2,AB10 7,"POLYGON ((-2.12239 57.12887, -2.12119 57.12972..."
3,AB11 5,"POLYGON ((-2.05528 57.14547, -2.05841 57.14103..."
4,AB11 6,"POLYGON ((-2.09818 57.13769, -2.09803 57.13852..."
5,AB11 7,"POLYGON ((-2.11045 57.13424, -2.11116 57.13484..."
6,AB11 8,"MULTIPOLYGON (((-2.05257 57.13426, -2.05729 57..."
7,AB11 9,"POLYGON ((-2.08748 57.13972, -2.08687 57.13966..."
8,AB12 3,"MULTIPOLYGON (((-2.08100 57.08471, -2.09129 57..."
9,AB12 4,"POLYGON ((-2.12807 57.03684, -2.12986 57.03764..."


---
---
## 4. CLOSE ALL CONNECTIONS TO POSTGRES DATABASE.

In [11]:
def disconnect_from_postgres():
    """
    Completely disconnect from Postgres.
    """
    engine.dispose() 
    print("All connections to Postgres have been terminated.")

In [12]:
disconnect_from_postgres()

All connections to Postgres have been terminated.
