![Boshi](Boshi.png)

[One_Drive](https://universityofstandrews907-my.sharepoint.com/:f:/r/personal/ofh1_st-andrews_ac_uk/Documents/Finale?csf=1&web=1&e=UhPnhh)

__Context and Literature__

The Tube in serving as the zenith of British transit on account of its frequency, acceleration and sound punctuality has been a model for public transportation systems worldwide with the concept of the contemporary metro system drawing heavily from practices originating 18th Century practices initially conceptualised in London. The denizens of said city are blessed with an ease of transportation with extends beyond similarly sized conurbations much of the world over. The benefits of such a comprehensive system are immeasurable with the continued upkeep and prospective expansion of the network facilitating a range of both socio-economic and environmental benefits which enable London to sustain itself as among the most prosperous and liveable cities worldwide. Investment in network modernisation has improved accessibility and reliability such that, in tandem with adjacent investments in active transit and congestion charging schemes air pollution has fallen dramatically over the prior decade. Between 2016 and 2019 London saw a 94% decline in the number of people residing in areas with illegal air pollution with investments into Tube Modernisation and adjacent expansion of the Santander Cycle scheme into parts of East and South London facilitating an enhancement in those using public or active transit.

Any analysis of the London Underground network reveals an interesting geographic oddity in that much of the system is spatially biased toward London’s Northwestern Axis leaving a city beleaguered by an explicit inequity in transport provision. This has long been the case in that when the tube was constructed, early Tunnel Boring Machines (TBM) fared considerably better under geological conditions present in the North with the ‘London Clay’ found in the area being favourable for construction. Additionally, the distribution of wealth in the city has always favoured a Western orientated axis with the City of Westminster being an uppermost plinth of capital concentration from where the aristocracy and middle classes stipulated urban policy and reaffirmed bourgeois interests. Residents leveraged their political and economic means such that the construction of Tramways and conventional railways in much of South and East London was halted in the affluent West End citing damage to existing properties and the unsightliness of infrastructure such as catenary poles. Instead, the political and economic means were held to undertake costly tunnelling exercises providing an efficient means of transportation while preserving the existing urban fabric. 

The post-war era saw the scrapping of London’s extensive Tram network which formed a vital means of transportation across much of the conurbations traditionally working-class areas. This gave way to buses as the principal transport method, with the extensive network of conventional railways being structured more so to the benefit of a newly established post-war suburbanite population as opposed to an increasingly isolated class of inner-city dweller. It holds true then, that any study of London’s public transport system must acknowledge the systematic socio-economic inequalities which influenced both the distribution and type of transit modes across the urban area. Acknowledgement of how transit availability was historically a direct consequence of the intersectionality between political power and economic means provides a fascinating contextual basis for my study. Principally I shall be looking to address a pair of research questions.    

__In what fashion are historically embedded spatial inequalities in transport provision replicated across a contemporary 21st century London?__

__Does adjacency to active and public transport links influence deprivation outcomes, and how do the implications differ with demographic factors__

Such questions are important in that The Mayor and Greater London Authority have repeatedly identified London’s extensive transportation network as being pivotal toward meeting the cities aspirational sustainability targets with a key facet of London Plan 2021 being the instigation of a modal shift from cars to alternative means of transport such that by 2041 80% of trips are to be made by foot, cycle or Public Transport. However, if such a plan is to succeed it stands to reason that investment ought to be targeted in toward a position of equitability. Having explicitly identified the transition of London from a city of mixed mobility to one overwhelmingly orientated around public and active transport it is paramount that investment is aimed at toward ensuring all residents can take advantage of the opportunities a burgeoning green mobility network bring. 

In understanding and identifying how historical transit innovations like The London Underground were optimised toward the retention of the political and economic means of the middle-classes identification of persistent trends in both the distribution and quality of transport links is key toward alleviating the harsh socio-economic divides present across Greater London. Indeed, London holds the highest GDP per head among UK regions yet has observed an unemployment rate below the national average for over three decades. In his ‘Skills for Londoners’ document London mayor Sadiq Khan underlines his commitment toward ensuring all young Londoners can fulfil their potential irrespective of socio-economic background with the 2016 ‘city for all Londoners’ manifesto outlining just how critical a stellar transport strategy is toward the sculpting of a city which ceases to discriminate on the basis of socio-economic standing. Khan describes how transport will be used as a ‘leveller’ toward enhancing the prospects of vulnerable young people. Noted is the role enhanced connectivity could have on eroding the prominence of youth gang violence which blight many disconnected, peripheral inner London neighbourhoods particularly in areas where transport investment has historically been poor.

In this sense once can see an intersectionality between the accessibility of transport links and deprivation a scenario which has been largely embedded in London socio-spatial patchwork. In addressing my two research questions I seek to provide clarity on the relationship between a series of demographic variables and transit accessibility. This should further understand of the nuanced dynamics present across a range of highly varied neighbourhoods within Inner London such that policymakers and adjacent stakeholders are best informed toward how best to incorporate future active travel solutions. 


![bike](bike.png)

__Methods__

The initial component of my methods constituted obtaining data from the ‘Transport for London’ (TFL) combined API. I fetched data providing the coordinate data for the location of both Santander Cycle Hire Station, and Cycle Parking facilities across London before extracting only the specific data I needed. Additionally, I sourced a shapefile providing the coordinate location of all transit stop locations in the United Kingdom. This data would then by aggregated by Lower Super Output Area (LSOA) and added to various census data metrics to analysis using a Geodemographic model. This is as to produce a cluster output which categorises neighbourhoods in the study area, in accordance with a set a common characteristics.


In [None]:
import requests
import pandas as pd
import geopandas as gpd

The code above is responsible for importing the requests, pandas and geopandas libraries

In [None]:
url_bikepoint = "https://api.tfl.gov.uk/BikePoint/"
response = requests.get(url_bikepoint)
response

Here I am making an API fetch request for the location of Santander Cycle Hire stations in London. The Response [200] indicates that it has worked

In [None]:
data = response.json()
data

observing the data

In [None]:
rental = pd.DataFrame(data)

Here I am designated the value rental for a pandas dataframe containing the data from the API request

In [None]:
rental

In [None]:
rental.dtypes

checking the data types

In [None]:
keep_cols = [
    "id",
    "commonName",
    "lat",
    "lon",]

subsetting the relevant columns

In [None]:
rental = rental[keep_cols]

rental now only includes these subsetted columns

In [None]:
rentaloc = gpd.GeoDataFrame(rental, geometry=gpd.points_from_xy(rental['lon'], rental['lat']))

I am making a geopandas dataframe called rentaloc using the points specified in 'rentals' lon and lat columns

In [None]:
rentaloc = rentaloc.set_crs("EPSG:4326")

The Coordinate Reference system shall be set to EPSG:4326

In [None]:
rentaloc.crs

In [None]:
rentaloc.head()

In [None]:
rentaloc = rentaloc.to_crs('EPSG:3857')

I am setting the rentaloc crs that uses metres

In [None]:
rentalwalk = rentaloc.buffer(400)

I have created rentalwalk which measures a distance 400 from the point geometry of rentaloc

In [None]:
rentalwalk.tail

In [None]:
rentalarea = gpd.GeoDataFrame(geometry=rentalwalk, crs="EPSG:3857")
rentalarea['buffer_id'] = range(len(rentalarea))

I am creating a new geopandas dataframe using the rentalwalk geometry, creating and assigning a new variable called buffer_id

In [None]:
rentalarea.head()

In [None]:
rentalarea = rentalarea.to_crs(crs="EPSG:4326")

Conversion back to the original CRS

In [None]:
rentalarea.crs

In [None]:
rentalarea.tail()

In [None]:
rentalarea.explore()

![annie](annie.png)

checking out what I just made

Critical to the analysis was the creation of a specific shape file ‘InnerLondonLSOA’ which contains the LSOA boundaries for a specified area of London most suited to my research questions. The shapefile contains the LSOA boundaries for an area including the twelve Inner London Boroughs under the ‘London Government Act 1963’ statutory definition, ‘The City of London’ and ‘The London Borough of Newham’. The area was chosen on account of it representing an area roughly correlating with The County of London, the administrative body established in 1889 as a forebearer to the current Greater London Authority. The peripheral areas of Greater London are predominantly Suburban, and in places Rural in nature. They demonstrate socio-economic and physical characteristics quite distinct from the Inner London Boroughs with much of the built-up area being low-density, originating during the inter-war period as residents moved outward attributed to railway expansion. These areas were shaped by spatial processes quite distinct from those which were responsible for the more entrenched distribution of the population within London’s traditional core.

Additionally, the distribution of active and public transport infrastructure is far sparser in these areas, with the Santander Cycle scheme being effectively non-existent. Suburban Southeast London completely lacks any Underground connections such that any attempt to include data from the Boroughs in this area would distort any data outcomes. Newham was added to the Shapefile as under various statistical definitions it forms part of Inner London instead of Greenwich. This makes sense as from a demographic and population density standpoint Newham holds characteristics equivalent to the Inner London Boroughs.


In [None]:
shapefile_path = "InnerLondon/InnerLondonMSOA.shp"
bounds = gpd.read_file(shapefile_path)

Loading in the shapefile which contains the bounds for my study area

In [None]:
boundconv = bounds.to_crs("EPSG:4326")
boundconv.explore()

![a1](a1.png)

Assigning the correct CRS and checking it out!

In [None]:
bikebounds = rentalarea.sjoin(boundconv,how="right", op='intersects')
bikebounds.tail()

Here I am creating a new geodata frame by completing a spatial join between my rental area buffer geodataframe and the newly loaded in bikebounds geodataframe. 

In [None]:
BikenearOA = bikebounds.groupby('LSOA21CD')['buffer_id'].nunique().reset_index(name='BikenearOA')

BikenearOA.tail()

Above I am aggregating the number of clusters overlapping with each LSOA such that a numerical value based on how many there are, is assigned to each LSOA.

In [None]:
geom = boundconv
count = BikenearOA

Dataframe = pd.merge(geom, count, right_on="LSOA21CD", left_on="LSOA21CD")

now I am to merge this wirh my existing boundconv dataframe

In [None]:
Dataframe.explore("BikenearOA", cmap = "YlGnBu")

![a2](a2.png)

Above is an interactive visualisation of the number of Santander Bike stations are adjacent or within each of the LSOA areas.

In [None]:
Dataframe.head()

In [None]:
url_bikepark = "https://api.tfl.gov.uk/Place/Type/CyclePark"
response = requests.get(url_bikepark)
response

Here I am completeing a second API request this time fetching the location of all Cycle storage facilities across Greater London

In [None]:
data = response.json()
data

In [None]:
park = pd.DataFrame(data)

In [None]:
park.head()

In [None]:
keep_cols = [
    "id",
    "lat",
    "lon",]

subsetting for these columns specifically

In [None]:
park = park[keep_cols]

keeping only these columns under the new park dataframe

In [None]:
park = gpd.GeoDataFrame(park, geometry=gpd.points_from_xy(park['lon'], park['lat']))

I am making a new geodataframe using the lon and lat columns as geometry

In [None]:
park = park.set_crs("EPSG:4326")

setting the crs and then exploring

In [None]:
park.explore()

![b](b.png)

In [None]:
parkbounds = boundconv.sjoin(park, how="left", op='intersects')

As previously, conducting a spatial join on the left side, intersecting. The intention is such that my inner London area will be shown in the same dataframe with the bike parking locations

In [None]:
parkbounds.explore()

![b1](b1.png)

In [None]:
ParkperOA = parkbounds.groupby('LSOA21CD')['id'].nunique().reset_index(name='ParkperOA')

ParkperOA.info()

I am now undergoing an aggregation operation such that the number of cycle parking facilities per indvidual LSOA is counted

In [None]:
ParkperOA.tail()

In [None]:
geom = Dataframe 
count = ParkperOA

Dataframe = pd.merge(geom, count, left_on="LSOA21CD", right_on="LSOA21CD")

Dataframe.tail()

Now, to merge the two dataframes into the one outcome!

In [None]:
Dataframe.explore("ParkperOA", cmap = "YlGnBu")

![b2](b2.png)

As before we can visualise the relationship between cycle hire stations and LSOAs

In [None]:
Dataframe.head()

In [None]:
Dataframe["Parkrate"] = Dataframe["ParkperOA"]/Dataframe["Shape__Are"]

In [None]:
Dataframe["Bikerate"] = Dataframe["BikenearOA"]/Dataframe["Shape__Are"]

Im interested to see the density of both cycle hire stations and parking lots. Thus I am creating two new columns dividing by the area of each LSOA

In [None]:
Dataframe.explore("Parkrate", cmap = 'Greens')

![b3](b3.png)

In [None]:
import mapclassify as mc
import matplotlib.pyplot as plt
import folium
import seaborn as sns

import geopandas as gpd

Here I am installing several packages for enhanced data visualisation

In [None]:
shapefile_path = "station/Tube2.shp"
tube = gpd.read_file(shapefile_path)

In the above code I am importing a shape file containing all the London Underground Stations in London

In [None]:
tube.head()

In [None]:
tube.crs

Here I am creating a buffer object around each tube station of 400 metres

In [None]:
tubewalk = tube.buffer(400)


In [None]:
tubearea = gpd.GeoDataFrame(geometry=tubewalk, crs="EPSG:27700")
tubearea['buffer_id'] = range(len(tubearea))

creating a new dataframe consituting the buffers and a new variable giving each a unique ID

In [None]:
tubearea.explore()

![c](c.png)

In [None]:
tubearea = tubearea.to_crs("EPSG:4326")

converting into the same CRS as the rest of my project so some analysis can occur 

In [None]:
tubearea.head()

In [None]:
tubebounds = tubearea.sjoin(boundconv,how="right", op='intersects')
tubebounds.tail()

Undertaking a spatial join between the tube barrier geodataframe and my current main geodataframe

In [None]:
TubenearOA = tubebounds.groupby('LSOA21CD')['buffer_id'].nunique().reset_index(name='TubenearOA')

TubenearOA.tail()

Aggregating by the number of buffers which intersect with each LSOA21CD to produce a new variable TubenearOA

In [None]:
geom = boundconv
count = TubenearOA

Dataframe2 = pd.merge(geom, count, right_on="LSOA21CD", left_on="LSOA21CD")

adding TubenearOA to the existing main geodataframe and labelling it as 'Dataframe2'

In [None]:
Dataframe2.head()

In [None]:
keep_cols = [
    "LSOA21CD",
    "TubenearOA",]

In [None]:
Dataframe3 = Dataframe2[keep_cols]

In [None]:
Dataframe2.explore("TubenearOA", cmap = 'Greens')

![c1](c1.png)

observing the TubenearOA variable

In [None]:
Dataframe = pd.merge(Dataframe, Dataframe3, right_on="LSOA21CD", left_on="LSOA21CD")


Merging the two dataframes to produce a unified outcome with all the data

In [None]:
Dataframe.head()

In [None]:
Dataframe = Dataframe.to_crs("EPSG:27700")

Conversion to EPSG:27700 as to obtain centroids using metres

In [None]:
Dataframe['centroid'] = Dataframe['geometry'].centroid

In [None]:
distances = Dataframe['centroid'].apply(lambda x: tube['geometry'].distance(x))

Creating a new function which calculates the distance from the centroid of each LSOA to the tube station locations

In [None]:
closest_tube = distances.min(axis=1)
Dataframe['closest_tube'] = closest_tube

The closest_tube variable calculates the minimum distance to any of the London Underground stations

In [None]:
Dataframe.tail()

In [None]:
filtered_df = Dataframe.loc[Dataframe['closest_tube'] < 30000]

There was some issue with my data in that some of the geometry was a bit off. A LSOA area was recorded as being extant 40000 metres from the nearest tube station, throwing off the gradient. As such I removed it

In [None]:
filtered_df

In [None]:
filtered_df.explore("closest_tube", cmap = 'Reds')

![c2](c2.png)

In [None]:
filtered_df.tail()

The shape file was used to provide a set of boundaries to aggregate the distribution of Cycle and Tube Station points providing a metric through which to measure the amount of each facility on an area-by-area basis. I choose to create a set of buffers delimiting a radius 400 metres from the location of each of the Cycle Hire and Tube Station points. The 400-metre value was specified in accordance with the common consensus that 477 metres is the average distance a pedestrian will routinely travel to a transit stop from their place of residence. Accounting for variation in street network orientation and the fact that any buffers overlapping with LSOAs will contribute toward the count of the entire polygon it made sense to reduce the buffer distance. I didn’t apply any buffers to the point locations for the Cycle Parking facility as I felt people would be less willing to walk considerable distances to access cycle parking storage especially as cycle storage facilities are typically used to store cycles at the conclusion of their journey necessitating a position adjacent to their destination. As an additional variable I decided to calculate the distance from the centre of each LSOA to the closest Tube station point with the intention being as to make note of whether an enhanced distance from a transit stop held any correlation for the included demographic variables.

In [None]:
import mapclassify as mc
import matplotlib.pyplot as plt
import folium
import seaborn as sns

Importing functions to create histograms to optimise the creation of breaks to visualise how each of the six newly created variables differ across the Inner London study area

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(30, 20))


sns.histplot(data=filtered_df, x="BikenearOA",ax=axes[0, 0], kde=True) 
sns.histplot(data=filtered_df, x="ParkperOA",ax=axes[0, 1], kde=True) 
sns.histplot(data=filtered_df, x="Parkrate",ax=axes[1, 0], kde=True) 
sns.histplot(data=filtered_df, x="Bikerate",ax=axes[1, 1], kde=True) 
sns.histplot(data=filtered_df, x="TubenearOA",ax=axes[2, 0], kde=True) 
sns.histplot(data=filtered_df, x="closest_tube",ax=axes[2, 1], kde=True) 

plt.tight_layout()
plt.show()

![d](d.png)

In [None]:
num_classes = 5

classifier_a = mc.NaturalBreaks(filtered_df['BikenearOA'], k=num_classes)
classifier_b = mc.NaturalBreaks(filtered_df['ParkperOA'], k=num_classes)
classifier_c = mc.NaturalBreaks(filtered_df['Parkrate'], k=num_classes)
classifier_d = mc.NaturalBreaks(filtered_df['Bikerate'], k=num_classes)
classifier_e = mc.NaturalBreaks(filtered_df['TubenearOA'], k=num_classes)
classifier_f = mc.NaturalBreaks(filtered_df['closest_tube'], k=num_classes)

Creating a sequence of natural breaks detailing the distribution of LSOAs within the histogram for each of the six new variables 

In [None]:
fig, axs = plt.subplots(3, 2, figsize=(30, 20))

filtered_df.plot(column='BikenearOA', ax=axs[0, 0],
         legend=True, cmap='Greens',
         scheme='UserDefined',
         classification_kwds={'bins': classifier_a.bins}
        )

axs[0, 0].set_title("Cycle Hire Stations within walking distance of OA")

filtered_df.plot(column='ParkperOA', ax=axs[0, 1],
         legend=True, cmap='Greens',
         scheme='UserDefined',
         classification_kwds={'bins': classifier_b.bins})

axs[0, 1].set_title("Cycle parking frames within OA")

filtered_df.plot(column='Parkrate', ax=axs[1, 0],
         legend=True, cmap='Greens',
         scheme='UserDefined',
         classification_kwds={'bins': classifier_c.bins}
        )

axs[1, 0 ].set_title("Parking Frames per metre")

filtered_df.plot(column='Bikerate', ax=axs[1, 1],
         legend=True, cmap='Greens',
         scheme='UserDefined',
         classification_kwds={'bins': classifier_d.bins}
        )

axs[1, 1].set_title("Nearby Cycle Hire Station per metree")

filtered_df.plot(column='TubenearOA', ax=axs[2, 0],
         legend=True, cmap='Greens',
         scheme='UserDefined',
         classification_kwds={'bins': classifier_e.bins}
        )

axs[2, 0].set_title("Tube stations within walking distance")

filtered_df.plot(column='closest_tube', ax=axs[2, 1],
         legend=True, cmap='Greens_r',
         scheme='UserDefined',
         classification_kwds={'bins': classifier_f.bins}
        )

axs[2, 1].set_title("Distance from Tube station")


plt.tight_layout() 
plt.show()


![d2](d2.png)

Using Folium to plot the six map outcomes using fisher Jenkins breaks

I identified a set of 18 variables obtained from the 2021 Census website, obtaining data for each of the LSOA areas within my identified ‘Inner London’ study area. Each of the variables were selected on the grounds that they would be able to break the wider population down into a series of smaller demographic groups in accordance with principally socio-economic and identity-based categorisations. This was such that observation of the fashion in which an overlap between identity and geographic distribution could be noted. This is in acknowledgement that, because of successive policy interventions London stands to be a city mired by spatial stratification with the clustering of various socio-economic and ethnic groups in immediate adjacency to one another. Specific census categories associated with the quality and type of housing resided in, in addition to health-related criteria were particularly interesting to me, in that I sought to examine the degree to which unequal outcomes historically present are replicated as of 2021. Poorer health is a particularly critical variable in that it presents a key intersectionality between mobility and socio-economic outcomes. In 2019 roughly 4,000 Londoners died as a result of the impact of toxic air pollution with those exposed to the worst air pollution being more likely to be deprived Londoners and those from BAME communities. It seems plausible to draw a hypothesis in which those most distant from active and public transit options, yet without the capacity to use private transportation are subset to a double discrimination of economic isolation and enhanced negative-health related pushback. I decided to add variables delimiting the uptake of driving as the principle means of getting to work in addition to a variable marking those who were employed only a part-time basis. The rationale behind including these two variables was founded on an understanding that those residing in areas of poor public transport provision may be forced to rely on private transport or adopt an economically inactive lifestyle on account of poor accessibility to employment opportunities. Thus, inaccessibility to alternative transportation means may enforce a ‘double discrimination’ in that already an already disadvantaged population are forced to invest in alternative means of transportation or exclude themselves from Society. A vicious cycle which fuels disillusionment and crime. 

In [None]:
import pandas as pd
import os

csv_directory = "bigOne/"

csv_files = [file for file in os.listdir(csv_directory) if file.endswith(".csv")]

merged_data = pd.DataFrame()
for csv_file in csv_files:
    csv_path = os.path.join(csv_directory, csv_file) 
    df_csv = pd.read_csv(csv_path, low_memory=False) 
    merged_data = pd.concat([merged_data, df_csv], axis=1)

merged_data.to_csv("bigone/merged_census_data.csv", index=False)


Importing the os module for data analysis. I have created a loop which cycles through all the csv files within the designated directory 'bigOne'. The loop goes ahead and merges all of the csv files to one large dataframe containing all of the CSV files.

In [None]:
import pandas as pd
import geopandas as gpd

shp_path = "InnerLondon/InnerLondonMSOA.shp"
gdf = gpd.read_file(shp_path)

Reading the Inner London shape file again, as to have it fresh out the box and in my head. Easy to forget what is happening!

In [None]:
csv_path = "bigone/merged_census_data.csv"
csv_data = pd.read_csv(csv_path, low_memory=False)
merged_data = gdf.merge(csv_data, left_on='LSOA21CD', right_on='Row Labels', how='left')

I then merge the data one geodataframe using LSOA21CD as a key. This is saved under merged_data 

In [None]:
merged_data.tail()

In [None]:
list(merged_data.columns)

In [None]:
def calculate_percentages(dataframe, total_columns, value_columns):

    result_df = pd.DataFrame()

    for total_col, value_col in zip(total_columns, value_columns):
        percentage_col_name = f"{value_col}_percentage"

        if total_col not in dataframe.columns:
            raise ValueError(f"Total column '{total_col}' not found in the DataFrame.")
        dataframe[value_col] = pd.to_numeric(dataframe[value_col], errors='coerce')
        dataframe[total_col] = pd.to_numeric(dataframe[total_col], errors='coerce')
        
        result_df[percentage_col_name] = (dataframe[value_col] / dataframe[total_col]) * 100

    return result_df

total_cols = ['TotalDep',
              'TotalDist',
              'TotalDist',
              'TotalEmploy',
              'TotalEmploy',
              'TotalEmploystat', 
              'TotalHousing',
              'TotalHousing',
              'TotalRel',
              'TotalRel',
              'TotalHome', 
              'TotalSex',
              'TotalCrowd',
              'TotalCrowd',
              'TotalType',
              'TotalEnglish','TotalTrans',
              'TotalHour']
 
value_cols = ['Household is not deprived in any dimension',
              'Works mainly from home',
              'Works mainly at an offshore installation, in no fixed place, or outside the UK',
              'L14.1 and L14.2: Never worked and long-term unemployed',
              'L15: Full-time students',
              'Economically inactive: Long-term sick or disabled',
              'Social rented: Rents from council or Local Authority',
              'Owned: Owns outright',
              'Christian',
              'Muslim',
              'Second address is outside the UK',
              'Male',
              'Occupancy rating of rooms: -2 or less',
              'Occupancy rating of rooms: +2 or more',
              'Detached',
              'Main language is not English (English or Welsh in Wales): Cannot speak English well', 'Driving a car or van',
             'Part-time: 15 hours or less worked']

result_dataframe = calculate_percentages(merged_data, total_cols, value_cols)

The code above does as follows, I am intially creating an empty dataframe result_df before looping through the columns as specified above. For each pair of columns I'm creating a percentage value based on the total value for each row, as specified above. These are stored in a new column within the dataframe. Additionally I am checking that all values are numerical and if not I give the value NaN before returning the values in the dataframe, result_dataframe.

In [None]:
result_dataframe.tail()

In [None]:
concatenated_df = pd.concat([merged_data, result_dataframe], axis=1, ignore_index=False)
concatenated_df.tail()

The latest iteration of my dataframe is to be called concatenated_df and consists of the new percentage values as stored in 'result_dataframe' and the merged_data dataframe containing the rest of my data.

In [None]:
list(concatenated_df.columns)

In [None]:
keep_cols= [
    'LSOA21CD',
    'geometry',
    'Household is not deprived in any dimension_percentage',
 'Works mainly from home_percentage',
 'Works mainly at an offshore installation, in no fixed place, or outside the UK_percentage',
 'L14.1 and L14.2: Never worked and long-term unemployed_percentage',
 'L15: Full-time students_percentage',
 'Economically inactive: Long-term sick or disabled_percentage',
 'Social rented: Rents from council or Local Authority_percentage',
 'Owned: Owns outright_percentage',
 'Christian_percentage',
 'Muslim_percentage',
 'Second address is outside the UK_percentage',
 'Male_percentage',
    'Occupancy rating of rooms: -2 or less_percentage',
    'Occupancy rating of rooms: +2 or more_percentage',
    'Detached_percentage',
    'Main language is not English (English or Welsh in Wales): Cannot speak English well_percentage',
    'Driving a car or van_percentage',
 'Part-time: 15 hours or less worked_percentage']

Dataframe4 = concatenated_df[keep_cols]

I am now specifing to keep only the geometry, LSOA code and associated percentage values in my most yp to date dataframe.

In [None]:
short_column_names = {
    'Household is not deprived in any dimension_percentage' : 'Not Deprived',
 'Works mainly from home_percentage' : 'Home Worker',
 'Works mainly at an offshore installation, in no fixed place, or outside the UK_percentage' : 'Mobile Worker',
 'L14.1 and L14.2: Never worked and long-term unemployed_percentage' : 'Unemployed',
 'L15: Full-time students_percentage' : 'Student',
 'Economically inactive: Long-term sick or disabled_percentage' : 'Disabled',
 'Social rented: Rents from council or Local Authority_percentage' : 'Social Renter',
 'Owned: Owns outright_percentage' : 'Owner',
 'Christian_percentage' : 'Christian',
 'Muslim_percentage' : 'Muslim',
 'Second address is outside the UK_percentage' : 'Oversees address',
 'Male_percentage' : 'Male',
    'Occupancy rating of rooms: -2 or less_percentage' : 'Overcrowded',
    'Occupancy rating of rooms: +2 or more_percentage' : 'Capacity spare' ,
    'Detached_percentage' : 'Detatched',
    'Main language is not English (English or Welsh in Wales): Cannot speak English well_percentage': 'Not fluent',
      'Driving a car or van_percentage': 'Driver',
 'Part-time: 15 hours or less worked_percentage':'Part-time'
}

Dataframe4 = Dataframe4.rename(columns=short_column_names)

For ease of use I'm specifying a set of shorter column names, which will help in the analysis portion

In [None]:
Dataframe4.tail()

In [None]:
Dataframe4.columns

In [None]:
Dataframe4.dtypes

In [None]:
numeric_columns = Dataframe4.select_dtypes(include='float64')
z_score_df = (numeric_columns - numeric_columns.mean()) / numeric_columns.std(ddof=0)
z_score_df.tail()

In order to create a correlation matrix I need to standardise the variables. This is achieved through applying a formula which subtacts the mean over the standard deviation of a column from each numerical value in the dataframe. This output is to be saved in a new dataframe called z_score_df.

In [None]:
corr = z_score_df.corr()
corr.style.background_gradient(cmap='coolwarm')

![e](e.png)

Here I have created a correlation matrix specifying the relationship between each of the variables. The greater the correlation the more intense the colour gradient.

In [None]:
import matplotlib.pyplot as plt

plt.matshow(z_score_df.corr())
plt.show()

![e1](e1.png)

And here is a simplified Rasterised version of the above matrix. Interesting!

In [None]:
import seaborn as sns

threshold = 0.75


highly_correlated = (corr.abs() > threshold) & (corr.abs() < 1.0)

plt.figure(figsize=(10, 8))
sns.heatmap(highly_correlated, cmap='coolwarm', cbar=False, annot=True)

plt.title('Highly Correlated Variables')
plt.show()

![e2](e2.png)

The 18 variables were then placed through a further selection process by which they were subject to standardisation with each metric having its z-score calculated through dividing the column mean by the column standard deviation. The Z-score outputs were then placed in a correlation matrix such that any variables with a correlation threshold above 0.75 would be considered for removal. When choosing which of the two correlated variables to bin I placed a preference on keeping the variable which was responsible for the greatest number of correlations between it and its peers. As such I would be left with a slimmer number of more influential variables.

I have imported seasborn to enable this visualisation. I have set a threshold of 0.75 such that if the relationship between any two variables correlations to a degree in excess of this it shall be selected and coloured red within a new matrix.

In [None]:
z_score_df.drop(['Home Worker', 'Unemployed', 'Owner', 'Overcrowded','Muslim'], axis=1, inplace=True)
z_score_df.info()

Upon obtaining the results from this matrix I have identified 5 variables which ought to be dropped from the analysis on account of strong correlation

In [None]:
z_score_df.fillna(z_score_df.mean(), inplace=True)

And again, filling the mean values to ensure everything goes smoothly

In [None]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist, pdist

# KMeans with 10 clusters
kmeans = KMeans(n_clusters=10)
kmeans.fit(z_score_df)
labels = kmeans.predict(z_score_df)
cluster_centres = kmeans.cluster_centers_
z_score_df['Cluster'] = kmeans.labels_

I then ran a K-means cluster analysis with the remaining variables to split to data into a number of unique cluster groups based on the error sum of square measures. The algorithm uses a cluster centre and calculates the scenario in which the distance between it and a set of outlying points are reduced, with the points being assigned a cluster centre based on this and the similarity it has with the data values held by the randomly generated cluster seed. I produced 10 cluster value seeds based on the values in the Z-score matrix. As the histogram demonstrated that there was much variability between the sizes of the clusters I sought to promote a scenario of increased equitability. This was done through running the Elbow method which involves the use of a graphical visualisation to demonstrate how the addition of each new cluster group influences the K-value, demarking the point at which the addition of a new cluster group ceases to improve the model to a reasonable extent’ at a point of diminishing returns. The number of clusters noted at this Elbow point was 5 and such I proceeded in specifying this as the number of clusters. Upon running the analysis each of the LSOAs over the specified Inner London study area was ascribed a cluster value with the output being as shown.

In [None]:
plt.hist(labels)

![e3](e3.png)

In [None]:
Sum_of_squared_distances = []

K_range = range(1,15)

for k in K_range:
 km = KMeans(n_clusters=k, random_state = 26)
 km = km.fit(z_score_df)
 Sum_of_squared_distances.append(km.inertia_)
    
plt.plot(K_range, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

![e4](e4.png)

In [None]:
kmeans = KMeans(n_clusters=5, random_state = 26)
kmeans.fit(z_score_df)
labels = kmeans.predict(z_score_df)
cluster_centres = kmeans.cluster_centers_

z_score_df['Cluster'] = kmeans.labels_

In [None]:
import plotly.express as px
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

plt.figure(figsize=(12, 8))

kmeans = KMeans(n_clusters=5, random_state = 26)
clusters = kmeans.fit_predict(z_score_df)

z_score_df['Cluster'] = clusters

scaler = StandardScaler()
stand_data_scaled = scaler.fit_transform(z_score_df)

# PCA analysys.
pca = PCA(n_components=2).fit(stand_data_scaled)
pca_result = pca.transform(stand_data_scaled)

#Percentage of variance explained by each of the selected components.
variance_ratio = pca.explained_variance_ratio_

# Create a scatter plot
fig = px.scatter(x=pca_result[:, 0], y=pca_result[:, 1], color=clusters,
                 labels={'color': 'Cluster'},
                 #title='Cluster Plot against 1st 2 Principal Components',
                 opacity=0.7,
                 width=800, 
                 height=800)

plt.tight_layout()
fig.show()

print(f"These two components explain {(variance_ratio.sum()*100):.2f}% of the point variability.")

![f](f.png)

In [None]:
kmeans = KMeans(n_clusters=5, random_state = 26)
clusters = kmeans.fit_predict(z_score_df)

z_score_df['Cluster'] = clusters

scaler = StandardScaler()
stand_data_scaled = scaler.fit_transform(z_score_df)

pca = PCA(n_components=2).fit(stand_data_scaled)
pca_result = pca.transform(stand_data_scaled)

variance_ratio = pca.explained_variance_ratio_

plt.figure(figsize=(10, 6))
sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1], hue=clusters, palette='viridis', s=50, alpha=0.7)
plt.title('Cluster Plot against 1st 2 Principal Components')
plt.xlabel(f'Principal Component 1 variation: {variance_ratio[0]*100:.2f}%')
plt.ylabel(f'Principal Component 2 variation: {variance_ratio[1]*100:.2f}%')
plt.legend(title='Clusters')
plt.show()


![f1](f1.png)

In [None]:
kmeans = KMeans(n_clusters=5, random_state=26)
clusters = kmeans.fit_predict(z_score_df)

# Get the cluster centers
cluster_centers = kmeans.cluster_centers_


# Get the cluster centers
cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=z_score_df.columns)

# Create a new DataFrame with cluster assignments and column names
#result_df = pd.DataFrame({'Cluster': clusters, 'Column': z_score_df.columns})

cluster_centers.head(6)

Creating radial polar charts from the cluster centre values allows for a bespoke analysis of the variables which prompted the associated K=means cluster distribution. In this instance I manually cycled through each of the six cluster groups and took note of the attributed which held high Z-score of both positive and negative values

In [None]:
first_row_centers = cluster_centers.iloc[0, :]

# len of features
num_features = len(first_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

![g](g.png)

In [None]:
first_row_centers = cluster_centers.iloc[1, :]

# len of features
num_features = len(first_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

![g1](g1.png)

In [None]:
first_row_centers = cluster_centers.iloc[3, :]

# len of features
num_features = len(first_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

![g2](g2.png)

In [None]:
first_row_centers = cluster_centers.iloc[2, :]

# len of features
num_features = len(first_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

![g3](g3.png)

In [None]:
first_row_centers = cluster_centers.iloc[4, :]

# len of features
num_features = len(first_row_centers)

# polar coordinates
theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')
# Add an extra red line at the 0.0 value
ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Avarage')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns, rotation=45, ha='right')

plt.show()

![g4](g4.png)

In [None]:
list(z_score_df.columns)

In [None]:
z_score_df.drop(['Not Deprived',
 'Mobile Worker',
 'Student',
 'Disabled',
 'Social Renter',
 'Christian',
 'Oversees address',
 'Male', 
 'Capacity spare',
 'Detatched',
 'Not fluent','Driver',
 'Part-time'
 ], axis=1, inplace=True)
z_score_df.info()

Here we drop all of the attributes except the Cluster as we wont need them.

In [None]:
final_df = pd.concat([Dataframe4, z_score_df], axis=1, ignore_index=False)
final_df.tail()

We then add the newly created z score dataframe to the exitsing master data frame 'Dataframe4'

In [None]:
final_df.dtypes

In [None]:
final_df.tail()

In [None]:
merged_df = final_df.merge(filtered_df, on='LSOA21CD', how='left')

In [None]:
merged_df.tail()

In [None]:
merged_df.dtypes

In [None]:
merged_df.drop(['BNG_E', 'BNG_N', 'LONG', 'LAT', 'Shape__Len', 'geometry_y', 'centroid'], axis=1, inplace=True)

In [None]:
merged_df.tail()

In [None]:
df_boss = merged_df.drop(columns=['Cluster'])

In [None]:
final = gpd.GeoDataFrame(merged_df, geometry='geometry_x', crs="EPSG:27700")

In [None]:
final.explore(column='Cluster', cmap='Set1', tiles='CartoDB positron')

![h](h.png)

In [None]:
final.dtypes

In [None]:
numeric_columns = final.select_dtypes(include='float64')
df2 = (numeric_columns - numeric_columns.mean()) / numeric_columns.std(ddof=0)
df2.tail()

In [None]:
df2.drop(['Shape__Are'], axis=1, inplace=True)

In [None]:
corr = df2.corr()
corr.style.background_gradient(cmap='coolwarm')

![h2](h2.png)

However, critically this initial outcome ceases to include the variables including in my aggregation analysis associated with the points fetched through use of the Tube shape file and combined TFL API. I merged the two tables on the LSOA code values giving an outcome with both the aggregated point per LSOA counts and the census data before running a correlation matrix analysis for the second time. I completed this process twice, once with just the census data and on an additional occasion with the two datasets as to make note of how the fashion in which there were distinctions between the optimum number of clusters, and the cluster extents. Additionally on the second run through of the correlation matrix I reduced the threshold for removal down to a value of 0.7. The rationale was that upon adding several new variables I sought to include only the most influential to ensure the eventual cluster outcomes were of a large enough distinction from one another. 

Indeed, off the back of the first Cluster map I felt the clusters to be not distinct enough with Clusters 3 and 0 especially being prevalent within LSOAs on the basis of what I feel were less so associated with a set of distinct characteristics but rather having an absence of one. A contributing factor to this was the inclusion of the ‘closest_tube’ metric which gave remarkably high values to LSOAs located toward the Southeast periphery of the study area. This is on account of Southeast London being completely devoid of London Underground lines, a core piece of rationale behind the concentration of the study toward Inner London exclusively. As the defining characteristic of the majority of LSOAs in this area is their considerable distance from a London Underground station there was a heavy preference toward cluster 3, a fact which diminished the role of other variables which could provide a more nuanced distinction between the various areas. – As such irrespective of its position on the correlation matrix I took the decision to remove the ‘closest_tube’ metric for the subsequent cluster analysis. 

Upon running the correlation matrix for a second time I was left with a set of 15 final variables including 3 from the externally sourced aggregated data location, BikenearOA, ParkperOA and TubenearOA. I then run the elbow method an additional instance giving a recommended Cluster number of 5 as before. Upon running the analysis, I obtained a final cluster map outcome which demonstrates the fashion in which London is divided in accordance with mobility accessibility and socio-economic factors. This map will form the basis of further graphical and statistical analysis to provide answers to my two research questions.


In [None]:
import seaborn as sns

threshold = 0.7


highly_correlated = (corr.abs() > threshold) & (corr.abs() < 1.0)

plt.figure(figsize=(10, 8))
sns.heatmap(highly_correlated, cmap='coolwarm', cbar=False, annot=True)

plt.title('Highly Correlated Variables')
plt.show()

![h3](h3.png)

In [None]:
df2.drop(['Not Deprived', 'Home Worker', 'Muslim', 'Overcrowded', 'Not fluent', 'Capacity spare', 'closest_tube', 'Bikerate', 'Parkrate',], axis=1, inplace=True)
df2.info()

In [None]:
df2.fillna(df2.mean(), inplace=True)

In [None]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist, pdist

kmeans = KMeans(n_clusters=10, random_state = 71)
kmeans.fit(df2)
labels = kmeans.predict(df2)
cluster_centres = kmeans.cluster_centers_
df2['Cluster'] = kmeans.labels_

In [None]:
Sum_of_squared_distances = []

K_range = range(1,15)

for k in K_range:
 km = KMeans(n_clusters=k, random_state = 71)
 km = km.fit(df2)
 Sum_of_squared_distances.append(km.inertia_)
    
plt.plot(K_range, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

![h4](h4.png)

__Results__

The K-means cluster Elbow method stipulated that I divide the LSOAs into a set of five clusters in accordance with a set of distinct qualities. 


In [None]:
kmeans = KMeans(n_clusters=5, random_state =67)
kmeans.fit(df2)
labels = kmeans.predict(df2)
cluster_centres = kmeans.cluster_centers_

df2['Cluster'] = kmeans.labels_

In [None]:
kmeans = KMeans(n_clusters=5, random_state=67)
clusters = kmeans.fit_predict(df2)

cluster_centers = kmeans.cluster_centers_

cluster_centers = pd.DataFrame(kmeans.cluster_centers_, columns=df2.columns)

cluster_centers.head()

In [None]:
first_row_centers = cluster_centers.iloc[0, :].drop('Cluster')


num_features = len(first_row_centers)

theta = np.linspace(0, 2 * np.pi, num_features, endpoint=False)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')

ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Average')

ax.set_xticks(theta)
ax.set_xticklabels(first_row_centers.index, rotation=45, ha='right')  # Use the index of first_row_centers

plt.show()

The polar graph for the initial cluster, cluster 0 is most defined by a high uptake of driving with the mean value being 0.6 standard deviations above the holistic mean. LSOAs among this classification also have a greater propensity to have a high percentage of Detached, owner-occupied dwellings. Public and Active transit availability is below the overall average attributes which provide a plausible reason for the heavy reliance on driving. The number of students and social renters are similarly low such that I would suspect cluster 0 describes the suburban area toward the studies South East rim perhaps largely mirroring the cluster 3 area present in the prior map. The number of men as a proportion of the population is precisely on the mean which also speaks to an environment more defined by conventional family households. Critically however cluster 0 refutes the association between transport accessibility and deprivation as hypothesised by my second research question. A low proportion of social renters and high concentration of detached houses suggests areas classified as being apart of cluster 0 are toward the affluent side irrespective of the lack of alternative transit options. 

![i](i.png)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

first_row_centers = cluster_centers.iloc[1, :].drop('Cluster')

num_features = len(first_row_centers)

theta = np.linspace(0, 2 * np.pi, num_features, endpoint=False)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')

ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Average')

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns[:-1], rotation=45, ha='right')  # Exclude 'Cluster' column

plt.show()

This second graph depicts Cluster 1 which shares several attributes with Cluster 0 in that several attributes point toward affluence. A high proportion of those residing in Cluster 1 LSOAs are owner occupiers while the mean number of those possessing a second home overseas is a standard deviation above the overall mean. Variation is borne on account of the higher-than-average availability of Bike hire and storage facilities in addition to the presence of tube infrastructure.  In this sense Cluster 1 appears to describe wealthy well-connected neighbourhoods which have seen their dependency on private automobile transport falter on account of alternative transit options. 

![i1](i1.png)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

first_row_centers = cluster_centers.iloc[2, :].drop('Cluster')

num_features = len(first_row_centers)

theta = np.linspace(0, 2 * np.pi, num_features, endpoint=False)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')

ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Average')

ax.set_rticks(np.linspace(first_row_centers.min(), first_row_centers.max(), num=6))

ax.set_xticks(theta)
ax.set_xticklabels(cluster_centers.columns[:-1], rotation=45, ha='right')  # Use feature names as tick labels

plt.show()

The third graph depicts Cluster 4, which is characterised by high multimodal accessibility and low Driving uptake as was the case in the prior cluster. Both Ownership and Social Renting is below the mean suggesting a scenario of high private renting with part time work uptake being far below the holistic mean. The male population is also roughly 0.4 Standard Deviations above the average suggesting an area perhaps occupied by a large economic-migrant population an attributed further plausible considering the low ‘part-time employed’ figure. The comparatively low propensity of Detached dwellings, in addition to transport accessibility also stipulates that Cluster 4 is most probably concentrated toward an area in the central core of the study area, most unlike Cluster 0.

![i2](i2.png)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

first_row_centers = cluster_centers.iloc[3, :]

num_features = len(first_row_centers)

theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')

ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Average')

ax.set_xticks(theta)
ax.set_xticklabels(first_row_centers.index, rotation=45, ha='right')  # Use index as tick labels

plt.show()

Cluster 2 is depicted by the fourth graph. Cluster 2 describes areas which match those described in the ‘context and literature’ section of my study. Tube and Cycling accessibility is moderately below the study wide mean while the social renting attribute is nearly 2 Standard Deviations above the mean. Unemployment is also high with residents of LSOAs within this cluster holding a greater propensity to drive and work on a part time basis. These metrics point toward Cluster 2 areas being comparatively geographically isolated with an under provision of mobility options enforcing worse socio-economic prospects. Such areas also have a 0.5 standard deviation below the average male population which perhaps evidences a scenario of single-parent familial households which could affirm economic conditions. 

![i3](i3.png)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

first_row_centers = cluster_centers.iloc[4, :]

num_features = len(first_row_centers)

theta = np.linspace(0, 2 * np.pi, num_features, endpoint=True)

fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
ax.plot(theta, first_row_centers, linewidth=1, color='blue', marker='o', label='Centers')

ax.plot(theta, np.zeros_like(first_row_centers), color='red', linestyle='--', label='Average')

ax.set_xticks(theta)
ax.set_xticklabels(first_row_centers.index, rotation=45, ha='right')  # Use index as tick labels

plt.show()

The fifth graph depicts cluster 3, which gives attributes largely describing a more transient global population base. The three mobility orientated statistics point toward accessibility far higher than its cluster contemporaries, with each exceeding the average by 2.5 standard deviations. The Student and Oversees address values are also similarly high pointing toward a wealthy population of youthful people. Additionally given the Driver value of which is more than one negative standard deviation below the mean it seems likely that the focality of cluster 3 sits in an area boasting immense connectivity likely in the heart of London’s West End.

![i4](i4.png)

In [None]:
df2.dtypes

In [None]:
df2.drop([
 'Mobile Worker', 'Unemployed',
 'Student',
 'Disabled',
 'Social Renter',
    'Owner',
 'Christian',
 'Oversees address',
 'Male', 
 'Detatched',
 'Driver','Part-time',
                 'BikenearOA','ParkperOA', 'TubenearOA'
 
 ], axis=1, inplace=True)
df2.info()

In [None]:
df2.tail()

In [None]:
finad = pd.concat([df_boss, df2], axis=1, ignore_index=False)
finad.tail()

In [None]:
finag = gpd.GeoDataFrame(finad, geometry='geometry_x', crs="EPSG:27700")

In [None]:
finag.dtypes

In [None]:
finag.explore(column='Cluster', cmap='tab20', tiles='CartoDB positron')

![j](j.png)

__Final Map__

The final map outcome demonstrates the fashion in which the various census and mobility measures intersect in a spatial fashion. Striking is the fashion in which four of the five clusters form reasonably clear core agglomerations occupying the four cardinal directions from central London. Cluster 1 is most prominently agglomerated toward the immediate west of London’s West End, centred upon areas of considerable wealth such as Chelsea, Kensington and Fulham. Outlying bands also form an axis encompassing upscale residential neighbourhoods from Marylebone north toward Hampstead Heath. Desirable exclaves within South London such as Clapham and Dulwich are also confined within Cluster 1. For the most part these areas are synonymous with locales frequented by those within 19th Century high ‘society’, being affluent period neighbourhoods in which much of London’s wealth has been concentrated for generations. Consequently, much of London’s ‘Deep level’ Underground construction was centred toward these neighbourhoods, running in a north-west orientated axis. Reference to the associated Polar graph allows for an understanding of this area as being broadly well connected to pubic and active transport links, while being holding wealth in excess of adjacent clusters on the grounds of high owner-occupation and very low social renting rates. It would make since to infer that those residing in Cluster 1 areas are exposed to great opportunity associated with their ease of mobility and spatial proximity to various critical Inner London hubs prompting a self-perpetuating scenario of wealth concentration toward these areas.

Cluster 2 occupies a prominent cluster toward the immediate south of central London, centred on the South-East running Old Kent Road. This area has historically been among the least connected in Inner London being what could be described as a ‘Public Transport Desert’ on account of its lack of mobility options. Typically, mobility through the area is funnelled through Old Kent Road which runs into Central London. However heavy reliance on buses increases journey times and reduces reliability with congestion along roadway arteries in this area being particularly bad and negatively impacting a resident population which is among the most deprived in the UK. Several large social housing schemes, most notably the Aylesbury Estate which houses approximately 7.500 residents are located in this area. Reference to the Polar graph highlights the degree to which areas classified within Cluster 2 are filtered in accordance with a greater than average ‘Social Renter’ value. This would explain why unlike the other four ‘core’ cluster groups Cluster 2 exhibits characteristics of a more spatially disparate nature with various smaller agglomerations located across the study area most likely being the location of prominent social housing schemes.

Cluster 3 is centred around the ‘core activity areas’ of London consisting of a prominent agglomeration centred on the West End and The City of London. Outlying areas are also present focused on the regenerated business districts of Stratford and Canary Wharf, as well as university campuses. Areas under this classification having the strongest tendency toward having outstanding transportation links with areas within this Cluster often having multiple Tube stops. Interestingly these areas also mirror much the current extent of the Santander Cycle scheme rollout perhaps in association with the fact they contain various highly trafficked tourist sites. Connectivity wise then Cluster 3 areas parallel Cluster 1 LSOAs on account of unprecedented connectivity. From a demographic perspective it seems likely that residents of these areas are similarly wealthy yet most likely younger being predominantly students and young professionals.

Cluster 4 holds a strong Eastward orientated bias being situated over much of London’s traditional East End, spanning from the boundary between the City and Tower Hamlets, as far East as the River Lea. This area is among the most multicultural in London being home to a large South Asian Community. The most notably characteristics of this cluster are its low number of Christians and tendency toward an overrepresentation of males traits which support the idea that this area houses a considerable number of immigrants. Driving uptake is low, most likely as a consequence of alternative travel options, yet observation of the Polar graph paints a narrative in which Cluster 4 is most distinguished from its contemporaries on account of its identity based demographic differences as opposed to any trend between socioeconomics and accessibility. It is however, worth bearing in mind that Tower Hamlets is among the most deprived local authorities in the UK, and Tube accessibility is only 0.1 standard deviations above the mean of a wider area which includes vast swathes of Southeast London which is completely devoid of Underground connections potentially distorting the capacity to which trends can be observed. 

Cluster 0 is quite distinct from the other Clusters in that it isn’t grounded in a focal location rather it forms a broad band around the characteristically more ‘urban’ components of inner London. A quirk of the Study Area is that it holds a slight Eastern bias with several boroughs considered as part of Inner London toward the Southeast extending into more modern suburban development. The Boroughs of Lewisham and Greenwich for instance which are considered ‘inner London’ share characteristics with their Outer London neighbourhoods in that they largely developed because of suburbanisation in the 1930s, never quite having the connectivity or proximity to Westminster which granted favoured status to London’s north-west among the affluent. Much of this area has historically been industrial with the Eastward prevailing wind halting development. Thus Cluster 0 is most pertinently defined not in association with demographic variables but in acknowledgement of a lack of active travel and London Underground connections. This cluster sees mobility patterns distinct from the rest of the areas with driving having a strong prevalence. This makes sense as the peripheral location of this cluster corresponds with the more recent development of its road network and thus increased ease of transportation. However, despite having a higher-than-average Owner population which is a reasonable surrogate for wealth, air pollution is typically among the worst in Outer London attributed to high vehicular traffic. As such Cluster 0 presents an interesting counter example to the geographic distribution of wealth in Victorian London in that it stands as a product of the modern proliferation of commuter railways such that adjacency to ones place of work is no longer a critical factor toward ones home location. The Middle classes are afforded an opportunity to suburbanise, yet at the cost of accessibility to potentially more healthy travel solutions.

Upon observation of the final map diagram it is interesting to observe how drastic the role of specific tube stations can be toward shifting the balance from one cluster to the next. Clusters 0 and 4 largely intermingle toward the North-Eastern peripheries of the study location with concentrated LSOA enclaves earned as being cluster 4 largely being immediately adjacent mass transit stations. Finsbury Park and Upton Park in North and East London respectively, stand to be interesting examples. Similarly Cluster 3 extends some distance South West in a haphazard fashion, mirroring exactly the route of the Northern Lines southernmost branch and highlighting the role it places both as a nexus of connectivity in this area but also as a facilitator of middle-class wealth concentration.


In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 2, figsize=(30, 15))  

finag.plot(ax=axs[0], column='Cluster', cmap='tab20', legend=True)
axs[0].set_title('After Mobility')

final.plot(ax=axs[1], column='Cluster', cmap='tab20', legend=True)  
axs[1].set_title('Before Mobility')  

plt.tight_layout()  
plt.show()

![j1](j1.png)

This above visualisation notes the distinction between the two cluster maps before and after the inclusion of the aggregated mobility values. Please note the clusters are different between the two so comparison is more indicative

When considering the two intial research questions as listed below one has to consider both the above cluster maps and the relationship between various variables.

__In what fashion are historically embedded spatial inequalities in transport provision replicated across a contemporary 21st century London?__

__Does adjacency to active and public transport links influence deprivation outcomes, and how do the implications differ with demographic factors__

Observation of the cluster maps underlines an understand that equity in public and active travel provision is indeed largely unequal on the grounds that Londons Tube network is largely a product of the socio-economic conditions which shaped it in the late 19th century. Much of the systems growth and expansion occured over this period with succesive expansions being largely piecemeal and skewed toward existing lines. Identification of two clusters biased toward the SOuth East,  Clusters 0 and 2 as lacking sustainable transit options affirms a strict geographic disparity in transport provision along a clear axis. However where Cluster 0 is largely a product of suburbanisation with car ownership and affluence being comparitively high, Cluster 2 has remained an agglomeration of relative poverty with unemployment and social renting being high. As such we can see a city divided not just by transport provision but in how public transport is recieved and utilised. In the identification of various types of public transport deficient environment evidenced us how deprivation is dependant not just on transport provision, of which access helps specifically in inner-city neighbourhoods as has helped certain East London (Cluster 4) communities to thrive as compared to some in Cluster 2, but also attitudes about the role public transport takes in ones life. Cluster 0 on account of its car-centric urban grain would perhaps be less likely to see considerable active transit participation even in the case of its rollout.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")

ax = sns.lmplot(data=finag, x="ParkperOA", y="Driver", hue="Cluster", palette='Set1', scatter_kws={'alpha':0.5}, ci=None, height=8, aspect=1.5)

ax._legend.set_title('Cluster')

plt.show()

![z](z.png)

The above plot notes the relationship between Cycling parking infrastructure and the percentage of those who drive to work by LSOA basis. A clear holistic relationship emerges associated with those areas having the highest ParkperOA being having minor 'Driver' values. The inverse is true when observing LSOAs with high Driver values.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")

ax = sns.lmplot(data=finag, x="ParkperOA", y="Unemployed", scatter_kws={'alpha':0.5}, ci=None, height=8, aspect=1.5)

plt.show()

![z1](z1.png)

Typically Unemployment and PArkperOA have only a minor correlation, however all those LSOAs exceeding 50 cycle spaces have an unemployment rate below 15% presenting a scenario where the most eocnomic prosperous of areas demonstrate a willingness to install infrastructure facilitating active travel.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")

plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
sns.boxplot(data=finag, x='Cluster', y='Social Renter', palette='Set1')

plt.xlabel('Cluster')
plt.ylabel('Unemployed')
plt.title('Unemployment by Cluster')

plt.show()

![z2](z2.png)

Noted above is the reality that the Cluster with the greatest median unemployment, Cluster 2 has amongst the worst mobility metrics beside Cluster 0

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="ticks")

ax = sns.lmplot(data=finag, x="ParkperOA", y="TubenearOA", scatter_kws={'alpha':0.5}, ci=None, height=8, aspect=1.5)

plt.show()

![z3](z3.png)

As a final point of note there is a remarkably strong correlation between the number of Cycle parking facilities and Tube Stations within a LSOA. This makes sense as the inclusion of cycling facilities adjacent a metro stop allows for one to complete a sutsinable mulit-modal journey which serves to a be a sustainable solution the 'the lost mile problem'. The issue stipulates that it is often hard for transport planners to steer transit users into similarly effecient modes if they reside beyond reasonable walking distance from their metro station destination. However the installation of Parking facilities can encourage use bikes in tandem with Tube to ensure users complete a fully sustainable journey from doorstep to doorstep. Hypothetically then it seems areas with existing tube access benefit from a propensity toward more cycling infrastructure highlighting how 19th century inequalties in line construction has influenced mobility patterns in perhaps unappreciated ways beyond just the Tube mode itself. In order to create a fully equitable mobility paradigm for the 21st Century we ought to be acknowledge said quirks and reframe the way in which we view the role of specific modes. Where else beside Tube stations would benefit from a network of Cycle Parking infrastructure for instance?
