# Challenge 1: Geodemographic Classification

In this challenge, you will replicate the process of creating a geodemographic classification using the k-means clustering algorithm. Please select any city in the UK except London, Liverpool, or Glasgow. The main goal is to generate a meaningful and informative classification that captures the diversity of areas in your dataset using the census data ( For England, you can try to use the 2021 or 2011 census, and for Scotland, you need to use the 2011 census data) 

1. Define the main goal for the geodemographic classification (marketing, retail and service planning). 
2. Look for census data from the selected city for which you would like to generate the geodemographic classification.
3. The census data at the Output Area OA level. Select multiple topics of at least four topics (socio-demographics, economics, health, and so on). Describe your topic selection accordingly based on the goal of your geodemographic classification. For example, if your geodemographics are related to marketing, Economic variables might be the appropriate selection. 
4. Identify the variables that will be crucial for effectively segmenting neighbourhoods. Evaluate how this choice may impact the classification results, including a DEA analysis.
5. Prepare, adjust or clean the dataset addressing any missing values or outliers that could distort the clustering results.
6. Include standardisation between areas and variables. Make an appropriate analysis and adjust the variable selection accordingly for any multicollinearity.
7. Utilize the k-means clustering algorithm to create a classification based on the selected variables.
8. Define the optimum number of clusters (i.e., using the Elblow method). Experiment with different values of k.
9. Evaluate your cluster groups (e.g., using PCA) and interpret your cluster centres. Describe your results and repeat the process to adjust the variable selection and cluster groups to provide more meaningful results for your geodemographic goal. Interpret the characteristics of each cluster. What demographic patterns or similarities are prevalent within each group?
10. Map the final cluster groups
11. Finish the analysis by naming the final clusters and plotting a final map that includes the census values and the provided names.
12. Finally, acknowledge the subjective nature of classification and make analytical decisions to produce an optimum classification for your specific purpose. Reflect on the challenges and insights gained during the classification process. Ensure you document your analytical decisions and the rationale behind any important decision. Once your geodemographics are constructed, describe the potential use cases for the geodemographic classification you have built based on your initial goal.

## Geodemographic classification

I will use geodemographic classification to assess differences in standards of health across different neighbourhoods in Edinburgh. This is important to understand whether service provision in different areas is sufficient. 

To do this, the geodemographic classification will account for differences in lifestyle, household, health and economic variables for neighbourhoods to assess areas with worse health. I will draw on a previous similar study that applies geodemographics to public health in Greater London (Peterson et al., 2007) to inform my variable selection and analysis.

I will use output area data from the 2022 Scotland Census which can be found here: https://www.scotlandscensus.gov.uk/documents/2022-output-area-data/

I will use data from the following categories to undertake the geodemographic classification 
* UV104 - Marital and civil partnership status
* UV301 - Provision of unpaid care
* UV302 - General health
* UV303 - Disability
* UV304 - Long term health conditions
* UV407 - Central heating
* UV501 - Highest level of qualification
* UV601 - Economic activity
* UV604 - Hours worked
* UV606 - Occupation



### Preparing the data 

In [4]:
cd UA2

/Users/milliemccallum/Documents/UA2


In [5]:
import pandas as pd
import os

csv_directory = "Data/data_6/challenge_data/census_2022/"

# Create a list of all csv's in that folder
csv_files = [file for file in os.listdir(csv_directory) if file.endswith(".csv")]

# An empty DataFrame to store the merged data
merged_data = pd.DataFrame()

# Loop through each CSV file
for csv_file in csv_files:
    csv_path = os.path.join(csv_directory, csv_file) # Create a consistent path
    df_csv = pd.read_csv(csv_path, low_memory=False) #read each file

    merged_data = pd.concat([merged_data, df_csv], axis=1)

# Save the merged dataset
merged_data.to_csv("Data/data_6/challenge_data/new_merged_census_data.csv", index=False)

In [10]:
# import the shp for local authority zones in the UK and select only edinburgh

import geopandas as gpd

LAD_path = "Data/data_6/challenge_data/edinburgh_oa/LAD_MAY_2024_UK_BFE.shp"
LAD = gpd.read_file(LAD_path)

lad_edinburgh = LAD[LAD["LAD24NM"]=="City of Edinburgh"]
lad_edinburgh.head()

Unnamed: 0,LAD24CD,LAD24NM,LAD24NMW,BNG_E,BNG_N,LONG,LAT,geometry
328,S12000036,City of Edinburgh,,320193,669417,-3.27826,55.9112,"MULTIPOLYGON (((313649.660 679534.410, 313650...."


In [None]:
# load the output area shapefile from the scotland 2022 census 
oa_census = "Data/data_6/challenge_data/edinburgh_oa/OutputArea2022_EoR.shp"
oa_census = gpd.read_file(oa_census)

# clip this to the extent of the edinburgh local authority so only edinburgh data is selected
edinburgh_oa = oa_census.clip(lad_edinburgh)

edinburgh_oa.to_file("Data/data_6/challenge_data/edinburgh_oa/edinburgh_oa.shp")

In [22]:
# merge the oa shapefile with the census data so the census data only covers edinburgh at the output area level
edinburgh_path = "Data/data_6/challenge_data/edinburgh_oa/edinburgh_oa.shp"
edinburgh = gpd.read_file(edinburgh_path)

csv_path = "Data/data_6/challenge_data/new_merged_census_data.csv"
csv_data = pd.read_csv(csv_path, low_memory=False)

merged_data = edinburgh.merge(csv_data, left_on='code', right_on='oa_code', how='left')

merged_data.to_file('Data/data_6/challenge_data/edinburgh_census_data.shp', index=False)

  merged_data.to_file('Data/data_6/challenge_data/edinburgh_census_data.shp', index=False)


SchemaError: Too many field names like 'Economically Active full-time students - Self-employed with employees - Part-time.2' when truncated to 10 letters for Shapefile format.

### Selecting input variables

In [None]:
list(merged_data.columns)