Data Exploration

# FROM DATA TO INSIGHTS

## Introduction
This notebook is created that it should be possible to run it in one go.
Python 3, conda and pip should be installed upfront.

In [None]:
!python --version
!conda --version
!pip --version

## Install whatever packages that are needed

In [None]:
!pip install folium==0.12.1
!pip install matplotlib==3.4.3
!pip install numpy==1.21.2
!pip install pandas==1.3.2
!pip install requests==2.26.0
!pip install scikit-learn==0.24.2


In [None]:
import folium
import json
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random
import requests

from IPython.display import display

from pathlib import Path

from sklearn.cluster import DBSCAN
from sklearn.cluster import AffinityPropagation
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA



In [None]:
DATA_FILE = "global_cities_data_set.json"
URL_FILE = "https://iisbvicmidlprdsa.blob.core.windows.net/fileshare/DATA_SET_DS_USE_CASE/global_cities_data_set.json?sv=2019-02-02&st=2021-08-06T08%3A18%3A35Z&se=2021-10-07T08%3A18%3A00Z&sr=b&sp=r&sig=vMOCDzuXhxSM%2BT02Wv3Zm2oW7BsXME2mZCk%2F%2BI5uMSU%3D"

START_FROM_SCRATCH = True

# Filters
REGION_FILTER = 'EUREG'

YEAR_LIST = [2018, 2019, 2020, 2021, 2022, 2023, 2024]

# Clustering hyper parameters
EPS_VALUE = 0.02
MIN_SAMPLES_VALUE = 50
N_CLUSTERS = 10


In [None]:
DTYPES_DICT = {
    'year': np.int32,
    'indicator_name': object,
    'geography_iso': object,
    'geography_country': object,
    'geographyid': object,
    'geographyname': object,
    'value_unit': object,
    'databank': object,
    'value': np.float64
}

In [None]:
FILE_LIST = [
    'Consumer spending by product',
    'Population',
    'Household numbers by income band'
]

In [None]:
def download_and_read_source_data():
    if START_FROM_SCRATCH:
        r = requests.get(URL_FILE)
        open(DATA_FILE, 'wb').write(r.content)

    file_object = open(DATA_FILE, encoding='utf8')
    data = json.load(file_object)

    df = pd.json_normalize(data['data'])

    print("df.shape: (all): ", df.shape)
    # Make sure the year field is an integer
    df.year = df.year.astype('int32')
    
    file_object.close()

    return df

## Filtering

In the current setup it's only possible to visualize the data for EU region,

The geographyid is unique for all countries except for the USA. Therefore creating a combined logical key named geography_region_id consisting of geographyid and geographyname which is 100% unique for the region.
year combined with geography_region_id is a primary key which can be used to merge data.

In [None]:
def filter_data(par_df, par_year):
    par_df["geography_region_key"] = par_df["geographyid"] + "_" + par_df["geographyname"]
    par_df = par_df[(par_df['databank'] == 'EUREG') & (par_df['year'] == par_year)]
    print("par_df.shape: (" + REGION_FILTER +  " & " + str(par_year) +  "): ", par_df.shape)
    return par_df

## Indicators

The file provided hosts a number of different types of data as can be seen in the indicator_name field.
Some indicators belong together. For example Population per age range.
These indicator_groups are handled separately.

Singular indicator are written into separate files.

In [None]:
def split_indicators(par_data_dir_name, par_df_data):
    #Some indicator are grouped
    indicator_groups = [
        'Household numbers by income band',
        'Population',
        'Consumer spending by product'
    ]

    indicator_groups_strings = (
        'Household numbers by income band',
        'Population',
        'Consumer spending by product'
    )

    other_indicators = []

    for word in par_df_data.indicator_name.unique()[:]:
        if not word.startswith(indicator_groups_strings):
            other_indicators.append(word)

    # Create separate files for indicators.
    for indicator in other_indicators:
        df_filtered = par_df_data[(par_df_data['indicator_name'] == indicator)]
        filtered_file_name = \
            par_data_dir_name + os.path.sep + indicator.replace(" ", "_"). \
        replace(",", "_").replace("/", "_") + '.csv'
        df_filtered.to_csv(filtered_file_name, sep=";", encoding="utf-8")

    # Group some indicators into one file.
    for indicator_group in indicator_groups:
        df_filtered = par_df_data[(
            par_df_data['indicator_name'].str.startswith(indicator_group))]
        filtered_file_name = \
            par_data_dir_name + os.path.sep + indicator_group + '.csv'
        df_filtered.to_csv(filtered_file_name, sep=";", encoding="utf-8")

    par_df_data.to_csv(par_data_dir_name + os.path.sep + "total_set.csv",
                       sep=";",
                       encoding="utf-8")

## Indicator groups

Now process the indicator groups. Different bands of the same kind of data are put into one file for further processing.

As the value_unit might not be the same we can't compare the data in its original form.
For each band a ratio is calculated to indicate what proportion of total this band represents.
This makes it possible to compare the data no matter the country.

In [None]:
def generate_grouped_indicator_files(par_data_dir_name, par_file_item):

    df_data = pd.read_csv(
        par_data_dir_name + os.path.sep + par_file_item + ".csv",
        sep=";",
        encoding="utf8",
        dtype=DTYPES_DICT)

    print("shape: ", df_data.shape)

    # Remove unwanted columns when grouping
    df_sum = df_data.loc[:, ("geography_region_key", "year", "value")]

    # Sum values
    df_grouped = df_sum.groupby(by=['year', 'geography_region_key']).sum()
    # Back to a data frame
    df_sum = df_grouped.reset_index()

    def calculate_ratio(par_year, par_geography_region_key, par_value):
        df_filtered_sum = df_sum[(df_sum['year'] == par_year) &
            (df_sum['geography_region_key'] == par_geography_region_key)].sum()
        return par_value / df_filtered_sum.values[2]

    df_data['ratio'] = df_data.apply(
            lambda row : calculate_ratio(
                row['year'],
                row['geography_region_key'],
                row['value']), axis = 1)

    df_data['ratio'].fillna(0, inplace=True)
    
    print("shape: ", df_data.shape)

    df_data.to_csv(par_data_dir_name + os.path.sep + file_item + "_ext.csv",
        sep=";",
        encoding="utf8")

    print("End " + file_item)


In [None]:
def generate_rows_with_grouped_indicators(par_data_dir_name, par_file_item):

    df_data = pd.read_csv(
        par_data_dir_name + os.path.sep + par_file_item + "_ext.csv",
        sep=";",
        encoding="utf8")

    print("shape: ", df_data.shape)

    column_names = []
    df_data_ext = pd.DataFrame()

    indicator_names = df_data.indicator_name.unique()
    for indicator_name in indicator_names:
        df_select = df_data[df_data.indicator_name == indicator_name] 
        column_name = indicator_name. \
            replace("resident", ""). \
            replace("based", ""). \
            replace("current", ""). \
            replace("prices", ""). \
            replace("(", ""). \
            replace(")", ""). \
            replace("Consumer spending by product / service - ", ""). \
            replace("Household numbers by income band - ", ""). \
            replace(",", ""). \
            replace(" ", "_"). \
            replace("-", "_"). \
            replace("____", ""). \
            replace("__", "_"). \
            lower()
        #print("column_name: ", column_name)
        column_names.append(column_name)
        df_select[column_name] = df_select['ratio']
        df_select = df_select.loc[:, ("geographyid", "geography_region_key", "year", column_name)]

        if (len(df_data_ext) == 0):
            df_data_ext = df_select
        else:
            df_data_ext = df_data_ext.merge(
                right=df_select,
                on=["geographyid", "geography_region_key", "year"],
                how="outer")

        #print("Shape: ", df_data_ext.shape)

    df_data_ext.fillna(0, inplace=True)

    df_data_ext.to_csv(par_data_dir_name + os.path.sep + par_file_item + "_ext2.csv",
        sep=";",
        encoding="utf8")

    print("End " + file_item)


### Preprocess data

Download the data and preprocess it for each year.

In [None]:
def preprocess_data(par_data_dir_name, par_file_item, par_df_filtered):
    print("Process :", par_file_item)

    split_indicators(par_data_dir_name, df_filtered)
    generate_grouped_indicator_files(par_data_dir_name, file_item)
    generate_rows_with_grouped_indicators(par_data_dir_name, file_item)

## Data exploration

Now we have a set of different files, one file for each indicator(group). Let's look at the data in more detail.

### Primary data points

Some of the data are the primary datapoints. These can be divided into grouped and non-grouped indicators.


### Non-grouped indicators:
| indicator_name                                                        | indicator_type | value_unit      | value_type | regions          | comment                        |   |
|-----------------------------------------------------------------------|-----------------|-----------------|------------|------------------|--------------------------------|---|
| Average_household_size                                                | demographics    | #Persons        | float      | AFR, EUREG, GCFS |                                |   |
| Births                                                                | demographics    | #Persons        | float      | AFR, GCFS        | how to interpret? Aggregations |   |
| CREA_house_price_index                                                | housing         | Index           | float      | AMREG            | CAN                            |   |
| Deaths                                                                | demographics    | #Persons        | float      | AFR, GCFS        | how to interpret?              |   |
| Employment_-_Industry                                                 | employment      | #Persons        | float      | AFR, GCFS        | not complete, how to interpret |   |
| Employment_-_Transport__storage__information_&_communication_services | employment      | #Persons        | float      | AFR, GCFS        | how to interpret, not complete |   |
| Gross_domestic_product__real                                          | gdp             | currency        | float      | EUREG, AMREG     |                                |   |
| Homeownership_rate                                                    | housing         | %               | float      | AMREG            | USA                            |   |
| Household_disposable_income__per_household__nominal                   | housing         | currency        | float      | EUREG            |                                |   |
| Household_disposable_income__per_household__real                      | housing         | currency        | float      | EUREG            |                                |   |
| Household_disposable_income__real                                     | housing         | currency        | float      | EUREG            |                                |   |
| Housing_permits_-_multi_family                                        | housing         | Housing permits | float      | AMREG            | USA                            |   |
| Housing_permits_-_single_family                                       | housing         | Housing permits | float      | AMREG            | USA                            |   |
| Housing_permits_-_total                           | housing      | Housing permits | float | AMREG     | USA                               |   |
| Housing_starts                                    | housing      | null            | float | AMREG     | CAN, how to interpret?            |   |
| Housing_starts_-_multi_family                     | housing      | Housing starts  | float | AMREG     | USA                               |   |
| Housing_starts_-_single_family                    | housing      | Housing starts  | float | AMREG     | USA                               |   |
| Housing_starts_-_total                            | housing      | Housing starts  | float | AMREG     | USA                               |   |
| Income_from_employment__nominal                   | income       | currency        | float | AMREG     | USA                               |   |
| Income_from_rent__dividends_and_interest__nominal | income       | currency        | float | AMREG     | USA                               |   |
| Income_taxes__nominal                             | income       | currency        | float | AMREG     | USA                               |   |
| Labor_force                                       | employment   | #Persons        | float | AMREG     | USA, CAN                          |   |
| Labor_force_participation_rate                    | employment   | %               | float | AMREG     | USA                               |   |
| Labour_force_participation_rate                   | employment   | %               | float | AMREG     | CAN                               |   |
| Median_household_income__real                     | income       | currency        | float | AMREG     | USA                               |   |
| Net_migration_(including_statistical_adjustment)  | demographics | #Persons        | float | AFR, GCFS | can be both negative and positive |   |
| New_housing_price_index                           | housing      | index           | float | AMREG     | CAN                               |   |
| Personal_disposable_income__per_capita__real      | income       | currency        | float | AMREG     | USA, CAN                          |   |
| Personal_disposable_income__per_household__real   | income       | currency        | float | AMREG     | USA, CAN                          |   |
| Personal_income__per_capita__real    | income       | currency    | float | AMREG | USA, CAN |   |
| Personal_income__per_household__real | income       | currency    | float | AMREG | USA, CAN |   |
| Proprietors_incomes__nominal         | income       | currency    | float | AMREG | USA      |   |
| Residential_building_permits         | housing      | null        | float | AMREG | CAN      |   |
| Social_security_payments__nominal    | income       | currency    | float | AMREG | USA      |   |
| Total_households                     | housing      | #Households | float | All   |          |   |
| Total_population                     | demographics | #Persons    | float | All   |          |   |
| Unemployment_level                   | unemployment | #Persons    | float | AMREG | USA, CAN |   |
| Unemployment_rate                    | unemployment | %           | float | AMREG | USA, CAN |   |
| Urban_Total_Population               | demographics | #Persons    | float | All   |          |   |

<br/>

### Grouped indicators

| indicator_name                    | indicator_type  | value_unit  | value_type | regions | comment                                             |
|-----------------------------------|-----------------|-------------|------------|---------|-----------------------------------------------------|
| Population*                       | demographics    | #Persons    | float      | All     |                                                     |
| Consumer spending by product*     | spending        | currency    | float      | All     | value_unit contains : empty, null                   |
| Household numbers by income band* | income          | #Households | float      | All     | value contains float values very big and very small |

<br/>

## Secondary data points

There's a set of secondary data points that describe the primary data points in terms of a number of facets. For instance geographical region, year etc.

| indicator_name    | value_unit | value_type | key  | comment                    |
|-------------------|------------|------------|------|----------------------------|
| year              | year       | int        | Key1 |                            |
| geography_iso     | category   | string     |      | ISO 3166-1 alpha-3         |
| geography_country | category   | string     |      |                            |
| geographyid       | category   | string     | Key2 | NUTS-2 region data (EUREG), No standards found for other regions |
| geographyname     | category   | string     | Key3 |                            |
| databank          | category   | string     |      |                            |

<br/>

## Conclusion

The indicators that are available for all regions are limited. The rest is fragmented, most detailed data is available for the AMREG region.

For the geographyid a standard applies based on the ISO 3166-1 alpha-3 standard and then extended with a 2 or 3 digit code. In order to visualize the results of the clustering in a map longitude and latitude data is needed per region. I've been only able to find the definition for it the EUREG region, but not for the other regions. This is a drawback for now. This data should be available somehow so it's not considered an impediment.

## Assumptions made

Though it's possible to generate cluster data on a global level it's not possible to visualize it. Therefore I've made the assumption that it's ok to take just the EUREG region so the results can be shown to the stakeholders in a map.

I will focus on data that is available on a global level,but filter to the EUREG region, so that whenever the geospatial data becomes available it's easy to visualize it for all regions of the world.


## Clustering

Now we have preprocessed the data we can start the clustering.

In [None]:
def read_file(par_data_dir_name, par_file_name):
    X = pd.read_csv(par_data_dir_name + os.path.sep + par_file_name + '.csv',
                    sep=';',
                    encoding="utf8")

    # Dropping irrelevant columns from the data
    drop_columns = [
        'Unnamed: 0',
        'year',
        'geography_region_key',
        'geographyid'
    ]

    X_stripped = X.drop(drop_columns, axis=1)

    # Handling the missing values
    X_stripped.fillna(0, inplace=True)

    print("X.shape: ", X_stripped.shape)

    return (X, X_stripped)

In [None]:
def do_PCA(par_X_normalized):
    pca = PCA(n_components=2)
    par_X_normalized = par_X_normalized.dropna()
    X_principal = pca.fit_transform(par_X_normalized)
    X_principal = pd.DataFrame(X_principal)
    X_principal.columns = ['P1', 'P2']

    return X_principal

In [None]:
def init_algo():
    #return DBSCAN(eps=EPS_VALUE, min_samples=MIN_SAMPLES_VALUE)
    #return AffinityPropagation(random_state=None, max_iter=20)
    return AgglomerativeClustering(n_clusters=N_CLUSTERS)

In [None]:
def get_labels(par_DBSCAN, par_X_principal):
    db_default = par_DBSCAN.fit(par_X_principal)
    labels = db_default.labels_
    print("labels: ", labels.max())

    return labels

In [None]:
def generate_colours():
    '''Generate a set of random colours for the plot'''

    colours = {}
    
    for i in range(-1, 200):
        r = random.random()
        b = random.random()
        g = random.random()
        color = (r, g, b)
        colours[i] = color
    
    return colours

In [None]:
def do_plot(par_data_dir_name,
            par_file_item,
            par_labels,
            par_X_principal,
            colours):
    cvec = [colours[label] for label in par_labels]

    legend_list = []
    label_list = []
    for counter in range(0, par_labels.max()):
        legend_item = plt.scatter(
            par_X_principal['P1'],
            par_X_principal['P2'],
            color=colours[counter])
        legend_list.append(legend_item)
        label_item = "Label " + str(counter)
        label_list.append(label_item)

    # Plotting P1 on the X-Axis and P2 on the Y-Axis
    # according to the colour vector defined
    plt.figure(figsize=(9, 9))
    plt.scatter(par_X_principal['P1'], par_X_principal['P2'], c=cvec)

    # Building the legend
    plt.legend(legend_list, label_list)

    plt.savefig(par_data_dir_name + os.path.sep + par_file_item + '.png')

    return plt

In [None]:
def run_algo(par_algo, par_X_principal):
    db = par_algo.fit(par_X_principal)

    return db
    

In [None]:
def do_clustering(par_data_dir_name, par_file_item):
    X, X_stripped = read_file(par_data_dir_name, par_file_item + "_ext2")
    X_principal = do_PCA(X_stripped)
    algo = init_algo()
    labels = get_labels(algo, X_principal)
    result = run_algo(algo, X_principal)
    plt = do_plot(par_data_dir_name,
                  par_file_item,
                  labels,
                  X_principal,
                  colours)
    plt.show()

    X['cluster'] = result.labels_.tolist()
    X.to_csv(par_data_dir_name + os.path.sep + par_file_item + "_clusters.csv",
             sep=";",
             encoding="utf8")


## Visualization

The plots show the different clusters but it's not clear to which regions the data points refer.
Therefore we will plot the clusterdata on a map so it's clear where the actual clusters are.

In [None]:
COLOURS = [
           'lightred',
           'lightgreen',
           'yellow',
           'lightpurple',
           'darkgrey',
           'darkred',
           'darkgreen',
           'darkyellow',
           'darkpurple',
           'dodgerblue',
           'red', 
           'blue',
           'green',
           'cyan',
           'black',
           'lightyellow',
           'lightgrey',
           'olive',
           'purple',
           'lime'
]

In [None]:
def get_coordinates(coordinates, item_no):
    if coordinates == np.nan:
        return None

    try:
        if item_no == 0:
            return coordinates[0]
        else:
            return coordinates[1]
    except Exception:
        return None

In [None]:
def read_geo_data():
    DATA_FILE = "nutspt_3.json"
    file_object = open(DATA_FILE, encoding="UTF-8")
    json_data = json.load(file_object)

    df = pd.json_normalize(json_data['features'])

    df['longitude'] = df.apply(
        lambda row : get_coordinates(row['geometry.coordinates'], 0), axis = 1)
    df['latitude'] = df.apply(
        lambda row : get_coordinates(row['geometry.coordinates'], 1), axis = 1)
    
    return df

In [None]:
def read_cluster_data(par_data_dir_name, par_file_name):
    return pd.read_csv(
        par_data_dir_name + os.path.sep + par_file_name + "_clusters.csv",
        sep=";",
        encoding="utf8")


In [None]:
def merge_data(par_df_cluster, par_df_geo):
    df_cluster_merged = par_df_cluster.merge(par_df_geo,
                        left_on='geographyid',
                        right_on='properties.id',
                        how='left')

    return df_cluster_merged.dropna()

In [None]:
def plot_map(par_data_dir_name, par_df_cluster, par_title, par_year):
  # Initialize map and center on Munich
  folium_map = folium.Map(location=[48.130518, 11.5364172],
                   zoom_start=3,
                   width='75%',
                   heigth='75%')

  title_html = '''
             <h3 align="center" style="font-size:16px"><b>{} ({})</b></h3>
             '''.format(par_title, par_year)
  
  folium_map.get_root().html.add_child(folium.Element(title_html))

  for index, row in par_df_cluster.iterrows():
    colour = COLOURS[row.cluster]
    folium.CircleMarker(
      location=[row['latitude'], row['longitude']],
      popup="<stong>" + str(row['properties.id']) + "</stong>",
      tooltip=str(row.cluster),
      color=colour,
      ).add_to(folium_map)

  folium_map.save(par_data_dir_name + os.path.sep + par_title + ".html")
  
  folium_map

  return folium_map

In [None]:
def do_visualization(par_data_dir_name,
                     par_file_item,
                     par_df_geo_data,
                     par_map_list,
                     par_year):
    df_cluster = read_cluster_data(par_data_dir_name, par_file_item)
    df_merged = merge_data(df_cluster, par_df_geo_data)
    print("df_merged.shape :", df_merged.shape)
    cluster_map = plot_map(par_data_dir_name,
                           df_merged,
                           par_file_item,
                           par_year)
    par_map_list.append(cluster_map)

    display(cluster_map)

    return par_map_list

## main process 

Loop tthrough the different indicator and years and perform the clustering and rendering of the maps

### Render cluster maps

The maps are rendered per indicator per year

The maps are also saved as PNG files in the data directory.

### Render maps

The maps are rendered per indicator per year

The maps are also saved as HTML files in the data directory.

It's possible to zoom in and out of an area.

If you hover over a data point it shows you the cluster it belongs to. This corresponds to the cluster number as can be found in the *_cluster.csv files in the data directory.

Clicking on a data point shows you the region of that data point. This corresponds to the geographyid in the *_cluster.csv files in the data directory.

In [None]:
# Download and filter base data set
df_data = download_and_read_source_data()

# Generate colour palette for cluster maps
colours = generate_colours()

# Initialize list of maps
map_list = []
# Retrieve geospatial data
df_geo_data = read_geo_data()

for file_item in FILE_LIST:
    print("Process file: ", file_item)
    for year in YEAR_LIST:
        print(">>Process year: ", year)

        data_dir_name = "data_" + str(year)
        # Create a directory for derived data.
        Path(data_dir_name).mkdir(parents=True, exist_ok=True)

        df_filtered = filter_data(df_data, year)

        preprocess_data(data_dir_name, file_item, df_filtered)
    
        do_clustering(data_dir_name, file_item)
    
        map_list = do_visualization(data_dir_name,
                                    file_item, 
                                    df_geo_data,
                                    map_list,
                                    year)

print("End cell")

## Outcome and recommendations

### Justification

#### Choice of algorithm

The choice of algorithm is done based on research into available clustering algorithms. Algoritms like K-means are not good at handling data with outliers and handling spheres.

Based on the this image it can be seen which algorithm would do a good job: [Cluster algoritm comparison](https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_001.png)

I've tried with the DBScan and the Agglomarative Clustering as they seem robust to some possible outcomes.
My preference was for DBScan as it will determine the number of clusters itself.
During the experimentation it turned out that Agglomarative Clustering gave better results than the DBScan. This could best be seen when plotting the data in the map. There you could for instance see the income differences between countries (Switzerland, Luxembourg and Norway are rich, the north of Italy is richer than the south, etc.)

#### Explainable AI

It's easy to put all data on a big heap and run some AI algoritm on it. This will get you a result, but it'll be difficult for a human to verify whether the result is a true result. For supervised learning this is easier to justify than for data where unsupervised learning is to be used.

The setup chosen is a way that can still be understood by humans. Showing the data on the map already gives you insights. You can go back to the cluster data in the CSV files to further analyse it.

### Insights

Outlook from 2018 to 2024

#### Household number by income band 
- The top segment stays about the same: Switzerland, Luxembourg,center of London and Norway.
- Southern Europe shows a more diverse picture. In 2018 Italy shows a clear separation between the richer north and the poorer south. By 2024 the southern part shows more diversity in income bands.
- Greece is on the rise, by 2024 it's similar to the south part of Italy, where in 2018 it was lower than that.
- Germany and the Netherlands grow to be more homogeneous by 2024 as compared to 2018. This especially shows in what used to be the former DDR.
- The eastern European countries show few differences between 2018 and 2014.

#### Consumer spending by product
- Over the years the central European countries grow to have a similar pattern. Countries like Germany, Switzerland, Austria, Czech republic and te UK are all in the same segment.
- Nordic and Baltic countries show similarities, this stays the same over the years.

#### Population

- Where in 2018 the separation between the north and the south of Italy was evident, this difference is projected to be equalized by 2024.
- Central Europe and Eastern Europe still show a clear separation by 2024 as was visible in 2018. Central Europe becomes more homogeneous as can be seen by the band running from Germany all the way down to Italy.
- Greece has connected to the rest of Europe by 2024 where it was still lacking this connection in 2018.
- Northwestern Europe is quite diverse and stays like that through the years.

### Conclusion

This script shows that it's possible to gain insights from this data set. The maps show differences in the demographics, spending behaviour and income throughout Europe. Due to the limited time available not all has been explored that can be explored. The setup of the scripts allows for an easy way to further analyze the data.

### Recommendations

As time was limited and some addittional data was not available not all options could be worked out. These are a few recommendations for possible next steps:

- Add more indicators. The script can easily be run on other indicators.
- Get geospatial data of the other regions so truly global insights can be gained. The impact of this on the script is minimal.
- Try to get more data which is available for all regions or do feature engineering on existing data. Only then you can create global insights
- Experiment with more/fewer clusters, this needs to be discussed with the stakeholders.
- Experiment with clustering of clustering. We've a number of singular cluster topics now. If we run a clustering algorithm on the cluster data itself we could generate an aggregated clustering which could show other segements.
- A more diverse choice of real swedish food in the food market... ;-) 