### Interactive Data Visualization and Analysis of World Happiness Data

Welcome to our interactive notebook, created as a part of the Doing Data Science course project in 2023. In this notebook, you are free to explore the World Happiness data, a comprehensive dataset encompassing a wide range of socio-economic and health indicators from various countries over multiple years. Our aim is to present these complex data in an engaging, intuitive, and insightful manner, allowing for in-depth analysis and exploration. It is important to know that our prior analysis has led to the formation of five distinct clusters (ranging from 0 to 4) based on these indicators, particularly focusing on the 'life ladder' as a proxy for happiness. Cluster 4 represents the happiest countries, while cluster 0 includes those with lower happiness levels.


#### Overview of the Project

- **Project Context**: This notebook is designed to allow users, particularly our fellow students, to interactively engage with the results of our analysis on the World Happiness dataset. 
- **Data Insights**: We have applied advanced data science techniques, including clustering algorithms, to uncover patterns and insights within the dataset.
- **Visualizations**: The visualizations provided here are tailored to enhance understanding and facilitate exploration of the data.

#### Key Visualizations and Analyses

- **Interactive Map of Original Data**: Explore the different aspects of the data on a dynamic world map.
- **3D Visualization of Data**: Explore the multidimensional aspects of the data in a dynamic 3D scatter plot.
- **Line Plot of Cluster Assignments Over Years**: Track and compare the evolution of cluster assignments for different countries over time.
- **Boxplots of Attributes Grouped by Clusters**: Analyze the distribution of various attributes within each cluster.
- **World Map Visualization of Cluster Assignments**: Discover spatial patterns and trends in cluster assignments across the globe.
- **World Map Visualization of Predictions for 2020**: Evaluate the accuracy of our predictive model’s cluster assignments for 2020.

#### Objective

The primary goal of this notebook is to provide an accessible platform for users to delve into the World Happiness data. We hope to facilitate a deeper understanding of how socio-economic and health indicators are interconnected and how they influence the clustering of countries. This interactive experience is aimed at providing valuable insights into global patterns and trends, encouraging users to discover their own insights.

#### Authors
- Mara Hannappel
- Karl Hendrik Tamkivi
- Thalis Goldschmidt
_______________________________________________________________________________________________________________________________________________________________________________________________________________________

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import branca.colormap as cm
from plotly.subplots import make_subplots
from ipywidgets import widgets, interact, Layout
import geopandas as gpd
import folium
from IPython.display import display
import math

# Import the dataset
data = pd.read_csv('./data/result_data.csv')

# Map simplification rate (increase this parameter, if the map generation and updates are slow)
simplification_rate = 0.075

### Interactive Map of Original Data

This interactive world map plot provides a dynamic way to explore different aspects of the dataset from years 2005 up to 2020. In this visualisation values from the PCA analysis have also been added to the selctable variables list.

In [2]:
world_map_1_raw_data = data.copy()
world_map_1_raw_data = world_map_1_raw_data.iloc[:, 0:16]  # Exclude cluster labels
world_map_1_raw_data["year"] = world_map_1_raw_data["year"].astype(int)

# Load geographical data and simplify geometries
geo_data = gpd.read_file("./data/updated_geodata.shp")
geo_data['geometry'] = geo_data['geometry'].simplify(simplification_rate)  # Simplify geometry to speed up the updates

# Merge the dataset with geographical data on 'country'
world_map_1_merged_data = pd.merge(
    world_map_1_raw_data, geo_data[["country", "geometry"]], on="country", how="left"
)

# Convert to GeoDataFrame
world_map_1_gdf = gpd.GeoDataFrame(world_map_1_merged_data, geometry="geometry")

world_map_1_gdf['suicide_rate'] = world_map_1_gdf['suicide_rate'].round(2) #Make the values look normal

# Interactive Widgets
years = world_map_1_raw_data['year'].unique()
metrics = [col for col in world_map_1_raw_data.columns if col not in ['year', 'country']]

year_slider = widgets.IntSlider(
    value=max(years),
    min=min(years),
    max=max(years),
    step=1,
    description='Year:',
    continuous_update=False
)

metric_dropdown = widgets.Dropdown(
    options=metrics,
    value=metrics[0],
    description='Metric:'
)

def create_map(year, metric):
    m = folium.Map(location=[25, 0], zoom_start=1.5)

    # Filter data for the selected year
    year_data = world_map_1_gdf[world_map_1_gdf['year'] == year]

    # Determine the min and max values for the metric
    min_val, max_val = year_data[metric].min(), year_data[metric].max()

    # Define a linear colormap
    colormap = cm.LinearColormap(
        colors=['yellow', 'darkgreen'] if metric in ['life_ladder', 'log_gdp_per_capita', 'social_support', 
                                                         'healthy_life_exp_at_birth', 'freedom_to_make_life_choices', 
                                                         'generosity', 'positive_affect'] else ['yellow', 'orange', 'red'],
        vmin=min_val,
        vmax=max_val
    )

    # Create a GeoJson object with tooltip and custom style
    geo_json = folium.GeoJson(
        data=year_data.__geo_interface__,
        style_function=lambda feature: {
            'fillColor': colormap(feature['properties'][metric]) if feature['properties'][metric] is not None else 'transparent',
            'color': 'black',
            'weight': 0.5,
            'fillOpacity': 0.9
        },
        tooltip=folium.GeoJsonTooltip(fields=['country', metric])
    ).add_to(m)

    # Add colormap to the map
    colormap.add_to(m)

    folium.LayerControl().add_to(m)
    return m

# Use 'interact' for interactive map
interact(create_map, year=year_slider, metric=metric_dropdown)

interactive(children=(IntSlider(value=2020, continuous_update=False, description='Year:', max=2020, min=2005),…

<function __main__.create_map(year, metric)>

### 3D Visualization of Data

This interactive 3D scatter plot provides a dynamic way to explore the dataset's multidimensional aspects. Here are some key features and tips for navigating this visualization:

- **Interactive Axes Selection**:
  - Users can choose which attributes to display on the X, Y, and Z axes. This flexibility allows for a customized view of the data's relationships and patterns.
  - The default axes are set to display the principal components, which are the result of a dimensionality reduction technique. These components capture the most significant variance in the dataset.

- **Flexible Color Encoding**:
  - While the default color coding is based on our cluster assignments, users have the freedom to choose any other attribute for color encoding. This feature adds an extra layer of depth to the analysis.
  - For instance, selecting 'life_ladder' as the color encoding provides insights into the happiness index across different countries, whereas choosing 'log_gdp' can illustrate economic variations.

- **Year Filtering and Data Exploration**:
  - A slider is available to filter data points by year, enabling users to observe how relationships and clusters evolve over time.
  - Additionally, there's an option to ignore the year filter and view the entire dataset across all years.

- **Interactive Tooltips**:
  - Hovering over any data point will display a tooltip with detailed information about that point, including the country's name and the values of the selected attributes.

In [2]:
# Create a dataset copy for the scatter plot visualization
scatter_plot_data = data.copy()

# Convert the 'cluster' column to string for color encoding
scatter_plot_data['cluster'] = scatter_plot_data['cluster'].astype(str)

# Define the interactive plotting function for the 3D scatter plot
def interactive_plot_scatter_3d(year, x_axis, y_axis, z_axis, color_choice, ignore_year):
    # Filter the data based on the selected year or use all data if year is ignored
    filtered_data = scatter_plot_data[scatter_plot_data['year'] == year] if not ignore_year else scatter_plot_data
    filtered_data = filtered_data.sort_values(by=['cluster'])
    print(f"{filtered_data.shape[0]} data points for the selected criteria.")

    # Create and return the 3D scatter plot
    fig = px.scatter_3d(filtered_data, x=x_axis, y=y_axis, z=z_axis, color=color_choice,
                        hover_name='country')

    fig.update_layout(width=900, height=700,
                     title={
                            'text': f'3D Visualization: {x_axis} vs {y_axis} vs {z_axis}, colored by {color_choice}',
                            'y':0.95,
                            'x':0.5,
                            'xanchor': 'center',
                            'yanchor': 'top'
                        })
    
    return fig.show()

# Define the options for the interactive widgets
plotting_variables = scatter_plot_data.columns[scatter_plot_data.columns.get_loc('life_ladder'):scatter_plot_data.columns.get_loc('PC3') + 1].tolist()
color_variables = scatter_plot_data.columns[scatter_plot_data.columns.get_loc('country'):scatter_plot_data.columns.get_loc('cluster') + 1].tolist()

# Create interactive widgets for user input
year_slider = widgets.IntSlider(value=scatter_plot_data['year'].max(), min=scatter_plot_data['year'].min(), max=scatter_plot_data['year'].max(), step=1, description='Year:')
ignore_year_checkbox = widgets.Checkbox(value=False, description='Ignore Year Filter')
x_axis_dropdown = widgets.Dropdown(options=plotting_variables, value='PC1', description='X Axis:')
y_axis_dropdown = widgets.Dropdown(options=plotting_variables, value='PC2', description='Y Axis:')
z_axis_dropdown = widgets.Dropdown(options=plotting_variables, value='PC3', description='Z Axis:')
color_dropdown = widgets.Dropdown(options=color_variables, value='cluster', description='Color by:')

# Combine the widgets and the plotting function for interactive visualization
interact(interactive_plot_scatter_3d, year=year_slider, x_axis=x_axis_dropdown, y_axis=y_axis_dropdown, z_axis=z_axis_dropdown, color_choice=color_dropdown, ignore_year=ignore_year_checkbox)

interactive(children=(IntSlider(value=2020, description='Year:', max=2020, min=2005), Dropdown(description='X …

<function __main__.interactive_plot_scatter_3d(year, x_axis, y_axis, z_axis, color_choice, ignore_year)>

### Line Plot of Cluster Assignments Over Years

This visualization provides a unique perspective on how cluster assignments for different countries have evolved over the years:

- **Interactive Country Selection**:
  - Users can interactively select one or multiple countries to track their cluster assignments over time. To select multiple countries, hold the 'Shift' key and click on the desired countries in the selection widget.
  - This feature is particularly useful for comparing the trajectories of different countries side by side.

- **Visual Representation of Changes**:
  - Each selected country is represented by a distinct line on the plot, with different colors for easy differentiation.
  - The line plot effectively illustrates the progression or stability of cluster assignments for each country, offering insights into trends and shifts over the years.


In [3]:
# Create copy of the original dataset for the line plot visualization
lineplot_data = data.copy()
lineplot_data['cluster'] = lineplot_data['cluster'].astype(int)

def get_unique_colors(n):
    """Generate n distinct colors for the line plot."""
    return px.colors.qualitative.Plotly[:n]

def interactive_plot_line(selected_countries):
    """Create an interactive line plot for selected countries to visualize cluster changes over years."""
    fig = px.line()

    # Check if any countries are selected for the plot
    if selected_countries:
        # Filter the data for the selected countries
        filtered_data = lineplot_data[lineplot_data['country'].isin(selected_countries)]
        unique_colors = get_unique_colors(len(selected_countries))

        # Iterate through each selected country and add a line to the plot
        for i, country in enumerate(selected_countries):
            country_data = filtered_data[filtered_data['country'] == country]
            country_color = unique_colors[i % len(unique_colors)]

            # Add a line trace for each selected country
            fig.add_trace(px.line(country_data, x="year", y="cluster",
                                  color_discrete_sequence=[country_color],
                                  hover_name="country").data[0])

            # Manually set the legend name for each country
            fig.data[-1].name = country
            fig.data[-1].showlegend = True

    # Configure the layout and appearance of the plot
    fig.update_layout(
        plot_bgcolor='white', 
        width=1000,  
        height=700,
        showlegend=True,  
        legend=dict(
            title="Country",  
            orientation="h", 
            yanchor="bottom",
            y=-0.15, 
            xanchor="center",
            x=0.5
        ),
        title={
            'text': "Cluster Changes Over Years by Country",
            'y':0.95,
            'x':0.5,
            'xanchor': 'center',
            'yanchor': 'top'
        },
        yaxis=dict(
            tickvals=[0, 1, 2, 3, 4],  # Define tick values for y-axis (cluster numbers)
            range=[0, 4],  # Set a fixed range for the y-axis
            showgrid=True,
            gridcolor='LightGrey',
            zeroline=True,
            zerolinecolor='LightGrey',
            zerolinewidth=1
        ),
        xaxis=dict(
            range=[lineplot_data['year'].min(), lineplot_data['year'].max()]  # Define the range for the x-axis (years)
        )
    )
    
    # Display the plot
    fig.show()

# Create a widget for selecting multiple countries for the plot
country_selection = widgets.SelectMultiple(options=lineplot_data['country'].unique(), 
                                           description='Countries', 
                                           disabled=False)

# Link the widget to the plotting function for interactivity
interact(interactive_plot_line, selected_countries=country_selection)

interactive(children=(SelectMultiple(description='Countries', options=('Afghanistan', 'Albania', 'Algeria', 'A…

<function __main__.interactive_plot_line(selected_countries)>

### Boxplots of Attributes Grouped by Clusters

Dive into a detailed analysis of various socio-economic and health indicators with this interactive visualization:

- **Customizable Year Range and Attribute Selection**:
  - Users can select specific year ranges and multiple attributes to generate boxplots. This feature offers flexibility in analyzing data over different time periods and across various dimensions.
  - By adjusting the year range slider and selecting desired attributes, users can tailor the visualization to focus on the aspects most relevant to their analysis.

- **Insights into Distributions Within Clusters**:
  - The boxplots provide a clear visualization of how different attributes are distributed within each cluster. This helps in identifying patterns, anomalies, or general trends within the data.
  - Attributes like GDP, life expectancy, happiness index (life ladder), and more can be compared within and across clusters, offering insights into the factors that contribute to the clustering of countries.
  - Hover over the boxplots to display exact statisics (e.g., min, max and median)

- **Enhanced Data Exploration**:
  - This visualization is especially useful for exploring how key indicators vary across different clusters and understanding the underlying factors that lead to these variations.
  - It's an excellent tool for researchers, analysts, or anyone interested in socio-economic and health data analysis, providing a comprehensive view of the data's characteristics.

Utilize this interactive tool to explore and uncover the intricate relationships between different attributes and their impact on cluster assignments.


In [4]:
# Create copy of the original dataset for the boxplot visualization
boxplot_data = data.copy()

# List of attributes for which to create boxplots
attributes = ['life_ladder', 'log_gdp_per_capita', 'social_support',
              'healthy_life_exp_at_birth', 'freedom_to_make_life_choices',
              'generosity', 'perceptions_of_corruption', 'positive_affect',
              'negative_affect', 'pop_density', 'suicide_rate']

# Define a color scheme to use for all plots
color_scheme = px.colors.qualitative.Plotly

# Interactive function to plot the boxplots
def interactive_boxplot(year_range, selected_attributes):
    # Filter data for the selected year range
    filtered_data = boxplot_data[(boxplot_data['year'] >= year_range[0]) & (boxplot_data['year'] <= year_range[1])]

    # Dynamically define the number of rows and columns for the subplot grid
    n_attributes = len(selected_attributes)
    if n_attributes == 1:
        n_cols = n_rows = 1
    elif n_attributes == 2:
        n_cols = 2
        n_rows = 1
    else:
        n_cols = min(3, n_attributes)  # Up to 3 columns per row
        n_rows = math.ceil(n_attributes / n_cols)

    # Create a subplot figure
    fig = make_subplots(rows=n_rows, cols=n_cols, subplot_titles=selected_attributes)

    # Add boxplots to subplots
    for i, attribute in enumerate(selected_attributes, start=1):
        row = (i - 1) // n_cols + 1
        col = (i - 1) % n_cols + 1
        for j, cluster in enumerate(sorted(filtered_data['cluster'].unique())):
            cluster_data = filtered_data[filtered_data['cluster'] == cluster]
            fig.add_trace(
                go.Box(y=cluster_data[attribute], name=f'{cluster}',
                       marker_color=color_scheme[j % len(color_scheme)], showlegend=False,
                       customdata=cluster_data[['country', 'year']],
                       hovertemplate="<br>".join([
                           "%{customdata[0]}",
                           "%{customdata[1]}",
                           f"<extra>{str(cluster)}</extra>"
                       ])),
                row=row, col=col
            )

        # Update x-axis for the subplot
        fig.update_xaxes(title_text='', tickvals=sorted(filtered_data['cluster'].unique()), row=row, col=col)

        # Update y-axis for the subplot (show gridlines)
        fig.update_yaxes(showgrid=True, gridcolor='LightGrey', row=row, col=col)

    # Update the layout
    fig.update_layout(height = 600 * n_rows, width = 1100,
                      plot_bgcolor='white')

    fig.show()

# Widget for year range slider
year_range_slider = widgets.IntRangeSlider(
    value=[boxplot_data['year'].min(), boxplot_data['year'].max()],
    min=boxplot_data['year'].min(), max=boxplot_data['year'].max(),
    step=1,
    description='Year Range',
    continuous_update=False,
    layout=Layout(width='35%'),  
    style={'description_width': 'initial'} 
)

# Widget for selecting attributes
attribute_selector = widgets.SelectMultiple(
    options=attributes,
    value=[attributes[0]],
    description='Attributes',
    disabled=False
)

# Link the widgets with the plot function
interact(interactive_boxplot, year_range=year_range_slider, selected_attributes=attribute_selector)

interactive(children=(IntRangeSlider(value=(2005, 2020), continuous_update=False, description='Year Range', la…

<function __main__.interactive_boxplot(year_range, selected_attributes)>

### World Map Visualization of Cluster Assignments

This interactive world map provides a unique perspective on our cluster assignments, derived from a PCA with hierarchical clustering:

- **Cluster Assignments Without Geographical Bias**:
  - The clusters displayed on the map were computed based solely on socio-economic and health indicators, without incorporating any geographical information such as location attributes. This ensures that the clusters are formed based on similarities in data, independent of the countries' physical locations.

- **Exploring Spatial Correlations and Temporal Changes**:
  - While geographical data was not used in the clustering process, this visualization allows us to explore if spatial correlations or patterns emerge globally. A slider is available to change the year, offering a dynamic view of how cluster assignments and potentially happiness levels have evolved over time in different regions.
  - Users can investigate questions like whether certain continents or regions exhibit similar cluster characteristics consistently over the years, or if there are noticeable shifts in happiness and development levels as indicated by our clusters.

- **Uncovering Global Trends and Insights Across Years**:
  - By presenting our cluster assignments on a world map with the ability to navigate across different years, we enable a geo-spatial and temporal exploration. This approach can reveal interesting trends, such as the stability of clusters in certain areas or significant changes in others over time.
  - The map encourages users to delve into these questions, offering a novel way to understand and visualize the results of our clustering analysis in relation to the geographical distribution of countries and their changes across years.


In [16]:
# Load the dataset and create a copy for the second world map visualization
world_map_2_raw_data = data.copy()
world_map_2_raw_data["cluster"] = world_map_2_raw_data["cluster"].astype(str)
world_map_2_raw_data["year"] = world_map_2_raw_data["year"].astype(int)

# Load geographical data
geo_data = gpd.read_file("./data/updated_geodata.shp")
geo_data['geometry'] = geo_data['geometry'].simplify(simplification_rate)

# Merge the dataset with geographical data on 'country'
world_map_2_merged_data = pd.merge(
    world_map_2_raw_data, geo_data[["country", "geometry"]], on="country", how="left"
)

# Convert the merged data to a GeoDataFrame
world_map_2_gdf = gpd.GeoDataFrame(world_map_2_merged_data, geometry="geometry")

# Extract unique years from the data for the slider widget
unique_years = world_map_2_raw_data['year'].unique()

# Define the interactive widget for selecting years
year_slider = widgets.IntSlider(
    value=max(unique_years),
    min=min(unique_years),
    max=max(unique_years),
    step=1,
    description='Year:',
    continuous_update=False
)

# Define a color map using Plotly's standard color scheme
color_map = {
    '0': '#fa7070',  # Light red
    '1': '#e58c4e',  # Orange
    '2': '#c2a64a',  # Yellow
    '3': '#99ba67',  # Light green
    '4': '#70c794',  # Dark green
}

def create_map(year):
    """Create an interactive folium map for the selected year."""
    # Filter the GeoDataFrame for the selected year
    year_data = world_map_2_gdf[world_map_2_gdf['year'] == year]

    # Initialize a folium map centered around a global view
    m = folium.Map(location=[25, 0], zoom_start=1.5)

    # Add a GeoJson layer to the map with style and tooltip
    folium.GeoJson(
        data=year_data,
        style_function=lambda feature: {
            'fillColor': color_map.get(feature['properties']['cluster'], 'grey'),
            'color': 'black',
            'weight': 0.5,
            'fillOpacity': 0.7
        },
        tooltip=folium.GeoJsonTooltip(fields=['country', 'cluster'])
    ).add_to(m)

    # Add layer control to the map
    folium.LayerControl().add_to(m)
    
    # Display the map
    display(m)

# Link the interactive map function with the year slider widget
interactive_map = widgets.interactive(create_map, year=year_slider)

# Display the interactive elements
display(interactive_map)

interactive(children=(IntSlider(value=2020, continuous_update=False, description='Year:', max=2020, min=2005),…

### World Map Visualization of Predictions for 2020

Explore the accuracy of our predictive model's cluster assignments for the year 2020 on a world map:

- **Prediction Accuracy Illustrated**:
  - This map showcases the effectiveness of a linear SVC model trained on data up to 2019 in predicting the cluster assignments for countries in 2020.
  - Countries are color-coded to indicate whether their cluster assignment was predicted correctly, providing a clear visual representation of the model's performance.

- **Interactive Tooltip Details**:
  - Hovering over a country reveals the actual and predicted cluster values. This feature allows users to examine the precision of the predictions and understand the extent of any discrepancies.
  - The tooltips serve as a quick reference to evaluate how close or far off the predictions were for each country.

In [3]:
# Load the dataset and filter for the year 2020, then create a copy for the third world map
world_map_3_raw_data = data.copy()
world_map_3_raw_data = world_map_3_raw_data[world_map_3_raw_data['year'] == 2020]

# Convert predicted_cluster to integer
world_map_3_raw_data['predicted_cluster'] = world_map_3_raw_data['predicted_cluster'].astype(int)

# Rename 'cluster' to 'actual_cluster' and convert to string for consistency
world_map_3_raw_data.rename(columns={'cluster': 'actual_cluster'}, inplace=True)
world_map_3_raw_data['actual_cluster'] = world_map_3_raw_data['actual_cluster'].astype(str)
world_map_3_raw_data['predicted_cluster'] = world_map_3_raw_data['predicted_cluster'].astype(str)

# Load geographical data
geo_data = gpd.read_file("./data/updated_geodata.shp")
geo_data['geometry'] = geo_data['geometry'].simplify(simplification_rate)

# Merge world_map_3_raw_data with geo_data on 'country'
world_map_3_merged_data = pd.merge(
    world_map_3_raw_data, geo_data[["country", "geometry"]], on="country", how="left"
)

# Convert to a GeoDataFrame
world_map_3_gdf = gpd.GeoDataFrame(world_map_3_merged_data, geometry="geometry")

def create_map():
    """Create an interactive folium map to visualize prediction accuracy."""
    # Initialize a folium map centered around the given coordinates
    m = folium.Map(location=[25, 0], zoom_start=1.5)

    # Define colors for correct and incorrect predictions
    correct_prediction_color = '#70c794'  # Light green
    incorrect_prediction_color = '#fa7070'  # Light red

    # Define a function to choose the color based on prediction accuracy
    def get_color(actual, predicted):
        return correct_prediction_color if actual == predicted else incorrect_prediction_color

    # Add a GeoJson layer to the map with the style function and tooltip for interactive display
    folium.GeoJson(
        data=world_map_3_gdf,
        style_function=lambda feature: {
            'fillColor': get_color(feature['properties']['actual_cluster'], feature['properties']['predicted_cluster']),
            'color': 'black',
            'weight': 0.5,
            'fillOpacity': 0.7
        },
        tooltip=folium.GeoJsonTooltip(fields=['country', 'actual_cluster', 'predicted_cluster'])
    ).add_to(m)

    # Display the map
    display(m)

# Create and display the map
create_map()