<a href="https://www.kaggle.com/code/mikedelong/eda-with-a-map?scriptVersionId=162116051" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
!pip install --quiet geocoder

We need geocoder to put points on a map given their names.

In [2]:
import pandas as pd
from warnings import filterwarnings

filterwarnings(action='ignore', category=FutureWarning)
DEATHS = '/kaggle/input/mountain-climbing-accidents-dataset/deaths_on_eight-thousanders.csv'

df = pd.read_csv(filepath_or_buffer=DEATHS, parse_dates=['Date'])
df['year'] = df['Date'].dt.year
df.head()

Unnamed: 0,Date,Name,Nationality,Cause of death,Mountain,year
0,2023-07-27,Muhammad Hassan,Pakistan,Unknown,K2,2023
1,2022-07-22,Matthew Eakin,Australia,Fall,K2,2022
2,2022-07-22,Richard Cartier,Canada,Fall,K2,2022
3,2022-07-21,Ali Akbar Sakhi,Afghanistan,"Unknown, suspected altitude sickness",K2,2022
4,2021-07-25,Rick Allen,United Kingdom,Avalanche,K2,2021


Let's build a table of mountain-centric data to visualize on a map.

In [3]:
from arrow import now
from geocoder import arcgis

time_start = now()
mountain_df = pd.DataFrame.from_dict(orient='index', data={mountain: arcgis(location=mountain).latlng for mountain in df['Mountain'].unique().tolist()}, ).reset_index().merge(right=df['Mountain'].value_counts().to_frame().reset_index(), left_on='index', right_on='Mountain', how='inner').drop(columns=['Mountain'])
mountain_df.columns = ['Mountain', 'latitude', 'longitude', 'count']
print('done in {}'.format(now() - time_start))

done in 0:00:07.697514


In [4]:
from plotly.express import scatter_mapbox
scatter_mapbox(data_frame=mountain_df, lat='latitude', lon='longitude', size='count', hover_name='Mountain', mapbox_style='open-street-map', zoom=3, height=900)

In [5]:
from plotly.express import histogram
histogram(data_frame=df.sort_values(by='Mountain'), x='Date', color='Mountain')

We get a feel for which mountains dominate the overall figures from the data x mountain plot above.

In [6]:
histogram(data_frame=df, x='Mountain')

This plot demonstrates how Everest dominates the overall totals.

In [7]:
for column in ['Nationality', 'Cause of death']:
    histogram(data_frame=df, y=column, height=1500).show()

The breakdown by nationality shows how unevenly the fatalities are distributed; the cause of death data is too noisy to be useful for a volumetric breakdown.

In [8]:
from plotly.express import strip
strip(data_frame=df, y='Nationality', x='Mountain', hover_name='Name', hover_data=['Date'], height=1500, stripmode='overlay', color='year')

We can combine the nationality and mountain data and add name and year data and we get as complete a view as we can for this many dimensions of data; unfortunately because the strip plot treats years as categorical our colors are not especially helpful.

In [9]:
from plotly.express import scatter
scatter(data_frame=df[['Nationality', 'Mountain']].groupby(by=['Nationality', 'Mountain']).size().reset_index().rename(columns={0: 'count'}),
        x='Mountain', y='Nationality', size='count', height=1500)

This view is so sparse it may not be a useful way to understand the data other than realizing most of the fatalities are from Nepal and most of them have died on Everest.