<a href="https://www.kaggle.com/code/mikedelong/python-eda-with-maps?scriptVersionId=142764022" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import pandas as pd
df = pd.read_csv(filepath_or_buffer='/kaggle/input/life-expectancy-and-socio-economic-world-bank/life expectancy.csv')
df.head()

In [None]:
df.info()

In [None]:
from plotly.express import choropleth
choropleth(data_frame=df[df['Year'] == 2019], locations='Country Code', color='Region')

Clearly we're missing some countries from this dataset, but we have at least some data for most countries.

In [None]:
from plotly.express import histogram
histogram(data_frame=df, x='Year', color='IncomeGroup')

No country has changed income groups during the period of interest.

In [None]:
years = sorted(df['Year'].unique().tolist())
def make_plot_data(column: str) -> dict:
    plot_df = df[['Country Name', column, 'Year', 'Country Code']]
    data = [dict(type='choropleth', locations = plot_df[plot_df['Year'] == year]['Country Code'], 
                 z=plot_df[plot_df['Year'] == year][column],
                 hovertext=plot_df[plot_df['Year'] == year]['Country Name']) for year in years]
    steps = [dict(method='restyle', args=['visible', [other == year for other in years]], label=year) for year in years]
    layout = dict(geo=dict(scope='world'), sliders=[dict(active=0, pad={'t': 1}, steps=steps)], title=column)
    return dict(data=data, layout=layout)
print('loaded make-plot-data function')


In [None]:
from plotly.offline import init_notebook_mode
from plotly.offline import iplot
init_notebook_mode()
iplot(figure_or_data=make_plot_data(column='Life Expectancy World Bank'))

In [None]:
df.columns[6:]

In [None]:
for column in df.columns[6:].tolist():
    iplot(figure_or_data=make_plot_data(column=column))

In [None]:
from plotly.express import imshow
imshow(df[df['Year'] == 2019][df.columns[5:]].corr())

Some of our quantities are normalized and some are not, so throwing them into a single model unadjusted will probably not produce great results.

In [None]:
from plotly.express import scatter
scatter(data_frame=df[df['Year'] == 2019].dropna(subset='Unemployment'),
        x='Prevelance of Undernourishment', y='Life Expectancy World Bank', color='IncomeGroup', size='Unemployment',
        hover_name='Country Name', trendline='ols', trendline_scope='overall')

We would expect to see a negative-sloping trendline between hunger and life expectancy based on the negative correlation we see in the heatmap above, and that is in fact what we see.

In [None]:
scatter(data_frame=df[df['Year'] == 2019], 
        x='Corruption', y='Life Expectancy World Bank', color='IncomeGroup', 
        hover_name='Country Name', trendline='ols', trendline_scope='overall')

Weird to see a positive correlation between corrupation and life expectancy.

In [None]:
scatter(data_frame=df[df['Year'] == 2019], 
        x='Sanitation', y='Life Expectancy World Bank', color='IncomeGroup', 
        hover_name='Country Name', trendline='ols', trendline_scope='overall')

This is much more what we would expect: low income countries have poor sanitation; high income countries have much better santitation, and middle income countries are in the middle, with income broadly being positively correlated with life expectancy. And also santitation being positively correlated with life expectancy.

In [None]:
scatter(data_frame=df[df['Year'] == 2019], 
        x='Communicable', y='NonCommunicable', color='IncomeGroup', log_x=True, log_y=True,
        hover_name='Country Name', trendline='ols', trendline_scope='overall')

Measures of death or disease are first measures of population, so we would expect communicable and non-communicable disease data to be primarily measures of population. Because population is so unevenly distributed among countries we need to use a log-log plot here, and our linear trendline looks really funny.