In [1]:
import pandas as pd

DATA = '/kaggle/input/average-monthly-surface-temperature-1940-2024/average-monthly-surface-temperature.csv'
df = pd.read_csv(filepath_or_buffer=DATA, parse_dates=['Day'])
df.head()

Unnamed: 0,Entity,Code,year,Day,Average surface temperature,Average surface temperature.1
0,Afghanistan,AFG,1940,1940-01-15,-2.032494,11.327695
1,Afghanistan,AFG,1940,1940-02-15,-0.733503,11.327695
2,Afghanistan,AFG,1940,1940-03-15,1.999134,11.327695
3,Afghanistan,AFG,1940,1940-04-15,10.199754,11.327695
4,Afghanistan,AFG,1940,1940-05-15,17.942135,11.327695


This seems like a lot of data. How much data do we have?

In [2]:
df.shape

(198900, 6)

Wow that's a lot of data. It looks like the last column is the annual average, taken from the monthly readings, for each country and year. Let's test that hypothesis.

In [3]:
df['Entity'].nunique(), df['year'].nunique(), df['Average surface temperature.1'].nunique()

(195, 85, 16557)

We have 85 years and 195 countries, and 16557 unique temperature readings. If the final column is the annual average, then if we multiply the number of countries by the number of years we should get roughly the number of unique temperature readings.

In [4]:
85 * 195

16575

What does our world annual average look like?

In [5]:
from plotly import express
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)
express.scatter(data_frame=df[df['Entity'] == 'World'].drop(columns=['Code', 'Day', 'Average surface temperature']).drop_duplicates()
, x='year', y='Average surface temperature.1', trendline='lowess').show(renderer='iframe_connected',)

What do we see? We see that world monthly average temperatures have generally trended upward over the course of the period of interest.

In [6]:
express.scatter(data_frame=df[df['Entity'] != 'World'].drop(columns=['Code', 'Day', 'Average surface temperature'], ).drop_duplicates()
, x='year', y='Average surface temperature.1',  color='Entity', height=800, trendline='lowess',).show(renderer='iframe_connected',)

What do we see? It looks like all countries have average temperatures that have trended upward over the period of interest.

What countries are at the extremes? We can measure extremes in terms of both mean and variance.

In [7]:
scatter_df = df[['Entity', 'year', 'Average surface temperature.1']].drop_duplicates().groupby(by=['Entity',]).agg({'mean', 'std'}).reset_index().drop(columns=[('year', 'mean'), ('year', 'std')])
scatter_df.columns = ['Entity', 'mean', 'std']

In [8]:
express.scatter(data_frame=scatter_df, x='mean', y='std', color='Entity', height=800).show(renderer='iframe_connected',) 

What do we see? We see that a lot of countries cluster together in the lower right corner, but we see that we also have some outliers. Let's pick out some examples and plot their curves together.

In [9]:
entities = {'World', 'Greenland', 'Norway', 'Samoa', 'Saudi Arabia', 'Belarus', 'Falkland Islands'}
express.scatter(data_frame=df[df['Entity'].isin(entities)].drop(columns=['Code', 'Day', 'Average surface temperature']).drop_duplicates()
, x='year', y='Average surface temperature.1', trendline='lowess', color='Entity').show(renderer='iframe_connected',)

What do we see? We see that World temperatures have low variability, probably because they're an average, but the country temperatures tend to have quite a bit of variability. 